A high-throughput pipeline for designing microarray-based pathogen diagnostic assays
© Satya et al; licensee BioMed Central Ltd. 2008
Received: 21 November 2007
Accepted: 10 April 2008
Published: 10 April 2008
We present a methodology for high-throughput design of oligonucleotide fingerprints for microarray-based pathogen diagnostic assays. The oligonucleotide fingerprints, or DNA microarray probes, are designed for identifying target organisms in environmental or clinical samples. The design process is implemented in a high-performance computing software pipeline that incorporates major algorithmic improvements over a previous version to both reduce computation time and improve specificity assessment.
The algorithmic improvements result in significant reduction in runtimes, with the updated pipeline being nearly up to five-times faster than the previous version. The improvements in specificity assessment, based on multiple specificity criteria, result in robust and consistent evaluation of cross-hybridization with nontarget sequences. In addition, the multiple criteria provide finer control on the number of resulting fingerprints, which helps in obtaining a larger number of fingerprints with high specificity. Simulation tests for Francisella tularensis and Yersinia pestis, using a well-established hybridization model to estimate cross-hybridization with nontarget sequences, show that the improved specificity criteria yield a larger number of fingerprints as compared to using a single specificity criterion.
The faster runtimes, achieved as the result of algorithmic improvements, are critical for extending the pipeline to process multiple target genomes. The larger numbers of identified fingerprints, obtained by considering broader specificity criteria, are essential for designing probes for hard-to-distinguish target sequences.
Recent developments in technology have led to the sequencing of many eukaryotic and prokaryotic organisms. Availability of these genomic sequences unlocked opportunities for the development of whole-genome-based diagnostic assays, such as DNA microarrays and polymerase chain reaction (PCR) assays, which offer higher specificity than traditional methods based on a single gene or protein . Because of their simplicity and efficiency, these assays are increasingly being used for various applications in medicine, environmental monitoring, and biodefense. The popularity of these assays, in turn, triggered the development of different computational tools for sequence-based signature design [1–4].
Microarray-based pathogen diagnostic assays are gaining popularity due to their ability to test for hundreds, or even thousands, of pathogens in a single diagnostic test . Wilson et al.  used 50-thousand 20 mer overlapping oligonucleotides to detect 18 pathogens, Wang et al.  reported using a microarray with 11-thousand 70 mer oligonucleotides that can identify ≥ 954 distinct viruses, while Palacios et al.  designed a panmicrobial microarray comprising nearly 30-thousand 60 mer probes. Current technology for custom microarray design enables up to 385000 oligonucleotides per slide, while arrays with 2.1 million probes are available for other high-throughput applications, such as comparative genome hybridization . The ability to simultaneously screen against a wide range of targets is essential for detecting biological threat agents in environmental samples, where there may be no prior knowledge about the specific pathogens likely to be present in the sample.
In terms of the computational problems, the design of microarray probes for pathogen identification is fundamentally different from the design of microarray probes for gene expression analysis. An oligonucleotide probe designed for monitoring the expression of a gene should hybridize only to the mRNA of the corresponding gene, and should not have any significant cross-hybridization with other mRNAs from the same organism. Because all sequences involved are gene transcripts of a single organism, the combined length of the sequences is typically, at most, a few megabases. The problem, though computationally intensive, can be handled by a single processor in less than an hour . Many efficient computational tools have been developed for designing microarray probes for gene expression analysis [10–15].
In contrast, oligonucleotide probes designed for pathogen diagnostic assays should hybridize only with the intended target and should not have any significant cross-hybridization with any nontarget genome. This necessitates the comparison of each candidate probe against every available sequenced nontarget genome. To date, the combined size of the nucleotide sequences in the National Center for Biotechnology Information (NCBI) nt database is greater than 21 gigabases and is bound to grow exponentially as more and more genomes are sequenced. The amount of sequence data that need to be analyzed for each target pathogen is also increasing due to multiple strains of many pathogens being sequenced. Therefore, the sheer magnitude of the search space necessitates efficient high-throughput algorithms that both quickly scan large genomic databases and reduce the space to a small set of unique regions in the target genome.
In this paper, we present a software tool for identifying oligonucleotide fingerprints for microarray-based pathogen diagnostic assays. The software, named Tool for Oligonucleotide Fingerprint Identification (TOFI), is an integrated, scalable, high-performance-computing pipeline that combines genome comparison tools, probe design software, and sequence alignment programs in order to design highly specific microarray probes for a given target pathogen (i.e., one complete genomic sequence). This pipeline (henceforth referred to as TOFI-beta), is an improvement over an earlier version of the pipeline (TOFI-alpha), presented by Tembe et al. . TOFI-beta incorporates several optimizations and enhancements that significantly reduce the overall execution time of the pipeline, opening up possibilities for future extension of the system to design fingerprints common to a group of targets. In addition, TOFI-beta uses new, multiple criteria for estimating probe specificity, which, for any given genome, results in a considerable increase in the number of identified in silico fingerprints without increase in the (expected) false-positive rate. This is particularly important in cases where the target sequence is a close match to other sequences, where the larger number of in silico fingerprints increases the chance of identification of true fingerprints.
Existing methods for design of pathogen diagnostic assays
One method to design probes is to select regions of the pathogen genome that are known to be associated with specific functions. For example, specific genes of bacterial genomes, such as the 16S rRNA gene , virulence genes , and antibiotic resistance genes , have all been used to design microarray probes for species-level diagnostics.
Another method to design probes is to employ the whole genome of the pathogen. A few software tools/algorithms have been proposed to guide the design of whole-genome-based pathogen diagnostic assays [1–4, 11–14]. Some of these tools are intended for microarray-based assays [2, 4, 11–14], whereas others are intended for PCR-based assays [1, 3]. Most of these tools, however, do not have the capability of testing for specificity against a large number of nontarget genomes; they are based on the assumption that the signatures need only be unique with respect to the host or a small set of nontarget sequences. For instance, Kaderali and Schliep  developed an algorithm that analyzes a set of input target sequences and designs a single probe for each target, with the probe being unique with respect to all other input sequences. The uniqueness of the probe is determined by constructing a generalized suffix tree for all the input target sequences. The method presented by Putonti et al.  designs probes that are unique with respect to a host organism. Both of these approaches are clearly not adequate if the signatures are to identify pathogens from environmental samples containing a multitude of nontarget organisms. Because, in general, there is no prior knowledge of the contaminants in a sample, the signatures have to be unique with respect to all known nontarget sequences.
The KPATH pipeline for PCR assays  is the seminal software that introduced the concept of in silico comparison against all known nontarget sequences. Insignia  is another tool for designing PCR assays. Unlike the current version of TOFI, which designs fingerprints for a single genome, both KPATH and Insignia have the ability to design fingerprints that are common to multiple target genomes. To our knowledge, KPATH and Insignia are the only tools other than TOFI  that have the provision for in silico sequence-based testing for specificity against multiple nontarget genomes.
Neither KPATH nor Insignia is applicable for designing microarray fingerprints, as the design and specificity requirements of microarray fingerprints are quite different from those of PCR signatures. For example, the most commonly used PCR signatures consist of a probe and two primers, which, due to their short length [18–25 base pairs (bp)] and constraints on the inter-primer distance, can tolerate inexact matches with nontarget sequences without much degradation in specificity. Conversely, in addition to being characterized by only one DNA segment with no spacing constraints, microarray probes are generally longer and more susceptible to cross-hybridization even in the absence of an exact match . This requires more extensive searches, for both exact and inexact matches, against nontarget sequences to identify highly specific fingerprints for microarrays.
The first stage uses the suffix-tree-based MUMmer  program to perform pairwise comparisons of the target genome with each nontarget genome and eliminates regions in the target genome that have exact matches with one of the nontarget genomes. The surviving regions of the target genome, called candidate sequences, are then passed on to the second stage of the pipeline. Given a pair of sequences, MUMmer finds all maximal matches that are at least as long as a threshold (termed minmatch) between the two sequences. TOFI uses MUMmer to find these maximal matches, and eliminates regions in the target genome that are covered by these maximal matches. The threshold to use for minmatch is calculated based on the specificity parameters supplied by the user. This ensures that every segment of the target genome that is at least as long as the minimum probe length and satisfies the specificity parameters is part of a candidate sequence.
Stage 2 identifies oligonucleotides of desired length from the candidate sequences that satisfy experimental conditions, such as melting temperature (T m ) and GC content. TOFI uses the Oligonucleotide Modeling Platform (OMP) software  to identify these oligonucleotides, also referred to here as probes. OMP uses the nearest neighbor hybridization model  to calculate T m and to estimate if a probe forms any secondary structures that may prevent it from hybridizing to the target genome.
In the third and final stage of the pipeline, TOFI performs a BLAST  search against a local copy of the NCBI nt database for each probe. The program uses mpiBLAST  to run BLAST in parallel on multiple processors. Probes with alignments to nontarget genomes that exceed the specificity thresholds are eliminated, and the surviving probes become the in silico DNA fingerprints for the target genome. These probes are then subjected to experimental validation to test their sensitivity and specificity.
TOFI-beta incorporates major modifications in the first and third stages of the pipeline to increase computational speed and enhance the specificity assessment of the fingerprints. In the following, we compare TOFI-beta and TOFI-alpha and describe these improvements.
Improvements in Stage 1
In Stage 1, the major improvement in TOFI-beta over TOFI-alpha is the comparison of the target against multiple nontarget genomes for finding exact matches. TOFI-alpha only allows for comparison against a single nontarget sequence. Comparison against a single genome is effective in eliminating a large portion of the target genome when a closely-related, nontarget near-neighbor genome sequence is available. However, when such a nontarget sequence is not available, too many candidate sequences are passed on to the later stages of the pipeline, which are computationally more expensive. Even when a closely-related, nontarget near-neighbor is available, comparisons against additional nontarget genomes is advantageous. As Stage 1 is relatively inexpensive, the additional time spent in this stage is offset by the much larger reduction in computation time in the later steps, yielding a very favorable trade-off in overall runtime. Potentially, the target genome could be compared with all nontarget sequences in the entire nt database.
As described by Tembe et al. , sequence comparisons for exact matches in Stage 1 are performed using the nucmer module in MUMmer. Some modifications were required in TOFI-beta, however, to avoid some performance issues when using a large database of nontarget sequences. Since the larger databases are too big for MUMmer, they are split into smaller databases and the target sequence is compared for exact matches against the smaller databases. This procedure is parallelized in TOFI-beta so that the target sequence is compared against a different set of nontarget sequences at each processor. The results are then assembled and processed so that only unique regions of the target genome are passed on to the next stage.
Improvements in Stage 3
In Stage 3, each probe that is generated in Stage 2 is screened for cross-hybridization against all available nontarget genome sequences in the nt database using BLAST . Stage 3 is, by far, the most computationally expensive stage, which takes about 99% of the total runtime of TOFI-alpha.
The subsequent steps consist of a series of BLAST searches using blastn, where at each step the probes are queried against increasingly larger nucleotide databases of more distantly-related organisms to the target organism. For example, when Yersinia pestis is the target organism, the probes are first queried against databases consisting of sequences of Proteobacteria, then all other bacteria, and finally the nt database. Because the time taken to perform a BLAST search increases with database size and a probe is more likely to match sequences of closely-related organisms, the strategy in TOFI-beta is to perform relatively less expensive BLAST searches against small databases of related organisms first, eliminating many nonspecific probes before performing more comprehensive and costly BLAST searches. The hierarchical sequence databases are manually constructed. The probes that meet the specificity criteria in all BLAST steps are provided as the in silico DNA fingerprints for the target organism.
Improved specificity criteria
Probes designed for pathogen identification have to be unique to the target organism, and should not cross-hybridize with any nontarget organism. High sequence similarity between a probe and a nontarget sequence, apparent from the presence of good pairwise sequence alignments, is generally indicative of cross-hybridization between the two. There are multiple criteria for determining the specificity of a probe: overall sequence similarity, contiguous matches, and predicted free energy have all been shown to be important measures of the potential for cross-hybridization . In addition to these criteria, we propose the use of near-contiguous matches (i.e., long stretch of matches with very few mismatches, insertions or deletions) to measure probe specificity.
Sequence similarity, i.e., the number of matches (or mismatches) in the alignment between two sequences, is one criterion for estimating cross-hybridization. This criterion measures only the matching bases. That is, it does not take into consideration how the matches are distributed in the alignment. TOFI-alpha uses the number of mismatches in the alignment, denoted by T, as the sole criterion for determining probe specificity. However, when the probe length is variable, specifying the matching bases as a percentage of the probe length allows for a more consistent measure of similarity, as a single threshold can be used for various probe lengths. The percentage of matching bases in the alignment between two sequences is commonly referred to as the identity of the two sequences. In TOFI-beta, we use sequence identity as one of several criteria for estimating cross-hybridization.
In order to incorporate contiguous and near-contiguous matches in determining probe specificity, we use a series of thresholds M0, M1, M2, and M3, where M i is the maximum length of a contiguous region in which the alignment between a probe and a nontarget sequence has (M i - i) matches and i mismatches/insertions/deletions. Accordingly, M0 is the length of the longest stretch of contiguous matches between a probe and a nontarget genome.
The use of multiple criteria for specificity is deemed to yield a number of advantages. First, other factors, apart from overall sequence identity, influence the hybridization of a probe to a nontarget sequence. Kane et al.  performed empirical analysis on 50 mer oligonucleotides in order to measure, among other things, the effect of overall sequence identity and contiguous stretches of similarity on cross-hybridization. They concluded that a probe is likely to cross-hybridize with a nontarget if overall sequence identity is > 75% or if there is a contiguous match > 15 bp. Li et al.  also concluded that better specificity can be obtained by using multiple criteria, such as sequence identity, length of contiguous matches and hybridization energy.
Second, the use of multiple criteria for specificity gives finer control; one can relax the threshold value for each individual criterion and yet obtain more in silico fingerprints with comparable specificity to fingerprints that can be obtained using a single criterion. Conversely, using a single criterion does not give much control over the specificity of the selected probes. Applying strict thresholds for a single criterion might result in missing many specific probes, whereas relaxing the thresholds may lead to too many nonspecific probes .
An additional advantage is that using multiple specificity criteria, including contiguous matches, improves runtime. The selection of specificity criteria affects the choice of the length of minimum exact matches in Stage 1, and using multiple criteria allows for the selection of a smaller threshold for minimum exact matches. This selection results in fewer candidate sequences passing Stage 1, thereby improving the overall pipeline performance.
Using free energy (ΔG) for probe selection
Li et al.  have suggested that free energy (ΔG) is an important measure of probe specificity. However, computing the ΔG between a probe and nontarget sequences involves traversing through each nontarget genome and computing ΔG for each alignment with the probe. Given the large number of nontarget genomes and the relatively high computational cost of each ΔG calculation, such an approach is not practical for the current application. A more feasible strategy would be to obtain the BLAST hits for the probe and compute ΔG against each significant hit. However, even this strategy would be impractical, owing to the large number of fingerprints reported at the end of Stage 2. A feasible strategy is to perform the calculations only at the very final stage, after the probes have been screened using other specificity criteria. Hence, ΔG estimation (computed with OMP) is provided as an optional post-processing step in TOFI-beta.
The TOFI pipeline can design oligonucleotide probes of any length. The results that we present here are for the design of variable-length probes with length varying from 35 to 40 bp. The choice of probe length was solely based on the requirements of the downstream field-deployable microarray platform for which these probes are designed, and hence we do not attempt to find optimal probe lengths for pathogen identification. In general, shorter probes result in better specificity, and longer probes result in better sensitivity [28, 29].
Performance comparison of TOFI-alpha and TOFI-beta
Genome (length in base pairs)
Near-neighbor (length in base pairs)
TOFI-alpha runtime (hrs)
TOFI-beta runtime (hrs)
Yersinia pestis (4653728)
Yersinia pseudotuberculosis (4744671)
Francisella tularensis (1892819)
Francisella philomiragia (2049711)
Burkholderia mallei (5835527)
Burkholderia thailandensis (6723972)
Brucella melitensis (3294931)
Agrobacterium tumefaciens (2074782)
Comparison of TOFI-alpha and TOFI-beta for the F. tularensis genome on 74 processors
MUMmer with near-neighbor
1730 candidates, 1889863 bp
1730 candidates, 1889863 bp
MUMmer with Proteobacteria
1882 candidates, 1768235 bp
MUMmer with nt database
2140 candidates 1192259 bp
bl2seq with near-neighbor
BLAST with Proteobacteria
1 hr 37 m
BLAST with nt database
2 hrs 13 m
Total execution time
4 hrs 29 m
The optimum number of BLAST steps depends on many factors, including the size of the target genome, the availability of genomic sequences from closely-related, nontarget genomes, and the number of processors available. Based on empirical evidence obtained by running the pipeline for various bacterial genomes, we observe that the use of one intermediate BLAST step is optimal in reducing the overall runtime. Additional BLAST steps do not reduce the runtime significantly when a large number of processors (> 30) are available. However, when a limited number of processors are available, using additional BLAST steps might be useful in reducing the overall runtime.
Effect of multiple specificity criteria
The number of fingerprints obtained using different specificity criteria for four different target organisms
T = 9
identity = 80%, M0-M1-M2-M3 = 18-22-26-30
Case studies: Francisella tularensis and Yersinia pestis
In the following, we present a detailed comparative analysis of the number and specificity of fingerprints obtained by using individual versus multiple specificity criteria. Our aim is to consider all potential probes that satisfy the experimental constraints and evaluate the specificity criteria based on estimated ΔG of these probes with nontarget genomes. Accordingly, we evaluated all the probes resulting from Stage 2, where in Stage 1 we compared the target with a single near-neighbor genome using an exact match threshold of 32 bp. The results presented here are for F. tularensis SCHU S4 (NCBI accession no. NC_006570) and Y. pestis CO92 (NCBI accession no. NC_003143). A contig from F. philomiragia ATCC 25017  was used as the near-neighbor for F. tularensis, and Y. pseudotuberculosis IP 32953 (NCBI accession no. NC_006155) was used as the near-neighbor for Y. pestis.
Probe design on the candidate sequences, taking into consideration the experimental constraints (probe length between 35 and 40 bp, T m between 70 and 100°C and GC content between 45 and 50%), yielded a total of 33299 probes for F. tularensis and 19810 probes for Y. pestis. BLAST searches were performed on these probes against the entire nt database downloaded from NCBI in July 2007. BLAST hits with the corresponding targets and synthetic constructs were ignored. The remaining BLAST hits were extended on either side by 50 bp, and ΔG with the corresponding probe was estimated using the simulation feature of the OMP software. For each probe, the largest negative ΔG value among all BLAST hits was taken as a conservative measure of cross-hybridization between the probe and a nontarget sequence.
The probes are categorized into good, bad, or gray based on their ΔG values, as actual experimental hybridization results are not available for these probes. Good probes are expected to have little or no cross-hybridization with nontarget genomes, bad probes are expected to have significant cross-hybridization, and the behavior of gray probes is too uncertain to categorize either way. Estimated ΔG should not, however, be construed as a substitute for experimental hybridization tests. We heuristically selected well-spaced ΔG thresholds for good and bad probes in order to assess the relative performance of the different specificity criteria. Probes with ΔG greater than or equal to -16 kcal/mol are categorized as good probes. This threshold was selected because it corresponds to about 50% of the mean ΔG between a probe and its complement, which is the approximate ratio recommended for 50 mers . Probes with ΔG less than -20 kcal/mol (less than 60% of the mean ΔG between a probe and its complement) are considered as bad probes. The probes with ΔG between -20 and -16 kcal/mol are labeled as gray probes. Further increase in separation between good and bad probes is not convenient because it reduces the sample size for comparative assessment.
The numbers of good, bad and total probes (including gray) obtained for different specificity thresholds
Thresholds for specificity criteria
Identity = 80%, M 0 - M 3 = 18–22–26–30
Identity = 80%, M 0 - M 3 = 19–22–25–28
Identity = 82.5%, M 0 - M 3 = 19–22–25–28
Identity = 75%
Identity = 77.5%
Identity = 80%
T = 8
T = 9
T = 10
M 0 = 16
M 0 = 17
M 0 = 18
M 1 = 20
M 1 = 21
M 1 = 22
M 2 = 23
M 2 = 24
M 2 = 25
M 3 = 25
M 3 = 26
M 3 = 27
Effect of database size on specificity thresholds
The size of the target genome and the size of the database of nontarget sequences affect the selection of thresholds for contiguous and near-contiguous matches. For the size of target genomes and nontarget databases used in this paper, M0 = 18 appears to be an appropriate threshold, resulting in sufficient number of fingerprints for all the genomes in Table 3. However, as more and more nontarget sequences become available, it might not be possible to find unique 18 mers. Reed et al.  present a model to estimate the quality of a probe based on the size of the database of nontarget sequences. According to this model, the probability of finding a unique 18 mer is close to zero when the size of database is 100 Gb. As the size of sequence databases is quickly approaching this number, it might be necessary to use higher thresholds for M0 and other specificity parameters.
Limitations of using BLAST
Even though BLAST is among the best methods for sequence comparisons against large databases like the nt database, BLAST has its limitations. BLAST is a heuristic approach that first finds short, exact, "anchor" matches and extends the alignments around these exact matches. The size of these exact matches, termed word size, is given by the input parameter W. In the blastn program, used for BLAST searches against large sequence databases, the default value for W is 11 and the smallest possible value W can take is 7. Because every match reported by BLAST must include an exact match of length W, it imposes a lower bound on the values of the specificity thresholds M0, M1, M2, and M3. For any given W, the smallest near-contiguous match with i mismatches that can be guaranteed to be reported by BLAST is given by W(i+1), therefore, the minimum allowable value for any M i is W(i+1)-1. Accordingly, the lowest thresholds of M1, M2, and M3 for W = 7 are 13, 20, and 27, respectively. Using lower thresholds for M i could result in some near contiguous matches of length M i not being reported by BLAST.
Errors in sequence databases
TOFI currently uses the description provided in the header line of a FASTA sequence to determine if the sequence is a target or a nontarget sequence. Typographical errors or missing information in the NCBI sequence data can sometimes cause TOFI to treat a sequence from a target organism as a nontarget sequence, which can potentially lead to removal of good fingerprints.
The fingerprints designed by TOFI can only be as accurate as the sequence databases used for comparisons with nontarget sequences. However, identifying the errors and removing low-quality data from sequence databases is beyond the scope of the TOFI framework. The only complete solution for handling errors in sequence data is to use manually curated sequence databases that contain only high-quality sequences with accurate sequence descriptions.
Extension to multiple genomes
The performance improvements in TOFI-beta pave the way for extending the pipeline to design probes common to a group of target genomes, e.g., multiple strains of a species and multiple species of a genus. The identification of common fingerprints is a more computationally intensive problem, as it requires the simultaneous analyses of multiple target genomes. In the future, we will extend the software pipeline to design sets of fingerprints common to multiple targets and sets unique to each target, taking maximum advantage of the shared sequences among the multiple genomes in order to reduce the overall computation time.
The enhanced pipeline incorporates major algorithmic improvements, resulting in performance that is nearly up to five-times faster than the previous version of the pipeline. The use of multiple specificity criteria provides finer control over the number of resulting fingerprints. This is helpful in obtaining a larger number of in silico fingerprints than those obtained using individual criteria, and may be essential for the following three reasons: (1) as more and more genomic sequences become available, there will be significant fingerprint erosion; matches with the newly available sequences will eliminate some fingerprints, (2) certain near-neighbor organisms may have very similar sequences, so, in these cases, obtaining a larger number of potential in silico fingerprints for experimental testing would be desirable, and (3) due to the noisy nature of microarray experiments, redundancy is essential for confidence in the results. Therefore, it is desirable to start with a sufficiently large number of in silico fingerprints, and identify specific fingerprints based on experimental results with the target and a panel of nontarget sequences.
Availability and requirements
Project name: TOFI
Project home page: http://www.bhsai.org/downloads/tofi_beta.tar.gz
Operating systems: Linux
Programming Language: Perl
Other Requirements: mpiBLAST 1.4.0 or higher, MUMmer 3.19 or higher, and OMP developer edition
We thank Major Jeanne Geyer and Dr. Leonard Wasieloski of the U. S. Army Medical Research Institute of Infectious Diseases for their help and suggestions in selecting various design parameters and for providing the sequence of F. philomiragia prior to publication on the DOE Joint Genome Institute website. This work was sponsored by the U. S. Department of Defense High Performance Computing Modernization Program, under the High Performance Computing Software Applications Institutes initiative.
The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the U. S. Army or of the U. S. Department of Defense. This paper has been approved for public release with unlimited distribution.
- Slezak T, Kuczmarski T, Ott L, Torres C, Medeiros D, Smith J, Truitt B, Mulakken N, Lam M, Vitalis E, Zemla A, Zhou CE, Gardner S: Comparative genomics tools applied to bioterrorism defence. Brief Bioinform 2003, 4(2):133–149. 10.1093/bib/4.2.133View ArticlePubMedGoogle Scholar
- Kaderali L, Schliep A: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 2002, 18(10):1340–1349. 10.1093/bioinformatics/18.10.1340View ArticlePubMedGoogle Scholar
- Phillippy AM, Mason JA, Ayanbule K, Sommer DD, Taviani E, Huq A, Colwell RR, Knight IT, Salzberg SL: Comprehensive DNA signature discovery and validation. PLoS Comput Biol 2007, 3(5):e98. 10.1371/journal.pcbi.0030098PubMed CentralView ArticlePubMedGoogle Scholar
- Tembe W, Zavaljevski N, Bode E, Chase C, Geyer J, Wasieloski L, Benson G, Reifman J: Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics 2007, 23(1):5–13. 10.1093/bioinformatics/btl549View ArticlePubMedGoogle Scholar
- Loy A, Bodrossy L: Highly parallel microbial diagnostics using oligonucleotide microarrays. Clin Chim Acta 2005/08/30 edition. 2006, 363(1–2):106–119. 10.1016/j.cccn.2005.05.041View ArticlePubMedGoogle Scholar
- Wilson WJ, Strout CL, DeSantis TZ, Stilwell JL, Carrano AV, Andersen GL: Sequence-specific identification of 18 pathogenic microorganisms using microarray technology. Mol Cell Probes 2002/05/29 edition. 2002, 16(2):119–127. 10.1006/mcpr.2001.0397View ArticlePubMedGoogle Scholar
- Wang D, Urisman A, Liu YT, Springer M, Ksiazek TG, Erdman DD, Mardis ER, Hickenbotham M, Magrini V, Eldred J, Latreille JP, Wilson RK, Ganem D, DeRisi JL: Viral discovery and sequence recovery using DNA microarrays. PLoS Biol 2003, 1(2):E2. 10.1371/journal.pbio.0000002PubMed CentralView ArticlePubMedGoogle Scholar
- Palacios G, Quan PL, Jabado OJ, Conlan S, Hirschberg DL, Liu Y, Zhai J, Renwick N, Hui J, Hegyi H, Grolla A, Strong JE, Towner JS, Geisbert TW, Jahrling PB, Buchen-Osmond C, Ellerbrok H, Sanchez-Seco MP, Lussier Y, Formenty P, Nichol MS, Feldmann H, Briese T, Lipkin WI: Panmicrobial oligonucleotide array for diagnosis of infectious diseases. Emerg Infect Dis 2007, 13(1):73–81.PubMed CentralView ArticlePubMedGoogle Scholar
- CGH Whole-Genome & Custom Fine-Tiling Microarrays and Services[http://www.nimblegen.com/products/cgh/index.html]
- Nordberg EK: YODA: selecting signature oligonucleotides. Bioinformatics 2005, 21(8):1365–1370. 10.1093/bioinformatics/bti182View ArticlePubMedGoogle Scholar
- Rimour S, Hill D, Militon C, Peyret P: GoArrays: highly dynamic and efficient microarray probe design. Bioinformatics 2005, 21(7):1094–1103. 10.1093/bioinformatics/bti112View ArticlePubMedGoogle Scholar
- Chen H, Sharp BM: Oliz, a suite of Perl scripts that assist in the design of microarrays using 50mer oligonucleotides from the 3' untranslated region. BMC Bioinformatics 2002, 3: 27. 10.1186/1471-2105-3-27PubMed CentralView ArticlePubMedGoogle Scholar
- Wernersson R, Nielsen HB: OligoWiz 2.0--integrating sequence feature annotation into the design of microarray probes. Nucleic Acids Res 2005, 33(Web Server issue):W611–5. 10.1093/nar/gki399PubMed CentralView ArticlePubMedGoogle Scholar
- Rahmann S: Fast and sensitive probe selection for DNA chips using jumps in matching statistics. Proc IEEE Comput Soc Bioinform Conf 2003, 2: 57–64.PubMedGoogle Scholar
- Rahmann S: Fast large scale oligonucleotide selection using the longest common factor approach. J Bioinform Comput Biol 2003, 1(2):343–361. 10.1142/S0219720003000125View ArticlePubMedGoogle Scholar
- Loy A, Maixner F, Wagner M, Horn M: probeBase--an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res 2006/11/14 edition. 2007, 35(Database issue):D800–4. 10.1093/nar/gkl856PubMed CentralView ArticlePubMedGoogle Scholar
- Volokhov D, Pomerantsev A, Kivovich V, Rasooly A, Chizhikov V: Identification of Bacillus anthracis by multiprobe microarray hybridization. Diagn Microbiol Infect Dis 2004/07/13 edition. 2004, 49(3):163–171. 10.1016/j.diagmicrobio.2004.03.015View ArticlePubMedGoogle Scholar
- Zhu LX, Zhang ZW, Wang C, Yang HW, Jiang D, Zhang Q, Mitchelson K, Cheng J: Use of a DNA microarray for simultaneous detection of antibiotic resistance genes among staphylococcal clinical isolates. J Clin Microbiol 2007/08/31 edition. 2007, 45(11):3514–3521. 10.1128/JCM.02340-06PubMed CentralView ArticlePubMedGoogle Scholar
- Putonti C, Chumakov S, Mitra R, Fox GE, Willson RC, Fofanov Y: Human-blind probes and primers for dengue virus identification. Febs J 2006, 273(2):398–408. 10.1111/j.1742-4658.2005.05074.xView ArticlePubMedGoogle Scholar
- He Z, Wu L, Li X, Fields MW, Zhou J: Empirical establishment of oligonucleotide probe design criteria. Appl Environ Microbiol 2005, 71(7):3753–3760. 10.1128/AEM.71.7.3753-3760.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. 10.1186/gb-2004-5-2-r12PubMed CentralView ArticlePubMedGoogle Scholar
- OMP Developer Edition[http://www.dnasoftware.com/Products/OMP_DE/index.htm]
- SantaLucia J Jr., Hicks D: The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct 2004, 33: 415–440. 10.1146/annurev.biophys.32.110601.141800View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Darling A, Carey L, Feng W: The Design, Implementation, and Evaluation of mpiBLAST . In 4th International Conference on Linux Clusters: The HPC Revolution 2003 in conjunction with the ClusterWorld Conference & Expo. San Jose, CA ; 2003.Google Scholar
- Li X, He Z, Zhou J: Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation. Nucleic Acids Res 2005, 33(19):6114–6123. 10.1093/nar/gki914PubMed CentralView ArticlePubMedGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res 2000, 28(22):4552–4557. 10.1093/nar/28.22.4552PubMed CentralView ArticlePubMedGoogle Scholar
- Reed C, Fofanov V, Putonti C, Chumakov S, Slezak T, Fofanov Y: Effect of the mutation rate and background size on the quality of pathogen identification. Bioinformatics 2007/09/21 edition. 2007, 23(20):2665–2671. 10.1093/bioinformatics/btm420View ArticlePubMedGoogle Scholar
- Chou CC, Lee TT, Chen CH, Hsiao HY, Lin YL, Ho MS, Yang PC, Peck K: Design of microarray probes for virus identification and detection of emerging viruses at the genus level. BMC Bioinformatics 2006, 7: 232. 10.1186/1471-2105-7-232PubMed CentralView ArticlePubMedGoogle Scholar
- Liebich J, Schadt CW, Chong SC, He Z, Rhee SK, Zhou J: Improvement of oligonucleotide probe design criteria for functional gene microarrays in environmental applications. Appl Environ Microbiol 2006, 72(2):1688–1691. 10.1128/AEM.72.2.1688-1691.2006PubMed CentralView ArticlePubMedGoogle Scholar
- JGI Microbial Genomes[http://genome.jgi-psf.org/mic_cur1.html]
- Censor Y: Pareto Optimality in Multiobjective Problems. Appl Math Optimiz 1977, 4: 41–59. 10.1007/BF01442131View ArticleGoogle Scholar
- Agilent DNA microarrays[http://www.chem.agilent.com/Scripts/PCol.asp?lPage=494]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.