The TOFI pipeline consists of the three main stages illustrated in Figure 1. The stages are designed so that large portions of the target genome are eliminated in the less-expensive two initial stages, and the computationally more expensive searches for specific fingerprints are performed over smaller regions of the target genome in the final stage.
The first stage uses the suffix-tree-based MUMmer [21] program to perform pairwise comparisons of the target genome with each nontarget genome and eliminates regions in the target genome that have exact matches with one of the nontarget genomes. The surviving regions of the target genome, called candidate sequences, are then passed on to the second stage of the pipeline. Given a pair of sequences, MUMmer finds all maximal matches that are at least as long as a threshold (termed minmatch) between the two sequences. TOFI uses MUMmer to find these maximal matches, and eliminates regions in the target genome that are covered by these maximal matches. The threshold to use for minmatch is calculated based on the specificity parameters supplied by the user. This ensures that every segment of the target genome that is at least as long as the minimum probe length and satisfies the specificity parameters is part of a candidate sequence.
Stage 2 identifies oligonucleotides of desired length from the candidate sequences that satisfy experimental conditions, such as melting temperature (T
m
) and GC content. TOFI uses the Oligonucleotide Modeling Platform (OMP) software [22] to identify these oligonucleotides, also referred to here as probes. OMP uses the nearest neighbor hybridization model [23] to calculate T
m
and to estimate if a probe forms any secondary structures that may prevent it from hybridizing to the target genome.
In the third and final stage of the pipeline, TOFI performs a BLAST [24] search against a local copy of the NCBI nt database for each probe. The program uses mpiBLAST [25] to run BLAST in parallel on multiple processors. Probes with alignments to nontarget genomes that exceed the specificity thresholds are eliminated, and the surviving probes become the in silico DNA fingerprints for the target genome. These probes are then subjected to experimental validation to test their sensitivity and specificity.
TOFI-beta incorporates major modifications in the first and third stages of the pipeline to increase computational speed and enhance the specificity assessment of the fingerprints. In the following, we compare TOFI-beta and TOFI-alpha and describe these improvements.
Improvements in Stage 1
In Stage 1, the major improvement in TOFI-beta over TOFI-alpha is the comparison of the target against multiple nontarget genomes for finding exact matches. TOFI-alpha only allows for comparison against a single nontarget sequence. Comparison against a single genome is effective in eliminating a large portion of the target genome when a closely-related, nontarget near-neighbor genome sequence is available. However, when such a nontarget sequence is not available, too many candidate sequences are passed on to the later stages of the pipeline, which are computationally more expensive. Even when a closely-related, nontarget near-neighbor is available, comparisons against additional nontarget genomes is advantageous. As Stage 1 is relatively inexpensive, the additional time spent in this stage is offset by the much larger reduction in computation time in the later steps, yielding a very favorable trade-off in overall runtime. Potentially, the target genome could be compared with all nontarget sequences in the entire nt database.
As described by Tembe et al. [4], sequence comparisons for exact matches in Stage 1 are performed using the nucmer module in MUMmer. Some modifications were required in TOFI-beta, however, to avoid some performance issues when using a large database of nontarget sequences. Since the larger databases are too big for MUMmer, they are split into smaller databases and the target sequence is compared for exact matches against the smaller databases. This procedure is parallelized in TOFI-beta so that the target sequence is compared against a different set of nontarget sequences at each processor. The results are then assembled and processed so that only unique regions of the target genome are passed on to the next stage.
Improvements in Stage 3
In Stage 3, each probe that is generated in Stage 2 is screened for cross-hybridization against all available nontarget genome sequences in the nt database using BLAST [24]. Stage 3 is, by far, the most computationally expensive stage, which takes about 99% of the total runtime of TOFI-alpha.
In TOFI-alpha, Stage 3 consists of a single step in which a BLAST search is performed for each probe against the complete nt database. In contrast, as illustrated in Figure 2, the third stage in TOFI-beta consists of multiple, hierarchical BLAST steps, with the computational cost of the BLAST searches increasing with the number of steps. At each step, the oligonucleotide probes having significant alignments with nontarget sequences are removed, and only the surviving probes are passed on to the next, more expensive step. In the first step, we use the pairwise BLAST program bl2seq to identify matches with a near-neighbor genome, and the probes that meet the specificity requirements are passed on to the subsequent steps. In the absence of a near-neighbor genome, this step is bypassed.
The subsequent steps consist of a series of BLAST searches using blastn, where at each step the probes are queried against increasingly larger nucleotide databases of more distantly-related organisms to the target organism. For example, when Yersinia pestis is the target organism, the probes are first queried against databases consisting of sequences of Proteobacteria, then all other bacteria, and finally the nt database. Because the time taken to perform a BLAST search increases with database size and a probe is more likely to match sequences of closely-related organisms, the strategy in TOFI-beta is to perform relatively less expensive BLAST searches against small databases of related organisms first, eliminating many nonspecific probes before performing more comprehensive and costly BLAST searches. The hierarchical sequence databases are manually constructed. The probes that meet the specificity criteria in all BLAST steps are provided as the in silico DNA fingerprints for the target organism.
Improved specificity criteria
Probes designed for pathogen identification have to be unique to the target organism, and should not cross-hybridize with any nontarget organism. High sequence similarity between a probe and a nontarget sequence, apparent from the presence of good pairwise sequence alignments, is generally indicative of cross-hybridization between the two. There are multiple criteria for determining the specificity of a probe: overall sequence similarity, contiguous matches, and predicted free energy have all been shown to be important measures of the potential for cross-hybridization [26]. In addition to these criteria, we propose the use of near-contiguous matches (i.e., long stretch of matches with very few mismatches, insertions or deletions) to measure probe specificity.
Sequence similarity, i.e., the number of matches (or mismatches) in the alignment between two sequences, is one criterion for estimating cross-hybridization. This criterion measures only the matching bases. That is, it does not take into consideration how the matches are distributed in the alignment. TOFI-alpha uses the number of mismatches in the alignment, denoted by T, as the sole criterion for determining probe specificity. However, when the probe length is variable, specifying the matching bases as a percentage of the probe length allows for a more consistent measure of similarity, as a single threshold can be used for various probe lengths. The percentage of matching bases in the alignment between two sequences is commonly referred to as the identity of the two sequences. In TOFI-beta, we use sequence identity as one of several criteria for estimating cross-hybridization.
In order to incorporate contiguous and near-contiguous matches in determining probe specificity, we use a series of thresholds M0, M1, M2, and M3, where M
i
is the maximum length of a contiguous region in which the alignment between a probe and a nontarget sequence has (M
i
- i) matches and i mismatches/insertions/deletions. Accordingly, M0 is the length of the longest stretch of contiguous matches between a probe and a nontarget genome.
The use of multiple criteria for specificity is deemed to yield a number of advantages. First, other factors, apart from overall sequence identity, influence the hybridization of a probe to a nontarget sequence. Kane et al. [27] performed empirical analysis on 50 mer oligonucleotides in order to measure, among other things, the effect of overall sequence identity and contiguous stretches of similarity on cross-hybridization. They concluded that a probe is likely to cross-hybridize with a nontarget if overall sequence identity is > 75% or if there is a contiguous match > 15 bp. Li et al. [26] also concluded that better specificity can be obtained by using multiple criteria, such as sequence identity, length of contiguous matches and hybridization energy.
Second, the use of multiple criteria for specificity gives finer control; one can relax the threshold value for each individual criterion and yet obtain more in silico fingerprints with comparable specificity to fingerprints that can be obtained using a single criterion. Conversely, using a single criterion does not give much control over the specificity of the selected probes. Applying strict thresholds for a single criterion might result in missing many specific probes, whereas relaxing the thresholds may lead to too many nonspecific probes [20].
An additional advantage is that using multiple specificity criteria, including contiguous matches, improves runtime. The selection of specificity criteria affects the choice of the length of minimum exact matches in Stage 1, and using multiple criteria allows for the selection of a smaller threshold for minimum exact matches. This selection results in fewer candidate sequences passing Stage 1, thereby improving the overall pipeline performance.
The thresholds M1, M2, and M3 help design robust fingerprints that are not affected by small variations in nontarget sequences. Using M1, M2, and M3 one can avoid situations in which a small number of mutations/insertions/deletions in a nontarget sequence might potentially lead to long stretches of contiguous matches between the probe and nontarget, causing the probe to cross-hybridize with the nontarget. For instance, in the example shown in Figure 3, the probe does not have a very long stretch of contiguous matches with a nontarget, but rather a long near-contiguous match. In this instance, the probe might cross-hybridize with the nontarget because of the long stretch of near-contiguous match. The threshold M1 is particularly useful in avoiding regions around common single nucleotide polymorphisms between the target and nontarget genomes.
Using free energy (ΔG) for probe selection
Li et al. [26] have suggested that free energy (ΔG) is an important measure of probe specificity. However, computing the ΔG between a probe and nontarget sequences involves traversing through each nontarget genome and computing ΔG for each alignment with the probe. Given the large number of nontarget genomes and the relatively high computational cost of each ΔG calculation, such an approach is not practical for the current application. A more feasible strategy would be to obtain the BLAST hits for the probe and compute ΔG against each significant hit. However, even this strategy would be impractical, owing to the large number of fingerprints reported at the end of Stage 2. A feasible strategy is to perform the calculations only at the very final stage, after the probes have been screened using other specificity criteria. Hence, ΔG estimation (computed with OMP) is provided as an optional post-processing step in TOFI-beta.