Breaking the computational barriers of pairwise genome comparison

Background Conventional pairwise sequence comparison software algorithms are being used to process much larger datasets than they were originally designed for. This can result in processing bottlenecks that limit software capabilities or prevent full use of the available hardware resources. Overcoming the barriers that limit the efficient computational analysis of large biological sequence datasets by retrofitting existing algorithms or by creating new applications represents a major challenge for the bioinformatics community. Results We have developed C libraries for pairwise sequence comparison within diverse architectures, ranging from commodity systems to high performance and cloud computing environments. Exhaustive tests were performed using different datasets of closely- and distantly-related sequences that span from small viral genomes to large mammalian chromosomes. The tests demonstrated that our solution is capable of generating high quality results with a linear-time response and controlled memory consumption, being comparable or faster than the current state-of-the-art methods. Conclusions We have addressed the problem of pairwise and all-versus-all comparison of large sequences in general, greatly increasing the limits on input data size. The approach described here is based on a modular out-of-core strategy that uses secondary storage to avoid reaching memory limits during the identification of High-scoring Segment Pairs (HSPs) between the sequences under comparison. Software engineering concepts were applied to avoid intermediate result re-calculation, to minimise the performance impact of input/output (I/O) operations and to modularise the process, thus enhancing application flexibility and extendibility. Our computationally-efficient approach allows tasks such as the massive comparison of complete genomes, evolutionary event detection, the identification of conserved synteny blocks and inter-genome distance calculations to be performed more effectively. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0679-9) contains supplementary material, which is available to authorized users.


Words
To reduce the computational space and accelerate data processing most of the proposed strategies uses some kind of pre-processing step. K-mers are used as prefixes for fast identification of matching words to be used as seed points from where to extend the local alignment.
In the image a hash table using different "prefix" length (K=1, 2, 3). The header contains the "word" and the table contains the positions in which that word appears in the sequence. As longer the prefix is, the shorter the number of "occurrences" is. In this case, the hash is built-up by "full-identity", thus all the words are the same for a given header entry (this is a tradeoff between sensitivity and memory requirements). The number of putative hash-headers is 4 K where K is the word length, however NOT all the combinations are present in the sequence. The exact number of words is L-K+1 being L the sequence length.

Hits
Hits are word matches to be used as seed points. The number of hits depends on the number of matching-word repetitions, and it depends also on the word size (K).

Example:
Identical words produce hits in the coordinates they appear (diag= h -v, when xh matches yv) (A) Produces hits in : (3, 1), (3,6), (5, 1), (5, 6), (8, 1) and (8,6) (C) Produces hits in : (2, 3), (2, 8), (6, 3) and (6,8) We would like to emphasise that although K is 32 for the dictionary step, the users can choose a different K value (e.g. K=1, 2, 3… 32) starting at the "hits determination" step based on their knowledge of the similarity shared by the sequences under comparison (higher numbers means less sensitivity and vice versa). We mention in the manuscript that all K'<K mers are detected, only as an observation. In fact, if we use K=32, the last K-mer in the sequence is in position L-32+1. Obviously, this last K-mer has 32 letters, therefore K-mers of length K=31, 30, … 1; are lost at sequence ends. However, this loss is minimal as the maximum fragment that could start at such a position would be no longer than 32 residues for obvious reasons.

Big-hits
To reduce the computational space even further, the hits (i.e. matches of words coming from the sequences under comparison) are joined together based on their proximity. Hits in the same diagonal at a distance shorter that a parameter D will be joint to form a "big-Hit" (in the image the three hits in the first upper diagonal (at distances X1, X2 < D) and the two hits in the second diagonal (X3<D) are joint tor form a "3BigHit" and "2BigHit" Behavior of the seed-points computational space as a function of K (word length) and the inter-hits distance parameter used to group neighbor hits. Real data (chromosomes X from several species) have been used in the simulation.

The global idea
Pre-processing Masking low complexity regions, …

Hashing
Including sub-processes such as sorting, grouping, etc.

HITS by diagonal
Including BigHits detection and pruning (see slide 8 for hits positions)

HITS Extension
Search for similarities using hits as seed points Post-processing Visualization, frequencies, words distributions, etc.

First Step: Building the dictionaries
The dictionary calculation is based on the well-known binary tree in computer sciences. Each tree node contains a word (key) and its list of occurrences (values). Following the behaviour of a binary tree, left hand side nodes of a given tree come lexicographically before nodes on the right hand side. To avoid memory consumption problems caused by the huge number of possible words (i.e .a theoretical maximum of 4k different words, without counting repetitions), we decided to split the calculation in p steps (with p being a multiple of 4), thus reducing the amount of memory used by the program by a factor of p (assuming a normal distribution of words). To split the dictionary and conserve its lexicographical order, a prefix of length log4p is used. This strategy requires us to iterate p times over the whole sequence, each time using a different prefix in lexicographical order to preserve word order. Another improvement is the reservation of a memory pool at the beginning of the process to avoid memory allocation requests occurring for each node. Instead, we request for a pool of memory and new memory pools are then only reserved once currently reserved memory is used up. To obtain the final result we traverse the tree in order, storing the word contained in the node together with the list of occurrences. We considered other strategies in this step, such as a prefix tree, but found that they experience memory consumption issues similar to the problems faced by existing software approaches.
Noteworthy to observe, this process needs to be done only once for each sequence (or twice if the complementary reverse is going to be also analyzed). The dictionary is computed for K=32 which contains all the prefixes for k'<K.
Visual representation of the binary tree used to store and sort K-mers. A sliding window is used to scan the sequence, then the word inside each window is inserted into the tree following lexicographic order. Blue boxes indicate the sequence position(s) of each word.

Alternative dictionary calculation using Suffix arrays
Most of the current methods to calculate genome comparisons/alignments (i.e. Gepard, Mauve, MUMmer, LASTZ) are using either Suffix trees or Suffix arrays as the indexing step in their algorithms.
In order to illustrate why we are using a customized solution instead of a Suffix array, we decided to implement a Suffix array version of our indexing step. This implementation confirmed what we can extract from the execution time and memory consumption data contained in the Results section of the paper. It is clearly faster than our customized solution for short sequences until a certain point where the performance gain starts to degrade. Besides execution time, the consumed memory of the suffix array implementation is of approximately 9 times the length of the input sequence what contrasts with our customized solution which has a constant consume of 1.5GB due to the buffering strategy we are using. As expected, the consumption for short sequences is lower than our solution because of the fixed buffer size, but as soon as the sequence length is big enough the consumption starts to be greater. In general the estimated consumption of 9 times the sequence length is acceptable for short sequences but for genomes longer than 1Gbp it starts to surpass the 8GB present in most of the current desktop machines.
The previous figure confirms the fact that the execution time is getting closer between the two different implementations when the sequence size grows. This confirms that our customized solution, specifically designed for long sequences, has a comparable performance for long sequences.

RATIO TIME
The previous figure confirms that for long sequences the memory consumption of our customized solution is lower than the one of suffix arrays, and the ratio will even get closer to zero when the sequence size grows (since our solution has a constant memory consumption).
Below the data from which we obtain the previous conclusions is shown. It is important to note, that although we are comparing the same chromosome of several sequences, the chromosome numbering depends on the sequence length rather than on the content. This is to say, that we will obtain significant alignments in a subset of comparisons, but not in all of them. This fact can be observed in the alignment of human with domestic cow contained below.
The previous figure is included to illustrate the non-presence of significant signals in some of the sub-plots of the new figure. In the previous figure, the human chromosomes are color-coded at the left hand side and at the right hand side are the corresponding 3.2. Reference software calculating local alignment ('-T 0'), with a minimum initial match of 32 ('-l') and with an overlapping window along the query sequence of step 1 ('-k', similar to the '-step=1' of LASTZ). These specific parameters were used to obtain a fair comparison. An execution line example: lastal S1DB S2 -m 1000 -r 4 -q 4 -s 2 -T 0 -l 32 -k 1 All the programs are using the following scoring matrix (or equivalent match/mismatch parameters in the cases they are not accepting a scoring matrix parameter):

Execution time
The execution time in the case of Gepard contains "*" because the program suffered memory consumption problems in our test infrastructure. This does not correspond in with what they report, see the table below extracted from: "Krumsiek, Jan, et al. (2007); "Gepard: a rapid and sensitive tool for creating dotplots on genome scale"; Bioinformatics Vol. 23 no. 8" In the case of Mummer, the table contains "*" because it also suffered memory consumption problems in our 8GB infrastructure. We confirmed this fact with a later execution in a bigger machine (see the 15GB Maximum resident size highlighted in bold below). Anyway the execution time is greater than the one reported by GECKO (6 hours, 27 minutes and 6 seconds compared to 3 hours, 17 minutes and 24 seconds calculated as dictionary calculation time plus comparison time of GECKO). The output of the execution is: /usr/bin/time -v mummer -maxmatch -n -b S1 S2 > mummer.mums # reading input file "S1" of length 234708820 # construct suffix tree for sequence of length 234708820 # (maximum reference length is 536870908) # (maximum query length is 4294967295) # process 2347088  # CONSTRUCTIONTIME mummer S1 322.15 # reading input file "S2" of length 231158085 # matching query-file "S2" # against subject-file "S1" # COMPLETETIME mummer S1 21197.13 # SPACE mummer S1 450.64 Command being timed: "mummer -maxmatch -n -b S1 S2" User time ( In the case of LAST, the table contains "n.a." because it suffered memory consumptions problems in our 8GB infrastructure. In this case the execution was even not possible in a bigger machine (see the more than 300GB used in the maximum resident size highlighted in bold below). This is in contrast with the statement they make in their webpage (http://last.cbrc.jp): "Human vs. mouse. This took about 1 day on 1 CPU, and less than 2 GB of RAM." We believe it is because of the parameters we are using, but in order to do an apples-to-apples comparison with GECKO the parameters need to be fixed to such values. In any case, the execution time reported until it crashed it is much greater compared to the one of GECKO (39 hours 57 minutes and 57 seconds in LAST compared to 3 hours, 17 minutes and 24 seconds calculated as dictionary calculation time plus comparison time of GECKO). /usr/bin/time -v ~/bin/lastdb S1DB S1 Command being timed: "/mnt/home/users/tic_182_uma/oscart/bin/lastdb S1DB S1" User time (seconds) Exit status: 0 /usr/bin/time -v ~/bin/lastal S1DB S2 -m 1000 -r 4 -q 4s 2 -T 0 -l 32 -k 1 > last.maf lastal: out of memory Command exited with non-zero status 1 Command being timed: "/mnt/home/users/tic_182_uma/oscart/bin/lastal S1DB S2 -m 1000 -r 4 -q 4 -s 2 -T 0 -l 32 -k 1" User time ( In the case of Mauve (i.e. progressiveMauve) in the table appears ">604800.00" what means that the execution time is greater than a week (the maximum execution time of a job in our bigger system). Since the job was cancelled we did not obtain the memory consumption, that is why the table contains "n.a.".

Determining the point where to start using GECKO
The following chart compares the execution times of GECKO and MUMmer showing that MUMmer is significantly faster when analysing short sequences. However, since the difference between the processing of smaller sequence datasets (< 5 Mbp) is in the range of seconds, it may not be worth switching between the two different methods for this reason alone. If choosing an analysis method for speed, the execution time for GECKO and MUMmer is similar on Drosophila genome-sized sequences (around 21 Mbp), and we would suggest GECKO as the better choice from this point onwards. However, there are other reasons to choose GECKO, for example MUMmer only reports the occurrence position in both sequences plus the length of the alignment, in contrast to the default GECKO output, which includes additional useful values such as the number of identities, score, similarity, etc.
In the end, the choice of which software to use depends on many variables such as the type of analysis or the species being worked on. However, there is a clear trend to think in terms of big-sequence analysis, not only going from genes to genomes but also from pairwise comparison to multiple genome comparison. For a lab that plans to look at more complex sequence analysis, we truly believe that the way GECKO avoids multiple dictionary re-calculations, or hits computation while filtering fragments with different parameters, makes it the best option in many, if not most, situations.

Dictionary creation
Usage: dictionary seq.fasta prefixSize prefixOutfile Parameters:  Seq.fasta: input sequence from which the words to be stored in the dictionary will be extracted.  prefixSize: this parameters indicate the number of iterations that the program will do. The number of iterations is 4 p . Example with p=1 the first iteration will look for the words starting by 'A', the second iteration the words starting with 'C', the third with 'G' and the fourth with 'T'. This is used to avoid entering into starvation due to the high amount of memory to be used.  prefixOutFile: this parameter indicates the output file name of this program "prefixOutFile.dict" We have un-successfully used dustmask (from Blast distribution). Thus, we consider low complexity regions (LCR) must be identified before using this procedure.
dictionary considerations: 1. Accept "ACGT" as valid symbols to conform 2. The program interprets low case symbols as non-valid characters to conform a word, so they are skipped (this is the usual way used by maskers to represent a LCR). 3. Remove non-coding ASCII characters. We have found, in particular the '\r\n' used by MS-DOS coding to represent the new line are not well processed by Linux-like environments (an alternative is to use the dos2unix command)

FilterHits
Two hits will be grouped if they occur at a given distance (less than the K-mer size parameter). In this way we reduce the number of hits to be extended. It is also possible to remove all "isolated" hits.

Parameters:
 fileIn refers to the hits file coming from the sortHits procedure. This file is sorted first by diagonal and then by position in the sequence X.
 fileOut indicates the output file of this program  KmerSize indicates the length of the matches in order to know when one hit is starting at a position already covered by a previous hit.


filterIsolatedHits indicates if a hit that is isolated in one of the diagonal is filtered or not. Usually when two sequences are similar it is normal that in every diagonal they have more than one hit.
Note: here there is a possibility to include in the output how many hits conform each Big-Hit and then compute fragments using only Big-Hits formed by "at least" n simple hits. However, during experiments we verify that for K=32, working with individual hits is fast enough. The main reason for this behaviour is that when hits are ordered, the fragments (next) program is able to collect hits on the same diagonal and then jump all the hits that were incorporated into a given fragment.
The Double-Hit approach used by Blast has demonstrated very powerful to compare short sequences such as genes and proteins. However when working with genomic sequences it is expected large strings of identical fragments (not only by the longer sequences under analysis but other genomic characteristics such as operons, suggest this possibility).
Overlapped hits points for K = 5 Although in the previous slide we have observed that the fragment detection program will avoid consecutive hits to be extended, there are some situations (such parallel implementation) in which is better to fusion the consecutive hits to avoid spurious or duplicated results.
Let's analyze this fact. For fragments with a large identical section several overlapping words will be found (see illustration). All these hits are considered as potential seeds or starting point of fragments. Next step is to extend the seed to identify the real fragments. Under this situation (a) or several seeds will be extended to produce the same fragment; or (b) a clever strategy will discard a-posteriori the consecutive seed once the first one integrate all the seed in one fragment. Potentially, in a parallel implementation, this fact will produce a false estimation of the computational load to be distributed to the different processors, and seeds assigned to different processors will be no identified as close hits.
We have analysed the possibility to extend the seed by using the 2-words matches imported from Blast; or enhancing the seed by let the signal to detect their own boundaries that could be named k-words self-bounded matches.
As it was previously mentioned, the 2-words hits used with high success in program such Blast and FASTA have this good behaviour due to the length of the sequences under analysis (genes or proteins). But in the case of genomic sequences, the probability of consecutive words increase so much and distort the estimation of the computational load. There are more seed than needed since several of this seed are formed by overlapped words.
In the table and the corresponding graph, the number of 1-word hits is displayed together with the number of 2-words blast hits used as seed in the well known application. The self-bounded (N-hits) approach get a notorious reduction of the number of seeds. Here a (numbers) table with the reduction of the number of seed using both approaches IMPORTANT: since the 1-word matches are ordered by diagonal, the typical array (of length N+M-1) for identification of 2-words matches is no more necessary.
Thus NOT memory boundaries!!! Number of seeds using the 1-word, 2-words-blast and n-hits (log scale).
Here, the accuracy of results is evaluated. The full set of local alignments between two E.coli genomes was computed (L1 & L2 ≈ 5Mpb). The full search space is L1xL2, which is reduced to linear space by identifying 1-word hits (L1+L2 search space) and performing 1.035.923 fragment extensions.
Using a word of length 12; a small identical fragment of length 15 will potentially produce 3x2x1 2-hits seeds which represent the factorial of the difference between the fragment length and the word size .. which is stratospheric for words as large as 30. However, the strategy of self-bounded seeds (Big-Hits) reduces the search space to 30%.
More important, a 99,98% of the final real fragments are identified (99,99% counting fragments with more than 20 residues length).
Table of common words of length 12 for the two E.coli genomes used in the study. This represent the number of seed point that would be processed to identify fragments. A self-bounded strategy to identify longer seeds by overlapping of single hits allow a reduction to 30% of the seeds needed to examine.The extension of seed to identify fragments (local alignments) perfectly reproduce the results produced by exhaustive strategies.

FragHits
A fragment is a sub-string present in both sequences whose quality (score) cannot be increased by extending the string in any of both directions. Quality is measured using a scoring scheme. Currently this scoring scheme is hard-coded (4 for identical symbols, -1 for N-N matches and -4 for different symbols) but we have also available a version which uses a user-specified scoring matrix. The program also accounts the number of identical symbols in the fragment.
It is easy to deduce that a fragment starts and ends with an identical symbol and eventually grows until reach a maximum quality or the sequences ends.
To avoid poorly scored fragments, a filtering strategy is used, based on similarity or statistical significance. Thus, when the borders of the fragment are detected, the maximum score is computed in a level of similarity is obtained. In other cases, the pvalue is computed for the corresponding fragment.
Noteworthy to observe, the fragment with higher score is not necessarily the best fragment from the similarity point of view, being possible -and really frequent-that a discarded long fragment could contains one or more shorter but well-conserved subfragments that satisfy filtering requirements.
Typical fragment detection algorithms that apply quality assessment at the end of fragment detection. In the picture (bottom part) the per-residue score and the similarity level are shown. The growing line represents the accumulated score that should be needed at each point by the fragment to be accepted as an interesting fragment (over the similarity threshold). The hill-shaped line correspond to the real score the fragment obtain at each point. This fragment should be discarded if a similarity threshold is used, because at the end of the fragment although the score is maximum, the similarity does not. Computing at each point the criterion that will be used to filter the fragment should produce two shorter but well-conserved fragments. (a length based threshold score can also be used). Although a slightly increase in CPU-time is needed, better results are obtained.
The following figure illustrates an example of how the actual HSP score is calculated: In this case, hits are used as seed points to extend the fragment. A backward step is also included to extend the fragment in the opposite direction from the hit. //Score of the fragment. This //depends on the score matrix //used uint64_t score; //Percentage of similarity. This //is calculated as score/scoreMax //Where score max is the maximum //score possible float similarity; //sequence number in the 'X' file uint64_t seqX; //sequence number in the 'Y' file uint64_t seqY; //synteny block id int64_t block; //'f' for the forward strain and 'r' for the reverse char strand; };  Lmin: minimal fragment length.  SimThr: Similarity Threshold (to obtain all fragments use a low value, or zero).  WordSize indicates the K value used to calculate the hits (seed points) based on the computed sequence dictionaries.  FixedLen indicates whether the length parameter should be taken into account as the actual value or as the percentage of the length of the input sequences. This parameter has more sense when comparing multiple sequence fasta files. Allowed values: 1 (actual value), 0 (to be considered as percentage).  Strand indicates whether we are computing the forward or the reverse strand fragments. Allowed values: f (forward), r (reverse).

Additional programs readDict
This program will read the calculated dictionary and will produce the following output: <word>: num:<number of ocurrences> : <list of occurrences>

Results quality
Although the performance aspects of GECKO's design are crucial, the production of high quality results is equally important. In this section we explain how we evaluated the quality of the results produced by our algorithm versus the other applications using the same parameters. The rationale behind our evaluation was to compare the coverage of the HSPs detected by each algorithm. To avoid biases in the evaluation we decide to obtain a consensus set of reference HSPs. This set is composed of those HSPs reported by at least half of the reference algorithms. The HSPs produced by GECKO were then mapped over the reference HSPs and the percentage of coverage recorded as a measurement of result quality. This means that matching positions reported by the consensus HSP reference and not reported by GECKO will push down the quality and vice versa. There are more sophisticated ways of comparing the results, such as only considering coding regions, or by qualifying and weighting matches depending on sequence type or section. However, we decided not to use these methods as they can incorporate noise or biases into the evaluation.
We have performed different experiments for closely and remotely related sequences in order to thoroughly study the results quality.

Closely related sequences
Following the previously described procedure with the results of the pairwise test described in the paper, the evaluation determined that in our experiments GECKO detected 3% more HSPs than the consensus set. Moreover, GECKO obtained a larger dataset while maintaining identity values over 65%, thus representing the identification of additional statistically-significant HSPs.
In the following sections we will summarize some examples of the performed experiments showing the HSPs we report and the other don't and viceversa. We will show also the alignment of those we are not taking into account to point out they are not good alignments. Additionally, we will also provide an example

TLCV vs. TYLCV-lr2
As explained before the result of the comparison based on coverage is the following:

HS-RN
After analyzing the results both in terms of coverage and identity, we can say that our results are superior in both situations for the presented studies. This makes us believe it will behave similarly in other comparisons.

Alignathon dataset study
First, we would like to stress that GECKO was designed to calculate High-Scoring Segment Pairs (HSPs) which, by definition, do not contain gaps, in contrast to the gapped alignments contained in the specified MAF file. Thus, to address the referee's point regarding GECKO validation we instead used the simTest collection of sequence files, downloaded from: http://compbio.soe.ucsc.edu/alignathon/data/simTest.seqs.tar.gz. We analysed the results of simMouse and simRat sequence comparisons using MUMmer and GECKO with the length parameter set to 20. We have used this parameter because after inspecting the MAF file we noticed that there was a significant number of short alignments.
In the comparison of both sequences from chr0, MUMmer obtained mostly short alignments with an average length of 24.84. In contrast, GECKO obtained much longer alignments with an average length of 95.76, and a maximum alignment length of 1831 versus a maximum of 107 reported by MUMmer. More info regarding alignment length distribution is provided in the image and