How genome complexity can explain the difficulty of aligning reads to genomes
© Phan et al. 2015
Published: 7 December 2015
Although it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship.
We investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes.
We formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.
Advances in next-generation sequencing technologies have driven the development of computational approaches to address the problem of aligning short reads to reference genomes [1–10]. Even so, the alignment problem remains challenging due to the presence of genomic repeats that are much longer than reads. Yu et al.  evaluated alignment performance of several aligners on repetitive regions selected from CpG islands and concluded that long repeats seriously degraded alignment performance.
Researchers generally believe that the difficulty of aligning short reads is very much related to the complexity of genomes; it is easier to misalign short reads when the genomes of interest have long and complicated repeat patterns. While there has been an interest in measuring complexity of strings, recent attention has been focused on complexity of DNA sequences [12–15]. Whiteford et al.  utilized k-mer frequencies as a way to visualize and understand the complexity of genomes. Kurtz et al. , similarly, annotated plant genomes with k-mer frequencies so that repeat structures and characteristics can be easily visualized. With the same approach to understanding genome complexity, Chor et al.  analyzed k-mer spectra of over 100 species and observed multimodal spectra for regions with specific CG content characteristics. Unfortunately, these measures cannot be easily quantified and immediately adopted to study how complexity affects the difficulty of short-read alignment.
In a recent study, Becher et al.  introduced a measure known as the I-complexity, which seems most promising as a tool to correlate sequence complexity to difficulty of short-read alignment. The authors showed several interesting properties of this measure, including its closeness to the Lempel-Ziv complexity and its efficient computation in linear time. The I-complexity can be easily adopted for our purpose and will be used among others to understand how genome complexity affects the difficulty of short-read alignment.
In this paper, we propose measures of complexity that are best suited for the analysis and understanding of the difficulty of short-read alignment and how such measures might be helpful in selecting appropriate aligners for new genomes. The inspiration for this work lies in the observation that complex repeat structures in DNA that affect the performance of computational tasks are length specific. For instance, in finding regulatory motifs in DNA sequences, repeated structures of interest are around 8-25 characters long. On the other hand, in aligning reads to genomes, such repeats probably have little effect on the performance of aligners. This means that measures such as the I-complexity that are general and not length-specific might not be best.
I-complexity and D-complexity
where LCP [i] is the length of the longest common prefix of the suffixes of g starting at positions S[i] and at S[i - 1], and S is the suffix array of g. The suffix array S of g stores implicitly lexicographically sorted suffixes of g; i.e. for i < j, gS[i]···|g| (the suffix starting at index S[i]) is lexicographically smaller than gS[j]···|g| (the suffix starting at index S[j]).
The somewhat non-intuitive definition of the I-complexity has some advantages. The authors established upper and lower bounds for I(g), and showed that it was close to the Lempel-Ziv complexity of g. Further, it can be computed in linear time because the suffix and LCP arrays can be constructed in linear time [16, 17].
where f(x) denotes the number of occurrences of x in g. To be precise, D(g) is equal to the total number of distinct substrings divided by the total number of substrings of g. D(g) can be computed in linear time, due to the following lemma.
Proof Suppose that a substring s of g occurs exactly k times. Then, there will be a block of size k in the suffix array that corresponds to k suffixes that have s as the common prefix. More specifically, assume that s is the common prefix of the suffixes of g starting at positions . We will call the occurrence of s at position S[i] its representative occurrence, and its occurrences at its repeat occurrences.
Each repeat occurrence of s is a prefix of the longest common prefix of the suffixes starting at . This means, each repeat occurrence of s is accounted for uniquely by the values of .
If we focus on a position, for example i + 1, we can see that the longest common prefix between S[i + 1] and S[i] (let's call it ) accounts uniquely for j repeat occurrences, namely . One of these is s; the rest are repeat occurrences of other substrings. Thus, each repeat occurrence is accounted for uniquely in some entry of LCP and each entry of LCP accounts uniquely for some repeat occurrences. That implies that accounts for the total repeat occurrences of all substrings of g.
Further, is the total number of substrings of g, since there are exactly i substrings starting at position i. Thus, if we remove all repeat occurrences from the total number of substrings, we will get precisely the total number of representative occurrences. This means .
Length-sensitive measures of complexity
where f (x) is the number of occurrences of x in g. D k and R k measure the rates of distinct substrings and repeats, respectively, of length k. R k and D k are not exact "opposites" because R k does not account for non-repeats, whereas D k does. R k is related to the function C(k, r) proposed by Whiteford et al. . C(k, r) is the count of k-mers repeating exactly r times. Therefore, . D k and R k can be computed in linear time and space using suffix and LCP arrays.
Proof A k-substring of g must start at an index between 1 and |g|-k +1. Further, if LCP [j] <k, the k-prefix of the suffix starting at S[j] is different from the k-prefix of the suffix starting at S[j - 1]. Thus, each j for which S[j] ≤ |g| - k + 1 and LCP [j] <k represents exactly one distinct k-substring.
On the other hand, if S[j] > |g|-k+1 or LCP [j] ≥ k, then the k-substring starting at S[j] does not exist or is not distinct. Since LCP runs through all positions of g, all distinct k-substrings are uniquely accounted for.
Lemma 3 , where I k is the set of intervals [i, j]'s, where i ≤ j, such that
1 LCP [u] ≥ k for i ≤ u ≤ j
2 LCP [i - 1] <k unless i = 1
3 LCP [j + 1] <k unless j = |g|
Proof A k-repeat is a substring x of length k, with f (x) >1. Since the suffix array S is sorted lexicographically, S forms consecutive runs of k-repeats, which are k-prefixes of the suffixes stored implicitly by S. More specifically, each interval [i, j] ∈ I k corresponds to all occurrences of exactly one k-repeat. The number of occurrences for each k-repeat is exactly j - i + 2.
I k can be computed in linear time by scanning the LCP array once. Note that the index of LCP runs from 1 to |g|, and LCP  = 0.
Relating genome complexity to difficulty of aligning short reads to genomes
I, D, D k , and R k provide quantitative measures of complexity for each genome. Intuitively, the more distinct substrings a reference genome has (i.e. high values of I, D, and D k ), the easier to align reads to the reference genome. Conversely, the more long repeats the genome has, the more difficult to align reads to it correctly (since the probability of mismatching of a read with a wrong substring is higher.)
We measure the performance of an alignment algorithm using precision and recall, where precision is defined as the fraction of aligned reads that are correct (i.e. number of correctly aligned reads divided by the total number of aligned reads); and recall is defined as the fraction of reads that are correctly aligned (i.e. number of correctly aligned reads divided by the total number of reads). These definitions were also used by Liu et al. .
To correlate complexity values to difficulty of alignment, for each measure of complexity, we computed the linear correlation between the complexity values of sequences in a diverse data set including 100 genomic sequences, and the actual performance for each of 10 popular aligners. A good measure of complexity will correlate highly with alignment performance.
Aligners and genomic data
Information on the selected 100 genomic sequences [Part 1].
Lactobacillus johnsonii NCC 533,
Arabidopsis thaliana DNA chr. 4, long arm
Toxoplasma gondii RH, genomic DNA chr. Ib
Listeria welshimeri serovar 6b str. SLCC5334
Eimeria tenella chr. 1, ordered contigs
Bacillus halodurans C-125 DNA,
TPA: Aspergillus nidulans FGSC A4 chr. II
Caenorhabditis elegans Bristol N2 genomic chr., I
Ostreococcus tauri WGS project CAID00000000 data, contig chr. 12
Canis lupus familiaris chr. 1
Canis lupus familiaris chr. 38
Cryptococcus neoformans var. neoformans B-3501A chr. 4
Drosophila pseudoobscura pseudoobscura strain MV2-25 chr. 3
Rattus norvegicus strain BN/SsNHsdMCW chr. 20
Gallus gallus chr. 18
Oryza sativa (indica cultivar-group) chr. 9
Dictyostelium discoideum AX4 chr. 3
Drosophila yakuba strain Tai18E2 chr. 2L
Drosophila yakuba strain Tai18E2 chr. 2R
Aspergillus fumigatus Af293 chr. 1
Bos taurus chr. 1
Bos taurus chr. 25
Trypanosoma brucei brucei strain 927/4 GUTat10.1 chr. 10
Mus musculus chr. 1
Macaca mulatta chr. 16
Equus caballus chr. 1
Plasmodium vivax chr. 11
Taeniopygia guttata chr. 1
Taeniopygia guttata chr. 13
Pongo abelii chr. 22
Fusarium graminearum PH-1 chr. 2
Gibberella moniliformis 7600 chr. 3
Fusarium oxysporum f. sp. Lycopersici
4287 chr. 4
Information on the selected 100 genomic sequences [Part 2].
Phaeodactylum tricornutum CCAP
1055/1 chr. 9
Thalassiosira pseudonana CCMP1335
Saccharomyces kluyveri NRRL Y-12651
Sorghum bicolor chr. 8
Sorghum bicolor chr. 10
Zea mays chr. 1.
Oryctolagus cuniculus chr. 10
Sus scrofa chr. 18.
Drosophila virilis strain 15010-1051.88
Glycine max chr. 17
Callithrix jacchus chr. 20
Ovis aries chr. 22
Ovis aries chr. 23
Nasonia vitripennis chr. 3
Medicago truncatula chr. 5.
Medicago truncatula chr. 6.
Macaca fascicularis chr. 1
Macaca fascicularis chr. 19
Borrelia hermsii DAH,
Scheffersomyces stipitis CBS 6054 chr. 2, complete sequence.
Acaryochloris marina MBIC11017,
Nostoc punctiforme PCC 73102,
Phaeodactylum tricornutum CCAP
1055/1 chr. 11, complete sequence.
Pedobacter heparinus DSM 2366,
Chitinophaga pinensis DSM 2588,
Bacillus megaterium DSM319,
Achromobacter xylosoxidans A8,
Acetobacterium woodii DSM 1030,
Actinoplanes sp. SE50/110,
Desulfitobacterium dehalogenans ATCC 51507,
Acidovorax sp. KKS102,
Candida glabrata strain CBS138 chr. H complete sequence.
Bradyrhizobium sp. ORS278,complete sequence.
Schizosaccharomyces pombe chr. III, complete sequence
Information on the selected 100 genomic sequences [Part 3].
Zygosaccharomyces rouxii strain CBS732 chr. A complete sequence.
Oryzias latipes DNA, chr.10, strain: HdrR.
Drosophila melanogaster unordered unlocalized genomic scaffolds (chrUn)
Aliivibrio salmonicida LFI1238 chr. 1
Citrobacter rodentium ICC168,
Trypanosoma brucei gambiense DAL972 chr. 11, complete sequence
Babesia microti strain RI chr. III, complete sequence.
Clostridiales sp. SM4/1 draft genome.
Leishmania braziliensis MHOM/BR/75/M2904, chr. 6
Danio rerio genome assembly, chr1
Gorilla gorGor3.1 chr. 1
Schistosoma mansoni strain Puerto Rico chr. 7,
Torulaspora delbrueckii CBS 1146 chr. 3,
Torulaspora delbrueckii CBS 1146 chr. 8,
Tetrapisispora blattae CBS 6284 chr. 4,
Kazachstania naganishii CBS 8797 chr. 1,
Arabidopsis thaliana chr. 1, complete sequence.
Human herpesvirus 4 complete wild type genome.
Viruses, dsDNA viruses
Oryza sativa Japonica Group DNA, chr. 1, complete sequence, cultivar: Nipponbare
Oryza sativa Japonica Group DNA, chr. 4, complete sequence, cultivar: Nipponbare
Oryza sativa Japonica Group DNA, chr. 5, complete sequence, cultivar: Nipponbare
Oryza sativa Japonica Group DNA, chr. 6, complete sequence, cultivar: Nippon bare
Oryza sativa Japonica Group DNA, chr. 7, complete sequence, cultivar: Nipponbare
Oryza sativa Japonica Group DNA, chr. 8, complete sequence, cultivar: Nipponbare
Oryza sativa Japonica Group DNA, chr. 10, complete sequence, cultivar: Nipponbare
Populus trichocarpa linkage group I, whole genome shotgun sequence
Homo sapiens chr. 12 genomic contig, GRCh37.p13 Primary Assembly
Homo sapiens chr. 13 genomic contig,
GRCh37.p13 Primary Assembly
Homo sapiens chr. 3 genomic contig, GRCh37.p13 Primary Assembly
Homo sapiens chr. 7 genomic contig, GRCh37.p13 Primary Assembly
Homo sapiens chr. 15 genomic contig, GRCh37.p13 Primary Assembly
Homo sapiens chr. 1 genomic contig, GRCh37.p13 Primary Assembly
Homo sapiens chr. × genomic contig, GRCh37.p13 Primary Assembly
We selected 10 popular short-read aligners that employ different algorithmic techniques and indexing structures: SHRiMP2 , mrFAST , SeqAlto , GASSST , Bowtie2 , BWA-SW , SOAP2 , CUSHAW2 , Masai , and Smalt .
We used default parameters to run these programs because these aligners appeared to perform well and consistent over the 100 genomes at such settings.
It is not possible to compute the number of correctly aligned reads for real reads because positions of real reads in reference genomes are not known. Consequently, precision and recall cannot be computed using real reads. For this reason, we simulated reads for each genome, 2x coverage of reads at lengths 50, 75, and 100 using the wgsim program . Reads were generated with default parameters; sequencing error rates equal to 0.5%, 1%, 2%, and mutation rates between 0.1% and 1%, of which 15% are indels. These parameters should be sufficiently realistic for the current sequencing technologies and a large range of organisms.
Overview performance of aligners
Precision and recall averaged across 100 genomes at read lengths 50, 75, 100.
In brief, many of these aligners (e.g. Bowtie2, CUSHAW2, SeqAlto) performed similarly accurately on the tested 100 genomes. Without additional information, it can be difficult to decide between these high-performing aligners. It would be useful if we could predict how accurately they perform on new genomes. To explore the aligners' performance on new genomes, we will examine the correlation between various measures of genome complexity and alignment accuracy.
Correlation between genome complexity and alignment performance
We can see that there is a value of k for which D k that correlated highly with performance for both precision and recall, across all read lengths of 50, 75, and 100. For most aligners, the correlation coefficients were approximately 0.95. The only noticeable exception is for GASSST whose correlation coefficients were comparatively lower than those of the others. We think the explanation for this is in GASSST's peculiar performance as we reported earlier, whereby its recalls were above 1 for many of the 100 genomes. Additionally, we could see that when recalls were comparative lower for mrFAST and BWA-SW at read length 50, their correlations were also comparative lower than the other aligners'. It is important to note that some aligners were designed to work optimally with longer reads and consequently do not work very well with shorter reads. One can conclude that if aligners perform predictably in their comfort zones, D k (or R k ), is a good complexity measure that correlates highly to the accuracy of aligning reads to genomes.
A close examination of the results shows that the value of k for which D k correlated highest with performance was very close to the read length. For example, at read length 100, D100 had the highest correlations across aligners; at both read lengths 50 and 75, D50 had the highest correlations, although D25 also had very high correlations at read length 50. Thus, the most accurate measure of complexity to understand the difficulty of short-read alignment is length sensitive. Intuitively, this is because repeats of length close to 75, for example, influence the accuracy of the alignment of reads of length 75.
The fact that the best value of k is less than or equal to read length, and not larger than it implies that D k accounts for approximate repeats. To see this, observe that a 75-mer repeat might not be part of a 100-mer, but surely contains several 50-mer repeats (26 of them, to be precise). This means that D100 neglects to account for several 50-mers, whereas D50 accounts for all of these, and these 50-mer repeats directly have an influence of the accuracy of aligning reads of length 75. This is probably why D50 had a better correlation profile to complexity than D100 did. The fact that D k accounts for approximate repeats longer than k can be explained formally by the so-called q-gram lemma, which states that two sequences of length k with edit distance e share at least n - q + 1 - qe q-grams. An estimate of complexity involving counting approximate repeats might give better correlation. However, the computing of approximate repeats is computational expensive compared to linear time computation of D k . The best computation of approximate repeats we know of using a lossless filter  has an average running time of O(n2). For long genomes, this running time is not desirable. Since D k and R k already correlated quite highly (approximately 0.95) for many aligners, a more efficient running time (linear instead of quadratic) seemed to be a better trade-off than a potentially better correlation.
Predicting aligner performance for unknown genomes
The existence of many short-read aligners makes it challenging for researchers to pick the best one for their genomes of interest. Surveys such as  compared popular software packages on a few known genomes and served as a good starting point. But when it comes to adopt a particular software package, the decision seems to be a mixture of many factors including the authors' reputation, past familiarity with the software, its alignment accuracy, its quality and ease of use, its resource usage (running time and RAM), and recommendations of fellow researchers. Our focus is on accuracy, defined in terms of precision and recall.
To remove this uncertainty and make more informed decisions, we might additionally incorporate correlation between genome complexity and accuracy. To illustrate this strategy, consider the bottom figure in Figure 7 shows the correlation between D100 (since it is the best at read length 100) of the aligners' precision and recall across the 100 genomes. Comparing the performance of the high-performing aligners identified in the previous step (those other than MrFast and SOAP2), we can see that they have different levels of correlations. For instance, Bowtie2 had noticeably lower correlations (0.89 for precision and 0.90 for recall) than CUSHAW2 for both precision and recall). Thus, although Bowtie2 and CUSHAW2 had similar accuracies for the 100 genomes, we expect that CUSHAW2 will more likely have similar accuracies for unknown genomes.
The effect of k on D k and R k
The transition from near-uniform distributions of D12 (and R12) to very skewed distributions of D k (and R k ), for k ≥ 50, might explain for the low correlation of D12 (and R12) to alignment accuracy. Thus, we can stipulate that right choice of k is essential for correlating complexity and alignment accuracy. The right choice of k appears to be similar to read length as we have observed.
We demonstrated that length sensitive measures were suitable for studying how genome complexity affected the of short-read alignment. This work has implications for theoretical studies of genome complexity, as well as for comparing alignment methods, and designing cost-effective experiments to assemble genomes. Beyond short-read alignment, these measures should be useful for problems such as short-read assembly, which are affected by genomic repeats.
This method depends on the simulation of reads with known alignment locations, from which we can compute the number of correctly aligned reads for the calculation of precision and recall. With real reads, we cannot know this information. Better simulation of reads will improve the predictive power of this method.
This work is partly supported by NSF grant CCF-1320297.
Publication charges for this work were funded by NSF grant CCF-1320297 to VP.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S17.
- David M, Dzamba M, Lister D, Ilie L, Brudno M: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics. 2011, 27 (7): 1011-1012.View ArticlePubMedGoogle Scholar
- Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, et al: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009, 41 (10): 1061-1067.PubMed CentralView ArticlePubMedGoogle Scholar
- Mu JC, Jiang H, Kiani A, Mohiyuddin M, Asadi NB, Wong WH: Fast and accurate read alignment for resequencing. Bioinformatics. 2012, 28 (18): 2366-2373.PubMed CentralView ArticlePubMedGoogle Scholar
- Rizk G, Lavenier D: GASSST: global alignment short sequence search tool. Bioinformatics. 2010, 26 (20): 2534-2540.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with bowtie 2. Nat Methods. 2012, 9 (4): 357-359.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics. 2010, 26 (5): 589-595.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714.View ArticlePubMedGoogle Scholar
- Liu Y, Schmidt B: Long read alignment based on maximal exact match seeds. Bioinformatics. 2012, 28 (18): 318-324.View ArticleGoogle Scholar
- Siragusa E, Weese D, Reinert K: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 2013, 41 (7): e78-PubMed CentralView ArticlePubMedGoogle Scholar
- Ponstingl H, Ning Z: SMALT-a new mapper for DNA sequencing reads. F1000 Posters. 2010, 1: 313-Google Scholar
- Yu X, Guda K, Willis J, Veigl M, Wang Z, Markowitz MD, et al: How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?. BioData Min. 2012, 5 (1): 6-PubMed CentralView ArticlePubMedGoogle Scholar
- Becher V, Heiber PA: A linearly computable measure of string complexity. Theoretical Computer Science. 2012, 438: 62-73.View ArticleGoogle Scholar
- Chor B, Horn D, Goldman N, Levy T, Massingham T: Genomic DNA k-mer spectra: models and modalities. Genome Biology. 2009, 10 (10): R108-PubMed CentralView ArticlePubMedGoogle Scholar
- Kurtz S, Narechania A, Stein JC, Ware D: A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008, 9: 517-PubMed CentralView ArticlePubMedGoogle Scholar
- Whiteford NE, Haslam NJ, Weber G, Prugel-Bennett A, Essex JW, Neylon C, et al: Visualizing the repeat structure of genomic sequences. Complex Systems. 2008, 17 (4): 381-398.Google Scholar
- Kärkkäinen J, Sanders P, Burkhardt S: Linear work suffix array construction. J ACM. 2006, 53 (6): 918-936.View ArticleGoogle Scholar
- Kasai T, Lee G, Arimura H, Arikawa S, Park K: Linear-time longest-common-prefix computation in suffix arrays and its applications. Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching Lecture Notes in Computer Science. 2001, 181-192.View ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al: The sequence alignment/map format and samtools. Bioinformatics. 2009, 25 (16): 2078-2079.PubMed CentralView ArticlePubMedGoogle Scholar
- Peterlongo P, Sacomoto GA, do Lago AP, Pisanti N, Sagot MF: Lossless filter for multiple repeats with bounded edit distance. Algorithms Mol Biol. 2009, 4: 3-PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics. 2010, 11 (5): 473-483.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.