Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome

Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.


Introduction
Many applications of next-generation-sequencing (NGS) in human genetic and medical studies depend on the ability to uniquely align DNA reads to the human reference genome (1)(2)(3)(4)(5)(6).
This in turn is related to the level of redundancy caused by repetitive sequences in the human genome, well known in the earlier human whole-genome shotgun sequencing (7)(8)(9), and the read length k.When the read length is too short, it is theoretically impossible to construct a sequence with size comparable to the human genome that does not contain any repeats of windows with k bases.It has been shown using graph theory that the shortest DNA sequences avoiding any repeats of k-mers can be constructed by packing all unique k-mers shifting one position at the time (10).The number of different k-mer types is 4 k /2 (k even) or (4 k + 2 k )/2 (k odd) if both a subsequence and its reverse complement are considered to belong to the same k-mer type.Solving 4 k /2 ≈ 3 × 10 9 leads to the conclusion that read length k must be at least greater than 17 for all reads to be uniquely alignable to a hypothetical reference sequence that has the size of the human genome.However, in reality the human genome did not evolve by a first principle to be consistently compact and incompressible.Redundant sequences in the human genome have resulted from duplication, insertion of transposable elements, and tandem repeats due to replication slippage, and more than half of the human genome can be traced to repetitive transposable elements.
Although locally duplicated sequences can be deleterious (11) or disease-causing (12), a certain level of redundancy is crucial for biological novelty and adaptation (13)(14)(15).For higher eukaryotes, a slower removal of the deleterious repeats due to low mutation rates and smaller population sizes (16) lead to a higher level of genome-wide redundancy.This in turns may lead to more protein sequences with internal repeats and perhaps new fold or new functions such as the case for connection tissue, cytoskeletal, and muscle proteins (17).
Therefore, k=17 is a very unrealistic estimation of the minimal read length required for a perfectly successful NGS reads alignment.Accordingly, NGS technologies utilize reads with various larger lengths: k=70 for Complete Genomics, 35 ∼ 85 for ABI SOLiD, 75 ∼ 150 pairend for Illumina HiSeq, 400 for Ion Torrent PGM, 450 ∼ 600 for Roche 454 GS FLX Titanium XLR70, etc. (18).Currently, the technology is pushing towards read lengths of k=1000 (e.g., Roche 454 GS FLX Titanium XL+) or even k=10000 (19).Needless to say, the longer the read length, the higher the chance that reads can be aligned to the reference genome.Ultimately, high quality genome will be obtained by a mix of technologies.To find this optimal mixture, a quantitative understanding of the repeat structure of the human genome is required.
Our analysis of the repeat structure is different from some earlier investigations of read mappability (3,5).In these studies, the actual reads from the current sequencing technology are used.There are two shortcomings in these approaches: (i) it is impossible to extrapolate the result to read lengths which is beyond the current technology; (ii) a certain proportion of reads are never mappable because the corresponding regions in the reference genome are not finished.Using the existing reference genome makes it possible to treat k-mers as hypothetic reads whose length k can be as long as possible, and unfinished regions can be excluded from the analysis.
In this paper we quantitatively address the question of how alignment improves for greater read lengths.To this end we artificially cut the human reference genome into overlapping k-windows (k-mers, k-tuples, or k-gram (20)), each considered to be possible a "read", and count the number of appearances (or "tokens", borrowing a terminology from linguistics (21)) of each k-mer type across the full reference sequence.Those k-mer types that appear in the genome only once (f =1) are labeled singletons, and the remainder (f > 1) are non-singletons.
Intuitively, the percentage of non-singleton reads is expected to decrease with increasing read length k.Obtaining the functional form of this decay enables us to predict the percentage of difficult-to-align reads at longer k's.
These seemingly simple calculations already encounter a "big data" problem on a regularsized computer.In particular, storing counts in a hash table requires large amount of RAM.Suppose a k-mer needs K byte to store (e.g.K=k/4), a hash table to count all k-mers in the human genome would require 3K GByte RAM, which quickly becomes implausible when k is greater than 100.Using a solution that is similar to other applications where the hard disk (22)(23)(24) or computing time (25) is traded with RAM, we use a new public-domain program DSK which utilizes the less expensive hard disk or longer CPU time to compensate a lack of RAM (26).Other efficient k-mer count procedures have been proposed in (27)(28)(29).
The mathematical relationship between the fraction of non-singleton k-mers and k predicts the fraction of putative reads that can be mapped uniquely.Another statistic of interest is the distribution of k-mer frequencies when k is fixed at a given value.This distribution has a head and a tail, a head for low frequency k-mers (including singletons), and a tail for high frequency k-mers.In the situation when these distributions exhibit long-tails (30) and power-law-like trends (31), thus fitting a straight line in log-log scale, the head end is best characterized by the frequency distribution (21), whereas the tail end is better characterized by the rank-frequency distribution commonly related to Zipf's law in quantitative linguistics (32).Our analysis of these distributions provides information on the level of redundancy in the human genome at various scales.
Locating human genome regions that cannot be uniquely mapped by sequencing reads (which can be called "non-uniqueome" following the term "uniqueome" used in (3)) is important in any NGS-based studies.These regions may contribute the most to the false-positive and false-negative variant callings.These may also be hotspot for other structural variations such as indel and copy-number-variation (33,34).We will specifically examine the location of some of these redundant regions at the k = 1000 level.

Genome sequence data
The human reference genome GRCh37 (hg19) was downloaded from UCSC's genome browser (http://genome.ucsc.edu/).The intermittent strings of N's (marking unfinished basepairs that cannot be sequenced with the applied technology (35)) are used to partition the 22 autosomes and 2 sex chromosomes into 322 subsequences, and k-mers overlapping two chromosome partitions are not allowed.
For an additional analysis on repeat-filtered sequences, strings of lowercase letters in the reference genome (which mark repetitive sequences identified by the RepeatMasker program, http://www.repeatmasker.org/) are used to partition the genome into 3456905 subsequences with all transposable elements removed.

Counting k-mers
A k-mer type includes both the direct and the reverse complement substring; AAGC/GCTT is an example of such a 4-mer type.We use a state-of-art k-mer counting program DSK (26) (http://minia.genouest.org/dsk/),version 1.5031 (March 26, 2013).Most of the DSK calculations were carried out on a Linux computer with 48 GByte RAM and around 900 GByte disk space, except a calculation at k=1000 which was run on another Linux computer with the same RAM but 30 TByte of disk space.The parameter setting of DSK was determined by a trial-and-error process.The output of the DSK program consists of a list of k-mers.The BLAT program from UCSC's genome browser is used to map frequent k-mers back to the reference genome.
Frequency distribution, rank frequency plot, and data fitting Suppose a k-mer type appears in the genome f times (f is frequency, or copy number); frequency distribution (FD) is the number of k-mer types with frequency f .Individual k-mer types can be ranked by their f , highest f ranks number 1, second highest f ranks number 2, etc.The ranked f 's of k-mer types as a function of rank r is the rank-frequency distribution (RFD).

Percentage of non-singleton reads vs. read length: piece-wise power-law function
In Figure 1 we show the percentage of non-singleton reads/tokens (p ns ) as a function of k-mer length k in log-log scale.The p ns is 28.35% at k=20, 8.16% at k=50, 4.26% at k=80, 3.40% at k=100, 2.44% at k=150, 1.33% at k=400, 1.18% at k=500, and 0.82% at k=1000.If k is shorter than the "shortest unique substring" length, which is 11 in the human genome (40), singletons do not exist (i.e., p ns = 100%).
Visual inspection of the trend suggests the use of piecewise power-law function in fitting the data.We fit the points in k = 20-80 and k= 200-1000 ranges separately by linear regressions in the log-log scale: log 10 p ns = a + b log 10 k (or log p ns ∼ log k).The fitted (â, b) is (1.58366, -1.5478) and (-0.4371, -0.5495) for the two segments, equivalent to p ns = 38.34/k 1.548 and p ns = 0.365/k 0.55 .The steep decay in the first segment shows a stronger increase of the amount of uniquely mappable sequences with read length, which implies that obtaining read lengths of at least around 100 is more cost-efficient with respect to reducing the amount of nonmappable reads.Of course, longer reads have extra benefits such as more robust alignments in the presence of polymorphisms or the ability to determine the length of longer repeat polymorphisms.The power-law function also indicates that the reduction of non-specific, difficult-to-align reads with longer read length is not linear.
If we assume our fitting function can be extrapolated to larger k's for which a direct analysis of k-mer frequencies is restricted by computational constraints, the proportion of non-singleton reads can be predicted.For example, this leads to the prediction of a 0.2% non-singleton rate at the 10kb read length.
It is known that repetitive sequences such as transposable elements and heterochromatin sequences create considerable obstacle in NGS alignment (41).Though transposable elements may exhibit subtle correlation with functional units in the genome (42), it is generally assumed that their biological role is indirect.Accordingly, we also looked at the non-singleton k-mer percentages in RepeatMasker filtered sequences (Figure 1).As expected, the percentage of uniquely mappable sequence is much higher than that in the all-inclusive sequence for short k-mers (e.g.k < 100).Interestingly, the differences between the two disappear for longer kmers (e.g.k=500).A note of caution is that 89% of these RepeatMasker-filtered subsequences are shorter than 1kb, making the statistics less reliable at longer k's.

Maximum k-mer frequency decreases with k slowly
Another measure of the level of redundancy at length scale k is the maximum frequency (max(f )) of k-mer types.For example, base A/T homopolymers of length 20 appear most often with 898647 copies; at k=400, AT repeats have more copy numbers (f =150) than other 400-mers; the max(f ) for k=1000 is equal to 24 for a sequence which is not filtered by the RepeatMasker.The max(f ) as a function of k is shown in Figure 2 in log-log scale.
For RepeatMasker-filtered sequences, max(f ) quickly decays below 100 and then falls only slowly, indicating that RepeatMasker usually finds shorter repeats.At k= 200-500, the k-mer with the max(f ) ∼ 50 is a low-complexity sequence, with internal repeats of GGGGGGAACAGC-GACAC/GTGTCCGCTGTTCCCCCC.Despite its high prevalence, this low-complexity sequence is not masked by RepeatMasker in the human reference genome.
Fitting of the linear regression, log 10 max(f ) = a + b log 10 k (or log max(f ) ∼ log k), leads to (a, b)= (8.99, -2.62).Extrapolating this regression to longer k's predicts that at k=2724, max(f )=1.This prediction should be viewed with caution as max(f ) is mainly determined by "outlier" events thus un-reproducible in principle.
Frequency distributions at fixed k values exhibit power-law-like trend: The frequency distribution (FD) describes the distribution of k-mer types according their copy numbers in the genome.When plotted in log-log scale, low-frequent k-mer types and the less redundant portion of the sequence are highlighted.Figure 3 shows five FDs at k=30, 50, 150, 500, and 1000 in log-log scale.The FDs at k=30 and 50 span a wider frequency range, and the power-law trend is obvious.
A similar FD for k=40 in human genome was shown in (43,44), and a slope of −2.3 in linear regression (in log-log scale) in the f = 3-500 range was reported.When we fit the k=50 FD by linear regression in log-log scale, a very similar fitting slope value is obtained (−2.38, for f = 3-200).However, it is clear from Figure 3 that the slopes are steeper for k=150 (−2.7 for f = 2-100), k=500 (−3.5 for f =2-40), and k=1000 (−5.3 for f = 2-19, or −5.9 from f = 2-9), indicating that the slope is not a universal parameter.
From the short read alignment perspective, the long tail at the high copy-numbers shows that many sequences cannot be uniquely mapped at smaller k values (e.g.k= 30, 50).However, the tail is much shortened at k=1000.As expected, the tail for RepeatMasker-filtered sequences at various k values are much shorter (Figure 3, grey lines).
Rank-frequency distributions at fixed k values mostly follows a concave curve in log-log scale Although rank-frequency distributions (RFD's) can be converted to cumulative FD (36), in log-log scale, it zooms in the high-frequency tail of the frequency distribution.Figure 4 shows five RFD at k's from 30 to 1000.While the RFD at k=30 may maintain a power-law or piecewise power-law trend, those at larger k values become more concave.This concave Zipf's curve is commonly observed in city size distributions (45,46).
For RFDs deviating from the Zipf's law, functions with two parameters may be used to account for the concave or convex shape of the curve in log-log scale (36).We found that the quadratic logarithmic function, but not the Weibull function, fits the RFDs well (Figure 5).The Beta rank function usually exhibit "S" shapes (39), whereas the RFD in Figure 4 shows a "Z" shape.This motivated us to use a novel reverse Beta function to fit the data (Figure 5).The "Z" shaped log-log RFD means that if the power-law function is the default functional relationship between frequency and rank, frequencies of the intermediately-ranked k-mers decrease faster than the two tails.The "S" shaped log-log RFD implies the opposite.
Mapping f ≥ 10 1000-mer to the reference genome For k=1000, there are 6107 k-mer types with frequency f larger or equal to 10. Due to the fact that these are overlapping k-mers, they are mapped to only 172 chromosomal regions, each of a few kb (the 172 locations, number of high-frequency 1000-mers, and the distance from the previous chromosome regions are included in Supplementary Table S1).

DISCUSSION
Long k-mers in the reference genome as surrogate for sequencing reads: The k-mer distribution has many application in sequence analysis, such as measuring similarity between two genomes (56), correcting sequencing error (57), finding repeat structures (58), determining the feasibility of gene patents (59).In many applications, only short k-mers are considered to be relevant, such as k = 6 (60), k ≤ 7 (61), k=8 (62), k=11 (63).This paper essentially uses long k-mers taken from the reference genome as surrogate for reads from future NGS technologies.Computationally speaking, counting long k-mers is more challenging and we are not aware of any prior publications on the long k-mer distributions in the human genome for k as long as 1000.
As compared to other papers on mappability of genome sequencing reads (3,5), our more theoretical approach has the advantage of being able to discuss long reads (e.g.k=1000) where such data is not available from the current NGS technology.Our approach also separates the two causes of the unmappability: one due to the unfinished sequence in the reference genome and another due to the redundancy in the finished sequences.The unfinished bases are mainly located in the centromeres, short arms of acrocentric chromosomes and other heterochromatic regions, and rich in repetitive sequences.If we always treat this unfinished sequences (total 234 Mbases) to be non-singletons regardless of k, p ns would flatten out around 0.1 (see Figure 1).

A baseline knowledge of redundancy of the human genome at length k level:
Figures 1-3 provides a baseline knowledge of the redundancy of the human genome at the k-mer level.Our results give a more quantitative description on the effect of read length k on the mappability of reads from the finished region of the human genome.
Reference assembly is easier than de novo assembly, and our approach does not directly apply to de novo sequencing "assemblability".However the mappability in reference assembly and assemblability in de novo assembly are closely related, as repetitive sequences cause problems in both situations (64).The current de novo assemblies still do not perform consistently (65,66) and a quantitative assessment of the impact of repetitive sequences on reference assembly could be a useful piece of information for de novo assembly as well.Note that some discussion on k-mer-based assembly actually refers to k ′ -mer (k ′ << k) (67,68).
Highly redundant regions at k = 1000 level and copy-number-variation regions: The chromosome 1 and X regions which we have identified by showing at least 10 copy numbers of 1000-mers are discussed in the literature as regions with common copy-number-variations (CNV).CNV in the 1q21.1 region, if not NBPF-specific, has been linked to congenital cardiac defect (69)(70)(71), autism (72,73), mental retardation (74), developmental abnormalities (75), schizophrenia (76,77), and neuroblastoma (78).With so many abnormalities mapped to this region, these are collectively called the chromosome 1q21.1 duplication syndrome in the Online Mendelian Inheritance in Man (OMIM 612475).The Xq23 region, if not macrosatellite DXZ4 specific, has been identified as likely CNV regions linked to developmental and behavioral problems (79).Chromatin configuration at DXZ4 region is reported to differ between male melanoma cells and normal skin cells (80).The Xq24 region if not the CT47A gene is listed as a candidate CNV region for intellectual disability (81), mental retardation (82), etc.
A well-known mechanism for CNV formation is the non-allelic homologous recombinations (NAHR) between repetitive elements (83).More copies of a repetitive sequence give more opportunities that NAHR could occur, resulting in a natural connection between repetitive sequences and common CNV.The fact that simple counting of 1000-mer frequencies leads to CNV regions with medical implications indicates that understanding the k-mer distribution is an important part of genomic analyses.
Long-tails and the diminishing return of longer reads: Our analysis shows that all distributions discussed in this paper are better viewed in log-log scale, proving the existence of power-law distributions or long-tails.This has been observed in the past for other genomic distributions, such as correlation function (84)(85)(86)(87), power spectrum of base composition (88)(89)(90)(91), frequency distribution of gene or protein family size (92)(93)(94)(95), sizes of ultraconserved regions (96), and in models with duplications (97)(98)(99)(100).Ongoing duplications increase the copy number geometrically, which explains the presence of long-tails.
A consequence of the long-tail in Figure 1 is that with increasing read (or k-mer) lengths, the proportion of reads that cannot be mapped to a unique genomic region (within the finished sequences) decreases algebraically, as compared to linearly or exponentially.Numerically, if not economically, this defines a diminishing return.To assess the economic return with NGS technology with longer reads, other factors should be considered, such as the choice of less redundant target regions such as exome (101), the choice of pair-end sequencing technologies (102) which effectively increases the read length (103) though the mappability may not necessarily improve over the single-end sequencing (4), and the overall cost of longer-read sequencing.
Anticipating the eventual high-quality human genome sequences obtained by a combination of various technologies, the k-mer distribution will be a prominent factor determining how these technologies are optimally combined.Supplementary material: Table S1 Chromosome locations of k=1000-mers with frequency f ≥ 10 mapped to the human reference genome s considered to be non−singletons

Figure 1 :Figure 2 :
Figure 1: Proportion of non-singleton k-mers/tokens in the human genome (24 chromosomes) as a function of k (in log-log scale).Circles (o) show the results for all finished basepairs, whereas crosses (x) for the result from RepeatMasker-filtered sequences.Pluses (+) are results when unfinished sequences (234 Mbase) are included as non-singletons.