Vindel: a simple pipeline for checking indel redundancy
© Li et al.; licensee BioMed Central Ltd. 2014
Received: 28 June 2014
Accepted: 23 October 2014
Published: 19 November 2014
With the advance of next generation sequencing (NGS) technologies, a large number of insertion and deletion (indel) variants have been identified in human populations. Despite much research into variant calling, it has been found that a non-negligible proportion of the identified indel variants might be false positives due to sequencing errors, artifacts caused by ambiguous alignments, and annotation errors.
In this paper, we examine indel redundancy in dbSNP, one of the central databases for indel variants, and develop a standalone computational pipeline, dubbed Vindel, to detect redundant indels. The pipeline first applies indel position information to form candidate redundant groups, then performs indel mutations to the reference genome to generate corresponding indel variant substrings. Finally the indel variant substrings in the same candidate redundant groups are compared in a pairwise fashion to identify redundant indels. We applied our pipeline to check for redundancy in the human indels in dbSNP. Our pipeline identified approximately 8% redundancy in insertion type indels, 12% in deletion type indels, and overall 10% for insertions and deletions combined. These numbers are largely consistent across all human autosomes. We also investigated indel size distribution and adjacent indel distance distribution for a better understanding of the mechanisms generating indel variants.
Vindel, a simple yet effective computational pipeline, can be used to check whether a set of indels are redundant with respect to those already in the database of interest such as NCBI’s dbSNP. Of the approximately 5.9 million indels we examined, nearly 0.6 million are redundant, revealing a serious limitation in the current indel annotation. Statistics results prove the consistency of the pipeline on indel redundancy detection for all 22 chromosomes. Apart from the standalone Vindel pipeline, the indel redundancy check algorithm is also implemented in the web server http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php.
KeywordsIndel redundancy Gap opening Gap extension
Genetic variations include single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and structural variants such as inversions, large-scale duplications/deletions, and transpositions. Among all these types of variation, indels are the second most common in human populations, only after SNPs, demonstrated by recent large-scale human genome sequencing projects i. However, with the availability of newly sequenced human genomes, the number of novel indels increases at a much faster pace than that of SNPs. For example, a 2011 study shows that more than 63% of the nearly 2 million indels identified in 79 diverse human genomes are novel , compared to those in dbSNP. Recently, sequencing and analysis of an Indian female’s genome reveals that about 84% of her indels are unique, i.e., not documented in any of the sequenced genome databases, in contrast to less than 3% of her SNPs being unique . Thus, compared to SNPs, the research on cataloging indel variants is still in its infancy, and intense effort is needed to obtain a complete inventory. Indels also present great technical difficulty and challenge to short read mapping algorithms. Improved from the first generation short read mappers, various mapping programs and indel detection programs have been developed to allow for indel detection -. However, if indels happen to occur in seed regions (where only mismatches are allowed), mappers and indel detection programs may still fail to map the reads, though it is unclear how this impacts the overall mapping performance.
This paper develops methods and strategies to check for indel redundancy. Using dbSNP indels as the test case, we examine the extent of indel redundancy for humans and develop Vindel, a standalone indel redundancy verification pipeline, together with a corresponding web tool. Statistics analysis is applied to check for the correctness of the pipeline. As indels have been shown to be linked to diseases and cancer and have been used as genetic markers for various purposes, it is essential to catalog redundant indels and develop annotations with non-redundant information that represent real biological signals instead of computational artifacts. Our Vindel system provides the tool needed for this purpose.
Based on the two examples of redundant indels in dbSNP (Figure 1) and the comparison of distance distribution of adjacent SNPs and indels (Figure 2 for chromosome 22, see Additional file 1: Figure S1 for other chromosomes), we designed and implemented an indel redundancy verification pipeline. The pipeline consists of three phases: First, indel information was retrieved from the SNP/indel flat files downloaded from dbSNPii using a Python program. Second, based on their position information, indels were allocated into candidate redundant indel groups by clustering. Third, indel variant substrings were generated correspondingly for indels in the same candidate redundant groups and pairwise comparisons were conducted to identify redundant indels. Details are described in the following.
NCBI dbSNP is a widely used public database for short genetic variants. We collected indel information by parsing the human genome dbSNP (GRCh37 build version p10) flat files that contain both SNPs and indels for chromosomes 1 to 22. The original files were parsed to retrieve indel information, including indel ID, chromosome number, chromosome position, allele information, and alignment type.
Indel alignment type specification verification
To check indel redundancy, we need to determine the alignment type (insertion or deletion) for each indel relative to the reference genome. NCBI dbSNPiii specifies indel alignment types in four categories, with loctype = 1 denoting “insertion on the subject sequence”, Loctype = 3 “deletion on the subject sequence”, Loctype = 4 “range nsertion”, and Loctype = 6 “range deletion”. In this work, we focus on small indels, i.e., loctype = 1 and loctype = 3, which account for the majority of human indels (>99%). As it is unclear from the description page what the subject sequence means in the annotation and our email inquiry to dbSNP helpdesk was not answered, we checked the reference genome to see whether the substrings exist, so the type corresponds to deletion of the substrings. Our results show that indels with loctype 1 should correspond to deletion made to the reference genome as only about 1% of loctype 1 failed to find the substring versus more than 38% of loctype 3. Therefore, from here on, we treat loctype 1 as deletion and loctype 3 as insertion relative to the reference genome.
Candidate redundant indel groups
Algorithm for clustering indels into candidate redundant indel groups
Clustering indels into candidate redundant indel groups Algorithm
An indel List: List (I) ordered by indel positions on the reference genome, each indel I has position P, threshold value D of distance between adjacent indels
Candidate redundant indel groups List List (G k )
Candidate-Group-Generation (indel list: List(I), threshold-value: D)
Set List (G k ) empty: Ø;
Set k =0;
Set current indel I current = I 0 , the first element in the List (I);
for each indel i = 2 to n in indel list List (I)
if next adjacent indel I i ’s position P i - P current < = D then
Add the next indel into the current candidate group G (k);
Set current indel I current = I i ;
Append G(k) to candidate group list List (G k );
k = k + 1;
return candidate redundant group list List (G k );
Indel redundancy check
Algorithm for indel pair redundancy check by applying sliding window on reference genome
Indel pair redundancy check Algorithm
Two candidate redundant indels A and B’s information
same type T (either insertion or deletion), T A = T B,, same length L A = L B,
allele information S A,S B, position information P A,P B, where P A < = P B,
reference genome sequence S.
A pair indels A, B are redundant or not: Redundancy
Set Redundancy = False;
Phase 1: template substring formation
Form template substring for insertion type S I or for deletion type S D separately
S I =Substring in reference genome with P B - P A ;
S D = Substring in reference genome with P B - P A + L A ;
Phase 2: variant substring formation for insertion type
if T A = T B = Insertion then
Insert S A in front of template substring S I to form variant substring V A for indel A;
Append S B at the end of template substring S I to form variant substring V B for indel B;
if V A = V B then
Redundancy found: Redundancy = True;
Phase 3: variant substring formation for deletion type
if T A = T B = Deletion then
Cut S A in front of template string S D , form variant substring V A for indel A;
Cut S B at the end of template string S D , form variant substring V B for indel B;
if V A = V B then
Redundancy found: Redundancy = True;
Statistical analysis of the distribution of indel sizes and adjacent-indel distances
Results and discussion
Indel redundancy rate with different distance cutoffs
We first applied our pipeline to human indels on chromosome 22. As pairwise comparison of all the indels on a chromosome is too time consuming, we set cutoff values for the distance between adjacent indels to 1, 5, 10, and 100 bps. We generated candidate redundant indel groups based on these cutoff values, then applied indel redundancy verification methods to identify redundant indels. In this process, we calculated the redundancy percentage for indel insertion type and deletion type separately since we only handled alignment type 1 (Deletion) and 3 (Insertion) as discussed in the method section. The results are shown in Figure 6. As the distance threshold increases from 1 to 5, the total indel redundancy rate increases sharply, from 3.72% to 8.88%, however, from 10 to 100, the percentage increase trend becomes flat, from 8.88% to 10.62%, suggesting that for large distance groups, there is less increase in the number of redundant indels. Specifically, for distance threshold 100, we get 13% redundancy rate for insertion type indels and 9% redundancy rate for deletion type indels. Based on this observation, we adopted the cutoff of 100 as the distance cutoff for identifying redundant indels on all chromosomes.
Indel redundancy rates for all the chromosomes
Original number of indels
rate (%) a
rate (%) b
Indel size distribution
After redundancy filtration, we fit the indel size distribution with a Pareto distribution. The scale or location parameter is fixed to be 1. MLE of the shape parameters for all 22 chromosomes are listed in Table 3. The shape parameter varies little across chromosomes, ranging from 1.33 to 1.47, with mean 1.43 and standard deviation 0.035. This is consistent with the shape parameter of ,.
Distribution of distances between adjacent variants
We first investigate the distribution of distances between adjacent SNPs. By fitting the distance distribution with a Gamma distribution, we obtain MLE for the shape parameter α and rate parameter β for all 22 chromosomes (see Table 3). The shape parameter estimates are all close to (though smaller than) 1, suggesting that the occurrence of SNPs on human chromosomes may be described approximately by a Poisson process with rate 0.017 times the genetic distance within humans. On the other hand, after removing the redundant indels, we fit the distribution of distances between adjacent indels with a Gamma distribution. The parameter estimates for all 22 chromosomes are listed in Table 3, with average , and average . Comparing the corresponding parameter estimates and in the Gamma count models for the distance distribution of adjacent SNPs and for the distance distribution of adjacent indels, we see that the mean adjacent indel distance is times larger than the mean adjacent SNP distance, and also the variance is times larger. This result indicates several differences between SNPs and indels. First, the rate of indel mutations might be about ten times lower than that of single nucleotide mutations. Large-scale genome sequencing projects ,, have shown that there are about 1 SNP every 100 bps, whereas about 1 indel per 1000 bps. The rate difference could also be partially contributed by the fact that indels are under stronger selective constraints than SNPs and stronger purifying selection on indels might have removed more indels than SNPs.
Several aspects merit discussion. First, results on modeling indel size distribution and modeling adjacent-indel distance distribution can be used to estimate gap extension and gap opening in sequence alignment and indel calling algorithms. Second, as we limited the distance cutoff to be 100, there is still redundancy, albeit small, in indels that are farther apart than 100 bps. Therefore, future improvement includes incorporation of more efficient algorithm for examining all possible indel pairs to identify all the redundant indels. Furthermore, other important features, such as sequencing errors, mapping errors, and coverage, may also be incorporated in our algorithm to aid the selection of distance cutoff. One may argue that sequence alignment ambiguity may also reflect true biological events, in the sense that there are correspondingly multiple ways for indels to happen. However, if we focus on the net effect of these variations, it is clear that regardless of the exact indel events, they create the same variant string or genomic sequence and therefore, should most likely have the same effect on the individual carrying the variant. Therefore, we believe that it is important to keep only the unique indels.
A web-based tool for indel redundancy check process based on standalone pipeline algorithm
In addition of the standalone indel redundancy check pipeline, we also developed a Web-based user friendly tool for indel redundancy check. The Web tool applies PHP/APACHE/MYSQL/Linux architecture, based on a Model-View-Controller design strategy. In the Web interface front end, a user can input indel information, such as chromosome number, chromosome position, and indel allele information. In the server backend, we have a database table that stores indel information from dbSNPs and the indel redundancy check pipeline Python program that checks for redundancy based on the user’s input. The redundancy checking results are displayed in the Web front end. For computational efficiency, the current Web tool only searches for and checks adjacent indels less than 5 bps from the user’s query indel in our non-redundant indel database. The response time is at most a few seconds and the result is displayed to the Web front end. However, for their target indel, users can also choose to examine all indels on the same chromosome as the target indel for redundancy check, which significantly increases the computational time. One limitation with the current redundancy check standalone pipeline and Web tool is that they only handle indels in humans. As more and more indels from other species are identified, we will add the capability of indel redundancy check for additional species.
Based on the observed indel redundancy in the current dbSNP, we developed Vindel, a simple computational pipeline to check for indel redundancy in a database of interest. Vindel is implemented in Python and used to investigate the degree of redundancy in human indels in dbSNP. The approximate 10% redundancy is observed consistently across all the 22 human chromosomes. Further statistics results prove the consistency of the pipeline. In addition to the standalone Vindel pipeline, the indel redundancy check algorithm is also implemented in the Web server http://bioinformatics.cs.vt.edu/zhanglab/indelRedundant.php.
ZL performed the computational experiments for the pipeline. BH did data processing and prototype coding. XW did the statistical analysis of the data. LZ proposed the design of the pipeline to detect indel redundancy in dbSNP. ZL, XW, and LZ wrote the paper. All authors read and approved the final manuscript.
We thank Mingming Liu and Lenwood Heath for comments and suggestions. This project is partially funded by NSF Award No. OCI-1124123 to Liqing Zhang. This article is partially supported by Virginia Tech’s Open Access Subvention Fund and the Statistics department's startup fund to Xiaowei Wu.
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65. 10.1038/nature11632.View ArticlePubMedGoogle Scholar
- Ryan E, Mills WSP: Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res. 2011, 21: 830-839. 10.1101/gr.115907.110.View ArticleGoogle Scholar
- Rivi Gupta AR: Sequencing and analysis of a South Asian-Indian personal genome. BMC Genomics. 2012, 13: 440-10.1186/1471-2164-13-440.View ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.View ArticlePubMed CentralPubMedGoogle Scholar
- Homer N, Merriman B, Nelson SF: BFAST: an alignment tool for large scale genome resequencing. PLoS One. 2009, 4 (11): A95-A106. 10.1371/journal.pone.0007767.View ArticleGoogle Scholar
- David M, Dzamba M, Lister D, Ilie L, Brudno M: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics. 2011, 27 (7): 1011-1012. 10.1093/bioinformatics/btr046.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.View ArticlePubMed CentralPubMedGoogle Scholar
- Grimm D, Hagmann J, Koenig D, Weigel D, Borgwardt K: Accurate indel prediction using paired-end short reads. BMC Genomics. 2013, 14: 132-10.1186/1471-2164-14-132.View ArticlePubMed CentralPubMedGoogle Scholar
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009, 25 (21): 2865-2871. 10.1093/bioinformatics/btp394.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang J, Wu Y: SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data. Bioinformatics. 2011, 27 (23): 3228-3234. 10.1093/bioinformatics/btr563.View ArticlePubMedGoogle Scholar
- Abyzov A, Gerstein M: AGE: defining breakpoints of genomic structural variants at single- nucleotide resolution, through optimal alignments with gap excision. Bioinformatics. 2011, 27 (5): 595-603. 10.1093/bioinformatics/btq713.View ArticlePubMed CentralPubMedGoogle Scholar
- Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R: Dindel: accurate indel calls from short-read data. Genome Res. 2011, 21 (6): 961-973. 10.1101/gr.112326.110.View ArticlePubMed CentralPubMedGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43 (5): 491-498. 10.1038/ng.806.View ArticlePubMed CentralPubMedGoogle Scholar
- Gu XLW: The size distribution of insertions and deletions in human ad rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol. 1995, 40 (4): 464-473. 10.1007/BF00164032.View ArticlePubMedGoogle Scholar
- Kamal NL M: Power Laws, Scale-Free Networks and Genome Biology. 2006, Springer, New York, New YorkGoogle Scholar
- Winkelmann R: Duration dependence and dispersion in count-data models. J Bus Econ Stat. 1995, 13 (4): 467-474.Google Scholar
- S. V. Lekshmi CT: Some generalization of poisson processes. J Stat Theor Appl. 2012, 11 (3): 225-235.Google Scholar
- Foss E, Lande R, Stahl FW, Steinberg CM: Chiasma interference as a function of genetic-distance. Genetics. 1993, 133 (3): 681-691.PubMed CentralPubMedGoogle Scholar
- Mcpeek MS, Speed TP: Modeling interference in genetic-recombination. Genetics. 1995, 139 (2): 1031-1044.PubMed CentralPubMedGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng ZP, Snyder M, Dermitzakis ET, Stamatoyannopoulos JA, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Fu WQ, O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Altshuler D, Shendure J, Nickerson DA, Bamshad MJ, Exome Sequencing Project NHLBI, Akey JM: Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013, 493 (7431): 216-220. 10.1038/nature11690.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.