Population genetic analysis of bi-allelic structural variants from low-coverage sequence data with an expectation-maximization algorithm
© Lucas-Lledóet al.; licensee BioMed Central Ltd. 2014
Received: 30 October 2013
Accepted: 14 May 2014
Published: 29 May 2014
Population genetics and association studies usually rely on a set of known variable sites that are then genotyped in subsequent samples, because it is easier to genotype than to discover the variation. This is also true for structural variation detected from sequence data. However, the genotypes at known variable sites can only be inferred with uncertainty from low coverage data. Thus, statistical approaches that infer genotype likelihoods, test hypotheses, and estimate population parameters without requiring accurate genotypes are becoming popular. Unfortunately, the current implementations of these methods are intended to analyse only single nucleotide and short indel variation, and they usually assume that the two alleles in a heterozygous individual are sampled with equal probability. This is generally false for structural variants detected with paired ends or split reads. Therefore, the population genetics of structural variants cannot be studied, unless a painstaking and potentially biased genotyping is performed first.
We present svgem, an expectation-maximization implementation to estimate allele and genotype frequencies, calculate genotype posterior probabilities, and test for Hardy-Weinberg equilibrium and for population differences, from the numbers of times the alleles are observed in each individual. Although applicable to single nucleotide variation, it aims at bi-allelic structural variation of any type, observed by either split reads or paired ends, with arbitrarily high allele sampling bias. We test svgem with simulated and real data from the 1000 Genomes Project.
svgem makes it possible to use low-coverage sequencing data to study the population distribution of structural variants without having to know their genotypes. Furthermore, this advance allows the combined analysis of structural and nucleotide variation within the same genotype-free statistical framework, thus preventing biases introduced by genotype imputation.
KeywordsStructural variation Population genetics Maximum likelihood Reference bias Genotyping
Ongoing efforts to discover genetic variation in humans and other species are yielding long lists of known variants [1–3]. The discovery of genetic variation is always the first step of population genetics or association studies. Once structural or nucleotide variation is revealed, individuals not present in the original sample are usually genotyped, for subsequent studies. Genotyping individuals is much easier than discovering new variants, because fewer loci have to be tested, and the prior probability of there being an alternative allele is higher than in sites not known to be variable. However, the presence of sequencing and mapping errors, and undersampling at heterozygous sites demand high coverage in order to infer individual genotypes accurately. Accurate genotypes are the basic information upon which most classic methods of population genetics depend. Our reliance on classic methods and the convenience of low-coverage data are such that the population-level structure of the variation is being used to improve the genotype calls on low-coverage data [4, 5], which are then expected to help us understand the population-level structure of the variation. This circularity prevents, for example, the use of genotypes imputed on the bases of known patterns of linkage disequilibrium to infer recombination rates. The problem stems from considering the genotypes as the ultimate result of a research program. For most studies, other than personalized medicine, the genotypes are just the means to gain insight into population-level processes: association between genotypes and phenotypes, history of migration and admixture, patterns of recombination, natural selection, etc.
When studying structural variants, there is too much to loose from relying on imputed genotypes. Polymorphic structural variants contribute to phenotypic diversity and disease susceptibility in humans , and other species [7–11]. The population genetics of structural variants is both a classic field, and, thanks to the new sequencing techniques, an emergent research venue [12, 13]. In the last years, many programs have been developed to identify structural variants from sequence data, using diverse signatures from split reads [14, 15], read depth [16, 17], paired reads [18, 19], or a combination thereof [20, 21]. Among the hotest topics that await to harvest full benefits from the high-throughput sequencing technologies, is the interplay between structural and nucleotide variation. The nucleotide variation linked to structural variants can inform of their evolutionary history and their effects on fitness, and it can also reveal to what extent structural variants affect recombination patterns. However, these questions cannot be addressed with SNP genotypes imputed on the bases of assumed patterns of linkage disequilibrium.
An alternative to genotype imputation is to study the population-level structure of the variation with new methods that take genotype uncertainty into account in the analysis. This is what has been proposed before for single-nucleotide variation [22–24]. The idea is to calculate the genotype likelihoods at every site, instead of calling the genotypes. Genotype likelihoods can be used to obtain unbiased estimates of allele frequencies or site-frequency spectra, and likelihood-ratio tests can be used to address population-level hypotheses, such as Hardy-Weinberg equilibrium, population differentiation, linkage disequilibrium, genotype-phenotype association, etc [24, 25]. However, this promissing approach is not applicable yet to structural variants, mostly because existing implementations assume even sampling probabilities of the two alleles from a heterozygous sample. While this assumption is a reasonable simplification for the analysis of SNP data, it is not for the analysis of structural variants, where one of the alleles may be observed much more frequently than the other in a heterozygous sample . One source of bias is the conservative alignment of reads to the reference genome. By default, mapping tools are optimized for the discovery phase, when the prior of there being a structural variant is very low. Therefore, aligners favor concordant mappings, that is, those that do not report the variation [27–29]. Another source of allele sampling bias is the different detectability of the alleles. For example, using single reads, the presence of a polymorphic insertion may be revealed by any read mapping to either of the two breakpoints, or even to its inner sequence, if unique and known; while the absence of the inserted fragment can only be positively attested by reads mapping on a single breakpoint. Also, repetitive sequences are frequently present around a structural variant, and they can impair the detection of one allele more than the other. For example, when the sequence of a transposable element is broken by an inversion in one of the breakpoints. In summary, allele sampling bias is the rule, rather than the exception, when genotyping structural variation with sequence data; and the bias can be orders of magnitude higher than in the case of SNPs (see below). In addition, the few existing tools able to report genotype likelihoods [5, 24] require a bam file  for their calculation, which is not designed for structural variation at all.
Here, we present svgem, a simple and flexible command-line application to infer genotype likelihoods and allele frequencies from counts of reference and alternative alleles, appropriate for structural variation data, with an arbitrarily high reference (or alternative) bias. This program is not concerned with the discovery phase of the structural variants, but with the post-discovery analyses. The only assumptions made about the type of variation that can be analysed are that it is bi-allelic, and that the two alleles can be distinguished from sequencing data. Thus, simple insertions, deletions, inversions, and translocations of any length can be analysed with svgem, while multiple, overlapping rearrangements, with more than two possible alleles are not. The sampling bias must be known in advance, and passed as a parameter. We offer some guidelines on how to estimate it, and explain how to test for Hardy-Weinberg equilibrium. Also, the ploidy of the samples is taken into account to infer genotype likelihoods and allele frequencies in sex chromosomes properly.
Overview and requirements
The program svgem implements an expectation-maximization algorithm in C++. The source code is freely available in Additional file 1, and in . It runs from the command line, analyses one structural variant at a time, and it takes as input a text file with three (optionally, four) columns: a sample identifier and the numbers of times the reference and the alternative alleles have been observed. The optional fourth column represents the ploidy, for the case when the sample is composed of a mixture of males and females, and the variant being analysed is on the sex chromosome.
In order to make svgem compatible with virtually any way of counting the observations of reference and alternative alleles, only two qualities are distinguished: the average quality of the reference counts, and the average quality of the alternative counts. These are analogous to the base qualities of a sequenced read, and they can be passed as parameters to the program, in terms of the frequencies of erroneous counts, or estimated from the data. Alternatively, if the individual quality (probability of error) of each observation is known, the expected number of true counts or ‘effective’ number of times the alleles are observed, instead of the raw counts, can be used. This approach proved to be very useful in SNPTools .
The allele sampling bias is represented by one parameter, λ, defined as the odds of sampling the reference allele from a diploid, heterozygous genotype. Even though in some cases it could be estimated from the data, svgem requires λ to be passed as a parameter, in order to save degrees of freedom. Otherwise, it would be impossible to get accurate estimates of extreme allele frequencies in the presence of errors. Plus, because svgem is not designed to discover variants, but to analyse already known variants, it is fairly easy to estimate λ. There are two different situations. First, if the exact sequences of both structural alleles are completely known, the alleles can be distinguished by the single reads mapping specifically to one of them. All possible reads from the informative regions of both alleles can be extracted, as if they came from a heterozygous sample, and mapped back to the reference genome and to the alternative allele. Then, λ would be estimated as the ratio between the number of reference reads that map uniquely to the reference allele, and the number of alternative reads that map uniquely to the alternative allele.
Second, if the exact sequences of the structural alleles are not known (e.g., imprecise breakpoints), they must be distinguished by the pattern of paired-end mappings: concordant for the reference, and discordant for the alternative (methods based on the depth of coverage, usually requiring high coverage, are not considered here). Still, the most likely version of the alternative allele could be composed, and used to extract paired end reads from it. In this case, the complete extraction of all possible paired-end reads is not feasible, but extensive simulations can be done with available programs, such as wgsim, or ART. If the same coverage is simulated in both alleles, λ can be estimated as the ratio of the number of reference paired ends that map concordantly to the number of alternative paired ends that map discordantly. To overcome the imprecision of the breakpoints, several alternative alleles could be simulated and averaged. Alternatively, if a subset of individuals are known to be heterozygous by other means (e.g., PCR evidence, or higher sequencing coverage), the ratio between their pooled numbers of reference and alternative reads can also be used as an estimate of λ. Inaccurate estimates of λ have a mild impact on genotype likelihoods, and they are always preferred to the default value of λ=1, as long as they are closer to the true value of the allele sampling bias (see below).
Implementation of the expectation-maximization algorithm
Total number of allele observations, or counts, in one individual.
Number of times the reference allele is observed in one individual (l ≤ k).
Number of reference alleles in the genotype (g ≤ m).
Allele sampling bias in heterozygous individuals.
Frequency of erroneous counts among reference counts.
Frequency of erroneous counts among alternative counts.
Likelihoods of the three diploid genotypes ( m = 2)
In the equations above, is the t th estimate of the total number of diploid individuals with genotype g, and is the t th estimate of the total number of hemizygous individuals with genotype g. That is, they are the summations of the posterior probabilities of genotype g over the respective kind of individuals. D and H are the total number of diploid and hemizygous individuals, respectively, where D + H = N. is the t th estimate of the total number of alternative counts coming from hemizygous and homozygous individuals for either the alternative (g = 0) or the reference (g = 2) allele. Finally, is the t th estimate of the total number of reference counts that come from hemizygous and homozygous individuals for either the alternative (g = 0) or the reference (g = 2) allele.
When there is sampling bias in heterozygous individuals, λ≠1, and the next values of the proportions of errors among reference (ε r ) and alternative (ε a ) counts are the results of two quadratic equations (Additional file 2). In practice, it is assumed that the erroneous counts are a minority, and the program halts when ε r ≥ 0.5 or ε a ≥ 0.5. This can prevent the correct estimation of extreme allele frequencies in the presence of erroneous counts, as should be expected.
Output and applications
The output includes: maximum likelihood estimates of the parameters mentioned above, the likelihood of such estimates, the genotype likelihoods of all individuals, and the posterior probabilities of the genotypes of all individuals. The main purpose of svgem is to obtain unbiased estimates of allele and genotype frequencies, which are fundamental parameters in population genetics. From these estimates, several other population parameters can be estimated. The maximum likelihood estimate of the frequency of the heterozygous genotype (), estimated without assuming Hardy-Weinberg equilibrium, is a direct estimate of heterozygosity. An estimate of the inbreeding coefficient follows from comparing with the expected frequency of heterozygous individuals under Hardy-Weinberg equilibrium: (where is the maximum likelihood estimate of the alternative allele frequency). The fixation index, FST, which measures genetic differentiation among populations, is also readily estimated from allele frequencies [33–35]. A test for Hardy-Weinberg equilibrium (HWE) can be performed by running svgem with and without the equilibrium assumption, and comparing the log-likelihoods of the estimated frequencies. Twice the difference between the log-likelihoods must be compared with a χ2 distribution with 1 degree of freedom, if all individuals are diploid, or with 2 degrees of freedom if the frequencies of hemizygous genotypes are also being estimated.
Some analyses that used to require accurate knowledge of individual genotypes can be performed now using only genotype likelihoods. For example, it is possible to estimate the linkage disequilibrium between pairs of variants using genotype likelihoods, instead of individual genotypes . At the end of the next section, we show how to estimate the linkage disequilibrium between a structural variant and the SNPs around it, without the biases typically associated with genotype imputation.
It is also possible to run genetic association tests from genotype likelihoods, without knowing the exact genotype of the individuals . Associations between phenotypes and genetic variants are a significant difference in allele frequency between two samples (cases and controls), and they are routinely searched along the human genome to infer the causal variants of diseases. To compare the allele frequency of a variant between two samples, svgem must be run three times: once in each sample separately, and once in the joint dataset. Lets call ℓ a and ℓ b the log-likelihoods of the two independent estimates for samples a and b. The total log-likelihood of the hypothesis of two different frequencies is just the sum of the log-likelihoods of the two samples: ℓ1 = ℓ a + ℓ b . The log-likelihood of the hypothesis of one common allele frequency, ℓ0, is obtained from the run on the joint data set. Because the two hypotheses are nested, the application of a likelihood ratio test is justified. Thus, if the null hypothesis of a common allele frequency is true, the statistic 2(ℓ1 - ℓ0) is expected to follow a χ2 distribution with as many degrees of freedom as additional parameters the most complex model has, which is 1 in this case (see , page 137).
Although svgem is designed to analyse one variant at a time, more than one variant can easily be analysed in multiple parallel runs. For the purpose of detailed population genetic analyses of specific variants of interest, svgem performs well, since its typical run time is always a tiny fraction of the time usually needed to obtain the allele counts. In absolute terms, it can analyse >1000 individuals in, at most, a few seconds, in a standard PC. However, the run time is variable, just as in any expectation-maximization algorithm, and convergence may take longer if the information content of the input data set is limited.
Results and discussion
Estimates of λfrom real structural variants
In order to assess the expected range of values of the λ parameter, we downloaded the BreakDB database , last accessed on January 26th 2014, and built a library with the known sequences of both alleles of 568 structural variants (54 inversions, 161 insertions, and 353 deletions). Then, we simulated the exhaustive sequencing of both alleles with single-end, 100-bp reads. Next, we mapped the sequenced reads first to the library built with the sequences of the reference and the alternative alleles, and afterwards to the whole reference genome (HG18), in order to discard any non-specific read. For this purpose, we used the pipeline BreakSeq , with minor modifications. We removed from BreakSeq a filter that required the reads to map on the breakpoints, in order to be able to use the inserted or deleted sequences as evidence of the presence or the absence of an insertion or a deletion. Finally, we counted how many reads mapped specifically on the reference or on the alternative allele, and estimated λ as the ratio of the two counts.
Analysis of simulated data
We run some simulations to test svgem. In the artificial datasets, genotypes followed Hardy-Weinberg equilibrium, with allele frequencies between 0.01 and 0.99, and the allele counts were sampled with a simulated error rate of 0.005. Coverage was Poisson-distributed, in order to introduce variation in coverage among individuals, although the exact distribution is irrelevant for the calculation of genotype likelihoods. In all, 100 simulations were run with the same parameter values.
Next, we checked svgem performance with different sample sizes from 10 to 1000 and a fixed SV frequency of 0.5, which is the one with higher sampling variance. Figure S1 in Additional file 2 shows that smaller samples, with a mean coverage of 4, also yield unbiased estimates with a precision comparable to the expected under accurate knowledge of genotypes.
It is also important to mention that the frequency of erroneous reference or alternative allele counts, ε r and ε a , need to be either known or co-estimated from the data to get accurate estimates of allele or genotype frequencies. An erroneous count is a false observation of an allele, which should not contribute to the estimate of allele frequency. They are assumed to be less frequent than true counts. In practice, the accurate estimation of both the allele (or genotype) frequency and the frequency of erroneous counts is only feasible if there is enough information in the data. As a rule of thumb, when coverage is below 4, or the number of individuals is below 100, a priori estimates of ε r and ε a are highly recommended. They can be obtained from simulations, or estimated from a subset of individuals with high coverage, or empirically determined in a subset of homozygous individuals.
Analysis of real data
To genotype experimentally the inversion, we used different pairs of primers specific for the reference orientation (A2-B2) or the inverted orientation (A4-C3 and B2-D1; Table S3 in Additional file 3) and carried out duplicate PCRs of each individual, both in simplex and multiplex format. In total, the 270 individuals of the Phase II of the HapMap Project were analyzed, including 90 Yoruba (YRI), 90 from European origin (CEU), 45 Chinese (CHB) and 45 Japanase (JPT), and the PCR results can be accessed in the InvFEST database . Table S4 shows the observed genotypes and the genotype posterior probabilities calculated with svgem for the 122 individuals that were both genotyped by PCR and analysed with BreakSeq and svgem. The alternative allele frequencies determined experimentally or estimated with svgem in this subsample were, respectively, 0.545 (standard error 0.045), and 0.541. The most probable genotype determined by svgem matched the true genotype in 111 (91%) individuals, with only 1 error (out of 86) when the coverage is higher than 2. From the allele counts of homozygous individuals, it can be seen that the opposite allele is never observed, confirming that the rate of erroneous counts is negligible.
The development of methods to discover structural variants in individual genomes is giving way to population-level analyses. The most recently developed discovery tools, such as GASVPro  or CloudBreak , call the genotypes of the individuals analysed, instead of just reporting the variants discovered (a notable exception being ForestSV ). However, these individual-based methods require high coverage, and they are oblivious to the information present at the population level. While these methods are still useful in some applications, there is a current demand for efficient ways to analyse low-coverage population genomics data. Most studies still insist in genotyping the individuals, despite of the loss of data caused by arbitrary quality thresholds, and despite the circularity and biases associated with genotype imputation [1, 47, 48]. The alternative of using likelihood or Bayesian approaches, which take genotype uncertainty into account, is an optimal strategy to explore genetic diversity, since it does not require high coverage per individual, and allows the sequencing of more individuals at the same cost . Not surprisingly, a new method to genotype indels from sequence data in polyploid genomes uses the same approach of likelihood calculation, frequency estimation through an expectation-maximization algorithm, and reporting of posterior probabilities . However, this method does not consider any allele sampling bias, because it targets indels shorter than the reads used to distinguish the alleles. By including allele sampling bias in the genotype likelihood calculation, our program extends the applicability of these methods to the analysis of large structural variation. Furthermore, for the first time nucleotide and structural variation can be analysed in the same statistical framework, without having to rely on the accuracy of the genotypes.
Two of the key features of svgem are its simplicity and its few assumptions about the data, which make the program useful for a wide variety of data types. Any bi-allelic structural variant detected by sequenced paired-ends or split reads, including inversions, mobile element insertions, duplications, and deletions, can be analysed by svgem. Using simulations, we have shown that estimates of allele or genotype frequencies are accurate, even in the face of rampant allele sampling bias, that usually accompanies the detection of structural alleles. Finally, using data from the 1000 Genomes Project and PCR experiments, we prove its applicability to real data.
Availability and requirements
Project name: svgem.
Project home page:http://grupsderecerca.uab.cat/cacereslab/content/resources.
Operating system(s): Platform independent.
Programming language: C++.
Other requirements: None.
License: GNU General Public License.
Any restrictions to use by non-academics: None.
Polymerase chain reaction
Minor allele frequency.
We thank David Izquierdo for help with the experimental genotyping of the inversion, and Xavier Estivill and Marta Morell for help with the CEU population lymphoblastoid cell cultures. This work was supported by the European Research Council [Starting Grant 243212 (INVFEST) to MC] under the European Union Seventh Research Framework Programme (FP7); and by the Commission for Universities and Research of the Ministry of Innovation, Universities and Enterprise of the Autonomous Government of Catalonia and the Cofund programme of the Marie Curie Actions of the FP7 [‘Beatriu de Pinós’ postdoctoral fellowship to JILL].
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HYK, Leng J, Li R, Li Y, Lin CY, Luo R, Hajirasouliha I, et al: Mapping copy number variation by population-scale genome sequencing. Nature. 2011, 470 (7332): 59-65. 10.1038/nature09708.View ArticlePubMed CentralPubMedGoogle Scholar
- Stewart C, Kural D, Strömberg MP, Walker JA, Konkel MK, Stütz AM, Urban AE, Grubert F, Lam HYK, Lee WP, Busby M, Indap AR, Garrison E, Huff C, Xing J, Snyder MP, Jorde LB, Batzer MA, Korbel JO, Marth GT, 1000 Genomes Project: A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 2011, 7 (8): e1002236-10.1371/journal.pgen.1002236.View ArticlePubMed CentralPubMedGoogle Scholar
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65. 10.1038/nature11632.View ArticlePubMedGoogle Scholar
- Handsaker RE, Korn JM, Nemesh J, McCarroll SA: Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet. 2011, 43 (3): 269-276. 10.1038/ng.768.View ArticlePubMedGoogle Scholar
- Wang Y, Lu J, Yu J, Gibbs RA, Yu F: An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 2013, 23 (5): 833-842. 10.1101/gr.146084.112.View ArticlePubMed CentralPubMedGoogle Scholar
- Girirajan S, Campbell CD, Eichler EE: Human copy number variation and complex genetic disease. Annu Rev Genet. 2011, 45: 203-226. 10.1146/annurev-genet-102209-163544.View ArticlePubMedGoogle Scholar
- Guryev V, Saar K, Adamovic T, Verheul M, van Heesch SAAC, Cook S, Pravenec M, Aitman T, Jacob H, Shull JD, Hubner N, Cuppen E: Distribution and functional impact of DNA copy number variation in the rat. Nat Genet. 2008, 40 (5): 538-545. 10.1038/ng.141.View ArticlePubMedGoogle Scholar
- Gazave E, Morcillo-Suarez C, Petit-Marty N, Carreño A, Marigorta UM, Ryder OA, Blancher A, Rocchi M, Bosch E, Baker C, Marquès-Bonet T, Eichler EE, Navarro A, Darré F: Copy number variation analysis in the great apes reveals species-specific patterns of structural variation. Genome Res. 2011, 21 (10): 1626-1639. 10.1101/gr.117242.110.View ArticlePubMed CentralPubMedGoogle Scholar
- Berglund J, Nevalainen EM, Molin AM, Perloski M, Zody MC, Sharpe T, Hitte C, Lindblad-Toh K, Lohi H, Webster MT, The LUPA Consortium: Novel origins of copy number variation in the dog genome. Genome Biol. 2012, 13 (8): R73-10.1186/gb-2012-13-8-r73.View ArticlePubMed CentralPubMedGoogle Scholar
- Muñoz Amatriaín M, Eichten SR, Wicker T, Richmond TA, Mascher M, Steuernagel B, Scholz U, Ariyadasa R, Spannagl M, Nussbaumer T, Mayer KF, Taudien S, Platzer M, Jeddeloh JA, Springer NM, Muehlbauer GJ, Stein N: Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol. 2013, 14 (6): R58-10.1186/gb-2013-14-6-r58.View ArticlePubMed CentralPubMedGoogle Scholar
- Sudmant PH, Huddleston J, Catacchio CR, Malig M, Hillier LW, Baker C, Mohajeri K, Kondova I, Bontrop RE, Persengiev S, Antonacci F, Ventura M, Prado-Martinez J, Great Ape GenomeProject, Marques-Bonet T, Eichler EE: Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 2013, 23 (9): 1373-1382. 10.1101/gr.158543.113.View ArticlePubMed CentralPubMedGoogle Scholar
- Corbett-Detig RB, Hartl DL: Population genomics of inversion polymorphisms in Drosophila melanogaster. PLoS Genet. 2012, 8 (12): e1003056-10.1371/journal.pgen.1003056.View ArticlePubMed CentralPubMedGoogle Scholar
- Zichner T, Garfield DA, Rausch T, Stütz AM, Braun M, Furlong EEM, Korbel JO, Cannavó E: Impact of genomic structural variation in Drosophila melanogasterbased on population-scale sequencing. Genome Res. 2013, 23 (3): 568-579. 10.1101/gr.142646.112.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang J, Mullighan CG, Easton J, Roberts S, Ma J, Rusch MC, Chen K, Harris CC, Ding L, Heatley SL, Holmfeldt L, Payne-Turner D, Fan X, Wei L, Zhao D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J: CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods. 2011, 8 (8): 652-654. 10.1038/nmeth.1628.View ArticlePubMed CentralPubMedGoogle Scholar
- Karakoc E, Alkan C, O’Roak BJ, Dennis MY, Vives L, Mark K, Rieder MJ, Nickerson DA, Eichler EE: Detection of structural variants and indels within exome data. Nat Methods. 2012, 9 (2): 176-178.View ArticleGoogle Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974-984. 10.1101/gr.114876.110.View ArticlePubMed CentralPubMedGoogle Scholar
- Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, Kucherlapati R, Park PJ: Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci U S A. 2011, 108 (46): E1128-E1136. 10.1073/pnas.1110574108.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009, 6 (9): 677-681. 10.1038/nmeth.1363.View ArticlePubMed CentralPubMedGoogle Scholar
- Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM: Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 2010, 20 (5): 623-635. 10.1101/gr.102970.109.View ArticlePubMed CentralPubMedGoogle Scholar
- Michaelson JJ, Sebat J: forestSV: structural variant discovery through statistical learning. Nat Meth. 2012, 9 (8): 819-821. 10.1038/nmeth.2085.View ArticleGoogle Scholar
- Sindi SS, Önal S, Peng LC, Wu HT, Raphael BJ: An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012, 13 (3): R22-10.1186/gb-2012-13-3-r22.View ArticlePubMed CentralPubMedGoogle Scholar
- Keightley PD, Halligan DL: Inference of site frequency spectra from high-throughput sequence data: quantification of selection on nonsynonymous and synonymous sites in humans. Genetics. 2011, 188 (4): 931-940. 10.1534/genetics.111.128355.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim SY, Lohmueller KE, Albrechtsen A, Li Y, Korneliussen T, Tian G, Grarup N, Jiang T, Andersen G, Witte D, Jorgensen T, Hansen T, Pedersen O, Wang J, Nielsen R: Estimation of allele frequency and association mapping using next-generation sequencing data. BMC Bioinformatics. 2011, 12: 231-10.1186/1471-2105-12-231.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011, 27 (21): 2987-2993. 10.1093/bioinformatics/btr509.View ArticlePubMed CentralPubMedGoogle Scholar
- Nielsen R, Korneliussen T, Albrechtsen A, Li Y, Wang J: SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS ONE. 2012, 7 (7): e37558-10.1371/journal.pone.0037558.View ArticlePubMed CentralPubMedGoogle Scholar
- Cáceres M, Lucas Lledó JI: On the power and the systematic biases of the detection of chromosomal inversions by paired-end genome sequencing. PLoS ONE. 2013, 8 (4): e61292-10.1371/journal.pone.0061292.View ArticlePubMed CentralPubMedGoogle Scholar
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-1967. 10.1093/bioinformatics/btp336.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093/bioinformatics/btp698.View ArticlePubMed CentralPubMedGoogle Scholar
- Novocraft technologies. [http://www.novocraft.com],
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project DataProcessingSubgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMed CentralPubMedGoogle Scholar
- Mario Cáceres Lab. [http://grupsderecerca.uab.cat/cacereslab/content/resources],
- Huang W, Li L, Myers JR, Marth GT: ART: a next-generation sequencing read simulator. Bioinformatics. 2012, 28 (4): 593-594. 10.1093/bioinformatics/btr708.View ArticlePubMed CentralPubMedGoogle Scholar
- Nei M: Analysis of gene diversity in subdivided populations. PNAS. 1973, 70 (12): 3321-3323. 10.1073/pnas.70.12.3321.View ArticlePubMed CentralPubMedGoogle Scholar
- Weir BS, Cockerham CC: Estimating F-Statistics for the analysis of population structure. Evolution. 1984, 38 (6): 1358-1370. 10.2307/2408641.View ArticleGoogle Scholar
- Bhatia G, Patterson N, Sankararaman S, Price AL: Estimating and interpreting FST: the impact of rare variants. Genome Res. 2013, 23 (9): 1514-1521. 10.1101/gr.154831.113.View ArticlePubMed CentralPubMedGoogle Scholar
- Yang Z: Computational Molecular Evolution. 2006, New York: Oxford University PressView ArticleGoogle Scholar
- BreakDB. [http://sv.gersteinlab.org/breakdb/],
- Lam HYK, Mu XJ, Stütz AM, Tanzer A, Cayting PD, Snyder M, Kim PM, Korbel JO, Gerstein MB: Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotech. 2010, 28: 47-55. 10.1038/nbt.1600.View ArticleGoogle Scholar
- Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tüzün E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, et al: Mapping and sequencing of structural variation from eight human genomes. Nature. 2008, 453 (7191): 56-64. 10.1038/nature06862.View ArticlePubMed CentralPubMedGoogle Scholar
- Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, Park D, Lee YS, Kim S, Reja R, Jho S, Kim CG, Cha JY, Kim KH, Lee B, Bhak J, Kim SJ: The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 2009, 19 (9): 1622-1629. 10.1101/gr.092197.109.View ArticlePubMed CentralPubMedGoogle Scholar
- McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE, et al: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009, 19 (9): 1527-1541. 10.1101/gr.091868.109.View ArticlePubMed CentralPubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291 (5507): 1304-1351. 10.1126/science.1058040.View ArticlePubMedGoogle Scholar
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Stockwell TB, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5 (10): e254-10.1371/journal.pbio.0050254.View ArticlePubMed CentralPubMedGoogle Scholar
- Martínez-Fundichely A, Casillas S, Egea R Ràmia, Barbadilla A, Pantano L, Puig M, Cáceres M: InvFEST, a database integrating information of polymorphic inversions in the human genome. Nucleic Acids Res. 2013, 42 (D1): D1027-D1032.View ArticlePubMed CentralPubMedGoogle Scholar
- Whelan CW, Tyner J, L’Abbate A, Storlazzi CT, Carbone L, Sönmez K: Cloudbreak: accurate and scalable genomic structural variation detection in the cloud with MapReduce. arXiv:1307.2331 [q-bio]. 2013,Google Scholar
- Hormozdiari F, Konkel MK, Prado-Martinez J, Chiatante G, Herraez IH, Walker JA, Nelson B, Alkan C, Sudmant PH, Huddleston J, Catacchio CR, Ko A, Malig M, Baker C, Project GAG, Marques-Bonet T, Ventura M, Batzer MA, Eichler EE: Rates and patterns of great ape retrotransposition. PNAS. 2013, 110 (33): 13457-13462. 10.1073/pnas.1310914110.View ArticlePubMed CentralPubMedGoogle Scholar
- Li X, Chen S, Xie W, Vogel I, Choy KW, Chen F, Christensen R, Zhang C, Ge H, Jiang H, Yu C, Huang F, Wang W, Jiang H, Zhang X: PSCC: Sensitive and reliable population-scale copy number variation detection method based on low coverage sequencing. PLoS ONE. 2014, 9: e85096-10.1371/journal.pone.0085096.View ArticlePubMed CentralPubMedGoogle Scholar
- Buerkle CA, Gompert Z: Population genomics based on low coverage sequencing: how low should we go?. Mol Ecol. 2013, 22 (11): 3028-3035. 10.1111/mec.12105.View ArticleGoogle Scholar
- Shao H, Bellos E, Yin H, Liu X, Zou J, Li Y, Wang J, Coin LJM: A population model for genotyping indels from next-generation sequence data. Nucl Acids Res. 2013, 41 (3): e46-10.1093/nar/gks1143.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.