Population genetic analysis of bi-allelic structural variants from low-coverage sequence data with an expectation-maximization algorithm

Background Population genetics and association studies usually rely on a set of known variable sites that are then genotyped in subsequent samples, because it is easier to genotype than to discover the variation. This is also true for structural variation detected from sequence data. However, the genotypes at known variable sites can only be inferred with uncertainty from low coverage data. Thus, statistical approaches that infer genotype likelihoods, test hypotheses, and estimate population parameters without requiring accurate genotypes are becoming popular. Unfortunately, the current implementations of these methods are intended to analyse only single nucleotide and short indel variation, and they usually assume that the two alleles in a heterozygous individual are sampled with equal probability. This is generally false for structural variants detected with paired ends or split reads. Therefore, the population genetics of structural variants cannot be studied, unless a painstaking and potentially biased genotyping is performed first. Results We present svgem, an expectation-maximization implementation to estimate allele and genotype frequencies, calculate genotype posterior probabilities, and test for Hardy-Weinberg equilibrium and for population differences, from the numbers of times the alleles are observed in each individual. Although applicable to single nucleotide variation, it aims at bi-allelic structural variation of any type, observed by either split reads or paired ends, with arbitrarily high allele sampling bias. We test svgem with simulated and real data from the 1000 Genomes Project. Conclusions svgem makes it possible to use low-coverage sequencing data to study the population distribution of structural variants without having to know their genotypes. Furthermore, this advance allows the combined analysis of structural and nucleotide variation within the same genotype-free statistical framework, thus preventing biases introduced by genotype imputation.


Installation
It has only been tested on Linux. Compile the source code typing the following: g++ svgem.cpp -o svgem Options -i <string> Input le name (required). The input le must be a tabdelimited, text le with three or four columns. It contains the counts of reference and alternative alleles for all the individuals at one variable site. The columns are the following: 1. Sample, or individual identier.
2. Reference count (oat). It is the number of times the reference allele has been observed in this individual, in the locus being analysed. For example, for a structural variant detected with paired-end sequencing, it is the number of concordant pairs of reads compatible with the site being in standard conformation. If dierent qualities are assigned to the observations, the expected number of observations should be reported here, calculated as the summation of the probabilities of the counts being true.
3. Alternative count (oat). It is the number of times the alternative allele has been observed in this individual and in the locus being analysed, or the expected number of true observations, if each observation has a dierent quality.
4. Ploidy of the locus in this individual (1 or 2). Optionally, the ploidy of the locus can be specied for each individual, so that variation in the X chromosome can be analysed in a mixture of males and females. If ploidy is specied in the input le, the option -p must be set to 0.
-p <int> Ploidy (0,1,or 2;default,2). If this parameter is set to 1 or 2, all individuals are assumed to have the same ploidy, and the input le is assumed to have 3 columns. If this parameter is set to 0, the input le must be in the 4-columns format.
-b <oat> Reference bias (default, 1.0). It is the ratio between the probability of observing the reference allele and the probability of observing the alternative allele from a heterozygous sample. The reference bias is not estimated by svgem, and needs to be provided here. The recommended way to estimate it is to perform simulations of the sequencing and mapping of reads from both the reference and the alternative alleles, and then use the ratio of reference to alternative counts observed.
-f <oat> Initial value of the alternative allele frequency to start the iterations (default, 0.5). It is advised to run svgem more than once with dierent starting values, to make sure that the algorithm converges to the maximum-likelihood estimate. Setting -f implies the assumption of Hardy-Weinberg equilibrium.
-F <string> Comma separated initial values of genotype frequencies (default`0.25,0.5,0.25'). There must be no spaces between the values. They correspond to the genotypes: homozygous for the alternative (Alt/Alt), heterozygous (Ref/Alt), and homozygous for the reference (Ref/Ref). This is incompatible with -f, and allows to override the assumption of Hardy-Weinberg equilibrium. It is a good practice to re-run svgem with alternative initial values to ensure that the nal estimates are the maximum-likelihood estimates.
-e <oat> (Initial) estimate of the frequency of errors among counts of the alternative allele (default, 0.001). It is assumed to be low, and svgem stops and complains if it is equal or higher than 0.5. This limitation may prevent the estimation of very low alternative allele frequencies.
This value is not optimized, unless the ag -s is set.
-E <oat> (Initial) estimate of the frequency of errors among counts of the reference allele (default, 0.001). This is also assumed to be low, and not optimized unless -s is set.
-s Turn on the estimation of frequencies of errors (default, false). This ag determines the estimation of two more parameters, the initial values of which are set by -e and -E. This is not recommended on very low coverage data. Under some circumstances, it can prevent the convergence of the estimates.
-m <int> Maximum number of iterations. The estimates keep being rened until the maximum number of iterations is reached, or the dierence between consecutive estimates of the alternative allele frequency become smaller than the tolerance, whichever happens before.
-l Report relative, log 10 genotype likelihoods, instead of posterior probabilities, in the svgem.log output (see below).

Output
Two outputs are produced. The main output, sent to either the standard output or to the requested le (see Options), reports the renement of the estimates along the iterations of the EM algorithm. It has 8 elds: 1. Iteration number.
2. Current estimate of the alternative allele frequency.
3. Current estimate of the frequency of homozygous alternative genotypes.
4. Current estimate of the frequency of heterozygous genotypes.
5. Current estimate of the frequency of homozygous reference genotypes.
6. Current estimate of the frequency of errors among alternative counts.
7. Current estimate of the frequency of errors among reference counts.
8. Log-likelihood of the current estimates.
The log-likelihood should always increase, and its nal value can be used to perform likelihood ratio tests (see below). The second output is an extended version of the input le with three additional elds, reporting the posterior probabilities of the three genotypes in  k Total number of allele observations, or counts, in one individual. l Number of times the reference allele is observed in one individual (l ≤ k). m Ploidy. g Number of reference alleles in the genotype (g ≤ m). λ Allele sampling bias in heterozygous individuals. r Frequency of erroneous counts among reference counts. a Frequency of erroneous counts among alternative counts. Table 2: Likelihoods of the three diploid genotypes (m = 2).
hemizygous samples, the rst and the third of these additional elds are used to represent the probabilities of genotypes Alt/0 and Ref/0, respectively. The genotype likelihoods, instead of the posterior probabilities, can be requested in this output (see option -l ).
Derivation of the likelihood functions Following the notation in [Li, 2011] (see Table 1), we refer to a genotype by its number of reference alleles, g ∈ {0, 1 . . . m}, where m is the ploidy, usually 2. We assume that variants are biallelic, so that m − g is the number of alternative alleles in the genotype. Table 2 shows the likelihoods of the three diploid genotypes.
The likelihood functions shown on table 2 are derived as follows. k is the total number of observations of the two possible (structural) alleles. That is, the coverage of the structural variant. Each observation is a direct evidence of one or the other allele, whether these are single or paired-end reads mapping on the breakpoints or on the inserted or deleted region. Allele observations are independent among them, and they may have dierent qualities (that is, dierent probabilities of being erroneous). The likelihood of a genotype is the probability of observing the reference allele exactly l times (l ≤ k) and the alternative allele, k − l times, with the corresponding probabilities of the observations being erroneous, given that the genotype is g. An erroneous observation would be the observation of the allele that was not actually sequenced, but the other, for example, due to mapping errors.
Given the assumed independence among observations, the likelihood is the product of the probabilities of all individual observations. The probability of observing an allele x with a given quality is equal to: is the probability of sampling that allele during sequencing, and j is the probability of the observation being erroneous, according to the its quality.
The probability of sampling one of the two alleles during sequencing is usually assumed to be proportional to the number of chromosomes carrying that allele in the genotype: g for the reference allele, and m − g for the alternative. In the case of a sampling bias of λ, any chromosome with the reference allele is λ times as likely to be sequenced as a chromosome with the alternative allele. Thus, the probability of sampling the reference allele is λg m+λg−g . For example, in a triploid individual (m = 3), with two homologous chromosomes carrying the reference allele of a deletion, and the third chromosome carrying the alternative (deleted) allele (g = 2), the probability of sampling the reference allele could be magnied from the naive 2 3 to the actual 4 5 , if λ = 2, because each reference allele counts twice as much as an alternative. Putting it all together, and letting the rst l observations be of the reference allele, we have: The equation above becomes identical to Li's equation 2 if λ = 1. In order to obtain the likelihood functions shown on table 2, we rst treat all the observations of the same allele as having the same, average, probability of being errnoeous: r in the case of a reference observation, and a in the case of an alternative observation. And then, we x m = 2 for diploid variants. Because L(0) and L(m) are independent of m, the two possible genotypes in a hemizygous (sex) chromosome (g = 0, and g = 1) have the same likelihood functions as the genotypes g = 0, and g = 2 of diploid variants.

Implementation of the expectation-maximization algorithm
Treating the genotypes as missing values, we implement an expectationmaximization (EM) method to estimate either the alternative allele fre-quency, ψ, under the assumption of Hardy-Weinberg equilibrium, or the genotype frequencies ψ g (with g ∈ {0, 1, 2} for diploids) or φ g (with g ∈ {0, 1}, for hemizygous individuals), and eventually the proportions of errors among reference ( r ) and alternative ( a ) counts. Note that ψ g is the frequency of genotype g among diploids, and φ g is the frequency of genotype g among hemizygous individuals. The EM algorithm is an iterative estimation of the parameters that gets closer to the maximum likelihood estimates in every iteration. Let Θ represent the parameters being estimated. Several good manuals show that the next (t + 1) estimate of the parameters is obtained from the current (t) estimates by the following expression: where, X is the whole set of allele counts across individuals, and the summation is over all possible combinations of genotypes among individuals (G). This becomes tractable when, following Gupta and Chen [2010], the summation is performed over individuals and genotypes, because the data X is composed of N independent and identically distributed samples, namely the individuals: Using Bayes theorem and the likelihood function described in table 2, the function in curly brackets can be expressed in terms of the parameters r , a , and either ψ or ψ g , depending on whether Hardy-Weinberg equilibrium is assumed or not. If not, P (g|Θ) = ψ g (if diploid; φ g , otherwise), and the conditions g ψ g = 1 and g φ g = 1 must be used during optimization, by means of Lagrangian multipliers. Then, the next values of the parameters can be found to be the following: In the equations above, D g is the t th estimate of the total number of diploid individuals with genotype g, and H (t) g is the t th estimate of the total number of hemizygous individuals with genotype g. That is, they are the summations of the posterior probabilities of genotype g over the respective kind of individuals. D and H are the total number of diploid and hemizygous individuals, respectively, where g is the t th estimate of the total number of alternative counts coming from hemizygous and homozygous individuals for either the alternative (g = 0) or the reference (g = 2) allele. Finally, R (t) g is the t th estimate of the total number of reference counts that come from hemizygous and homozygous individuals for either the alternative (g = 0) or the reference (g = 2) allele. Explicitly, When there is sampling bias in heterozygous individuals, λ = 1, and the next values of the proportions of errors among reference ( r ) and alternative ( a ) counts are the results of the two quadratic equations below: Among the possible solutions, r = a = 0, r = a = 1, r = 1/(1 + λ), and a = 1/(1 − λ) are excluded. In practice, it is assumed that the erroneous counts are a minority, and the program halts when r ≥ 0.5 or a ≥ 0.5. This can prevent the correct estimation of very low allele frequencies, in the presence of erroneous counts, as should be expected. Figure S1: Precision of the estimates of allele frequency from samples of dierent sizes, compared to the expected precision (orange lines) with known genotypes. Again, for each sample size, we performed 50 simulations. Here, the mean coverage is 4.
Eect of sample size on allele frequency estimates Figure S1 shows that sample size aects the precision of the allele frequency estimates as expected, without imparing the accuracy: while smaller samples produce necessarily less precise estimates, they are not biased.
Accuracy of genotype frequencies svgem can be run with or without the assumption of Hardy-Weinberg equilibrium. When the equilibrium is assumed, genotype frequencies follow from the alternative allele frequency, which is the only parameter being estimated, in this case. When the equilibrium is not assumed, however, the frequencies of the three genotypes among diploid individuals (and the frequencies of the two possible genotypes among hemizygous individuals, if any) are estimated. The estimation of genotype frequencies without assuming Hardy-Weinberg equilibrium is necessary for some applications, such as estimating heterozygosity and inbreeding coecients. The fact of estimating more parameters with the same amount of data entails that the precision cannot be as high as when estimating only one parameter. In comparison with Figure 2

Accuracy of the predicted genotypes
For some analyses, the genotypes need to be known, because genotype-free methods have not been developed yet. For example, coalescence-based studies may always require accurate genotypes, because they already deal with enough uncertainty about the genealogy. In these situations, higher levels of coverage are necessary to achieve genotype accuracy. With high coverage, the benets of running the expectation-maximization algorithm implemented in svgem, as oposed to simply using the most likely genotype of each individual, diminish. Indeed, svgem was designed to take advantage of low-coverage   Figure S5: Expected probability of the true diploid genotype being dierent from the most probable genotype after sampling its alleles between 0 and 10 times (Coverage), with a given allele bias (λ) and with a nite probability of spurious erroneous observations (Error). datasets, where genotypes are dicult or impossible to guess. The performance of svgem at predicting individual genotypes (Figures 3 and 5 of the main text) should be compared with the theoretical expectation. Given a number of allele observations in a diploid individual (coverage), the probability of a wrong genotype having higher posterior probability than the true one depends on the allele frequency, the rates of erroneous observations, and the allele sampling bias (λ). Figure S5 shows the numerical solutions of this function, which represent the expected proportions of individuals with a most probable genotype dierent from their true genotype, when allele frequency and allele sampling bias are known with accuracy. Interestingly, extreme allele sampling biases impair genotype prediction even with moderate depth of coverage.