A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium

Background Allelic specific expression (ASE) increases our understanding of the genetic control of gene expression and its links to phenotypic variation. ASE testing is implemented through binomial or beta-binomial tests of sequence read counts of alternative alleles at a cSNP of interest in heterozygous individuals. This requires prior ascertainment of the cSNP genotypes for all individuals. To meet the needs, we propose hidden Markov methods to call SNPs from next generation RNA sequence data when ASE possibly exists. Results We propose two hidden Markov models (HMMs), HMM-ASE and HMM-NASE that consider or do not consider ASE, respectively, in order to improve genotyping accuracy. Both HMMs have the advantages of calling the genotypes of several SNPs simultaneously and allow mapping error which, respectively, utilize the dependence among SNPs and correct the bias due to mapping error. In addition, HMM-ASE exploits ASE information to further improve genotype accuracy when the ASE is likely to be present. Simulation results indicate that the HMMs proposed demonstrate a very good prediction accuracy in terms of controlling both the false discovery rate (FDR) and the false negative rate (FNR). When ASE is present, the HMM-ASE had a lower FNR than HMM-NASE, while both can control the false discovery rate (FDR) at a similar level. By exploiting linkage disequilibrium (LD), a real data application demonstrate that the proposed methods have better sensitivity and similar FDR in calling heterozygous SNPs than the VarScan method. Sensitivity and FDR are similar to that of the BCFtools and Beagle methods. The resulting genotypes show good properties for the estimation of the genetic parameters and ASE ratios. Conclusions We introduce HMMs, which are able to exploit LD and account for the ASE and mapping errors, to simultaneously call SNPs from the next generation RNA sequence data. The method introduced can reliably call for cSNP genotypes even in the presence of ASE and under low sequencing coverage. As a byproduct, the proposed method is able to provide predictions of ASE ratios for the heterozygous genotypes, which can then be used for ASE testing. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0479-2) contains supplementary material, which is available to authorized users.


Details of the EM Algorithm and its Implementation
In this section, we provide details of the EM algorithm for obtaining the maximum likelihood estimates (MLE) of θ where θ = (α 1 , β 1 , α 2 , β 2 , e, A) T , where A = (a kk ) k,k =1,··· ,M are parameters in the transition matrix.
The likelihood function for the complete data is a g i(l−1) ,g il (θ)π g i1 (θ). * The authors acknowledge the support from NSF grants DMS-1209112, DMS-1309156, NIFA-AFRI 2010-65205-20342 andNIFA-AFRI 2011-07015-30338. where f X (X il |G il , δ il ) is the conditional density of X il . It follows that the loglikelihood function of L(θ|Y) is given by Define L i,k (l) as The conditional expection of log L(θ|Y) given X evaluated at θ (m−1) is We then maximize E{log L(θ|Y)|X, θ (m−1) } with respect to θ, say, the maximal is taken at point θ (m) . We updated the parameter θ (m−1) by θ (m) . It can be shown, by a constrained maximization, that a where the marginal probability mass function of X il given G il is The details of the implementation of above EM algorithm can be done by a forward and backward method. The following forward-backward algorithm implements the EM algorithm in three steps: 1. Compute α i,k (forward probabilities), β i,k (backward probabilities) and P X i .
Since there is no closed form integration C 0 (θ; X il1 , X il2 ) and C 1 (θ; X il1 , X il2 ), we compute them using a numerical integration. To update parameters α, β, e in (1.4), we defineê (m) as the solution of , 2) as the solutions to the following two equations:

Transition Probabilities Depending on Distances Among SNPs
In this section, we discuss a generalized version of HMM-ASE, with transition probability taking into consideration of distances among SNPs. The idea is, if the two SNPs are close to each other, it is less likely that the genotype state changes from one SNP to another. While if the two SNPs are far apart, it is more likely that there exists a change on genotype status between the two SNPs. Similar idea was been applied in copy number variation detection by Wang et al. (2013).
If distances among SNPs affect the transition probability, then the transition matrix A l = (a kk (l)) depending on the location of a SNP, which is a function of the SNP location l, where for k, k = 1, . . . , 5 and l = 2, . . . , L. Here d l represents the genomic distance between the locations of SNP l and SNP l + 1. The parameter ρ determines the effect of the distance on the transition probabilities (ρ > 0). The parameter a * kk affects the transition probabilities from state k to state k , besides the effect of distances. Also, there is a constraint that a * kk ∈ (0, 1) and l =k a * kk < 1 for each k. The expectation of the log-likelihood function is now changed to where a * = (a * 12 , . . . , a * M,M −1 ). We modify the forward-backward algorithm in the last section to accommodate the new model (1.5) on the transition matrix. The changes are summarized in the following (1) change the transition probabilities a k ,g in step 1 into a k ,g (l) and a k,k in step 2 into a k,k (l); (2) in additional to the update for parameters α 1 , β 1 , α 2 , β 2 , e in step 3, we also need to update the parameters a * k,k , ρ, which can be done by using the following method. Equating to zero the derivative of E{log L(θ|Y)|X, θ (k−1) } with respect to a * kk yields for each k with ρ initially fixed at its value from the previous EM iteration (ρ (m) ).
Then a new value of a * can be obtained by a kk = , k, k = 1 . . . , M , k = k. Now, an updated value of ρ can be obtained by directly maximizing R 2 (a * , ρ) + R 3 (a * , ρ) with respect to ρ, using the new a * value.

Additional Results in Real Data Analysis
In this section, we present some additional results from the real data analysis. The following contigency table Table S.1 reports the performance of the HMM-NASE DD and the HMM-ASE DD method, which are, respectively, HMM-NASE and HMM-ASE methods with transition probability matrix depending on distance among adjacent SNPs.
Comparing the results in the following Table S.1 with the results reported in Table   5 in the paper, we found that HMM-NASE DD method produced exactly the same results as the HMM-NASE method. And the HMM-ASE DD method had a higher empirical false positive rate than HMM-ASE method, which might be due to the over parameterization in the HMM-ASE DD model. This indicates that, there is no advantage of using distance dependent transition matrix for SNPs in a small neighborhood.
We also compared with BCF tools and the BEAGLE methods to the actual genotypes after excluding the zero counts data excluded in the HMM-ASE and HMM-NASE. The results are collected in Table S.2. As we can see that the HMM-ASE and HMM-NASE did slightly better than the BCF tools and the BEAGLE on the non-zero counts.