 Methodology article
 Open Access
 Published:
A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium
BMC Bioinformatics volume 16, Article number: 61 (2015)
Abstract
Background
Allelic specific expression (ASE) increases our understanding of the genetic control of gene expression and its links to phenotypic variation. ASE testing is implemented through binomial or betabinomial tests of sequence read counts of alternative alleles at a cSNP of interest in heterozygous individuals. This requires prior ascertainment of the cSNP genotypes for all individuals. To meet the needs, we propose hidden Markov methods to call SNPs from next generation RNA sequence data when ASE possibly exists.
Results
We propose two hidden Markov models (HMMs), HMMASE and HMMNASE that consider or do not consider ASE, respectively, in order to improve genotyping accuracy. Both HMMs have the advantages of calling the genotypes of several SNPs simultaneously and allow mapping error which, respectively, utilize the dependence among SNPs and correct the bias due to mapping error. In addition, HMMASE exploits ASE information to further improve genotype accuracy when the ASE is likely to be present.
Simulation results indicate that the HMMs proposed demonstrate a very good prediction accuracy in terms of controlling both the false discovery rate (FDR) and the false negative rate (FNR). When ASE is present, the HMMASE had a lower FNR than HMMNASE, while both can control the false discovery rate (FDR) at a similar level. By exploiting linkage disequilibrium (LD), a real data application demonstrate that the proposed methods have better sensitivity and similar FDR in calling heterozygous SNPs than the VarScan method. Sensitivity and FDR are similar to that of the BCFtools and Beagle methods. The resulting genotypes show good properties for the estimation of the genetic parameters and ASE ratios.
Conclusions
We introduce HMMs, which are able to exploit LD and account for the ASE and mapping errors, to simultaneously call SNPs from the next generation RNA sequence data. The method introduced can reliably call for cSNP genotypes even in the presence of ASE and under low sequencing coverage. As a byproduct, the proposed method is able to provide predictions of ASE ratios for the heterozygous genotypes, which can then be used for ASE testing.
Background
RNAseq is revolutionizing transcriptome analyses [1]. While RNAseq is typically used for transcriptcentric analysis, where differential expression of genes or transcripts is tested between treatments or tissues [2], recently, RNAseq has been increasingly utilized for nucleotidecentric inferences such as, for coding SNP (cSNP) discovery [3], for cSNP genotyping to estimate population parameters [4] or for allelic specific expression [5,6].
ASE is particularly promising because it illuminates the genetic control of gene expression and its links to phenotypic variation [7]. In general, ASE testing is implemented through binomial or betabinomial tests of counts of alternative alleles in reads aligned to cSNPs of interest in heterozygous individuals [8]. Some algorithms and models have been specifically tailored to perform this inference using RNAseq data [811], but most of them require prior ascertainment of cSNP genotypes to extract read counts for heterozygous sites or they require RNAseq or genomic sequence on parents of the individuals used for ASE testing to reliably infer cSNP genotypes. Moreover, most models do not include biological replication and assume either a single replicate or treat all biological replicates alike and collapse counts down to the nucleotide level. These assumptions may not be too restrictive in F1 crosses of inbred strains of individuals of model organisms [12] for which exhaustive sequence resources are available and biological variation is minimal, but they become more problematic for outbred populations and their crosses [13] and even for crosses of inbred lines when the purpose is to focus on individual variation in ASE for breeding [14] or population genetics inferences [15].
In the above cases, genotypes are called first from RNAseq using models designed for calling SNP from genomic sequence data [1619], but there is a concern that extreme allelic imbalance could cause a heterozygous SNP to be mislabeled as homozygous or even not called at all, especially when coverage is low, as happens with low expressed genes [11,15]. This type of error is also present when calling SNP from pooled DNA samples where the expected allele frequency differs from 0, 0.5 and 1, as modeled in most SNP calling programs [20]. Moreover, while mislabeling a heterozygote as homozygote will not affect the estimation of the ASE ratio, it leads to loss in power. This is particularly important when working with outbred populations or their crosses, where the number of heterozygote individuals may be limited if the frequency of the minor allele is low. As a way to mitigate this problem, the use of phased haplotypes has been proposed [11], with the purpose of more reliably calling genotypes by exploiting linkage disequilibrium and minimizing the chances of missing heterozygote individuals. Exploitation of linkage disequilibrium is important because it has the potential to call SNP genotypes more accurately even with low expressed genes due to low sequence coverage.
In this paper, we concentrate on the problem of ascertaining the cSNP genotype when ASE is likely to be present. The HMM methods we propose improve genotyping accuracy by accounting for allelic imbalance, exploiting LD and allowing for mapping error. The hidden Markov approach has several advantages. First, it can model multiple SNPs simultaneously and their dependence through underlying hidden variables. Simultaneous modeling allows the HMM to make use of more data to estimate global parameters than a single SNP method. This results in increased accuracy in SNP calling especially in low expressed transcripts with low coverage. Second, the HMM is easy to implement through an ExpectationMaximization (EM) algorithm. Third, the HMM is very flexible. For example, it can be adapted to all kinds of modeling to account for individual variation in ASE ratios and sequence mapping errors. Fourth, the HMM is a likelihood based approach that can be easily used to make statistical inference. Consistency and asymptotic normality [21] can be established under some regularity conditions. The likelihood ratio approach may also be applied directly. Although HMM has been used successfully to identify copy number variations [22,23], it has not been applied to identify genotypes from RNA sequence data when allele specific expression exists. The proposed HMMs are immediately applicable after the SNPs are identified and locations are ascertained by existing software (e.g. VarScan). A comparison of existing software can be found in [18]
This paper is organized as follows. In the Methods section, we introduce HMMASE for calling the underlying genotype while predicting the ASE status; HMMNASE will be introduced as a special case. In the Results section, we present simulation results for the prediction of the underlying SNP, using HMMASE and HMMNASE first. Then a real data analysis is used to demonstrate the method proposed and compare it to other popular methods such as VarScan [17], BCFtools [19] and Beagle [24]. Concluding remarks are given in the Conclusion and Discussion section. We also provide Additional file 1 for additional details of the EM algorithms and some additional numerical results. A manual for the R package HMMASE can be found at http://www.stt.msu.edu/users/pszhong/HMMASE.html.
Methods
The purpose of this Section is to introduce HMMASE, which can infer the underlying genotypes and predict SNP with ASE simultaneously using the RNA counts from the next generation sequence data. The introduction of HMMNASE will be given as a special case at the end of this Section. Let X _{ il }=(X _{ i l1},X _{ i l2},X _{ i l3},X _{ i l4})^{T} be observed allele specific RNA counts of the lth SNP for l=1,⋯,L and the ith individual (i=1,⋯,n), where X _{ i l1},X _{ i l2},X _{ i l3},X _{ i l4} represent observed counts for A,C,G and T, respectively. Let \(n_{\textit {il}}=\sum _{j=1}^{4}X_{\textit {ilj}}\) be the total counts at the lth SNP of the ith individual.
To simplify the notation, we will introduce the proposed HMMASE method for a biallelic SNP. The extension to other cases can be done in a similar manner. Without loss of generality, consider two possible alleles A and T. There are three possible genotypes, two homozygous AA and TT, and one heterozygous AT. For the heterozygous genotype, we also wish to predict if an allelic specific expression exists for alleles A and T. Specifically, we want to further classify the heterozygous genotypes AT into three states ATNASE (heterozygous without ASE), ATASEHIGH (heterozygous with ASE, with reads of A more than T) and ATASELOW (heterozygous with ASE, with reads of T more than A). For convenience, let G _{ il } represent the (hidden) combination of genotype and allelic specific status at the position l (l=1,⋯,L) for the ith individual where
where “ATNASE” means the combination of genotype “AT” and nonallelic specific expression (NASE); “ATASE” means the combination of genotype “AT” and allelic specific expression (ASE). Given the observed RNA counts {X _{ il }:i=1,⋯,n;l=1,⋯,L}, we wish to predict the underlying genotypes G _{ il } for all i and l. This prediction simultaneously determines the genotypes and the ASE status of each SNP.
Assume that sequence error exists in X _{ il } so that all alleles are possibly observed. The read counts X _{ il } are generated from a hierarchical model, which is determined by a hidden genotype G _{ il } and allele specific ratio δ _{ il }. That is, given n _{ il } and δ _{ il }, X _{ il }=(X _{ i l1},X _{ i l2},X _{ i l3},X _{ i l4})^{T} follows a multinomial distribution, i.e.,
where p(δ _{ il },e) represents the probability vector of multinomial distribution, δ _{ il } represents the allelic specific ratio, e is used to account for the mapping error. Given δ _{ il }, the probabilities of observing a read as A, C, G or T are specified in the following probability vector
The ASE ratios δ _{ il } is a random variable that is generated from a distribution depending on G _{ il }=k in the following ways
where \(I_{\{\delta _{\textit {il}}=a\}}\) represents a discrete random variable with probability mass one on point a, and B _{(S,U)}(α,β) (S<U) is a rescaled beta distribution taking values within (S,U) which has a probability density function
where B e t a(α,β) is the beta function with parameters α and β. Further, we assume that δ _{ il } are independent given G _{ il }. As a usual HMM model, we assume the hidden states {G _{ il }:l=1,⋯,L} follows a Markov process to allow dependence among the observed counts X _{ il }s. The transition probability among underlying genotypes G _{ il } is assumed to be
with initial probabilities P(G _{ i1}=k)=π _{ ik } and M=5. One may note that the transition probabilities in (0.6) do not depend on the distances between SNPs, which motivates us to extend the transition probability matrix as a function of the distance between adjacent SNPs. The details of this extension and the associated algorithm are summarized in Additional file 1.
Denote the observed RNA counts data (incomplete) to be X _{ i }={X _{ i1},⋯,X _{ iL }} for i=1,⋯,n. Then the posterior probability of G _{ il }=k given X={X _{1},⋯,X _{ n }}, i.e., P(G _{ il }=kX), will be used for predicting the underlying genotypes and simultaneously infer the allelic specific status at the lth SNP for the ith individual. The posterior probability P(G _{ il }X) can be computed by Bayes’ formula
where G _{ i }=(G _{ i1},⋯,G _{ iL })^{T} is all the possible underlying genotypes combinations on the L positions.
To obtain the posterior probability in (0.7), we note the probabilities P(X,G _{ i }) and P(X) depend on a vector of unknown parameters θ=(α _{1},β _{1},α _{2},β _{2},e,A)^{T}, where A=(a _{ k k′}) are parameters in the transition matrix. We will use maximum likelihood estimates (MLE) to estimate θ. We can find the MLEs of θ by an EM algorithm [25,26]. To this end, we introduce the following complete data corresponding to the observed data X,
The likelihood function for the complete data is
where f _{ X }(X _{ il }G _{ il }) is the conditional density of X _{ il } given G _{ il } obtained from (0.2) and (0.4) whose explicit forms can be found in the Additional file 1 and G=(G _{1},⋯,G _{ n })^{T}. It follows that the loglikelihood function of L(θY) is given by
Given θ ^{(m)}, the update θ ^{(m+1)} is found by maximizing E{logL(θY)X,θ ^{(m)}}. The details of the EM algorithm can be found in the Additional file 1. We implemented the EM algorithm by a forward and backward method [27]. Further details about the forward and backward algorithm can be found in the Additional file 1.
The HMMNASE method could be considered as a simplification of the HMMASE method. The difference between HMMNASE and HMMASE is that HMMNASE does not consider the possible existence of ASE. As a result, the underlying genotypes of HMMNASE only contain three states AA,TT and ATNASE, which means that G _{ il } in (0.1) can only have three possible values 1,2 and 5. The emission probability will be (0.3) with k set to be 1,2 and 5 in (0.4). Then the above forward backward algorithm is still applicable except that the unknown parameter is reduced to θ=(e,A)^{T}.
The real data analyzed in this paper were collected by [28]. The experimental procedures were approved by the All University Committee on Animal Use and Care at Michigan State University (AUF# 09/0311400).
Results
ᅟ
Simulation study
We performed a simulation study to demonstrate the proposed HMMASE and HMMNASE methods. The underlying genotypes of SNPs were generated with the linkage disequilibrium (LD) information. Assume that LD(d) is the LD between two SNPs with distance d, which is a known function of d. For simplicity, in this simulation, we assume that the LD is a constant function of d and each SNP has only two possible alleles: either A or T. The genotypes were generated by combining two independent haplotypes.
Let S _{ il } be the allele (either A or T) at the lth SNP for the ith individual. The marginal probabilities for A and T are P(S _{ i l }=A)=p _{ A·}=p _{·A } and P(S _{ il }=T)=p _{ T·}=p _{·T } respectively. For any pair of SNPs which are next to each other in the position, the joint probability mass is defined as P(S _{ il }=x,S _{ i(l+1)}=y)=p _{ x y } where x,y are either A or T.
Note that the LD(d) is then defined as
Hence p _{ AA },p _{ AT },p _{ TA } and p _{ TT } can be computed once LD(d),p _{ A·} and p _{·A } are given. We generated each side (a haplotype) of the SNP sequences independently. Both sides are generated by the following three steps:

Set l=1 and generate a random variable b _{ l1}∼Bernoulli(p _{ A·}). If b _{ il }=1 then we set S _{ il }=A, otherwise S _{ il }=T.

Let l=l+1. Generating allele at l+1 position conditional on the l position. Namely, P(S _{ i(l+1)}=AS _{ il }=x)=p _{ xA }/p _{ x·}, where x could be either A or T.

Repeating step (b) until we get L SNPs.
We then generated total read counts \(n_{\textit {il}}=\sum _{k=1}^{4} X_{\textit {ilk}}\) from a Negative Binomial (NB) distribution independently for at each SNP l=1,⋯,L and individual i=1,⋯,n. Conditional on the underlying genotype and the total RNA counts, we generated the allele specific RNA counts X _{ il } through the hierarchical model given in (0.2) through (0.4). For illustration purpose, we considered a data set that was generated only by 4 underlying states in the simulation. Namely, AA, ATNASE, ATASEHIGH and TT, where the distribution of the allele specific ratio δ _{ il } for the ATASEHIGH state was changed to a Beta(α,β) distribution where we set α=30,β=10 such that the mode and center of the beta distribution is concentrated around 0.75.
The following scenarios were used in the simulation. We designed three different numbers of individuals (n): 6, 12 and 24, and two different numbers of SNP (L): 10 and 100. For each of the above six individual/SNP combinations, the total number of reads were simulated from negative binomial distributions NB(λ,0.4) with five different λ values: 8, 16, 24, 32 and 56 as well as two different values for LD: 0.5 and 0.8. We did 10 replications for each of the above combinations. We measured the performance of the proposed method by empirical false discovery rate (EFDR) and empirical false negative rate (EFNR), where were defined, respectively, as
The proposed HMMASE and HMMNASE were applied to the above scenarios. Table 1 and Table 2 summarize the EFDR and EFNR for both methods in the case with 10 SNPs. The first and second columns of both tables represent the values of λ and LD. The larger the value of λ, the larger the average of the RNA counts. On one hand, HMMASE and HMMNASE share some similarity. By increasing the value of λ, the EFDR and EFNR of both HMMs were smaller with lower variability (smaller standard deviation). Increasing the LD value led to better predictions, which shows that both HMMs made use of the LD information in predicting genotypes. Specifically, one can see that the EFDR and EFNR rates were improved with increased LD when λ is relative small. On the other hand, the EFNR rates of HMMNASE in Table 2 were almost always consistently larger than the EFNR rates from HMMASE in the same table while the EFDR (Table 1) of HMMASE were slightly higher than those of HMMNASE when λ is relative small. This demonstrates that HMMASE has better sensitivity than HMMNASE in heterozygote SNP genotypes from RNAseq data when ASE is likely present because ASE is considered in HMMASE. To better illustrate these two methods, Figure 1 compared the EFDR and EFNR for HMMASE and HMMNASE methods when n=24.
Finally, we assessed the effect of increasing the size of the SNP block on HMMASE and HMMNASE results. We repeated the simulations using 100 SNP blocks. The results are presented in Tables 3 and 4. For the LD structure, coverage and sample sizes used in this simulation, there was virtually no significant difference in error rates compared to the results using a smaller SNP block. This suggests that small SNP blocks are as efficient as larger blocks in utilizing the advantage of LD. Thus, we prefer using small blocks to reduce the computation load. In addition, we also observed that the number of individuals used in the simulation only had minor effects on the EFDR and EFNR with differences well within the range of standard errors.
Simulation conditional on haplotypes from real data
To make the haplotype structures in the simulation data more realistic, we randomly selected hayplotype structures from a real data in the pig resource population [28] to create the genotypes. Conditional on the genotypes, we generated the counts data using the same methods as those in Tables 1, 2, 3 and 4. Results of the EFDR and EFNR of the HMMASE and HMMNASE methods are shown in Table 5. HMMASE still maintained a low level EFDR and EFNR indicating that the HMMASE method is robust to the change of underlying haplotype structures. But the EFNR of the HMMNASE method was higher in Table 5, because HMMNASE did not account for the ASE, which exists in the simulated data. To confirm this, we further generated counts without ASE (ASE ratios =0.5), the results are summarized in Table 6. Both the HMMASE and HMMNASE methods performed well in this case, suggesting that the HMMNASE method is robust to the change of haplotype structures but not to the existence of ASE. This confirms the importance of developing the HMMASE method.
Real data analysis
Assessing accuracy of heterozygote calling rates Called cSNP genotypes were compared to goldstandards or true genotypes. In the simulation, the true genotype was readily available. For the real data analysis, genotypes obtained from a DNA SNP chip were used as a gold standard to evaluate the performance of cSNP calling and genotyping. The EFDR and Sensitivity were computed to assess the accuracy of genotype calling where
These measures are especially relevant when the intent of genotype calling is to perform ASE testing, because they focus on key heterozygous genotypes. EFDR and Sensitivity can be computed globally across all sites and individuals on a cSNP site basis.
Comparison with alternative methods We applied the proposed HMMs, HMMASE and HMMNASE, to a real RNAseq dataset and compared SNP genotype calls with those from VarScan, SAMtools+BCFtools and BEAGLE, wellknown methods for SNP and mutation calling from sequence data. RNAseq data were available for 24 female pigs from an F2 cross of Duroc and Pietrain in our pig resource population [2932]. Pig breeds are outbred and show substantial variation in allele frequency, high linkage disequilibrium within breed and limited phase agreement between breeds [33]. These animals were part of a larger transcriptional profiling study [34] and had been selected because they showed extreme phenotypes for loin eye area (a trait of economic value) compared to their litter mates. SNP chip data were available from the 60K illumina chip [35] from a recent study. Genotype data from the chip were treated as a gold standard against which cSNP called with RNAseq were validated. RNA from each sample was reverse transcribed into cDNA, fragmented and labeled to generate 24 barcoded libraries that were sequenced on an Illumina HiSeq 2000 (100 bp, pairedend reads). Each library was sequenced in four lanes with the raw read data consisting of 96 pairs of fastq files (4 per sample) containing approximately 15million shortreads (100 bp) each. Those fastq files were preprocessed using FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess read quality. Then, Tophat [36] was used for mapping the reads to the reference genome (Sus scrofa 10.2.69 retrieved from the Ensembl database) using an index generated by Bowtie2 [37]. The aligned records were stored in BAM/SAM format [38]. Alignment statistics and base coverage were calculated for each file using SAMTools [38]. Initially coding SNP discovery and genotyping were done with VarScan [17]. First, a base alignment file (.mpileup) for each covered position was obtained for each chromosome using the mpileup option of SAMTools [19] and subsequently VarScan [17] was used to call genotypes and count reads mapping to each segregating allele. We focused on chromosome 13 and extracted counts of reads agreeing with reference (R) or alternative (A) allele with respect to the reference genome at putative 5364 cSNP discovered by VarScan, which included 65 SNPs represented in the 60K chip for which we had reliable genotype data. We segmented the SNP data into 65 brackets that included each of those SNP and their surrounding cSNPs. Each bracket was analyzed separately with our program because we expect that only closely linked cSNPs will benefit from our multiSNP HMM model.
There were a total of 1560 genotypes to impute (24 animals and 65 SNPs), 591 heterozygotes and 969 homozygotes. VarScan (Table 7) did not impute any homozygotes as heterozygotes (EFDR = 0), but it only correctly identified 449 of the 591 heterozygotes (Sensitivity = 0.76). This drop in sensitivity to detect heterozygotes was accounted for by the noncall rate (64/591 =0.11) and wrongly calling 78 heterozygotes as homozygotes (0.14). The HMMASE and HMMNASE had an EFDR =0.015 (9/(571 + 9)) and 2/(570 + 2) =0.0035, respectively. But the sensitivities were 1.0 (detected all Heterozygotes) and 0.998, respectively if the restriction of having at least one read was imposed. Remarkably, even some genotypes without any reads were imputed correctly due to exploitation of zygotic disequilibrium. In this particular data set, HMMASE did not show additional advantage by considering ASE effects, indicating that modeling dependence is more important than modeling ASE. We also called cSNPs using two widely used algorithms: SAMtools+BCFtools and SAMtools+BCFtools+Beagle. SAMtools+BCFtools is probably the most commonly used algorithm for calling SNP, it calls SNP genotypes independently and its likelihood function assumes no ASE. When read counts are very low, SAMtools+BCFtools may not call SNP genotypes. Alternatively, BEAGLE uses the output from the previous algorithm and performs SNP calling by accounting for LD. Table 7 shows that these two methods were very similar to the proposed methods in terms of EFDR, which were 0.007 and 0.010, respectively. These two methods performed slightly better than HMMASE and HMMNASE for SNPs with zero counts. But the proposed methods were slightly better than the SAMtools+BCFtools and Beagle methods for SNPs with nonzero counts (See Table S.2 in Section 3 of the Additional file 1). Since the performance of HMMASE, HMMNASE, SAMtools+BCFtools and Beagle methods were very similar in discovering heterozygous SNPs, we decided to proceed with HMMASE analysis after assigning noncalls to those genotypes without any read.
Upon a SNPbySNP analysis of genotype calls, the low sensitivities obtained with the two programs (HMMASE and VarScan) for some markers can be largely explained by low coverage (Figure 2). In our HMMASE model, having at least 200 total reads (across all 24 individuals) produced sensitivities over 0.8 but the effect of these errors in final inferences is not clear. For example, if inferred genotypes are used for ASE, the effect of missing a heterozygote is lower power, while the effect of incorrectly imputing a homozygote as heterozygote could be biases in the estimated ASE (and false positive rate of ASE tests). Conversely, the two errors could potentially cancel out when using genotypes to estimate minor allelic frequency (MAF). Consequently, we proceeded to estimate MAF using ascertained genotypes and confirmed that HMMASE estimated MAF very precisely (Figure 3) with a correlation of 0.94 with DNA chip based estimates.
In order to assess the effect of genotype ascertainment on estimation and testing of ASE, we used heterozygous genotypes (either observed with chip or ascertained) and fit a betabinomial model [39] to read counts. In fact, when working with HMMASE this would not be necessary because our method produced estimates of ASE parameters, but we used the betabinomial model in order to separately assess the effect of genotype calling errors. In Figure 4 we observe that using genotypes ascertained with the HMMASE produced very accurate estimates of ASE. On the other hand, using heterozygote genotypes ascertained with VarScan also produced good agreement with those from chip data, except when the sensitivity was very low either because of calling heterozygotes as homozygotes or because of noncalling a genotype (horizontal points close to zero). These effects were even more obvious when looking at associated pvalues. In that case, missing heterozygotes from the single SNP analysis program (in this case VarScan) substantially reduced significance (Figure 5).
Conclusion and discussion
In this paper, we present HMM methods to call SNP genotypes in the presence of allelic imbalance by exploiting zygotic disequilibrium. In its present form, HMMASE and HMMNASE require that cSNP locations have been previously ascertained and conditional on those it can accurately call their genotypes. Our program is particularly useful for cSNP genotyping after a SNP discovery step has been applied [18]. This is important because while many programs are tailored for cSNP discovery [18] and genotyping assuming allelic balance in heterozygotes, there is a need for accurate genotype calling by exploiting linkage disequilibrium in the presence of ASE in individuals from outbred populations [15]. HMMASE and HMMNASE can use read counts of 4 bases or prefiltered biallelic counts. The biallelic option was used in the real data case because previous cSNP discovery using VarScan produced counts of reference and alternative alleles. A similar preprocessing step can be performed with a number of available programs [18]. The strength of HMM compared to other methods is that it can reliably call heterozygous cSNP genotypes even in the presence of ASE and under low sequencing coverage. Furthermore, by comparing to DNAchip genotypes, genotypes produced using the proposed methods resulted in good estimates of ASE and MAF. This is important in population genetics that can use lowcoverage sequence of many individuals in order to accurately estimate MAF, linkage disequilibrium and other population genomics parameters [33]. Another potential use of genotypes obtained from HMMs is to perform ASE testing [13]. Although not extensively studied in this first paper, HMMASE could be used to derive not only the cSNP genotypes but their ASE ratios. In its current form, HMMASE integrates out such information when calling genotypes, but further work in this area is warranted.
An important part of implementing our HMM algorithm in HMMASE and HMMNASE consists of segmenting the SNPs in groups that are tractable and informative. From the simulation study and real data analysis, we found that the HMMNASE was very robust in terms of group segmentation but HMMASE was slightly more sensitive to the number of SNPs and the length of the segment. In particular, EFDR (genotyping homozygotes as heterozygotes) was slightly lower when the segment was 1kb long compared to 4 kb long. We experimented with many criteria to partition the SNPs in the real data set and found that the inferences were robust to the number of SNPs in the segment for a range of 2 to 25 SNPs over a 1/2 kb to 2 kb long segments. We only observed one SNP within a 1 kb segment containing 35 SNPs had slightly higher EFDR. Further inspection of the segment indicated that this region likely included SNPs from several transcripts and that the zygotic disequilibrium seemed to be low for SNPs on different transcripts. A possibility to mitigate this problem could be to group SNP by transcript by using bioinformatics tools such as the ensemble variation API [40]. Since this problem was sporadic (one segment) in our data set we believe that such an approach was not needed.
In summary, in this paper we present and evaluate an algorithm for calling SNP genotypes in the presence of allelic imbalance by exploiting linkage disequilibrium. The method is particularly suitable for calling cSNP from lowcoverage RNAseq data and the resulting genotypes show good properties for estimation of genetic parameters and allelic ratios. We provide HMMASE, an R package to implement the proposed algorithm (http://www.stt.msu.edu/users/pszhong/HMMASE.html). Our algorithm performed better than VarScan and similarly to BCFtools and Beagle, indicating that the joint modeling of ASE and LD recovered important information although our algorithm did not use haplotypes information. Furthermore, our promising results encourage further research on extending the algorithm to incorporate haplotype structures and performing the ASE testing.
Availability of supporting data
The data set and the R package HMMASE supporting the results of this article are available in http://www.stt.msu.edu/users/pszhong/HMMASE.html.
References
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40.
Cánovas A, Rincon G, IslasTrejo A, Wickramasinghe S, Medrano JF. SNP discovery in the bovine milk transcriptome using RNASeq technology. Mammalian Genome. 2010; 21:592–8.
Nielsen R, Korneliussen T, Albrechtsen A, Li Y, Wang J. SNP calling, genotype calling, and sample allele frequency estimation from NewGeneration sequencing data. PLoS One. 2012; 7:37558.
Montgomery S, Sammeth M, GutierrezArcelus M, Lach R, Ingle C, Nisbett J, et al.Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010; 464:773–77.
Pickrell J, Marioni Pai A, Degner J, Engelhardt B, Nkadori E, Veyrieras J, et al.Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464:768–72.
Pastinen T. Genomewide allelespecific analysis: insights into regulatory variation. Nat Rev Genet. 2010; 11:533–8.
Sun W. A statistical framework for eQTL mapping using RNAseq data. Biometrics. 2012; 68:1–11.
Pandey R, Franssen S, Futschik A, Schlotterer C. Allelic imbalance metre (Allim), a new tool for measuring allelespecific gene expression with RNAseq data. Mol Ecol Resour. 2013; 13(4):740–5.
Rozowsky J, Abyzoy A, Wang J, Alves P, Raha D, Harmanci A, et al.AlleleSeq: analysis of allelespecific expression and binding in a network framework. Mol Syst Biol. 2011; 7:522.
Turro E, Su S, Gonçalves A, Coin L, Richardson S, Lewin A. Haplotype and isoform specific expression estimation using multimapping RNAseq reads. Genome Biol. 2011; 12:R13.
Skelly D, Johansson M, Madeoy J, Wakefield J, Akey J. A powerful and flexible statistical framework for testing hypotheses of allelespecific gene expression from RNAseq data. Genome Res. 2011; 21:1728–37.
Ernst CW, Steibel JP. Molecular advances in QTL discovery and application in pig breeding. Trends Genet. 2013; 29:215–224.
Perumbakkam Muir W, BlackPyrkosz A, Okimoto R, Cheng H. Comparison and contrast of genes and biological pathways responding to Marek’s disease virus infection using allelespecific expression and differential expression in broiler and layer chickens. BMC Genomics. 2013; 14:64.
Singhal S. De novo transcriptomic analyses for nonmodel organisms: an evaluation of methods across a multispecies data set. Mol Ecol Resources. 2013; 13:403–16.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, et al.A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet. 2011; 43:491–8.
Koboldt D, Chen K, Wylie T, Larson D, McLellan M, Mardis E, et al.VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics. 2009; 25:2283–85.
You N, Murillo G, Su X, Zeng X, Ning K, Zhang S, et al.SNP calling using genotype model selection on highthroughput sequencing data. Bioinformatics. 2012; 28:643–50.
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27:2987–93.
Raineri E, Ferretti L, EsteveCodina A, Nevado B, Heath S, PérezEnciso M. SNP calling by sequencing pooled samples. BMC Bioinformatics. 2012; 13:239.
Bickel PJ, Ritov Y, Rydén T. Asymptotic normality of the maximumlikelihood estimator for general hidden Markov models. Ann Stat. 1998; 26:1614–1635.
Chen H, Xing H, Zhang N. Estimation of parent specific DNA copy number in tumors using highdensity genotyping arrays. PLoS Comput Biol. 2011; 7(1):e1001060. doi:10.1371/journal.pcbi.1001060.
Wang K, Li M, Hadley D, Liu R, Glessner J, Grant S, et al.PennCNV: An integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome SNP genotyping data. Genome Res. 2007; 17(11):1665–74.
Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007; 81:1084–97.
Baum L, Petrie T, Soules G, Weiss N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970; 41:164–71.
Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm (With discussion). J R Stat Soc Ser B. 1977; 39:1–38.
Rabiner L. A tutorial on Hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989; 77:257–86.
Gualdrón Duarte J, Bates R, Ernst C, Raney N, Cantet R, Steibel J. Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels. BMC Genet. 2013; 14:38. doi:10.1186/147121561438.
Choi I, Bates R, Raney N, Steibel J, Ernst C. Evaluation of QTL for carcass merit and meat quality traits in a US commercial Duroc population. Meat Sci. 2012; 92:132–8.
Choi I, Steibel J, Bates R, Raney N, Rumph J, Ernst C. Identification of Carcass and Meat Quality QTL in an F(2) Duroc x Pietrain pig resource population using different leastsquares analysis models. Front Genet. 2011; 2:18.
Edwards D, Ernst C, Raney N, Doumit M, Hoge M, Bates R. Quantitative trait loci mapping in an F2 Duroc x Pietrain resource population. I. Growth traits. J Anim Sci. 2008a; 86:241–53.
Edwards D, Ernst C, Raney N, Doumit M, Hoge M, Bates R. Quantitative trait locus mapping in an F2 Duroc x Pietrain resource population. II. Carcass and meat quality traits, J Anim Sci. 2008b; 86:254–66.
Badke Y, Bates R, Ernst C, Schwab C, Steibel J. Estimation of linkage disequilibrium in four US pig breeds. BMC Genomics. 2012; 13:24.
Steibel J, Bates R, Rosa G, Tempelman R, Rilington V, Ragavendran A, et al.Genomewide linkage analysis of global gene expression in loin muscle tissue identifies candiyear genes in pigs. PLoS One. 2011; 6:e16766.
Ramos A, Crooijmans R, Affara N, Amaral A, Archibald A, Beeyer J, et al.Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One. 2009; 4:e6524.
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNASeq. Bioinformatics. 2009; 25(9):1105–11.
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012; 9:357–59.
Li H, Handsaker B, Fennell T, Ruan J, Homer N, Marth G, et al.The sequence alignment/Map format and SAMtools. Bioinformatics. 2009; 25:2078–9.
Griffiths DA. Maximum likelihood estimation for the betabinomial distribution and an application to the household distribution of the total number of cases of disease. Biometrics. 1973; 29:637–48.
McLaren W. Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor. Bioinformatics. 2010; 26:2069–70.
Acknowledgements
The authors thank the Associate Editor and three referees for constructive comments and suggestions. We also acknowledge the supports from NSF grants DMS1209112, DMS1309156, NIFAAFRI 20106520520342 and NIFAAFRI 20110701530338.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Development of the method (PSZ & JPS); Developing the software (HW); Writing the manuscript (PSZ & JPS). All authors read and approved the final manuscript.
Additional file
Additional file 1
Supplemental. Contains a detailed algorithm of the proposed HMMASE algorithm (Section 1), with an extension to the case with distance dependent transition matrix (Section 2). In Section 3, we include some additional real data analysis results using distance dependent transition matrix and the comparison with BCFtools and Beagle after excluding the zero counts in the data set. In Section 4, two additional tables for simulation studies with haplotypes generated from the real data are also included.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Steibel, J.P., Wang, H. & Zhong, PS. A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium. BMC Bioinformatics 16, 61 (2015). https://doi.org/10.1186/s1285901504792
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901504792
Keywords
 Hidden Markov model
 RNAseq data
 Allelic specific expression