dsPIG: a tool to predict imprinted genes from the deep sequencing of whole transcriptomes

Li, Hua; Su, Xiao; Gallegos, Juan; Lu, Yue; Ji, Yuan; Molldrem, Jeffrey J; Liang, Shoudan

doi:10.1186/1471-2105-13-271

Research article
Open access
Published: 19 October 2012

dsPIG: a tool to predict imprinted genes from the deep sequencing of whole transcriptomes

Hua Li^1,2,
Xiao Su³,
Juan Gallegos⁴,
Yue Lu⁵,
Yuan Ji⁶,
Jeffrey J Molldrem² &
…
Shoudan Liang⁷

BMC Bioinformatics volume 13, Article number: 271 (2012) Cite this article

4674 Accesses
10 Citations
Metrics details

Abstract

Background

Dysregulation of imprinted genes, which are expressed in a parent-of-origin-specific manner, plays an important role in various human diseases, such as cancer and behavioral disorder. To date, however, fewer than 100 imprinted genes have been identified in the human genome. The recent availability of high-throughput technology makes it possible to have large-scale prediction of imprinted genes. Here we propose a Bayesian model (dsPIG) to predict imprinted genes on the basis of allelic expression observed in mRNA-Seq data of independent human tissues.

Results

Our model (dsPIG) was capable of identifying imprinted genes with high sensitivity and specificity and a low false discovery rate when the number of sequenced tissue samples was fairly large, according to simulations. By applying dsPIG to the mRNA-Seq data, we predicted 94 imprinted genes in 20 cerebellum samples and 57 imprinted genes in 9 diverse tissue samples with expected low false discovery rates. We also assessed dsPIG using previously validated imprinted and non-imprinted genes. With simulations, we further analyzed how imbalanced allelic expression of non-imprinted genes or different minor allele frequencies affected the predictions of dsPIG. Interestingly, we found that, among biallelically expressed genes, at least 18 genes expressed significantly more transcripts from one allele than the other among different individuals and tissues.

Conclusion

With the prevalence of the mRNA-Seq technology, dsPIG has become a useful tool for analysis of allelic expression and large-scale prediction of imprinted genes. For ease of use, we have set up a web service and also provided an R package for dsPIG at http://www.shoudanliang.com/dsPIG/.

Background

Diploid eukaryotic species inherit two copies (i.e., two alleles) of the same gene from both parents. If one allele fails to work properly, the other allele can still implement a gene’s cellular function. For some genes, however, this protective mechanism is disabled because only one allele is expressed and its failure probably leads to cellular malfunction. These monoallelically expressed genes can be classified into one of three categories [1]: X-inactivated genes, which are regulated by a random process in which one of the two X chromosomes present in female mammals is silenced [2]; autosomal genes subject to random monoallelic gene expression, such as the T cell receptors and natural killer cell receptors [3–9]; and autosomal imprinted genes (e.g., CDKN1C and H19), which express from only one of the two alleles according to their parental origin [10–13]. Imprinted genes play important functional roles in the control of embryonic growth and development, as well as in post-natal development [14–16]. As imprinted genes are expressed from only one of the two parental chromosomes, a de facto haploid state is caused by imprinting and leads to asymmetric functions of parental genomes and loss of diploid protection against recessive mutations [11]. Thus, imprinting dysregulation is linked to numerous human genetic diseases, such as developmental disorders (Prader-Willi syndrome, Angelman syndrome) and cancers (neuroblastoma, Wilms’ tumor) [17–20]. In addition, environmental factors can influence gene expression by targeting imprinted genes [21, 22]. Because imprinted genes are more susceptible to disease than non-imprinted genes [23, 24], it is of great importance to identify novel imprinted genes for human.

Identifying imprinted genes experimentally has traditionally been a slow process, and the number of validated ones is much lower than the previous estimation (~1% of all human genes) [25]. However, recently developed high-throughput screening approaches (e.g., expression profiling and single-nucleotide polymorphism [SNP] microarrays) and recently identified DNA sequence characteristics (e.g., the number and type of repeated elements flanking a gene) have led to the proposal of several new methods to predict imprinted genes on a global scale [26–31]. With advances in next-generation sequencing technology [32, 33], mRNA-Seq is becoming a powerful tool for transcriptome profiling [34]. It can generate not only the number of reads mapped to exons, which reflects the expression levels of a gene, but also the actual sequence, which may identify the allele from which the mRNA is expressed. Therefore, inference can be made for predicting imprinted genes. For example, by sequencing whole transcriptomes from mice embryos, Babak et al. (2008) measured allelic expression bias and identified six novel imprinted genes [35].

However, to our knowledge, prediction of imprinted genes by deeply sequencing transcriptomes (mRNA-Seq) from multiple independent tissues is still an open problem. In this study, we proposed a Bayesian model – dsPIG (d eep s equencing-based P rediction of I mprinted G enes) – to predict imprinted genes based on allelic expression inferred from observed SNPs in mRNA-Seq data of independent human tissues. With dsPIG, we were able to measure the imbalance of allelic expression among various tissues and calculate the posterior probability of imprinting status for each gene. Under a stringent probabilistic cut-off of the posteriors and other reasonable biological criteria, we identified 57 potentially imprinted genes from 9 diverse human tissues and 94 potentially imprinted genes from 20 cerebellar cortices, with an expected low false discovery rate (FDR). Furthermore, analysis of allelic expression of the same genes among different tissues revealed that, in some cases, even if a gene was biallelically expressed, one allele always had higher expression level than the other.

Results

Statistical model development

Monoallelic expression generally falls into one of three categories: imprinted expression, random monoallelic expression and X-inactivation, all of which express only one of two alleles in a single cell [1–10]. At a tissue level, however, random monoallelic expression will allow both alleles to be detected in total RNA because of the “mosaicism” of the tissue [9, 36] (also see discussion). Because our study was based on whole transcriptomes of tissue samples, random monoallelic expression was reasonably considered as biallelic expression when averaged over the entire tissue. X-inactivation was also excluded from this study by discarding all predictions on the X chromosome. Thus imprinting is the most likely cause of the observed monoallelic expression among transcriptomes of different tissues even though we cannot infer the parent of origin.

We used known SNPs from dbSNP [37] to distinguish and count the two alleles of each gene. If a gene was imprinted, we expected to observe only one of the two alleles of each SNP in the exon region from the whole transcriptome. With the allelic counts obtained from the mRNA-Seq data (see Materials and Methods), we developed a Bayesian model (dsPIG) to compute the posterior probability of imprinting based on each single SNP. Suppose we have sequenced transcriptomes from n independent tissue samples. For each sample, we count the alleles of all known SNPs, discarding those with 0 counts. For each SNP, let the allelic counts be: (x₁, y₁), (x₂, y₂)…, (x_n, y_n), where x_k and y_k are the counts for two alleles X and Y in the sample k (k= 1, …, n). Because each gene can only have two statuses: imprinted (I) or non-imprinted (NI), we consider (I, NI) as a binary event vector for the imprinting status. By denoting data = {(x₁, y₁), (x₂, y₂)…, (x_n, y_n)}, we have by Bayes’ Theorem:

\Pr (I | d a t a) = \frac{(\Pr d a t a | I) \times \Pr (I)}{\Pr (d a t a | I) \times \Pr (I) + \Pr (d a t a | N I) \times \Pr (N I)}

(1)

where Pr(I | data) is the posterior probability of imprinting and Pr(I) is the prior of imprinting. Based on current knowledge of prevalence of imprinted genes [25], we set the prior Pr(I)=1%, thus Pr(NI) = 1 – Pr(I) = 99%. Since samples were independent of each other and genotype has only 3 possible combinations (XX, XY, YY), we denote the genotype as follows:

δ = {\begin{cases} 1 the genotype is X X \\ 2 the genotype is X Y \\ 3 the genotype is Y Y \end{cases}

Assuming p and q are the allele frequencies for allele X and Y, p + q =1. According to Hardy-Weinberg equilibrium, the prior probabilities for the three genotypes are calculated as follows:

\Pr (δ) = {\begin{cases} = \Pr {(X)}^{2} = p^{2} δ = 1 \\ = \Pr (X) \Pr (Y) = 2 p q δ = 2 \\ = \Pr {(Y)}^{2} = q^{2} δ = 3 \end{cases}

Since values of p and q can be retrieved from dbSNP [37], p and q are treated as constants.

We used the law of total probability to calculate the likelihood Pr(data|I) as follows:

\Pr (data| I) = \prod_{k = 1}^{n} \Pr (x_{k}, y_{k} | I) (n \geq 1)

(2)

\begin{array}{l} = \prod_{k = 1}^{n} [\Pr (x_{k}, y_{k} | I, δ = 1) \times p^{2} + \Pr (x_{k}, y_{k} | I, δ = 2) \\ \times 2 p q + \Pr (x_{k}, y_{k} | I, δ = 3) \times q^{2}] \end{array}

By assuming that (i) the transcript levels of a gene’s two alleles are equal if the gene is biallelically expressed, (ii) two different alleles, if both expressed, have the same chance to be sequenced by mRNA-Seq, and (iii) unexpressed alleles with counts >0 are due to sequencing errors, we had the following derivation:

\Pr (x_{k}, y_{k} | I, δ = 1) = f (y_{k}; n_{k}, p_{e})

(3)

\Pr (x_{k}, y_{k} | I, δ = 3) = f (x_{k}; n_{k}, p_{e})

(4)

\Pr (x_{k}, y_{k} | I, δ = 2)

\begin{array}{l} = \Pr (x_{k}, y_{k} | I, δ = 2, θ = 0) \times \Pr (θ = 0) \\ + Pr (x_{k}, y_{k} | I, δ = 2, θ = 1) \times Pr (θ = 1) \end{array}

= f (y_{k}; n_{k}, p_{e}) \times \frac{1}{2} + f (x_{k}; n_{k}, p_{e}) \times \frac{1}{2}

(5)

Here, f denotes binomial distribution [i.e., $f (x; n, p) = (\begin{array}{l} n \\ x \end{array}) {(1 - p)}^{(n - x)} p^{x}$ ], n_k = x_k + y_k is assumed fixed, and p_e is the averaged sequencing error rate for each SNP (p_e was obtained from Wang et al. 2008). The binary variable θ is defined as follows $θ = {\begin{cases} 0 Only X can be expressed due to imprinting \\ 1 Only Y can be expressed due to imprinting \end{cases}$ . Since X and Y have an equal chance to be inherited from either maternal or paternal genome, X and Y have an equal chance to be expressed in imprinted genes. Hence, $\Pr (θ) = {\begin{cases} 1 / 2 θ = 0 \\ 1 / 2 θ = 1 \end{cases}$ .

Similarly for the likelihood function Pr(data|NI), we have:

\Pr (d a t a | N I) = \prod_{k = 1}^{n} \Pr (x_{k}, y_{k} | N I) (n \geq 1)

(6)

\begin{array}{l} = \prod_{k = 1}^{n} [\Pr (x_{k}, y_{k} | N I, δ = 1) \times p^{2} + \Pr (x_{k}, y_{k} | N I, δ = 2) \\ \times 2 p q + \Pr (x_{k}, y_{k} | N I, δ = 3) \times q^{2}] \end{array}

\Pr (x_{k}, y_{k} | N I, δ = 1) = f (y_{k}; n_{k}, p_{e})

(7)

\Pr (x_{k}, y_{k} | N I, δ = 3) = f (x_{k}; n_{k}, p_{e})

(8)

Based on our three assumptions (i) – (iii), Pr(x_k, y_k|NI, δ = 2) follows a binomial distribution with p=0.5. Therefore, we have:

\Pr (x_{k}, y_{k} | N I, δ = 2) = (\begin{array}{l} x_{k} + y_{k} \\ x_{k} \end{array}) {(\frac{1}{2})}^{x_{k} + y_{k}}

(9)

Computation is performed separately for each single SNP. Therefore, a posterior probability of imprinting for a gene is associated with a specific SNP in this gene, unless otherwise specified.

Simulation-based model analysis

We generated allelic counts from simulated data by taking into account imprinting status, SNP frequency, and sequencing error. We then applied dsPIG to estimate the sensitivity, specificity and the FDR. We generated two sets of allelic counts under the assumption that the locus was either imprinted or not. The number of reads for each allele was generated assuming the presence of one dominating allele plus sequencing error for imprinted case; and presence of equal amount of two alleles plus sequencing error for non-imprinted case. The generated allelic counts followed a distribution similar to the actual mRNA-Seq data (Additional file 1 Figure S1). More details of the procedure were illustrated in Additional file 2 Figure S2. Given an allele frequency (0.5), a sequencing error (0.02) and a prior (0.01) of imprinting, the posterior probability calculated from the allelic counts generated for imprinted genes approached 1 as the sample size increased, while concomitantly the posterior probability for biallelically expressed approached 0 (Figure 1a,b; for each SNP, only the tissue samples with allelic counts >2 were considered valid samples, and we used “sample size” to refer to the number of valid samples in this study). With minor allele frequencies between 0.005 to 0.5, sequencing errors between 0.01 to 0.05 and priors between 0.005 to 0.02, we obtained similar results as Figure 1a,b. Using 0.2 as the cut-off for posteriors Pr(I | data) (see Model Analysis Based on Independent Test Sets), sensitivity of our model-based prediction exceeded 99.9% when sample size was >9, and specificity exceeded 99.99% when sample size was >18 (Figure 1c). Under the same cut-off and the allele frequency, the FDR of predicted imprinted genes decreased as sample size increased: it dropped to 5% and 1% as sample size exceeded 20 and 25, respectively (Figure 1d). Using the same cut-off (0.2), we also examined how FDR varied when minor allele frequency changed from 0.1 to 0.5 and sample size increased from 1 to 50 (Figure 2). Based on these simulations, we were able to provide estimated FDRs for most of our predictions in this study (Additional file 3 Table S1).

Predictions of imprinted genes

We collected two previously published mRNA-Seq data sets. One set included 9 diverse tissue samples (Group I) from Wang et al. 2008, and the other set had 20 cerebellum cortex samples (Group II) from Mudge et al. 2008 (Table 1; see Data Collection in Materials and Methods) [38, 39]. Wang et al. 2008 showed that, in terms of alternative isoform expression, cerebellum tissues were clustered together and the 9 diverse samples were more closely correlated. Here we performed hierarchical clustering based on the imprinting-inclined SNPs (i.e., SNPs with posteriors >0.01) and obtained similar results (Figure 3; see Sample Clustering in Materials and Methods). As we expected, samples from the cerebellar cortex were clustered together, with Caucasian and African American separated in two sub-clusters (Figure 3a). Using Caucasian allele frequency on African American samples, however, yielded a sub-cluster without separation between the two ethnicities (Figure 3b). This suggests that the separation observed in Figure 3a was due to differences in minor allele frequencies. As a test set, 3 breast cancer cell line samples were clustered together in both cases. Compared with other non-cerebellum samples, the brain sample had higher correlation with cerebellum samples in both cases, which is sensible biologically. The result that the 9 diverse tissue samples were clustered together could be caused by many factors such as different experimental conditions between Group I and Group II samples.

Table 1 Sample information and sequencing results from 9 various normal tissues (Group I) and 14 cerebellar cortices with schizophrenia (SCZD) and 6 normal cerebellar cortices (Group II)

Full size table

Using dsPIG, we predicted imprinted genes for Group I and Group II separately. To call a gene imprinted, we checked the posteriors of all the SNPs on the same gene to make sure that (i) the gene had at least one SNP with a posterior >0.2, which was the same cut-off used in simulations, and (ii) all the other SNPs did not show any contradictory evidence (i.e., all the other posteriors were >0.002 [our 20,000-time simulations showed that 95% of the posteriors of imprinted genes were >0.002]). After applying dsPIG to the mRNA-Seq data and using the above criteria, we predicted 57 potentially imprinted genes for Group I samples and 94 potentially imprinted genes for Group II samples out of a total of 20,559 genes (0.28% and 0.46%, respectively) that had allele frequency data in dbSNP. The distribution of sample sizes for SNPs was shown in Additional file 5 Figure S3. We listed the top 15 predictions for both groups with respect to their posterior probabilities of imprinting in Table 2 (see Additional file 3 Table S1 for the complete list). Surprisingly, we found only 13 common genes between the 57-gene list and the 94-gene list. Functional enrichment analysis performed in Ingenuity® Pathway Analysis (IPA) showed that the two lists of genes were significantly enriched (p-value<0.05 after “BH” correction [40]) in certain Bio Functions but not in any canonical pathways defined by IPA (Additional file 6 Figure S4). We also compared our predictions with the 371 genes that are subject to random monoallelic expression [1] and found none of our predicted genes in either group overlapped with them, which further validates the quality of our predictions and strongly suggests that our predictions are not affected by random monoallelic expression.

Table 2 The top 30 predictions of imprinted genes based on mRNA-Seq data from Group I and Group II samples

Full size table

Model analysis based on independent test sets

We collected 66 validated imprinted genes and 155 validated non-imprinted genes ([31, 41]; see Data Collection) and used them to check for false-positive and false-negative predictions made by dsPIG. Based on the mRNA-Seq data of Group I samples, 28 of the 66 imprinted genes and 119 of the 155 non-imprinted genes had allelic counts for known SNPs; based on the mRNA-Seq data of Group II samples, 26 of the 66 imprinted genes and 110 of the 155 non-imprinted genes had allelic counts for known SNPs. Under the same criteria used to predict imprinted genes in Table 2, dsPIG identified 2 out of 28 imprinted genes from Group I samples and 9 out of 26 imprinted genes from Group II samples, based on the mRNA-Seq data; under the same criteria, dsPIG misidentified 0 out of 119 non-imprinted genes from Group I samples and 0 out of 110 non-imprinted genes from Group II samples. Moreover, among those imprinted genes, only 8 out of 28 (Group I) and 17 out of 26 (Group II) have sample sizes >5. On the contrary, all 11 dsPIG-identified imprinted genes (2 for Group I and 9 for Group II) have sample size >5. Therefore, for imprinted genes with sample size > 5, 2 out of 8 genes and 9 out of 17 genes could be identified by dsPIG for Group I and Group II, respectively. Again, this showed that the sensitivity of dsPIG increased as sample size increased and dsPIG could probably identify more imprinted genes if the number of sequenced tissue samples further increased. This also agreed with the simulation results. Interestingly, some of the validated imprinted genes had very small posteriors (<10^-8), which indicated that they had biallelic expression (or random monoallelic expression) in certain tissues (Table 3).

Table 3 Tissues where validated imprinted genes most likely had biallelic expression

Full size table

We used the same sets of imprinted and non-imprinted genes to determine the cut-off of the posteriors used in the prediction of imprinted genes. Because most genes (~99%) are expected to be non-imprinted, the cut-off has to yield a very high specificity (>99%) so that the overall FDR of our predictions can be low enough (<50%) for further validations. After trying different cut-offs (0.1, 0.2, …, 0.9), we found 0.2 to be the most appropriate cut-off in terms of the validated gene sets because (i) increasing the cut-off from 0.2 only lowered the sensitivity while left the specificity unchanged (~0%), and (ii) decreasing the cut-off from 0.2 lowered the specificity while the sensitivity didn’t change a lot, which substantially increased the FDR of predictions (Additional file 7: Figure S5). We also showed the ROC curves for dsPIG in Additional file 7: Figure S5.

Candidates for experimental validation

We chose top 30 candidate genes from our predictions and listed them in Table 4. Except one SNP (rs#3106189), all the SNPs in Table 4 have high minor allele frequencies (>0.184), which indicates >30% chance of observing heterozygous alleles in experiments. In addition, these genes also met at least one of the following three criteria: (i) their SNPs (the 4^th column of Table 4) had a relatively low FDR (<0.3), (ii) they had multiple SNPs with posteriors > 0.2 (dsPIG calculated a posterior for each SNP in a gene), and (iii) they were located near existing imprinted genes (distance <2M base pairs) [10, 42]. These additional criteria further increased the possibility of imprinting.

Table 4 Suggested predictions for experimental validation.

Full size table

Detection of allele-preferred expression

By investigating the biallelically expressed genes identified in the mRNA-Seq data, we found that at least 18 genes expressed significantly more transcripts from one specific allele than the other among various individuals and tissues (P < 0.05 by binomial test; Table 5; see Binomial Test in Materials and Methods). This indicated that the difference between expression levels of the two alleles was not caused by sequencing errors or stochastic effects in RT-PCR. In future, as more mRNA-Seq data are generated, if more genes with one specific allele always under-expressed are observed, we would speculate that a sophisticated mechanism (such as nonsense-mediated mRNA decay [43]) may exist to explain this type of allelic preference in gene expression for biallelically expressed genes.

Table 5 20 SNPs of which one specific allele had a higher transcript level than the other one among various tissues and individuals

Full size table

Web-based service and R package for dsPIG

We have provided a web-based service for dsPIG at http://www.shoudanliang.com/dsPIG/. Users need to upload either mapped mRNA-Seq data in the supported format or processed data files containing allelic counts for each SNP (see the website for more details). After uploading the data, users may set the values for (i) the cut-off for posteriors, (ii) the average sequencing error rate, (iii) the prior of imprinting, and (iv) QS (quality score). In addition, users need to specify the human genome build and the SNP build for dsPIG. Our server will calculate the posterior probabilities of imprinting for each gene and email the results back to the users. In the final submission form, users may request additional analysis, such as suggesting tissues where known imprinted genes most likely have biallelic expression. In addition, we have made an R package for dsPIG, which is available both on the website and in the supplementary materials (Additional file 8 for UNIX and Additional file 9 for Windows). The instruction and the sample files for this R package are in Additional file 10. The web service and the R package generate the same result on the predictions of imprinted genes. For reference, we have also provided the annotated code for dsPIG (including both the R code and the C code) used in this study as Additional file 11 and on the website.

Discussion

dsPIG operates under the assumption that if a gene is biallelically expressed, the transcript levels of two alleles are the same. However, because imbalance of allelic expression has been widely detected in human tissues [27, 44–46], our assumption may not always be correct. Indeed, within our 29-tissue samples, allelic imbalance was observed even after excluding possible expression from imprinted genes (data not shown here). However, our simulation studies showed that our model can still distinguish imprinted genes from non-imprinted ones in most situations (Figure 4), unless one specific allele is always expressed at <13% of the other allele’s expression level across different samples (no supporting literature for this yet). Stochastic RT-PCR amplification [4] is not of particular concern in dsPIG because this has been taken into account as allelic imbalances.

dsPIG is sensitive to biallelic expression and unlikely to falsely predict imprinted genes. A gene will get a very low posterior probability of imprinting when it obviously has biallelic expression in one tissue, even if only one allele of this gene is observed in the transcriptomes from all other tissues (Additional file 12: Figure S6). For example, a gene may have monoallelic expression because of allele-specific differences caused by either heterozygous SNPs or somatic mutations in the promoter region of the gene; as long as monoallelic expression caused by these conditions is not present population-wide and a large number (e.g., >25 in our study) of independent tissue samples is used in dsPIG, these genes will not be falsely predicted as imprinted genes. However, for the same reason, if transcriptomes are collected from various tissues, it becomes very hard for dsPIG to detect tissue-specific imprinted genes. This partially explains why dsPIG predicted much less imprinted genes in Group I samples than in Group II samples (another reason is that Group II has more samples than Group I).

One advantage of dsPIG is that it is able to predict imprinted genes without sequencing the genotype. Although a homozygous allele will also lead to identification of only one allele in the transcriptome, the result will not elevate the posterior belief of imprinting in our model. This is very important and practical for human because we no longer need to sequence the genome for genotypes. However, a disadvantage is that dsPIG cannot tell the parent of origin for predicted imprinted genes, which can be verified only by other studies.

One obvious limit of dsPIG is that it was modeled based on single SNPs. This means, if a gene has more than 1 SNP site in its exons, it may have different single-SNP-based allelic counts and thus different posterior probabilities from dsPIG. Therefore, one single strong posterior cannot determine the imprinting status of this gene. Instead, we have to look into all SNP sites of each gene and make sure no contradictory posterior exists for our predictions (as stated in results). In future, a possible improvement would be integration of all SNP information of a gene and calculate a single posterior to predict imprinting status.

Conclusions

In this paper, we proposed a new method – dsPIG, applicable to all mammals with genomic imprinting, to predict imprinted genes based on mRNA-Seq data of various independent tissues. With enough sequenced samples, dsPIG is capable of predicting imprinted genes on a genome-wide scale with expected low FDRs. The power of dsPIG will be further enhanced after more data generated by mRNA-Seq technology become available in the near future.

Methods

Data collection

To predict gene imprinting status, we used ~650 million short reads from 29 human tissue samples [38, 39]: 206 million from 9 different normal tissues, 320 million from 14 cerebellar cortices with schizophrenia and 124 million from 6 normal cerebellar cortices. Tissue samples were collected from 29 independent individuals, among which 20 were Caucasian and 9 were African American (Table 1). We downloaded SNP data from UCSC genome browser using the table function [47] and used SNPs with only two alleles to make predictions in this study. The source of the original data was SNP Database (dbSNP) build 129 from NCBI [37]. Known imprinted and non-imprinted genes were collected from Additional file 3: Table S1 and Additional file 4: Table S2 in the supplementary research data of Luedi et al. (2007) [31] and from Catalogue of Parent of Origin Effects (http://igc.otago.ac.nz/home.html) [41].

Calculation of allelic counts

RNA-Seq reads were mapped to human genome hg18 from UCSC genome browser using Eland (GAPipeline-1.3.2). The unmapped reads were mapped to the exon-exon junctions downloaded from http://genes.mit.edu/burgelab/mrna-seq/[38]. The junctions contain 56 (28×2) base pairs in total, allowing the reads (32 base pairs) to be mapped with a minimum of 4 base pairs on each side of the junctions [38]. To compute the number of alleles for each SNP, we scanned each mapped tag for all known SNPs (in terms of dbSNP) and counted the number of times each nucleotide occurred at each SNP position in each sample. To reduce the amount of calculation, we only retained SNPs that were covered by any sequencing tag in any sample. This generated allelic counts for a total of 1,261,906 SNPs in the 29 tissue samples. We then discarded SNPs with unknown frequency and very low allelic counts (i.e., total allelic counts <3 in each sample). In addition, we defined a quality score-QS (see Definition of QS) and discarded SNPs with QS <0.9 in all 29 samples (see Additional file 13: Figure S7). After these steps, allelic counts for 82,916 SNPs remained, and these were used in dsPIG.

Definition of QS

We only used biallelic SNPs in dbSNP, which is the majority of SNPs. Nucleotides observed other than these two (alleles) were considered as sequencing errors due to low sequencing quality at the SNP site, or indicted that the allelic information of the SNP was wrong in dbSNP. Thus, we defined quality score-QS to improve the quality of the SNPs used in our predictions:

{QS}_{k} = \frac{x_{k} + y_{k}}{x_{k} + y_{k} + e_{k}}

In the above equation, x_k and y_k are the allelic counts of allele X and Y in the kth sample (X and Y were determined by dsSNP), and e_k is the count of additional nucleotide(s). QS was calculated for each SNP in each sample. We arbitrarily chose 0.9 as a cut-off for QS ( Additional file 13: Figure S7) and only those SNPs with QS>0.9 were used in dsPIG.

Sample clustering

By applying dsPIG to each of the 29 samples, we obtained 29 lists of posterior probabilities (each list had 87,852 posteriors for 87,852 SNPs), which were first multiplied by 100 and then natural logarithm transformed. Thus, if the posterior was the same as the prior (0.01), it would become 0 after the transform. After that, for each SNP, all posteriors ≤ 0 were modified to 0, while all posteriors > 0 were kept at the same value. If all 29 posteriors for a SNP were 0, the SNP was removed from the 29 lists. By doing this, only the SNPs that showed an increased probability of imprinting were kept for clustering. We then computed spearman correlations between samples based on the remaining 29 lists of posteriors, used these correlation values to determine distance between samples, and performed hierarchical clustering in R (http://www.r-project.org). By using this method, we clustered samples in terms of imprinting-inclined SNPs, and thus reduced the influence of biallelically expressed genes.

Binomial test

For a single SNP, we first defined $P_{k} = \frac{x_{k}}{y_{k}}$ , where x_k and y_k are the allelic counts for its alleles X and Y in the sample k (k=1, …, n). We then defined $z = \sum_{k = 1}^{n} I (P_{k})$ , where $I (P_{k}) = {\begin{cases} 1, if P_{k} > 1 \\ 0, if P_{k} < 1 \end{cases}$ . We used the following criteria to find the SNPs with both alleles (XY) expressed in the sample k: x_k>10, y_k>10, x_k+y_k>50, x_k/y_k<10 and y_k/x_k<10. The criteria made it very unlikely to observe biallelic expression because of sequencing error. For each SNP that met the criteria in n samples, if no allele was preferably expressed by the transcription machinery, z should follow a binomial distribution f(z; n, p) with p=0.5. To obtain enough testing power, SNPs with n ≥ 8 were deemed qualified for the binomial test. Under the above criteria, we found 24 qualified SNPs and listed all 20 significant testing results in Table 5.

References

Gimelbrant AA, Hutchinson JN, Thompson BR, Chess A: Widespread monoallelic expression on human autosomes. Science 2007, 318: 1136–1140. 10.1126/science.1148910
Article CAS Google Scholar
Lyon MF: X chromosomes and dosage compensation. Nature 1986, 320: 313.
Article CAS Google Scholar
Pernis B, Chiappino G, Kelus AS, Gell PG: Cellular localization of immunoglobulins with different allotypic specificities in rabbit lymphoid tissues. J Exp Med 1965, 122: 853–876. 10.1084/jem.122.5.853
Article PubMed Central CAS Google Scholar
Chess A, Simon I, Cedar H, Axel R: Allelic inactivation regulates olfactory receptor gene expression. Cell 1994, 78: 823–834. 10.1016/S0092-8674(94)90562-2
Article CAS Google Scholar
Rajewsky K: Clonal selection and learning in the antibody system. Nature 1996, 381: 751–758. 10.1038/381751a0
Article CAS Google Scholar
Hollander GA, Zuklys S, Morel C, Mizoguchi E, Mobisson K, Simpson S, Terhorst C, Wishart W, Golan DE, Bhan AK, Burakoff SJ: Monoallelic expression of the interleukin-2 locus. Science 1998, 279: 2118–2121. 10.1126/science.279.5359.2118
Article CAS Google Scholar
Bix M, Locksley RM: Independent and epigenetic regulation of the interleukin-4 alleles in CD4+ T Cells. Science 1998, 281: 1352–1354.
Article CAS Google Scholar
Rhoades KL, Singh N, Simon I, Glidden B, Cedar H, Chess A: Allele-specific expression patterns of interleukin-2 and Pax-5 revealed by a sensitive single-cell RT-PCR analysis. Curr Biol 2000, 10: 789–792. 10.1016/S0960-9822(00)00565-0
Article CAS Google Scholar
Gimelbrant AA, Ensminger AW, Qi P, Zucher J, Chess A: Monoallelic Expression and Asynchronous Replication of p120 Catenin in Mouse and Human Cells. J Biol Chem 2005, 280: 1354–1359.
Article CAS Google Scholar
Reik W, Walter J: Genomic imprinting: parental influence on the genome. Nat Rev Genet 2001, 2: 21–32.
Article CAS Google Scholar
Ferguson-Smith AC, Surani MA: Imprinting and the epigenetic asymmetry between parental genomes. Science 2001, 293: 1086–1089. 10.1126/science.1064020
Article CAS Google Scholar
Morison IM, Ramsay JP, Spencer HG: A census of mammalian imprinting. Trends Genet 2005, 21: 457–465. 10.1016/j.tig.2005.06.008
Article CAS Google Scholar
Zhang TY, Meaney MJ: Epigenetics and the environmental regulation of the genome and its function. Annu Rev Psychol 2010, 61: 439–466. 10.1146/annurev.psych.60.110707.163625
Article Google Scholar
Constância M, Pickard B, Kelsey G, Reik W: Imprinting mechanisms. Genome Res 1998, 8: 881–900.
Google Scholar
Tycko B, Morison IM: Physiological functions of imprinted genes. J Cell Physiol 2002, 192: 245–258. 10.1002/jcp.10129
Article CAS Google Scholar
Isles AR, Holland AJ: Imprinted genes and mother-offspring interactions. Early Hum Dev 2005, 81: 73–77. 10.1016/j.earlhumdev.2004.10.006
Article Google Scholar
Caron H, van Sluis P, van Hoeve M, de Kraker J, Bras J, Slater R, Mannens M, Voute PA, Westerveld A, Versteeg R: Allelic loss of chromosome 1p36 in neuroblastoma is preferential maternal origin and correlates with N-myc amplification. Nat Genet 1993, 4: 187–190. 10.1038/ng0693-187
Article CAS Google Scholar
Moulton T, Chung WY, Yuan L, Hensle T, Waber P, Nisen P, Tycko B: Genomic imprinting in Wilms’ tumor. Med Ped Oncol 1996, 27: 476–483. 10.1002/(SICI)1096-911X(199611)27:5<476::AID-MPO15>3.0.CO;2-8
Article CAS Google Scholar
Bartolomei MS, Tilghman SM: Genomic imprinting in mammals. Annu Rev Genet 1997, 32: 493–525.
Article Google Scholar
Nicholls RD, Saitoh S, Horsthemke B: Imprinting in Prader-willi and Angelman syndromes. Trends Genet 1998, 14: 194–200. 10.1016/S0168-9525(98)01432-2
Article CAS Google Scholar
Waterland RA, Jirtle RL: Early nutrition, epigenetic changes at transposons and imprinted genes, and enhanced susceptibility to adult chronic diseases. Nutrition 2004, 20: 63–68. 10.1016/j.nut.2003.09.011
Article CAS Google Scholar
Jirtle RL, Skinner MK: Environmental epigenomics and disease susceptibility. Nat Rev Genet 2007, 8: 253–262. 10.1038/nrg2045
Article CAS Google Scholar
Falls JG, Pulford DJ, Wylie AA, Jirtle RL: Genomic imprinting: implications for human disease. Am J Pathol 1999, 154: 635–647. 10.1016/S0002-9440(10)65309-6
Article PubMed Central CAS Google Scholar
Murphy SK, Jirtle RL: Imprinting evolution and the price of silence. Bioessays 2003, 25: 577–588. 10.1002/bies.10277
Article CAS Google Scholar
Wilkinson LS, Davies W, Isles AR: Genomic imprinting effects on brain development and function. Nat Rev Neurosci 2007, 8: 832–843. 10.1038/nrn2235
Article CAS Google Scholar
Nikaido I, Saito C, Mizuno Y, Meguro M, Bono H, Kadomura M, Kono T, Morris GA, Lyons PA, Oshimura M, et al.: Discovery of imprinted transcripts in the mouse transcriptome using large-scale expression profiling. Genome Res 2003, 13: 1402–1409. 10.1101/gr.1055303
Article PubMed Central CAS Google Scholar
Lo HS, Wang Z, Hu Y, Yang HH, Gere S, Buetow KH, Lee MP: Allelic variation in gene expression is common in the human genome. Genome Res 2003, 13: 1855–1862.
Article PubMed Central CAS Google Scholar
Luedi PP, Hartemink AJ, Jirtle RL: Genome-wide prediction of imprinted murine genes. Genome Res 2005, 15: 875–884. 10.1101/gr.3303505
Article PubMed Central CAS Google Scholar
Pant PV, Tao H, Beilharz EJ, Ballinger DG, Cox DR, Frazer KA: Analysis of allelic differential expression in human white blood cells. Genome Res 2006, 16: 331–339. 10.1101/gr.4559106
Article PubMed Central CAS Google Scholar
Ruf N, Dunzinger U, Brinckmann A, Haaf T, Nurnberg P, Zechner U: Expression profiling of uniparental mouse embryos is inefficient in identifying novel imprinted genes. Genomics 2006, 87: 509–519. 10.1016/j.ygeno.2005.12.007
Article CAS Google Scholar
Luedi PP, Dietrich FS, Weidman JR, Bosko JM, Jirtle RL, Hartemink AJ: Computational and experimental identification of novel human imprinted genes. Genome Res 2007, 17: 1723–1730. 10.1101/gr.6584707
Article PubMed Central CAS Google Scholar
Mardis ER: The impact of next generation sequencing technology on genetics. Trends Genet 2008, 24: 133–141. 10.1016/j.tig.2007.12.007
Article CAS Google Scholar
Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009, 10: 669–680.
Article PubMed Central CAS Google Scholar
Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10: 57–63. 10.1038/nrg2484
Article PubMed Central CAS Google Scholar
Babak T, DeVeale B, Armour C, Raymond C, Cleary MA, van der Kooy D, Johnson JM, Lim LP: Global survey of genomic imprinting by transcriptome sequencing. Curr Biol 2008, 18: 1735–1741. 10.1016/j.cub.2008.09.044
Article CAS Google Scholar
Watanabe DB, Barlow DP: Random and imprinted monoallelic expression. Genes Cells 1996, 1: 795–802. 10.1046/j.1365-2443.1996.d01-276.x
Article CAS Google Scholar
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM: Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308–311. 10.1093/nar/29.1.308
Article PubMed Central CAS Google Scholar
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456: 470–476. 10.1038/nature07509
Article PubMed Central CAS Google Scholar
Mudge J, Miller NA, Khrebtukova I, Lindquist IE, May GD, Huntley JJ, Luo S, Zhang L, van Velkinburgh JC, Farmer AD, et al.: Genomic Convergence Analysis of Schizophrenia: mRNA sequencing reveals altered synaptic vesicular transport in post-mortem cerebellum. PLoS One 2008, 3: e3625. 10.1371/journal.pone.0003625
Article PubMed Central Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 1995, 57: 289–300.
Google Scholar
Morison IM, Paton CJ, Cleverley SD: The imprinted gene and parent-of-origin effect database. Nucleic Acids Res 2001, 29: 275–276. 10.1093/nar/29.1.275
Article PubMed Central CAS Google Scholar
Lewis A, Reik W: How imprinting centres work. Cytogenet Genome Res 2006, 113: 1–4.
Article Google Scholar
Willing MC, Deschenes SP, Slayton RL, Roberts EJ: Premature chain termination is a unifying mechanism for COL1A1 null alleles in osteogenesis imperfecta type I cell strains. Am J Hum Genet 1996, 59: 799–809.
PubMed Central CAS Google Scholar
Yan H, Yuan W, Velculescu VE, Vogelstein B, Kinzler KW: Allelic variation in human gene expression. Science 2002, 297: 1143. 10.1126/science.1072545
Article CAS Google Scholar
Pastinen T, Sladek R, Gurd S, Sammak A, Ge B, Lepage P, Lavergne K, Villeneuve A, Gaudin T, Brandstrom H, et al.: A survey of genetic and epigenetic variation affecting human gene expression. Physiol Genomics 2004, 16: 184–193.
Article CAS Google Scholar
Pastinen T, Hudson TJ: Cis-acting regulatory variation in the human genome. Science 2004, 306: 647. 10.1126/science.1101659
Article CAS Google Scholar
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004, 32: D493-D496. 10.1093/nar/gkh103
Article PubMed Central CAS Google Scholar

Download references

Acknowledgements

We thank Kevin Coombes, Li Zhang, Pan Tong and Zhifeng Shao for thoughtful discussions and consistent help.

This research was funded by a training fellowship from the Keck Center for Quantitative Biomedical Sciences of the Gulf Coast Consortia, on the Computational Cancer Biology Training Program from the Cancer Prevention & Research Institute of Texas (CPRIT No. RP101489).

Author information

Authors and Affiliations

Shanghai Center for Systems Biomedicine, Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200240, China
Hua Li
Department of Stem Cell Transplantation and Cellular Therapy, The University of Texas M D Anderson Cancer Center, Houston, TX, 77030, USA
Hua Li & Jeffrey J Molldrem
Division of Biostatistics, The University of Texas School of Public Health at Houston, Houston, TX, 77030, USA
Xiao Su
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
Juan Gallegos
Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
Yue Lu
Center for Clinical and Research Informatics, NorthShore University HealthSystem, Chicago, Il, 60201, USA
Yuan Ji
Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
Shoudan Liang

Authors

Hua Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Su
View author publications
You can also search for this author in PubMed Google Scholar
Juan Gallegos
View author publications
You can also search for this author in PubMed Google Scholar
Yue Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Ji
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey J Molldrem
View author publications
You can also search for this author in PubMed Google Scholar
Shoudan Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shoudan Liang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HL and SL developed the model of dsPIG. HL carried out the analysis. XS built the R package. JG built the website. YL performed sequence alignment. YJ improved the model. HL and SL wrote the manuscript. SL conceived and directed this study. JJM co-directed this study. All authors read and approved the final manuscript.

Xiao Su, Juan Gallegos contributed equally to this work.

Electronic supplementary material

12859_2012_5445_MOESM1_ESM.tiff

Additional file 1:Figure S1. Distribution of simulation-generated allelic counts vs. observed distribution in real data. Red line stands for the generated distribution; black line stands for the observed distribution. The red text and black text in the upper right green box are summarized statistics for red line and black line, respectively (TIFF 148 KB)

12859_2012_5445_MOESM2_ESM.tiff

Additional file 2:Figure S2. Flowchart showing steps in data simulation and model assessment. In step 2, the differences in data generation are caused by two factors: (i) imprinted genes need to express only one allele at a tissue level while non-imprinted genes don’t, (ii) two alleles expressed from non-imprinted genes need to be sequenced in RNA-Seq with an equal probability, while imprinted genes only have one allele expressed. In this step we also need to assume that sequencing error leads to misread of one nucleotide to the other three with an equal probability. RT-PCR amplification is not shown in the process because we assume that it amplifies both alleles synchronously (for details, see Discussion) (TIFF 373 KB)

12859_2012_5445_MOESM3_ESM.doc

Additional file 3:Table S1. The predicted imprinted genes based on mRNA-Seq data from Group I and Group II samples. Abbreviations: rs#-SNP identification number, Chr-chromosome, Str-strand, SS-sample size. “NA” in the “FDR” column means the FDR could not be estimated based on our 20,000-time simulations (DOC 282 KB)

12859_2012_5445_MOESM4_ESM.doc

Additional file 4:Table S2. The FDR values with respect to different sample sizes and allele frequencies. “ NA” means FDR could not be estimated based on our 20,000-time simulations (DOC 78 KB)

12859_2012_5445_MOESM5_ESM.png

Additional file 5:Figure S3. Distribution of SNPs’ sample sizes in Group I (from 9 diverse tissue samples) and Group II (from 20 cerebellum samples). Group I and Group II had a total of 44007 SNPs and 66294 SNPs with sample size >0, respectively (PNG 33 KB)

12859_2012_5445_MOESM6_ESM.pdf

Additional file 6:Figure S4. Functional enrichment analysis of the 57 genes and the 94 genes in Ingenuity® Pathway Analysis. (a) Comparison of enrichment in Bio Functions between the two gene lists. (b) Comparison of enrichment in Canonical Pathways between the two gene lists. The height of each bar in (a) and (b) represents the logartihm (10-based) transformed p-values calculated from Fisher’s exact test. In (a), the horizontal yellow line is the threshold [i.e., -log₁₀(0.05)] above which bars (p-values) were considered significant; in (b), there is no bar above the threshold which is not shown here. (Figure S4 is located in a separate PDF file: “Fig. S4.pdf”.) (PDF 566 KB)

12859_2012_5445_MOESM7_ESM.png

Additional file 7:Figure S5. Sensitivity and specificity analysis of dsPIG based on the genes with known patterns of allelic expression. (a) Sensitivity analysis based on the validated imprinted genes. (b) Specificity analysis based on the validated non-imprinted genes. In (a) and (b), the solid black lines, which showed the numbers of genes identified by dsPIG as “imprinted”, were based on the mRNA-Seq data of Group I samples, and the dotted black lines were based on Group II samples; the red line is the cut-off (0.2) used in this study to predict imprinted genes. (c) ROC curve for dsPIG based on Group I samples. (d) ROC curve for dsPIG based on Group II samples (PNG 236 KB)

Additional file 8:R package (dsPIG, version 3.0) for UNIX).(GZ 1 MB)

Additional file 9:R package (dsPIG, version 3.0) for Windows).(ZIP 2 MB)

Additional file 10:The instruction and the sample files for the R package of dsPIG.(ZIP 164 KB)

Additional file 11:The annotated R code and C code for dsPIG used in our study.(ZIP 11 KB)

12859_2012_5445_MOESM12_ESM.png

Additional file 12:Figure S6. Simulated (log-transformed) posteriors of genes with biallelic expression in only one sample. Each positive integer (x) on the x-axis (“Sample size”) includes two parts: 1 sample of biallelic expression and (x-1) samples of imprinted expression. Posteriors were calculated by dsPIG. The dashed line stands for the log-transformed prior (0.01). This result was based on 20,000-time simulations with geometric mean as the method of averaging posteriors (PNG 4 KB)

12859_2012_5445_MOESM13_ESM.png

Additional file 13:Figure S7. Distributions of QS in 32 samples (including 3 breast cancer cell line samples). The x-axis is the number of sequencing tags that covered the SNP site, and the y-axis is the QS. Tissue names are located at the lower right side of each plot, where “Cancer” stands for “breast cancer cell line sample” and “C.” stands for “cerebellum sample”. Dashed lines in the 32 samples represent the cut-off (0.9) for QS (PNG 407 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Li, H., Su, X., Gallegos, J. et al. dsPIG: a tool to predict imprinted genes from the deep sequencing of whole transcriptomes. BMC Bioinformatics 13, 271 (2012). https://doi.org/10.1186/1471-2105-13-271

Download citation

Received: 15 March 2012
Accepted: 28 September 2012
Published: 19 October 2012
DOI: https://doi.org/10.1186/1471-2105-13-271

dsPIG: a tool to predict imprinted genes from the deep sequencing of whole transcriptomes

Abstract

Background

Results

Conclusion

Background

Results

Statistical model development

Simulation-based model analysis

Predictions of imprinted genes

Model analysis based on independent test sets

Candidates for experimental validation

Detection of allele-preferred expression

Web-based service and R package for dsPIG

Discussion

Conclusions

Methods

Data collection

Calculation of allelic counts

Definition of QS

Sample clustering

Binomial test

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us