Skip to main content
Fig. 2 | BMC Bioinformatics

Fig. 2

From: Robust differential expression analysis by learning discriminant boundary in multi-dimensional space of statistical attributes

Fig. 2

The Montgomery dataset shows that real RNA-seq datasets contain complex distributions. (a Gene ID: 64928), (b Gene ID: 11244), and (c Gene ID: 80169) show the distributions of three genes as examples. The white bars represent the histogram of the original data. The solid-red and dashed-blue curves represent the distributions of the fitted GMM and NB, respectively. See main text for the details of fitting NB and GMM to the original data. The number of Gaussian mixtures are 1, 2, and 3 in (a), (b), and (c), respectively. GMM is better than NB at representing distributions with multiple modes. (d Compare correlation coefficients) Y-axis: the correlation coefficients between the read-count distributions and the corresponding fitted GMM distribution. X-axis: the correlation coefficients between the read-count distributions and the corresponding fitted NB distribution. Each dot represents a gene. The distribution of a gene’s read-counts is approximated by a histogram of 20 equal-size bins spanning the read-count value range. The colors of dots indicate the most proper numbers of components in a fitted GMM according to the Bayesian Information Criterion: green (~44%), blue (~50%), and red (~6%) correspond to 1, 2 and 3 components, respectively. About 63.5% of genes are above the diagonal line indicating their distributions are more GMM-like. The distributions of the remaining ~36.5% genes are more NB-like. To further investigate this observation, we calculated N GMM NB as the number of genes whose advantages of their GMM fits over their NB fits are significant (p-value < 0.05) if the distributions of all genes are NBs. If all genes are indeed governed by NBs, N GMM NB should be close to the expected number that is 11,573 × 0.05 ≈ 579. We sampled 2000 datasets from the NB fit of each gene, each of which contain 60 samples. For each dataset, we fit a GMM and a NB, and calculated the difference between their fitting scores (i.e., GMM fit score – NB fit score). The score differences across all datasets were collected to approximate the NULL distribution and calculate the p-value of the score difference between the GMM and NB fits to the original samples. We got N GMM NB = 2442 (> > 579), 1830 of which have 2+ components in their GMMs. Hence we can deduce that the distributions of a substantial number of genes are not NB-like. In a similar way, we calculated N NB GMM as the number of genes whose advantages of their NB fits over their GMM fits are significant (p-value < 0.05) if the distributions of all genes are GMM. We obtained N NB GMM = 2431 (>> 579) indicating that the distributions of a substantial number of genes are not GMM-like. Putting the above together, we conclude that neither NB nor GMM dominates the distributions of genes in the Montgomery dataset

Back to article page