l1kdeconv: an R package for peak calling analysis with LINCS L1000 data
© The Author(s). 2017
Received: 28 April 2017
Accepted: 19 July 2017
Published: 27 July 2017
LINCS L1000 is a high-throughput technology that allows gene expression measurement in a large number of assays. However, to fit the measurements of ~1000 genes in the ~500 color channels of LINCS L1000, every two landmark genes are designed to share a single channel. Thus, a deconvolution step is required to infer the expression values of each gene. Any errors in this step can be propagated adversely to the downstream analyses.
We presented a LINCS L1000 data peak calling R package l1kdeconv based on a new outlier detection method and an aggregate Gaussian mixture model (AGMM). Upon the remove of outliers and the borrowing information among similar samples, l1kdeconv showed more stable and better performance than methods commonly used in LINCS L1000 data deconvolution.
Based on the benchmark using both simulated data and real data, the l1kdeconv package achieved more stable results than the commonly used LINCS L1000 data deconvolution methods.
The NIH Common Fund’s Library of Integrated Network-based Cellular Signatures (LINCS) is a rich collection of gene expression data from a variety of human cell lines perturbed with a large battery of drugs and small molecules [7, 12]. So far, data generated by LINCS L1000 technology comprise over 1,000,000 gene expression profiles from 42,553 perturbagens applied to as many as 77 cell lines. A total of 19,811 of the perturbagens (including over 2000 FDA approved & clinical trial drugs) are small molecule compounds applied at different time points and doses. The other perturbagens are genetic perturbations, including knockdown and overexpression of well-selected 4372 genes. In general, triplicated (or more) measurements were performed for each perturbergen, leading to a total of over 400,000 gene expression signatures generated by this technology . Different from other genome wide expression measurement technologies such as microarray and RNA-Seq, L1000 is used in LINCS and only measures about 1000 selected “landmark” genes with only about 500 distinct bead colors  so that every two “landmark” genes share a color. Despite this reduction greatly increasing the throughput of the perturbed assays that can be performed simultaneously and driving down the experimental cost, it induces an additional deconvolution analysis step that is crucial for the accuracy of the measure of the gene expression levels [8, 9].
This peak deconvolution step is necessary because each bead color is associated with two genes rather than one. For each color, deconvolution is supposed to yield a distribution that generally consists of two peaks. The default approach provided by LINCS is the k-medians clustering algorithm which partitions all the data into two distinct components, and the median expression values are assigned as the expression value of the two gene of a color. Liu et al. [8, 9] proposed another deconvolution approach that was based on naïve Gaussian mixture model (GMM) with only two components, and the expression values were inferred from the component means. However, although the two genes with the same color were selected so that their intensity differences are as much as possible, there still exist many cases in which the expressions of a pair of genes are intermixed, making it difficult to infer the expression values by either k-medians or naïve GMM . Thus, it is critical to develop a new method to improve the deconvolution accuracy.
Here, we presented an R package, l1kdeconv, that implements a new peak-calling algorithm based on an aggregate Gaussian mixture model (AGMM) along with an outlier detection method for improving the robustness of AGMM. AGMM is based on the Gaussian mixture model and borrows information from the samples of the same condition to improve the peaking calling accuracy.
Outlier detection in LINCS L1000 data
Gaussian mixture models based on clustering analysis methods in general are sensitive to outliers . To improve the clustering accuracy, we first developed an outlier detection method to remove the outliers before peak calling.
Aggregate Gaussian mixture model
To accurately detect the expression values of a pair of genes, we introduced a novel peak calling method to borrow the information of the samples at the same condition (e.g., the same cell line, the same perturbation and the same treatment duration). This method is an extension of naïve GMM with a constraint that the order of the gene expression values of two genes of the same color is consistent across these similar samples.
Because the two genes assigned to a color were selected in a way such that their intensity difference was maximized [8, 9], this difference between the two genes can be assumed to be in the same direction across the similar samples. In other words, of any given color, the mean intensity of one gene was always greater than the mean intensity of the other gene. This assumption suggested that the information borrowing among the similar samples will make the mean intensity estimates more robust.
Brief overview of the package
The l1kdeconv package contains a novel peak calling algorithm for LINCS L1000 data, using aggregate Gaussian mixture model, an extension of the naïve GMM. Rather than fitting the bead intensities of each color in a single sample using two-component Gaussian mixture model separately, AGMM calls the two peaks of each color across similar samples simultaneously. In addition, the outlier detection method included in l1kdeconv can also enhance the clustering performance of AGMM. Figure 2 shows the framework of the outlier detection and AGMM approach.
The parameters of AGMM can be estimated using the R function optim() by the method of maximum likelihood. The L-BFGS-B method [1, 13] is used to add box constrains. To ensure the convergence of the optimization, k-means is used to compute the rough estimates of the mean intensities, which are used as initial values for the optimization. To make the package user-friendly, detailed help pages and running examples are provided in the package.
The construction of simulation dataset
To make a realistic evaluation of the performance of l1kdeconv, a simulated dataset was created with the key characteristics of LINCS L1000 data using a hierarchical model as described below. First, the empirical true expression values of each pair of genes were extracted by k-medians using the LINCS L1000 data of the A375 cell line treated with luciferase. Since the differences and averages of the empirical true expression values approximately followed a Gamma distribution and a Gaussian distribution respectively, these two distributions were used to resample the differences and the averages of the mean expression values of the two genes of any color. These differences and averages were then converted to the mean expression values of each pair of genes. The bead intensities were simulated from a two-component Gaussian mixture model with these mean expression values as the center of each Gaussian component. Then outliers were added to each color using a uniform distribution, and the number of outliers was generated from a Poisson distribution. Ten samples were created so that AGMM can borrow information across them. In each sample, there were 500 colors. In each color, there were roughly 60 beads and the ratio of the number of the beads for each gene in a color was 2:1.
Comparison of l1kdeconv with k-medians and naïve GMM on the simulated dataset
The number of the true predictions and mean absolute error of k-medians, naïve GMM, AGMM without the outlier detection and AGMM with the outlier detection. AGMM with the outlier detection made more true predictions, and its predictions have the least mean absolute error
AGMM without the outlier detection
AGMM with the outlier detection
# of true predictions
Mean absolute error
Comparison of l1kdeconv with k-medians and naïve GMM on a real dataset
The deconvolution results of k-medians, naïve GMM and AGMM with outlier detection using real data. Each pair of superscripts indicates a peak flipping case. The row index refers to the subfigure label in Fig. 5
AGMM with outlier detection
The l1kdeconv package provides a stable and accurate deconvolution algorithm for LINCS L1000 data. Because the deconvolution is the first step of the analysis of LINCS L1000 data, any significant improvement in this step will have a critical impact on the downstream analyses. The l1kdeconv package has two components --- the outlier detection and aggregated Gaussian mixture model. By filtering the outliers and borrowing information from similar samples, l1kdeconv outperforms k-medians and naïve Gaussian mixture model using the limited number of intensity values.
The package also provides detailed help pages, which makes it user friendly. Furthermore, we standardize the distribution, installation and maintenance of this package. It is available on the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org.
The study was supported by the startup funding to Peng Yu from the ECE department and Texas A&M Engineering Experiment Station/Dwight Look College of Engineering at Texas A&M University and by funding from TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering (CBGSE) at Texas A&M University.
This work was supported by startup funding to P.Y. from the ECE department and Texas A&M Engineering Experiment Station/Dwight Look College of Engineering at Texas A&M University, by funding from TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering (CBGSE) at Texas A&M University, by TEES seed grant, and by Texas A&M University-CAPES Research Grant Program. The open access publishing fees for this article have been covered in part by the Texas A&M University Open Access to Knowledge Fund (OAKFund), supported by the University Libraries and the Office of the Vice President for Research.
Availability of data and materials
The l1kdeconv package is available on the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org. The LINCS L1000 data is available at GEO with the accession number of GSE70138.
Consent to publish
PY conceived the general project and supervised it. ZL and PY were the principal developers. ZL, JL and PY wrote the manuscript. All the authors contributed with ideas, tested the software, read the final manuscript and approved it.
Ethics approval and consent to participate
The authors have no conflicts of interest to declare
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Byrd RH, et al. A Limited Memory Algorithm for Bound Constrained Optimization. Siam Journal on Scientific Computing. 1995;16(5):1190–208.View ArticleGoogle Scholar
- Diedenhofen, B. J. Musch (2015). "cocor: A Comprehensive Solution for the Statistical Comparison of Correlations (vol 10, e0121945, 2015)." PLoS One 10(6).Google Scholar
- Duan, Q., et al. (2014). "LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures." Nucleic Acids Res 42(Web Server issue): W449-W460.Google Scholar
- El-Melegy MT. Model-wise and point-wise random sample consensus for robust regression and outlier detection. Neural Netw. 2014;59:23–35.View ArticlePubMedGoogle Scholar
- Hittner JB, et al. A Monte Carlo evaluation of tests for comparing dependent correlations. Journal of General Psychology. 2003;130(2):149–68.View ArticlePubMedGoogle Scholar
- Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International journal of forecasting. 2006;22(4):679–88.View ArticleGoogle Scholar
- Lamb J, et al. The connectivity map: Using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313(5795):1929–35.View ArticlePubMedGoogle Scholar
- Liu C, et al. Compound signature detection on LINCS L1000 big data. Mol Biosyst. 2015a;11(3):714–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu CL, et al. Compound signature detection on LINCS L1000 big data. Molecular Biosystems. 2015b;11(3):714–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Scott, D. W. (2004). Outlier detection and clustering by partial mixture modeling. COMPSTAT 2004—Proceedings in Computational Statistics, Springer.Google Scholar
- Subramanian, A., et al. (2017). "A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles." bioRxiv.Google Scholar
- Wang ZC, et al. Drug-induced adverse events prediction with the LINCS L1000 data. Bioinformatics. 2016;32(15):2338–45.View ArticlePubMedGoogle Scholar
- Xiao YH, Wei ZX. A new subspace limited memory BFGS algorithm for large-scale bound constrained optimization. Applied Mathematics and Computation. 2007;185(1):350–9.View ArticleGoogle Scholar