svapls: an R package to correct for hidden factors of variability in gene expression studies

Background Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distorted, leading to inaccurate results from a genome-wide testing for the important transcripts. Results We present an R package svapls that can be used to identify several types of unknown sample-specific sources of heterogeneity in a gene expression study and adjust for them in order to provide a more accurate inference on the original expression pattern of the genes over different varieties of samples. The proposed method implements Partial Least Squares regression to extract the hidden signals of sample-specific heterogeneity in the data and uses them to find the genes that are actually correlated with the phenotype of interest. We also compare our package with three other popular softwares for testing differential gene expression along with a detailed illustration on the widely popular Golub dataset. Results from the sensitivity analyes on simulated data with widely different hidden variation patterns reveal the improved detection power of our R package compared to the other softwares along with reasonably smaller error rates. Application on the real-life dataset exhibits the efficacy of the R package in detecting potential batch effects from the dataset. Conclusions Overall, Our R package provides the user with a simplified framework for analyzing gene expression data with a wide range of hidden variation patterns and delivering a differential gene expression analysis with substantially improved power and accuracy. The R package svapls is freely available at http://cran.r-project.org/web/packages/svapls/index.html.

variation. This effect is still manageable as composition of the batches are known prior to analyses. But, numerous other factors may still exist that are not detectable from outside, but can potentially affect the subject-specific expression levels of the genes in different ways. They can in turn lead to complex latent expression structures in the entire genomic landscape of the data (e.g., confounded signals between the two groups of samples, correlated expression signals corresponding to a specific group of genes and samples affected by the hidden factors, etc.). The contributed impact of these factors, either acting singly or in consort can induce serious problems in multiple testing of differential expression for the genes. Thus, a number of truly significant genes can pass out undetected while many others may be wrongly flagged as positives. The consequence is a severe reduction in power (sensitivity) of the testing procedure accompanied by a substantially high rate of erroneous discoveries. Most of http://www.biomedcentral.com/1471-2105/14/236 available softwares for differential gene expression analyses either overlook this broadly general issue of hidden variability or consider simple parametric regression approaches (linear regression, mixed effects models, etc.) to address the maladies of residual heterogeneity. However the complexity of problem necessitates the development of a more generalized and efficient technique that can identify these latent effects of variation in the data and adjust for them in order to deliver a more powerful and accurate inference on the actual expression pattern of the genes. This motivated us to construct a methodology [1] that provides an unified framework for handling these widely different types of spurious variability in the data.
We have built an R software svapls that uses the multivariate Non-Linear Iterative Partial Least Squares (NIPALS) algorithm [2] to extract the latent, unwanted effects of variation in a gene expression data and uses them to build an optimal ANCOVA model for detecting the truly differentially expressed genes between two types of samples/tissues. In the next section we describe the important functions in our package along with illustrative examples that explain their practical usage in detail. The following section 'Comparative evaluation with other available software packages' demonstrates its comparatively superior performance with respect to three other popular softwares: sam [3], limma [4] and sva [5] through a sensitivity analysis of two simulated differential gene expression datasets affected by complicated hidden variation patterns. Section 'Application on the Golub data' elucidates an application on a real-life dataset that proves the worth of our software through the adjustment for batch effects and detection of some additional phenotyperelated genes that are deemed to be significant from their annotations in the literature. The manuscript ends with a discussion in Section 'Discussion' .

Brief overview of the package
This R package consists of the three primary functions: fitModel, svpls and hfp. Below we give a brief outline of them. The function applications are demonstrated on a simulated dataset affected by hidden variation (hidden_fac.dat) that is inbuilt as a part of the R package.
• The first function fitModel fits an ANCOVA model to the original log-transformed gene expression data, with a certain number of PLS scores as surrogate variables (specified by n.surr) or the simple ANOVA model [6,7]  • The second function svpls calls the first function fitModel to fit a number of ANCOVA models (specified by pmax) to the data and selects the optimal model as the one with the minimum value of the Akaike's Information Criterion (AIC) [8]. This model is then used to predict the actual pattern of differential expression of the genes over the two sample varieties by performing a multiple hypothesis testing at specified value of the false discovery rate (FDR) [9] (specified by fdr).

Figure 1 Histograms of the unadjusted and adjusted p-values.
This figure exhibits two histograms from an analysis of the data hidden_fac.dat, one for the unadjusted p-values for testing the variety-based differential gene expression (found from the standard ANOVA model) and the other corresponding to the adjusted p-values obtained after correcting the hidden variability in the data by our R package svapls.
• The third function hfp produces a heatmap for the PLS-imputed estimate of the residual expression heterogeneity corresponding to an user-specified set of genes and samples (specified by gen and ind respectively). This enables us to understand how intensely the latent factors from a certain set of subjects affect the true expression levels of a specified set of genes.

Comparative evaluation with other available software packages
In this section we illustrate the application of the R package along with the other three popular software packages through a family of simlulation analyses conducted with two sample sizes 20 and 40 under three different values of the noise-to-signal ratio (η = 0.05, 0.1 and 0.15) controlling the relative intensity of the random error and primary signal variances from low to high [1]. In each simulation study we generate correlated expression measurements on 1000 genes over k subjects, (k = 20, 40) classified equally into two groups 1 and 2. Overall, we consider two different settings: (1) The genes are affected by a highly complex subject-specific confounder (mixture of two normal random variables) with a small variance and (2) The genes are affected by two widely different subject-specific confounders (one mixture of two normal random variables and another mixture of two exponential random variables), both with very high variances [1]. Under both the settings, the first 70 genes are considered to be truly differentially expressed over the two varieties while the rest are chosen as non-significant. The simulation study is based on computation of the average values of two right decision indicators (sensitivity, specificity) and two wrong detection indicators (false discovery rate and false non-discovery rate), evaluated from 1000 http://www.biomedcentral.com/1471-2105/14/236  (Tables 1 and 2 for setting 1 and 3, 4 for setting 2). The obtained results from the two simulation settings clearly reveal the superior sensitivity of svapls compared to other three R packages sam, limma and sva, under most of the combinations of group size and noise-to-signal ratios. This illustrates the efficiency of our R package in discovering a higher proportion of the truly significant genes compared to the existing software packages. The sensitivity of sam is comparable to our method for a higher sample size under setting 1 (Tables 1  and 2) and is very close or marginally better under setting 2 (Tables 3 and 4), but is adversely impacted by its significantly large false discovery rate. Specifically, the sensitivity obtained from our R package becomes almost similar or slightly better than sam as the group size is increased from 10 to 20 (Table 4). Moreover, the specificity rate is the best for svapls closely followed by sva, while sam and limma are less efficient in this context. In addition, the average error rates FDR and FNR are much lower for svapls compared to the other three software packages. Thus, overall our R package is capable of discovering the truly differentially expressed genes with more power along with an efficient control over the wrong detections (non-detections).

Application on the Golub data
Now, we explore the performance of svapls on the pre-processed ALL/AML dataset [10,11]. It contains the log-transformed expression levels of 7129 genes      (4) Children's Cancer Study Group (CCG). This inherent classification in the data can potentially generate significant batch effects that may distort the original expression pattern of the genes. This motivated the implementation of our R package on this dataset. The corrected expression matrix for the first 1000 genes returned from the use of the svpls function on this data demonstrates that the batch effects due to variability in the sample sources have been removed effectively. The haphazard distribution of the samples from the four sources in the corrected gene expression matrix wipes out the additonal effects owing to the observed batchspecific clustering in the original data. In this context svapls fares equally well compared to another popular R package ber for removing batch effects in microarray data [12] (Figure 3). Overall, limma detects 7128 genes followed by 3307 genes from sam, 1015 genes from our svapls and 412 genes from sva. A Venn diagram (Figure 4) represents the extent of overlap between the genes detected by the four softwares. Specifically, limma detects all the genes that are found to be significant from the other three http://www.biomedcentral.com/1471-2105/14/236 softwares. This may be attributable to its high false discovery rate (FDR) as was observed in the simulation study. Interestingly, svapls detected 24 genes that are missed by both sam as well as sva. Among them the genes CD74, TNFRSF1A, LCN2 and GSN deserve special mention. All these genes are either related to some type of cancer or regulate cell growth(or apoptosis). CD74 plays an important role in multiple myeloma and its higher expression induces tumor cell malignancy [13]. An isoform of the tumor necrosis factor TNFRSF1A is associated with the development of Acute Lymphoblastic Leukemia (ALL) in children [14]. Specifically, LCN2 has been found to be connected with Acute Myelogenous Leukemia (AML) [15]. GSN plays a significant role of suppressing tumorigenicity in lung cancer [16] and has a diminuted expression in bladder cancer cells [17].

Discussion
Various hidden sources of variation are found to exist in a gene expression data that cannot be removed by the standard normalization procedures. But, their effect may be substantial enough to change the expression pattern of the genes over two different varieties of samples. The immediate consequence is a large reduction in the detection power of the testing procedure employed to find the truly significant genes, followed by highly elevated error rates. In this project we discuss the development and usage of an R package svapls that can tackle a wide variety of hidden effects in a gene expression analysis and can deliver a more accurate inference on the differential expression variability of the genes between two groups of samples (tissues). We illustrate the superior performance of our R package in comparison to three other popular softwares available for differential gene expression analyses. The high detection power (sensitivity) of svapls along with http://www.biomedcentral.com/1471-2105/14/236 the reasonably small error rates provides it a significantly better edge over the competing softwares. Specifically, sva is outperformed by our package in terms of the sensitivity (power), while sam comes close and performs marginally better in some cases, although its competence is severely marred by the considerably high false discovery rate (FDR) and substantially low specificity rate. In addition the graphical representation of the hidden variation (by the function hfp) from our package enables the user to understand the pattern in which the hidden sources of variability affect the expression signals of any specified subset of genes over a selected group of subjects/samples. This paves the way to more sophisticated analyses of subject-set specific gene expression variability in the data. Application of our package on the Golub data demonstrates its efficacy in removing the significant batch effects from the collected/analyzed samples. Moreover our package detects four additional genes (missed by both sva and sam) that have been found to be connected to Leukemia or some other type of cancer.

Conclusions
The R package svapls can detect a wide variety of hidden factors in a gene expression study and adjust for them appropriately, in order to provide a more accurate inference on the expression pattern of the genes between two different types of tissues. In particular, the superior detection power and small error rate gives our R package a substantially better edge over the competing softwares considered in the analysis.

Operating system and R version
The R package is platform independent and is compatible with all the versions of R same as or higher than 2.0.