Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets
 Fangzhou Yao^{1, 2},
 Jeff Coquery^{2, 3} and
 KimAnh Lê Cao^{2}Email author
DOI: 10.1186/147121051324
© Yao et al; licensee BioMed Central Ltd. 2012
Received: 5 September 2011
Accepted: 3 February 2012
Published: 3 February 2012
Abstract
Background
A key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A wellestablished technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data.
Results
We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA).
Conclusions
On simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.
IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' webinterface.
Background
With the development of high throughput technologies, such as microarray and next generation sequencing data, the exploration of high throughput data sets is becoming a necessity to unveil the relevant information contained in the data. Efficient exploratory tools are therefore needed, not only to assess the quality of the data, but also to give a comprehensive overview of the system, extract significant information and cope with the high dimensionality. Indeed, many statistical approaches fail or perform poorly for two main reasons: the number of samples (or observations) is much smaller than the number of variables (the biological entities that are measured) and the data are extremely noisy.
In this study, we are interested in the application of unsupervised approaches to discover novel biological mechanisms and reveal insightful patterns while reducing the dimension in the data. Amongst the different categories of unsupervised approaches (clustering, modelbased and projection methods), we are specifically interested in projectionbased methods, which linearly decompose the data into components with a desired property. These exploratory approaches project the data into a new subspace spanned by the components. They allow dimension reduction without loss of essential information and visualization of the data in a smaller subspace.
Principal component analysis (PCA) [1] is a classical tool to reduce the dimension of expression data, to visualize the similarities between the biological samples, and to filter noise. It is often used as a preprocessing step for subsequent analyses. PCA projects the data into a new space spanned by the principal components (PC), which are uncorrelated and orthogonal. The PCs can successfully extract relevant information in the data. Through sample and variable representations, they can reveal experimental characteristics, as well as artefacts or bias. Sometimes, however, PCA can fail to accurately reflect our knowledge of biology for the following reasons: a) PCA assumes that gene expression follows a multivariate normal distribution and recent studies have demonstrated that microarray gene expression measurements follow instead a superGaussian distribution [2–5], b) PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data [6].
A more plausible assumption of the underlying distribution of highthroughput biological data is that feature measurements following Gaussian distributions represent noise  most genes conform to this distribution as they are not expected to change at a given physiological or pathological transition [7]. Recently, an alternative approach called Independent Component Analysis (ICA) [8–10] has been introduced to analyze microrray and metabolomics data [2, 6, 11–13]. In contrary to PCA, ICA identifies nonGaussian components which are modelled as a linear combination of the biological features. These components are statistically independent, i.e. there is no overlapping information between the components. ICA therefore involves high order statistics, while PCA constrains the components to be mutually orthogonal, which involves second order statistics [14]. As a result, PCA and ICA often choose different subspaces where the data are projected. As ICA is a blind source signal separation, it is used to reduce the effects of noise or artefacts of the signal since usually, noise is generated from independent sources [10]. In the recent literature, it has been shown that the independent components from ICA were better at separating different biological groups than the principal components from PCA [2, 5–7]. However, although ICA has been found to be a successful alternative to PCA, it faces some limitations due to some instability, the choice of number of components to extract and high dimensionality. As ICA is a stochastic algorithm, it needs to be run several times and the results averaged in order to obtain robust results [5]. The number of independent component to extract and choose is a hard outstanding problem. It has been the convention to use a fixed number of components [2]. However, ICA does not order its components by 'relevance'. Therefore, some authors proposed to order them either with respect to their kurtosis values [9], or with respect to their l_{2} norm [2], or by using Bayesian frameworks to select the number of components [15]. In the case of high dimensional data sets, PCA is often applied as a preprocessing step to reduce the number of dimensions [2, 7]. In that particular case, ICA is applied on a subset of data summarized by a small number of principal components from PCA.
In this paper, we propose to use ICA as a denoising process of PCA, since ICA is good at separating mixed signals, i.e. noise vs. no noise. The aim is to generate denoised loading vectors. These vectors are crucial in PCA or ICA as each of them indicates the weights assigned to each biological feature in the linear combination that leads to the component. Therefore, the goal is to obtain independent components that better reflect the underlying biology in a study and achieve better dimension reduction than PCA or ICA.
Independent Principal Component Analysis (IPCA) makes the assumption that biologically meaningful components can be obtained if most noise has been removed in the associated loading vectors.
In IPCA, PCA is used as a preprocessing step to reduce the dimension of the data and to generate the loading vectors. The FastICA algorithm [9] is then applied on the previously obtained PCA loading vectors that will subsequently generate the Independent Principal Components (IPC). We use the kurtosis measure of the loading vectors to order the IPCs. We also propose a sparse variant with a builtin variable selection procedure by applying softthresholding on the independent loading vectors [16, 17] (sIPCA).
In the 'Results and Discussion' Section, we first compare the classical PCA and ICA methodologies to IPCA on a simulation study. On three real biological datasets (microarray and metabolomics datasets) we demonstrate the satisfying samples clustering abilities of IPCA. We then illustrate the usefulness of variable selection with sIPCA and compare it with the results obtained from the sparse PCA from [18]. In the 'Methods' Section, we present the PCA, ICA and IPCA methodologies and describe how to perform variable selection with sIPCA.
Results and Discussion
We first performed a simulation study where the loading vectors follow a Gaussian or superGaussian distribution. On three real data sets, we compared the kurtosis values of the loading vectors as a way of measuring their nonGaussianity and ordering the IPCs. The samples clustering ability of each approach is assessed using the Davies Bouldin index [19]. Finally, the variable selection performed by sIPCA and sPCA are compared on a simulated as well as on the Liver Toxicity data sets.
Simulation study
 1.
Gaussian case. The first two eigenvectors v _{1} and v _{2}, both of length 500, follow a Gaussian distribution.
 2.SuperGaussian case. In this case the first two eigenvectors follow a mixture of Laplacian and uniform distributions:${v}_{1k}~\left\{\begin{array}{cc}\hfill L\left(0,25\right)\hfill & \hfill k=1,\dots ,50\hfill \\ \hfill U\left(0,1\right)\hfill & \hfill \mathsf{\text{otherwise}},\hfill \end{array}\right.\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathsf{\text{and}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{v}_{2k}~\left\{\begin{array}{cc}\hfill L\left(0,25\right)\hfill & \hfill k=301,\dots ,350\hfill \\ \hfill U\left(0,1\right)\hfill & \hfill \mathsf{\text{otherwise}}.\hfill \end{array}\right.$
Simulation study: angle (median value) between the simulated and estimated loading vectors simulated with either Gaussian or superGaussian distributions.
Method  Gaussian  superGaussian  

v_{1}  v_{2}  v_{1}  v_{2}  
PCA  20.48  21.61  20.47  21.62 
ICA  85.70  84.39  82.13  77.77 
IPCA  70.05  69.72  12.46  14.08 
Mean value of the kurtosis measure of the first 5 loading vectors in the simulation study for PCA, IPCA and & ICA.
PCA  ICA  IPCA  

Gaussian case  loading 1  0.007  0.015  0.54 
loading 2  0.009  0.013  0.21  
loading 3  0.012  0.013  0.01  
loading 4  0.011  0.013  0.20  
loading 5  0.015  0.015  0.41  
superGaussian case  loading 1  34.75  0.28  52.58 
loading 2  34.16  0.43  33.81  
loading 3  0.01  0.42  0.27  
loading 4  0.01  0.44  0.02  
loading 5  0.02  0.47  0.25 
Tables 1 and 2 seem to suggest that ICA performs poorly in both Gaussian and superGaussian case, even if we would expect quite the contrary in the superGaussian case. In the high dimensional case, PCA is used as a pre processing step in the ICA algorithm. It is likely that such step affects the ICA input matrix and that the ICA assumptions are not met. Therefore, the performance of ICA seems to be largely affected by the high number of variables.
PCA gave satisfactory results in both cases. In the superGaussian case, PCA is even able to recover some of the superGaussian distribution of the loading vectors. However, IPCA is able to recover the loading structure better than PCA in the superGaussian case (angles are smaller in Table 1 and kurtosis value is much higher for the first loading for IPCA). Depending on the (unknown) nature of the data set to be analyzed, it is therefore advisable to assess both approaches.
Application to real data sets
Liver Toxicity study
In this study, 64 male rats were exposed to nontoxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg) or severely toxic (2000 mg/kg) doses of acetaminophen (paracetamol) in a controlled experiment [20]. In this paper, we considered 50 and 150 mg/kg as low doses, and 1500 and 2000 as high doses. Necropsies were performed at 6, 18, 24 and 48 hours after exposure and the mRNA from the liver was extracted. The microarray data is arranged in matrix of 64 samples and 3116 transcripts.
Prostate cancer study
This study investigated whether gene expression differences could distinguish between common clinical and pathological features of prostate cancer. Expression profiles were derived from 52 prostate tumors and from 50 non tumor prostate samples (referred to as normal) using oligonucleotide microarrays containing probes for approximately 12,600 genes and ESTs. After preprocessing remains the expression of 6033 genes (see [21]) and 101 samples since one normal sample was suspected to be an outlier and was removed from the analysis.
Yeast metabolomic study
In this study, two Saccharomyces cerevisiae strains were used  wildtype (WT) and mutant (MT), and were carried out in batch cultures under two different environmental conditions, aerobic (AER) and anaerobic (ANA) in standard mineral media with glucose as the sole carbon source. After normalization and preprocessing, the metabolomic data results in 37 metabolites and 55 samples that include 13 MTAER, 14 MTANA, 15 WTAER and 13 WTANA samples (see [22] for more details).
Choosing the number the components with the kurtosis measure
As mentioned by [5], one major limitation of ICA is the specification and the choice of the number of components to extract. In PCA, the cumulative percentage of explained variance is a popular criterion to choose the number of principal components, since they are ordered by decreasing explained variance [1]. For the case of high dimensionality, many alternative ad hoc stopping rules have been proposed without, however, leading to a consensus (see [23] for a thorough review). In Liver Toxicity, the first 3 principal components explained 63% of the total variance, in Yeast, the first 2 principal components explained 85% of the total variance. For Prostate that contains a very large number of variables, the first 3 components only explain 51% of the total variance (7 principal components would be necessary to explain more than 60%). However, from a visualization perspective, choosing more than 3 components would be difficult to interpret.
Kurtosis measures of the loading vectors for PCA, IPCA and & ICA.
Dataset  PCA  ICA  IPCA  

Liver Toxicity study  loading 1  6.588  7.697  9.700 
loading 2  1.912  2.737  6.982  
loading 3  6.958  4.799  0.672  
Prostate cancer study  loading 1  1.527  0.553  1.513 
loading 2  0.561  0.723  0.249  
loading 3  1.176  1.640  1.509  
Yeast metabolomic study  loading 1  4.532  0.274  1.551 
loading 2  12.261  0.758  1.437  
loading 3  4.147  1.677  0.475 
Sample representation
In Liver Toxicity, IPCA tended to better cluster the low doses together, compared to PCA or ICA (Figure 1). In Prostate (Figure 2), PCA graphical representations showed interesting patterns. Neither the first, nor the second component in PCA were relevant to separate the two groups. Instead, it was the third component that could give more insight into the expected biological characteristics of the samples. It is likely that PCA first attempts to maximize the variance of noisy signals, which has a Gaussian distribution, before being able to find the right direction to differentiate better the sample classes. For IPCA, the first component seemed already sufficient to separate the classes (as indicated by the kurtosis value of its associated loading vector in Table 3), while two components were necessary for ICA to achieve a satisfying clustering. For the Yeast study (Figure 3), even though the first 2 principal components explained 85% of the total variance, it seemed that 3 components were necessary to separate WT from the MT in the AER samples with PCA, whereas 2 components were sufficient with ICA and IPCA. For all approaches, the WT and MT samples for the ANA group remain mixed and seem to share strong biological similarities.
Cluster validation
Davies Bouldin index for PCA, ICA and IPCA on the three data sets.
Dataset  # of components  PCA  ICA  IPCA 

Liver Toxicity study  2 components  1.809  1.923  1.242 
Liver Toxicity study  3 components  1.523  1.578  1.525 
Prostate cancer study  2 components  4.117  1.679  1.782 
Prostate cancer study  3 components  3.312  2.316  2.315 
Yeast metabolomic study  2 components  1.894  1.788  2.338 
Yeast metabolomic study  3 components  2.119  2.139  2.037 
In fact, the DaviesBouldin index seemed to indicate that for large data sets (Liver Toxicity and Prostate), IPCA seems to perform best for a smaller number of components than PCA. It is able to highlight relevant information in a very small number of dimensions.
Variable selection
We first performed a simulation study to assess whether sIPCA could identify relevant variables. We then applied sIPCA to the Liver Toxicity study. In both cases, we compared sIPCA with the sparse PCA approach (sPCArSVDsoft from [18]) that we will subsequently call 'sPCA'.
Simulated example
 1.Gaussian case. The two sparse simulated eigenvectors followed a Gaussian distribution:${v}_{1k}\left\{\begin{array}{cc}\hfill ~N\left(0,1\right)\hfill & \hfill k=1,\dots ,50\hfill \\ \hfill =0\hfill & \hfill \mathsf{\text{otherwise}},\hfill \end{array}\right.\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathsf{\text{and}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{v}_{2k}\left\{\begin{array}{cc}\hfill N\left(0,1\right)\hfill & \hfill k=301,\dots ,350\hfill \\ \hfill =0\hfill & \hfill \mathsf{\text{otherwise}}.\hfill \end{array}\right.$
 2.SuperGaussian case. In this case, we have${v}_{1k}\left\{\begin{array}{cc}\hfill ~L\left(0,25\right)\hfill & \hfill k=1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}50\hfill \\ \hfill =0\hfill & \hfill \mathsf{\text{otherwise}},\hfill \end{array}\right.\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathsf{\text{and}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{v}_{2k}\left\{\begin{array}{cc}\hfill ~L\left(0,25\right)\hfill & \hfill k=301,\dots ,350\hfill \\ \hfill =0\hfill & \hfill \mathsf{\text{otherwise}}.\hfill \end{array}\right.$
Simulation study: average percentage of correctly identified nonzero loadings (standard deviation) when 50 variables are selected on each dimension (each loading vector).
Method  Gaussian  superGaussian  

v_{1}  v_{2}  v_{1}  v_{2}  
sPCA  90.30% (3.5)  72.5% (11.6)  85.44% (4.3)  68.22% (10.6) 
sIPCA  86.7% (8.3)  87.7% (8.1)  80.80% (8.6)  82.30% (8.4) 
Real example with Liver Toxicity study
Choosing the number of genes to select
Comparison of the sparse loading vectors
Sample representation
Biological relevance of the selected genes
We have seen that the independent principal components indicate relevant biological similarities between the samples. We next assessed whether these selected genes were relevant to the biological study. The genes selected with either sIPCA or sPCA were further investigated using the GeneGo software [26], that can output pathways, process networks, Gene Ontology (GO) processes and molecular functions.
We decided to focus only on the first two dimensions as they were sufficient to obtain a satisfying cluster of the samples (see previous results). We therefore analyzed the two lists of 50 genes selected with either sIPCA or sPCA for each of these two dimensions. Amongst these 50 genes, between 33 to 39 genes were annotated and recognized by the software.
Genes selected on dimension 1
Both methods selected genes previously highlighted in the literature as having functions in detoxification and redox regulation in response to oxidative stress: 2 cytochrome P450 genes (1) and heme oxygenase 1 were selected by sIPCA (sPCA) on the first dimension (see Additional files 1 and 2). The expression of these genes has been found to be altered in biological pathways perturbed subsequent to incipient toxicity [27–32]. These genes were also previously selected with other statistical approaches by other colleagues on the same study [20].
A Gene Ontology enrichment analysis for each list of genes was performed. GO terms significantly enriched included biological processes related to response to unfolded proteins, protein refolding and protein stimulus, as well as response to chemical stimulus and organic substance (Additional file 3). Although very similar, the sPCA gene list highlighted slightly more genes related to these GO terms than the sIPCA gene selection. The GO molecular functions related to these genes were, however, more enriched with sIPCA: heme and unfolded protein binding as well as oxidoreductase activity (Additional file 4).
Genes selected on dimension 2
The gene lists from dimension two not only highlighted response to unfolded protein and to organic substance, but also cellular carbohydrate biosynthesis process, trygliceride, acylglycerol, neutral metabolic processes as well as catabolic process and glucogenesis. For this dimension, however, it is sIPCA that selected more relevant genes that enriched these terms (Additional file 5).
In terms of pathways, both approaches selected HSP70 and HSP90 genes. The HSP90 gene encodes a member of the heat shock proteins 70 family. These proteins play a role in cell proliferation and stress response, which explained the presence of pathways found such as oxidative stress [33, 34] (Additional file 6). The HSP90 proteins are highly conserved molecular chaperones that have key roles in signal transduction, protein folding and protein degradation. They play an important roles in folding newly synthesized proteins or stabilizing and refolding denatured proteins after stress [35].
Summary
This preliminary analysis demonstrates the ability of sIPCA and sPCA to select genes that were relevant to the biological study. These genes that are ranked as being 'important' by both approaches, participate in the determination of the components which are linear combinations of the original variables. Therefore, the expression of these selected genes not only help clustering the samples according to the different treatments or biological conditions but also have a biologically relevant meaning for the system under study.
Conclusions
We have developed a variant of PCA called IPCA that combines the advantages of both PCA and ICA. IPCA assumes that biologically meaningful components can be obtained if most noise has been removed from the associated loading vectors. By identifying nonGaussian loading vectors from the biological data, it better reflects the internal structure of the data compared to PCA and ICA. On simulated data sets, we showed that IPCA outperformed PCA and ICA in the superGaussian case, and that the kurtosis value of the loading vectors can be used to choose the number of independent principal components. On real data sets, we assessed the cluster validity using the Davies Bouldin index and showed that in high dimensional cases, IPCA could summarize the information of the data better or with a smaller number of components than PCA or ICA.
We also introduced sIPCA that allows an internal variable selection procedure. By applying a softthresholding penalization on the independent loading vectors, sparse loading vectors are obtained which enable variable selection. We have shown that sIPCA can correctly identify most of the important variables in a simulation study. For one data set, the genes selected with sIPCA and sPCA were further investigated to assess whether the two approaches were able to select genes that were relevant to the system under study given these genes, relevant GO terms, molecular functions and pathways where highlighted. This analysis demonstrated the ability of such approaches to unravel biologically relevant information. The expression of these selected genes is also decisive to cluster the samples according to their biological conditions.
We believe that (s)IPCA approach can be useful, not only to improve data visualization and reveal experimental characteristics, but also to identify biologically relevant variables. IPCA and sIPCA are implemented in the R package mixomics [36, 37] and its associated webinterface http://mixomics.qfab.org.
Methods
Principal Component Analysis (PCA)
where U is an n × p matrix whose columns are uncorrelated (i.e. U^{ T }U = I_{ P }), V is a p × p orthogonal matrix (i.e. V^{ T }V = I_{ P }), and D is a p × p diagonal matrix with diagonal elements d_{ j } . We denote u_{ j } the columns of U and v_{ j } the columns of V. Then u_{ j }d_{ j } is the jth principal component (PC) and v_{ j } is the corresponding loading vector[1]. The PCs are linear combination of the original variables and the loading vectors indicate the weights assigned to each of the variables in the linear combination. The first PC accounts for the maximal amount of the total variance. Similarly, the jth (j = 2,..., p) PC can explain the maximal amount of variance that is not accounted by the previous j  1 PCs. Therefore, most of the information contained in X can be reduced to a few PCs. Plotting the PCs enable a visual representation of the samples projected in the subspace spanned by the PCs. We can expect that the samples belonging to the same biological group, or undergoing the same biological treatment would be clustered together and separated from the other groups.
Limitation of PCA
Sometimes, however, PCA may not be able to extract relevant information and may therefore provide meaningless principal components that do not describe experimental characteristics. The reason is that its linear transformation involves second order statistics (i.e. to obtain mutually nonorthogonal PCs) that might not be appropriate for biological data. PCA assumes that gene expression data have Gaussian signals, while it has been demonstrated that many gene expression data in fact have 'superGaussian' signals [2, 4].
Independent Component Analysis (ICA)
Independent Component Analysis (ICA) was first proposed by [8]. ICA can reduce the effects of noise or artefacts in the data as it aims at separating a mixture of signals into their different sources. By assuming nonGaussian signal distribution, ICA models observations as a linear combinations of variables, or components, which are chosen to be as statistically independent as possible (i.e. the different components represent different nonoverlapping information). ICA therefore involves higherorder statistics [14]. In fact, ICA attempts to recover statistically independent signal from the observations of an unknown linear mixture. Several algorithms such as FastICA, Kernel ICA [38] and ProDenICA [39] were proposed to estimate the independent components. The FastICA algorithm maximizes nonGaussianity of each component, while Kernel ICA and ProDenICA minimize mutual information between components. In this article, we used the FastICA algorithm.
ICA assumes that Gaussian distribution represent noise, and therefore aims at identifying nonGaussian components in the sample space that are as independent as possible. Recent studies have observed that the signal distribution of microarray data are typically superGaussian since only a small number of genes contribute heavily to a specific biological process [2, 5].
Two classical quantitative measures of Gaussianity are kurtosis and negentropy.

Kurtosis, also called the fourthorder cumulant is defined as$K=E\left\{{\mathbf{s}}_{i}^{4}\right\}3.$(6)
where s_{ i } is the row of S, which has zero mean and unit variance, j = 1... n. The kurtosis value equals zero if s_{ i } has a Gaussian probability density function (pdf), is positive if s_{ i } has a spiky pdf (superGaussian, i.e. the pdf is relatively large at zero) and is negative if s_{ i } has a flat pdf (subGaussian, i.e. the pdf is rather constant near zero). We are interested in the spiky and flat pdf (i.e. nonGaussian pdfs) since nonGaussianity is regarded as independence [9]. Note that although kurtosis is both computationally and theoretically simple, it can be very sensitive to outliers. The authors in [6] proposed to order the ICs based on their kurtosis value.

In the FastICA algorithm, negentropy is used as it is an excellent measurement of nonGaussianity. Negentropy equals zero if s_{ i }is Gaussian and is positive if s_{ i }is nonGaussian. It is not only easy to compute, but also very robust [9]. However, this measure does not distinguish between superGaussianity and subGaussianity.
Limitation of ICA
Similar to PCA, ICA also suffers from high dimensionality, which sometimes leads to the inability of the ICs to reflect the (biologically expected) internal structure of the data. Furthermore, since ICA is a stochastic algorithm, it faces the problem of convergence to local optima, leading to slightly different ICs when reanalyzing the same data [40].
Independent Principal Component Analysis (IPCA)
To reduce noise and better reflect the internal structure of the data generated by the biological experiment, we propose a new approach called Independent Principal Component Analysis (IPCA). Rather than denoising the data or the PCs directly, as it is performed in ICA, we propose instead to reduce the noise in the loading vectors. Recall that the PCs, which are then used to visualize the samples and how they cluster together, are a linear combination of the original variables weighted by their elements in the corresponding loading vectors. Thus we will obtain denoised PCs by using ICA as a denoising process of the associated loading vectors.
Summary of the IPCA algorithm.
Algorithm Principal Component Analysis with Independent loadings (IPCA) 

1. Implement SVD on the centered data matrix X to generate the whitened loading vectors V, and choose the number of components m to reduce the dimension. 
2. Implement FastICA on the loading vectors V and obtain the independent loading vectors S^{ T }. 
3. Project the centered data matrix X on the m independent loading vectors s_{ j } and get the Independent PCs ${\stackrel{\u0303}{\mathbf{u}}}_{j},j=1...m$. 
4. Order the IPCs by the kurtosis value of their corresponding independent loading vectors. 
Extract the loading vectors from PCA
where the columns of V contain the loading vectors. Since the mean of each loading vector is very close to zero, these vectors are approximately whitened and the FastICA algorithm can be applied on the loading vectors.
Dimension reduction
Dimension reduction enables a clearer interpretation without the computational burden. Therefore, only a small number of loading vectors, or, equivalently, a small number of PCs is needed to summarize most of the relevant information. However, there is no globally accepted criterion on how to choose the number of PCs to keep. We have shown that the kurtosis value of the independent loading vectors gives a post hoc indication of the number of independent principal components to be chosen (see 'Results and Discussion' Section). We have experimentally observed that 2 or 3 components were sufficient to highlight meaningful characteristics of the data and to discard much of the noise or irrelevant information.
Apply ICA on the loading vectors
where $\stackrel{\u0303}{\mathbf{U}}$ is a (n × m) matrix whose columns contain the IPCs.
Ordering the IPCs
Recall that ICA provides unordered components and that the kurtosis measure indicates the Gaussian characteristic of a pdf. [6] recently proposed to use the kurtosis measure of the ICs to order them. In IPCA, we propose instead to order the IPCs according to the kurtosis value of the m independent loading vectors s_{ j } (j = 1... m), as we are mainly interested loading vectors with a spiky pdf, indicated by a large kurtosis value.
Sparse IPCA (sIPCA)
Similar to PCA and ICA, the elements in the loading vectors in IPCA indicate which variables are important or relevant to determine the principal components. Therefore, obtaining sparse loading vectors enables variable selection to identify important variables of potential biological relevance, as well as removing noisy variables while calculating the IPCs in the algorithm.
where γ is the threshold and is applied on each element k of the loading vector s_{ j } (k = 1... p, j = 1... m) so as to obtain the sparse loading vector ${\widehat{\mathbf{s}}}_{j}$. The variables whose original weights are smaller than the threshold γ will be penalized to have zero weights. A classical method to choose γ is crossvalidation. In practice, however, γ has been replaced by the degree of sparsity (i.e., the number of nonzero elements in each loading vector, see following paragraph). In this way, we can control how many variables to select and save some computational time.
Using (s)IPCA
IPCA and sIPCA are implemented in the R package mixomics which is dedicated to the analysis of large biological data sets [36, 37]. The use of the approaches is straightforward: the user needs to input the data set, and to choose the number of components to keep (usually set to a small value). In the case of the sparse version, the number of variables to select on each sIPCA dimension must also be given. The number of components can be reconsidered afterwards by extracting the kurtosis value of the loading vectors, i.e., identifying when a sudden drop occurs in the obtained values will indicate how many components are enough to explain most of the information in the data.
The number of variables to select is still an open issue (as pinpointed by many authors working on sparse approaches, [18]) as in such studies, we are often limited by the number of samples. Tuning the number of variables to select therefore mostly relies on the biological question. Sometimes, an optimal but too short gene selection may not suffice to give a comprehensive biological interpretation, and sometimes, the experimental validation might be limited in the case of a too large gene selection.
In our example, for the sake of simplicity, we have set the same number of variables to select on each dimension.
Simulation studies
In the different simulation studies, we used the following framework (previously proposed by [18]). Σ is the variancecovariance matrix of size 500 × 500, whose first two normalized eigenvectors v_{1} and v_{2}, both of length 500 are simulated for different cases described the the 'Results and Discussion' Section. The other eigenvectors were drawn from U 0[1]. A GramSchmidt orthogonalization method was applied to obtain the orthogonal matrix V whose columns contain v_{1} and v_{2} and the other eigenvectors. To make the first two eigenvectors dominate, the first two eigenvalues were set to c_{1} = 400, c_{2} = 300 and c_{ k } = 1 for k = 3,..., 500. Let C = diag{c_{1},..b., c_{500}} the eigenvalue matrix, then Σ = VCV^{ T }. The data are then generated from a multivariate normal distribution N(0, Σ), with n = 50 samples and p = 500 variables.
DaviesBouldin index
where c_{ i } is the centroid of cluster i, and σ_{ i } is the average distance of all elements in cluster i to centroid c_{ i } and d(c_{ i }, c_{ j } ) is the distance between the two centroids, K is the number of known biological conditions or treatments. Depending on the number of components that were chosen, we applied a 2 or 3norm distance. Geometrically speaking, we are seeking to minimize the withincluster scatter (the numerator) while maximizing the between class separation (the denominator). Therefore, for a given number of components, the approach that gives the lowest index has the best clustering ability.
Declarations
Acknowledgements
We would like to thank Dr Thibault Jombart (Imperial College) for his useful advice. This work was supported, in part, by the Wound Management Innovation CRC (established and supported under the Australian Government's Cooperative Research Centres Program).
Authors’ Affiliations
References
 Jolliffe I: Principal Component Analysis. second edition. Springer, New York; 2002.
 Lee S, Batzoglou S: Application of independent component analysis to microarrays. Genome Biology 2003, 4(11):R76. 10.1186/gb2003411r76PubMed CentralView ArticlePubMed
 Purdom E, Holmes S: Error distribution for gene expression data. Statistical applications in genetics and molecular biology 2005, 4: 16.View Article
 Huang D, Zheng C: Independent component analysisbased penalized discriminant method for tumor classification using gene expression data. Bioinformatics 2006, 22(15):1855. 10.1093/bioinformatics/btl190View ArticlePubMed
 Engreitz J, Daigle B Jr, Marshall J, Altman R: Independent component analysis: Mining microarray data for fundamental human gene expression modules. Journal of Biomedical Informatics 2010, 43: 932–944. 10.1016/j.jbi.2010.07.001PubMed CentralView ArticlePubMed
 Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J: Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 2004, 20(15):2447–2454. 10.1093/bioinformatics/bth270View ArticlePubMed
 Frigyesi A, Veerla S, Lindgren D, Höglund M: Independent component analysis reveals new and biologically significant structures in micro array data. BMC bioinformatics 2006, 7: 290. 10.1186/147121057290PubMed CentralView ArticlePubMed
 Comon P: Independent component analysis, a new concept? Signal Process 1994, 36: 287–314. 10.1016/01651684(94)900299View Article
 Hyvärinen A, Oja E: Indepedent Component Analysis: Algorithms and Applications. Neural Networks 2000, 13(4–5):411–430. 10.1016/S08936080(00)000265View ArticlePubMed
 Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons; 2001.View Article
 Liebermeister W: Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002, 18: 51–60. 10.1093/bioinformatics/18.1.51View ArticlePubMed
 Wienkoop S, Morgenthal K, Wolschin F, Scholz M, Selbig J, Weckwerth W: Integration of Metabolomic and Proteomic Phenotypes. Molecular & Cellular Proteomics 2008, 7: 1725–1736. 10.1074/mcp.M700273MCP200View Article
 Rousseau R, Govaerts B, Verleysen M: Combination of Independent Component Analysis and statistical modelling for the identification of metabonomic biomarkers in HNMR spectroscopy. Tech rep, Universté Catholique de Louvain and Universté Paris I 2009.
 Kong W, Vanderburg C, Gunshin H, Rogers J, Huang X: A review of independent component analysis application to microarray gene expression data. BioTechniques 2008, 45(5):501. 10.2144/000112950PubMed CentralView ArticlePubMed
 Teschendorff A, Journée M, Absil P, Sepulchre R, Caldas C: Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS computational biology 2007, 3(8):e161. 10.1371/journal.pcbi.0030161PubMed CentralView ArticlePubMed
 Jolliffe I, Trendafilov N, Uddin M: A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics 2003, 12: 531–547. 10.1198/1061860032148View Article
 Donoho D, Johnstone I: Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81: 425–455. 10.1093/biomet/81.3.425View Article
 Shen H, Huang JZ: Sparse Principal Component Analysis via Regularized Low Rank Matrix Approximation. Journal of Multivariate Analysis 2008, 99: 1015–1034. 10.1016/j.jmva.2007.06.007View Article
 Davies D, Bouldin D: A cluster separation measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on 1979, (2):224–227.
 Bushel P, Wolfinger RD, Gibson G: Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology 2007., 1:
 Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, D'Amico A, Richie J, Lander E, Loda M, Kantoff P, Golub T, Sellers W: Gene expression correlates of clinical prostate cancer behavior. Cancer cell 2002, 1(2):203–209. 10.1016/S15356108(02)000302View ArticlePubMed
 VillasBoâs S, Moxley J, Åkesson M, Stephanopoulos G, Nielsen J: Highthroughput metabolic state analysis: the missing link in integrated functional genomics. Biochemical Journal 2005, 388: 669–677. 10.1042/BJ20041162PubMed CentralView ArticlePubMed
 Cangelosi R, Goriely A: Component retention in principal component analysis with application to cDNA microarray data. Biology Direct 2007., 2(2):
 Bezdek J, Pal N: Some new indexes of cluster validity. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 1998, 28(3):301–315. 10.1109/3477.678624View Article
 Bartlett M, Movellan J, Sejnowski T: Face recognition by independent component analysis. Neural Networks, IEEE Transactions on 2002, 13(6):1450–1464. 10.1109/TNN.2002.804287View Article
 Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Midori A, Hill D, IsselTarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene Ontology: tool for the unification of biology. Nature genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMed
 Bauer I, Vollmar B, Jaeschke H, Rensing H, Kraemer T, Larsen R, Bauer M: Transcriptional activation of heme oxygenase1 and its functional significance in acetaminopheninduced hepatitis and hepatocellular injury in the rat. Journal of hepatology 2000, 33(3):395–406. 10.1016/S01688278(00)802755View ArticlePubMed
 Hamadeh H, Bushel P, Jayadev S, DiSorbo O, Bennett L, Li L, Tennant R, Stoll R, Barrett J, Paules R, Blanchard K, Afshari C: Prediction of compound signature using high density gene expression profiling. Toxicological Sciences 2002, 67(2):232. 10.1093/toxsci/67.2.232View ArticlePubMed
 Heijne W, Slitt A, Van Bladeren P, Groten J, Klaassen C, Stierum R, Van Ommen B: Bromobenzeneinduced hepatotoxicity at the transcriptome level. Toxicological Sciences 2004, 79(2):411. 10.1093/toxsci/kfh128View ArticlePubMed
 Heinloth A, Irwin R, Boorman G, Nettesheim P, Fannin R, Sieber S, Snell M, Tucker C, Li L, Travlos G, Vansant G, Blackshear P, Tennant R, Cunningham M, Paules R: Gene expression profiling of rat livers reveals indicators of potential adverse effects. Toxicological Sciences 2004, 80: 193. 10.1093/toxsci/kfh145View ArticlePubMed
 Waring J: Development of a DNA microarray for toxicology based on hepatotoxinregulated sequences. Environmental health perspectives 2003, 111(6):863.PubMed CentralView Article
 Wormser U, Calp D: Increased levels of hepatic metallothionein in rat and mouse after injection of acetaminophen. Toxicology 1988, 53(2–3):323–329. 10.1016/0300483X(88)902247View ArticlePubMed
 Flaherty K, DeLucaFlaherty C, McKay D: Threedimensional structure of the ATPase fragment of a 70 K heatshock cognate protein. Nature 1990, 346(6285):623. 10.1038/346623a0View ArticlePubMed
 Tavaria M, Gabriele T, Kola I, Anderson R: A hitchhiker's guide to the human Hsp70 family. Cell Stress & Chaperones 1996, 1: 23. 10.1379/14661268(1996)001<0023:AHSGTT>2.3.CO;2View Article
 Panaretou B, Siligardi G, Meyer P, Maloney A, Sullivan J, Singh S, Millson S, Clarke P, NaabyHansen S, Stein R, Cramer R, Mollapour M, Workman P, Piper P, Pearl L, Prodromou C: Activation of the ATPase activity of hsp90 by the stressregulated cochaperone aha1. Molecular cell 2002, 10(6):1307–1318. 10.1016/S10972765(02)007852View ArticlePubMed
 Lê Cao KA, González I, Déjean S: integrOmics: an R package to unravel relationships between two omics data sets. Bioinformatics 2009, 25(21):2855–2856. 10.1093/bioinformatics/btp515PubMed CentralView ArticlePubMed
 mixOmics[http://www.math.univtoulouse.fr/~biostat/mixOmics]
 Bach F, Jordan M: Kernel Independent Component Analysis. Journal of Machine Learning Research 2002, 3: 1–48.
 Hastie T, Tibshirani R: Independent Components Analysis through Product Density Estimation. 2002.
 Himberg J, Hyvarinen A, Esposito F: Validating the independent components of neuroimaging time series via clustering and visualization. Neuroimage 2004, 22(3):1214–1222. 10.1016/j.neuroimage.2004.03.027View ArticlePubMed
 Zou H, Hastie T, Tibshirani R: Sparse Principal Component Analysis. J Comput Graph Statist 2006, 15(2):265–286. 10.1198/106186006X113430View Article
 Witten D, Tibshirani R, Hastie T: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10(3):515. 10.1093/biostatistics/kxp008PubMed CentralView ArticlePubMed
 Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.