Data sets
The smoker data [15] contain expression levels for M = 7939 probe sets measured for N = 75 subjects on Affymetrix HG-U133A GeneChips (Affymetrix, Inc., USA). The subjects are divided into 34 current smokers, 18 former smokers and 23 individuals that have never been smoking. The expression data are arranged in a matrix X of size N rows by M columns, and the class information is contained in a binary array Y. Each class is represented by a column in Y, where a value of one assigns an object to that class. X is logarithmically transformed prior to the analysis. No other pre-processing or scaling is performed, that is, we use the data directly as provided by Spira et al. [15].
The spike-in data [20] are obtained from a designed experiment where 14 groups of three transcripts each were spiked in at 14 different concentration levels in three replicates. There is thus a total of N = 42 objects in the experiment, corresponding to specific concentration mixtures. Expression levels were measured on a modified Affymetrix HG-U133A array, and concentrations ranging from zero to 512 pM were arranged in a Latin square setup. The 42 spike-in genes were measured in a constant human background, 22300 probe sets in total. The data are pre-processed using the gcrma-package [31], which is part of the Bioconductor project [32]. As groups of three transcripts are spiked in at the same amounts, the response matrix Y contains 14 concentration profiles, one for each group. Several authors have remarked that some of the background genes show consistent differential expression across the experiments [21–23]. Some of those genes contain sequences which match the spike-ins exactly. We therefore adopt the extended spike-in gene-list given by McGee and Chen [22], yielding a total of 64 entries. The list is identical to the one suggested by Lo and Gottardo [23], except the probe set "205397_x_at" is removed. This probe set has identical sequence to the originally included "205398_at" for 5 of 11 probes. Due to memory limitations, there is a need to filter away some genes before the analysis. We therefore include only those normalised probe sets with a standard deviation above 0.4, which results in a total of M = 14.490 variables in X. Potential alternatives to filtering would have been to apply a leave-two-out cross-validation, or to use fewer Y-variables or components in the analysis.
In the liver toxicity study [24], male rats of the inbred strain Fisher 344 were administered non-toxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg) or severely toxic (2000 mg/kg) doses of paracetamol in a controlled experiment. Necropsies were performed at 6, 18, 24 and 48 hours after exposure. RNA analysis was performed on an Agilent platform (Agilent Technologies, USA), and four biological replicates were obtained for each of the 16 design-points. All spots flagged as bad or not found by the image analysis program are removed. In addition, spots are removed if they have total intensity less than 200, foreground to background intensity less than 1.5, spot diameter less then 25 pixels, or more than 50% of pixels saturated. As the background estimates available in the Agilent image analysis report are local backgrounds measured in pixels near the spots, it is doubtful if this is a useful estimate of the background in the spot [33]. Therefore, no background correction is done. The filtered data are transformed to log2-ratios and normalised using global loess to improve colour balance and remove global intensity effects [34]. Ratios for spots that have been removed in the filtering are imputed using k-nearest neighbours with k = 10 [35]. Two histological and ten clinical variables containing markers for liver injury are available for each object. The expression data are arranged in a matrix X of N = 64 objects and M = 7445 expression levels. For analysis by SAM, the two non-toxic levels are pooled in a group of low dose, and the two toxic levels are pooled in a group of high dose. The data are further divided into groups for each of the four time-points, giving a total of 8 groups to be tested for differential expression. In the bilinear analysis, a matrix Y of 14 columns is constructed instead that contains the design information of dose and time as well as the phenotypic responses available. A description of all the variables in Y are given in figure 8. The Y-variables SDH, BILE, AST and ALT are log-transformed in order to get responses which are more closely bell-shaped.
Bridge-PLS regression
PLSR finds linear combinations (scores) of the original features spanning directions of high covariance between X and Y. In the most common PLSR implementation, the data are deflated each time a component is found, by subtraction of that component from the remaining X-data. This ensures that each score vector is orthogonal to all previous components. It also enables better model fit, as several components are allowed in the modelling of each individual response. Cross-validation is usually applied to find the number of components which yields the best predictions [3]. However, the best prediction model does not necessarily contain the most informative components. If a single response is modelled (i.e. if Y has only one column), qualitatively useful information is expected from the vector of covariances between the variables in X and the response [36, 37]. The covariances are given by the so-called loading weights corresponding to the first component, and this vector is therefore well suited for significance analysis. Subsequent loading weights are less interpretable, as they, due to the deflation step, may be influenced by information in X unrelated to Y. In order to obtain informative weights for multiple responses, it is possible to make one PLSR-model for each individual response. Such an approach was taken e.g. in [9], where cell cycle regulated genes were found using the first loading weight from two single-response PLSR-models.
Bridge-PLSR has no deflation step, and the loading weights are used directly in the model rather than calculating new loadings [14]. If the ridge-parameter γ is set to zero, Bridge-PLSR is simply a singular value decomposition of the covariance matrix X ' Y, with a subsequent prediction relation between X and Y. At the other extreme, when γ = 1, Bridge-PLSR becomes an ordinary principal component regression [3]. Any intermediate value of γ corresponds to some weighted average between those two methods. For significance analysis, and for interpretation of components in relation to Y, the optimal value of γ is zero. The loadings will then collectively explain the same Y-information as the set of loading weights from all one-component, individual response PLSR-models. For multiple correlated responses, Kvalheim [38] recommends orthogonalisation of Y before calculating such single-response models. This ensures that the models will be close to orthogonal to each other as long as the predictive relation between X and Y is good. These steps are all circumvented with Bridge-PLSR, as the Bridge-PLSR loadings are orthogonal by default. In addition, a single, predictive Bridge-PLSR model is more useful and appealing to work with than multiple single-response PLSR-models.
When γ = 0, the maximum number of components equals the sum of independent responses in Y. If all these components are used, the residuals will be orthogonal to Y, and the loadings (= loading weights) will together account for all the Y-related information in X. Because only one component is available for each orthogonal response when γ = 0, the predictive ability of Bridge-PLSR is often less than that of an ordinary PLSR. However, a tuning of the γ-parameter may increase the predictive ability of the model. This is expected for instance in cases where Y is large or noisy compared to X. It is also seen from the response surfaces provided in supplementary information that a larger γ-value may stabilise the decomposition such that early local minima are avoided. However, the response surfaces provided are not validated by an external validation and should not be used for optimisation. Any such tuning of parameters should be accompanied for instance by cross model validation to avoid optimistically biased results [19, 39].
For the analysis in this work, γ is set to the small default value of 0.01. It is seen from the response surfaces in supplementary information that this yields the same predictive ability (down to two decimals) as if γ were zero. Also, the components obtained using the two values of γ are qualitatively equal, such that the influence on the significance analysis is negligible. The γ-parameter is given this value only to enable interpretation of more than two components for the smoker data, although these components are not used in the jack-knife. Bridge-PLSR is algorithmically simple, and the models are validated by cross-validation or test sets in a straight forward manner.
Pseudo-code Bridge-PLSR
-
1.
Start with column-centred X and Y
-
2.
Find by SVD Y = U
y
S
y
-
3.
Construct a representation of Y that has the rank of X: G = (1 - γ) U
y
S
y
+ γ I
-
4.
Find by SVD XTG = USVT
-
5.
Loadings P
A
given by the A largest components in U
-
6.
Scores T
A
= XP
A
-
7.
Y-loadings Q
A
= YTT
A
(T
A
)-1
-
8.
Regression coefficients B = P
A
-
9.
Predictive relation , where the offset
Optional extensions
Even though all the Y-related information is accounted for in Bridge-PLSR, the model may also include variance which is orthogonal to Y. Such orthogonal variance is the reason the smoker-classes are not better separated in figure 4. In a perfectly fitted model, all the samples of either class would appear as a single spot in the score plot, such that the three classes formed a triangle. In general, perfect fit is not a goal, as such a model would describe also measurement noise and other artifacts. However, in cases with much irrelevant X-variation, removal of some unwanted structures might improve the analysis. At least three approaches can be taken to this end. Variable selection attempts to remove the least relevant variables from X, improving the subsequent data-analysis [19]. Target rotation of PLSR loading weights toward orthogonalised responses will improve the fit provided the number of initial components is sufficiently high [37, 38]. Target rotation of two Bridge-PLSR components toward two independent responses will give no improvement, as the Bridge-PLSR loadings already span the directions of interest (when γ = 0). A third approach is to use some sort of orthogonal signal correction, a family of methods which separate components from X which are orthogonal to Y [40, 41]. The method of 02-PLS [40] is especially interesting, as this model formulation is very similar to Bridge-PLSR. In a test using the smoker data, target rotation based on five PLSR components gave similar results as an O2-PLS model where three orthogonal components had been removed (not shown).
Removal of some unwanted X-information may increase the predictive ability and find even more relevant genes. However, all the above procedures for improving the modelling are based on the data themselves, which means that degrees of freedom are in effect removed from X. It follows that the risk of overfit increases the more the data are manipulated. As an example, a perfect fit will always result when a target rotation is based on all data, or when all orthogonal components are removed from the model. Much caution and good knowledge of the data are therefore advised when any of the above procedures are implemented. As these methods are useful only for data with much irrelevant X-variation, we have chosen not to include them further in this work. As was demonstrated with the smoker-data, Bridge-PLSR will find the important directions in the data no matter how much irrelevant X-structure is present.
Significance analysis
A cross-validation using an appropriate dimension reduction method is performed, and loadings from each sub-model with left-out objects is obtained. The perturbed loadings are used to estimate the stability of each feature. The optimal number of components A for prediction, and the overall predictability of the model may be estimated as described in [3]. Only components that lead to a significant decrease in the cross-validated prediction error should then be included. In this work, we concentrate more on significance analysis than on optimisation of prediction models, and interpretation of score plots is therefore at least as important when A is found.
Many of the dimension reduction methods yield orthonormal loadings, in which cases the relative importance of the components are reflected by the length of the scores. For the significance analysis, it is important that the loadings themselves are weighted by the variance of their components. All perturbed loadings from the cross-validation also have to be aligned properly. A bilinear model can be written on the form , where C is any invertible matrix of rank A. This means that the components have full rotational freedom, and it is often observed that a component is flipped 180° in one sub-model relative to another. Components of similar eigenvalues may also be sorted differently in different sub-models. A Procrustes rotation of the perturbed models toward the calibration model will correct for these inconsistencies [3]. We perform un-centred and un-scaled rotations of all loadings prior to significance testing.
Each variable (gene) is tested under the null hypothesis that its corresponding loading-parameter is zero in the A-dimensional space spanned by the components. It follows that the choice of dimension reduction is very important for the outcome of the test. Figures 1 and 2 show that components spanning only irrelevant X-variation will yield irrelevant genes. On the other hand, when the components describe some Y-information of interest, we directly find the features important for the different responses (e.g. figures 4 and 5). Hotelling's T2-test is able to assess the significance of parameters in several dimensions simultaneously. We have modified the test to make it more suited for jack-knife of bilinear model parameters. Some of these modifications are described in [3, 12] for the regular one-sample t-test. The variation of the perturbed sub-models around an estimated true model describes the stability for each gene. Traditionally, the mean of the sub-models represents the centre points, but the median or even the full calibration model may also be used. Because the perturbed models are not independent of the full model, the deviations are summed instead of averaged. The latter point makes a difference if the actual value of the T2-score is interpreted, not if the T2-score is used simply as a ranking of the genes. Finally, shrinkage of the gene-specific variances toward a total variance across genes is included [25]. This is to avoid poorly estimated variances due to the small number of objects in microarray experiments. A shrinkage factor of zero means that every gene is evaluated individually, while a factor of one treats all genes as having equal variances. The actual value of the parameter is a choice that must be made on the basis of the number of replicates and the overall data quality. Poor quality data and few replicates may justify increased shrinkage.
The T2-score for gene i ∈ {1,..., M} is given as
(1)
(2)
(3)
The vector p
i
is of length A and represents the true loadings, β ∈ [0, 1] is the shrinkage factor, is a matrix of size K-by-A holding the loadings from the K sub-models and 1 is a vector of ones. Stotis the mean or median of the scaled, gene-specific variance-covariance matrices across all genes. The scaling is taken from [3] and is close to one in most cases. In this work, p
i
is the mean value over sub-models, the coefficient β = 0.1 and the median of covariances is used to find Stot.
FDR estimation
We adapt the definition of the positive false discovery rate [42], which states that
(4)
FP and TP are the numbers of false and true positives, respectively. A q-value ∈ [0, 1] assigned to each gene gives the minimum FDR that results in a rejected hypothesis for that feature. An FDR significance level cutoff is applied to the q-values to decide which features to call significant. The conditional statement that the sum of positive outcomes must be larger than zero can often be disregarded in practice. It is then possible to estimate the q-values by resampling, using the relation
(5)
Every feature i is associated with a score . The denominator holds the total number of genes at least as significant as this based on the calibration model. The number of false positives at the same significance level is not known, and must be estimated. The estimated count for gene i is found by reshuffling and denoted
i
. The rows of Y are shuffled R times to give several matrices , r ∈ {1,..., R}, with random object-information only. A jack-knife based on one of the randomised matrices is expected to give no true positives, especially if the number of significant features is small compared to the total number of features tested. Here, is found as the median number of significant outcomes over R = 1000 shuffled responses, and all features with smaller than a pre-set significance level α are called significant. Estimation of when no Y is available is difficult, because it is not possible to shuffle X for each feature separately without simultaneously disrupting all other systematic structure in the data. For the smoker-data, we have therefore used the same number of significant features for PCA as we found for Bridge-PLSR. The FDR significance level is set to α = 0.05 for the smoker-data and to α = 0.01 for the spike-in data. Because the relevant signal in the paracetamol data set is very strong, a significance level of α = 0.001 is enough to get a large number of significant features for these data.
SAM and Limma analysis
SAM and Limma are used to identify differentially expressed genes in the smoker data set. Both methods are part of the Bioconductor software (Version 1.9) [32], and we use the packages limma and siggenes for Limma and SAM, respectively. Limma is performed with the functions lmFit and eBayes with default settings. We control Benjamini and Hochberg's FDR at a level of α using the function decideTests. SAM is run with the function sam.dstat with 1000 permutations, the parameter delta = 0.8 and the remaining parameters at default setting.