MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms

Background Molecular signatures identified from high-throughput transcriptomic studies often have poor reliability and fail to reproduce across studies. One solution is to combine independent studies into a single integrative analysis, additionally increasing sample size. However, the different protocols and technological platforms across transcriptomic studies produce unwanted systematic variation that strongly confounds the integrative analysis results. When studies aim to discriminate an outcome of interest, the common approach is a sequential two-step procedure; unwanted systematic variation removal techniques are applied prior to classification methods. Results To limit the risk of overfitting and over-optimistic results of a two-step procedure, we developed a novel multivariate integration method, MINT, that simultaneously accounts for unwanted systematic variation and identifies predictive gene signatures with greater reproducibility and accuracy. In two biological examples on the classification of three human cell types and four subtypes of breast cancer, we combined high-dimensional microarray and RNA-seq data sets and MINT identified highly reproducible and relevant gene signatures predictive of a given phenotype. MINT led to superior classification and prediction accuracy compared to the existing sequential two-step procedures. Conclusions MINT is a powerful approach and the first of its kind to solve the integrative classification framework in a single step by combining multiple independent studies. MINT is computationally fast as part of the mixOmics R CRAN package, available at http://www.mixOmics.org/mixMINT/and http://cran.r-project.org/web/packages/mixOmics/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1553-8) contains supplementary material, which is available to authorized users.

). . . 14 S10 Breast cancer data. Boxplot of the standardised gene expression for the 11 genes identified by MINT (data is centered and scaled per study), for each of the four subtypes of breast cancer (

S1 PLS-algorithm
Let X and Y be high dimensional (quantitative) data matrices of size N × P and N × Q, respectively. These matrices can be high-dimensional. PLS regression is an iterative method that constructs successive artificial is a linear combination of X's variables (Y's), in which the vector of weights a h (b h ) is called loading vector. The PLS-algorithm is described next in a context of regression of Y onto X.

S2 Extension of MINT for a regression framework
We assume that the data are partitioned into M groups corresponding to each independent study m: where n m is the number of samples in group m, see Figure S1 for a classification framework. Each variable from the data set X (m) and Y (m) is centered and has unit variance. We write X and Y the concatenation of all X (m) and Y (m) , respectively. Figure S1: Experimental design of MINT, combining M independent studies X (m) , Y (m) , where X (m) is a data matrix of size n m observations (rows) × P variables (e.g. gene expression levels, in columns) and Y (m) is a dummy matrix indicating each sample class membership of size n m observations (rows) × K categories outcome (columns).
Previously, for a classification framework, Y was a dummy matrix indicating the class membership of each sample. We adapted the MINT algorithm to the regression framework in which Y is a quantitative data matrix. The MINT regression applies for example when modelling a multiple multivariate regression between transcriptomics data sets (X) and clinical parameters (Y ).
We added a regularisation parameter λ 2 to select variables of Y through a 1 penalisation. In this case, the function to maximize is where in addition to (3), λ 2 is a non negative parameter that controls the amount of shrinkage and thus the number of non zero weights in the global loading vectors b h . Both global loadings vectors a h and b h can be seen on the workflow of the MINT approached ( Figure S2). The MINT extension addresses simultaneously aims (1) and (3); it integrates different experiments while selecting the most relevant variables. The MINT pseudo algorithm in the context of regression in as follows. Figure S2: Workflow of the MINT algorithm. Black lines represent matrix multiplication; orange dashed lines represent addition and purple dotted lines represent no operation.
Choosing the regularization parameters λ h and γ h for each of the H PLS-components can be achieved through Cross-Validation (CV). Note that this tuning step can be computationally intensive if feature selection must be achieved on both X and Y data sets concurrently. Computational cost of CV is reduced when tuning a single parameter, as in a classification framework or when X or Y is univariate.

S3.1 A vs B
A differential expression analysis of A vs B was conducted on each of the three microarray platforms using ANOVA, and revealed an overlap among the platforms of 2717 DEG with a False Discovery Rate of 10 −4 (Benjamini and Hochberg, 1995). This corresponds to 59.6% of all DEG for illumina, 47.3% for Affymetrix HuGene and 36.5% for Affymetrix Prime ( Figure S3). We observed that conducting a differential analysis on the concatenated data from the three microarray platforms without accomodating for batch effects resulted in 2460 DEG, of which only 61.4%(1669) were part of the common 2717 DEG. This implies that 38.6% (791) of these genes were not found DE with a FDR of 10 −4 in at least one study. Thus, the biological effect is most likely confounding with the technical effect for these 791 false positive genes. A Principal Component Analysis (PCA) sample plot confirmed that the major source of variation in the combined data is attributed to platforms as the samples clustered by platforms rather than by outcome class (see Figure S3). Using LOGOCV to choose the optimal number of genes to discriminate the two biological classes with MINT on one component, a single gene "CKS2" was selected. CKS2 was also part of the common list of DEG and was ranked 2 for illumina, 3 for affyPrime and 362 for affyHugene. Since the biological samples to discriminate are very different, it was not surprising that MINT only selected one gene to achieve the best classification accuracy. However, to further compare the results of our approach to the 2717 common DEG, we manually required MINT to select more genes via the 'keepX' argument that was implemented in the mixOmics package to control the amount of sparsity. A very high overlap was reported between the genes selected by MINT and the 2717 common genes that are assumed to be true positive; when MINT was ask to select 500 genes, 100% of these were found in the common genes, this percentage remained high at 84.8% when MINT was asked to select 2717 genes.

S4 Limitations of common meta-analysis and integrative approaches -breast cancer study
Similarly to the the analysis of the human cell types data in Section 2 of the manuscript, we illustrate the shortcomings of (i) a classical meta-analysis and (ii) an integrative analysis using the three independent training datasets described in Table 2. First, a PCA sample plot representation illustrates the need to accommodate for unwanted variation as both METABRIC studies are clustering together but away from the TCGA RNA-seq experiment ( Figure S5B). We observed that the unwanted variation accounted for 75% of the total variability of the data. For (i), a DE analysis was performed with ANOVA (FDR < 10 −6 ). The Venn-Diagram depicted in Figure S5 highlights a high concordance of DEG between the METABRIC discovery and validation sets but a low concordance between METABRIC and TCGA. The low concordance between METABRIC sets and TCGA is most likely due to the use of a difference commercial platform; conversely, both METABRIC sets used the same platform. Concerning (ii), PLS-DA was not able to discriminate any of the four subtypes of breast cancer, although a diagonal trend can be seen from bottom right with Basal samples, to top left with Luminal samples ( Figure S5C). The latter result highlight the limitation of integrative analysis in this challenging analytical context.  Table 2.

S5 MINT outperforms state-of-the-art methods
Details on the methods are available in the manuscript and results are provided in Figure S6 for the stem cells data and Figure S7 for the analysis of the breast cancer data in which the PAM50 genes were removed. The classification accuracy results were similar when the PAM50 genes were included (not shown).   Figure S6: Stem cells data. Balanced Error Rate (BER), the lower the error rate the better the classification performance, top) and classification error rate per class (the higher the classification accuracy the better, bottom), for both training and independent test set.

S6 Application to the stem cells data S6.1 Meta-analysis
Ensembl ID Symbol ENSG00000091972 CD200 ENSG00000092421 SEMA6A ENSG00000121570 DPPA4 ENSG00000154639 CXADR/CAR ENSG00000178445 GLDC Table S2: Meta-analysis of the stem cells data. Genes commonly declared as Differentially Expressed with a FDR< 10 −5 for the human cell types of eight studies referenced in Table 1 S6.2 Signature identified by MINT The MINT approach was performed with 2 components that were tuned by LOGOCV; 2 and 15 genes were selected on the first two components, respectively (Table S3) Figure S9 depicts study-specific graphical outputs for MINT. As displayed in Figure 1E of the manuscript, all experiments gave satisfactory classification accuracies for all cell types except Bock and Takahashi studies.

S6.3 Study-specific output of MINT
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Bock
Briggs Chung Outcome q q q q q q q q q Fibroblast hESC hiPS Figure S9: Study-specific results of MINT for the three cell types of eight studies referenced in Table 1.

Ebert
MINT model was obtained with two and fifteen genes on the first two component, respectively (Table S3).

S7 Application to the breast cancer data
Two MINT analyses were performed, with or without the PAM50 genes. In a first analysis (including all genes), MINT selected 30, 572 and 636 genes on each three components respectively, using LOGOCV tuning. Genes selected on component 1 are listed in Table S5.
In the second analysis (excluding PAM50 genes), the MINT approach selected 11, 272 and 253 genes on the first three components that were tuned by LOGOCV. See Table S5 for a summary of the first component genes. The expression levels of the 11 genes selected on the first component highlight a gradient from Basal to luminal ( Figure S10).
In addition to the details provided in the manuscript, other selected genes in the molecular signature that may have biomarker potential are TBC1D9 (Andres et al., 2013(Andres et al., , 2014, DNALI1 (Parris et al., 2010), AFF3 (Lefevre et al., 2015) and CCDC170 (Yamamoto-Ibusuki et al., 2015).   Table S5: Genes selected by MINT on the first three components tuned by LOGOCV, for the three independent training sets of the breast cancer data (excluding PAM50 genes) referenced in Table 2.  Figure S10: Breast cancer data. Boxplot of the standardised gene expression for the 11 genes identified by MINT (data is centered and scaled per study), for each of the four subtypes of breast cancer (Table 2).

S8 OCT4 expression
OCT4 is the main known marker for undifferentiated cells, we questioned why OCT4 was not selected by MINT on the first component. Figure S12 confirms that OCT4 is DE between Fibroblasts and both hESC and hiPSC. However, both LIN28A and CAR are more DE and this was confirmed by a two-sided t-test ( Figure S12). We can thus conclude that OCT4 is a discriminant gene, but less informative that LIN28A and CAR, as those seem better suited candidates to discriminate differentiated cells. Further wet laboratory work would be needed to assess the potential and practicality of either LIN28A or CAR. Figure S12 shows the expression of LIN28A, CAR and OCT4 per study and highlights heterogeneity among the studies.