 Research Article
 Open access
 Published:
MVDA: a multiview genomic data integration methodology
BMC Bioinformatics volume 16, Article number: 261 (2015)
Abstract
Background
Multiple highthroughput molecular profiling by omics technologies can be collected for the same individuals. Combining these data, rather than exploiting them separately, can significantly increase the power of clinically relevant patients subclassifications.
Results
We propose a multiview approach in which the information from different data layers (views) is integrated at the levels of the results of each single view clustering iterations. It works by factorizing the membership matrices in a late integration manner. We evaluated the effectiveness and the performance of our method on six multiview cancer datasets. In all the cases, we found patient subclasses with statistical significance, identifying novel subgroups previously not emphasized in literature. Our method performed better as compared to other multiview clustering algorithms and, unlike other existing methods, it is able to quantify the contribution of single views on the final results.
Conclusion
Our observations suggest that integration of prior information with genomic features in the subtyping analysis is an effective strategy in identifying disease subgroups. The methodology is implemented in R and the source code is available online at http://neuronelab.unisa.it/amultiviewgenomicdataintegrationmethodology/.
Background
Stratifying patients into distinct subgroups can lead to more accurate diagnostic and treatment strategies. Current methods for patient stratification are usually based on gene expression data and apply cluster algorithms to identify groups of patients having similar expression profiles [1–3]. For example, multivariate gene expression signatures have been shown to discriminate between disease subtypes, such as recurrent and nonrecurrent cancer types or tumour progression stages [4]. In addition to gene expression data other omics data types, such as miRNA (microRNA) expression, methylation or copy number alterations, can be used to improve the model accuracy for patient stratification. For example, somatic copy number alterations provide good biomarkers for cancer subtype classification [5]. Data integration approaches to efficiently identify subtypes among existing samples has recently gained attention. The main idea is to identify groups of samples that share relevant molecular characteristics. Strategies of data integration of multiple omics data types poses several computational challenges, as they deal with data having generally a small number of samples and different preprocessing strategies for each data source. Moreover, they have to cope with redundant data as well as the retrieval of the most relevant information contained in the different data sources.
Methods for clustering multiple data layers can be grouped into three main categories, namely early, intermediate, and late integration. Early integration methods directly combine all features into a single dataset [6–8]; intermediate integration methods build joint representations of data given the views [9]; late integration methods preprocess separately each individual view, subsequently combining the results [10, 11]. Late integration methods are often preferred when combining continuous and discrete data together, such as CNV and mRNA. Omics data are highly dimensional data and subject to nonGaussian noise. Therefore, integrating them with an early or intermediate integration techniques may lead to highly noisy patterns unless appropriate regularization techniques are used which, however, lead to a very complex multiview learning process.
A number of data integration approaches for patients subgroups discovery were recently proposed, based on supervised classification, unsupervised clustering or biclustering. These methodologies are called multiview learning [12]. Examples of supervised approaches are [13, 14]. Multiview biclustering has been used in a cocaine user subtyping [15]. Finally multiview clustering methodologies have been intensively used also if in few cases on omics data. Multiview clustering applied to biological data includes iCluster [16] and SNF [9]. iCluster uses a joint latentvariable model to identify the grouping structure in multi omics data. On the other hand, SNF uses a networkbased approach to combine different omics data (e.g., mRNA expression, DNA methylation and microRNA expression data) to identify relevant patient subtypes. However, the contribution of the individual data sources to the classification output is not quantified in any of these multiview clustering methods.
In this study, we propose a new computational framework for multiview clustering that aims to combine dimensional reduction, variable selection, clustering (for each available data type) and data integration methods to find patient subtypes, as described in Fig. 1.
First, the clusterbased correlation analysis is used to reduce the number of features for each data type (genes, miRNAs, protein, etc.). Second, a rankedbased method is employed to select the features based on their ability to separate patient subtypes. Third, clustering is used to identify patient subtypes independently from each reduced dataset. Fourth, integrative clustering methods are exploited to find more robust patient subtypes and assess the contributions of different data types used for the identification of all the patient subtypes. Detailed information on each step can be found in Additional files 1 and 2. We tested our method on large genomic data sets including different omics data types, such as the Cancer Genome Atlas (TCGA) data sets (http://cancergenome.nih.gov/). Our comparison experiments suggest that our method outperforms other existing integration methods, such as TwKmeans [7] and SNF [9].
Results and Discussion
We developed a novel methodology for cluster analysis of multiple genomic data types.
We compared it with recently developed methods: the integrative clustering algorithm, namely SNF [9]. and the TwKmeans [7], an early integration multiview clustering model. Using TCGA datasets from 4 different tumor types (Table 1), we evaluated the cluster impurity error, the Normalized Mutual Information [17] and the cluster stability of all the considered algorithms.
The evaluation metrics computed for each dataset are summarized in Table 3. Our unsupervised method shows a mean error of 27,47 %, normalized mutual information (NMI) of 28 % and stability of 85 %. Moreover, the error can significantly decrease when using prior information. Indeed, our method with prior information reduces the error to 6,30 %. The other methods used in the comparison study show a higher mean error from the lowest 30,83 % of SNF to the highest 30,93 % of Kmeans. They also show a lower NMI (the maximum value reached is 26 % of Ward’s method) and variable stability from the lowest 51 % of the Kmeans to the highest 96 % of the partitioning around medoids (pamk).
The class label and the pvalue for each cluster obtained after the integrative step is reported in Fig. 2, where the label indicates the subclass to which patients in the cluster belong, while the pvalue measures the statistical significance of a cluster. In the case of the dataset OXF.BRC.1, the patients are divided into four classes: LumA, LumB, Her2 and Basal. We observed eight relevant clusters, four of which are subclasses of class LumA (cluster 4  pvalue 2.50 ×10^{−4}; cluster 5  pvalue 8.71 ×10^{−8}; cluster 6  pvalue 2:92 ×10^{−3}; cluster 11  pvalue 1.97 ×10^{−3}) and two are subclasses of class LumB (cluster 2  pvalue 3:93 ×10^{−14}; cluster 10  pvalue 5:14 ×10^{−3}). We also report the influence of each data on the final cluster. While it is obvious that the clusters are obtained considering all the genomic data views, the information needed to identify a specific subclass can be more relevant in a particular data type instead over the others. For example, the clusters 3, 6 and 11 of the OXF.BRC.1 dataset are both labeled as LumA. miRNA expression contributes for the 100 % to define the cluster 11, the gene expression is mainly determining the cluster 3 (57 %), while for cluster 6 they are equally important. This could mean, for example, that patients in cluster 11 are particularly characterized by miRNA expression while patients in cluster 3 by gene expression.
As shown in Fig. 3, the integrative clustering performed generally better that the clustering on each single data view. In the TCGA.BRCA dataset, the mean cluster impurity is about 26 % when patients are grouped by the gene expression and 43 % when they are grouped by their miRNA expression profiles. However, combining the gene and the miRNA expression profiles, 26,50 % of error in unsupervised mode and 9 % in semisupervised mode are obtained, respectively. Only in a few cases, the patient grouping based on a single data view performs better than the one obtained with multiple data types.
Figure 4 depicts the comparison between the two integration methods, either with or without prior information. The matrix factorization based method reaches the higher stability (about 85 %) in all the cases. With respect to the cluster impurity, the difference is almost always negligible. The greatest difference occurs when passing from the unsupervised to the semisupervised approach. The cluster impurity for the unsupervised clustering is about 30 % and about 7 % for semisupervised. Therefore, for more accurate subtyping of classes semisupervised integration was used, which maintains high stability and reduces the classification error compared to the classes. However, in case of unbalanced patient classes, the prior information is needed to increase the prediction.
Since we tested different algorithms at each step of our methodology, we aimed at understanding if a common pipeline for all the datasets could be applied. After the execution of all the analyses, we observed that the best algorithms for the first and second steps strongly depend on the data. We found that Kmeans is the best algorithm for step 3 for the TCGA.BRACA, OXF.BRCA.1 and OXF.BRCA.2 datasets (Table 2). At the last step, the matrix factorization approach provided lower errors and greater stability as compared with the general linear integration methods on the majority of the datasets. This result corroborates our hypothesis that a late integration approach is better for it allows using the best algorithms for each data type.
In order to evaluate the performance of the proposed method, we systematically compared it with TwKmeans and SNF algorithms (Table 3). Anyhow, we did not compare our method with iClust, as it has been show to have worse performance than SNF, with which we deal in this study [9]. We confirmed that late integration works more efficiently in integrating different views of genomic data. This is due to the large complexity and difference between the views. When views have different numerical and statistical characterizations, it is more convenient to individually analyze single data types and then combine the results in a multiview analysis. This becomes more and more important as the number of views involved in the analysis increases.
Evaluation of genes in breast cancer datasets
We selected a robust set of features from each analyzed dataset in order to find common features (Fig. 5 a) and highlight shared patterns by enrichment analysis (Fig. 5 b). Each list of features was obtained by using the Bordacount rule across the leaveoneout replicates. The enrichment analysis was performed by using the DAVID functional annotation tool [18, 19] and graphically displayed with the R package BACA [20]. Figure 5 b reports a chart indicating unique and common Gene Ontology (GO) terms found by using DAVID on the different lists. It is possible observe that the three lists of features highlight similar GO annotations, involved for instance in regulation of kinase activity and regulation of cellcycle. The list of genes shared between the three breast cancer datasets can be found in Additional file 3.
Conclusions
In this study, we proposed a methodology for multiple genomic data type analysis aiming at patients subtyping. The methodology is composed of four steps using state of the art algorithms. Furthermore we systematically searched for the best algorithm for each step on six of benchmark datasets. We performed experiments in a late integration fashion, with two different algorithms. Since we were interested in high accuracy in class patient subtyping, we used prior information as a new view in the integration process. We found that the integrative clustering outperforms the single view approaches on all the datasets. We also showed that our method is stable by executing clustering on perturbed datasets removing one patient at a time and evaluating the normalized mutual information between all the resulting clusterings.
Methods
The proposed methodology for the analysis of multiview biological datasets takes in input n matrices \(M_{i} \in R^{F_{i} \times P}\; for \;i=1, \ldots, n \phantom {\dot {i}\!}\), where F _{ i } is the number of features (genes, miRNAs, CNV, methylation, clinical information, etc.) and P is the number of patients and a vector cl of classes labels, and yields a multiview partitioning \(G = \bigcup _{i=1}^{k}(G_{i})\) of patients. The multiview integration methods also return a matrix C where c[ i,j] is the contribution of view i to the final multiview cluster j.
The approach consists of four main steps as shown in Fig. 1:

1.
Prototype Extraction: for each view, the features were filtered by variance and clustered in order to find prototypes, reducing the input data dimension.

2.
Prototype ranking: the prototypes found in the step 1 were ranked based on their ability to separate the classes.

3.
Single view clustering: in each view, the samples were clustered using the prototypes created in the steps 1 and 2 as features

4.
Integration: single view clustering results were integrated with a late integration approach, in order to obtain the k final multiview metaclusters
The late integration methodology can be considered as a further step of the proposed data mining pipeline, in which the clustering results of each single view are unified. This approach offers a number of significant advantages: (i) clustering algorithms can be optimally chosen with respect to each single view; (ii) it can be naturally parallelized; (iii) representation issues are avoided since clustering results are the inputs to the integration algorithms.
Prototype extraction
The features with low variance across the samples were eliminated. Therefore the data were clustered with respect to the patients and the cluster centroids were selected as the prototype patterns. The centroid of each cluster was selected as the most correlated element with respect to the other elements in that cluster. Different clustering algorithms were used: Pvclust [21], SOM [22], hierarchical clustering with Ward’s method [23], Kmeans [24], Partitional Around Medoids [25] and Spectral clustering [26].
The idea is to evaluate several popular clustering techniques and compare their behaviour on the different views with respect to the hierarchical method that is the standard algorithm used to cluster genes. As noted in [27], cluster analysis is a complex and interactive process and results change based on its parameters. Therefore, each algorithm was executed for different values of K. For each algorithm and for each K, clustering performance was evaluated according to the following evaluation function:
where IC is the complete diameter measure, representing the average sample correlation of the less similar objects in the same cluster; EC is the complete linkage measure, representing the average sample correlation of the less similar objects for each pair of clusters; S is the singleton factor and CG is the compression gain. The evaluation function was defined in order to obtain the output value normalized between 0 and 1. The complete diameter and the complete linkage measures were calculated with the R “clv” package [28]. The number of singleton was normalized in a range (0,1) in order to be comparable with the correlation measure. It was defined as S=N/(K−1). The compression gain was defined as C G=1−(K/N _{ elem }), where K is the number of clusters and N is the number of elements to be clustered.
Each clustering algorithm was executed on n different values of K and the corresponding results were evaluated with the function VAL. Values close to 1 indicate a clustering with similar objects in the clusters, weakly linked clusters, with few singletons and with a good compression rate. A numeric score was then assigned to each K value by considering the average values of the VAL function compiled over the clustering results obtained with the different algorithms. Then, the K showing the highest score was chosen and subsequently used to identify the best clustering algorithms having the first two highest scores with respect to the selected k value. In Algorithm 1 is reported the computational procedure followed to finetuned the kvalues for the cluster analysis.
Feature ranking
If the number of prototypes, after the fist step, was still high, further dimensional reduction by feature selection was done. Feature ranking was performed by computing the CATscore [29] and the Mean Decreasing Accuracy index calculated by Random Forests [30]. The parameters of RFbased classifiers were finetuned by using the R package rminer [31]. It provides a function that first tunes the hyper parameter(s) of a selected model by using bootstrap methods and subsequently builds the corresponding supervised datamining model. For each rank, the cumulative sum of the ranking score was computed and four different cuts based on the cumulative values were taken. Cuts took into account all the features needed to maintain 60 %, 70 %, 80 % and 90 % of the cumulative value. An example is shown in section Prototype Extraction of Additional file 1. These different groups of features were used to cluster patients in each single view, with the same single view clustering algorithms used in the previous step. The number of clusters K was considered as the number of classes. For each clustering, the error was calculated as the dispersion obtained in the confusion matrix between class labels and clustering assignments. The clustering algorithm that reached the minimum error for each view was then selected. These clustering results were used as the input to the late integration step.
Integration
Two late integration methods were used: the matrix factorization approach [11] and a general model for multiview integration [10]. The first method [11] combines information by factorizing the membership matrix of patient singleview clusterings. The method starts by transposing all the membership matrices and stacking them vertically obtaining the matrix of cluster X∈R ^{lXn} where l is the total number of cluster in C. The objective is to find the best approximation of X such that
The results of the factorization are two matrices: P∈R ^{lXk} that projects the clusters in a new set of k metaclusters and H∈R ^{kXn} whose columns can be viewed as the membership of the original objects in the new set of metaclusters. Based on the values in the projection matrix P, we can calculate a matrix T∈R ^{vXk}. T _{ hf } indicates the contribution of the view V _{ h } to the fth metacluster. Based on values in P it is also possible to find the optimal value of k for the number of multiview clusters we want in output. The matrix factorization was run with a range of values for k as input and the algorithm returns the factorization for the best value of k.
The second method exploits the intuition that the optimal clustering is the consensus clustering shared by as many views as possible. This can be reformulated as an optimization problem where the optimal clustering is the closest to all the single view clusterings under a certain distance or dissimilarity measure. Clusterings are again represented as membership matrices.
Formally the model can be described as follow: given a set of clustering membership matrices \(M=\;[\!M_{1},\ldots, M_{h}] \in R_{+}^{n \times l}\) and a positive integer k, the optimal clustering membership matrix \(B \in R_{+}^{n \times k}\) and the optimal mapping matrices \(P = \,[\!P_{1},\ldots,P_{h}] \in R_{+}^{k \times l}\) are given by the minimization:
where G I(MB P) is the generalized KullbackLeibler divergence such that
subject to the constraint that both P and B must be nonnegative and that each row of B must sum to one.
By taking the membership matrix for each of the previous clusterings, and, using these two late integration methods, a multiview clustering was obtained. Experiments were performed in two ways: the former uses all the prototypes for classification; the latter uses only the most relevant ones for class separability. Each one of these approaches were performed both in unsupervised and semisupervised manners, respectively.
The semisupervised approach consists of giving a priori information as input to the techniques of late integration via a membership matrix of patients with the exact information of their classes. This information is combined with the membership of the patients compared to the single view clustering and integrated in metaclusters. This can be a useful approach mainly when the data set is composed of unbalanced or under represented classes.
Derivation of subclasses
Once the multiview clusters were obtained, a subclass was assigned to each one. For each cluster, the number of objects of each class was calculated and the class with more representative patterns was assigned as the cluster label. Then, a pvalue was calculated in order to verify the statistical significance of the subclass by the Fisher’s exact test [32].
Validation
The method was compared with classical single view clustering algorithms, early and intermediate integration approaches. For each method clustering impurity, normalized mutual information (NMI) and cluster stability were evaluated. Cluster impurity was defined as the number of patients in the cluster whose label differs from that of the cluster. Given two clustering solutions C l _{1} and C l _{2} NMI was computed as the mutual information between the two clustering normalized by the cluster entropies. The NMI was computed between clustering results and real patient classifications.
Since prior information was introduced, the stability of the system was tested with leaveoneout technique. A test in itself was run on the first step to generate a stability index for the prototypes of the obtained clusters. Then, the steps 2, 3 and 4 were evaluated jointly to assess the stability of the selected features and to evaluate the robustness of the multiview clustering results. Furthermore, a bordacount [33] method was performed to find the final list of features selected over the leaveoneout experiments for the integration step.
At the end of this process, N different clustering assignments were obtained, one for each removed patient. An N×N matrix M was created, where M(i,j) was the normalized mutual information (NMI) between the clustering obtained removing patient i and the clustering obtained removing patient j. Then the mean of the matrix M was calculated, indicating the stability measure of the method.
The comparison study involved the following methods:

Kmeans, Hierarchical and Pam single view clustering

TwKmeans, an early integration multiview clustering algorithm

SNF, an intermediate integration multiview clustering algorithm
Experiments with single view clustering algorithms were executed in feature concatenation mode: data from views were concatenated and used as a new greater feature space. This kind of experiments were run both on the most variable features for each view and on the most relevant prototypes found after the first and second steps of our approach. Experiments with Twkmeans were executed on all the features without any manipulation of the initial datasets. Experiments with SNF were executed both using all the features and using all the features that belong to the clusters associated to the relevant prototypes.
Dataset collection and preparation
Six datasets were downloaded from The Cancer Genome Atlas (TCGA) (https://tcgadata.nci.nih.gov/tcga/), Memoral SloanKettering Cancer Center (http://cbio.mskcc.org/) and from NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/) (See Table 1).
TCGA.BRC
Breast cancer dataset from the TCGA repository (https://tcgadata.nci.nih.gov/tcga/  Breast invasive carcinoma [BRCA]). The samples in this dataset correspond to breast cancer patients with invasive tumors. Genomic data for two views were downloaded: RNASeq and miRNASeq (Level 3). Because level 3 data corresponds to already preprocessed data, only the batch effect was removed by the comBat method in the R “sva” package [34]. Patients were subsequently divided into four classes (Her2, Basal, LumA, LumB), using PAM50 classifier [35, 36].
OXF.BRC.1
Breast cancer dataset from a study performed at Oxford University [37]. Data were downloaded from Gene Expression Omnibus Dataset (http://www.ncbi.nlm.nih.gov/geo/). Data were available for two views: mRNA and microRNA expression under the accession number GSE22219 and GSE22220. Patients were divided into four classes (Her2, Basal, LumA, LumB), using PAM50 classifier [35, 36].
OXF.BRC.2
Breast cancer dataset from a study performed at Oxford University [37]. Data were downloaded from Gene Expression Omnibus Dataset (http://www.ncbi.nlm.nih.gov/geo/). Data were available for two views: mRNA and microRNA expression under the accession number GSE22219 and GSE22220. Patients were divided into four classes (Level1, Level2, Level3, Level4) using clinical data also retrieved from the same source. See Table 4 for classes definition.
TCGA.GBM
Glioblastoma cancer dataset from the TCGA repository. The samples in this dataset correspond to glioblastoma patient with invasive tumors. TCGA website was accessed (https://tcgadata.nci.nih.gov/tcga/  Glioblastoma multiforme [GBM]) and publicly available data for two views were downloaded: gene expression and miRNA expression. Also clinical data was retrieved. The patients were divided info four classes: Classical, Mesechymal, Neural and Proneural as described in [38].
TCGA.OVG
Ovarian cancer dataset from the TCGA repository (https://tcgadata.nci.nih.gov/tcga/  Ovarian serous cystadenocarcinoma [OV]). The samples in this dataset correspond to patient affected by ovarian serous cystadenocarcinoma tumors. Publicly available data for three views were downloaded: gene expression, protein expression, and miRNA expression. Clinical data were downloaded in order to classify patients in three categories. In particular patients were classified by clinical stage: first class: stage IA, IB, IC, IIA, IIB and IIC, second class: IIIA, IIIB and IIIC, third class Stage IV.
MSKCC.PRCA
Prostate cancer dataset from a study performed at the Memorial Sloan Kettering Cancer Center (http://cbio.mskcc.org/). The samples in these datasets correspond to patient prostate cancer tumors. The MSKCC Cancer Genomics data portal (http://cbio.mskcc.org/cancergenomics/prostate/data/) was accessed and data for five views were downloaded: clinical data, gene expression, microRNA expression and copy number variation. Patients were classified in two classes by using clinical data by the tumor stage: class one is Tumor Stage I and class two is Tumor Stage II, III and IV. Classification of patient was done according to a previous study performed on the same dataset [14].
Abbreviations
 miRNA:

MicroRNA
 CNV:

Copy number variation
 NMI:

Normalized mutual information
 GO:

Gene Ontology
 SOM:

Selforganizing map
 Pam:

Partitional around medoids
 CATscore:

Correlationadjusted tscore
 DAVID:

Database for annotation, visualization and integrated discovery
 TCGA:

The Cancer Genome Atlas
 MSKCC:

Memorial Sloan Kettering Cancer Center
 GEO:

Gene Expression Omnibus
References
Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sørlie T, et al. Robustness, scalability, and integration of a woundresponse gene expression signature in predicting breast cancer survival. Proc National Acad Sci U S A. 2005; 102(10):3738–43.
Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, et al. Gene expression predictors of breast cancer outcomes. The Lancet. 2003; 361(9369):1590–6.
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc National Acad Sci. 2001; 98(20):11462–7.
Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc National Acad Sci. 2001; 98(19):10869–74.
Vang Nielsen K, Ejlertsen B, Møller S, Trøst Jørgensen J, Knoop A, Knudsen H, et al. The value of top2a gene copy number variation as a biomarker in breast cancer: Update of dbcg trial 89d. Acta Oncologica. 2008; 47(4):725–34.
Kailing K, Kriegel HP, Pryakhin A, Schubert M. Clustering multirepresented objects with noise. In: Advances in Knowledge Discovery and Data Mining. Berlin Heidelberg: Springer: 2004. p. 394–403.
Chen X, Xu X, Huang JZ, Ye Y. Tw (k)means: Automated twolevel variable weighting clustering algorithm for multiview data. Knowl Data Eng IEEE Trans. 2013; 25(4):932–44.
Sa VRD. Spectral Clustering with Two Views. In: ICML workshop on learning with multiple views: 2005. p. 20–27.
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, HaibeKains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat methods. 2014; 11(3):333–7.
Long B, Yu PS, Zhang Z. A general model for multiple view unsupervised learning. In: Society for Industrial and Applied Mathematics  8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics: 2008. p. 822–33.
Greene D. A Matrix Factorization Approach for Integrating Multiple Data Views. Mach Learn Knowl Discov Databases. 2009; 5781:423–38.
Xu C, Tao D, Xu C. A survey on multiview learning. 2013. arXiv preprint arXiv:1304.5634.
Wasito I, Istiqlal A, Budi I. Data integration model for cancer subtype identification using Kernel Dimensionality ReductionSupport Vector Machine (KDRSVM). In: Computing and Convergence Technology (ICCCT), 2012 7th International Conference On. IEEE: 2012. p. 876–80.
Ray B, Henaff M, Ma S, Efstathiadis E, Peskin ER, Picone M, et al. Information content and analysis methods for multimodal highthroughput biomedical data. Sci Rep. 2014; 4:4411. doi:10.1038/srep04411
Sun J, Bi J, Kranzler HR. Multiview singular value decomposition for disease subtyping and genetic associations. BMC Genet. 2014; 15(1):73.
Shen R, Mo Q, Schultz N, Seshan VE, Olshen AB, Huse J, et al. Integrative subtype discovery in glioblastoma using icluster. PLoS ONE. 2012; 7(4):35236.
Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010; 11:2837–854.
Dennis Jr G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. David: database for annotation, visualization, and integrated discovery. Genome Biol. 2003; 4(5):3.
Huang DW, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, et al. David bioinformatics resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 2007; 35(suppl 2):169–75.
Fortino V, Alenius H, Greco D. Baca: bubble chart to compare annotations. BMC Bioinformatics. 2015; 16(1):37.
Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics (Oxford England). 2006; 22(12):1540–2. doi:10.1093/bioinformatics/btl117
Vesanto J, Alhoniemi E. Clustering of the selforganizing map. Neural Netw IEEE Trans. 2000; 11(3):586–600.
Ward JH. Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc. 1963; 58(301):236–44. doi:10.1080/01621459.1963.10500845
Hartigan JA, Wong MA. Algorithm AS 136: A KMeans Clustering Algorithm. J R Stat Soc. 1979; 28:100–8. doi:10.2307/2346830
Kaufman L, Rousseeuw PJ. Clustering by means of medoids. In: Data analysis based on the L 1Norm and related methods. NorthHolland: 1987. p. 405–416.
Ng AY, Jordan MI, Weiss Y. On spectral clustering: Analysis and an algorithm. Adv Neural Inf Process Systs. 2002; 2:849–56.
Handl J, Knowles J, Kell DB. Computational cluster validation in postgenomic data analysis. Bioinformatics (Oxford, England). 2005; 21(15):3201–12. doi:10.1093/bioinformatics/bti517
Nieweglowski L. Clv: Cluster Validation Techniques. 2013. R package version 0.32.1. http://CRAN.Rproject.org/package=clv.
Ahdesmäki M, Strimmer K. Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat. 2010; 4(1):503–19. doi:10.1214/09AOAS277 arXiv:0903.2003v4.
Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.
Cortez P. Rminer: Data Mining Classification and Regression Methods. 2014. R package version 1.4. http://CRAN.Rproject.org/package=rminer
Fisher RA. JSTOR: J R Stat Soc. 1922; 85(1):87–94. http://www.jstor.org/stable/2340521?__redirected. Accessed 17/06/14.
Lin S. Space oriented rankbased data integration. Stat Appl Genet Mol Biol. 2010; 9(1):1544–6115. doi: 10.2202/15446115.1534, April 2010.
Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD. Sva: Surrogate Variable Analysis. R package version 3.14.0.
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002; 99(10):6567–72. doi:10.1073/pnas.082099299
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol Off J Am Soc Clin Oncol. 2009; 27(8):1160–7. doi:10.1200/JCO.2008.18.1370
Buffa FM, Camps C, Winchester L, Snell CE, Gee HE, Sheldon H, et al. microRNAassociated progression pathways and potential therapeutic targets identified by integrated mRNA and microRNA expression profiling in breast cancer. Cancer Res. 2011; 71(17):5635–45. doi:10.1158/00085472.CAN110489
Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010; 17(1):98–110. doi:10.1016/j.ccr.2009.12.020
Acknowledgements
This work has been supported by the European Commission, under grant agreement FP7309329 (NANOSOLUTIONS).
The results shown here are in part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
DG and RT conceived and supervised the study. AS developed the methods, analysed and interpreted the data, and implemented the software. MF, VF and GR participated the development of the methods and the analysis of the data. All the authors have participated in drafting the manuscript. All authors read and approved the final manuscript.
Additional files
Additional file 1
It contains a section for each step of the methodology in which the tables and figures with the results for each dataset are reported. (PDF 1495 kb)
Additional file 2
Each sheet refers to each dataset analysed, reporting the results of the singleview clustering patients. Clustering errors for each algorithm and each cut of feature are also reported. (XLSX 54 kb)
Additional file 3
It contains the gene symbols and description for all shared genes between the tree breast cancer datasets highlighted by the analysis. (DOCX 14 kb)
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Serra, A., Fratello, M., Fortino, V. et al. MVDA: a multiview genomic data integration methodology. BMC Bioinformatics 16, 261 (2015). https://doi.org/10.1186/s1285901506803
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901506803