MVDA: a multi-view genomic data integration methodology

Background Multiple high-throughput molecular profiling by omics technologies can be collected for the same individuals. Combining these data, rather than exploiting them separately, can significantly increase the power of clinically relevant patients subclassifications. Results We propose a multi-view approach in which the information from different data layers (views) is integrated at the levels of the results of each single view clustering iterations. It works by factorizing the membership matrices in a late integration manner. We evaluated the effectiveness and the performance of our method on six multi-view cancer datasets. In all the cases, we found patient sub-classes with statistical significance, identifying novel sub-groups previously not emphasized in literature. Our method performed better as compared to other multi-view clustering algorithms and, unlike other existing methods, it is able to quantify the contribution of single views on the final results. Conclusion Our observations suggest that integration of prior information with genomic features in the subtyping analysis is an effective strategy in identifying disease subgroups. The methodology is implemented in R and the source code is available online at http://neuronelab.unisa.it/a-multi-view-genomic-data-integration-methodology/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0680-3) contains supplementary material, which is available to authorized users.

In this section intermediate results of step one of the methodology is reported. As a preliminary step, for all dataset, feature with low variance are eliminated. Variance was evaluated for each feature and then the cumulative function of the variance was calculated. The the cumulative function was cut at different level as showed in an example in figure 1.
Then feature were clustered by correlation in order to remove feature redundancy and reduce their number.
Here are reported evaluation metrics for each algorithm in clustering feature as described in section material and method of the manuscript. For each dataset the best two algorithms that reach higher value of the metric were selected.
In figures 2 3 4 5 6 we can see the algorithms behavior by varying the value of K. More details are shown in table 1. Figure 1 Feature ranking cut: here IS reported, as example, the feature ranking cut for the gene expression view of the OXF.BRC.1 dataset. Feature ranking was performed with the Cat-t score method on the prototypes obtained with the Pam algorithm. As we can see, in order to achieve the 60% of the cumulative ranking score 53 prototypes were needed, 71 for 70%, 93 for 80% and 126 on 90%. In this example there were 200 prototypes at all. Table 1 Results after step 1: Here we show the results of the two best algorithms used in order to cluster elements in each view for each dataset. For each dataset the top 20% of features were selected. N is the number of patients in each dataset. Apart for Pvclust algorithm that automatically finds the number of clusters, the optimal value of K was calculated as described in section Material and Methods (the optimal values are those in the red lines). For each dataset the two best algorithms that maximize the index (bold value) were selected

Single view patients clustering
In this section summary results of single view patient clustering for each dataset are showed. The clustering algorithm reach the minimum impurity error percentage is also reported. Table 2 reports which cut is used in order to reach this results and also the algorithm (used in the first step of the methodology) from witch the prototype come from.

Final Results
In this section final results for all datasets are reported. All the results reported for the integration step refer to features obtained with the leave-one-out process. In particular table 3 shows cluster impurity errors and cluster stability computed for each dataset for the two integrative methods.

Relevant Prototype for each subclass
For each cluster of patients a set of features coming from different data types was available. Each cluster was analysed in order to find the features that characterize it better. Two kinds of analysis were performed: the former was the correlation between patients in the cluster, the latter was related to the distribution of each variable in one sample in a cluster compared to all the other samples. In the first case, the most relevant features for each cluster were identified by evaluating how the correlation between patients in one cluster decrease when a feature was removed. The feature relevance is directly related to the correlation decrease. One feature at a time was removed and the correlation was evaluated. At the end the features were ranked and the first features for each view were selected. Figure 8 shows the most relevant features for each dataset. In the second case, features were ranked for each cluster according to their distribution. The key concept was that the variance of a relevant feature is low in the cluster and high between clusters. So were considered significant those features for which the difference between the variance out of the cluster and the variance in the cluster were highest. The features were ordered according to this criterion and for each cluster was observed what are the top key features. An example of results on TCGA.BRCA dataset is reported in (Figure 7).

Class characterisation by visualisation
For inspection of the patient characteristics in each class, the distribution of each variable in a cluster was compared with its distribution in other clusters, using boxplots. A boxplot shows the median expression level (solid horizontal bar), the upper quartile and lower quartile range (shaded grey bar), the highest non-outlier and lowest non-outlier (smaller ticks joined by dashed lines), and any outlier (open circles). Because of the great amount of features the box-plot of all the variables cannot be visualized in a clear manner. So the features that gave more information on the difference between clusters were found. Analysis was started from cluster centroids. Feature were ranked by its variance between centroids. This means that the greater is the variance the greater is the difference between clusters for that feature. In ( Figure 9 Box-plots of the TCGA.OVG: The box-plots of TCGA.OVG dataset were calculated on the multi-view clustering results obtained with the matrix factorization approach in semi-supervised mode. For space and clarity reasons, the box-plots of patients were drawn only on the features with the highest variance between the centroids of different clusters.

Label assigned to clusters
Tumor stage I,II Tumor stage III Tumor stage IV Figure 10 Box-plots of the MSKCC.PRCA: The box-plots of MSKCC.PRCA dataset were calculated on the multi-view clustering results obtained with the matrix factorization approach in semi-supervised mode. For space and clarity reasons, the box-plots of patients were drawn only on the features with the highest variance between the centroids of different clusters.
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Label assigned to clusters
Tumor stage T1 Tumor stage T2,T3,T4 Figure 11 Box-plots of the OXF.BRCA.1 and OXF.BRCA.2: The box-plots of OXF.BRCA.1 and OXF.BRCA.2 datasets were calculated on the multi-view clustering results obtained with the matrix factorization approach in semi-supervised mode. For space and clarity reasons, the box-plots of patients were drawn only on the features with the highest variance between the centroids of different clusters.       The method as been compared with classical single view clustering algorithms, early and intermediate integration approach.

Label assigned to clusters
We calculated classification error and normalized mutual information (NMI) for each method, between each clustering results and real patient classification.
Given two clustering solutions Cl1 and Cl2 NMI compute the mutual information between the two clustering normalized by the cluster entropies.
Because we know how patients are categorized we compute NMI between clustering results and real patient classifications.