Ensemble classifier based on context specific miRNA regulation modules: a new method for cancer outcome prediction
© Zhou et al.; licensee BioMed Central Ltd. 2013
Published: 24 September 2013
Skip to main content
© Zhou et al.; licensee BioMed Central Ltd. 2013
Published: 24 September 2013
Many calssifiers which are constructed with chosen gene markers have been proposed to forecast the prognosis of patients who suffer from breast cancer. However, few of them has been applied in clinical practice because of the bad generalization, which results from the situation that markers selected by one method are very different from those obtained by anohter mothod, and thus such markers always lack discriminative capability in the other data sets.
In this work, a new ensemble classifier, on the basis of context specific miRNA regulation modules, has been proposed to forecast the metastasis risk of cancer sufferers. First, we defined all of the miRNAs which regulate the same context as a module that contains miRNAs and their regulating context, and applied the CoMi (Context-specific miRNA activity) score in order to illustrate a miRNA's effect which happened in a particular background; then the miRNA regulation modules with distinguising abilities were detected and each of them was responsible for building a weak classifier separately; at last, by using majority voting strategy, we integrated all weak classifiers to establish an ensembled one that was applied to forecast the prognosis of patients who suffer from cancer.
After comparing, the results on the cohorts containing over 1,000 samples showed that the proposed ensemble classifier is superior to other three classifiers based on miRNA expression profiles, mRNA expression profiles and CoMi activity patterns respectively. Significantly, our method outperforms the representative works. Moreover, the detected modules from different data sets show great stability (with p-value of 6.40e-08). For investigating the biological significance of those selected modules, case studies have been done by us and the results suggested that our method do help to reveal latent mechanism in metastasis of breast cancer.
One context specific miRNA regulation module can uncover one critical biological process and its involved miRNAs that are related to the cancer outcome, and several modules together can help to study the biological mechanism in cancer metastasis, thus the classifer based on ensembling multiple classifers which were built with different context specific miRNA regulation modules has showed promising performances in terms with both prediction accuracy and generalization.
For breast cancer, many classifiers based on gene signatures were built to predict the prognosis of cancer patients [1–4], with the purposes of ensuring the patients to receive befitting therapy. However, two major problems have occurrd in real applicaitons. Firstly, the performances of these classifiers usually decline sharply in the datasets different from the one used for the construction; secondly, there are few common genes among these published signatures, making the clinicians confused and found it hard to believe the signatures are helpful. For instance, two independent studies respectively identified a signature composed of 70 genes  as well as another signature consisting of 76 genes  for forecasting cancer sufferers' distant metastasis, both of which achieved classfication accuracies between 0.6-0.7  in their own patient cohorts. However, these two gene sets have only one gene in common . Besides, each of the two gene sets performed badly on each other's dataset (with accuracy of less than 0.55) . The reason might be that the detected gene sets just contain 'passengers' instead of 'drivers', resulting from the fact that a large amount of passenger signals buried in the expression profiles of tumor cells . Recently, some researchers have proposed to extract features from function gene sets to forecast prognosis of cancer [7–9]. These gene-set signatures are more stable than the gene signatures, however they still suffer from the problem of low classification accuracy on independent test sets (AUC no more than 0.7) .
As we all know that the expression levels of a miRNA is not equal to its activity , thus a miRNA activity calculation method, which was called Context-specific miRNA activity (CoMi activity) estimate method, was proposed to estimate a miRNA's activity in a given background (function gene set) in our earlier works [11, 12]. The statistical differences in expression profiles between genes of targets' and non-targets' of a miRNA in a particular context (function gene set) was calculated as the CoMi activity score. To cheack whether the CoMi activity patterns are a more informative feature space to predict cancer prognosis, features selected from the CoMi activity patterns were used for the construction of a classifier to predict the metastasis risk of cancer patients. As a result, CoMi activity patterns have been proved to be superior to gene expression profiles in cancer prognosis .
We thought that multiple miRNAs may affect the prognosis of cancer sufferers by regulating a certain biological process, and several biological processes may co-affect on the patient's prognosis, thus we stepped forward in our recent work . Several miRNAs regulating the same biological process (Go Term) was defined as a module, then a classifier was constructed based on each discriminative module to forcast these individuals' prognosis. After that, the chosen module classifiers with classification capabilities were integrated to an combined classifier by majority voting rules. In order to evaluate the ensemble classifer, we first constructed three classifers respectively based on miRNA expression profiles (miRNA classifer), mRNA expression profiles (mRNA classifier) and CoMi activity patterns (CoMi classifer); then we compared the ensemble classifier with the three classifiers and other representative classifiers reported before. Moreover, the specificity as well as the steadiness of those distinguishing modules were studied. At last, we tried to reveal some metastasis mechanisms by investigating the regulation relations in the selected modules.
To evaluate the methods, we downloaded five normalized breast cancer date sets from NCBI GEO: GSE2034 , GSE4922 , GSE6532 , GSE7390  and GSE11121 . In GSE6532,every patient is ER-positive, and patients in the other four datasets are either ER-positive or ER-negative. Furthermore, the data sets of GSE4922 and GSE6532 contains patients with one of the both kinds of lymph nodes (positive or negative), while there are patients only with negative lymph node in the other data sets [12, 17]. We also downloaded another breast cancer data set from TCGA , in which there are 504 samples' miRNA and mRNA microarray data. The mRNA microarray analysis was performed with Agilent G4502A Genechips and the platform to analyze the miRNA profiles was an IlluminaGA miRNASeq microarray. In this work, we used the level 3 data  for both kinds of profiles.
Predictive power of the classifiers on the NCBI data sets (AUC)
Dividing the genes in the given gene list into different groups based on their GOBP term annotations. One group consists of genes related to the same GOBP term.
Based on these steps, the samples' CoMi activity profiles can be got by calculating all the miRNA-GOBP pairs' activity scores from their gene expression profiles. The profiles are described by a two-dimensional array, in which each column stands for a patient, each row represents a miRNA-GOBP. In addition, an element in the array is the miRNA activity on the GOBP (Figure 1d).
In order to reduce noises, a total of 10% rows in the matrix (the process is based on CoMi activity profiles in GSE2034) with the smallest square deviations were discarded. Moreover, if the elenments in two rows were the same, the first row would be retained with the purpose of removing the redundancy which may be caused by the prediction tools.
One biological process may describe one aspect in cancer prognosis, thus we regard the miRNAs acting on the same GOBP as an entire module (Figure 1e). In the following sections, the name of the GOBP is used as the name of the whole module.
Now that multiple miRNAs may regulate a given biological process together, and have an effect on the prognosis of patients, some modules may be discriminative in the various risk groups of the cancer sufferers. Therefore, for each module, a classifier was built based on it to classify the patients into two groups, and the ones with classification capability (AUC ≥ 0.6) were considered (Figure 1f).
We adopted the centroid classifier in our work. The first reason is that the centroid classifier is suit for microarray data, which has the character of large size of features and few samples . The second reason is that the centroid classifier does not need to adjust parameter and is as good as or more excellent than the famous methods. What is more, the centroid classifier is hard to be overfitting .
If , then is assigned to positive class; otherwise, it is assigned to negative class.
GSE2034 was used to select the distinguishing modules. For each module, we constructed a centroid classifier and used five-fold cross validation to evaluate the classification capability. And we chose the distinguishing modules by using the AUC (AUC ≥ 0.6). As described above, one module could depict a feature that influence the metastasis of tumor sufferers, thus after combining the whole discriminative modules we may reveal a more overall biological mechanism in tumor outcome. Consequently, the weak classifiers were integrated to a combined classifier by majority voting rule (Figure 1f).
For the purpose of comparison, we also built three classifiers from TCGA breast cancer data set with miRNAs, mRNAs and CoMs as features respectively, by using the same centroid classification model. The original data contains the profiles to express both mRNA and miRNA, and we can compute the CoMi activity patterns by using our previous method .
In order to get the optimal classifiers, we first ranked the features by the weight vector described as above, then we constructed the centroid classifiers based on the features ranking the top, and then we evaluated the classifiers by the use of five-fold cross validation. We varied the feature number from 1 to 200 when building the classifiers, and adopted the classifier having the best performance (AUC as the measure index) for the comparison purpose.
In order to evaluate our method, three typical methods using in outcome forecasting of breast cancer were adopted to compare with our method on the same data sets.
The fisrt one was the most famous gene marker classifier in this filed . They tarined a gene signature composed of 70 genes, which was than used as the markers. And then a classifier was constructed based on the 70 genes (denoted as 70g classifier in this work). In this method, the average vectors of the 70 genes' expression levels of the two groups (distant metastasis groups and non-distant metastasis groups) were calculated as the patterns of the two classes, and the samples were assigned to the more correlated groups using Pearson's correlation coefficients.
The second one was proposed by Wang et al. . In this method, a total of 76 genes were selected as gene markers. Based on the 76 genes (denoted the classifier as 76g), a risk score of each patient was defined as the linearizing summation of weighted expression values, where the weight is the Cox's regression coefficient [4, 12]. At last, the patient is classified into high risk group or low risk group according to whether the risk score is larger than a threshold.
The last two methods used the gene set statistics as features . We gathered the function gene sets in the database of MSigDB . Then the statistical value was calculated from the combination of each gene set and expressional level of the samples. In terms of calculating the statistical value, the statistical methods of Set Centroid and Set Median were used because they were the best two [7, 12]. After acquiring the statistical value, we selected the optimal sets and used them as features to establish a classifier (centroid method) for forecasting the individuals' metastasis risks within 5 years. The optimal sets selection and the classifier construction method are the same as above section.
We adopted the resampling method to test the specificity of those selected modules. First, the identical number of modules was randomly selected from all the generated modules. Second, the randomly selected modules were applied to establish weak classifiers and all these classifiers were combined to an ensemble classifier on GSE2034, which is then evaluated on the merged set of the other four NCBI data sets. The process is repeated 10,000 times, and all the performances (AUC) of the random ensemble classifiers were used as null distribution, based on which, we can calculate the significance p-value, which can be used to assess the specificity of these selected modules.
As the severe unbalance between the two different risk groups (For instance, compared with 154 low-risk patients, there are only 28 high-risk patients in GSE11121). Many measures indexes, such as sensitivity (SN), specificity (SP), and accuracy (ACC), are not efficient enough to character the performance of the classifiers.
In this work, the AUC (area under the receiver operating characteristic curve) and MCC (Matthews Correlation Coefficient) are applied as the two main measures to evaluate our classifiers.
A ROC (operating characteristic curve) is created by plotting the sensitivity versus one minus the specificity at various threshold settings, and the AUC is the area under the ROC, which is widely used to illustrate the performance of a binary classifier.
MCC is also used as the major standard to evaluate the performances of the classfiers in our study, for MCC is a measure method which can provide us with the most information when the samples in the dataset are seriously unbalanced . The MCC takes into account the true and false positives and negatives, which is described in detail in [12, 26]. And the values of MCC fluctuate between -1 and 1, with 1 indicating absolutely correct prediction, 0 indicating meaningless prediction and -1 indicating absolutely opposite prediction.
Predictive power of the classifiers on the NCBI data sets (MCC)
Table 1 and Table 2 show the scores of AUC and MCC respectively, both of which resulted from our method and other four ones on the five data sets (one for train and four for independent test). From Table 1 it is clear that the two gene sets classifiers are better than the 70g and 76g classifier (The detailed results of the four published methods are shown in Supplementary Table 4 to 7 in Additional file 1). This result conforms to the previous study . Meanwhile our method has a better performance than the others, expect in GSE11121 where our method is slightly worse than the gene set classifiers.
As illustrated above, MCC is the bset measure index for classifier to handle the lopsided cohorts as in our work. Therefore, it also was used as the main measure index. The comparing results of MCC have been shown in Table 2. From this index it is obvious that the performance of our method is the best. In addition, the table also shows that except our method, all the other ones had a worse performance on the GSE6532 and GSE4922 than the other data sets, particularly the ones based on gene signatures. The reason may be that the former two cohorts include both kinds of lymph-node samples, whereas the others only include non-lymph metastasis ones. Nevertheless, even in the two datasets, our method had a very stable performance. Thus, our method is obviously very robust.
To sum up, the conclusion is that our method is better than the published classifiers, because it has a better classification capability as well as a better generalization
We have investigated those selected modules and found that many miRNAs and GOBP terms have actually been proven to be in relation with cancer or metastasis. For examples, hsa-miR-34a, hsa-miR-34b and let-7 family, having been reported to be cancer-related miRNAs , are all included in the selected modules; Furthermore, cell division , DNA repair , apoptosis , regulation of cell cycle , cell death , autophagy  and cell migration  are all important GO terms related to cancer. They are also included in the discriminative modules. In addition, the module 'cell adhesion' (124 miRNAs regulation on cell adhesion), with an AUC of 0.669, are also reported to be biological meaningful .
To validate the specificity of our selected modules, we calcuated the significance as described in the Method section and got the p-value as 0.0155, which shows our selected modules are with significant specificity.
From the description above, an essential problem in the studies before is that the gene markers extracted from various cohorts lack stability. For instance, in the two most famous gene markers [3, 4], there is only one common gene . Therefore, the classifiers are in shortage of generalization.
The difference between our work and previous researchers is that we regard the all the miRNAs acting in a biological process as an entire marker, each of which is able to show one feature of the regulation mechanism in distant metastasis, resulting in the stability across various cohorts.
The CoMi score can reveal the effect of miRNAs as well as the biological progress regulated by the miRNAs. Therefore, we analyzed the chosen modules to examine if they are able to reveal certain concealed biological mechanisam influencing cancer outcome. As a result, most markers were indeed associated to tumor. In our modules, most biological processes were metastasis-associate, such as apoptosis , autophagy  and cell migration . Moreover, we have found some biological processes which are related with cancer prognosis, but such relationship was seldom reported previously.
What is interesting is that in most selected modules, the differences of the CoMi scores between these two groups are less than the most significant ones (data not shown). But when the modules were put together, there are actually obvious distinguishing abilities. This situation may illustrate that our method concentrated on choosing features which are related to particular biology process when put together, instead of those which are prognosis-related respectively, resulting in the situation that the chosen modules were more likely being the "drivers" rather than the "passengers". Consequently, the established classifier is greatly robust across various data sets.
Now that a few miRNAs which regulate a biological process can work together to affect the prognosis of tumor sufferer, and a couple of biology processes may participate in the prognosis for cancer, we offered to find out the markers which contain the miRNAs and the GOBP regulation by the miRNAs so as to establish a combined classifier as a way to predict cancer prognosis. From the train data set, fifty five modules were chosen as distinguishing ones. Every chosen module was utilized in establishing a weak classifier separately, all of which have been utilized in constructing one integrated classifier with the rule of majority voting. The results of experiment show that, compared with other methods, the ensemble classifier has a better performance. Furthermore, the chosen modules has a high specificity and a stability across various data sets, which can lead to the conclusion that our method performs both well and robust. The biological anylisis also proves that the chosen modules are able to reveal hidden metastasis mechanism in breast cancer.
The paper is financially aided by the National Science Foundation of China (61272274, 60970063), the program for New Century Excellent Talents in Universities (NCET-10-0644), the Ph.D. Programs Foundation of Ministry of Education of China (20090141110026), as well as the Fundamental Research Funds for the Central Universities (2012211020208).
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 12, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S12.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.