MicroRNA based Pan-Cancer Diagnosis and Treatment Recommendation

Cheerla, Nikhil; Gevaert, Olivier

doi:10.1186/s12859-016-1421-y

Research article
Open access
Published: 13 January 2017

MicroRNA based Pan-Cancer Diagnosis and Treatment Recommendation

BMC Bioinformatics volume 18, Article number: 32 (2017) Cite this article

5860 Accesses
50 Citations
19 Altmetric
Metrics details

Abstract

Background

The current state-of-the-art in cancer diagnosis and treatment is not ideal; diagnostic tests are accurate but invasive, and treatments are “one-size fits-all” instead of being personalized. Recently, miRNA’s have garnered significant attention as cancer biomarkers, owing to their ease of access (circulating miRNA in the blood) and stability. There have been many studies showing the effectiveness of miRNA data in diagnosing specific cancer types, but few studies explore the role of miRNA in predicting treatment outcome.

Methods

Here we go a step further, using tissue miRNA and clinical data across 21 cancers from the ‘The Cancer Genome Atlas’ (TCGA) database. We use machine learning techniques to create an accurate pan-cancer diagnosis system, and a prediction model for treatment outcomes. Finally, using these models, we create a web-based tool that diagnoses cancer and recommends the best treatment options.

Results

We achieved 97.2% accuracy for classification using a support vector machine classifier with radial basis. The accuracies improved to 99.9–100% when climbing up the embryonic tree and classifying cancers at different stages. We define the accuracy as the ratio of the total number of instances correctly classified to the total instances. The classifier also performed well, achieving greater than 80% sensitivity for many cancer types on independent validation datasets. Many miRNAs selected by our feature selection algorithm had strong previous associations to various cancers and tumor progression.

Then, using miRNA, clinical and treatment data and encoding it in a machine-learning readable format, we built a prognosis predictor model to predict the outcome of treatment with 85% accuracy. We used this model to create a tool that recommends personalized treatment regimens.

Both the diagnosis and prognosis model, incorporating semi-supervised learning techniques to improve their accuracies with repeated use, were uploaded online for easy access.

Conclusion

Our research is a step towards the final goal of diagnosing cancer and predicting treatment recommendations using non-invasive blood tests.

Background

MicroRNAs (also known as miRNAs) are a family of non-coding regulatory RNA genes that are involved in RNA silencing and downregulation of gene expression. They accomplish this by binding to mRNA, preventing translation (translational silencing) and speeding up deadenlyation (Poly-A tail breakdown). The first miRNA was discovered in 1993 in the roundworm Caenorhabditis elegans [1]. Since then, our knowledge of miRNA has grown exponentially. MiRNA can function to prevent cell division and drive terminal differentiation. An implication of this is that downregulation of some miRNAs might play a causal role in the generation or maintenance of tumors [2, 3]. Calin et al. [4] first reported miRNA’s role in cancer, after finding frequent deletions and dysregulation of two miRNAs in lymphocytic leukemia. Since then, many studies have established links between miRNA and oncogenesis. MiRNA expression profiles have previously been used to classify cancer and normal tissues and as biomarkers for prognosis [5–8].

In addition to suggesting the presence of cancer, miRNA expression also contains valuable information about the type, severity and prognosis of tumors [9]. This implies that miRNA profiles could potentially be used to recommend tumor-specific treatments instead of the “one-size-fits-all’ treatments that are often recommended by doctors.

In multiple studies, miRNA profiles of tumor tissue and adjacent normal tissue have been used to create predictive models for specific tumors. These studies focused on a relatively small number of samples. Lu et al. [6], after analyzing 217 mammalian miRNAs from 333 tumor samples consisting of tumors of lung, lymphoma and bone, discovered that miRNA expression profiles of tumors cluster based on developmental lineage. They used hierarchical clustering analysis to show that cancers of epithelial origin are clearly differentiated from cancers of the gastrointestinal tract. They further used clustering analysis among the samples of acute lymphocytic leukemia (ALL) to discover that different ALL subtypes tend to cluster together. Rosenfeld, Nitzan et al. [10] studied miRNAs from tumor and metastatic cancer to prove that they could predict the cancer tissue of origin (22 tumor types into 7 different tissue types) using KNN and decision tree classifiers. They achieved 89% classification accuracy during validation. Both these studies are limited by the number of samples (less than 400 samples) and the focus on classifying tissue of origin, where multiple tumors originating from the same tissue are classified as one class. Identifying the actual cancer type (as opposed to the tissue of origin) with as few miRNA as possible helps make it possible to diagnose specific cancer types from miRNA expression levels of circulating tumor cell miRNA in the blood [11, 12].

Although extensive research has been done on expression levels of specific miRNA to predict the prognosis of diseases like cancer, very few researchers explored the next natural topic: the role of miRNA expression profiles from the tumor tissue in predicting how successful a cancer treatment will be. Villaruz et al. [13] studied the effectiveness of miRNA as biomarkers in predicting the outcome of melanoma for patients who took a specific treatment. However, there is not enough previous pan-cancer analysis of predicting treatment effectiveness using miRNA data. With the advent of the TCGA pan-cancer analysis project, vast amounts of molecular and clinical data from tumorous and normal tissues, spanning over 34 different types of cancers, are at the fingertips of researchers [14].

In this study, we used the vast amount of tissue miRNA data available for different cancer types in the TCGA pan-cancer project to build a parsimonious multivariate model that can accurately classify cancer type. Our results show that the support vector machine based model could classify individual cancer types with an overall accuracy of 97.2% with most of the per-cancer sensitivities well above 95%, as well as classifying with above 80% sensitivity for most cancers on an independent validation set. Next, we investigated whether prognosis could be predicted given clinical information, the course of treatment taken and miRNA expression levels at the time of diagnosis. By combining miRNA expression data with clinical information in a machine learning-parseable format, we were able to predict the prognosis with 85% accuracy. Because this result was 8% more accurate than the accuracy without miRNA data, we showed that miRNA data does make a significant impact on prognosis. Finally, we built a web-based tool where a user can upload miRNA expression and clinical data and receive an automated diagnosis. We added semi-supervised learning to our web-based tool, and the tool automatically re-trains the classifiers with new uploaded data.

Methods

Data preprocessing and modeling

We used the R language (version 3.2.3) and CRAN package ‘caret’ version 6.0 for preprocessing and building classifier models [15]. The WEKA 3.6.13 feature suite was used for feature selection algorithms [16]. For both cancer classification and treatment recommendation, we used the normalized miRNA expression profiles from TCGA pan-cancer analysis project [17]. TCGA database has miRNA samples for 34 different tumor types obtained from GPL11154 (Illumina Hiseq 2000) platform. However, some tumor types were discarded due to low sample sizes (less than five samples per tumor type). After this selection, 5229 samples from 21 different tumor and normal tissues remained (see Table 1). These 21 cancer types span across various organs and systems. A few of these cancer types (Lung Squamous Cell Carcinoma, Stomach Adenocarcinoma, Liver Hepatocellular Carcinoma, Thyroid Carcinoma, Kidney Chromophobe and Kidney Renal Papillary Cell carcinoma) have significant number of normal tissue samples as well in the dataset. But, many cancer types either have very few corresponding normal samples or no normal samples. Also, the samples have missing miRNA values. These missing miRNA value estimation was done using a k-Nearest-Neighbor imputation algorithm [18]. A threshold value of 0.2 was used to eliminate miRNA features that were not present in more than 20% of samples. Finally, the miRNA data was log transformed and standardized to zero mean and unit variance.

Table 1 Distribution of the cancer and normal samples in the dataset used to build the predictive model (TCGA cancer classifier)

Full size table

Building a cancer classifier

We built several multi-class classifiers (SVM linear, SVM radial, Random Forest, Linear Discriminant Analysis and Naïve Bayes) and validated them using 10-fold cross validation, repeated ten times. These classifiers are trained to classify 21 different cancer types and the normal tissue type from the miRNA expression values. During the training (for each fold), we used two-stage feature selection algorithm. In the first stage, we used a correlation based feature selection algorithm [19] to remove features that showed no meaningful Pearson correlation with the output. During the second stage, recursive feature elimination algorithm (from the caret package in R) was used to narrow down the feature set to 60 features while picking a strong feature set that would not compromise the classification performance.

The classifier with highest overall accuracy, aggregated across all validation tests, was selected and analyzed for misclassifications. Overall accuracy (or the classifier accuracy) is defined the ratio of all correctly classified instances over the total instances during the testing phase of each iteration in the 10-fold cross validation. Similarly, per-class accuracy is defined as the ratio of the correctly classified instances of a particular class to the total number of instances of that class. Next, we used the kappa statistic -- as a measure of the agreement as compared to the likely agreement due to chance alone -- to evaluate the classifiers. We also measured per class training sensitivities and specificities of the classifier that had the highest overall accuracy and kappa statistic.

Since the feature selection is done during training phase of each of the ten iterations in 10-fold cross validation, features selected could differ between iterations. We took the common features from all the iterations and analyzed the features using the IPA core analysis tool (IPA ® QIAGEN, Redwood City, www.qiagen.com/ingenuity) to find the diseases/functions and networks enriched by this subset. This analysis gives insight into whether our classifier performance is driven by tissue specific miRNAs or cancer/tumor specific miRNAs.

Tissue/Organ/Origin prediction

Studies have shown that miRNA from tissues with similar anatomical locations correlate well [20]. Thus, we further investigated whether miRNA expression values from cancerous tissues also correlate along anatomical locations and if the correlation follows tissue development hierarchies. To formalize this, each cancer site was assigned to leaf nodes in the embryonic development tree [21] (Fig. 2). Then, by walking up the tree, classification was compared at three stages of the developmental tree. For example, stomach adenocarcinoma was classified as a cancer of GI tract at stage 3, a cancer of gut tube origin at stage 2 and a cancer of endoderm origin at stage 1.

Independent validation

Next, for independent validation of our SVM based classifiers, we obtained three miRNA datasets from GEO database [22] and validated the performance using those datasets. Due to the lack of open access datasets for the majority of cancer types in our 21-cancer SVM classifier, we had to use cancer types that our model was not trained on. We used our Stage III classifier, which classifies these cancers by tissue/organ type, with these datasets. The GSE2564 series uploaded by Jun et al. [6] contains miRNA expression values for 200+ cases spanning across 12 different cancer types. The GSE68839 series uploaded by Vergani et al. [23] contains miRNA data for eight melanoma samples. The GSE21847 series uploaded by Montes-Moreno et al. [24] had 29 samples of B-cell Lymphoma. We used log2 normalized expression values from these data sets and further standardized the datasets individually to have zero mean and unit variance.

Prognosis prediction

The TCGA database contains clinical data for some, but not all, patients. This data includes clinical data such as ethnicity, gender, the type of treatments and drugs used, and a description of the outcome (e.g. whether the cancer recurred or not). Out of the 5229 patients, complete clinical information is available for about 710 patients. When selecting the samples for building the treatment recommendation tool, only the samples with the below criteria were retained: (1) contains clinical information: gender, age and ethnicity, the type of treatment received and if the tumor went under remission after 300 days or more. (2) There are at least five samples for each cancer type (3) There are at least five samples for each unique treatment.

This data set is further divided into training and testing with 75–25 split. All the unique treatments from the training samples were extracted. In order to represent the treatment in a form that machine learning models could easily interpret, treatment data items were first converted into a canonical bag-of-strings form of “general treatment, specific method;” and then sorted in alphabetical order and combined (Additional file 1). First, we used the Levenshtein/edit distance to define a similarity metric between patients with different treatments. Next, we mapped these distances into 3-dimensional space. A 3-dimensional point P_k was attached to each treatment k, and an attempt was made to minimize the difference between 3D distances and the “true” (Levenshtein) distance through an optimization formula: \( min\;\left(\sum_{i\ne j}\left( ldist\left(i,\kern0.24em j\right) - dist\left({P}_i,\;{P}_j\right)\right)\right) \). R provides a solution to problems of this form (often called Cox scaling problems) with the package “cmdscale”. This analysis was done on the unique treatment list to obtain three numeric coordinates for each treatment. Then, a support vector machine model was trained on a combination data set featuring miRNA data, clinical data, cancer type, and treatment type (in 3D coordinate form) to predict the clinical outcome (i.e. recurrence or remission status after at least 300 days). The SMOTE algorithm [25] was used to correct class imbalance caused by low number of samples where disease reoccurred compared to remission samples. The model with highest accuracy and per-class sensitivities on the test data set was chosen as the prognosis predictor tool. The prognosis predictor model output is a score (between 0 and 1) that indicates the probability of remission.

Web tool development

Next, we developed a treatment recommendation algorithm based on our prognosis model. The algorithm works as follows:

Given miRNA information and clinical data for a patient, this algorithm cycles through all possible treatments from the unique treatment lists obtained in the previous analysis, and for each treatment, predicts the probability of remission using the prognosis predictor tool. Then, the treatment that results in the highest probability of remission is picked as the recommended treatment.

Then, using the RStudio shiny package (version 0.12) [26], a web application was built. In this web application, our SVM classifier and the treatment recommendation tools were hosted on a server and the web application runs in a browser. Using the user interface for this application, user can upload clinical information using dialog boxes, with miRNA information in a CSV file.

The data will be sent to the servers where the application pre-processes the data, including imputing missing miRNA values (removing miRNAs with more than 20% missing), then uses the saved classification and treatment recommendation models to diagnose cancer type and makes three different treatment recommendations. A semi-supervised learning algorithm with the potential to improve the accuracy of both models was implemented. In this algorithm, the newly uploaded training samples classified with a probability over a threshold p = 0.95 are periodically added to the database, and the models have the opportunity to retrain and adjust based on these samples.

Results and discussion

Cancer diagnosis and classification

Out of the 2588 miRNA features, 2118 miRNA were removed during pre-processing as more than 20% of the samples had missing values for these miRNA leaving a feature set of 470 miRNAs. Most of the missing miRNA values were systemically missing across all cancer types. Using this feature set, different classifiers (Naïve Bayes, Logistic Model Tree, KNN and Linear SVM) were built and the overall accuracy and kappa statistics were compared (Fig. 1).

The radial SVM classifier was chosen as it reported the highest accuracy, 97.2%. This classifier also has the highest Kappa statistic of 0.97. Per-cancer sensitivities and specificities of this classifier were listed in Table 2. Table 3 contains the confusion matrix of this classifier. Most of the sensitivities are above 90%, except for ESCA, esophageal adenocarcinoma and CHOL, cholangiocarcinoma. Further inspection of the confusion matrix in Table 3 revealed that, 11% of CHOL cases were classified as pancreatic adenocarcinoma (PAAD). 12% of the ESCA cases were classified as Stomach Adenocarcinoma (STAD) and 6% of the ESCA cases were classified as liver hepatocellular carcinoma (LIHC). The STAD, CHOL and ESCA cancers are from the gastrointestinal (GI) tract and share a common developmental origin (Fig. 2). Similarly, PAAD or cancers in the pancreas and LIHC are related to the other three cancers due to a similar developmental parent (i.e. gut tube).

Table 2 SVM classifier performance

Full size table

Table 3 Confusion matrix for the SVM classifier

Full size table

Predictably, the results improve dramatically when walking up the embryonic origin tree diagram and classifying at different stages. Table 4 lists the overall sensitivities when classifying at different stages of the embryonic development tree. Sensitivities slowly increase at the higher levels with accuracies reaching to 100% at the germ layer (Stage 1). For example, the ESCA (esophageal cancer) class is detected with 82% sensitivity initially, but when it is combined with other gastrointestinal cancers to form the subtype GI tract, the sensitivity jumps to 0.98 with Stage III classifier. Then, when classified as a gut tube cancer, the sensitivity increases even further with the Stage II classifier. These increases are more than would be expected by chance aggregation alone and hint at a relation between cancer differentiation and the embryonic origin tree diagram.

Table 4 SVM classifier performance at different stages of the embryonic development tree

Full size table

The first level of feature selection yielded 132 predictive miRNA features. The second level yielded 60 miRNA features. Table 2 also lists the per-class performance of the classifier for 132 and 60 features. The classifier built with 60 features achieved an overall accuracy of 95.5% with per-class sensitivities above 95% for many cancer types. Note that, due to the scarcity of the normal samples, all the normal tissue samples were put together under a single class in these classifiers. The per-class sensitivities and specificities of the ‘Normal” class were 95 and 99% respectively, indicating the ability of the classifier to distinguish between the many cancer types from the normal tissues.

These results suggest that miRNA can be used to predict diagnosis with extremely high accuracy, sensitivity, and specificity.

The results from IPA core analysis on the 60 miRNA features used by our classifier are summarized in Additional file 2. As seen from this table, 49 out of the 60 molecules are highly correlated with cancer. The top two disease categories enriched by our miRNA features are metastatic solid tumor (p-value of 3.27E-53) and advanced malignant solid tumor (p-value of 1.61E-50). Similarly, the top three networks enriched by these set of miRNA are related to cancer, organismal injury and abnormalities and gastro-intestinal diseases (Additional file 3: Table S1). IPA computes the p-value for a function/process as the measure of likelihood that the association between that function and set of molecules (in our case, miRNAs) is due to a random chance. The smaller the p-value (or the higher the negative log of the p-value), the higher the likelihood that there is a significant association. P-value was obtained using Benjamini-Hochberg method to account for multiple testing [27].

The high negative log p-values for association between the cancer/tumor progression and the miRNA features used by our classifier suggest that the classifier was able to select and classify these 21 different cancer types using key miRNA features previously known to be associated with many cancer types. The features were predominantly cancer specific (rather than tissue specific), hinting that tumor progression mechanisms differ at the miRNA level between tissues.

We then analyzed the top three miRNAs (ranked by our feature selection algorithm) from our 60 miRNA features. The first one, let-7 family of miRNAs (hsa-let-7i-5p, hsa-let-7d-3p) has been mentioned extensively in the literature as the miRNA precursor that regulates cell cycle progression and cancer [28–30]. Decreased expression of let-7 was known to cause unregulated cell division and tumor formation [31]. The second feature, mir-1-3p (mir-1 family) was studied by Hu et al. [32]; it has been suggested that this miRNA plays a critical role as a tumor activator for human liver hepatocellular carcinoma. Decreased expression of mir-1 could decrease proliferation and induce apoptosis. Mir-10 family (mir-10a and mir-10b) were studied extensively before for their casual role in several cancer types. Lund et al., summarized the previous studies and reported that mir-10 was dis-regulated in several cancer types including Hepatocellular carcinomas, Pancreatic cancer, B-cell Chronic Lymphocytic Leukemia and Melanoma [33]. These are all cancer types our classifier can distinguish. This IPA analysis combined with our literature search indicate that our SVM classifier was able to achieve better performance using miRNAs previously known to cause and promote tumor proliferation.