 Methodology
 Open Access
 Published:
A novel miRNAbased classification model of risks and stages for clear cell renal cell carcinoma patients
BMC Bioinformatics volume 22, Article number: 270 (2021)
Abstract
Background
Clear cell renal cell carcinoma (ccRCC) is the most common subtype of renal carcinoma and patients at advanced stage showed poor survival rate. Despite microRNAs (miRNAs) are used as potential biomarkers in many cancers, miRNA biomarkers for predicting the tumor stage of ccRCC are still limitedly identified. Therefore, we proposed a new integrated machine learning (ML) strategy to identify a novel miRNA signature related to tumor stage and prognosis of ccRCC patients using miRNA expression profiles. A multivariate Cox regression model with three hybrid penalties including Least absolute shrinkage and selection operator (Lasso), Adaptive lasso and Elastic net algorithms was used to screen relevant prognostic related miRNAs. The best subset regression (BSR) model was used to identify optimal prognostic model. Five ML algorithms were used to develop stage classification models. The biological significance of the miRNA signature was analyzed by utilizing DIANAmirPath.
Results
A fourmiRNA signature associated with survival was identified and the expression of this signature was strongly correlated with high risk patients. The high risk patients had unfavorable overall survival compared with the low risk group (HR = 4.523, Pvalue = 2.86e−08). Univariate and multivariate analyses confirmed independent and translational value of this predictive model. A combined ML algorithm identified six miRNA signatures for cancer staging prediction. After using the data balancing algorithm SMOTE, the Support Vector Machine (SVM) algorithm achieved the best classification performance (accuracy = 0.923, sensitivity = 0.927, specificity = 0.919, MCC = 0.843) when compared with other classifiers. Furthermore, enrichment analysis indicated that the identified miRNA signature involved in cancerassociated pathways.
Conclusions
A novel miRNA classification model using the identified prognostic and tumor stage associated miRNA signature will be useful for risk and stage stratification for clinical practice, and the identified miRNA signature can provide promising insight to understand the progression mechanism of ccRCC.
Background
Renal cell carcinoma (RCC) is one of the top ten cancer diagnoses, and it accounts for 3–5% of all new cases in females and males [1]. Recently, there are more than 140,000 death per year associated with RCC [2]. Clear cell renal cell carcinoma (ccRCC) is the most common RCC subtype and it represents 70–80%, of all renal malignant tumors [3]. Despite many advances in effective therapeutic and diagnostic strategies in ccRCC, and the overall survival rate is still poor, particularly for advancedstage ccRCC patients[4]. CcRCC has poor prognosis due to the resistance to chemotherapy and radiotherapy[5]. Late tumor staging is the main risk factor of ccRCC patients [6] and detection of ccRCC patients at earlystage is crucial for better diagnosis and treatment options.
Currently, the development of next generation sequencing (NGS) technology has enabled researchers to explore genetic alterations in tumorigenesis and discovering molecular biomarkers for many cancers[7]. NGS allows to examine the possible contributions of the upstream molecular regulators of gene expression such as miRNAs. MiRNA play critical roles in regulating various physiological and pathological processes, including regulation of cell division, apoptosis, cell maturation, angiogenesis, metastasis, migration, invasion, differentiation of cells, metabolism, and proliferation by negative regulation of gene expression [8,9,10]. Moreover, in various cancers, dysregulated miRNAs can be used as biomarkers [11,12,13]
Statistical and machine learning approaches have been used to predict gene sets as biomarkers for patients with ccRCC [14]. Ng and Taguchi employed the tensor decomposition method to identify miRNA signature in ccRCC [15]. Previously, studies were used miRNA expression profiles of liver and breast cancer patients, followed by a support vector machine (SVM) with genetic algorithm, to predict the early and advanced stages [16, 17]. Recently, miRNA profiles were used to detect lung cancer subtypes [18]. Several studies have reported miRNA biomarkers in ccRCC. For example, a threemiRNA signature including miR21, miR155 and miR584 is associated with survival in ccRCC [11]. Zhao and Bai identified 13miRNA signature associated with overall survival in ccRCC [19]. However, the study of multimiRNA signature models to predict the risks and tumor stages of ccRCC patients are still limited and hence, we aimed to select a small set of miRNAs as signature that can predict risk as well as tumor stages in ccRCC patients using genomic profiles, so that identified miRNA signature can provide promising insight to understand the progression and development mechanism of ccRCC.
In this study, we proposed a computational method for identifying prognosticassociated miRNA signature as well as predicting the early and late tumor stages of ccRCC using miRNA expression profiles. We identified a four miRNA signature associated with the prognosis of ccRCC from highdimensional miRNA expression profiles using multivariate Cox regression with Elasticnet, Lasso and Adaptive lasso penalties followed by best subset regression analysis. The prognostic risk model involving four miRNA signature effectively classified ccRCC patients into high and low risk groups; prognosis was significantly worse in highrisk group when compared with lowrisk groups. Furthermore, we extracted significant miRNAs that can distinguish early and late tumor stages using various machine learning approaches. We identified a six miRNA signature strongly related to tumor stages of ccRCC patients. The five machine learning algorithms were used to evaluate classification performance of a six miRNA signature using independent testing set. Finally, the SVM algorithm achieved the best classification performance when compared with other classifiers.
Results
We proposed a computational method including penalized Cox models and machine learning approach to identify miRNA signature for risk and tumor stage prediction using miRNA profiles, which consists of several steps as described in detail in the “Methods” section. To develop optimal prognostic predictive model for ccRCC patients, combined penalized Cox models (including Elasticnet, Lasso and Adaptive lasso), best subset regression and risk score model were used. Furthermore, a combined machine learning approach was used to prioritize and identify miRNA signatures associated with early and late tumor stages in ccRCC patients. The systematic pipeline of the overall process is shown in Additional file 1: Fig. S1.
Identification of dysregulated miRNAs in ccRCC patients
After TCGAccRCC data quality assessment, preprocessing and normalization, a total of 1046 miRNA expression profiles were used for differentially analysis based on “limma” package in R. We identified 124 differentially expressed miRNAs (DEMs), of which 80 downregulated and 44 upregulated miRNAs in 254 primary tumor tissue samples compared with 71 normal samples, using the criteria of absolute value log2foldchange > 1 and BenjaminHochberg (BH) adjusted Pvalue < 0.05 (Additional file 1: Table S1). These abnormal miRNAs were used for subsequent survival and stage classification model development.
Identification of prognosticassociated miRNAs and development of risk classification model
DEMs that altered in tumor samples are potential prognostic and diagnostic signatures. To identify significant prognostic DEMs, the TCGA cohort of ccRCC patient (n = 252) having survival information were used. Subsequently, three regularized survival methods (including Elasticnet, Lasso, and Adaptive lasso) with tenfold cross validation were implemented to obtain the optimal lambda (λ) values that obtained from the smallest partial likelihood deviances. The estimated optimal penalty parameter λ values for the three algorithms were \(\lambda_{opt}^{Enet}\) = 0.093, \(\lambda_{opt}^{Lasso}\) = 0.056 and \(\lambda_{opt}^{Ad.lasso}\) = 0.015 and these optimal tuning parameters were used to choose informative features (miRNAs) that were associated with patient survival. Elasticnet, Lasso, and Adaptive lasso algorithms identified 13 miRNAs, 11 miRNAs and 6 miRNAs respectively (Table 1). A union of candidate miRNAs selected by the three methods including: miR21, miR223, miR146b, miR30b, miR3613, miR187, miR203, miR5143, miR1292, miR200a, miR508, miR1.2 and miR934 were used for BSR analysis. Then, all subset prognostic models created by the identified 13 miRNAs were assessed using BSR analysis based on the “glmulti” package in R and finally we obtained an optimal miRNA prognostic model with fourmiRNA signatures (including miR30b, miR21, miR187 and miR150 200a) having the smallest AIC value (Additional file 1: Fig. S2). Then, using the regression coefficient obtained from the result of multivariate Cox regression analysis of the fourmiRNA signatures (Additional file 1: Table S2), we developed a risk score (RS) model, which is given by, RS = (0.525 \(\times\) miR30b expression) + (1.485 × miR21 expression) + (0.485 \(\times\) miR187 expression) − (0.320 \(\times\) miR200a expression). Based on this risk score definition, we stratified ccRCC patients into high risk and low risk groups. The Kaplan–Meier curve shows that high risk group related to poor overall survival time relative to low risk group (Fig. 1a). Furthermore, the AUC values of the time dependent ROC curve were 77.40%, 73.00%, 79.33% and 82.73% for a 1year, 3year, 5year and 10year survival (Fig. 1b). These results demonstrated that the time dependent risk prediction based on a fourmiRNA signature can be used for risk assessment for ccRCC patients.
Univariate and multivariate analysis for risk prediction evaluation
In order to further examine whether the identified miRNA signature can be used as independent predictor, we performed univariate and multivariate Cox analysis (Table 2). The results of univariate Cox analysis indicated that age, pathological stages and the miRNA signature are significantly correlated with overall survival. Meanwhile, multivariate analysis showed that pathological stage (Pvalue = 0.048), pathological M stage (Pvalue = 0.035) and four miRNA signature (Pvalue = 4.23e−05) are significantly associated with unfavorable prognosis.
Identification of miRNA signature for classification of tumor stages
In order to identify tumor stage associated miRNA signatures, 124 DEMs expression profiles and the corresponding ccRCC patients with stage information (n = 252) were used. We applied ensemble learning feature selection algorithms including logistic regression (LR), random forest (RF), support vector machine with radial kernel (SVMR) and average neural network (avNNet) model to identify optimal classifier of tumor stages. The maximum Relevance Minimum Redundancy (mRMR) algorithms were used to identify miRNA features having the most correlation with tumor stage and the least correlation with miRNA features themselves. The detail of mRMR algorithm is described in Fig. 2a. The utility of mRMR based ensemble ML algorithm is enhancing informative feature selection by minimizing the bias that might be introduced by single algorithm. We identified top ranked features of 20 miRNAs using four ML methods and the ranking of miRNAs and their corresponding features importance relevance are shown (Additional file 1: Table S3). To determine the common number of miRNA features for tumor stage classification, we performed an overlapping analysis of selected miRNAs by these algorithms and six common miRNAs including miR106b, miR144, miR224, miR91, miR21, miR342 (Fig. 2b) were identified. These identified miRNAs were used to develop a stage prediction model. We also further verified that the identified 6 miRNA features were differentially expressed between early and late stage groups. The boxplot shows for each miRNA, there was a significant expression difference in early and late stage subgroups (Fig. 2c). Interestingly, the identified 6 stageassociated miRNAs are significantly associated with the overall survival of patients with ccRCC (Additional file 1: Fig. S3).
SMOTE sampling data and stage classification performance
In this study, SMOTE sampling was used to solve the class imbalance problem. Balanced dataset obtained from the SMOTE algorithm was then divided into training set and test set (80:20 ratio). The identified six miRNA signature was used to build prediction model based on five ML algorithms—SVMR, LR, Naïve Bayes, avNNet and KNN. The classification performance evaluated after data balancing is shown in Table 3. The accuracy of all models were found to be in the range of 0.62–0.923 with maximum accuracy of 0.923 for SVMR. The highest sensitivity and specificity, MCC and precision were observed for SVMR. To check whether SMOTE data sampling resulted improvement prediction, we used the original data (without SMOTE balancing) and constructed the training set and test set in a 80:20 ratio. Stage prediction and their classification performance delivered by the five ML methods were explored. The best accuracy and sensitively was observed in Naïve Bayes with an accuracy 0.775 and sensitivity 0.778. The details of model performance comparison are shown in (Additional file 1: Table S4). It is noteworthy that SMOTE data balancing algorithm can improve the prediction accuracy of all ML models except Naïve Bayes. The overall model prediction results indicated that the identified sixsignature miRNAs can classify tumor stages of ccRCC patients with reasonable prediction performance using miRNAs expression profiles.
Enrichment analysis and biological roles of identified miRNA signatures
The identified miRNA signatures are linked with the development and progression of various cancer types. Overexpression of miR21 leads to amplified cell proliferation and reduced apoptosis [20] and high expression of miR21 associated with poor survival in lung cancer and ccRCC [11, 21]. Downregulated miR30b5p act as a tumor suppressor to regulate renal cell carcinoma in cell proliferation, metastasis and epithelialtomesenchymal transition by targeting Gprotein subunit α13 [22]. Decreased miR187 in clear cell renal cell carcinoma inhibits cell growth, migration though targeting B7H3 and correlated with lower survival [23]. MiR106p5p upregulation targets several negative regulators of the Wnt/βcatenin pathway [24]. MiR144 promotes RCC development by hampering mTOR expression [25]. Fujii N et al. reported that higher expression of miR224 associated with poor progressionfree survival and overall survival in ccRCC [26]. MiR200a regulates epithelial to mesenchymal transitionassociated with gene expression and regulates prognosis in colorectal cancer [27]. In addition, miR200a consistently decreased in RCC and serve as diagnostic biomarker for the early detection of RCC [28].
Biological roles of the identified miRNAs were assessed by using KEEG pathways and GO annotation analyses via DIANAmirPath. The enriched biological pathways of the upregulated and downregulated identified miRNA signatures are presented in Fig. 3a, b. The upregulated miRNA signatures are enriched in Prion diseases, Lysine degradation, ECMreceptor interaction, Proteoglycans in cancer, Fatty acid elongation,Pathways in cancer, Cell cycle, FoxO signaling pathway, p53 signaling pathway, TGFbeta signaling pathway, Biosynthesis of unsaturated fatty acids, Viral carcinogenesis, signaling pathways regulating pluripotency of stem cells, Renal cell carcinoma and other biological pathways. The detail biological pathways and number of target genes for upregulated miRNA signatures are described in Additional file 1: Table S5. Similarly, downregulated miRNA signatures involved in Fatty acid biosynthesis, Adherens junction, Fatty acid metabolism, Lysine degradation, Pathways in cancer, Viral carcinogenesis, p53 signaling pathway, mRNA surveillance pathway, RNA degradation other KEGG pathways. The detail summary of downregulated miRNA signatures enriched pathways and number of target genes are indicated in Additional file 1: Table S6.
GO analysis also showed that upregulated miRNA signatures participated in cellular nitrogen compound metabolic process, biosynthetic process, mitotic cell cycle, cell death, DNA metabolic process, innate immune response, cell cycle, cell proliferation and others. The detail biological processes and target genes of upregulated miRNA signatures are shown in Additional file 1: Table S7. Furthermore, the biological processes of downregulated miRNAs are enriched in nucleobasecontaining compound catabolic process, macromolecular complex assembly, mitotic cell cycle, cellular protein modification process, catabolic process, biosynthetic process, gene expression, viral process, cellular component assembly, cellular protein metabolic process, small molecule metabolic process, Fcepsilon receptor signaling pathway, response to stress, cell death and other biological process. The detail biological processes and target genes for downregulated miRNA signatures are shown in Additional file 1: Table S8.
Discussion
In this study, we showed an efficient strategy to identify miRNA signatures that can stratify high risk and low risk patients as well as classify early and late tumor stages. Several computational and machine leering algorithms have been developed to explore miRNAassociated diseases [17, 29]. Previous studies also used miRNA profiles to identify biomarkers for risk stratification [11, 30]; however, only a few research works have been conducted to explore miRNA signatures for early tumor stage of ccRCC. Therefore, in our study, we proposed a novel hybrid ML strategy to identify miRNA signatures associated with overall survival and tumor stages classification for ccRCC patients. The major findings and contributions of our work are outlined as follow. Identification of risk and stage predictive miRNA signatures with good predictive performance. The use of multivariate Cox regression with Elasticnet, Lasso and adaptive Lasso penalties followed by optimal subset prognostic model selection strategy identified a four novelmiRNA signature, namely miR30b, miR21, miR187 and miR200a. This signature can effectively stratify high risk and low risk subgroups with good classification performance (time dependent ROC); hence, the fourmiRNA signature may be use as a prognostic biomarker. Furthermore, the proposed combined ML algorithms identified six top ranked miRNAs (miR106b, miR144, miR224, miR91, miR21 and miR342) based on their relative importance and their classification performance were evaluated by five ML methods. The classification performance of the sixstage associated miRNAs revealed effective stratifying ability early stage and late stage. The stage classification performance can be further improved if we applied SMOTE algorithm to prepare balanced early and late tumor stage data. The consistency and validation of the predictive models were assessed using an independent test set. The biological significance of the identified miRNA signatures in ccRCC development and progression were discussed and confirmed by the existed literature. The biological roles of the identified miRNA signatures were examined using enrichment analyses.
Future works: we will attempt to identify more robust features and classification models and apply different balanced data algorithms to improve the efficiency of the classification performance of early and late stage other than LR, Naive Bayes, KNN, and SVM classifiers.
Conclusions
In summary, we have attempted to identify potential miRNA signatures for stratification of risk using integrated statistical approach. An optimal fourmiRNA based prognostic model with the smallest AIC criterion was identified and the fourmiRNA signature can effectively classify ccRCC patients into high and low risk groups. Moreover, using stateoftheart ML algorithms, we identified six top ranked miRNA signatures that can classify early and late stages for patients with ccRCC. These six miRNA signature classified early stage and late stage with reasonable good classification performance. Finally, we explored the biological roles of the identified risk and stageassociated miRNA signatures and these signatures involved important biological pathways. Overall, we expect that our findings provide promising insight to explore the role of miRNAs in ccRCC patients and could help risk and stage classifications.
Methods
Input data
The RNAseq expression profiles and their corresponding clinical data (including TNM stage, survival time, sex, age) for ccRCC were retrieved from the Firebrowse database (http://firebrowse.org/). The clinical information consists of 325 samples (254 tumors and 71 normal tissues), and clinical data consists of 131, 28, 48, and 45 samples of stage I, stage II, stage III and stage IV respectively. Finally, we considered both early stage (stage I and stage II, n = 159) and late stage (stage III and stage IV, n = 93) samples of patients for classification purpose.
Preprocessing and identification of differentially expressed miRNAs
Normalization of the miRNA profiles were carried out by using the 'edgeR' package [19]. Differential analysis of miRNAs was conducted by utilizing the 'LIMMA' package in R [19] and miRNAs that satisfy the criterion of the absolute value of log2fold change > 1 and BenjaminHochberg (BH) adjusted Pvalue < 0.05 were considered as differentially expressed miRNAs (DEMs).
Data standardization for ML modeling
In this calculation, zscaling was used to normalized the count per million (CPM) of miRNA expression values, which is defined by the following equation.
where x is the expression value of miRNA, \(\overline{x}\) is the mean expression values of miRNA of the samples and σ is the standard deviation of expression values of miRNA of the samples and z represents normalized miRNA expression that follow the normal distribution with zero mean and unit standard deviation.
Identification of miRNA signature associated patient survival and construction of risk classification model
Feature selection (FS) is an important process to improve classification performance by avoiding irrelevant/noise features. To select survivalassociated features (miRNAs), Cox model regression model was proposed. Cox regression is defined as:
Here, i denotes for ccRCC patients, and the Xs are the covariates (miRNAs). \(X_{i\beta } = \beta_{1} x_{i1} + \cdots + \beta_{k} x_{ik}\), \(h_{0} \left( t \right)\) is the baseline hazard function at time t, \(\beta = (\beta_{1} , \beta_{2} , \ldots ,\beta_{k} )\) is the vector of regression coefficients, and \(\beta\) denotes the kdimension regression coefficient vector of covariates. However, when we deal with highdimension data such as genomic data, number of penalized Cox models have been proposed (including: Lasso, Adaptive lasso and Elasticnet) to solve the overfitting problem [31,32,33]. For each model, the estimated \(\hat{\beta }\) values are obtained by minimizing the negative loglikelihood function with different penalty functions as follows:
Here \(\delta_{i}\) is an indicator for the uncensored observation, λ is called penalty and \(w_{nj}\) = \(\left {\hat{\beta }_{nj} } \right^{  1}\) is calculated from the initial estimator \(\hat{\beta }_{n}\).
Lasso Cox and adaptive Lasso Cox were used to identify relevant miRNAs associated with survival time by the shrinkage of some of the irrelevant miRNAs regression coefficients to zero. Adaptive Lasso Cox imposes an adaptive weighted penalty term in comparison with the Lasso, model, which further reduce the number of lessrelevant miRNAs in such a way that the resulting coefficient estimates are sparse. Elasticnet is suitable for screening relevant miRNAs when there is a multicollinearity problem in the genomic data [34]. We used (tenfold) crossvalidation of Lasso, Adaptive lasso and Elasticnet algorithms to obtain the predicted optimal λ value for each method. The optimal λ value that minimize the estimated meansquared prediction error and this optimal λ was used to select candidate miRNAs. All these methods were implemented using the ‘glmnet’ [35] package in R. Then, to make use of the strength of each method, we proposed a combined feature selection approach in the study of miRNAdisease association. The union of candidate miRNAs identified by the three algorithms were used to identify the best miRNAs combination that could predict prognosis of ccRCC patients more effectively. More detailed description of selecting the best subset prognostic model is described below:
Let \(l = 1 , 2, \ldots . , k\), where k is the total number of candidate miRNAs identified by Lasso, Adaptive lasso and Elasticnet algorithms.

a)
Construct all possible combinatorial subset model having \(l\) miRNA candidates

b)
Compare all possible models with the Akaike information criterion (AIC)

c)
Select the best subset prognostic model, having the smallest AIC, with \(l\) miRNAs using the ‘glmulti’ package [36] in R.
After identifying the best subset prognostic model, we developed risk score (a linear combination of best subset miRNA expression and coefficients of miRNAs obtained from multivariate analysis) to evaluate risk prediction performance. The risk score (RS) computed as follow,
Then, RS was used to classified ccRCC patients into high risk and low risk group using median RS as cutoff. The time dependent receiver operating characteristics (ROC) was used to estimate the survival time difference between high risk and low risk group using the ‘survivalROC’ package in R.
Identification of miRNA signature associated with tumor stages for early and late tumor stage classification
To select best subset features (miRNAs) that associated with tumor stages, we proposed to use hybrid feature selection methods. Feature selection methods were conducted based on tenfold crossvalidation. The four popular ML algorithms proposed in this study include: LR, RF, SVMR and avNNet, and top ranked miRNAs based on their importance as the best features were identified from highdimensional data. We utilized the “caret” package in R that consists of several complex ML algorithms for classification and prediction problems.
The processing of feature selection using four ML algorithms are briefly discussed as follows.
Ensemble logistic regression (LR) model is a ML model used as a classification model in feature selection to identify features that can distinguish binary samples of patients. Let \({\varvec{x}} \in {\varvec{R}}^{k}\) denotes an observation consists of \(k\) feature miRNA values and let \(y \in \left\{ {  1, + 1} \right\}\) represents the corresponding binary outcome; such as, early and late stage. A LR model is a condition probability distribution (CPD) of the class level \(y\) given the feature (miRNAs) vector \({\varvec{x}}\) is defined by
where \(\user2{w } \in R^{k }\) is the weight vector and \(v \in {\varvec{R}}\) are parameters of LR. Ensemble LR for relevant feature selection is stable with respect to variation of the learning samples, since it uses ttest to rank features, which does not consider dependence between features. LR transfers the strongly correlated features to the LR gradient decent optimization algorithm to increase classification performance [37].
The random forest (RF) algorithm is a nonlinear regression model, which has been used in several regression and classification problems in a variety of computational studies. RF learning is flexible algorithm that minimizes prediction errors and screen the most important features related to class level when feature size is large. It consists of constructing trees by splitting random sample of r features from a set of n features using bootstrapped training data.
where IG denotes impurity of a node n, \(ki\) denotes the features (miRNAs), and j is the fraction of each \(ki\) feature (miRNAs) [38].
Support vector machine (SVM) algorithm has been applied in many biological problems, such as biomarkers selection and classification problems [16, 39]. SVM is nonprobabilistic classifier which maximize the margin of the decision boundary to classify two classes using support vectors to achieve the best classification. The general formulation of the SVM classifier is defined as
Here, w denotes the normal vector of the hyperplane, \(\varphi\) is the classifier parameter, \(x_{i}\) are the variables and n is the number of vectors in the training dataset.
The avNNet model is a type of neural network (NN), that learns nearly infinite number of mapping functions and works like natural human neurons. The inputs (features) connect to class label (staging), the connections are called edges. The input feature connects either forward/backward propagation hidden nodes to compute neurons. The number of hidden layers determines the depth of NN. A feedforward NN having more than one hidden layer is called deep network [40, 41]. The NN is commonly applied to the discovery of biomarkers in cancer studies [42].
where \(\beta_{j}^{r}\) represents the jth neuron in the rth layer is associated with activation in the (r − 1)th layer. The \(\vartheta\) is a vectorising function parameter. The \(w_{jk}^{r}\) denotes a rth weight matrix for each layer r of the jth row and kth column, and \(\alpha_{j}^{r}\) denotes the j bias for each layer r.
Stage classification methods
SVMR, NB, avNNet, KNN and LR were utilized for building predictive and classification models. Each model was constructed by tenfold cross validation to avoid over/under fitting. The cost function was optimized [100–1000 iterations with 100 steps per iteration] to attain accurate classification.
Data balancing
The proportion of early stage patients is approximately twice of late stage patients; this creates the data imbalance problem which leads to biased prediction. Therefore, we performed the data balancing procedure by using the Synthetic Minority Oversampling Technique (SMOTE) algorithm [43] (included in the DMwR package). The SMOTE is a common method to solve data imbalance problem more effectively, prior to applying the ML classifier [42]. The dataset was divided randomly into training set (80%) and the remaining 20% as independent test set. The training set was employed to train ML algorithms in classifying early and late stage of patients based on tenfold cross validation. The test set was used as independent test set and used to assess the classification performance of five ML methods.
Classification performance evaluation metrics
To evaluate the performance of the classification models, we used the following measures: specificity, sensitivity, accuracy (ACC), precision, and Matthews correlation coefficient (MCC). The mathematical formulas of the measures are given below:
where TP, TN, FP, and FN denote true positive, true negative, false positive and false negative respectively.
Functional enrichment analysis
We assessed the biological relevance of the identified prognostic and diagnostic miRNA signatures using the DIANAmirPath [44].
Availability of data and materials
The datasets analyzed during the current study are available in the Firebrowse database, https://gdac.broadinstitute.org/.
Abbreviations
 RCC:

Renal cell carcinoma
 ccRCC:

Clear cell renal cell carcinoma
 ML:

Machine learning
 TCGA:

The Cancer Genome Atlas
 SMOTE:

Synthetic Minority Oversampling Technique
 LASSO:

Least absolute shrinkage and selection operator
 HR:

Hazard ratio
 SVM:

Support Vector Machine
 SVMR:

Support Vector Machine with radial kernel
 MRMR:

Maximum Relevance Minimum Redundancy
 LR:

Logistic regression
 KNN:

Knearest neighbors algorithm
 AVNNET:

Averaging Neural network
 GO:

Gene Ontology
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
References
 1.
Siegel RL, Miller KD, Jemal A. Cancer statistics, 2018. CA Cancer J Clin. 2018;68(1):7–30.
 2.
Ferlay J, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136(5):E359–86.
 3.
Zhang L, et al. The complex roles of microRNAs in the metastasis of renal cell carcinoma. J Nanosci Nanotechnol. 2013;13(5):3195–203.
 4.
Siegel RL, Miller KD, Jemal A. Cancer statistics, 2016. CA Cancer J Clin. 2016;66(1):7–30.
 5.
Wood CG. Multimodal approaches in the management of locally advanced and metastatic renal cell carcinoma: combining surgery and systemic therapies to improve patient outcome. Clin Cancer Res. 2007;13(2 Pt 2):697s–702s.
 6.
Amin MB, et al. The Eighth Edition AJCC Cancer Staging Manual: continuing to build a bridge from a populationbased to a more “personalized” approach to cancer staging. CA Cancer J Clin. 2017;67(2):93–9.
 7.
Li, Y., et al., Transcriptomic and functional network features of lung squamous cell carcinoma through integrative analysis of GEO and TCGA data.
 8.
Chang RK, et al. MicroRNA expression profiles in nonepithelial ovarian tumors. Int J Oncol. 2018;52(1):55–66.
 9.
Lujambio A, et al. Genetic unmasking of an epigenetically silenced microRNA in human cancer cells. Cancer Res. 2007;67(4):1424–9.
 10.
Macfarlane LA, Murphy PR. MicroRNA: biogenesis, function and role in cancer. Curr Genomics. 2010;11(7):537–61.
 11.
Liang B, Zhao J, Wang X. A threemicroRNA signature as a diagnostic and prognostic marker in clear cell renal cancer: an in silico analysis. PLoS ONE. 2017;12(6):e0180660.
 12.
Jacob H, et al. A fourmicroRNA classifier as a novel prognostic marker for tumor recurrence in stage II colon cancer. Sci Rep. 2018;8(1):6157.
 13.
Asakura K, et al. A miRNAbased diagnostic model predicts resectable lung cancer in humans with high accuracy. Commun Biol. 2020;3(1):134.
 14.
Li F, et al. An improved clear cell renal cell carcinoma stage prediction model based on gene sets. BMC Bioinform. 2020;21(1):232.
 15.
Ng KL, Taguchi YH. Identification of miRNA signatures for kidney renal clear cell carcinoma using the tensordecomposition method. Sci Rep. 2020;10(1):15149.
 16.
Yerukala Sathipati S, Ho SY. Identifying a miRNA signature for predicting the stage of breast cancer. Sci Rep. 2018;8(1):16138.
 17.
Yerukala Sathipati S, Ho SY. Novel miRNA signature for predicting the stage of hepatocellular carcinoma. Sci Rep. 2020;10(1):14452.
 18.
Sherafatian M, Arjmand F. Decision treebased classifiers for lung cancer diagnosis and subtyping using TCGA miRNA expression data. Oncol Lett. 2019;18(2):2125–31.
 19.
Zhao E, Bai X. Nomogram based on microRNA signature contributes to improve survival prediction of clear cell renal cell carcinoma. Biomed Res Int. 2020;2020:7434737.
 20.
Oue N, et al. High miR21 expression from FFPE tissues is associated with poor survival and response to adjuvant chemotherapy in colon cancer. Int J Cancer. 2014;134(8):1926–34.
 21.
Li C, et al. Nonsmall cell lung cancer associated microRNA expression signature: integrated bioinformatics analysis, validation and clinical significance. Oncotarget. 2017;8(15):24564–78.
 22.
Liu W, et al. MiR30b5p functions as a tumor suppressor in cell proliferation, metastasis and epithelialtomesenchymal transition by targeting Gprotein subunit α13 in renal cell carcinoma. Gene. 2017;626:275–81.
 23.
Zhao J, et al. MicroRNA187, downregulated in clear cell renal cell carcinoma and associated with lower survival, inhibits cell growth and migration though targeting B7–H3. Biochem Biophys Res Commun. 2013;438(2):439–44.
 24.
Lu J, et al. miR106b5p promotes renal cell carcinoma aggressiveness and stemcelllike phenotype by activating Wnt/βcatenin signalling. Oncotarget. 2017;8(13):21461–71.
 25.
Xiang C, Cui SP, Ke Y. MiR144 inhibits cell proliferation of renal cell carcinoma by targeting MTOR. J Huazhong Univ Sci Technol Med Sci. 2016;36(2):186–92.
 26.
Fujii N, et al. Extracellular miR224 as a prognostic marker for clear cell renal cell carcinoma. Oncotarget. 2017;8(66):109877–88.
 27.
Pichler M, et al. MiR200a regulates epithelial to mesenchymal transitionrelated gene expression and determines prognosis in colorectal cancer patients. Br J Cancer. 2014;110(6):1614–21.
 28.
Wang C, et al. Circulating miR200a is a novel molecular biomarker for earlystage renal cell carcinoma. ExRNA. 2019;1(1):25.
 29.
You ZH, et al. PBMDA: A novel and effective pathbased computational model for miRNAdisease association prediction. PLoS Comput Biol. 2017;13(3):e1005455.
 30.
Tang J, et al. Identification of miRNAbased signature as a novel potential prognostic biomarker in patients with breast cancer. Dis Markers. 2019;2019:3815952.
 31.
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95.
 32.
Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–29.
 33.
Zou H, Hastie T. Regularization and variable selection via the Elastic Net. J R Stat Soc Ser B (Stat Methodol). 2005;67(2):301–20.
 34.
Lee S, Lim H. Review of statistical methods for survival analysis using genomic data. Genomics Inform. 2019;17(4):e41.
 35.
Kuhn, M., et al., Package ‘caret’. 2020.
 36.
Calcagno V, de Mazancourt C. glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models. J Stat Softw. 2010;34(12):29.
 37.
Zakharov R, Dupont P. Ensemble logistic regression for feature selection. Berlin: Springer; 2011.
 38.
Liaw A, Wiener M. Classification and Regression by randomForest. R news. 2007;2:18–22.
 39.
Chu F, Wang L. Applications of support vector machines to cancer classification with microarray data. Int J Neural Syst. 2005;15(6):475–84.
 40.
Kriegeskorte N, Golan T. Neural network models and deep learning. Curr Biol. 2019;29(7):R231–6.
 41.
GuangBin H, QinYu Z, CheeKheong S. Extreme learning machine: a new learning scheme of feedforward neural networks. In: IEEE international joint conference on neural networks (IEEE Cat. No.04CH37541). 2004.
 42.
Mirza B, et al. Machine learning and integrative analysis of biomedical big data. Genes (Basel). 2019;10(2):87.
 43.
Chawla NV, et al. SMOTE: synthetic minority oversampling technique. J Artif Int Res. 2002;16(1):321–57.
 44.
Vlachos IS, et al. DIANAmiRPath v3.0: deciphering microRNA function with experimental support. Nucleic Acids Res. 2015;43(W1):W460–6.
Acknowledgements
We would like to thank the peer reviewers for their constructive comments, which have helped improve the manuscript. Our gratitude also goes to Michael Burton, Asia University, proofread the manuscript.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 10 2021: Selected articles from the 19th Asia Pacific Bioinformatics Conference (APBC 2021): bioinformatics. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement10.
Funding
Dr. KaLok Ng and Mr. Eskezeia Y. Dessie works are supported by the Taiwan Ministry of Science and Technology (MOST) (Grant Numbers: MOST 1092221E468013 and MOST 1082221E468020). Dr. KaLok Ng work is also supported by Asia University (grant numbers: 106asia06, 106asia09, 107asia02 and 107asia09). Dr. Jeffrey J.P. Tsai work is supproted by MOST (Grant Number: MOST 1092321B468001). Dr. JanGowth Chang is supported by China Medical University Hospital (DMR107209). The publication costs are funded by MOST (Grant Numbers: MOST 1092221E468013 and MOST 1082221E468020). The funding bodies were not involved in the design of the study, in the collection, analysis of the data, or in writing of the manuscript.
Author information
Affiliations
Contributions
EYD conceptualized, draft and reviewed and edited the manuscript. JJPT managed the project and secured the necessary acquisition funding. JGC contributed to guide data analysis, and interpretation and manuscript writing. KLN foresee the research, managed the project, reviewed and edited the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
None to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Lists of supplementary materials, including workflow, differentially expressed miRNAs, KM plots of ccRCC patients, Cox regression results, and the results of enrichment analysis of the miRNA signatures.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Dessie, E.Y., Tsai, J.J.P., Chang, JG. et al. A novel miRNAbased classification model of risks and stages for clear cell renal cell carcinoma patients. BMC Bioinformatics 22, 270 (2021). https://doi.org/10.1186/s12859021041892
Received:
Accepted:
Published:
Keywords
 Clear cell renal cell carcinoma
 Biomarkers
 MicroRNAs
 Survival analysis
 Regularized cox model
 Machine learning classifiers