Patients
Clinical data from breast cancer patients were downloaded from the TCGA database. Criteria for including patient data in the present study include histopathological diagnosis of breast cancer and solid tissue normal. Totally 762 breast cancer patients and 138 solid tissue normal subjects were included. To overcome the issue of imbalanced data distribution, we applied an approach explained in the “Experimental design and Statistical analysis” section.
ML methods
ML methods are tools utilized to create and evaluate algorithms that facilitate prediction and classification. ML is based on four steps: data collection, picking a model, training the model, and testing the model. In this study, three groups of algorithms were employed: feature selection, feature extraction, and classification algorithms. These are elaborated on next.
Feature selection algorithms
ML procedures have difficulty in dealing with a large number of input features. Hence, to support the process of applying ML effectively in real-world scenarios, data preprocessing is an essential task. Feature selection is one of the most frequent procedures in data preprocessing [12]. It has become a vital element of the ML process to obtain the relevant feature or feature subsets in the literature to achieve their classification objectives. However, besides the advantages of feature selection procedures to search for a subset of relevant features, they are used to avoid overfitting and provide faster and more cost-effective models. In the present study, four feature selection procedures, including filter and embedded approaches, are employed and compared to select the most valuable feature: (1) ANOVA; (2) Mutual Information (MI); (3) Extra Trees Classifier (ETC); and (4) Logistic Regression (LGR).
Feature extraction algorithm
We employed feature extraction to convert high-dimensional data to fewer dimensions; thus, the risk of overfitting was reduced. Dimensionality reduction procedures use no label for feature extraction; hence, they only rely on patterns between input features. Consistent with previous studies [13, 14], Principal Component Analysis (PCA) was outperformed other feature extraction algorithms. PCA is a dimensionality reduction procedure that generates new specified features, but not a feature selection procedure. PCA transforms features, but feature selection procedures choose features without transforming them. Hence, as the feature extractor, PCA was implemented in this study.
Classifier algorithms
We selected 13 classification algorithms: (1) LGR; (2) Support Vector Machine (SVM); (3) Bagging, (4) Gaussian Naive Bayes (GNB); (5) Decision Tree (DT); (6) Gradient Boosting Decision Tree (GBDT); (7) K Nearest Neighborhood (KNN); (8) Bernoulli Naive Bayes (BNB); (9) Random Forest (RF), (10) AdaBoost, (11) ExtraTrees, (12) Linear Discriminant Analysis (LDA), and (13) Multilayer Perceptron (MLP). Of note, all the feature selection, extraction, and classification procedures were implemented using the scikit-learn package in python (scikit-learn version 1.0.2, python version 3.8.3). The cross-combination approach was employed to compare the performance of feature selection, extraction, and classification procedures. Therefore, each feature selection and the extraction procedure were combined with all the nine classification procedures. Finally, we got 65 combinations of ML strategies.
Experimental design and statistical analysis
In this study, the performance of the ML procedures was obtained with five-fold cross-validation. Five-fold cross-validation split the data into five parts and then alternately used four parts for training and the rest for testing. Therefore, the original sample is randomly partitioned into five equal size subsamples.
A problem with unbalanced classification for a model is that there are few samples of the minority class to learn the decision boundary effectively. In other words, a dataset is unbalanced if the classification classes are not approximately equally represented. In our study, the dataset was unbalanced; the breast cancer group was higher than the solid tissue group, which means the unbalanced distribution of the data might bias. Hence, we used Synthetic Minority Over-sampling Technique (SMOTE) [15] to solve this problem. SMOTE is an oversampling method that makes synthetic minority class samples. It potentially performs better than simple oversampling. This method is designed for selecting samples close to the feature space, drawing a line between the samples in the feature space, and drawing a new sample at a point along that line. As mentioned in [15], it shows that SMOTE can achieve better performance when combined with the under-sampling of the majority class. Therefore, we employed the first oversample of the minority class (solid tissue group) with SMOTE, then undersample the majority class (breast cancer group). The performance of ML algorithms is commonly evaluated using predictive accuracy and is calculated as follows:
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}$$
(1)
In the context of balanced datasets, it is reasonable to use accuracy as a performance metric. However, this is not appropriate when the dataset is unbalanced. Hence, for evaluating the proposed models' performance, balance accuracy defined in Eq. (4) and area under curve (AUC) were used as diagnostic indicators.
$${\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(2)
$${\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
(3)
$${\text{Balanced}}\;{\text{accuracy}} = \frac{{{\text{Sensitivity}} + {\text{Specificity}}}}{2}$$
(4)