Gene expression and network data
Gene expression dataset. The gene expression dataset for breast cancer metastasis was collected from the Amsterdam Classification Evaluation Suite (ACES) [25]. ACES is a compilation of twelve (12) separate breast cancer studies from NCBI’s Gene Expression Omnibus. The ACES dataset used only the 133A platform. It discarded the same GEO patient IDs from multiple studies. Using R’s arrayQualityMetrics, array quality control was checked for all the patient samples belonging to the same study. RLE (Relative Log Expression) or NUSE (Normalized Unscaled Standard Error Plot) analysis were employed to exclude the patient sample outliers from further examination. After executing all these preprocessing steps, 1616 patient samples were selected in the final patient cohort. The expression arrays of these 1616 patients were normalized altogether using the justRMA method from R. Probe intensities were log-normalized and mean centered for each sample. Finally, 12750 gene probes remained in ACES dataset. Additionally, there were batch effects in the dataset due to the collection of expression data from different location and for different sub-types of breast cancer. R’s combat was employed to remove the batch effects from the dataset. For details of the dataset and preprocessing steps, see the article [25].
The class label of a patient is determined as negative (good) class if the patient was free from recurrence of cancer for at least five (5) years. If cancer occurred again within five (5) years for a patient, the patient is classified as a positive (poor) class. The detailed information about the studies and the class distributions are provided in Table 8.
Protein-protein interaction network (PPI). The protein-protein interaction network (PPI) was curated from the BioGrid (version 3.4.149) interaction database [30]. The network only considered the edges if two genes were in the 12750 genes of ACES dataset. After discarding self-edges, 180371 edges for 12750 nodes (i.e., genes) remained in the PPI network.
Gene co-expression network (CE). A global gene co-expression network was used as another network in the evaluation. Pearson correlation coefficients [31] between the genes’ expressions in the ACES dataset was computed. If two corresponding genes were present within the top-k most co-expressed genes from each other, an edge was created in the network. To keep the network structure (i.e., number of edges and degree distribution) like the PPI network, the number of neighbors (k) was set to 84. The co-expression network contained 161042 edges for 12750 nodes (i.e., genes) and the power-law distribution was observed for the degree distribution of the network; it is the typical distribution of the real-world network [32].
Edge-based feature type. The edge-based feature type utilizes all the edges belonging to a given network (i.e., PPI and CE network). Expression of an edge was calculated as the sum of the expression of the two genes corresponding to that edge (Fig. 11). The edge-based feature type from PPI and CE network are termed as “PPIEdge” and “CEEdge” respectively.
Prediction performance comparison
To measure the prediction performance of different feature types, an RF model was built on each of the feature types for the 13 datasets separately. An RF model with 100 estimators (i.e., 100 decision trees) was used with other default settings from Sklearn API [33] in python. The performance was measured using 5-fold cross validation for 100 repetitions. Three types of metrics were used to evaluate the prediction performance of different feature types. These are the area under the receiver-operating characteristics curve (AUC) [34], the Cohen’s Kappa [35] and the F1-score [36]. AUC score is dependent on the predicted probability vector for test instances by the prediction model (raw predicted probabilities obtained from RF model). However, Kappa and F1-score are computed by the class prediction of test instances rather than the predicted probabilities from the prediction model. For evaluating Kappa and F1-score, two strategies were employed. RF’s default class prediction was used to compute the metrics in one strategy. Another strategy was to make a prediction of classes (i.e., two class prediction) using the optimal probability threshold. For this experiment, 200 repetitions of 10-fold cross validation were executed to evaluate the Kappa and F1-score. Each fold was considered as a test set and 80% and 20% (randomly chosen) of the remaining 9 folds were divided into training and validation sets respectively. RF model was trained on the training set and optimal probability threshold was chosen using the predicted probabilities returned by the trained RF model for the validation set instances. Optimal probability threshold was determined as the shortest distance from the top-left corner to the ROC curve as defined in the Eq. 1 [37].
$$ distance = \sqrt{(1 - sensitivity)^{2} + (1 - specificity)^{2}} $$
(1)
Finally, predicted probabilities were obtained for the test set instances from the trained RF model and the class labels were decided using the optimal probability threshold obtained from the validation set. The negative class was set for the test instance whose probability was lower than the optimal probability threshold; it was set to positive class otherwise. Kappa and F1-score were recomputed to evaluate the prediction performance of the feature type based on the second strategy. In this procedure, no bias/overfitting was introduced into the computation of Kappa and F1-score because optimal probability threshold was determined by completely separate validation set rather than utilizing training or test sets [38].
Linear kernel-based Support Vector Machine (SVM) and Logistic Regression (LR) were also employed to evaluate the prediction accuracy of the different feature types. AUC scores of 5-fold cross validation for 10 repetitions were measured for the comparison. Default settings of Linear SVM from Sklearn API [33] in python were used. For LR model, inverse of regularization strength C parameter was optimized by an inner 5-fold cross validation on the training set. C parameter was searched in the set {0.001, 0.01, 0.1, 1, 10, 100}. Additionally, L1 penalty was set in the LR model to provide sparse feature coefficients. Finally, LR model was trained on the training set with the best C parameter found by the inner cross validation.
Paired t-test between gene- and edge-based feature type was done to provide statistical significance of the comparisons. A p-value was computed for the comparison of the mean of the feature types for each of the evaluation metrics. The statistical significance threshold was set to 0.05 for each comparison. The comparative result was provided for both statistical significance and insignificance.
Moreover, to evaluate the significance of the edge-based features (gene pairs), an RF model trained on PPIEdge feature type was compared to two existing network-based methods, Park [10] and Feral [18]. RF model with 1000, 2000, and 5000 estimators (decision trees) was trained on the PPIEdge feature type to provide optimal predictive accuracy. 10-fold cross-validation was repeated for 30 times for this comparison.
Robustness analysis
To measure the robustness of the feature types, we relied on the feature importance scores returned by the RF model. The robustness of a feature type is measured by first drawing random samples from the dataset and identify x top-ranked features by importance score from each sampled dataset, and then calculate the ratio between the observed number of common features identified from the two subsets of samples, and the expected number of common features by chance. However, a few important details are worth noting, in order to achieve statistically valid comparison results.
First, it is important to note that, because of the nature of its algorithm, RF model calculates the importance score only for a randomly sampled subset of features, and a zero value in the importance score simply means the feature was never sampled in the model building process. While the number of estimators (decision trees) needed to achieve a good prediction performance is usually small (≈100 in our case) and remains similar for gene-based and edge-based features, the number of estimators needed for identifying top-ranked features and performing robustness analysis is much larger, as the ranking is meaningful only when every feature has been sampled (and scored) at least once. Also note that more estimators are needed for edge features than gene features to achieve similar coverage, due to the large number of edge features.
Another important issue is that in order to compare the robustness between edge features and gene features, the number of selected top-features needs to be adjusted for different feature types according to the number of features available. This is to ensure that the same fold enrichment of observed/expected number of common features from different feature types should correspond to similar statistical significance. As an illustration, assume that we have 10,000 genes and 160,000 edges. Say that we choose top 200 genes from two independent sets of patients, we would expect 4 =(200*200/10,000) genes in common between the two gene lists by chance. If there were actually 16 genes in common, the level of fold enrichment is 16 / 4 = 4, and the p-value is 2.4e-6 (fisher’s exact test based on hypergeometric distribution). To compare the robustness between gene-based features and edge-based features, say that we also choose top 200 edge-based features from each dataset. The expected overlap between the two sets of edges is 200 * 200 / 160,000 = 0.25. An actual overlap of one edge would represent a 4-fold enrichment as in gene-based features, but is statistically insignificant (p = 0.22). On the other hand, if we choose proportionally (i.e., 200 x 160,000 / 10,000 = 3,200 edges), we would expect 3,200 * 3,200 / 160,000 = 64 common edges by chance, and an actual overlap of 256 edges would represent the same 4-fold enrichment, but is statistically much more significant than the gene-based features (p-value = 3.4e-78 for edges vs p-value = 2.4e-6 for genes). In comparison, if we choose 800 edges from each dataset, the expected number of common edges is 4 (same as in the case of gene features), and an actual overlap of 16 edges will represent a 4-fold enrichment, with a statistical significance of 4.1e-6, which is very comparable to the case of gene-based features.
With the above intuition, we decided to choose different number of features for gene-based and edge-based features separately, while keeping the expected number of overlaps the same for different types of features. Since the expected number of overlaps between two lists of features is calculated as expected_#_of_overlaps=(#_of_selected_top_features)2/(total_#_of_features), the number of selected top features is determined by:
$$ X = \sqrt{total\_\#\_of\_features \times expected\_\#\_of\_overlaps} $$
(2)
Finally, to achieve an unbiased evaluation of robustness, the ACES dataset was randomly partitioned into two disjoint sub-samples, where each sub-sample had the same number of patients and similar metastatic/non-metastatic patient ratio. Top features were picked for a sub-sample from the returned feature importance vector for a feature type after training the RF model by that sub-sample. Multiple RF models were trained with different random states (i.e., random seed) on the same sub-sample to produce feature importance values for each feature belonging to the sub-sample. The number of estimators (decision trees) was chosen to be 300, 500 and 500 and the number of training iterations was assigned as 50, 200 and 200 for Gene, CEEdge and PPIEdge feature types respectively. Final feature importance of a feature was determined as the average of the non-zero values returned by the multiple trained RF models. Feature importance vector for each sub-sample was computed and top features were selected based on the highest feature importance values. Top X genes were selected from the two corresponding sub-samples and the robustness was measured as the ratio between the observed number of overlaps to the expected number of overlaps between the top X features of the two sub-samples. The process of splitting ACES into two sub-samples was repeated for 20 times so that a vector of 20 overlap ratios were obtained for each feature type for a specific expectation value (expected number of overlaps).
Gene ontology enrichment analysis
Top genes were selected after conducting bootstrapping on 65% of the sub-sampled data (with replacement) of ACES for 1000 repetitions for each of the feature types. The number of estimators (i.e., decision trees) for gene, CEEdge, and PPIEdge feature types were kept at 1000, 12630, and 14123 respectively. The number of estimators in RF for CEEdge and PPIEdge was increased linearly according to the number of features in the gene-based RF model. For instance, the number of estimators for CEEdge was set to (1,000*161,042)/12,750 = 12630, where 1000 is the number of estimators in gene-based RF model, and 12750 (161042) the number of gene (CEEdge) features. Increasing the number of estimators in CEEdge and PPIEdge-based RF models provided a similar percentage of non-zero feature importance measures for CEEdge- and PPIEdge-based models compared to gene-based model. The final importance score of a feature (gene or edge) was computed as the average of the non-zero feature importance values from the feature importance values obtained from 1000 repetitions of RF (Recall that a zero value in feature importance from RF model simply means the feature was not sampled in the model building process). Sorting the final feature importance values in descending order for each feature type resulted in the feature ranking vector. Top 500 genes were selected from the sorted feature ranking vector for gene ontology enrichment analysis of the gene-based features. Top 500 genes for edge feature type were selected by pooling genes from the top-ranked edges until 500 genes were obtained. David Bioinformatics Resources 6.8 [39] web application was used for gene ontology enrichment analysis.