Skip to main content

Table 7 Feature importance calculated for the highest quality Spark supervised algorithms (Decision Trees (DT) and Random Forest (RF)). The entropy, the number of nodes that included certain features in the Random Forest building with RUS pre-processing and the average impurity decrease of the MLlib 2.0 Random Forest with ROS variants are presented for the alignment-based, alignment-free and alignment-based + alignment-free feature combinations The Random Oversampling pre-processing (ROS) is accompanied by the corresponding resampling size value

From: Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers

  RUS + DT-Spark Weka RUS + RF-Spark/Gini Weka RF MLlib 2.0-Spark/Gini (Avg. Impurity Decrease)
Entropy Avg. Impurity Decrease Number of Nodes Normal ROS-100 ROS-130 RUS
Alignment-based Features/Algorithm
nw 0.789 0.520 42 0.809 0.180 0.175 0.171
sw 0.982 0.360 802 0.035 0.642 0.647 0.647
profile3 0.783 0.360 417 0.043 0.167 0.167 0.167
profile5 0.732 0.290 235 0.033 0.004 0.001 0.007
profile7 0.712 0.240 330 0.080 0.008 0.010 0.008
Alignment-free Features
aac 0.624 0.400 1891 0.033 0.173 0.171 0.169
Auto_Geary 0.000 0.310 64 0.000 0.000 0.000 0.000
Auto_Moran 0.000 0.320 75 0.000 0.000 0.000 0.000
Auto_Total 0.000 0.370 1124 0.000 0.000 0.000 0.001
CTD 0.408 0.310 1012 0.070 0.134 0.133 0.137
CTD_C 0.566 0.300 1482 0.071 0.060 0.062 0.066
CTD_D 0.407 0.320 1239 0.074 0.030 0.029 0.033
CTD_T 0.529 0.290 1385 0.076 0.028 0.035 0.036
fcm 0.265 0.310 1010 0.012 0.004 0.021 0.021
2-mers 0.158 0.390 954 0.022 0.003 0.003 0.002
2-mers_don’t care ps-1 0.000 0.320 847 0.000 0.000 0.000 0.000
2-mers_ don’t care ps-2 0.000 0.310 768 0.001 0.000 0.000 0.000
2-mers_ don’t care ps-3 0.000 0.260 772 0.000 0.000 0.000 0.001
3-mers 0.078 0.370 1523 0.064 0.006 0.005 0.006
3-mers_ don’t care ps-1 0.000 0.290 600 0.001 0.000 0.000 0.001
3-mers_ don’t care ps-2 0.000 0.270 653 0.001 0.000 0.000 0.001
3-mers_ don’t care ps-3 0.000 0.270 602 0.002 0.000 0.000 0.001
length 0.507 0.400 2890 0.353 0.166 0.165 0.154
nandy 0.109 0.260 902 0.009 0.000 0.000 0.001
pseaa10 0.000 0.240 825 0.000 0.000 0.000 0.001
pseaa3 0.611 0.380 1397 0.022 0.205 0.202 0.166
pseaa4 0.609 0.380 1652 0.112 0.155 0.156 0.184
QSO_maxlag_30_weight_01 0.280 0.240 1054 0.075 0.035 0.018 0.020
QSOCN_maxlag_30 0 0.250 513 0.001 0.000 0.000 0.001
Alignment-based + Alignment-free Features/Algorithm
nw 0.789 0.280 131 0.786 0.382 0.373 0.374
sw 0.987 0.470 646 0.005 0.135 0.139 0.126
profile3 0.769 0.280 271 0.005 0.098 0.101 0.097
profile5 0.727 0.290 230 0.016 0.168 0.168 0.137
profile7 0.710 0.260 229 0.004 0.083 0.084 0.126
aac 0.623 0.190 230 0.015 0.073 0.071 0.072
Auto_Geary 0.000 0.300 11 0.000 0.000 0.000 0.000
Auto_Moran 0.000 0.270 11 0.000 0.000 0.000 0.000
Auto_Total 0.000 0.510 147 0.001 0.000 0.000 0.000
CTD 0.411 0.360 109 0.005 0.000 0.000 0.000
CTD_C 0.570 0.340 204 0.039 0.032 0.032 0.032
CTD_D 0.411 0.390 151 0.009 0.002 0.001 0.001
CTD_T 0.531 0.320 164 0.001 0.002 0.003 0.004
fcm 0.260 0.300 154 0.005 0.000 0.000 0.001
2-mers 0.155 0.200 81 0.003 0.000 0.000 0.000
2-mers_don’t care ps-1 0.000 0.410 104 0.000 0.000 0.000 0.000
2-mers_ don’t care ps-2 0.000 0.410 98 0.000 0.000 0.000 0.000
2-mers_ don’t care ps-3 0.000 0.400 82 0.001 0.000 0.000 0.000
3-mers 0.074 0.230 97 0.010 0.000 0.000 0.000
3-mers_ don’t care ps-1 0.000 0.390 69 0.000 0.000 0.000 0.000
3-mers_ don’t care ps-2 0.000 0.340 49 0.001 0.000 0.000 0.000
3-mers_ don’t care ps-3 0.000 0.390 59 0.001 0.000 0.000 0.000
length 0.504 0.230 231 0.059 0.012 0.014 0.014
nandy 0.113 0.320 101 0.001 0.000 0.000 0.001
pseaa10 0.000 0.310 97 0.001 0.000 0.000 0.000
pseaa3 0.613 0.190 142 0.009 0.006 0.007 0.004
pseaa4 0.610 0.210 147 0.001 0.005 0.005 0.009
QSO_maxlag_30_weight = 0.1 0.286 0.270 108 0.020 0.001 0.001 0.000
QSO_maxlag_30 0.000 0.340 47 0.000 0.000 0.000 0.000
  1. nw: global alignment, sw: local alignment, profile: physicochemical profile from matching regions of aligned sequences at different window sizes (3, 5 and 7), aac: amino acid composition, pseacc: pseudo-amino acid composition at λ = 3,4 and 10, Auto_Geary: Geary’s auto correlation, Auto_Moran: Moran’s auto correlation, Auto_Total: Total auto correlation, fcm: four-color maps, nandy: Nandy’s descriptors, CTD: Composition, Distribution and Transition (Total), CTD_C: Composition, Distribution and Transition (Composition), CTD_D: Composition, Distribution and Transition (Distributions), CTD_T: Composition, Distribution and Transition (Transition), k-mers: 2-mers, 3-mers, spaced words: 2-mers with “don’t care positions” = 1, 2 and 3; 3-mer with “don’t care positions” = 1, 2, 3, QSO: Quasi-Sequence-Order, w = weight factor and maximum lag = 30