Skip to main content

Table 7 Feature importance calculated for the highest quality Spark supervised algorithms (Decision Trees (DT) and Random Forest (RF)). The entropy, the number of nodes that included certain features in the Random Forest building with RUS pre-processing and the average impurity decrease of the MLlib 2.0 Random Forest with ROS variants are presented for the alignment-based, alignment-free and alignment-based + alignment-free feature combinations The Random Oversampling pre-processing (ROS) is accompanied by the corresponding resampling size value

From: Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers

 

RUS + DT-Spark Weka

RUS + RF-Spark/Gini Weka

RF MLlib 2.0-Spark/Gini (Avg. Impurity Decrease)

Entropy

Avg. Impurity Decrease

Number of Nodes

Normal

ROS-100

ROS-130

RUS

Alignment-based Features/Algorithm

nw

0.789

0.520

42

0.809

0.180

0.175

0.171

sw

0.982

0.360

802

0.035

0.642

0.647

0.647

profile3

0.783

0.360

417

0.043

0.167

0.167

0.167

profile5

0.732

0.290

235

0.033

0.004

0.001

0.007

profile7

0.712

0.240

330

0.080

0.008

0.010

0.008

Alignment-free Features

aac

0.624

0.400

1891

0.033

0.173

0.171

0.169

Auto_Geary

0.000

0.310

64

0.000

0.000

0.000

0.000

Auto_Moran

0.000

0.320

75

0.000

0.000

0.000

0.000

Auto_Total

0.000

0.370

1124

0.000

0.000

0.000

0.001

CTD

0.408

0.310

1012

0.070

0.134

0.133

0.137

CTD_C

0.566

0.300

1482

0.071

0.060

0.062

0.066

CTD_D

0.407

0.320

1239

0.074

0.030

0.029

0.033

CTD_T

0.529

0.290

1385

0.076

0.028

0.035

0.036

fcm

0.265

0.310

1010

0.012

0.004

0.021

0.021

2-mers

0.158

0.390

954

0.022

0.003

0.003

0.002

2-mers_don’t care ps-1

0.000

0.320

847

0.000

0.000

0.000

0.000

2-mers_ don’t care ps-2

0.000

0.310

768

0.001

0.000

0.000

0.000

2-mers_ don’t care ps-3

0.000

0.260

772

0.000

0.000

0.000

0.001

3-mers

0.078

0.370

1523

0.064

0.006

0.005

0.006

3-mers_ don’t care ps-1

0.000

0.290

600

0.001

0.000

0.000

0.001

3-mers_ don’t care ps-2

0.000

0.270

653

0.001

0.000

0.000

0.001

3-mers_ don’t care ps-3

0.000

0.270

602

0.002

0.000

0.000

0.001

length

0.507

0.400

2890

0.353

0.166

0.165

0.154

nandy

0.109

0.260

902

0.009

0.000

0.000

0.001

pseaa10

0.000

0.240

825

0.000

0.000

0.000

0.001

pseaa3

0.611

0.380

1397

0.022

0.205

0.202

0.166

pseaa4

0.609

0.380

1652

0.112

0.155

0.156

0.184

QSO_maxlag_30_weight_01

0.280

0.240

1054

0.075

0.035

0.018

0.020

QSOCN_maxlag_30

0

0.250

513

0.001

0.000

0.000

0.001

Alignment-based + Alignment-free Features/Algorithm

nw

0.789

0.280

131

0.786

0.382

0.373

0.374

sw

0.987

0.470

646

0.005

0.135

0.139

0.126

profile3

0.769

0.280

271

0.005

0.098

0.101

0.097

profile5

0.727

0.290

230

0.016

0.168

0.168

0.137

profile7

0.710

0.260

229

0.004

0.083

0.084

0.126

aac

0.623

0.190

230

0.015

0.073

0.071

0.072

Auto_Geary

0.000

0.300

11

0.000

0.000

0.000

0.000

Auto_Moran

0.000

0.270

11

0.000

0.000

0.000

0.000

Auto_Total

0.000

0.510

147

0.001

0.000

0.000

0.000

CTD

0.411

0.360

109

0.005

0.000

0.000

0.000

CTD_C

0.570

0.340

204

0.039

0.032

0.032

0.032

CTD_D

0.411

0.390

151

0.009

0.002

0.001

0.001

CTD_T

0.531

0.320

164

0.001

0.002

0.003

0.004

fcm

0.260

0.300

154

0.005

0.000

0.000

0.001

2-mers

0.155

0.200

81

0.003

0.000

0.000

0.000

2-mers_don’t care ps-1

0.000

0.410

104

0.000

0.000

0.000

0.000

2-mers_ don’t care ps-2

0.000

0.410

98

0.000

0.000

0.000

0.000

2-mers_ don’t care ps-3

0.000

0.400

82

0.001

0.000

0.000

0.000

3-mers

0.074

0.230

97

0.010

0.000

0.000

0.000

3-mers_ don’t care ps-1

0.000

0.390

69

0.000

0.000

0.000

0.000

3-mers_ don’t care ps-2

0.000

0.340

49

0.001

0.000

0.000

0.000

3-mers_ don’t care ps-3

0.000

0.390

59

0.001

0.000

0.000

0.000

length

0.504

0.230

231

0.059

0.012

0.014

0.014

nandy

0.113

0.320

101

0.001

0.000

0.000

0.001

pseaa10

0.000

0.310

97

0.001

0.000

0.000

0.000

pseaa3

0.613

0.190

142

0.009

0.006

0.007

0.004

pseaa4

0.610

0.210

147

0.001

0.005

0.005

0.009

QSO_maxlag_30_weight = 0.1

0.286

0.270

108

0.020

0.001

0.001

0.000

QSO_maxlag_30

0.000

0.340

47

0.000

0.000

0.000

0.000

  1. nw: global alignment, sw: local alignment, profile: physicochemical profile from matching regions of aligned sequences at different window sizes (3, 5 and 7), aac: amino acid composition, pseacc: pseudo-amino acid composition at λ = 3,4 and 10, Auto_Geary: Geary’s auto correlation, Auto_Moran: Moran’s auto correlation, Auto_Total: Total auto correlation, fcm: four-color maps, nandy: Nandy’s descriptors, CTD: Composition, Distribution and Transition (Total), CTD_C: Composition, Distribution and Transition (Composition), CTD_D: Composition, Distribution and Transition (Distributions), CTD_T: Composition, Distribution and Transition (Transition), k-mers: 2-mers, 3-mers, spaced words: 2-mers with “don’t care positions” = 1, 2 and 3; 3-mer with “don’t care positions” = 1, 2, 3, QSO: Quasi-Sequence-Order, w = weight factor and maximum lag = 30