Skip to main content

Table 6 Performance measures used to evaluate protein solubility prediction (in alphabetical order)

From: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

#

Name

Abbr.

Formula

Description

1

Accuracy

ACC

(TP + TN)/(TP + TN + FP + FN)

The number of correctly classified instances divided by the total number of instances [6].

2

Area under ROC curve

AUC

-

It measures the discriminating ability of the model and it takes values between 0.5 for random drawing and 1.0 for perfect classifier [6].

3

Enrichment Factor

EF

[CS/(CS + WS)]/[S/(S + I)]

EF is especially suitable for the unbalanced datasets [27].

CS: Number of correctly classified soluble proteins.

WS: Number of soluble proteins wrongly classified as insoluble.

S: total number of soluble proteins.

I: total number of insoluble proteins.

4

False Negative

FN

-

The number of incorrectly predicted negatives [10].

5

False Positive

FP

-

The number of incorrectly predicted positives [10].

6

F-Score

FS

2 × Precision × Recall/(Precision + Recall)

The harmonic mean of recall and precision [10].

7

Gain

GAIN

Precision/proportion of the given class in the full data set.

It is an important performance measure that quantifies how much better the decision is in comparison with random drawing of instances [6].

8

Matthew’s Correlation Coefficient

MCC

(TP × TN - FP × FN)/((TP + FP)(TP + FN)(TN + FP)(TN + FN))

It indicates the correlation between the classifier assignments and the actual class in the two-class case. It is a good measure of classifier performance even when classes are unbalanced [6]. The MCC ranges between -1 and 1, and a large positive value indicates a better prediction [10].

9

Precision (Selectivity)

PRC

TP/(TP + FP) Or TN/(TN + FN)

The ratio of the number of correctly classified positive or negative instances to the number of all instances classified as positive or negative, for positive and negative class respectively [6].

10

ROC Curve

ROC

Plotting the “FP-rate” against the “TP- rate”, while the probability is increased from 0 to 1.0 with 0.01 increments.

The receiver-operator characteristic curve, showing the trade-off between the ratio of false positives and false negatives in testing a classifier [48]. A larger area value indicates a more robust prediction method [10].

11

Recall

REC

TP/(TP + FN)

The ratio of the number of correctly classified positive instances to the number of all instances from the positive class [6].

(Sensitivity)

(True positive rate)

(TP- rate)

12

Specificity

SPC

TN/(TN + FP)

The ratio of the number of correctly classified negative instances to the sum of all negative instances [6].

(True Negative Rate)

(TN-rate)

13

True Positive

TP

-

The number of correctly predicted positives [10].

14

True Negative

TN

-

The number of correctly predicted negatives [10].

  1. a. “TP” = True Positive; “TN” = True Negative; “FP” = False Positive; “FN” = False Negative; “+” = Add, “-” = Subtract; “×” = Multiply; “/” = Division.