Skip to main content

Table 1 A summary of key components of studies to predict protein solubility (in chronological order)

From: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

#

Paper

Dataset(s)

Feature selection method(s)

Modeling technique(s)

Web server

1

[7]

Bacterial protein sequences with ‘soluble’ and ‘insoluble’ in NCBI are selected randomly.

Wrapper: SVM

Support vector machine

-

Size: 5692

Soluble: 2448

Insoluble: 3244

2

[10]

HGPD

Filter: Student’s t-test

Two techniques:

ESPRESSO:

E. coli

Support vector machine

http://mbs.cbrc.jp/ESPRESSO

Size: 5100

Soluble: 1774

Insoluble: 3326

Wheat germ

Sequence pattern-based method

Size: 2939

Soluble: 1941

Insoluble: 998

3

[5]

eSol

Two methods:

Random forest

ProS:

Size: 1918

1. Filter: Student’s t-test

http://shark.abl.ku.edu/ProS/

Soluble: 886

2. Wrapper: Random forest

Insoluble: 1032

4

[8]

Four datasets:

-

Two methods:

SCM:

Sd957

Support vector machine

http://iclab.life.nctu.edu.tw/SCM/

Dataset Chan et al. [18] (Table 1, row 11)

Scoring card method (SCM)

Solpro

PROSO II

5

[4]

eSol

-

Four techniques:

-

Size: 1600

1. Support vector machine

2. Random forest

3. Conditional inference trees

4. Rule ensemble

6

[6]

PROSO II

Wrapper

A two-layer model:

PROSOII:

1. Layer 1: Parzen window + logistic regression

http://mips.helmholtz-muenchen.de/prosoII

2. Layer 2: Logistic regression

7

[22]

eSol

-

Decision tree

-

Size: 1625

Soluble: 843

Insoluble: 782

8

[23]

eSol

Wrapper: SVM

Support vector machine

-

Size: 2159

Soluble: 1081

Insoluble: 1078

9

[3]

HGPD

Filer: Student’s t-test

Random forest

-

E. coli

Size: 7823

Soluble: 2796

Insoluble: 5027

Wheat germ

Size: 3955

Soluble: 2739

Insoluble: 1216

10

[24]

SOLP

Seven methods:

Support vector machine

-

1. Filter: Information gain

2. Filter: Gain ratio

3. Filter: Chi squared

4. Filter: Symmetrical uncertainty

5. Wrapper: ReliefF

6. Wrapper: SVM recursive feature elimination (SvmRfe)

7. Embedded: One attribute rule

11

[16]

121genes from different species were expressed in 6 different vectors.

Feature selection package in LIBSVM: Filter (F-score) + Wrapper (SVM)

Support vector machine

-

Size: 726

Soluble: 231

Insoluble: 236

Non-expressed: 259

12

[20]

A database collected through literature search.

N/A

Logistic regression

http://www.biotech.ou.edu/

Size: 212

Soluble: 52

Insoluble: 160

13

[17]

Solpro

Wrapper

A two- layer model:

SOLpro:

1. Layer 1: 20 Support vector machines

http://scratch.proteomics.ics.uci.edu

2. Layer 2: One support vector machine

14

[25]

eSol

Using histogram

Support vector machine

-

15

[19]

PROSO

Two methods:

A two-layer model:

PROSO:

1. Wrapper

Layer 1: Support vector machine

http://mips.helmholtz-muenchen.de/proso/

2. Filter: Symmetrical uncertainty

Layer 2: Naive Bayes

16

[26]

Idicula‒Thomas 2006

N/A

Support vector machine

-

17

[27]

Idicula‒Thomas 2006

Filter: Unbalanced correlation score

Support vector machine

-

18

[28]

Idicula‒Thomas 2005

Filter: Mann–Whitney test

Discriminant analysis (A heuristic approach of computing solubility index (SI))

-

19

[29]

Genes of C. elegans with one expression vector and one Escherichia coli strain.

Filter: Linear correlation coefficient (LCC)

-

-

Size: 4854

Soluble: 1536

Insoluble: 3318

20

[30]

TargetDB

Wrapper: Random forest

Decision tree

-

Size: 27,000

21

[14]

SPINE

Wrapper

Decision tree

-

Size: 562

22

[31]

SPINE

Embedded: Decision tree

Decision tree

-

Size: 356

Soluble: 213

Insoluble: 143

23

[18]

Some genes of E. coli were expressed.

N/A

Regression

-

Size: 100

24

[9]

Some genes of E. coli were expressed.

N/A

Regression

-

Size: 81