Skip to main content

Table 4 Databases/datasets used to predict protein solubility (in chronological order)

From: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

#

Name

Reference

Size

Description

URL

Total

Soluble

Insoluble

1

Sd957

[8]

957

285

672

It is made from 3 previous datasets: Idicula-Thomas et al. [28], Diaz et al. [20] and Chan et al. [1].

http://iclab.life.nctu.edu.tw/SCM/downloads.php

2

PROSO II

[6]

82,000

41,000

41,000

It is made from pepcDB and PDB and has been the largest dataset ever. It is balanced.

http://mips.helmholtz-muenchen.de/prosoII/img/Suppl_files.zip

3

HGPD

[33]

17,821 (As of June 9th, 2011)

N/A

N/A

Human full-length cDNA.

http://www.HGPD.jp

4

eSol

[25]

30,173

N/A

N/A

A database on the solubility of entire ensemble of E. coli proteins based on ASKA library.

http://www.tanpaku.org/tp-esol/index.php?lang=en

5

Solpro (SOLP)

[17]

17,408

8704

8704

It is collected from 4 different sources: PDB, SwissProt, TargetDB and dataset of “Idicula-Thomas, 2006”. The sequence redundancy is removed with 25% sequence similarity. It is balanced.

http://download.igb.uci.edu/SOLP.fa

6

PROSO

[19]

14,000

7000

7000

It is collected by merging 4 datasets: TargetDB, PDB and datasets of “Idicula-Thomas 2005” and “Idicula-Thomas 2006”.

-

7

pepcDB

[34]

N/A

N/A

N/A

It stored target and protocol information contributed by Protein Structure Initiative centres as well as targets imported from the TargetDB database. Now it has been replaced by TargetTrack.

http://pepcdb.rcsb.org

8

Idicula-Thomas 2006

[27]

192

62

139

It is collected from the literature.

-

9

Idicula-Thomas 2005

[28]

174

41

133

It is collected from the literature.

-

10

PDB

[35]

91,359 (As of 11 June 2013)

N/A

N/A

It is a repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids.

http://www.rcsb.org/pdb/

11

SPINE

[16]

N/A

N/A

N/A

N/A

http://spine.nesg.org/user_login.cgi?url=http://spine.nesg.org/front_page.cgi?

12

TargetDB

[36]

295,041 (As of 29 March 2013)

N/A

N/A

It provided status information on target sequences and tracks their progress through the various stages of protein production and structure determination. Now it has been replaced by TargetTrack.

http://targetdb.rcsb.org

13

TargetTrack

-

316,424 (As of 14 June 2013)

N/A

N/A

It is a target registration database which provides information on the experimental progress and status of targets selected for structural determination by the Protein Structure Initiative and other worldwide high-throughput structural biology projects.

http://sbkb.org/tt