A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Habibi, Narjeskhatoon; Mohd Hashim, Siti Z; Norouzi, Alireza; Samian, Mohammed Razip

doi:10.1186/1471-2105-15-134

Table 3 Features used to predict protein solubility

From: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

#	Paper	Features
1	[7]	1. 2-level triangle CGR
		2. Entropy of “2-level triangle CGR”
		3. Dipeptide composition based on a different mode of pseudo amino acid composition (PseAAC)
		4. Entropy of “dipeptide composition”
2	[10]	Same as row 9 (Reference [3])
3	[5]	1. Counts of aromatic amino acids
		2. Counts of buried amino acids
		3. Counts of hydrogen bonds
		4. Counts of leucine amino acid
		5. Counts of arginine amino acid
		6. Negative charge
		7. Surface composition of amino acids in intracellular proteins of Mesophiles (percent)
		8. Beta-strand indices for beta-proteins
		9. Flexibility parameter for two rigid neighbours
		10. Net charge
		11. Counts of nitrogen atoms
		12. Long range non-bonded energy per atom
		13. Isometric point (pI)
		14. Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water
		15. Ratio of negative charge amino acids
		16. Ratio of net charge of protein
		17. Dependence of partition coefficient on ionic strength
4	[8]	Dipeptide composition (400 features)
5	[4]	1. Reduced features (39 features produced by pepstats):
		a. Molecular weight, number of residues, average residue weight, charge and isoelectric point
		b. For each type of amino acid: number, molar percent and DayhoffStat
		c. For each physicochemical class of amino acid: number, molar percent, molar extinction coefficient (A280) and extinction coefficient at 1 mg/ml (A280)
		2. Dimers (2400 features):
		a. Dimers amino acid frequencies which are computed considering gaps of 1–5 amino acid
		3. Complete set
		a. Reduced features + Dimers
6	[6]	1. Amino acid frequencies (18 features): R, N, D, C, Q, E, G, H, I, K, M, F, P, S, T, W, Y, V
6	[6]	2. Dipeptide frequencies (13 features): AK, CV, EG, GN, GH, HE, IH, IW, MR, MQ, PR, TS, WD
7	[22]	1. Monomer, dimer and trimmers using 7 different alphabets (18 features)
		2. Sequence-computed features:
		a. Molecular weight
		b. Sequence length
		c. Isoelectric point
		d. GRAVY index
		3. Features used in Niwa et al. work [25]
		4. Combination of all the above features 1–3.
8	[23]	1. Coil
		2. Disorder
		3. Hydrophobicity
		4. Hydrophilicity
		5. β-turn
		6. α-helix
9	[3]	1. Nucleotide sequence information:
		a. 1-mer
		b. Frequencies of 64 codons (3-mer)
		c. GC-contents
		2. Amin acid sequence information:
		a. Polypeptide length
		b. Frequencies of 20 single amino acids (1-mer)
		c. Frequencies of 8 chemical property groups
		d. Frequencies of 5 physical property groups
		e. Repeat of amino acids
		f. Repeat of 8 chemical property groups
		g. Repeat of 5 physical property groups
		3. Amino acid structural information:
		a. Frequencies of single amino acids in surface area
		b. Frequencies of 8 chemical property groups in surface area
		c. Frequencies of 5 physical property groups in surface area
		d. Number of transmembrane regions
		e. Disordered regions:
		i. Number of occurrence
		ii. Length
		iii. Proportion
		f. Secondary structures:
		i. alpha-helix
		ii. Beta-sheet
		iii. Others
10	[24]	1497 features computed by Protein Feature Server (PROFEAT) [32]:
		1. Group 1:
		a. Amino acid composition
		b. Dipeptide composition
		2. Group 2: Autocorrelation 1
		a. Normalized Moreau-Broto autocorrelation
		3. Group 3: Autocorrelation 2
		a. Moran autocorrelation
		4. Group 4: Autocorrelation 3
		a. Geary autocorrelation
		5. Group 5:
		a. Composition
		b. Transition
		c. Distribution
		6. Group 6: Sequence order 1
		a. Sequence-order-coupling number
		b. Quasi-sequence-order descriptors
		7. Group 7: Sequence order 2
		a. Pseudo amino acid descriptors
11	[1]	1. Nucleotide information:
		a. 1-mer
		b. 2-mer
		c. 3-mer
		d. Sequence length
		e. GC content
		2. Amino Acid information:
		a. Features of Wilkinson and Harrison [9]
		b. Features of Idicula-Thomas et al. [27]
		c. Isoelectric point
		d. Peptide statistics
		3. Codon Adaptation Index
		4. PTMs
12	[20]	1. Molecular weight
		2. Cysteine fraction
		3. Hydrophobicity-related parameters:
		a. Fraction of total number of hydrophobic amino acids
		b. Fraction of largest number of contiguous hydrophobic/hydrophilic amino acids
		4. Aliphatic index
		5. Secondary structure-related properties:
		a. Proline fraction
		b. Alpha-helix propensity
		c. Beta-sheet Propensity
		d. Turn-forming residue fraction
		e. Alpha-helix propensity/b-sheet propensity
		6. Protein–solvent interaction related parameters:
		a. Hydrophilicity index
		b. pI
		c. Approximate charge average
		7. Fractions of: Alanine, Arginine, Asparagine, Aspartate, Glutamate, Glutamine, Glycine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Serine, Threonine, Tyrosine, Tryptophan and Valine
13	[17]	1. Frequencies of amino acid monomers, dimers and trimmers using 7 different alphabets:
		a. Monomer frequencies
		i. [Natural-20:M]
		ii. [ClustEM-17:M]
		iii. [ClustEM-14:M]
		iv. [PhysChem-7:M]
		v. [BlosumSM-8:M]
		vi. [ConfSimi-7:M]
		vii. [Hydropho-5:M]
		b. Dimer frequencies
		i. [PhysChem-7:D]
		ii. [ClustEM-14:D]
		iii. [ClustEM-17:D]
		iv. [BlosumSM-8:D]
		v. [Natural-20:D]
		vi. [ConfSimi-7:D]
		c. Trimmer frequencies
		i. [ClustEM-17:T]
		ii. [Hydropho-5:T]
		iii. [ConfSimi-7:T]
		iv. [ClustEM-14:T]
		v. [Natural-20:T]
		2. Features computed directly:
		a. Sequence length
		b. Turn-forming residues fraction
		c. Absolute charge per residue
		d. Molecular weight
		e. GRAVY index
		f. Aliphatic index
		3. Predicted features using the SCRATCH suite of predictors:
		a. Beta residues fraction (Predicted by SSpro)
		b. Alpha residues fraction (Predicted by SSpro)
		c. Number of domains (Predicted by DOMpro)
		d. Exposed residues fraction (Predicted by ACCpro, using a 25% relative solvent accessibility cut-off)
14	[25]	1. Molecular weight
		2. Isometric point (pI)
		3. Ratios of each amino acid content
15	[19]	4. For mono-domain proteins:
		a. Word size 1:
		S, IL, M, F, DE, A, C, G, R
		b. Word size 2:
		R + R, R + C, R + E, R + T, N + Q, N + H, N + L, C + S, Q + A, Q + G, Q + I, E + A, E + G, E + K, E + P, E + V, G + P, H + M, L + Y, K + G, K + K, M + G, S + S, T + I, Y + C, Y + I
		c. Word size 3:
		ST + ST + ST, ST + ST + N, ST + DQE + AH, ST + C + ST, G + M + R, G + K + G, G + P + G,
		G + P + N, M + AH + AH, M + C + Y, DQE + G + R, DQE + R + DQE, DQE + M + ST,
		DQE + Y + N, DQE + AH + IV, K + R + IV, K + K + ST, P + DQE + DQE, P + DQE + C,
		IV + G + IV, L + IV + DQE, N + FW + DQE, N + C + P, AH + ST + ST, AH + K + L, C + FW + Y, C + K + C
		5. For multi-domain proteins:
		a. Word size 1:
		R, D, C, E, G, L, K, M, S, W
		b. Word size 2:
		A + Y, A + V, R + N, R + E, R + S, R + Y, N + A, D + M, C + T, Q + A, Q + E, E + D, E + G, E + T, G + I,
		G + F, G + S, H + C, H + M, H + P, L + G, L + S, K + D, K + G, K + L, K + F, P + L, T + L, T + Y, V + R
		c. Word size 3:
		ST + ST + ST, ST + P + DQE, ST + IV + K, R + DQE + FW, R + DQE + IV, R + IV + FW,
		FW + DQE + FW, M + ST + DQE, M + G + AH, M + FW + DQE, DQE + ST + ST,
		DQE + ST + G, DQE + G + K, DQE + IV + R, DQE + IV + L, P + G + ST, IV + ST + P,
		L + K + FW, AH + ST + IV, AH + G + IV, AH + AH + M
16	[26]	1. Aliphatic index
		2. Frequency of occurrence of residues Cysteine (Cys), Glutanic acid (Glu), Asparagine (Asn) and Tyrosine (Tyr)
		3. Reduced class of conformational similarity [CMQLEKRA]
		4. Reduced classes of hydrophobicity [CFILMVW] and [NQSTY]
		5. Reduced classes of BLOSUM50 substitution matrix [CILMV]
		6. The 18 dipeptide composition: [VC], [AE], [VE], [WF], [YF], [AG], [FG], [WG], [HH], [MI], [HK], [KN], [KP], [ER], [YS], [RV], [KY], [TY]
17	[27]	1. Physicochemical properties (6 features):
		a. Length of protein
		b. Hydropathy index (GRAVY)
		c. Aliphatic index
		d. Instability index
		e. Instability index of N-terminus
		f. Net charge
		2. Mono-peptide frequencies (20 features)
		3. Dipeptide frequencies (400 features)
		4. Reduced alphabet set (20 features)
18	[28]	1. Aliphatic index (AI)
		2. Instability index of the N terminus
		3. Frequency of occurrence of Asn, Thr, and Tyr
		4. Tri-peptide score
19	[29]	1. Signal peptide
		2. GRAVY
		3. Transmembrane helices
		4. Number of Cysteines
		5. Anchor peptide
		6. Prokaryotic membrane lipoprotein lipid attachment site
		7. PDB identity
20	[30]	1. General sequence composition
		2. Clusters of orthologous groups (COG) assignment
		3. Length of hydrophobic stretches
		4. Number of low-complexity regions
		5. Number of interaction partners
21	[16]	1. Single residue composition: I, T, Y
		2. Combined amino acid compositions: KR, DE, DENQ
		3. Predicted secondary structure composition: α and coil
		4. Presence of signal sequence
		5. Amino acid sequence length
		6. Number of amino acids in both short and long low complexity regions (over sequence length)
		7. Normalized low complexity value for both short and long regions (over sequence length)
		8. Minimum GES hydrophobicity score calculated over all amino acids in a 20 residue sequence window
22	[31]	1. Hydrophobe
		2. Cplx: a measure of a short complexity region based on the SEG program.
		3. Gln composition
		4. Asp + Glu composition
		5. Ile-composition
		6. Phe + Tyr + Trp composition
		7. Gly + Ala + Val + Leu + Ile composition
		8. His + Lys + Arg composition
		9. Trp composition
		10. Alpha-helical secondary structure composition
23	[18]	Same as row 24 (Reference [9])
24	[9]	1. Charge average approximation (Asp, Glu, Lys and Arg)
		2. Turn-forming residue fraction (Asn, Gly, Pro and Ser)
		3. Cysteine fractions
		4. Proline fractions
		5. Hydrophilicity
		6. Molecular weight (Total number of residues)

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us