From: Applying negative rule mining to improve genome annotation
Method used | |||||||||
---|---|---|---|---|---|---|---|---|---|
Feature name | Description | Examples | Algorithm | Threshold value used | Reference | Number of proteins having items of this type | Total number of items found | Average number of items per protein | Total number of attribute values |
Length | Protein length (number of amino acids) binned over four ranges | Small (<120), Medium (>=120, <1000), Large (>=1000, <1500), eXtraLarge (>=1500) | Direct calculation | Not applicable | None | All (55063) | All (55063) | 1 | 4 |
GC content of the gene | The value of the GC-content binned over 3 ranges | L (<=0.4), M (<0.5), H (>=0.5) | Direct calculation | Not applicable | None | 30218* | 30218 | 1 | 3 |
Isoelectric point | The value of the isoelectric point binned over 4 ranges | C (aCid, pI <=5.5), NC (Neutral-aCid, 5.0 < pI <=7.0), NL (Neutral-aLkaline, 7.0 < pI <=9.2), L (aLkaline, pI > 9.2) | Direct calculation | Not applicable | None | All (55063) | All (55063) | 1 | 4 |
Low complexity regions | Percentage of residues predicted to be in low complexity regions binned over three ranges | High (>=10%), Medium (0–10%), None (0%) | SEG | Default SEG parameters | (Wootton, 1994) | All (55063) | All (55063) | 1 | 3 |
Disordered regions | Percentage of residues in disordered regions binned over 4 ranges | High (>=20%), Medium (10–20%), Low (0–10%), 0 (0%) | DisEMBL | Default DisEMBL parameters | (Linding et al., 2003) | All (55063) | All (55063) | 1 | 4 |
Coiled coil regions | Presence of coiled coil regions | COILS:+ | COILS | Default COILS parameters | (Lupas, 1997) | 7809 | 7809 | 1 | 1 |
Structural class derived from secondary structure prediction | Classification of proteins based on the prevalent type of secondary structure | Alpha/beta | Predator | Default Predator parameters | (Frishman and Argos, 1997) | 52711 | 52711 | 1 | 4 |
Transmembrane segments | Presence and number of transmembrane segments | TM (=transmembrane domains are present), 1 TMs, 12 TMs (the number of TM domains) | TMHMM | Default TMHMM parameters | (Krogh et al., 2001) | 12437 | 24874 | 2 | 52 |
Signal peptide | The presence of the signal peptide | SignalP:+ | SignalP | Default SignalP parameters | (Bendtsen et al., 2004) | 8066 | 8066 | 1 | 1 |
Protein localiza-tion | Predicted cellular localization | Secretory pathway | TargetP | Default TargetP parameters | (Emanuelsson et al., 2000) | 18186 | 18186 | 1 | 2 |
SCOP super-families | Classification of proteins into superfemilies based on their tertiary structure, corresponds to the third level of the SCOP hierarchy | a.47.3 (Cag-Z) | RPS-BLAST | E-Value 1E-10 | (Lo et al., 2002) | 29562 | 37360 | 1.26 | 1096 |
Interpro | Sequence domains found by HMM profile searches: a. primary domains; b. IPR domains | IPR003593 (AAA_ATPase domain), PF02985 (PFAM primary domain, HEAT repeat) | BLASTP | E-Value 1E-10 InterPro-Scan | Putin et al. (2006) | a. 43829 b. 42627 | a. 142433 b. 83227 | a. 3.25 b. 1.95 | 18106 |
EC numbers | Enzyme Commission Classification of enzymatic activities | Ec1.1.1.1 | BLASTP | E-Value 1E-10 | (Webb, 1992) | 11869 | 15610 | 1.32 | 1753 |
COG | Ortologous groups of genes from for prokaryotic and eukaryotic organisms organisms) | COG0582 (Integrase), KOG1327 (Copine) | RPS-BLAST | E-Value 1E-10 | (Tatusov et al., 2003; Koonin et al., 2004) | 33930 | 50272 | 1.48 | 8048 |
Keywords | Swiss-Prot or PIR keywords | Ligase | BLASTP | E-Value 1E-10 BLASTP | (Wu et al., 2006), (Wu et al. 2002) | 20128 | 80910 | 4.02 | 632 |
Functional categories | Two upper levels of the MIPS Functional Catalog | Fc40.20 (Cell fate: aging) | BLASTP | E-Value 1E-10 | (Ruepp et al., 2004) | 32248 | 438699 | 13.60 | 171 |