Skip to main content

Table 1 Annotation features used in this work

From: Applying negative rule mining to improve genome annotation

   

Method used

     

Feature name

Description

Examples

Algorithm

Threshold value used

Reference

Number of proteins having items of this type

Total number of items found

Average number of items per protein

Total number of attribute values

Length

Protein length (number of amino acids) binned over four ranges

Small (<120), Medium (>=120, <1000), Large (>=1000, <1500), eXtraLarge (>=1500)

Direct calculation

Not applicable

None

All (55063)

All (55063)

1

4

GC content of the gene

The value of the GC-content binned over 3 ranges

L (<=0.4), M (<0.5), H (>=0.5)

Direct calculation

Not applicable

None

30218*

30218

1

3

Isoelectric point

The value of the isoelectric point binned over 4 ranges

C (aCid, pI <=5.5), NC (Neutral-aCid, 5.0 < pI <=7.0), NL (Neutral-aLkaline, 7.0 < pI <=9.2), L (aLkaline, pI > 9.2)

Direct calculation

Not applicable

None

All (55063)

All (55063)

1

4

Low complexity regions

Percentage of residues predicted to be in low complexity regions binned over three ranges

High (>=10%), Medium (0–10%), None (0%)

SEG

Default SEG parameters

(Wootton, 1994)

All (55063)

All (55063)

1

3

Disordered regions

Percentage of residues in disordered regions binned over 4 ranges

High (>=20%), Medium (10–20%), Low (0–10%), 0 (0%)

DisEMBL

Default DisEMBL parameters

(Linding et al., 2003)

All (55063)

All (55063)

1

4

Coiled coil regions

Presence of coiled coil regions

COILS:+

COILS

Default COILS parameters

(Lupas, 1997)

7809

7809

1

1

Structural class derived from secondary structure prediction

Classification of proteins based on the prevalent type of secondary structure

Alpha/beta

Predator

Default Predator parameters

(Frishman and Argos, 1997)

52711

52711

1

4

Transmembrane segments

Presence and number of transmembrane segments

TM (=transmembrane domains are present), 1 TMs, 12 TMs (the number of TM domains)

TMHMM

Default TMHMM parameters

(Krogh et al., 2001)

12437

24874

2

52

Signal peptide

The presence of the signal peptide

SignalP:+

SignalP

Default SignalP parameters

(Bendtsen et al., 2004)

8066

8066

1

1

Protein localiza-tion

Predicted cellular localization

Secretory pathway

TargetP

Default TargetP parameters

(Emanuelsson et al., 2000)

18186

18186

1

2

SCOP super-families

Classification of proteins into superfemilies based on their tertiary structure, corresponds to the third level of the SCOP hierarchy

a.47.3 (Cag-Z)

RPS-BLAST

E-Value 1E-10

(Lo et al., 2002)

29562

37360

1.26

1096

Interpro

Sequence domains found by HMM profile searches: a. primary domains; b. IPR domains

IPR003593 (AAA_ATPase domain), PF02985 (PFAM primary domain, HEAT repeat)

BLASTP

E-Value 1E-10 InterPro-Scan

Putin et al. (2006)

a. 43829 b. 42627

a. 142433 b. 83227

a. 3.25 b. 1.95

18106

EC numbers

Enzyme Commission Classification of enzymatic activities

Ec1.1.1.1

BLASTP

E-Value 1E-10

(Webb, 1992)

11869

15610

1.32

1753

COG

Ortologous groups of genes from for prokaryotic and eukaryotic organisms organisms)

COG0582 (Integrase), KOG1327 (Copine)

RPS-BLAST

E-Value 1E-10

(Tatusov et al., 2003; Koonin et al., 2004)

33930

50272

1.48

8048

Keywords

Swiss-Prot or PIR keywords

Ligase

BLASTP

E-Value 1E-10 BLASTP

(Wu et al., 2006), (Wu et al. 2002)

20128

80910

4.02

632

Functional categories

Two upper levels of the MIPS Functional Catalog

Fc40.20 (Cell fate: aging)

BLASTP

E-Value 1E-10

(Ruepp et al., 2004)

32248

438699

13.60

171

  1. * For technical reasons GC content values were not available for Arabidopsis thaliana genes at the time of writing.
  2. Features transferred by similarity (type 3, see Methods) are shown in italic.