LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles

Wise, Michael J

doi:10.1186/1471-2105-4-52

Research article
Open access
Published: 29 October 2003

LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles

Michael J Wise¹

BMC Bioinformatics volume 4, Article number: 52 (2003) Cite this article

10k Accesses
196 Citations
Metrics details

Abstract

Background

The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins, originally found in plants but now being found in non-plant species. Their precise function is unknown, though considerable evidence suggests that LEA proteins are involved in desiccation resistance. Using a number of statistically-based bioinformatics tools the classification of a large set of LEA proteins, covering all Groups, is reexamined together with some previous findings. Searches based on peptide composition return proteins with similar composition to different LEA Groups; keyword clustering is then applied to reveal keywords and phrases suggestive of the Groups' properties.

Results

Previous research has suggested that glycine is characteristic of LEA proteins, but it is only highly over-represented in Groups 1 and 2, while alanine, thought characteristic of Group 2, is over-represented in Group 3, 4 and 6 but under-represented in Groups 1 and 2. However, for LEA Groups 1 2 and 3 it is shown that glutamine is very significantly over-represented, while cysteine, phenylalanine, isoleucine, leucine and tryptophan are significantly under-represented. There is also evidence that the Group 4 LEA proteins are more appropriately redistributed to Group 2 and Group 3. Similarly, Group 5 is better found among the Group 3 LEA proteins.

Conclusions

There is evidence that Group 2 and Group 3 LEA proteins, though distinct, might be related. This relationship is also evident in the overlapping sets of keywords for the two Groups, emphasising alpha-helical structure and, at a larger scale, filaments, all of which fits well with experimental evidence that proteins from both Groups are natively unstructured, but become structured under stress conditions. The keywords support localisation of LEA proteins both in the nucleus and associated with the cytoskeleton, and a mode of action similar to chaperones, perhaps the cold shock chaperones, via a role in DNA-binding. In general, non-globular and low-complexity proteins, such as the LEA proteins, pose particular challenges in determining their functions and modes of action. Rather than masking off and ignoring low-complexity domains, novel tools and tool combinations are needed which are capable of analysing such proteins in their entirety.

Background

The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins whose precise function is unknown. While considerable evidence suggests that LEA proteins are involved in desiccation resistance, a variety of mechanisms for achieving this end have been proposed including protecting cellular structures from the effects of water loss by retention of water, sequestration of ions, direct protection of other proteins or membranes, or renaturation of unfolded proteins [1–4]. LEA proteins are primarily found in plants, where they were originally found in seeds [5–7], and then other plant tissues. In addition, a number of putative LEA genes have been found in a non-plant species, including eubacteria Haemophilus influenzae and Bacillus subtilis [8], extremophile Deinococcus radiodurans [9] and the nematodes Caenorhabditis elegans and Aphelenchus avenae [10]. Most of the literature to date on LEA proteins has been in the form of reports on individual LEA proteins with general surveys appearing some time ago [1, 11, 12]. The somewhat more recent survey by Close [13] of Group 2 LEA proteins also includes a discussion of predicted secondary structure for this Group.

LEA proteins are generally grouped on the basis of their similarity to prototypical LEA proteins from the cotton plant Gossypium hirsutum. In the Dure naming scheme, LEA protein groups are named after particular G. hirsutum cDNA clones, resulting in Group names such as D7, D11, D19, D95 and D113. Many authors since Dure, however, use an assignment to Groups originating with [12], though revised (and to some extent contradictory) assignments also appear in [3] and [4]. There is, however, a consensus only for three LEA protein groups: Group 1 (D19), Group 2 (also known as dehydrins, D11) and Group 3 (D7). Other LEA protein groups from [12] are Group 4 (D113), Group 5 (D29) and Group 6 (D34). Four of the LEA protein groups are also represented by Pfam [14] domain families:

Small Hydrophilic Plant Seed Protein (PF00477) – Group 1
Dehydrin (PF00257) – Group 2
LEA (PF02987) – Group 3
LEA-1 (PF03760) – Group 4

In addition, there are groups which do not appear in the Bray [1] scheme: Lea5 (D73) and Lea14 (D95) [15], although both are represented by Pfam families: Lea5(D73) by LEA-3, PF03242, and Lea14(D95) by LEA-2, PF03168.

Previous work, using just amino acid percentage composition and the Kyte Doolittle hydrophobicity metric, found that LEA proteins are characterised by a preponderance of hydrophilic amino acids together with high glycine content, resulting in their characterisation as "hydrophilins" [16]. Certain LEA protein Groups are also said to be rich in alanine, but deficient in cysteine and tryptophan [3, 4].

However, a significant, though often overlooked, feature of LEA proteins is that the majority are low complexity proteins. This is amply demonstrated through the use of the low complexity sequence demarcation tool, 0j.py [17], which was applied, first to all the sequences above 40aa in SwissProt and SpTrEMBL (also called Swall) and then to a database of 112 LEA proteins, which will be described shortly. The sequences in the large database returned a median score of 3, with 13% having a score of 0 and 32% a score greater than then 3; a low score implies that the protein has high sequence complexity. By contrast, the LEA sequences had a median score of 11.5, and 80% return a score greater than 3 (equivalent to a p-value of 1. 1 × 10^-25).

Low complexity sequences pose a particular problem for the local alignment tools such as BLAST which owe much of their discriminative power to scoring schemes based on the extreme value distribution [18]. For example, [19] compares the efficacy of both BLAST and FASTA with an implementation of the Smith-Waterman algorithm, each both with and without the use of scoring schemes based on the extreme-value distribution. The benefit of having statistically based scoring schemes is conclusively demonstrated [19]. However, it is well known that low complexity sequences prejudice extreme value distribution based statistical scoring [20]. The standard way of dealing with low complexity regions in the context of database searches is to mask these off in the query sequence using applications such as SEG [21]. When SEG was run across the set of 112 LEA proteins, 11 high complexity sequences are returned unaltered; the remainder were masked to a greater or lesser extent, with 57 having between 30% and 71% of their amino acids masked. The first effect of masking is to reduce the number of amino acids available for alignment. The second effect is to produce an asymmetry, because only the query sequence is masked, not the target (i.e. database) sequences, so the answer you obtain for an alignment between a masked query and the target sequence depends on which sequence is the query and which is the target.

The aims of this resurvey were therefore twofold. The first aim was to create a sizable set of the LEA proteins spanning all the Groups and then, using a number of software tools to lessen the impact of low sequence complexity, to reexamine the classification of this diverse set of proteins. In the light of this process, the previous findings are reviewed and expanded. Secondly, searches based on peptide composition were used to reveal proteins with similar composition to different LEA Groups; keyword clustering was then applied to the lists of search hits to suggest keywords and phrases indicative of the Groups' functions. These are the starting point for current and future experimental work.

Results

The Rules Induced by Supervised Learning Application

The input to supervised machine learning application, Ripper, for each LEA protein was therefore 13 values (3 hydrophobicity; 3 predicted secondary structure and 7 amino acid class) plus the Group to which the protein had been assigned. The output was a set of rules for classifying putative LEA proteins into Groups based on the 13 values. When working on real-world (i.e. noisy) data all rule induction algorithms attempt to balance accuracy/correct-predictions with conciseness; at the extreme one could have 100% accuracy by creating a rule for each input protein, while at the other extreme one can achieve maximum conciseness by having a single rule predicting the largest output category, which would in this case mean categorising every input LEA protein as Group 2. Ripper was run several times until the error on the input set was minimised. Extra conditions were then added by hand to the rules to deal with the misclassified proteins until no further rules could be added without generating other misclassifications. The final rule set, which appears in Table 11, should be understood as operating in a top-down, if .. else if, manner.

The reader will have noticed that the table of Group 2 LEA proteins (Table 2) has been partitioned into three subsets; these correspond to the three rules under which Group 2 proteins are classified using the above rule-set. The rules have been labelled 2a, 2b and 2c. Notice that the Group 2 LEA proteins induced by cold stress are predominantly characterised by Rules 2b and 2c (particularly 2c), while the Group 2 proteins which have been shown not to be up-regulated by cold stress and all the canonical LEA proteins are encompassed by Rule 2a.

Table 1 LEA Protein Group 1 (D19) Exemplar(s): LE19_GOSHI

LEAping to conclusions: A computational reanalysis of late embryogenesis abundant proteins and their possible roles

Abstract

Background

Results

Conclusions

Background

Results

The Rules Induced by Supervised Learning Application

Results from the POPP Analysis of LEA Proteins by Group

Results from Clustering LEA Protein Probability Profiles

Results from Keyword Clustering of POPP Search Hits

Discussion

Conclusions

Methods

Defining a LEA Protein for this Study

Obtaining the Sequences

Assignment to Historical Groups

Machine Learning Applied to the LEA Protein Sequence Sets

Supervised Machine Learning Applied to LEA proteins – Ripper

Hydrophilicity

Predicted Secondary Structure Percentage Composition

Amino Acid Class Percentage Composition

Unsupervised Machine Learning Applied to LEA Proteins – The POPPs

Significant LEA Protein Peptides

Clustering LEA proteins

Keyword Clustering Applied to Sets of Related POPPs Vectors

Additional Material

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Additional file 1: (ZIP 784kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us