The development of PRO will proceed by taking a pragmatically motivated approach to populating classes, relations, and annotations. We start with an initial set of types using existing, complementary, curated protein classification resources. Relations between these types are defined following the methodology of the OBO Relations Ontology [22]. Connections to other ontologies are used to formulate annotations of PRO classes. Finally, the results are subjected to manual validation by experts.
An overview of PRO is provided in Figure 1. For brevity, we refer to the protein evolution component as ProEvo and the protein forms component as ProForm.
Protein Evolution component (ProEvo)
The diverse proteins we find today in living organisms can be grouped into protein families, each member of which derives from a common ancestor. Families have built up over time by copying events (speciation or gene duplication), followed by divergence of the copies from each other. This expansion of a protein family can be represented as a bifurcating tree: each bifurcating node represents the copying of an ancestral sequence. These ancestral sequences are now extinct, but they are inferred from the sequences we observe today. Despite the passage of millions of years of divergence, members of each family still share recognizable similarities. It is therefore often possible to infer certain properties of the ancestral protein, such as function, based on the recognizable similarities of its modern descendants.
During the process of protein evolution, there are portions of proteins – called domains – that are usually copied in their entirety, presumably because they represent a minimal functional unit. A protein comprises one or more domains, usually with additional sequences connecting and surrounding them. Note that using our definition of domain, some domains have never combined as modules with another domain (at least as observed thus far). Proteins with similar domain architecture (that is, the same combination of domains in the same order) are said to be homeomorphic. In the case of single-domain proteins, the evolutionary history is identical to (or is a subtree of) that of the domain itself. However, the evolutionary history of multi-domain proteins is more complex: it can only be represented by a single tree as far back as the earliest ancestor that contained the same architecture. Prior to that, one must look to the histories of the constituent domains.
The relationship between a protein and each of its constituent domains can be modeled using the has_part relationship already defined in the OBO Relation Ontology. The relationship is most obvious for multi-domain proteins, but it also holds for single-domain proteins.
One complication is that domains within a multi-domain protein can be lost in one or more lineages (e.g., [23, 24]). This means that a has_part relationship to this domain that obtains for the parent class will not obtain for the child class. Therefore, we will use a lacks relationship type to describe evolutionary loss in the child lineage [25].
The GO molecular function ontology organizes function classes from the general (at the top of the hierarchy) to the specific (at the leaf nodes). In contrast, the hierarchy of ProEvo classes is based on evolutionary relatedness, organized from the distantly-related (at the top of the hierarchy) to the more closely related (at the leaf nodes). In many cases the functional and evolutionary classes will overlap. However, consider the case of erythrocyte membrane protein band 4.2 (EPB4.2). This protein is a major component of the red blood cell membrane skeleton [26] that was co-opted from an ancestral protein-glutamine gamma-glutamyltransferase [27], but subsequently lost the ancestral function [28]. In the GO molecular function ontology, the appropriate association for EPB4.2 is "constituent of cytoskeleton" (GO:0005200). For PRO, its parent is "protein-glutamine gamma-glutamyltransferase." The evolutionary relationship between the human and mouse versions of EPB4.2 and protein-glutamine gamma-glutamyltransferase is represented schematically in Figure 2. The difference in function is not due to gain or loss of specific sections of protein (domains), since all four proteins share end-to-end similarity and common domain architecture. However, two of the residues of the catalytic triad of protein-glutamine gamma-glutamyltransferase [29] are changed in EPB4.2 (data not shown).
Populating ProEvo classes: Resources
Several resources exist that group proteins according to function, sequence or structure-based relatedness. We use four of these resources to guide the initial construction of PRO. Together, these resources represent all of the basic elements of a protein evolutionary ontology outlined above. They provide the set of classes that are most important for one of the primary tasks we wish to accomplish with the evolution component: reliably using experimental data from other organisms to understand human genes. Moreover, each of these four resources has been curated by expert biologists to ensure quality. For clarity, in the description of these resources, we refer to the sets of proteins as "groups" or "families" or "clusters," and the name given to the set as the "class." The section below lists each resource according to the evolutionary relationships for which each approach is most appropriate, from the most distant to the closest.
Structure-based clusters with remote domain homology: SCOP
SCOP (Structural Classification of Proteins) [30] is arranged hierarchically into four levels: class, fold, superfamily and family. Homology (common ancestry) can be asserted for proteins in the same family on the basis of sequence data alone and for proteins in the same superfamily on the basis of three-dimensional (3D) structure data. Proteins in different superfamilies in the same fold group or class have similarities in 3D topology but do not necessarily have a common ancestor. Therefore, only the SCOP superfamily and family data are relevant for the purposes of PRO, with the former defining remote homology (shared ancestry that diverged in the distant past) and the latter defining close homology (shared ancestry that diverged in the more recent past).
Sequence-based clusters with close domain homology: Pfam
Pfam domain families [31] are comparable to SCOP families. However, Pfam contains domain definitions even in the absence of structure information; thus, Pfam represents a superset of SCOP families. Accordingly, we will use Pfam domain families in place of SCOP families to represent the "close" level of evolutionary relatedness for domains.
Clusters of homologous proteins: PIRSF
The PIRSF family classification system provides protein classification from superfamily to subfamily levels in a network structure to reflect the evolutionary relationship between sets of whole proteins and between whole proteins and domains [32]. The primary PIRSF classification unit is the homeomorphic family, whose members are homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). Basing classification on whole proteins allows annotation of family-specific biological functions, biochemical activities, and sequence features, while an understanding of the domain architecture of a protein provides insight into its general functional and structural properties as well as into complex evolutionary mechanisms.
Functionally-diverged subfamilies: PANTHER
A PANTHER subfamily [33] is defined as a monophyletic group of proteins that have distinct functions as compared to other monophyletic groups in the same protein family. These functional differences can derive from gain and loss of additional domains or from changes in the protein sequence.
Populating ProEvo classes: Mechanism
The initial ProEvo classes will derive from the curated protein clustering resources described above. How one class relates to another consequently resolves to how each cluster relates to another, and the problem condenses to a simple mapping exercise. The relationships between SCOP clusters and Pfam clusters already exist, as do the relationships between Pfam, PIRSF, and PANTHER. To facilitate updates and tracking between these initial resources and ProEvo classes, we will use both PRO accessions and IDs, similar to the system used by UniProt [15]. Thus, whenever possible, each PRO class will have an incremented number as its accession (e.g., PRO:00000001) and a source-database cross-reference (e.g., PRO:PIRSF000001).
Updating ProEvo classes
Once the initial mapping is done – and the classes and relationships are verified – the composition of the underlying clusters and how they interrelate will not be of consequence to PRO except as a source of additional nodes. That is, source database changes need not be reflected in the ontology. Consider the example of the hexokinase family of proteins, which includes xylulokinase [34]. Suppose the initial population of PRO classes yields xylulokinase is_a hexokinase, and the source database is subsequently modified such that the original xylulokinase family is renamed to ketopentose kinase after adding ribulokinases. The original relation still holds even though the xylulokinase family no longer exists, so the PRO xylulokinase class does not get deleted. Instead, a new level – ketopentose kinase – could be inserted between xylulokinase and hexokinase.
Protein Forms component (ProForm)
A number of different protein forms can be derived from a single gene. Protein databases typically represent only one reference sequence for a gene product, and do not have separate entries for mutations that can give rise to disease, for different forms that arise through variations in splicing, or for post-translational modifications. For example, cleavage of a signal peptide is needed for protein secretion. Also, specific residues can be covalently modified with a variety of chemical moieties. Some proteins engage in cyclic processes that involve, for example, phosphorylation and dephosphorylation. These various modified forms of a given gene product are critical to making precise annotation. For example, many diseases are not caused by the "normal" protein, but by a genetic variant. Also, a protein can activate a process when in its phosphorylated form, but inhibit that same process when not. Such nuances are not possible with the existing ontologies. Therefore, PRO allows for the definition of sequence forms arising from genetic, splice, and translational variation, and from post-translational cleavage and modification.
Relationships between protein forms will be simple and direct and make use of existing relations whenever possible. We will use OBO's Relations Ontology as a source of well-defined relationships, adding further relationships on the basis of need. For example, it is biologically reasonable to say that the product of a post-translational modification is modified_from the initial protein. However, using such a relationship adds complexity to the system and hinders the possible interconnections with other ontologies. In fact, modified_from is just a more specific way of asserting that "new entity created_from old entity" or "new entity derives_from old entity." The three relations convey identical ideas, but the latter is already part of the core set of relations [22]. Note, however, that this existing relation does not accurately describe the relationship between two variations of the same gene product, nor does the is_a relation. Therefore, we use a new relation variant_of for this situation.
Populating ProForm classes: Resources
Both the richness and the usefulness of an ontology stem from the diversity and comprehensiveness of its classes. Accordingly, we intend to capture the diverse forms that a protein can take. UniProtKB/Swiss-Prot contains information on sequence variants due to mutation, alternative splicing, or protein cleavage, and on post-translational modification. These data, found within the controlled vocabulary of FT (feature) lines or free text of CC (comment) lines, have been used to populate the appropriate classes. Other sources of data from which information can be computationally extracted include MGI [35] and iProClass [36].
Populating ProForm classes: Mechanism
We have developed a parser to transform information from the sources indicated above into OBO format nodes and relationships. The parser captures experimentally verified biological entities, ignoring those labeled as "by similarity," "potential," or "probable." There are three kinds of entities considered by the parser: isoforms, variants, and cleavage and modification products. For example, post-translational modification nodes are automatically populated from UniProtKB based on the FT field and from iProClass based on the PIR Feature and Post Translational Modification fields. Automatically-populated ProForm terms are verified by a curator and edited using OBO-EDIT. Additional terms are added as necessary after curator review of the literature.