The annotation of subcellular localization to motifs is easier with patterns than profiles
In this work we have systematically assigned subcellular localization information to both TP and FP motif sequences. PROSITE motifs are either pattern motifs, which use regular expression-like syntax, or matrix motifs, which use scores for each position in the motif sequence. We find that pattern motifs are better annotated than motif patterns. Only 48 matrix motifs have TP and FP sequences with annotated localizations versus 277 pattern motifs (Figure 2). In addition, pattern motifs allowed for better discrimination between TP and FP localizations. This suggests that sequence patterns are more robust than complex positional weight matrices for this type of analysis.
Functionally-related TP sequences are localized to different compartments from FP sequences
We compared the subcellular compartments assigned to TP and FP motif sequences finding their distribution different for 78% of total motifs with sufficient localization data (52% of matrices, and 82% of patterns) (Figure 2). These results strongly link subcellular localization and function. It suggests that TP sequence motifs typically evolve in the context of particular cellular compartments, and are closely tied to these locations. Protein motifs are chosen because of their strong conservation and are usually key residues involved in protein function e.g. the active site of an enzyme or a protein-protein binding site. In some cases the link with localization may be directly related to function, such as a DNA binding protein that is functionally-linked to the nucleus. In other cases, the link with subcellular localization may be related to the local context of the protein partners necessary for function rather than the function itself.
Subcellular distribution of TP motif sequences reflect functional or evolutionary relationships between subcellular compartments
We found a strong tendency for subcellular compartments to be related when we analyzed TP motif sequences associated to multiple localizations. Linked subcellular regions include compartments with significant protein exchange such as the cytosol and the nucleus, or compartments related by their origin, such as mitochondria and chloroplasts (Figure 4). Our findings are similar to other works where human proteins (not motifs) were classified by their subcellular localization . The same authors also compared binary relations between compartments identified with the PSLT2 subcellular prediction method using yeast sequences . Their results mostly correspond with the binary relationships we identified analyzing motifs. The exception is the plasma membrane and extracellular compartments. In contrast to their study, we did find these compartments frequently associated, which is what might be expected of compartments in direct contact.
One reason for the linked compartmentalization of motifs could be due to multiple localizations of individual proteins. However, when we repeated the analysis only using proteins with a single subcellular localization, we observed similar relationships between related compartments (Figure 4B). In addition, both the nucleus and the cytosol appear individually more than double compartments, while ER and GA motifs share localization with other compartments (Figure 4). This latter observation is not surprising considering the complex relationships between the ER, the GA and other parts of the cell.
The percentage of multi-compartmental proteins has previously been predicted to be at least 16% in humans . We only found 6.8% of proteins in Swiss-Prot annotated as multi-compartmental according to their keywords (Figure 1). This value could be an underestimate due to incomplete annotation. However, the percentage is greatly increased (24-35%) when we take into account compartments assigned to motifs (Figure 4), suggesting a high level of multi-compartmentalization of protein motifs.
Biologically, this could suggest a common origin for motifs that appear in multiple compartments. If a new compartment emerges from another, the related proteins (and their motifs) would also be inherited, as occurs with the ER and GA  and mitochondria and chloroplasts (Figure 4). However, some of our data suggests that a common origin may not always result in the presence of common motifs. Although an endosymbiotic origin was suggested for peroxisomes , recent work based on both experimental evidence  and in silico analysis  has suggested that they are derived from the ER. It is therefore surprising that we did not find evidence for a binary relationship between peroxisomes and the ER, even though they were associated with mitochondria, chloroplasts and the cytosol. However, when more than two compartments were analyzed, peroxisome motif localization was almost equally related with ER, mitochondrion and chloroplast (Figure 5). In fact, it has been suggested that peroxisome proteins were recruited from eukaryotic compartments such as mitochondria and chloroplasts , which could explain these relationships.
Remarkably, some subcellular regions were more likely to contain motifs linked to multiple compartments than to them alone. For example, we found 65 and 58 examples, for the ER and GA, respectively, of motifs also associated with other compartments, versus 27 and 18 cases of a single compartment (Figure 4). Some compartments, especially the ER, showed a high frequency of motifs associated with multiple additional compartments (Figure 5). This is logical given that the ER is a compartment through which a large number of proteins are transported to other destinations. Some organelles, such as the GA and lysosomes, are in permanent dynamic equilibrium with the ER, from which they originate. The ER also establishes multiple contacts with most other intracellular organelles by means of narrow cytoplasmic gaps called membrane contact sites, including mitochondria, chloroplasts, the GA, the cell membrane, the nucleus, and lysosomes . For example, organelles derived from endosymbiotic prokaryotes are not connected to the secretory pathway by vesicular traffic, meaning that mitochondria and chloroplasts acquire a large proportion of their lipids from the ER by non-vesicular routes . Thus, polar lipid assembly in plants requires tight co-ordination between the chloroplast and the ER and necessitates inter-organelle lipid trafficking .
Identification of possible functional or evolutionary relationships from the subcellular distribution of FP sequences
False positives are motif-containing sequences that have been assigned a known function that is distinct from the motif protein family. If FP motif sequence similarity is due to random sequence variation, with no functional or evolutionary connection with TP sequences, then we would not expect FP sequences to be linked to particular subcellular localizations in the same way as TP sequences. In fact, we identified several cases where FP sequences were strongly linked to specific subcellular compartments. Non-random distribution might suggest that the motif has functional significance in FP proteins. This could indicate sequence convergence if they arose independently from TP sequences or functional divergence if they shared a common ancestor.
For example, when we examined DNA-binding Homeobox domain motif proteins with single localizations, all TP sequences were restricted to the nucleus, while most FP sequences were assigned to the cell membrane and the mitochondrion (Additional file 1). It is very unlikely that membrane proteins have a DNA-binding function but it is also unlikely that they all possess this motif by chance. It may indicate that during the evolution of membrane proteins, the same motif has evolved independently to perform a different function by sequence convergence. In this case, there might be some kind of molecular or structural similarity with the DNA binding motif. DNA-binding domains have previously been found almost exclusively in nuclear proteins , but it is not the first time that homeobox domains have been linked with functions unrelated to DNA binding. The ceramide synthase protein LASS2 contains a homeodomain that has been implicated in V-ATPase protein binding, a proton-translocating pump located in the cytosolic membranes of vacuoles, lysosomes and the ER membrane .
Our analysis also revealed other possible examples of sequence convergence. The short ER targeting sequence motif, originally identified in proteins retained by the ER , also appears in a large number of nuclear FP proteins. Interestingly, this four amino acid motif always appears at the C-terminal end of both TP and FP sequences. Most of the nuclear sequences identified are fungal H2A histones (Figure 6B) which are not thought to pass through the ER. This strongly suggests that the ER targeting motif in the nuclear sequences has arisen independently through sequence convergence.
We also identified a number of vacuolar FP sequences with the ER targeting motif in their C-terminal domain. It was originally thought that the “Endoplasmic reticulum targeting sequence” permanently retained sequences within the ER but it is now known that it is required for the retrieval of proteins back to the ER following vesicular transport to other organelles . Thus, it is possible that the motif might still have the ability to target proteins to the ER, but that either divergence from the KDEL motif or competing action from other protein sequences may have reduced its activity and allowed it to accumulate in other cellular compartments such as vacuoles. It is even possible that the ER targeting motif does, in fact, have a functional role in these proteins but that this has not yet been identified experimentally. In fact, the C-terminal KDEL sequence is found in some proteins transported by vesicles from the ER to vacuoles via a Golgi-independent route . Determining the actual origin of these FP sequence motifs would require further analysis and/or experimentation but highlights the value of our methodology in identifying FP sequences of interest for further study.
Systematic analysis of subcellular localization may help interpret motif annotations
The assignment of true and false positives is based on the available evidence, both of the actual function of the motif and of the individual sequences. The PROSITE database is composed of high quality manually-annotated motifs. Inevitably, these annotations need to be revised and updated periodically in response to new experimental evidence. Localization is likely to be an important line of evidence used by annotators when defining protein function for many motifs, especially in the case of motifs whose function is strongly linked to a particular subcellular organelle. This could be seen as a weakness in our approach because our analysis of subcellular localization may be using the same localization data employed by annotators to assign function to sequences. It is true that care must be taken when interpreting results for motifs whose function is strongly linked to localization. However, the previous example of the ER targeting motif highlights the potential difficulties of using localization to assign function. For example, experimental evidence may be incomplete or misleading. We would argue that a systematic summary of the subcellular localization of FP and TP sequences would aid both annotators and end users in interpreting the value of both a motif and the evidence used to assign function to TP and FP sequences.