Automatically extracting functionally equivalent proteins from SwissProt

Background There is a frequent need to obtain sets of functionally equivalent homologous proteins (FEPs) from different species. While it is usually the case that orthology implies functional equivalence, this is not always true; therefore datasets of orthologous proteins are not appropriate. The information relevant to extracting FEPs is contained in databanks such as UniProtKB/Swiss-Prot and a manual analysis of these data allow FEPs to be extracted on a one-off basis. However there has been no resource allowing the easy, automatic extraction of groups of FEPs – for example, all instances of protein C. We have developed FOSTA, an automatically generated database of FEPs annotated as having the same function in UniProtKB/Swiss-Prot which can be used for large-scale analysis. The method builds a candidate list of homologues and filters out functionally diverged proteins on the basis of functional annotations using a simple text mining approach. Results Large scale evaluation of our FEP extraction method is difficult as there is no gold-standard dataset against which the method can be benchmarked. However, a manual analysis of five protein families confirmed a high level of performance. A more extensive comparison with two manually verified functional equivalence datasets also demonstrated very good performance. Conclusion In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot.

Without specific information about how and why these proteins were annotated by the respective species annotation communities, it is not clear whether the annotations are misleading, or whether the FOSTA results are incorrect. There are only two trypsin proteins of adequate sequence similarity found in Aedes aegypti: TRY5 AEDAE and TRY3 AEDAE. TRY3 AEDAE is equivalent to TRY3 HUMAN and there is no human trypsin-5 protein in UniProtKB/SwissProt version 53.0, so the assignment here appears sensible.
In Lucilia cuprina, two trypsin proteins are of sufficient sequence similarity: TRYA3 LUCCU, which has been identified as the FEP of TRY1 HUMAN, and TRYA4 LUCCU which has been identified as the FEP of TRY3 HUMAN. This is a difficult assignment to assess, particularly as TRYA3 LUCCU is a fragmented protein. It is worth noting that these five questionable trypsin proteins are derived from insect species: LUCCU, DROME and DROER are flies, AEDAE is a mosquito and MANSE is a moth. It may be that trypsin genes have duplicated and diverged in insect species.
In addition to the trypsin molecules, FOSTA identifes GRAG MOUSE, VSP1 BOTJR, VSP1M TRIST as FEPs because they are described as serine proteases as is TRY1 HUMAN. All mouse proteins explicitly described as trypsin belong to other FOSTA families, with protein prefix matches. There are no trypsin proteins for Bothrops jararacussu (BOTJR) or Trimeresurus stejnegeri (TRIST), but again it is unclear whether the assignment is correct or not. subfamilies respectively. All proteins assigned to a different subfamily may be misassigned. The UniProtKB/SwissProt family/domain classifications are manually confirmed, which suggests that in the case of DDX51 HUMAN, the candidate FEPs are so similar that FOSTA finds it difficult to discriminate between them. It should be stressed that a manual analysis of UniProtKB/SwissProt entries for this family is no more effective than FOSTA, and that where FOSTA is incorrect in the DDX51 HUMAN assignments, the proteins are fragments, and flagged as potentially unreliable.
The results for human glucose-6-phosphate isomerase (G6PI HUMAN, [Swiss-Prot:P06744]) appear very robust: 309 FEPs are identified, of which two are fragments. All of these proteins are glucose-6-phosphate isomerases. Only eighteen of the 309 assignments are made on the basis of sequence (where sequence matching is required to differentiate between G6PI1-4 or G6PIA-B proteins) and 287 (92.88% of these are protein prefix matches). As already discussed, without explanation of how these proteins were named, it is not clear whether FOSTA is generating the correct pairs, or whether the sequence matching is misleading.

Random samples used to compare FOSTA with Inparanoid
For the random samples used to compare FOSTA with Inparanoid, see Tables 1 and 2. PIRSF/Hulsen benchmarking: additional statistics PPV and MCC were included in the main text. Here we include specificity and sensitivity for both datasets (Table 3 for PIRSF and Table 4 for completeness.     Protein family: the protein family being examined; TO pairings: the number of TO pairs in the Hulsen dataset (including many-to-many orthologous pairings and non-UniProtKB/SwissProt proteins); Refined pairings: the number of one-to-one TO pairings tested after refinement of Hulsen TO dataset; Basic statistics: the basic counts of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN); Evaluation statistics: the PPV (positive predictive value, T P/(T P + F P )), and the MCC (Matthews Correlation Coefficient), all rounded to 2dp)