BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins
© Kankainen et al; licensee BioMed Central Ltd. 2012
Received: 11 July 2011
Accepted: 15 February 2012
Published: 15 February 2012
Skip to main content
© Kankainen et al; licensee BioMed Central Ltd. 2012
Received: 11 July 2011
Accepted: 15 February 2012
Published: 15 February 2012
Automated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation.
We present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1.
BLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at http://ekhidna.biocenter.helsinki.fi/poxo/blannotator/.
The rapid progress in sequencing technology has enabled the generation of unimaginable amounts of bacterial genomic data. The genome sequences of thousands of bacteria have been determined, and many more are in progress . In addition, enormous numbers of sequences have been produced in metagenomic studies exploring the genomic contents of microbial communities by sequencing . The interpretation of this data is necessarily based on computational analysis, and only a minority of the predicted protein-coding sequences are experimentally characterised or tested with functional genomics assays. Functional inferences for the large majority of putative proteins therefore require sophisticated and powerful annotation tools that can predict protein function based on sequence.
Many automated protein function prediction methods describe the biological role of a gene product in terms of single-line description of protein function (DE) or gene ontology (GO). These two widely used annotation schemes are radically different. The first shows a protein function in a human-readable free text format , whereas GO and its directed acyclic graph structure depict the function with a controlled vocabulary that is machine-readable . Several valuable tools have been developed for the prediction of DE or GO annotations, but the most popular tools are based on the concept of homology [5–7]. The premise of this technique is that the functional properties of related sequences are conserved during evolution and that the function of the query protein can be inferred from that of other proteins with similar sequences. Among the many homology-based function prediction solutions [8–15], the simplest involves sequence search with tools like BLAST and PSI-BLAST  and the transmission of function between similar sequences [10, 11]. Other variants include functional inference based on a set of related sequences, although pooling of information is often restricted to GO annotations [12–14]. In a typical genome, a large proportion of gene products can be functionally annotated by these kinds of methods . Predictions are not however always accurate and estimates of the error rate of genome-scale annotation processes vary from 5 to 40% [5, 6]. This approach is also vulnerable to misannotations, which were recently reported to exceed 80% for some protein families in some sequence databases . Furthermore, transfer of correct functional annotations with these methods require high levels of sequence identity. It has been suggested that at least 40-60% identity is, for example, needed to accurately infer enzymatic function [5, 6, 19].
Two other forms of homology-based annotation use protein signatures and phylogenomics. In the first approach, the function is inferred from similarity to motifs and domains obtained from databases, including FIGfams, HAMAP or Pfam [20–22]. A greater coverage over the sequence space is achieved than with BLAST or PSI-BLAST, but these predictions may be overly general, and some resources report functional annotations on the domain rather than the protein level [5, 6]. Phylogenomics directs the annotation process using the evolutionary history of sequences [5, 6, 23]. Commonly, function is inferred from that of orthologs, i.e., sequences related by speciation, and less weight is given to paralogs that have evolved by gene duplication and may have acquired a new function [5, 6]. For example, gene duplications can be inferred from incongruence between gene and species phylogenies [5, 23]. Phylogenomics is ideal to identify the most likely function out of many candidates associated with a set of protein sequences that share sequence similarity with the query protein. However, current implementations are often inefficient for genome-scale analysis, and some methods require a predefined species tree, which is not often at hand when analysing bacterial data.
An attractive alternative to the phylogenomic construction of sequence sets is to cluster sequences according to their annotation similarity. This approach has the advantage that the necessary information can be extracted from public data repositories, and time-consuming all-against-all protein comparisons to generate phylogenetic trees or orthologous sequence sets are not required. The use of annotation similarity has been shown to be beneficial for processing protein function [12, 24]. For example, CLAN combined sequence- and annotation-based clustering and successfully identified inconsistently annotated proteins from SWISS-PROT . On the other hand, in ConFunc, sequences identified by PSI-BLAST are split into groups according to their GO terms, and these sequence groups are used to build up feature-derived sequence profiles from which protein function can be predicted . GO annotations at high levels of precision were created, even at levels of sequence identity equal to or below 30% .
We present a computational method for protein function prediction that relies on the concept of homology to annotate a query sequence with one-line summary descriptions of protein function. This tool, called BLANNOTATOR, splits sequences identified by BLAST into groups according to their DE and GO annotations. The information from database matches that belong to same functional group is then pooled to evaluate DEs. The general scoring scheme we use follows that used in an earlier solution . The idea underlying BLANNOTATOR is that a correctly annotated sequence will occur in the list returned by sequence comparison, but its position(s) in the list may be arbitrary. For example, some protein sequences detected in the search may be related but un- or misannotated, and others may not be related at all. If misleading subject sequences have populated the top of a match list, conventional function prediction methods may fail to recognise the best annotation. In contrast, our method circumvents this problem by building the DE evaluation upon multiple sequence matches with common functional annotations. The concept of annotation-guided sequence clustering borrows from CLAN and ConFunc [24, 12], but unlike these programs, our method assigns sequence similarity search results to groups based on two types of functional information. This produces more comprehensive clusters because the two annotation schemes can complement each other's defects; in the absence of GO annotation, links can be formed via DEs, and misspelled or erroneous DEs can be rescued using GO-based links.
The performance of BLANNOTATOR was assessed by predicting protein functions for a set of 3090 bacterial proteins available in the SWISS-PROT database . While building the evaluation data, we aimed to minimise the number of circularly referenced functional annotations among the test sequences and BLAST hit lists. Accordingly, we discarded hits in which the query sequence matched a sequence that had a newer submission date than the annotation date recorded for the query protein. The DE of the remaining sequence hits was returned to its previous state (i.e., prior to the annotation date) with the help of UniSave . The proportion of character differences between the predicted and correct annotation was then computed to provide an estimate of the annotation quality. Using this data, predictions made by BLANNOTATOR were at minimum 1.6-fold better than those of the five other tested methods. Homology-based transfer of function can be ineffective in the absence of high levels of sequence identity. To simulate this scenario, the SWISS-PROT data were reanalysed after the removal of sequences with greater than 50% identity to the query sequence. The conclusions drawn at this limited level of sequence identity were for the most part consistent, showing that our method also performs well under these restricted conditions and is better than the other homology-based transfer methods we tested.
We analysed the protein-coding sequences of L. crispatus strain ST1  to further demonstrate the functionality of BLANNOTATOR. Human inspection of the resulting function predictions showed that our method provided a valid annotation for ~85% of the originally characterised cases. The original characterisation was based on a manual review of the results of a compendium of bioinformatics approaches . In comparison, RAST- or top BLAST hit-based approaches provided a valid function prediction for ~58% and ~69%. Moreover, BLANNOTATOR's ability to structure BLAST data in clusters and its user-friendly output were found to facilitate the annotation process.
The functionality of BLANNOTATOR was also tested by assigning annotations to the protein-coding sequences of L. crispatus strain ST1 [EMBL: FN692037] . In this example, function prediction was based on BLAST data from which sequence hits with lower than 40% identity and 50% alignment coverage to the query were discarded to remove spurious hits. Our method predicted functions for 1711 of the 2024 protein-coding sequences analysed. To assess the validity of the reported annotations, a human operator (TO) reviewed the predictions. This analysis revealed that our method made a prediction that matched the original for ~85% (1377) of the cases with prior functional annotations. In the remaining cases, the assigned annotation was either biologically distinct or less specific. In addition to our method, we performed function predictions with RAST  and by inference from the top BLAST hit. These two methods generated annotations for 1071 and 1498 test sequences, covering ~53% and ~74% of the proteome. Evaluation of the predictions of the methods showed that RAST and the top hit from BLAST generated fewer acceptable functions than our method. They produced the supported annotation for ~58% and ~69% of the sequences with approved functional annotation, respectively, and defined valid annotations for 104 and 103 protein-coding sequences that our method had failed to assign the correct function, while our method produced acceptable annotations for 530 and 362 protein-coding sequences for which the RAST- or BLAST-based annotation was unsuccessful.
Examples of selected protein-coding sequences for which our method assigned the human curator-supported function are given below.
Multiple and diverse candidate annotations. LCRIS_00064 is an example of a sequence that only BLANNOTATOR managed to characterise similarly to InterProScan . The competing methods failed to assign a function to this sequence, most likely because of the related sequences of LCRIS_00064 were associated with seemingly incoherent functional annotations, and the top candidates did not have valid annotations. For example, matching sequences were described with a set of annotations ranging from '50S ribosomal protein L3' and 'Uracil-DNA glycosylase' to the correct annotation, 'Cell wall-associated hydrolase'. Moreover, the correct functional description was enriched in the less conserved related sequences. Out of the 40 top hits, only seven had the valid functional description, while a third of the accepted hits in total had been described with this particular annotation.
Inference of specific enzymatic activity. An example in which an enzymatic function was recovered by BLANNOTATOR is that of LCRIS_00495. This putative protein had acceptable sequence level similarity to ten sequences, most of which had no functional information. In spite of the limited amount of data that could be used in homology-based transfer, the query protein was defined as an 'Antibiotic biosynthesis monooxygenase' by BLANNOTATOR, similarly to InterProScan . The other methods generated DEs that were less informative: 'Putative uncharacterized protein' and 'putative phosphoglycerate mutase'. Our method also defined an enzymatic function for LCRIS_01619, which was identified to be '1,4-dihydroxy-2-naphthoate octaprenyltransferase', similar to the function that was obtained using COG . The two other methods failed to assign a function to this putative protein.
Transferring meaningful annotations from remote relatives. Only BLANNOTATOR predicted a function for each component of the clustered regularly interspaced short palindromic repeats (CRISPR) system, whereas the competing methods failed to infer functions for LCRIS_01207 and LCRIS_01212, both of which are components of this vital defence mechanism that provides acquired phage resistance in bacteria and archaea . LCRIS_01207 and LCRIS_01212 showed only remote similarity to the CRISPR-associated protein sequences in the database, and BLAST hits associated with meaningful functional information had limited (≤ 50%) identity to the queried protein. For example, LCRIS_01207 had acceptable similarity to 59 sequences annotated as 'CRISPR-associated protein, Cas2', but these sequences displayed a maximum identity of 45%. Thus, if annotations had been predicted from top matches or the analysis of remotely similar sequences had been bypassed, no function could have been assigned to these sequences, leading to a failure to characterise the CRISPR system.
The idea underlying BLANNOTATOR is that the reorganisation of sequence similarity search results can improve homology-based protein function prediction. For example, sequence similarity search tools may find undesired sequences, and the match list may not always be in the ideal order . Commonly, sequence grouping in protein function analysis is performed through the identification of orthologs, which can be performed using any of the many algorithms designed for that purpose . Similar to the principal aim of forming orthologous sequence sets in automated protein function prediction, our annotation-guided clustering and simple multiclass classifier produces sets of proteins that are likely to have common functions. However, our program does this without dealing with complicated phylogenies or requiring time-consuming all-against-all sequence comparisons. Likewise, in our method, database hits associated with differing functional descriptions are not assigned to the same group, which effectively mimics the discarding of paralogous and other non-conserved sequences that occur in phylogenomic techniques [33–35]. Moreover, it was recently reported that sequence similarity could be better than sequence orthology in terms of predicting protein function [36, 37]. Thus, functional inference from orthologous sequence sets may be overkill and may even reduce the quality of function predictions in addition to increasing processing time. In comparison, BLANNOTATOR circumvents this deficiency as its function prediction is based on sets of sequence hits generated using functional information and because the candidate annotation is selected by pooling information over multiple hits in a sequence set.
We scored the DE of sequences identified by BLAST with a method that is based on an earlier proposal . The basic principle of this scoring scheme is that higher scores reflect higher sums of bit scores for the database matches to a given annotation. The scoring scheme we used assumes that each target sequence provides an additional layer of support and is independent of the other matches. These assumptions are suitable for protein function prediction in most scenarios but fail if the identified sequences have a biased functional representation. For example, if a single researcher has identified his/her favourite protein from many closely related organisms and annotated each protein similarly, bias will be introduced to the results, and the significance of the over-studied annotation can be over-estimated. Some computational approaches tackle this kind of bias. For example, ConFunc uses all-against-all sequence comparisons to reduce bias , and CLAN pre-calculates the annotation frequencies in the database to address the error . However, here we have traded the minor loss in quality for a greater gain in performance. Also, this problem can partially be circumvented by the use of a less redundant sequence database, such as UniRef90 or UniRef50 .
In order to assess protein function prediction methods, a well-established test data is required. The ideal evaluation data should contain only proteins and protein functions that have been characterised experimentally. This data should also list all of the alternative functions of the protein, irrespective of the conditions under which the function is active, because a predicted function can confidently be declared erroneous, or correct, only if all of a protein's true functions are known. Moreover, test sequences should include proteins with a variety of functions and varying levels of similarity to other sequences. Finally, this rich source of functional information should not be propagated further to avoid circular referencing. Here, we first evaluated function prediction by characterising a set of bacterial proteins of known function extracted from SWISS-PROT . Our data is not perfect but has several positive aspects. First, function annotations have passed the strictest filter possible and are as complete as they can be because SWISS-PROT has the lowest annotation error level of all of the major sequence databases . Test sequences displayed a wide variety of functions, both enzymatic and non-enzymatic, and they showed similarity to a varying range of other sequences. Moreover, functions for these test sequences were inferred from data that had a minimal number of circularly referenced annotations (see Figure 1). However, our evaluation dataset has its own problems. For example, we could not exclude the possibility that UniProt was not used for function prediction in the first place. If this was the case, there could be an increased chance that the same annotation would be produced again. Functional annotations can also be wrong, incomplete or contain mistakes, although they were selected from the most reliable database. Furthermore, there was no obvious way to regress GO annotations, so newer sets of GO terms had to be used in the annotation process. In addition to our novel test data, a new machine-based strategy was developed for DE comparison and for judging if two annotations are the same. This metric, mLD, computes the fraction of character changes between two annotations and can deal with word order changes. However, similar to most string metrics, it cannot process adequately synonymous terms. Annotations like 'GyrA' and 'DNA gyrase, subunit A' are falsely considered to be different, while erroneous similarity is seen between 'DNA polymerase I' and 'DNA polymerase IV', if these annotations occur in the same match list. Additionally, the method is blind to annotations of differing specificity. For example, 'DNA polymerase' is a good match with the two previous matches.
The annotation performance of BLANNOTATOR and the five other methods in the SWISS-PROT data was assessed using mLD statistics. At first we computed the fraction of predictions that were equal to or smaller than a certain distance from the valid database annotation of the query (see Figure 4). We chose to use this approach instead of the more typical sensitivity and specificity analysis, because it is difficult to draw the line between acceptable and non-acceptable annotations. Is an annotation correct, if it differs from the valid by, for example, three characters? The results shown in Figure 4 indicate that our method was better than the other function-assigning methods tested. In total, 22% of all predictions made by our method matched the one given to the query, and over 50% of the predicted annotations had an mLD equal to or smaller than 0.50. When circularly referenced functions had not been removed, even better performance was observed. The optimal DE was predicted for 37% cases and an mLD of 0.20 or lower was recorded for over half of the predictions in this data (see Figure S4 of Additional file 1). The standardised mLDs also supported the improvement and showed that predictions made by our method were, on average, 0.13-0.43-fold (after removal of circularly referenced annotations) and 0.10-0.46-fold (before the restoration process) better than the expected mean quality of the prediction. In contrast, the second best method improved results by 0.00-0.23-fold, and methods that relied on a single match, often the first method used in genome projects, were worse than a random selection from the BLAST hit list (see Table S5 of Additional file 1). We believe that the success of our method in protein function prediction is a sum of the whole process - the filtering of unspecific and infrequent words, the solving of functional conflicts and the use of the scoring system - and that each step contributes to the outcome. The use of new GO terms might have altered the results slightly, i.e., our method might not have generated the same partitioning as it would have generated on the date that the test protein was annotated. This effect most likely is marginal, if it exists at all, especially because we recorded the quality of the DE and not of the GO annotation. However, to rule out the possibility that the use of new GO terms has given some extra favour to our method, the other methods were also tested by providing them with only the information from the largest sequence hit group (if consisting of more than five matches) sharing a common GO annotation. The performance of the other methods was generally worse than the analysis made without this restriction (see Figure 4), suggesting that no extra benefit was given to our method.
In a genome project, automated protein function prediction is often performed after the removal of dubiously related database matches. This is most commonly achieved by discarding sequence hits below a given sequence identity or alignment coverage to the query sequences. In Figure 3 we depict the effect of sequence filtering on annotation quality in the case of SWISS-PROT data. Contrary to earlier suggestions [4, 5, 19], our data encourage the use of modest identity or alignment coverage filters in function assignment because aggressive thresholds just result in fewer matches, irrespective of their suitability for function prediction. In these data, the highest thresholds yielded the poorest predictions. Requiring perfect identity and alignment coverage, the mean annotation quality of the best-case prediction was approximately 50% worse than in the absence of sequence filtering. A drop of approximately 25% was seen at the 60% identity threshold, indicating that even this threshold may be too high. Higher thresholds for alignment coverage also decreased the performance but had smaller effects than the identity parameter. Only the strictest (100% coverage) filter needs to be avoided. Statistics obtained from the data before it was regressed showed a similar, albeit weaker, trend (see Figure 3). However, under this condition, the mean of the median mLD fluctuated more and was over twice (0.09) the value that was seen after the regression operation (0.04). Our data thus indicate that if BLAST hits must be removed, it should be done with modest sequence identity (0-40%) and alignment coverage (0-80%) filters. Results of these analyses were consistent with those in which BLAST results were filtered using relative bit scores, indicating that the same tendencies applies to that parameter also (see Figures S1 and S2 of Additional file 1). Our findings thus suggest that rather than discarding matching sequences, more focus should be placed on finding the true signal from the complete list of database matches. For example, weak signals could be amplified by analysing sequence sets, which is the main principle of BLANNOTATOR.
In addition to the SWISS-PROT data described above, we assessed our method by analysing the proteome of L. crispatus strain ST1. As the genome was unpublished at the time of analysis , the risk of circular referencing was again abolished. This more realistic setting may also have contained sequencing or gene prediction errors, which are likely to occur in modern genome projects. Moreover, the test set included a set of orphan protein-coding sequences that are specific to this particular organism, as well as more universal bacterial proteins that are required for a bacterium to live and operate. Also, to compensate these defects of mLD, prediction performance in this data was assessed by a human curator. This strategy avoids many of the problems described above, such as the recognition of different levels of specificity and synonymous wordings, but can produce other types of mistakes that are specific to human evaluation. The analysis of the putative proteome of L. crispatus strain ST1  showed that our method performed well also in this data and inferred the preferred functional description for 85% of the originally characterised test sequences. The two other methods tested were less successful. Moreover, these and a set of other methods used in the original analyses had a lower annotation coverage of the putative proteome: the fraction of annotated sequences ranged from 80 to 3% for InterProScan, inference of function from the top BLAST match, COG, RAST, KAAS and HAMAP-scan, in order from best to worst [29, 30, 20, 39, 21]. In fact, our method had greater annotation coverage (85%) than even InterProScan, which is considered one of the most comprehensive and heavily used resources . This can be explained by the effective use of all data. For some protein sequences, like that of LCRIS_01207, the preferred function was inferable only from matches showing low identity to the query. However, when working with sequences with limited levels of identity, the risk of inferring an erroneous function becomes higher, which is why functional information needs to be pooled.
The large number of sequences provided by bacterial genomic projects has enabled the identification of an increasing number of bacterial protein-coding sequences. As only a limited fraction of these newly discovered putative proteins undergo an experimental characterisation, there is a need for sophisticated annotation tools that can infer the biological role of a protein based on its sequence. We describe a computational method for the automated prediction of bacterial protein functions. This tool, BLANNOTATOR, generates accurate one-line summary descriptions of protein functions. Our method is better than other methods that we tested and produces an output that is easily understood. Moreover, it is applicable to genome-scale datasets and can be used to analyse metagenomics data because it does not require phylogenetic information or additional functional genomics experiments. We believe that BLANNOTATOR will be very useful for discovering information about genome and metagenome sequences in the future.
The BLANNOTATOR method is outlined in Figure 5. The function prediction process described here is based on the default settings. Sequence similarity search is performed using BLAST against UniProt [16, 3]. GO terms are extracted from GOA , and GO annotations are created by pooling GO terms over each of the three ontologies and over their ancestor GO terms until the ontological root is reached. By default, all GO terms are accepted irrespective of their evidence type, but the user may choose to use only GO term with certain evidence types. GO annotations containing fewer than three different GO terms are discarded. This effectively removes uninformative annotations consisted of a few GO terms, which lie close to the ontological root, like the ' cellular process'. DEs, originating from the header section of the FASTA-formatted sequence, are obtained from the BLAST hit list. Uninformative and infrequent words are removed from the DEs. The list of uninformative words contains 21 terms noted to occur frequently in protein annotations: 'hypothetical', 'uncharacterized', 'putative', 'contig', 'predicted', 'probable', 'fragment', 'genome', 'protein', 'chromosome', 'possible', 'similar to', 'proteins', 'homolog', 'possible', 'conserved', 'homologous', 'complete', 'shotgun', 'cdna', 'family'. In addition, infrequent words are removed from DEs that also contain frequent words. An infrequent word is defined as a word that occurs only once in the DEs of ten or more BLAST hits. If the test sequence has less than ten matches, infrequent filtering is not applied. If the DE processing creates an empty annotation, the hit is removed from the BLAST match list.
The remaining database hits in the BLAST match list are grouped according to their DE and GO annotations. At first, the annotation-guided clustering creates groups based on the DE information. Database hits are linked if they have the same DE. Hits or sets of hits identified in the previous step (an element) are then linked based on their associated GO annotations. Two elements are linked if their top-scoring database matches with accepted GO annotations (i.e. the sequence hit that has the highest bit score and has at least three GO terms) have identical GO annotations. Alternatively, GO annotations of several hits can be used to form links. In this case, hits within an element are ranked in descending order by their bit score and the GO annotations of every nth hit, read from the top, are merged and used in the comparison. The step of this interval is user-definable. Again, element joining requires identical GO term sets. Pooling of GO terms increases the run time substantially. Here, only the topmost GO annotations were used.
By normalising scores against the sum of the bit scores of all of the BLAST hits for the sequence, instead of normalising it against the sum of bit scores of all of the BLAST hits within the group, we ensure that DEs in different groups are unlikely to return equal scores. The final score of a DE is obtained by taking an average of the word and sentence scores of the DE.
The level of similarity of two DEs was measured using a Levenshtein distance (LD)-based metric. Typically, LD is calculated using a dynamic programming algorithm matching two strings over their full length and giving the minimum number of character differences between the two strings . This means that two annotations with the same set of words but that are in a different order cannot receive the best score. Here, we have modified the computation to deal with changes in word order. Annotations (X and Y) are deconstructed into word (x 1 ... x n and y 1 ... y n ) and consecutive word-pair (x 1,2 ... x n-1,n and y 1,2 ... y n-1,n ) elements. LD is then computed between the elements of × and Y, and the pair (xymin) yielding the smallest LD is recorded. The algorithm then removes xymin and all other elements of × and Y that share the words in xymin and finds a new xymin. The process is repeated until all of the elements of either × or Y have been used. Next, the possible remaining elements of × or Y are scored against an empty string. The mLD between × and Y is then defined as the sum of the LDs of all recorded xymin, divided by the length of × or Y, whichever is longer. Here, the normal LD was calculated with a dynamic programming algorithm in which gaps, substitutions and matches are scored as -1, -1 and +2, respectively.
The mLD ranges from zero (similar) to one (dissimilar) and indicates the fraction of character differences between words of two annotations. The metric is independent of word order and considers annotations like 'DNA gyrase, subunit A' and 'DNA gyrase, A subunit' to be similar because the only difference between them is the order of their words. On the other hand, if a single word differs between annotations, the mLD is approximately 0.25, e.g., in 'Nitric oxide synthase oxygenase' and 'Neuronal nitric oxide synthase'. A value of 0.50 indicates that half of the characters between the annotations differ. This is the level of mLD that was typically seen between the annotations of a query and its BLAST hits in the SWISS-PROT data; the average mLDs for query-subject annotation-pairs listed in the evaluation data before and after the regression operation were 0.61 and 0.48, respectively. An mLD of approximately 0.75 represents a limited textual resemblance, like that of 'Cryptochrome DASH' and ' DNA photolyase'. Because different releases of UniProt reported the enzyme commission codes differently, enzyme codes were trimmed from annotations before their similarity was calculated. If one of the DEs was an empty string (because enzyme codes had been removed) or contained only uninformative words (the same set of words as in BLANNOTATOR), the mLD was set to one.
Figure 1 outlines the process used to build the evaluation data. Entries in the SWISS-PROT database  were downloaded in October 2010 and were considered to contain accurate protein function data. The construction of the evaluation data began by removing entries that were not of bacterial origin or were added to SWISS-PROT prior to 2005. Other derivatives of the primary entry were also tracked, and the primary entry was discarded if any of its derivate entries (given in the accession number-field of the primary entry) did not fulfil our search date criteria. For the remaining set of entries, all older entry versions were extracted, and the first appearance of each DE in each database was recorded. Again, information of alternative entries was taken into account. Older entry versions were retrieved with the help of UniSave . All primary entries containing a DE with the words 'uncharacterized' or 'UPF' (i.e., uncharacterised protein family) were then discarded. Next, we selected all active entries of which the first versions were added directly to the SWISS-PROT database or that had emerged from a TrEMBL entry with a different annotation. The mLD method and a distance threshold of 0.50 were used to judge whether a TrEMBL annotation differed from that given in the active SWISS-PROT entry and had underwent re-annotation. If an older entry was found to carry overly similar annotation, this and all more recent entries were flagged. This filtering step was performed mainly to remove candidate test proteins that most likely had been annotated using information originating from SWISS-PROT and to avoid the selection of test sequences for which the function was from the same source as that of its database matches. This had only a marginal effect on the size of the final dataset (1047 DEs were rejected). Next, duplicate annotations between entries were searched, a random entry was chosen for each different DE (from which the name of the gene, e.g., 'yclE', 'ydjP', 'yfhM' or 'yisY', had been removed) and the annotation date (i.e., the date at which the entry was added to the SWISS-PROT or the date at which the re-annotation was performed) was recorded. Finally, test sequences with significant sequence similarity to other test sequences (sequence identity ≥ 90% and alignment coverage over the length of the query or subject sequence ≥ 90%) were clustered together, and a random entry was selected from the cluster. The rationale behind this data processing step was to discard proteins that were likely to be involved in the same function but were annotated differently because of synonymous wording or misspelling. For example, duplicated genes can have running numbers in their DE that produce two different annotations.
Test sequences were compared against UniProt using BLAST with default settings [3, 16]. BLAST alignments with bit scores smaller than 50 and lengths smaller than 30 were removed, and only the most significant occurrence of each query-subject pair was kept. Matches to the protein sequence itself were also discarded. The number of circularly referenced protein annotations was then minimised by restoring the data to the closest version prior to the characterisation of the query protein. At first, BLAST hits to sequences with a creation data were newer than or identical with the recorded annotation date of the query sequence were removed. The DEs of the remaining matches were reverted to their states preceding the query sequence annotation date with UniSave . In addition to this entire dataset, another version of the dataset was created by removing all BLAST hits with ≥50% sequence identity to the query sequence.
The similarities of DEs in the SWISS-PROT data were quantified in terms of the mLD. A score of 0.00 defined annotation pairs with analogous functional descriptions. GO annotations were also tested. Two GO annotations were considered to match when they contained the same set of GO terms. GO term comparisons were done twice. At first using all GO terms and then using only GO terms with experimental (EXP, IDA, IPI, IMP, IGI and IEP) or computational analysis evidence codes (ISS, ISO, ISA, ISM, IGC, IBA, IBD, IKR, IRD and RCA). The latter set of evidence codes comprises human reviewed predictions. For each test protein, we computed the fraction of BLAST hits with a matching GO or DE annotation. Likewise, fractions were counted from BLAST hits with a matching GO and a matching DE annotation (DE and GO dataset) and from sequences with one matching annotation irrespective to the type of the annotation (DE or GO dataset). The size of the BLANNOTATOR sequence group of each correctly annotated BLAST hit (either GO or DE annotation) was recorded, and the maximum of the values was taken. BLANNOTATOR sequence groups were extracted from runs in which all sequence hits were accepted and undesired GO terms had been removed. The mean of the fractions of all 3090 test sequences was calculated.
The similarity between the query sequence annotations and their subject sequence annotations was recorded in terms of mLD. For each query sequence, the minimum, maximum and median quality values were recorded, and the mean and standard deviation of the mLDs were calculated over test sequences having at least one BLAST hit. These statistics were recalculated after discarding BLAST hits below a given sequence identity and alignment coverage (calculated over the subject sequence) threshold combination. Six thresholds (0%, 20%, 40%, 60%, 80% and 100%) were tested, resulting in the testing of 36 (six times six) combinations.
In addition to BLANNOTATOR, the SWISS-PROT data were analysed with five other methods. The function was inferred from the highest scoring (bit score) non-identical BLAST hit as well as from the top non-identical BLAST match with an annotation that did not contain any uninformative words. The same set of uninformative words that was used for BLANNOTATOR was used here. The three other methods we used were built upon selecting the most common DE in the BLAST hit list, selecting the DE associated with sequences producing the highest sum of bit scores and selecting the DE with the highest word-based score . In addition, the five competing methods were applied to the data so that the information from the largest group of database hits sharing the same GO terms was only given to them. If the largest group of sequence hits sharing the same GO terms contained less than five BLAST hits, the DE was predicted using all matches. Ties were broken randomly.
The similarity between the predicted protein function and the correct annotation in the SWISS-PROT data was measured with the mLD metric, and the means and standard deviations of the mLDs were calculated over the test sequences with at least one BLAST hit. Furthermore, the obtained mLDs were standardised according to the expected mean and standard deviation of the prediction. For a test sequence, given the mLD of its prediction (x) and the mean (μ) and standard deviation (σ) of the mLDs of the DE associated with its BLAST hits, the Z-score was calculated as (x -μ)/σ. The obtained Z-score was multiplied by negative one so that positive Z-scores indicate improvement and negative scores indicate a decrease in quality. The mean and standard deviation of the Z-scores were calculated from test sequences having at least one BLAST hit. Similarly to computing the general statistics, both statistics described above were recalculated after discarding BLAST hits at a given sequence identity and alignment coverage (calculated over the subject sequence) threshold combination. The same 36 combinations were tested.
The protein-coding sequences of the L. crispatus strain ST1 [EMBL: FN692037]  were characterised with BLANNOTATOR, RAST  and inference from the top BLAST match. The sequence similarity search was performed with BLAST against UniProt with default settings, except that alignments were reported for 500 top matches [16, 3]. Dubious hits were removed by discarding BLAST hits with bit scores smaller than 50, alignment lengths smaller than 30 and matches with a less than 40% sequence identity or less than 50% alignment coverage (calculated over the subject sequence) to the query sequence. In cases where a query sequence matched the same subject sequence multiple times, only the best scoring (bit score) query-subject pair was retained. BLANNOTATOR was applied to the BLAST data with default parameters, as described in the first section of the Materials and Methods. RAST was also run with default parameters . For the BLAST-based approach, sequences identified by BLAST were sorted by bit score. Protein function predictions were manually compared against each other and against the original annotations that had been created using a wider repertoire of bioinformatics approaches and had been examined carefully during the original study .
single-line description of protein function
cluster of orthologous sequences
High-quality Automated and Manual Annotation of microbial Proteomes
Basic Local Alignment Search Tool
Kyoto Encyclopedia of Genes and Genomes
KEGG Automatic Annotation Server
Rapid Annotation using Subsystems Technology
Gene Ontology annotation database
modified Levenshtein distance
clustered regularly interspaced short palindromic repeats
This study was funded by Academy of Finland grants #114498 and #109849.