Applying negative rule mining to improve genome annotation

Background Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.


Background
There are currently over six million amino acid sequences known, and only a quarter of a million have been manually annotated [1]. Moreover, as estimated by [2], merely for 3% of proteins functional annotation is based on experimental evidence. Due to the advent of super-fast and ultra-low-cost DNA sequencing technologies [3] the speed of genome sequencing continues to increase, and sequencing prices continue to plummet. Given the quickly widening gap between the amount of molecular data and the capacity of human experts there is little doubt that electronic annotation by automated software pipelines will be the only source of information about the overwhelming majority of proteins.
In silico annotation generated by bioinformatics methods has the advantage of being efficient and cheap, but at the same time suffers from a notoriously high error level [4,5]. Most of these errors are caused by homology-based annotation transfer where available similarity is not sufficient to warrant the transfer of information from the source to the target sequence, or because the annotation of the source sequence is already wrong. Further complications include the presence of compositionally biased sequence regions, mosaic domain structure of eukaryotic proteins, wrong gene models, and difficulties in recognizing pseudogenes.
The most obvious and direct approach towards improving the reliability and coverage of unsupervised protein annotation entails the development of better bioinformatics tools. Remarkable algorithmic advances of the past decade include more accurate gene prediction techniques (reviewed by [6]), highly sensitive sequence similarity searches based on hidden Markov models [7], protein secondary structure prediction approaching the 80% accuracy barrier (e.g., [8]), novel tools for predicting protein cellular localization [9], improved strategies for annotation transfer by homology (e.g., [10]), and enhanced protein function prediction by phylogenomics methods [2], to name just a few. Automatic annotation efforts also significantly profit from highly curated and actively maintained information resources, such as sequence [1], pathway [11] and interaction [12] databases, functional and structural classifications of proteins [13][14][15], as well as the collections of protein domains and motifs [16,17].
A complementary tactic to improve the quality of protein sequence databases involves retrospective search for errors in the total corpus of already available annotation. Under this approach protein annotation is considered to be a collection of records, one per each gene, containing a varying number of attributes, ranging from just a few minimal descriptors (length, pI) for hypothetical proteins, to dozens of annotation items (motifs, EC numbers, localization, structural folds, etc) for better characterized proteins. Modern data mining techniques can be used to identify statistically significant associations between individual attributes, and then to investigate exceptions from such associations that can potentially point to erroneous assignments.
In our earlier work we applied the formalism of association rule mining to extract associations between annotation items in large molecular sequence databases [18]. Considering a database with multiple entries, with each entry ascribed a finite number of features, association rules [19] are simple implications that can be formulated in the form (A 1 & ... & A n ) => Z, where A 1 ... A n (the lefthand side of the rule, LHS) and Z (right-hand side, RHS) are different features, and the rule means "database entries that possess all features A 1 ... A n are likely to possess feature Z". The rules of this type are thus positive because they model a positive relation between two item sets. Each rule is characterized by its coverage, the number of entries in the database that possess all features A 1 ... A n , its support, the number of entries satisfying both the left and the right sides of the rule simultaneously, and its strength, which is essentially the probability that a given database entry will satisfy the right side of the rule given that it satisfies the left side of the rule.
Our strategy for finding errors in annotation consisted of finding rules with a strength very close, but not equal, to 1.0, which means that such rules have a minor number of exceptions, and then identifying all proteins that constitute exceptions to these rules. Applied to the Swiss-Prot [1] database, this approach yielded 7396, 4956, and 4046 rules with a strength greater than 0.95 and a coverage of over 50 which were not fulfilled exactly once, twice, or three times. In order to test whether exceptions from strong rules actually correspond to annotation errors subsequent releases of the Swiss-Prot database were compared and additional manual verification was conducted. It was indeed found that exceptions to strong rules get corrected substantially more often than the rest of the annotation. For unsupervised annotation automatically generated by the PEDANT genome analysis system [20] the total fraction of exceptions from strong rules classified by manual analysis as errors was as high as 68%. It was also found that most of the errors in the Swiss-Prot database are under-predictions (i.e. absence of features that would be expected based on association rules), consistent with the prudent manual annotation process adopted by Swiss-Prot, while in PEDANT errors are typically caused by over-predictions.
In this work we continue to explore the application of rule mining to correcting annotation errors and investigate the utility of negative association rule mining, which, as the name implies, represents the identification of negative relationships between item sets [19]. A negative association rule has the form A 1 & ... & A n (LHS) => not Z (RHS), with A 1 ... A n and Z being different features, and the rule means 'database entries that possess all features A 1 ... A n are unlikely to possess feature Z'. For negative rules support is the number of database entries satisfying both the LHS and the RHS, i.e. those entries that possess all features A 1 ... A n and do not possess the feature Z. An additional very important parameter used in this work to characterize negative rules is leverage which is defined as the difference of the rule support and the product of supports of its LHS and RHS. Leverage measures the unexpectedness of a rule as the difference of the actual rule frequency and the probability of finding it by chance with the given frequencies of its RHS and LHS.
A negative association rule is thus an implication from the union of several items to an item negation. An example of a trivial biologically relevant negative association rule is "Nuclear localization => not bacterial origin", i.e. every protein annotated as localized in the nucleus cannot have a bacterial origin. As with positive rules, negative rules are not necessarily absolutely strict. For instance, the rule "Operon structure => not eukaryotic origin" has a number of exceptions because bacterial-like operons were described in Ceanorhabditis elegans [21]. Since these exceptions comprise only a small fraction of the annotated genes, this rule may naturally be interpreted as "the majority of genes constituting an operon structure do not originate from eukaryotic organisms". In this specific case exceptions from a strong rule are biologically motivated and do not represent errors. However, in many other cases such exceptions do point to erroneously assigned annotation items. For example, the rule "intracellular transport vesicles => not bacteria" has four exceptions in the PED-ANT annotation caused by erroneous homology-based transfer of the functional category "intracellular transport vesicles" to four bacterial gene products. We present a large-scale evaluation of exceptions from strong negative rules in the PEDANT genome database and assess their utility for detecting and correcting annotation errors. To our knowledge this is the first application of negative association rule mining to molecular biological data.

Extracting item sets from the PEDANT genome database
PEDANT software suite [20] is an automatic annotation pipeline that runs various bioinformatics analyses on each protein sequence and stores the results, appropriately parsed, in a relational database. The associated PEDANT genome database [22] currently contains pre-computed annotation for 468 genomes and a total of more than 1.76 million gene products. Most of these data were calculated using the version 2 of the PEDANT software, but we are currently deploying the all-new version 3 with substantially enhanced capabilities including a new graphical user interface and an extended set of bioinformatics algorithms.
A typical annotation entry extracted from PEDANT for a given gene product has the following form: This line describes the antitoxin of the ChpB-ChpS toxinantitoxin system from Escherichia coli as a protein that has small length (length:S), acidic isoelectric point (Pi:C, less than 5.5), gene with high GC-content, bacterial origin (Bacteria), and low content of disordered regions (do:L), does not possess any low complexity regions (lc:0), has structural class of the 'alpha/beta' type and the PFAM [16] domain PF04014. It belongs to the IPR007159 InterPro [23] family and the b.129.1 SCOP [13] structural superfamily, and it is a homolog of the UniProt [1] proteins annotated with the keyword "DNA-binding". According to the MIPS Functional Catalog [15] (only two upper levels are considered here) the function of this protein is described by the labels fc16 (protein with binding function), fc16.03 (nucleic acid binding), fc40 (cell fate), fc40.01 (cell growth/morphogenesis), fc32 (cell rescue, defense and virulence) and fc32.05 (disease, virulence and defense).
Annotation attributes extracted from PEDANT can generally be subdivided into three types in terms of their intrinsic susceptibility to errors.
• Type 1. Features that are definitely known. This group includes either inherent properties of genes and their products, such as their taxonomic origin, or features that can be unambiguously calculated from primary sequences, such as GC content, length, pI value, percentage of low complexity regions, and so on.
• Type 2. Structural and functional properties of proteins predicted directly from their amino acid sequences by ab initio computational algorithms (secondary structure, disordered regions, coiled coils, transmembrane segments, signal peptides, cellular localization).
• Type 3. Structural and functional properties of proteins derived by similarity searches against previously characterized gene products. These features include sequence domains, keywords, functional categories, enzyme classes, and functional and structural superfamilies.
It is obvious that the features of type 1 are unfaultable and cannot generally contain errors (except for incorrectly predicted gene models, typographical errors, or errors caused by software bugs or human error). Features of type 2 are typically predicted with the accuracy in the order of 70% [24] by machine learning techniques, such as neural networks or support vector machines. If no experimental data for a given feature type is available (e.g. known three dimensional structure, experimentally determined cellular localization), such predictions can only rarely be further improved by human curation. Finally, features of type 3 are transferred from one or several previously annotated gene products to the query protein based on a suffi-  ciently significant degree of similarity. These features constitute the main bulk of protein-associated information available in the databases, and it is precisely this part of protein annotation that is especially prone to errors due to intrinsic limitations of annotation transfer by homology.
We are interested in applying negative association rule mining for identifying errors in the annotation attributes of Type 3 transferred by similarity from other proteins; in the annotation entry above such features are shown in italic. In our dataset there were the total of 848511 similarity-derived features (67% of all features analyzed), more than a half of which were constituted by functional category assignments.

Extracting rules from PEDANT annotation
The annotation set describing 55063 genes in ten PED-ANT genomes served as input data to extract negative association rules using a modified version of the wellestablished Apriori algorithm for association rule mining. The basic Apriori algorithm, described in detail in [25], is designed to find frequent item sets by consecutive expansion of candidate sets at every step. It is based on the simple notion that all subsets of a frequent item set are also frequent. In this work we used a version of this algorithm designed for the efficient negative association rule production [26] implemented and kindly provided by Christian Borgelt (for initial software description see, e.g., [27]). This notation means that proteins possessing FunCat labels 34.11 ("Cellular sensing and response") and 36 ("Interaction with the environment (systemic)") are unlikely to be of small (less than 120 amino acids) length. The LHS items are joined by the "&" symbol and are followed by the RHS (here -a negation of the annotation feature), and the list of numerical attributes of the rule: coverage, coverage count, RHS coverage, RHS coverage count, support, support count, strength, lift, leverage, and leverage count. In addition to "Support count", "Coverage count" and "Strength", important for positive association rule selection [18] (1560, 1558 and 0.999, respectively, in our example), the numerical parameter "Leverage" (or "interest of the rule", as alternative name) is also very important in the case of negative rules. All existing algorithms allow calculating negative association rules effec-tively only using a certain threshold on the minimal leverage or leverage count. Here, if not specified otherwise, all rules with the support and the leverage counts of at least 100 proteins and strength of at least 0.1 were retained for further analysis.

Analysis of taxon specificity
A considerable and arguably the most valuable part of PEDANT annotation involves assignments of functional categories based on the MIPS FunCat [15]. A large fraction of negative association rules included a taxon-specific FunCat label (e.g., fc75.03 -"animal tissue") on one side of the implication and the taxon of protein origin contradicting this specificity (in the given case, All rules of this kind are a direct consequence of the taxonspecific nature of the underlying (here fc75.03) FunCat labels. Some of these rules may have exceptions due to annotation transfer by homology between proteins from different taxonomic groups. We classify such cases as annotation errors according to the general procedure.

Manual verification of the rules
For manual verification of negative association rules we randomly selected a limited sample of protein entries from the PEDANT annotation set that constituted exceptions from rules and could not be corrected by the taxon specific analysis explained in the previous section. Annotation features of these proteins occurring either in the LHS or in the RHS of the rules were subjected to careful manual analysis by an experienced protein annotator according to the established procedures routinely used at MIPS for genome annotation (see, e.g., [28]). These include assessment of similarity hits and predicted protein features as well as in-depth examination of biological literature describing experimental studies. An exception was classified as an error if one of the features in the LHS or RHS of the rule was found to be assigned wrongly to the given protein entry. We then calculated the error rate among all manually analyzed exceptions according to the following formula: (percentage of exceptions classified as annotation errors among all manually verified exceptions * number of exceptions in rules not involving 'taxon specificity' + number of exceptions from 'taxon specific' rules)/overall number of exceptions.

Manual verification of annotation features
We filtered out wrongly assigned taxon-specific FunCat labels and selected randomly a limited sample among all remaining homology-transferred annotation features. The accuracy of the feature assignment was thoroughly verified by an experienced annotator. All verified annotation attributes were divided into 3 categories: true assignments, false assignments, or "not known". The latter category was selected if the evidence for a given assignment was not sufficient to make a judgment, but the feature did not obviously contradict the nature of the protein (e.g., the keyword "Zymogen" in the annotation of the lysosomal Pro-X carboxypeptidase from Arabidopsis thaliana, code "At5g65760"). Features of this category were excluded from further analysis and were not taken into account while estimating the error level. For example, if in a set of 100 features selected for manual verification 40 features were classified as 'errors', 56 as 'correct assignments', and 4 as 'not known', then the final estimate of the error level in this sample was 100*40/(100-4) = 42%.

Statistics of negative association rules in the PEDANT annotation
Application of the Apriori algorithm to the annotation set extracted from PEDANT resulted in 9591 negative rules (see Additional file 1). For example, one of the most trivial rules found was "Bacteria, not Eukaryota, 0.353, 19443, 0.413, 22765, 0.353, 19443, 1.000, 2.419, 0.207, 11404.573". This rule is satisfied in all possible cases and thus its strength is 1.0 with no exceptions. In total there were 2273 such rules calculated. Much more interesting rules in the context of this study are those of strength very close, but not equal to 1.0. These rules have a small number of exceptions that may constitute annotation errors. There were 7318 such rules with 26969 exceptions in total. An example of such rules is "Nuclear protein, not Bacteria, 0.033, 1808, 0.647, 35620, 0.033, 1798, 0.994, 1.537, 0.011, 628.413". This statement which is obvious from the biological point of view nevertheless does not make an absolute rule; in fact out of all 1808 protein entries annotated by the keyword "Nuclear protein" in the PEDANT database only 1798 actually have eukaryotic origin. The ten proteins constituting exceptions from this rule (for example, the oligoribonuclease from Mycobacterium tuberculesis, PEDANT code gi_15609648) simply inherit this keyword from their eukaryotic homologs. In our example, the homolog is human oligoribonuclease (PEDANT code gi_116242694, UniProt code ORN_HUMAN), one of the alternatively spliced isoforms of which is localized to the nucleus. Some aspects of negative rule statistics differ significantly from positive association rules due to vastly different item frequencies. Because annotation items themselves are rare, and most items are in fact extremely rare (e.g., the PFAM domain PF01029 is only found in 12 (0.02%) of proteins analyzed), their negations used in negative association rule mining are unavoidably very frequent. This simple circumstance makes the calculation of negative rules computationally much more challenging compared to positive rules and necessitates the application of much stricter thresholds on the rules of interest. While analyzing rule strength distribution we considered only the rules exhibiting strength higher than 0.1. The number of weaker rules (strength below 0.1) is too high due to the combinatorial explosion caused by random feature combinations, making their analysis computationally prohibitive. However, even in the strength interval 0.1 -1.0 the number of negative rules is several orders of magnitude higher that the number of positive rules. To make the task computationally tractable we additionally imposed a threshold on minimal leverage which effectively helps to select only the most 'non-random' rules (see Methods) and eliminates all rules with the strength below 0.97. The distributions of negative rule strength with different minimal coverage counts are plotted in Figure 1.
The fraction of proteins in the PEDANT database constituting exceptions from strong rules in the strength interval between 0.97 and 1.0 as well as the fraction of relevant (homology-transferred) features participating in such rules is very low. In total, we identified 6875 features (0.8%) in the annotation of 1031 proteins (1.9%) as potential annotation errors.

Distribution of negative association rule strength
(probability that a given database entry will satisfy the right side of the rule given that it satisfies the left side of the rule) Figure 1 Distribution of negative association rule strength (probability that a given database entry will satisfy the right side of the rule given that it satisfies the left side of the rule). Minimal coverage counts (number of entries in the database that possess all features from the left hand side of the rule) used are 100 (blue), 200 (pink), and 500 (green). The threshold for minimal leverage count (difference of the actual rule frequency and the probability to find it by chance with the given frequencies of its RHS and LHS) was set to 100 in all calculations

Analysis of potentially erroneous annotation features Taxon-specific rules
In order to estimate the number of actual annotation errors among exceptions from strong rules the first test was designed to exclude the influence of taxon specificity. It turned out that a very large number of rules combined FunCat labels on one side of the rule with the taxon of the protein origin on the other side (we used here only the highest-level taxons, namely Eukaryota, Bacteria, Archae, and Viruses). There were 3159 rules (33% of all negative rules) with such structure. Because many FunCat labels are taxon-specific (see Methods), these labels should ideally only be present in the annotation of the genes belonging to the corresponding taxa; homology-based transfer of such annotation attributes is highly prone to error. Where a taxonomically specific FunCat label is incompatible with the known gene taxon, it is the FunCat assignment which is guaranteed to be erroneous, since the protein origin is doubtlessly known. This simple test resulted in automatic correction of almost 50% of all exceptions in our set of strong negative rules. Figure 2 shows how the fraction of corrected exceptions among all exceptions depends on the rule strength.

Performance of the method
After the filter for taxonomy-specific FunCat labels was implemented, the set of 6432 rules formed all negative rules for our annotation sample. These rules involved 4687 transferable features (0.6% of all features) in the annotation of 822 proteins (1.5% of all proteins).
To estimate the prevalence of errors among exceptions not corrected by the taxonomy procedure described above we selected randomly a sample of 100 rules and analyzed their exceptions manually. In 96% of examined exceptions at least one of the features constituting the rule was assigned wrongly to the given protein. The overall specificity of the approach was estimated to be as high as 98%: practically all feature combinations associated with exceptions included at least one annotation error.
The specificity of the negative rules is thus much higher than that in the case of positive rules [18] which was estimated to be around 68% based on careful manual verification. By design, exceptions from negative association rules can only reveal over-annotation, i.e. erroneous assignment of attributes to protein entries, while underannotation (missing attributes) cannot be detected. While manually curated databases are typically under-annotated and this approach is not efficient for them (the performance of the method was very low when tested on the Swiss-Prot database, data not shown), over-annotation is a typical problem of many automatic software pipelines, including PEDANT, and the ability to correct this type of errors using negative rule mining is valuable. At the same time, the approach based on exceptions from strong negative rules yields much smaller coverage than positive rule mining. As seen in Figure 3, negative rule mining allows identifying 11 times fewer annotation features (0.6% for negative rules versus 6.7% for positive rules) that participate in incompatible feature combinations. More than Coverage of the negative and positive rule mining approaches What fraction of the annotation features flagged by the negative rules are actual annotation errors? Our approach is designed to flag incompatible feature combinations for subsequent manual inspection rather than to automatically correct annotation errors in an unsupervised fashion. With the exception of taxon-specific rules where FunCat labels incompatible with the taxonomic origin of a protein are guaranteed to be errors, we do not know exactly which feature of a flagged feature combination is wrong. Besides there always exists a chance that all features constituting an exception from a strong negative rule are nevertheless correctly assigned and that the exception is in fact biologically motivated.
It would be desirable to validate our predictions against high-quality manually curated databases such as Swiss-Prot or BRENDA [29], but this, unfortunately cannot be done at sufficiently large scale. As discussed above most of the annotation errors found in PEDANT are over-predictions which can not easily be confirmed by comparison with Swiss-Prot entries as the latter tend to be underannotated. If a certain feature is present in a Swiss-Prot entry, it is almost certainly correct, however, if a feature is absent no statement can generally be made. BRENDA focuses on one aspect of protein annotation -EC numbers -providing detailed classification of enzymes at four hierarchical levels. Correspondingly, there are only few proteins associated with each four digit EC number while association rule mining relies on frequent annotation items. In this study rules were required to have coverage count of at least 100, and only six EC numbers satisfied this condition.
We therefore attempted to estimate the fraction of actual annotation errors among those features flagged as suspicious and, for comparison, in non-flagged features by careful manual inspection as described in Methods. As seen in Table 2, features flagged as suspicious are almost 1.4 fold enriched in actual annotation errors than the unmarked ones.

Conclusion
Based on our assessment it becomes apparent that almost all incompatible feature combinations found by negative association rule mining include at least one wrongly assigned annotation term. The fraction of individual features flagged as suspicious is about 0.6% from the total number of features assigned by PEDANT and it is significantly enriched in annotation errors. Moreover, roughly two thirds of such erroneous assignments are not identified by positive rule mining. We conclude that applying a combination of positive rule mining described earlier [18] and negative rule mining presented creates an opportunity to enhance the fidelity of genome annotation in two alternative ways. First, insights about the sources of annotation errors gained in this investigation can be used to adjust the automatic annotation pipeline in order to minimize generation of these errors in the future. Examples of such possible modifications include taxon-specific homology-based transfer of functional categories and utilization of individualized similarity thresholds for various features. Second, suspicious features can be visually marked for subsequent inspection by the user. While this approach is better suited for manually curated databases where errors actually get corrected by human experts, it is also useful for automatic systems such as PEDANT where users get alerted to specific less trusted annotation items that should be used with caution.
Urban Hafner for technical assistance, and the anonymous reviewers for the valuable suggestions and corrections. This work was conducted in the framework of the BioSapiens Network of Excellence funded by the European Commission FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2003-503265. IIA was partially supported by the Russian Academy of Sciences (program "Molecular and Cellular Biology").