NoGOA: predicting noisy GO annotations using evidences and sparse representation

Background Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. Results We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. Conclusions The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1764-z) contains supplementary material, which is available to authorized users.


Workflow of sparse representation
Fig. S1 illustratively shows the usage of sparse representation to compute the semantic similarity between genes based on GO annotations of them. In Fig. S1, A is a gene-term association matrix which contains Figure S1: A workflow of using sparse representation to compute semantic similarity between genes 8 genes and 10 GO terms. By minimizing the first Equation in Fig. S1, we can obtain {γ i } 8 i=1 , which make γ T iĀ i close to A(i, ·) and ||γ i || 1 is minimized. For example, γ 1 means that A(1, ·) is reconstructed by 0.3074×A(3, ·), 0.3327×A(5, ·) and 0.2069×A(7, ·), so the similarity between A(1, ·) and {A(i, ·)} 8 i=2 are 0, 0.3074, 0, 0.3327, 0, 0.2069 and 0, respectively. Since we will not use the similarity between a gene and itself, we set the diagonal entry of S as 0. Finally, we set S = (S T + S)/2 to ensure S be a symmetric matrix and store the semantic similarity between genes in S.

Performance of noisy annotations prediction
In the main text, we reported the results of eight comparing methods, LF, SR, EC, NtN, NoisyGOA, NtN+EC, NoisyGOA+EC and NoGOA in predicting the noisy annotations of genes on achieved GOA files of H. Sapiens (archived date: May 2016). Here, the results on A. thaliana and S. cerevisiae and H. Sapiens, G. gallus, B. Taurus and M. musculus(archived date: May 2015 or May 2016) are included in Tables S1-S11. In these tables, the statistical best performer is highlighted in boldface and significance is checked by pairwise t-test at 95% level. Obviously, these results provide the similar observations and conclusions as in the main text.

Parameter sensitivity analysis
In order to study the effectiveness of NoGOA under different input values of θ, we vary θ from 0.1 to 0.9 while fixing α as 0.2. The results are shown in Fig. S2. From the results in Fig. S2, we can find that F1-Score remains stable when θ ∈ [0.1, 0.7] and then decreases when θ ∈ [0.7, 0.9] in most cases. These results suggest that the weight for evidence codes whose estimated ratios of noisy annotations above the threshold τ , and for those whose estimated ratios below τ should be differently specified. The sensitivity study of θ again justifies the rationality of setting different weights to different annotations based on the estimated ratio of noisy annotations per evidence code. We also study the effectiveness of NoGOA under different specifications of τ , we report the results of NoGOA in predicting noisy annotations when τ is the average (or median) of r m ec . In fact, we also test τ = 0 for NoGOA, since annotations with respect to more than half evidence codes in the historical GOA files do not disappear in the recent GOA files, the results of NoGOA with τ = 0 are the same as NoGOA with τ equal to the median of r m ec . The results are shown in Tables S12-S17. From these tables, we can find that τ influences the performance of predicting noisy annotations. In most cases, when τ = 0, NoGOA gets lower performance. That is because the evidence codes which do not have noisy annotations always have few annotations. NoGOA assigns larger weights to annotations tagged with these evidence codes, and smaller weight to others. As a result, using 0 to differentiate noisy  annotations from true annotations makes the performance un-obvious.

Evaluation metrics for gene function prediction
Here, we introduce six evaluation metrics, MacroAvgF1, MicroAvgF1, AvgAUC, AvgPrec Fmax and Smin to measure the performance of gene function prediction. In fact, the first four metrics are widely to measure the performance of multi-label classification (Zhang et al., 2014), which targets at predicting a set of related labels to an instance instead of a single label. From this perspective, gene function prediction can also be viewed as a multi-label learning problem. Therefore, these four metrics are used to measure the performance of gene function prediction (Fu et al., 2016;Yu et al., 2015). The latter two metrics (Smin and Fmax ), along with AvgAUC, are suggested and used in the Second Critical Assessment of protein Function Annotation algorithms (CAFA2) (Jiang et al., 2016). The definitions of these metrics are detailed in (Zhang et al., 2014;Jiang et al., 2016). To be self-inclusive, we list the definitions of these metrics here.
MacroAvgF1 is the average F 1 scores of different terms: where |T | is the number of terms, p t and r t are the precision and recall for GO term t, defined as: T P t , F P t , and F N t are the true positive, false positive, and false negative with respect to term t. From the definition, it can be observed MacroAvgF1 first calculates the F1 measures for each term, and then averages over all the terms. MacroAvgF1 is more affected by the performance on the specific terms, which annotated to few genes.
MicroAvgF1 calculates the F 1 measure on the predictions of different terms as a whole: MicroAvgF1 is more affected by the performance on general terms than on specific terms.
Average Precision (AvgPrec) evaluates the average fraction of terms ranked above a particular term t ∈ T i (T i stores all the GO terms annotated to the i-th gene).
where rank(V (i, ·)) is a rank function, which ranks the largest V (i, ·) ∈ R |T | as 1 and the smallest V (i, ·) as |T |. The larger the value of AvgPrec, the better the performance is.
Average AUC (AvgAUC) is a term-centric evaluation metric, it averages the area under the receiver operating curve (AUC) of each term across all terms in T . The receiver operating curve plots the true positive rate (sensitivity) as a function of the false positive rate (1-specificity) under different classification thresholds. The curve measures the overall quality of the ranking induced by the classifier, instead of the quality of a single value of the threshold in that ranking.
Fmax is a gene-centric evaluation metric used in the community-based critical assessment of protein function annotation (CAFA) (Radivojac et al., 2013), Fmax is an F -measure computed as: is the the precision at threshold θ ∈ [0, 1], p i (θ) is the precision on the i-th gene, m(θ) is the number of genes on which at least one prediction was made above the threshold θ, r(θ) = 1 N N i=1 r i (θ) is the recall across N genes at threshold θ. Smin is an evaluation metric used in the second critical assessment of functional annotation (CAFA2) (Jiang et al., 2016) and defined as: . P i (θ) denotes the set of terms whose predicted scores greater than or equal to θ for the i-th gene, T i denotes the ground-truth set of terms for that gene, 1(·) is an indicator function, and ic(t) is the information content of the ontology term t. It is estimated in a maximum likelihood manner as the negative binary logarithm of the conditional probability that the term t is present in a gene's annotations given that all its parent terms are also present.

Removing noisy annotations improves gene function prediction
In the main text, we provide the experimental results of gene function prediction by a majority vote model on PPI networks with/without removing noisy annotations in archived GOA file of H. sapiens.
Here, we report the corresponding results on archived GOA files of A.thaliana and S. cerevisiae in Tables S18-S19. From these tables, we can clearly see that removing the noisy annotations provides improved performance for gene function prediction.    , 2016).N c is the number of noisy annotations,N c is the number of predicted andŇ c is the number of correctly predicted noisy annotations by NoGOA, respectively. IDA  IPI  IMP  IGI  IEP  ISA  ISO  ISM  RCA  IGC  IBA  IBD  IKR  IRD  ISS  NAS  IC , 2016).N c is the number of noisy annotations,N c is the number of predicted andŇ c is the number of correctly predicted noisy annotations by NoGOA, respectively. IDA  IPI  IMP  IGI  IEP  ISA  ISO  ISM  RCA  IGC  IBA  IBD  IKR  IRD  ISS  NAS  IC , 2015).N c is the number of noisy annotations,N c is the number of predicted andŇ c is the number of correctly predicted noisy annotations by NoGOA, respectively. IDA  IPI  IMP  IGI  IEP  ISA  ISO  ISM  RCA  IGC  IBA  IBD  IKR  IRD  ISS  NAS  IC May, 2015).N c is the number of noisy annotations,N c is the number of predicted andŇ c is the number of correctly predicted noisy annotations by NoGOA, respectively.  , 2015).N c is the number of noisy annotations,N c is the number of predicted andŇ c is the number of correctly predicted noisy annotations by NoGOA, respectively. IDA  IPI  IMP  IGI  IEP  ISA  ISO  ISM  RCA  IGC  IBA  IBD  IKR  IRD  ISS  NAS  IC