Benchmark set
The Gene Ontology vocabulary was retrieved from Gene Ontology Consortium web resource [28]. At the time of the retrieval, the GO vocabulary consisted of 39,399 distinct terms partitioned into three main categories, structured as directed acyclic graphs (DAG) with a unique root: 26,099 terms of type Biological Process (BP), 9753 of type Molecular Function (MF) and 3547 of type Cellular Component (CC). The literaturebased GO annotations were retrieved from the UniProtGOA web resource [29]. We downloaded all the single Gene Association files, corresponding to set of proteins in different species/knowledge bases (30 overlapping sets). The single files were parsed in order to extract only the associations between pairs of GO terms and PubMed identifiers (PMIDs), discarding the gene product references. We ended up with a benchmark set of 328,357 pairs, consisting of 115,402 distinct PMIDs annotated with 22,839 distinct GO terms (see Additional file 1). The title, abstract and reference information related to the PMIDs in the benchmark set were downloaded in XML format from PubMed [30]. For 1256 out of 115,402 documents (1 %) the information downloaded from PubMed consists of the publication title only. For a total of 45,341 documents (39 %) also the cited references are available. Out of 22,839 GO identifiers in benchmark set, 14,889 are of type BP, 5951 of type MF and 1999 of type CC. The number of GO annotations for each PMID ranges from 1 to 1309 terms, with an average of 2.8. The distribution of annotations in the benchmark set is not uniform. In particular, except for a small subset, the number of annotations per PMID is quite small: 39 % of PMIDs have a single GO annotation and 99 % at most 10 distinct annotations (see Additional file 2).
Our method requires a preprocessing of the underlying Knowledge Base (KB) and a parameter tuning phase (see “Text preprocessing” section and “Tuning of parameters” section below). Both these phases are extremely timeconsuming, which prevents the possibility of performing multiple rounds of cross validation in reasonable time. For this reason, for performance evaluation, we randomly selected a large sample of 15,000 publications from the benchmark set derived from UniProtGOA. Publications in this set are used for testing the classification performances, while the remaining ∼100,000 publications constitute the underlying KB of our method. Almost 6 % of the terms represented in the test set have no associated publication in the KB.
Among these missing terms, there is a higher number of BP terms (67 %) in comparison to MF (25 %) and CC (8 %) terms.
Approach
We use the combination of a publicationcentric and termcentric approach to capture the relation between a scientific publication and a GO term:

1.
Publicationcentric: the relation between a GO term t and a query q is inferred from the query’s similarity with annotated publications in the underlying KB.

2.
Termcentric: the relation between the topics covered by a query q and a GO term t is determined by direct comparison.
The likelihoods obtained from these two approaches are simply combined into a single relatedness score:
$$ \Phi(q,t) = \Phi_{P}(q,t) \cdot \Phi_{T}(q,t), $$
((1))
where Φ
_{
P
} and Φ
_{
T
} are the publicationcentric and the termcentric similarity scores, respectively. The meaning of Eq. 1 is quite simple (see Fig. 1). The Φ
_{
P
} function makes use of a kNN strategy to select a ranked list of GO terms (only terms associated to some publication in the KB can be selected). The Φ
_{
T
} function provides a reweighting of the selected terms.
The details of the preprocessing and tuning phase, as well as the exact definition of the two similarity functions in Eq. 1 are given in the rest of this section.
Text preprocessing
We perform typical preprocessing operations to transform each publication and GO term into a structured representation more manageable for querying. We make use of the quite common Bag of Words (BoW) representation of documents. The BoW representation describes a textual document by means of a feature vector, where each entry indicates the presence or absence of a word. The elements of the vector are weighted in order to balance their relative importance. Other than BoW representation, we experimented with different features that can be automatically extracted from PubMed and GO knowledge bases. In detail, a publication p is indexed by the following information:

\(\mathcal {W}(p)\) (text): BoW built from the abstract and title of the publication p. More generally, this feature represents the BoW built from unstructured text associated to the publication.

\(\mathcal {T}(p)\) (title): BoW built from the title of the publication p.

\(\mathcal {R}(p)\) (references): weighted vector of references (PMIDs). As for the BoW, each entry of the vector indicates the presence (or absence) of a reference within the publication. The references are weighted according to their importance.

\(\mathcal {Y}(p)\) (year): year of publication of p.
All the features above can be easily extracted for documents indexed into PubMed. Some of these features could not be available for some publication (references, for example), particularly for query publications. We assume that at least the \(\mathcal {W}(p)\) feature vector is always nonnull (i.e. contains entries greater than zero). A GO term t is indexed with the following information (with abuse of notation, we use the same symbol when the termspecific and publicationspecific features are related):

\(\mathcal {W}(t)\) (text): BoW built from unstructured text associated to the term t. In this case, the text associated to t includes the term name, synonyms and description (all available in the GO knowledge base), as well as all the titles and abstracts of those publications in the KB associated to term t (these informations could be absent).

\(\mathcal {T}(t)\) (name): BoW built from the name and synonyms of the term t.

\(\mathcal {Y}(t)\) (year): average year of the publications annotated with term t.
The \(\mathcal {W}(t)\) and \(\mathcal {T}(t)\) BoWs are always non empty, while the \(\mathcal {Y}(t)\) feature can be unspecified (if there is no publication in the KB annotated with t).
We use the following procedure to create a pool of words (BoW) appearing in titles and abstracts of publications in the KB, as well as title, synonyms and descriptions of GO terms. Single words occurring in the text are extracted, discarding punctuation. A stopword list is used to filterout all the words that are not enough informative about the text (such as, actually, after, etc). The standard Porter stemming algorithm [31] is used to group words with common stems. As the method works accurately even without filtering or selecting features, which generally are costly steps that require further complex tuning of parameters, we do not perform such task in this paper. However, we use the common tfidf scheme [32], which is one of the several existing techniques [33], to assign a relevance weight to words related to each feature.
The Term Frequency (tf) measures how frequently a word appears into a document:
$$tf(w,x) = \frac{\text{number of times word}\ w\ \text{appears in text}\ x}{\text{total number of words in text}\ x} $$
The Inverse Document Frequency (idf) measures how much important a word is:
$$ {\small{\begin{aligned} &idf(w) \\ &= \log\!{\frac{\text{total number of publications and GO terms}}{\text{number of publications and GO terms containing}\ w}} \end{aligned}}} $$
The tfidf scheme is the product of these two quantities. To automatically extract weighted vectors of references \(\mathcal {R}(p)\), we use the same criteria of the idf measure. Publications indexed in PubMed are identified with a unique PMID and cited references are represented as a list of PMIDs. Thus, the idf measure for references can be easily calculated by:
$$ {\small{\begin{aligned} &{idf}_{\text{ref}}({\text PMID})\\ &\quad=\log\frac{\text{total number of publications with { bibliography}}}{\text{number of publications citing PMID}} \end{aligned}}} $$
With the BoW representation, the similarity between two feature vectors x,y can be measured by means of the commonly used cosine similarity [34]:
$$sim_{cos}(x,y)=\frac{x\cdot y}{\ x\ \y\} $$
Since by definition the BoW vectors have only positive values, the s
i
m
_{
cos
} score is always nonnegative.
Publicationcentric score
The first similarity score between a publication query and a GO term is recovered by comparison with annotated literature, namely the publications in the underlying KB. This is basically a kNN approach. First, the publication query is compared with all the publications in the KB, in order to detect those that appear to be most closely related to the query (Step 1 in Fig. 1). Second, a ranked list of GO annotations for the query is inferred from the recovered publications (Step 2 in Fig. 1).
The comparison score between a query publication q and a target publication p in the KB is given by the product of four different similarity measures:
$$ \phi_{P}(q,p)= \Pi_{i=1}^{4} (1+f_{i}(q,p))^{m_{i}} $$
((2))
where 0≤f
_{
i
}(q,p)≤1 gives a similarity score associated to a single publicationrelated feature and the power m
_{
i
} represents its relative weight. The four similarity functions f
_{
i
}
q
p are defined by:

1.
(Text similarity) Cosine similarity between BoW of the query and target publications:
$$f_{1}(q,p)=sim_{cos}(\mathcal{W}(q),\mathcal{W}(p)) $$

2.
(Title similarity) Cosine similarity between the titlerelated BoW of the query and target publications:
$$f_{2}(q,p)=sim_{cos}(\mathcal{T}(q),\mathcal{T}(p)) $$

3.
(References similarity) Cosine similarity between the weighted vector of references of the query and target publications:
$$f_{3}(q,p)=sim_{cos}(\mathcal{R}(q),\mathcal{R}(p)) $$

4.
(Year similarity) Distance between the publication year of the query and target. We decided to normalize this value with a maximum distance of 50 years:
$${}f_{4}(q,p)\! =\!\! \left\{\! \begin{array}{l l} 0 & \text{if}\ \mathcal{Y}(q)\mathcal{Y}(p)>50\\ (50\,\,\mathcal{Y}(q)\mathcal{Y}(p))/50 & \text{otherwise} \end{array} \right.$$
We remark that the four features described above can be used for the classification only if the query publication is indexed into PubMed. In all the other cases, i.e. when the query consists of unstructured text only, the only available feature is f
_{1}(q,p). Furthermore, also for publications indexed in PubMed some information can be missing (abstract and/or bibliography, for example). However, due to the definition of Eq. 2, the missing information does not affect the calculation of the comparison score. Symmetrically, the comparison score can be easily extended to incorporate more features.
The measure ϕ
_{
P
}(q,p), computed by Eq. 2, is used to rank the publications in the KB according to their similarity with the query. The goldstandard GO annotations of the topK ranked publications are then transferred to the query, by using ϕ
_{
P
}(q,p) as their relative weight. The publicationcentric similarity score between a query q and a GO term t is then given by:
$$ {}\Phi_{P}(q,t) \,=\, \sum_{\text{top}\textit{K} \text{ranked}\ p \in KB} \left\{\! \begin{array}{ll} \phi_{P}(q,p) & \text{if}\ p\ \text{has annotation}\ t \\ 0 & \text{otherwise} \end{array} \right. $$
((3))
Termcentric score
The second similarity score between a publication query and a GO term is based on the direct comparison between a query q and a term t (Step 3 in Fig. 1). Also in this case, the score is given by the the product of different similarity measures:
$$ \Phi_{T}(q,t)= \Pi_{i=1}^{5} (1+g_{i}(q,t))^{n_{i}} $$
((4))
where 0≤g
_{
i
}(q,t)≤1 is the score associated to a single GO termrelated feature and the power n
_{
i
} represents its relative weight. The five similarity functions g
_{
i
}(q,t) are defined by:

1.
(Text similarity) Cosine similarity between the BoW of the query and term:
$$g_{1}(q,t) = sim_{cos}(\mathcal{W}(q),\mathcal{W}(t)) $$

2.
(Title similarity) Cosine similarity between the titlerelated BoW of the query and term:
$$g_{2}(q,t)=sim_{cos}(\mathcal{T}(q),\mathcal{T}(t)) $$

3.
(Name frequency) Percentage of the words in the titlerelated BoW of the term that are contained in the BoW of the query:
$$g_{3}(q,t) = \frac{\mathcal{T}(t) \cap \mathcal{W}(q)}{\mathcal{T}(t)} $$

4.
(Name occurrence) Normalized number of occurrences of the name or synonym of the term in the query. In this case, we seek the full name of the term (and its synonyms) in the entire query text. The number of occurrences c is normalized to 4.
$$g_{4}(q,t) = \left\{ \begin{array}{ll} 1 & \text{if}\ c>4\\ c/4 & \text{otherwise} \end{array} \right. $$

5.
(Year similarity) Distance between the publication year of the query and term:
$${}g_{5}(q,t)\! = \!\left\{\! \begin{array}{l l} 0 & \text{if}\ \mathcal{Y}(q)\mathcal{Y}(t)\!>50\\ (50\,\,\mathcal{Y}(q)\mathcal{Y}(t))/50 & \text{otherwise} \end{array} \right. $$
Tuning of parameters
The computation of Eq. 1 requires a preprocessing of the underlying KB, as well as the tuning of a few parameters. The parameter tuning has been performed over the 100,402 publications in the KB, which has been randomly extracted from the 115,402 publications of the benchmark text set. Moreover we randomly split the KB into two text sets: a training set of 90,402 publications and a validation set of 10,000 publications.
Generally the tuning of parameters requires performing a crossvalidation strategy, however, as previously explained, we do not use such a strategy because of (i) the computational time required for preprocessing a so large data set and because (ii) the amount of data, randomly selected for the validation set, is representative of the entire benchmark text set. In fact, according to the sampling theory [35], in the worst case the representativeness (or sampling) error of this validation set is less than 0.98 %, with a confidence level of 95 %. We also remark that the effectiveness of the method does not strongly depend from the parameter tuning. In fact, without it the drop in classification accuracy over the validation set is less than 2 % (data not shown).
The parameters involved in the tuning process belong to the two functions that calculate the two similarity scores. In particular the computation of the similarity score Φ
_{
P
}(q,t) in Eq. 3 requires the tuning of the power parameters m
_{
i
} in Eq. 2 and the K parameter in Eq. 3. We tested all the combinations of the m
_{
i
} ranging from 0 to 4 each. Note that a 0value leads to not considering the related feature.
The selected power parameters are: m
_{1} = 4,m
_{3} = 3,m
_{2} =m
_{4} =1. While the power parameters seem to give more importance to some similarity scores, such as f
_{1}(q,p), they actually play a balancing role, since the actual values of the four similarity scores follow different distributions. We also remark that, when the input query consists uniquely of unstructured text, the classification relies only on the f
_{1}(q,p) similarity score. In such situations, the power parameter m
_{1} has only a limited effect on the scoring. We also tested the method by varying the number (K) of most similar publications in the KB considered in Eq. 3. We experimented that varying K from 50 to 1000, we obtain the best results with K ranging from 50 to 300, with a maximum peak at K=150.
As for Eq. 3, we tuned the power parameters n
_{
i
} for the Eq. 4 testing all combinations with each value ranging from 0 to 4. The selected parameters are: n
_{1}=4,n
_{3}=2, n
_{2} = n
_{4} = n
_{5} = 1. When the input query consists just of unstructured text, the only power parameters used are n
_{1},n
_{3} and n
_{4}.
We further considered the problem of assigning a confidence threshold (low, medium, high) to the predicted annotations. The aim of such filter is to provide to the user a confidence level, mainly for those queries that have a very low biological content. The confidence score is not considered for the experimental testing in the paper, but it is made available on the online Web application. Further details are reported in Additional file 2.
Evaluation metrics
In a real world scenario, a text mining tool for GO classification should assist a database curator for manual extraction of GO annotations from scientific literature. From this point of view, a classification tool can be useful whether it can provide a small set of annotations that accurately cover the biological content of a scientific publication. Thus, for performance assessment, we particularly focus on the evaluation of the top10 ranked terms only (recall that 99 % of the publications in our benchmark set have at most 10 distinct GO annotations). The performances for the top20 ranked terms are shown in Additional file 2.
Furthermore, in order to make the results more accessible to a heterogeneous community, we adopt different metrics from different domains. In particular, we use specific evaluation metrics from the IR domain, as established at TREC experiments [36], and metrics exploited for performance assessment in biologicallyrelated domains, such as BioCreative [14] and CAFA [5] experiments. The topic covered at CAFA is the evaluation of protein function prediction methods. As in our setting, also at CAFA the performances are assessed over the GO hierarchy. We further introduce a third set of metrics, based on recent approaches for measuring semantic similarity between set of terms/concepts [37].
In the rest of this section, we adopt the following notation. For a given query, we denote with T and P its related set of goldstandard (from GOA) and predicted (from the classifier) GO annotations, respectively. We use lowercase letters, such as t and p, to denote single terms within T and P, respectively. We can assume that predicted terms in P are ranked according to the classifier score. We denote with P
_{
k
} the topk ranked terms in P. With abuse of notation, we use P
_{
s
} to denote also the subset of predicted terms in P with a score greater than or equal to score threshold s. A detailed description of the metrics follows.
TREC metrics
The reciprocal rank metric evaluates the precision at first correctly predicted term by computing the reciprocal of its rank:
$$ {RR}_{k}(T,P)= \frac{1}{\min\{ i \mid T\cap P_{i}\neq \emptyset, 1\leq i\leq k\}}. $$
((5))
The M
R
R
_{
k
} is computed by averaging the R
R
_{
k
} over the entire set of evaluated queries. The recall at rank k metric evaluates the average fraction of relevant GO terms in the topk predictions returned by the classifier:
$$ R_{k}(T,P) = \frac{T\cap P_{k}}{T}. $$
((6))
The R
R
_{
k
} and R
_{
k
} measures are strictly greater than zero only if T∩P
_{
k
}≠∅. For performance evaluation, here we consider only the mean reciprocal rank and recall at the top10 predictions, M
R
R
_{10} and R
_{10}, respectively.
CAFA/BioCreative metrics
These metrics, introduced to reflect the hierarchical nature of GO, are based on a variant of the standard precision/recall measures, called hierarchical precision/recall [38]. The hierarchical measures are some of the metrics adopted at the BioCreative IV experiments. The hierarchicalprecision at rank k is defined by
$$ {hP}_{k}(T,P) = \frac{\mathcal{A}(T)\cap \mathcal{A}(P_{k})}{\mathcal{A}(P_{k})}, $$
((7))
and the hierarchicalrecall at rank k by
$$ {hR}_{k}(T,P) = \frac{\mathcal{A}(T)\cap \mathcal{A}(P_{k})}{\mathcal{A}(T)}, $$
((8))
where \(\mathcal {A}(X)\) denotes the set of all the ancestors of terms in X, recursively propagated up to the root(s) of the GO hierarchy. The set \(\mathcal {A}(X)\) contains also X, i.e. \(X\subseteq \mathcal {A}(X)\). The hierarchicalprecision and hierarchicalrecall can be combined into a single metric, the hierarchical Fmeasure (harmonic mean):
$$ {hF}_{k}(T,P) =\frac{2\cdot {hP}_{k}(T,P) \cdot {hR}_{k}(T,P)}{hP_{k}(T,P) +{hP}_{k}(T,P)} $$
((9))
The h
P
_{
k
}(T,P) measure tends to assign high scores when P contains very generic terms, such as those at the top of the hierarchy. Thus, the h
P
_{
k
}(T,P) measure is not very robust over GO, since it gives high scores even when P consists of just few noninformative terms, such as the three roots of the GO hierarchy. Symmetrically, the h
R
_{
k
}(T,P) measure tends to assign high scores when P contains very specific terms, such as those at the bottom of the hierarchy. Since the GO hierarchy contains many leaf terms, the h
R
_{
k
} measure is more robust than the h
P
_{
k
} over GO. Furthermore, if we choose a fixed k (= 10 or 20), even if T⊂P, h
P
_{
k
}(T,P) would generally provide a poor estimation of the classification capabilities, due to the highly unbalanced number of annotations per publication in our benchmark set. Conversely, the unbalanced number of annotations does not affect the h
R
_{
k
} metric. The h
F
_{
k
} metric is more robust than h
P
_{
k
} but it still suffers from the unbalanced amount of annotations in our dataset. For these reasons, here we consider only the hierarchical recall at rank 10, h
R
_{10}. The results for h
P
_{10} and h
F
_{10} are shown in Additional file 2.
With small modifications, the hierarchical measures have been adopted also at CAFA experiments. In detail, at CAFA, the topranked predictions are selected with respect to a given score threshold s, not a rank k. The hierarchical precision and recall in Eqs. 7 and 8, respectively, just need to be recoded with respect to score thresholds. Note that, in \( \mathcal {A}(P_{s})\) only the predicted terms with score greater than or equal to s are propagate up to the root(s) of the ontology. Furthermore, the hierarchical precision h
P
_{
s
}(T,P) is assumed to be equal to zero if P
_{
s
}=0. For a given dataset D consisting of pairs of true and predicted annotations, and for a given score threshold s, CAFA’s average hierarchical precision at s is defined by
$${hP}_{s}(D) = \frac{1}{m_{s}(D)}\sum_{(T,P)\in D} {hP}_{s}(T,P), $$
where m
_{
s
}(D)={P(T,P)∈D,P
_{
s
}>0} is the number of nonempty predictions at threshold s. Asymmetrically, the average hierarchical recall at threshold s is averaged over all pairs in the dataset, and it is defined by
$${hR}_{s}(D) = \frac{1}{D}\sum_{(T,P)\in D} {hR}_{s}(T,P). $$
These last two measure can be combined into a single metric, the Fmeasure (harmonic mean), at different score thresholds. The main evaluation metric at CAFA, which we also consider here, is the maximum value of the harmonic mean over all thresholds:
$$ {hF}_{max}(D) = \max_{s} \left\{ \frac{2\times {hP}_{s}(D) \times {hR}_{s}(D)}{hP_{s}(D) + {hR}_{s}(D)} \right\}. $$
((10))
Thanks to the choice to adopt a score cutoff, CAFA’s hierarchical Fmeasure is not affected by the unbalanced number of annotations in the benchmark set.
InformationTheoretic metrics
These metrics can be considered the informationtheoretic counterparts of precision/recall, and rely on the information content of individual terms t in the GO hierarchy:
where P
r(t) is the relative frequency of term t with respect to some background distribution. We adopt as background distribution the entire set of goldstandard annotations in our benchmark set. The Resnik’s similarity [39] between two terms t and p is defined as the maximum information content among the common ancestors of t and p:
$${sim}_{Resnick}(t,p)=\max_{a \in \mathcal{A}(\{t\}) \cap \mathcal{A}(\{p\})}\{ic(a)\} $$
The Lin’s similarity [40] between two terms t and p is the normalized version of the Resnick’s similarity:
$${sim}_{Lin}(t,p)=\frac{2\times {sim}_{Resnick}(t,p)}{ic(t)+ic(p)} $$
Following the approach in [37], we can extend the informationtheoretic similarities to sets of annotations. In this way, it is possible to obtain the informationtheoretic counterpart of precision at the topk ranked terms:
$$ {iP}_{k}(T,P)=\frac{1}{\mathcal{L}(P_{k})}\sum_{p\in \mathcal{L}(P_{k})} \max_{t\in \mathcal{L}(T)}\{sim_{Lin}(t,p)\}, $$
((11))
and recall at the topk ranked terms:
$$ {iR}_{k}(T,P)=\frac{1}{\mathcal{L}(T)}\sum_{t\in \mathcal{L}(T)} \max_{p\in \mathcal{L}(P_{k})}\{sim_{Lin}(t,p)\} $$
((12))
where \(\mathcal {L}(X)\) denotes the set of leaf terms in X, i,e. \(\mathcal {L}(X)\subseteq X\) is the largest subset of X such that \(\forall u,v \in \mathcal {L}(X), u\notin \mathcal {A}(v)\). As it happens for the hierarchicalprecision h
P
_{10}, the informationtheoretic precision i
P
_{10} is slightly affected by the unbalanced number of annotations in our benchmark set. Conversely, the i
P
_{1} metric is more robust, since it just estimates the quality of the very top predicted annotation. For these motivations, here we consider only the informationtheoretic precision at the top ranked term (i
P
_{1}) and the informationtheoretic recall at rank 10 (i
R
_{10}), which can be seen as complementary to the two TREC metrics M
R
R
_{10} and R
_{10}. The results for i
P
_{10} are shown in Additional file 2.