Predicting the functions of a protein from its ability to associate with other molecules

Background All proteins associate with other molecules. These associated molecules are highly predictive of the potential functions of proteins. The association of a protein and a molecule can be determined from their co-occurrences in biomedical abstracts. Extensive semantically related co-occurrences of a protein’s name and a molecule’s name in the sentences of biomedical abstracts can be considered as indicative of the association between the protein and the molecule. Dependency parsers extract textual relations from a text by determining the grammatical relations between words in a sentence. They can be used for determining the textual relations between proteins and molecules. Despite their success, they may extract textual relations with low precision. This is because they do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). Moreover, they may not be well suited for complex sentences and for long-distance textual relations. Results We introduce an information extraction system called PPFBM that predicts the functions of unannotated proteins from the molecules that associate with these proteins. PPFBM represents each protein by the other molecules that associate with it in the abstracts referenced in the protein’s entries in reliable biological databases. It automatically extracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. Towards this, we present novel semantic rules that identify the semantic relationship between each co-occurrence of a protein-molecule pair using the syntactic structures of sentences and linguistics theories. PPFBM determines the functions of an un-annotated protein p as follows. First, it determines the set Sr of annotated proteins that is semantically similar to p by matching the molecules representing p and the annotated proteins. Then, it assigns p the functional category FC if the significance of the frequency of occurrences of Sr in abstracts associated with proteins annotated with FC is statistically significantly different than the significance of the frequency of occurrences of Sr in abstracts associated with proteins annotated with all other functional categories. We evaluated the quality of PPFBM by comparing it experimentally with two other systems. Results showed marked improvement. Conclusions The experimental results demonstrated that PPFBM outperforms other systems that predict protein function from the textual information found within biomedical abstracts. This is because these system do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). PPFBM’s performance over these system increases steadily as the number of training protein increases. That is, PPFBM’s prediction performance becomes more accurate constantly, as the size of training proteins gets larger. This is because every time a new set of test proteins is added to the current set of training proteins. A demo of PPFBM that annotates each input Yeast protein (SGD (Saccharomyces Genome Database). Available at: http://www.yeastgenome.org/download-data/curation) with the functions of Gene Ontology terms is available at: (see Appendix for more details about the demo)http://ecesrvr.kustar.ac.ae:8080/PPFBM/.

Constituency parsers performs syntactic analysis in a tree representation of the constituents constituting a sentence and the hierarchy that governs the associations among the constituents. These parsers analyze the structural relationships among constituents in each raw of input corpuses. In constituency parsing, lexical semantics analyze the meaning in the granularity of words, stems, suffixes, and prefixes [20]. Dependency parsers extract textual relations from a text by determining the grammatical relations between words in a sentence. Despite the success of most constituency and dependency parsers, they may extract textual relations with low precision. This is because they do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). Moreover, they may not be well suited for complex sentences and for long-distance textual relations.
A number of systems and approaches that employ NLP parsers have been proposed to parse biomedical texts to infer useful information such protein function and protein-protein interactions. The following is a survey of some of these popular systems. In GOstruct [21,22], a protein p is annotated with functional category of a Gene Ontology (GO) term t, if p and t co-occur frequently in close proximity in PubMed abstracts. The abstracts were fed into a NLP pipeline, where abstracts are split into sentences, protein names are identified using BioNLP UIMA resources [23]. Text-KNN [24] represents a protein by the characteristic terms (i.e., GO terms) found within the biomedical abstracts associated with it. It annotates an un-annotated protein p with the functional categories of proteins represented by characteristic terms similar to p, using a k-nearest neighbor classifier. GOSTRUCT [25,26] presents a system that aims to identify semantic associations among residues and proteins, using dependency graphs. GOSTRUCT can predict protein function from the protein sites mentioned in biomedical abstracts. It categorizes protein sites based on their protein structures determined by the amino acid residues found in biomedical abstracts.
We propose in this paper an information extraction system called PPFBM (Predicating Proteins Functions from their Binding to other Molecules). PPFBM overcomes the limitations of most current constituency and dependency parsers outlined above as follows. It employs novel NLP dependency parsing and information extraction techniques that identify the semantic relationship between each pair of terms in a sentence using novel semantic rules. Moreover, it applies novel model and linguistic computational techniques for extracting the semantic relationship from different structural forms of terms in the sentences of biological texts. That is, PPFBM aims at enhancing the state of the art of biological text mining.
PPFBM analyzes biomedical texts in order to discover protein function information that is difficult to retrieve. Knowledge of protein function is crucial to the identification of gene-disease associations, cellular pathways, and drug design [4,24,[27][28][29][30][31][32][33][34]. Towards this, PPFBM represents each protein by the other molecules associated with it and are found within the biomedical abstracts associated with the protein. This is because the other molecules associate with a protein are highly predictive of the potential functions of the protein [35]. That is, these molecules that strongly associate with a protein are good characteristics and indicators of the functions of the protein. All proteins bind to other molecules and these bindings determine the biological properties of the proteins such as their functions [27].
Not all the co-occurrences of a protein's name and a molecule's name in sentences can be considered as indicative of the association between the protein and the molecule. Therefore, PPFBM automatically extracts from biomedical abstracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. Towards this, we present novel association discovery techniques (i.e., semantic rules) that identify the semantic relationship between each co-occurrence of a proteinmolecule pair using the syntactic structures of sentences and linguistics theories. After extracting the set of molecules, whose occurrences in abstracts represent semantic relationships with a protein, PPFBM selects the subset that is dominant and highly predictive of the protein's functions. It then represents the protein with the selected subset of dominant molecules in the form of textual features.
PPFBM determines the functions of un-annotated protein p as follows. First, it determines the set S r of annotated proteins that is semantically similar to p by matching the dominant molecules representing p and the dominant molecules representing the annotated proteins. Then, it determines the relative significance of the frequency of occurrences of set S r in each abstract associated with a protein annotated with a functional category. Let S FC be the significance of the frequency of occurrences of set S r in biomedical abstracts associated with proteins annotated with the functional category FC. Let S ′ FC be the significance of the frequency of occurrences of set S r in biomedical abstracts associated with proteins annotated with all other functional categories. PPFBM will assign the un-annotated protein p the functional category FC, if S FC is statistically significantly different than S ′ FC . PPFBM locates and identifies the associations that describe semantic relationships between a protein and a molecule co-occurrences using novel dependency parsing and information extraction techniques. These techniques rely, in part, on empirically determined syntactic structures of sentences and linguistics theories. We present semantic search and information retrieval mechanisms to efficiently explore the associations that exist between proteinmolecule pairs in the large amount of biomedical literature associated with proteins.
A demo of PPFBM that annotates each input Yeast protein [36] with the functions of Gene Ontology terms is available at: (see Appendix for details) http://ecesrvr. kustar.ac.ae:8080/PPFBM/

Methods
Representing a protein by a vector of weights Extracting the molecules that associate with annotated training proteins from biological abstracts We select a set of annotated proteins from a reliable biological database such as UniProtKB/Swiss-Prot [28]. The selected set will be used as a training protein dataset for PPFBM. The entry of each training protein in the biological database should have at least one reference to a PubMed abstract. We then retrieve the PubMed abstracts associated with the training proteins and referenced in the entry of the biological database. PPFBM extracts from these abstracts the molecules that associate with each of the selected training proteins. It automatically extracts from the retrieved abstracts each co-occurrence of a pair of protein and molecule that represents semantic relationship between the pair. These molecules will be used as text features to represent the training proteins. Our objective is to represent the training proteins using molecules that are highly predictive of their potential functions [24].
PPFBM is built on top of both ABNER Biomedical Named Entity Recognizer [37,38] and ChEBI (Chemical Entities of Biological Interest) ontology [39]. ChEBI is a manually curated database and ontology that organizes small molecule knowledge [39]. PPFBM access a single ChEBI ontology file to determine ChEBI identifiers/terms. A list of ChEBI identifiers corresponds to small molecules at the leaf level of the ChEBI structural hierarchy. Then, ABNER is used for the identification of relevant named entities in biomedical texts that correspond to the ChEBI terms. Molecules are classified into five classes, RNA, protein, DNA, cell-type, and cell-line. The Co-reference Resolution connects occurrences of same proteins. Some of these occurrences are represented by terms such as "this protein", "it", "they", etc. Also, lexical peculiarities in protein names (such as symbols and numbers) are identified. PPFBM employs a tokenizer and stemmer to align the sequence of words in a sentence and the names of molecules. A molecule's stemmed words are aligned against abstracts. Finally, PPFBM performs a domain analysis to identify the related entities as well as the nature of their relationships.
Representing an annotated training protein by the other molecules that associate with it Each protein p is represented by a vector of weights. That is, we view a protein p as a vector with one component corresponding to a molecule m i that associate with p, together with a weight w (m i , p) on this component in the set of abstracts associated with p. The w(m i , p) represents the statistical significance of the co-occurrences of m i and p based on their semantic relationships in the set of abstracts of PubMed associated with p. That is, w(m i , p) quantifies the likelihood of the association between m i and p based on of their semantic relationship occurrences in the set of abstracts of PubMed associated with p. The co-occurrence of a molecule m i and a protein p in a same sentence may not be necessary an indicative of the association between m i and p. Therefore, the weight of the association between m i and p pair relies, in part, on whether the co-occurrences of the pair are semantically related. That is, the weight w A j m i ; p ð Þ is based, in part, on whether the co-occurrences of the pair in abstract A j are semantically related. A molecule that does not occur in abstracts, its weight is zero. Let w A j m i ; p ð Þ be the weight of the co-occurrences of m i and p based on their semantic relationships in an abstract A j .
The weight w A j m i ; p ð Þis calculated as shown in eq. 1.
As shown in Table 1, let: (1) o 11 and o 12 be the observed frequencies of the co-occurrences of semantically related m i and p pair in abstract A j , (2) o 21 and o 22 be the observed frequencies of the co-occurrences of semantically unrelated m i and p pair in abstract A j , (3) e 11 and e 12 be the theoretical frequencies of the co-occurrences of semantically related m i and p pair in abstract A j , and (4) e 21 and e 22 be the theoretical frequencies of the co-occurrences of semantically unrelated m i and p pair in abstract A j . The operands T A j m i ; p ð Þand T ′ A j m i ; p ð Þin Eq. 1 are calculated as follows: Þis computed by normalizing the sum of the squared deviations of the observed frequencies o 11 and o 12 from the theoretical frequencies e 11 and e 12 in an abstract A j , where m i and p may or may not co-occur in the same sentence. Thus, T A j m i ; p ð Þ is computed as follows: If m i occurs in a different sentence than p, m i and p can be semantically related, if the two sentences are connected by a sentence connector (such as "moreover", "however", "otherwise", "therefore", etc.). In this case, the two sentences are represented by one common Part Of Sentence Tree [40] with one root node.
Þis computed by normalizing the sum of the squared deviations of the observed frequencies o 21 and o 22 from the theoretical frequencies e 11 and e 12 in an abstract A j , where m i and p may or may not co-occur in the same sentence. Thus, T ′ A j m i ; p ð Þis computed as follows: . N: overall observed frequencies.
Each value of o xy in Table 1 is computed using Eq. where: to consider the order of appearance of m i and p pairs in an abstract. This is important because, intuitively, the first appearances of the pair in an abstract should contribute more to the value of f m i ; p than the subsequent appearances of the pair. In this equation, the first appearance of the pair in an abstract contributes much more than the remaining appearances. The constant 0 < c < 1 controls the balance between the initial and subsequent appearances of the pair. This method is preferred for use, if the abstracts are known to be associated with protein p (e.g., they are referenced in the protein's entries in biological databases). This is because: (1) such abstracts usually contain occurrences of different molecules that associate with p, and (2) the molecules that appear first in the abstracts are usually more important to p (therefore, they should contribute more to the value of f m i ; p ). This method may not be as effective for randomly selected abstracts, because some of the molecules that occur in these abstracts may not even associate/bind to p; therefore, ranking molecules based on their order of appearances in these abstracts is useless.

Note 2:
We use f m i ; p ¼ 1 þ log n m i ; p , if we need to: (1) give more diminishing returns as the co-occurrence frequency of m i and p pair increases, and (2) have the cooccurrence of m i and p pair to be very frequent in order for the frequency contribution value to be greater than four. The logarithm used in the formula gives diminishing returns as molecule frequencies increase. This method is preferred for use, if some of the abstracts are known to be associated with p, while the other ones are not. Intuitively, the frequencies of the molecules that associate with p in the former abstracts are much higher than the later ones. This may cause the contribution to the value f m i ; p of the molecules in the later abstracts to be negligible. This method corrects this problem by giving diminishing returns as molecule frequencies in the former abstracts increase.

Note 3:
We use f m i ; p ¼ 1 A m i ; p , if we need to consider: (1) f m i ; p as a local measure of the co-occurrences of m i and p pair, and (2) a rank is a measure of importance. In this case, f m i ; p is a global measure, invertly proportional to the number of abstracts containing the pair in the whole database. This method is preferred for use, if it is expected that the frequencies of molecules are sparsely distributed in the different abstracts (i.e., molecule frequencies are not dense in only some of the abstracts).
, if we need to prevent a co-occurrence of m i and p pair for which A m i ; p ¼ 1 from being regarded as twice as important as another pair for which A m i ; p = 2. The logarithm included in the formula prevents a molecule for which A m i ; p = l from being regarded as twice as important as a molecule for which A m i ; p = 2. This method is preferred for use, if the abstracts have the same size or close sizes.
need to consider only the abstracts that contain cooccurrences of m i and p pair for computing the value of f A m i ; p (i.e., if we want to disregard abstracts that do not contain co-occurrences of the pair).
Example 1: In this example, we describe how w A j m i ; p ð Þin Eq. 2 is computed for protein PA1535. We selected the abstract of the paper Förster et al. [41] as A j (Förster et al. is one of the 12 papers associated with protein PA1535). We describe how the weight of associations between molecules and protein PA1535 are computed based on their semantic relationships in the abstract of Förster et al. That is, we describe how w A Fo :: rster et al:2008 m i ; PA1535 ð Þis computed. The abstract of Förster et al. [41] is shown below: "The atuRABCDEFGH gene cluster is essential for acyclic terpene utilization (Atu) in Pseudomonas aeruginosa.

Running Example:
We illustrate some of the concepts presented in this paper using a running example pertaining to protein PA1535. We illustrate in the running example how the molecules associated with PA1535 can be used as a vector of weights to represent the protein. In Example 1, we present the abstract of Förster et al. [41] and describe how the weight of the cooccurrences of each molecule and protein PA1535 is computed based on their semantic relationships in the abstract. In Example 2, we illustrate how the weights of associations between 10 molecules and protein PA1535 are computed based on their co-occurrences in 12 abstracts associated with protein PA1535. We retrieved the 12 PubMed abstracts associated with protein PA1535 and referenced in the entry of UniProtKB/Swiss-Prot [28]. In Example 3, we illustrate how the beats/looses scores and normalized weights of the 10 molecules that associate with PA1535 are computed based on their co-occurrences in the 12 Abstracts.
The biochemical functions of most Atu proteins have not been experimentally verified; exceptions are AtuC/AtuF, which constitute the two subunits of geranyl-CoA carboxylase, the key enzyme of the Atu pathway. In this study we investigated the biochemical function of AtuD and of the PA1535 gene product, a protein related to AtuD in amino acid sequence. 2D gel electrophoresis showed that AtuD and the PA1535 protein were specifically expressed in cells grown on acyclic terpenes but were absent in isovalerate-or succinate-grown cells. Mutant analysis indicated that AtuD but not the product of PA1535 is essential for acyclic terpene utilization. AtuD and PA1535 gene product were expressed in recombinant Escherichia coli and purified to homogeneity. Purified AtuD showed citronellyl-CoA dehydrogenase activity and high affinity to citronellyl-CoA. AtuD was inactive with octanoyl-CoA, 5-methylhex-4-enoyl-CoA or isovaleryl-CoA. Purified PA1535 gene product revealed high citronellyl-CoA dehydrogenase activity but had significantly lower affinity than AtuD to citronellyl-CoA. Purified PA1535 protein additionally utilized octanoyl-CoA as substrate. To our knowledge AtuD is the first acyl-CoA dehydrogenase with a documented substrate specificity for terpenoid molecule structure and is essential for a functional Atu pathway. Potential other terpenoid-CoA dehydrogenases were found in the genomes of Pseudomonas citronellolis, Marinobacter aquaeolei and Hahella chejuensis but were absent in non-acyclic terpene-utilizing bacteria". ð Þis computed using Eq. 2, where n m i ; p is the number of cooccurrences of m i and PA1535 pairs in the abstract of Förster et al. [41].
Example 2: Table 3 shows the weight of associations between 10 molecules and protein PA1535 based on their co-occurrences in 12 abstracts associated with the protein. Each cell in the table shows the weight of cooccurrences of m i and p based on their semantic relationships in abstract A j (i.e., w A j m i ; p ð Þ) Representing an annotated training protein by only the dominant molecules that associate with it A molecule could be uninformative, if it has only few occurrences in abstracts and/or is assigned a high weight even though it is found in abstracts associated with many other protein classes. Some of these abstracts may contain only a few occurrences of a molecule associated with many proteins annotated with different functional classes. Including uninformative molecules could lead to misclassifying proteins of small function classes into the larger classes and vice versa. To overcome this problem, we should refine the set of molecules representing a protein by excluding the uninformative molecules and keeping only the dominant ones (i.e., the ones that have frequent occurrences in abstracts that are not associated with many other protein classes). Towards this, we assign a score to each molecule m representing a protein p. The score reflects the dominance status of m relative to the other molecules representing p. First, we determine the pairwise beats and looses for each molecule contained in the abstracts associated with the protein p. Molecule m i beats molecule m j , if the number of times that the weights of m i (e.g., Table 3) is greater than that of m j in abstracts. Then, each molecule m is assigned a score, which is the difference between the number of times that m beats the other molecules and the number of times it loses in the abstracts.
Definition 1 -A score of a molecule: Let m i > m j denote: the number of times that the weights of molecule m i is greater than that of m j in abstracts. Let S(m i , p) Table 2 The weight of associations between 10 molecules and protein PA1535 based on their co-occurrences in the abstract of Förster et al. [41] Molecule  denote the score of association between molecule m i and protein p. Given the dominance relation > on the set of molecules V p for protein p, the score S(m i , p) equals: The following are some of the characteristics of the above scoring approach: (1) the overall sum of molecules' scores is zero, and (2) the highest possible score is (n−1) and the lowest possible score is -(n−1), where n is the number of molecules. We also compute w m i ; p ð Þ, the normalized weight of association between molecule m i and protein p in abstracts. We compute w m i ; p ð Þ by summing the positive of the most negative score and each other score and then normalizing the resulting values. Consider for example Table 4. The most negative score is −9. The positive of the most negative score (i.e., + 9) is summed to each score, as follows: (9 + 6 = 15), (9 + 5 = 14), (9 + 8 = 17), (9 + 2 = 11), (9 + 3 = 12), (9 − 6 = 3), (9 − 5 = 4), (9 + 1 = 10), (9 − 9 = 0), and (9 − 1 = 8). Finally, the resulting values are normalized as shown in the last row in Table 4 (i.e., row w m i ; p ð Þ). Example 3: Table 4 show the same 10 molecules presented in Example 2 and Table 3 after calculating the scores of their associations with protein p in the 12 abstracts. The Table illustrates how the score S(m i , p) and normalized weight w m i ; p ð Þ of the associations between the 10 molecules and protein PA1535 are calculated based on the weights shown in Table 3. Consider for example Table 3. AtuD beat citronellyl-CoA in six abstracts, citronellyl-CoA beat AtuD in four abstracts, and the two Table 3 The weight of associations between 10 molecules and protein PA1535 based on their co-occurrences in 12 abstracts molecule AtuD citronellyl-CoA octanoyl-CoA terpenoid-CoA isovaleryl-CoA Docosenoyl-CoA OPC4-CoA Sirodesmin H OPC8-CoA 3-dipole  Table 4 Beats/looses scores and normalized weights of the 10 molecules that associate with protein PA1535 based on their co-occurrences in 12 abstracts, calculated based on their weights shown in Table 3 AtuD The Symbol "+" denotes that molecule m i (column) Beats molecule m j (row) in the Abstracts, while "-" denotes that m i Lost. "0" denotes that m i and m j have the same Number of Beats and Looses. S(m i , p) and w m i ; p ð Þdenote the Score and Normalized Weight, respectively, of Molecule m i in The 12 Abstracts. An Entry is based on Column-Row Order molecules have the same weight in two abstracts. Therefore, the symbol "-"is placed in the entry (citronellyl-CoA, AtuD) of Table 4 to denote that citronellyl-CoA lost to AtuD (an entry is based on column-row order).
Then, the molecules are ordered by their normalized weights. The molecules with the most normalized weights are considered the dominant molecules for the protein p. The remaining molecules will be considered uninformative and will be excluded from the inclusion within the set of molecules representing p. Thus, protein p will be represented by only the dominant molecules as described above. That is, each protein is represented by only the dominant molecules associated with it. From the set V p of molecules associated with p, the subset Ṽ p ⊂ V p is considered the dominant ones for p, if every molecules ∈ Ṽ p satisfies the following: (1) It dominates every molecule m′ ∈ V p , m′ ∉ Ṽ p , (i.e., the normalized weight of m is greater than the normalized weight of each m′). (2) It acquires a normalized weight w (m, p) greater than a threshold β. β is a value lower than the mean normalized weight by the standard error of the normalized mean.
Definition 2 -Dominant molecule: Let V p be the set of molecules for a protein p. Let w m i ; p ð Þbe the normalized weight of a molecule m i ∈ V p associated with p. The subset Ṽ p ⊂ V p of the dominant molecules for p with the maximal weights is given by: , for all m j ∈ V p , and w m i ; p ð Þ> β} We model protein p as a vector Ṽ p , with one component corresponding to a molecule m i , together with w (m i , p) on this component. Thus, where m i is a dominant molecule in the set of abstracts associated with protein p.
Determining whether an annotated protein and a molecule are semantically related in a sentence The co-occurrence of a molecule m i and a protein p in the same sentence may not be an indicative of the association between m i and p. Therefore, the weight of the association between m i and p relies, in part, on whether the cooccurrences of the pair are semantically related. For example, the weights w A j m i ; p ð Þin Table 3 are based, in part, on whether the co-occurrences of the 10 molecules and protein PA1535 in the 12 abstracts are semantically related. In this section, we propose semantic rules that determine whether a co-occurrence of a molecule and a protein in a sentence is semantically related. In each of the next subsections, we propose semantic rules based on linguistics theories and the syntactic structures of sentences.
In each of the next two subsections, we illustrate our proposed rules using sentences extracted from biomedical literature.
In these examples, we show how the semantic relationships between molecules/proteins can be determined using our proposed rules. We divide each sentence into simple sentences using dependency grammar. Each simple sentence is an independent clause, which contains a subject and a predicate. We place each independent clause inside a rectangle for easy reference. In each example, the words that comprise a sentence are tagged as follows: (N) for noun, (V) for verb, (PREP) for preposition, and (PRON) for pronoun.

Sentences containing pronouns defining antecedents
According to linguistics, an antecedent noun is usually related to the subsequent noun(s), if the subsequent noun(s) is connected to the antecedent by a pronoun (such as "which", "who", "it", "whom", and "that") [42]. We propose our first semantic rules based on this linguistic observation, as follows: 1. An antecedent noun is semantically related to a subsequent noun(s), if the two nouns are connected by a pronoun. Towards this, PPFBM replaces each pronoun with the closest noun found under the predecessor independent clause. This conforms to grammar and linguistics, which treat a pronoun as a word that can be substituted by a noun or noun phrase. In Examples 4-8, we strikethrough each pronoun and replace it with the closest noun found under the predecessor independent clause. 2. An explicit or implicit pronoun preceded by a conjunction (i.e., "and" and "or") refers to the subject of closest predecessor independent clause. In Examples 4-8, we strikethrough each pronoun preceded by a conjunction and replace it with the subject of closest predecessor independent clause. In the case of an implicit pronoun preceded by a conjunction, we also replace it with the subject of closest predecessor independent clause.
For the sake of clarification, we perform the following in Examples 4-8: 1. We type the subject of the first independent clause using a different font. 2. We type each noun that replaces a pronoun: (1) in italics, (2) in a different font, and (3) place quotation marks around it. The replacement noun plays the role of the subject of the independent clause that comes immediately after the pronoun.
In Examples 4-8, we demonstrate how these semantic rules conform to the linguistics theory stated above. We determine the semantic relationships between each pair of molecules/proteins. Recall that all nouns (including the replacement nouns) within an independent clause are semantically related.
Example 4: Consider the following sentence: "Coenzymes are the organic molecules Citronellyl-CoA and OPC4-CoA that bind to the active site of the GGPS1 protein". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
The pronoun "that" is replaced by the closest noun(s) found under the predecessor independent clause (i.e., the nouns "Citronellyl-CoA" and "OPC4-CoA"), which become the subject nouns of the second independent clause. Therefore, the nouns "Citronellyl-CoA" and "OPC4-CoA" are semantically related to "GGPS1 protein".
Example 5: Consider the sentence: "It is cleaved to release 53 amino-acid molecule, which binds to the protein ADIPOR1 and interacts with the protein BMPR1A". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
In the second independent clause, the pronoun "which" is replaced by the closest noun under the predecessor independent clause (i.e., the noun "53 amino-acid molecule"), which becomes the subject of the second independent clause. Therefore, "53 amino-acid molecule" and "protein ADIPOR1" are semantically related. In the third independent clause, the implicit pronoun that follows the conjunction "and" is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "53 aminoacid molecule"), which becomes the subject of the third independent clause. Therefore, the nouns "53 amino-acid molecule" and "protein BMPR1A" are semantically related.
Example 6: Consider the following sentence: "Protein MshD acetyltransferase is composed of two GNAT domains, and it binds molecule AcCoA". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
Since the pronoun "it" follows the conjunction "and", it is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "Protein MshD acetyltransferase"), which becomes the subject of the second independent clause. Therefore, "Protein MshD acetyltransferase" and "molecule of AcCoA" are semantically related.
Example 7: Consider the following sentence: "Molecule acetyl CoA is a purified recombinant and it catalyzes the hydration of the yeast protein mak3". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
Since the pronoun "it" follows the conjunction "and", it is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "acetyl CoA"), which becomes the subject of the second independent clause. Therefore, molecule "acetyl CoA" and yeast protein "mak3" are semantically related.
Example 8: Consider the following sentence: "Fkh2p binds cooperatively with Mcm1p, which interacts with the Sid2p, which interacts with Blt1p and binds to mob1p". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
The subject protein "Fkh2p" is semantically related to the molecule protein "Mcm1p". In the second independent clause, the pronoun "which" is replaced by the closest noun under the predecessor independent clause (i.e., the noun "Mcm1p"), which becomes the subject of the second independent clause. Therefore, the molecule proteins "Mcm1p" and "Sid2p" are semantically related. In the third independent clause, the pronoun "which" is replaced by the closest noun under the predecessor independent clause (i.e., the noun "Sid2p"), which becomes the subject of the third independent clause. Therefore, the molecule proteins "Sid2p" and "Blt1" are semantically related. In the fourth independent clause, the implicit pronoun that follows the conjunction "and" is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun "Sid2p"), which becomes the subject of the fourth independent clause. Therefore, the molecule proteins "Sid2p" and "mob1p" are semantically related.

Sentences containing preposition modifiers
Our second proposed semantic rules are based on the following linguistics observations [43,44]: (1) two independent clauses connected by a preposition modifier (such as "but", "while", and "whereas") are usually unrelated, and (2) all nouns within an independent clause are usually related. The following are our proposed rules, which are based on the above observations: 1. The co-occurrence of a molecule and a protein pair in a sentence is considered semantically unrelated, if the two terms occur in two different independent clauses connected by a preposition modifier. This is because the two terms do not have dependency relationship in this case. 2. The co-occurrence of a molecule and a protein pair within an independent clause (i.e., inside a rectangle in our examples) is considered semantically related.
In Examples 9-11, we demonstrate how these semantic rules conform to the linguistics theory stated previously. In these sentences, we determine the semantic relationship between each pair of molecules/proteins in the sentences.
Example 9: Consider the sentence: "Citronellyl-CoA and OPC4-CoA participate in the catalysis of GGPS1 but OPC8-CoA is a substrate of the reaction of OPCL1". Below is the syntactic structure of the sentence in terms of its constituents of independent clauses: In the first independent clause, the organic molecules "Citronellyl-CoA" and "OPC4-CoA" are semantically related to the protein "GGPS1". In the second independent clause, the molecule "OPC8-CoA" is semantically related to the protein "OPCL1". However, each of "Citronellyl-CoA", "OPC4-CoA", and 'GGPS1" is unrelated to each of "OPC8-CoA" and "OPCL1", because they belong to two different independent clauses connected by the preposition modifier "but".
Example 10: Consider the following sentence: "The sequence of MshD is twice the length of GNAT and it binds CoASH and HSCoA, whereas ARL1 binds SCOCO and Golgin-245". The following is the syntactic structure of the sentence in terms of its constituents of independent clauses.
Since the pronoun "it" follows the conjunction "and", it is replaced by the subject noun of the closest predecessor independent clause (i.e., the noun protein "MshD"), which becomes the subject of the second independent clause. Therefore, the protein "MshD" is semantically related to the molecules "CoASH" and "HSCoA". In the third independent clause, the protein "ARL1" is semantically related to the molecules "SCOCO" and "Golgin-245". However, each of "MshD", "CoASH" and "HSCoA" is unrelated to each of "ARL1", "SCOCO" and "Golgin-245", because the first and second sets of nouns belong to two different independent clauses connected by the preposition modifier "whereas".
Determining the functions of an Un-annotated protein Determining the semantic similarity between an Unannotated protein and the Set of training proteins Each annotated training protein p is represented by a vector Ṽ p of the dominant molecules associated with p in biomedical abstracts. LetṼ˜p′ be the vector of weights representing an un-annotated protein p′. Each component inṼ˜p′ corresponds to a molecule m i that associate with p′, together with a weight w(m i , p′) on this component. w(m i , p′) is determined from the reference works that describe the un-annotated protein p′ and is computed using the same techniques described in previously. Let sim ( p′, p) be the semantic similarity of p′ and an annotated training protein p, computed based on the similarity ofṼ˜p′ and Ṽ p . PPFBM employs the cosinebased semantic similarity measure shown in Eq. 5 for measuring sim (p, p′). After measuring the semantic similarity of p′ and each annotated training protein p, we determine the set S r of annotated training proteins that is semantically similar to p′.
▪ w m i ; p ð Þ: Normalized weight of the semantic relationship association between a molecule m i and an annotated protein p in abstracts associated with p. ▪ w m i ; p ′ À Á : Weight of the semantic relationship associations between a molecule m i and the un-annotated protein p′ in the reference works that describe p′. ▪ Ṽ p : Set of the dominant molecules that have semantic relationship associations with p in biomedical abstracts. ▪Ṽ p ′ : Set of the molecules that have semantic relationship associations with p′ in the reference works describing p′. ▪Ṽ p ∩Ṽ p ′ : Set of the molecules representing both P and p′. ▪ w m i ; p ð Þ: Mean weight of the common molecules representing both p and p′ in the vector representing p, where: Mean weight of the common molecules representing both p and p′ in the vector representing p′, where: Determining the functional category of an Un-annotated protein As described previously, we determine the set S r of annotated training proteins that is semantically similar to the un-annotated protein p′ using Eq. 5. Let S FC be the significance of the frequency of occurrences of set S r in PubMed abstracts associated with proteins annotated with the functional category FC. Let S ′ FC be the significance of the frequency of occurrences of set S r in PubMed abstracts associated with proteins annotated with all other functional categories. The un-annotated protein p′ will be annotated with the functional category FC, if S FC is statistically significantly different than S FC ′ . An abstract is determined to be associated with a protein, if it is referenced in the protein's entry in a reliable biological database such as UniProtKB/Swiss-Prot [28].
PPFBM employs Z-score for determining the significance of the frequency of occurrences of set S r in PubMed abstracts. That is, Z-score is used for determining the significance of the frequency of occurrences of each protein p ∈ S r in each set of PubMed abstracts associated with proteins annotated with the same functional category. The Z-score for a protein p ∈ S r in a set of PubMed abstracts associated with proteins annotated with a functional category FC, is the distance between the raw score for p and the population mean, as shown in Eq. 6:

Results and discussion
We implemented PPFBM in Java, run on Intel(R) Core(TM) i5-4200U processor, with a CPU of 2.30 GHz and RAM of 4 GB, under Windows 8. A demo of PPFBM that annotates each input Yeast protein [36] with the functions of Gene Ontology terms is available at: (see Appendix for more details about the demo) http://ecesrvr.kustar.ac.ae:8080/PPFBM/.
We experimentally evaluated the quality of PPFBM for predicting the functions of proteins by comparing it with GOstruct [21,22] and Text-KNN [24]. The following are brief overviews of the two systems: GOstruct [21,22]: In the framework of GOstruct, a protein p is annotated with the functional category of a Gene Ontology (GO) term t, if p and concepts associated with t co-occur frequently in close proximity in PubMed abstracts. We re-implemented the framework of GOstruct exactly as described in [21,22]. We also contacted some of the coauthors of the two papers to ensure accurate re-implementation of GOstruct. The following is a brief description of the methodology and tools used in the re-implementation. Abstracts are fed into a NLP pipeline, where they are split into sentences, and the co-mentions in these sentences are identified using BioNLP Apache Unstructured Information Management Architecture (UIMA) version 2.4 [23,45]. UIMA creates a pipeline to automatically extract co-mentions of a specific protein and concepts associated with GO terms found within the abstracts. The version of UIMA we used employs LingPipe sentence-detector version 3.9.3 [46] to fragment text it into sentences. LingPipe is trained using Colorado Richly Annotated Full Text (CRAFT) corpus. Tokenization is done using PennBio tokenizer version 0.5 [47], which is distributed with ConceptMapper version August 2008 [48]. Protein names in abstracts are identified by mapping protein mentions to Uni-Prot identifiers using a protein dictionary. GO terms and the concepts associated with them in abstracts are identified by looking up ConceptMapper dictionaries [49]. The co-mentions of a specific protein and concepts associated with GO terms are determined based on sentence spans. That is, co-mentions are mentions of a protein and concepts from the Gene Ontology that co-occur within a sentence. Each protein is represented by a vector. Each component of the vector represents the number of times that the protein co-occurs with concepts associated with a specific GO term. The GOstruct framework is available for download at: http://sourceforge.net/projects/strut/files/ Text-KNN [24]: It represents a protein by the characteristic terms found within the biomedical abstracts associated with it. It annotates an unannotated protein p with the functional categories of proteins represented by characteristic terms similar to p, using a k-nearest neighbour classifier.
We evaluate and compare the prediction accuracy of the three systems by measuring their performance for predicting the functions of each protein P in the dataset using the standard Recall, Precision, and Fvalue metrics shown below: Compiling datasets for the evaluation CAFA challenge dataset We evaluated the systems using the Critical Assessment of Functional Annotation (CAFA) challenge dataset [24,50]. The goal of the CAFA challenge is to evaluate automated protein function prediction algorithms. We used for the evaluation CAFA 2 (2013-2014) dataset. CAFA 2 challenge consisted originally of 100,816 un-annotated proteins at the time of submission deadline on January 20, 2014. By the 17th of February 2015, 26,643 of these proteins have become experimentally annotated and validated. Therefore, we did not follow the exact CAFA set up. Each of the selected proteins has been associated with at least one PubMed abstract according to its entry in UniProtKB database. We used for the evaluation the 26,643 proteins and the 94,846 PubMed abstracts associated with them according to their entries in UniProtKB database.

Saccharomyces Genome Database (SGD)
We also evaluated the three systems using the complete 6086 Saccharomyces Genome Dataset (SGD) [36] as well as the 46,227 PubMed abstracts associated with the 6086 proteins according to their entries in UniProtKB database. SGD is a publicly available resource for the budding yeast Saccharomyces cerevisiae. SGD provides encyclopedic information about the yeast proteins, genome and its genes, and other encoded features. Experimental results on the functions and interactions of the yeast proteins are extracted by high-quality manual curation and are integrated within a well-developed database. This data is combined with high-throughput results. This combined collection of data is integrated with a variety of bioinformatics tools to help in experimental design and analysis and to allow discovery of new biological details. The SGD resource can be considered as a standard for functional description of budding yeast. It can also be considered as a platform from which to investigate related proteins and pathways. The SGD data is freely accessible to researchers and can be downloaded from [36].

Gene ontology dataset
We also evaluated the three systems using Gene Ontology (GO) dataset [51]. The dataset consists of GO terms and the proteins annotated to the functions of these GO terms. We selected a fragment of GO graph containing 70 GO terms from the biological process sub-ontology. We also selected a fragment of GO graph containing 30 GO terms from the molecular function sub-ontology. Table 5 shows the number of proteins selected for the evaluations from these two sub-ontologies (i.e., 62,386 proteins annotated to the functions of GO terms from the biological process sub-ontology and 16,576 proteins annotated to the functions of GO terms from the molecular function sub-ontology). We downloaded the 100 GO terms and the 78,962 proteins annotated to their functions from [51]. We retrieved 577,486 PubMed abstracts associated with the 78,962 proteins based on the entries of these proteins in UniProtKB/ Swiss-Prot database [28]. We selected for the experiments only the proteins that: (1) are associated with at least one PubMed abstract based on their entries in UniProtKB [28], and (2)  Evaluating the performance of the three systems for predicting protein functions through 5-fold cross validation We performed 5-fold cross-validation using the three datasets described previously. Each of the three datasets is divided into five partitions at random (i.e., 5 disjoint subsets). The systems are evaluated through five runs, where in each run a different partition of the dataset is used for testing while the other four partitions are used for training the systems. Each partition is one of the five disjoint subsets of proteins and the PubMed abstracts associated with these proteins. We considered the test proteins as un-annotated, and we measured the Recall, Precision, and F-value of the systems for predicting the functions of these test proteins. As shown in Eq. 6, Z-score is used for determining the significance of occurrence frequency of a test protein in each set of PubMed abstracts associated with training proteins annotated with the same functional category. In the experiments, we considered a frequency of occurrences significant, if its Z-Score is above the threshold "-1.96" standard deviation. The results are shown as follows: Figure 1 show the results of the CAFA dataset [24,50] described previously. That is, Fig. 1 show the results of the experiments using the 26,643 proteins and 94,846 PubMed abstracts associated with them according to their entries in UniProtKB database. The following are the number of correct predictions made by each system: PPFBM : 15,187,GOstruct: 12,789, and Text-KNN: 8261. Figure 2 show the results of the complete Saccharomyces Genome Dataset (SGD) described previously. That is, Fig. 2 show the results of the experiments using the 6086 Yeast proteins and 46,227 PubMed abstracts associated with them according to their entries in UniProtKB database. Table 6 shows a sample of the 6086 proteins and their Biological Process annotations identified by PPFBM. The last column in the Table shows the missing annotations identified by PPFBM. We discovered that 63 % of the proteins have missing annotations based on their published annotations in GO website [51] and UniProtKB/Swiss-Prot database [28]. The following are the number of correct predictions made by each system: PPFBM: 3955, GOstruct: 3,226, and Text-KNN: 2252. Figure 3 show the results of the Gene Ontology (GO) dataset described previously. That is, Fig. 3 show the results of the experiments using the 78,962 proteins and the 577,486 PubMed abstracts associated with them according to their entries in UniProtKB. The following are the number of correct predictions made by each system: PPFBM: 41,060, GOstruct: 33,953, and Text-KNN: 17,372.
As shown in Fig. 4, we also evaluated the three systems using CAFA protein-centric metrics. We followed CAFA [24,50] procedure for plotting precision-recall curve according to a sliding threshold scheme. Only predictions with confidence scores higher than threshold values t (0 < = t < = 1) are selected for the evaluation. We used thresholds distributed evenly in the range [0, 1] at step size 0.01. At each threshold, we calculated the precision and recall for each protein and also the average precision and recall on all the protein dataset. At each threshold t, the Recall rc i (t) and Precision pr i (t) for each protein i are calculated as shown in Eqs. 7 and 8: where: (1) Ti is the set of functional categories that is experimentally determined for protein i, (2) Pi(t) is the set of functional categories predicted by a system for protein i with score greater than or equal to t, (3) f is a functional term in the ontology, and (4) I(·) is the standard indicator function. The overall Recall and Precision for protein i at threshold t are calculated as shown in Eqs 9 and 10.
where m(t) is the number of proteins that have at least one prediction above t and n is the number of all proteins in the dataset. Figure 4 show the results. We also measured the Recall, Precision, and F-value of the systems for predicting the function of each GO term. For each GO term t, we randomly selected a set of training proteins and a set of testing proteins annotated with the function of t. We evaluated the accuracy of the systems for predicting the function of t. The results are shown as follows. Figure 5 shows the accuracy of predicting the functions of each set of GO terms located at the same average depth (level) in the Biological Process ontology. Figure 6 shows the accuracy of predicting the functions of each set of GO terms located at the same depth (level) in the Molecular Function ontology. Tables 7 and 8, show the depth (level) of each GO term in GO Graph and the accuracy of predicting the function of this term.
Evaluating the performance of the three systems for predicting protein functions through cumulativevalidation In this test, we perform ten runs using the GO dataset described previously. The number of training proteins accumulates successively in each run. In each run, 1330 test proteins (i.e., 1000 test proteins from the Biological   Figures 7 and 8 show the performance of each system in each of the ten runs.

Discussion of the results
The impact of the Key concepts employed by PPFBM on prediction results As Figs. 1, 2, 3, 4, 5, 6, 7 and 8 show, PPFBM outperformed GOstruct and Text-KNN. We attribute the performance of PPFBM over the other two systems to the following factors: 1) The first factor is the employment of PPFBM to the concept of dominant molecules to represent proteins. This concept ensures that uninformative molecules are filtered and excluded from representing proteins. A molecule is considered uninformative if it has only few occurrences in abstracts and/or is assigned a high weight even though it is found in abstracts associated with many other protein classes. The poor performance of Text-KNN is attributed, mainly, to the fact that it does not employ a mechanism for filtering and excluding uninformative characteristic terms from representing proteins. 2) The second factor is the employment of PPFBM to the concept of semantic relationship between proteins and molecules in sentences. This concept ensures each cooccurrence of a molecule and protein pair in a sentence is disregarded, if the pair is unrelated grammatically (as described previously). That is, PPFBM considers the co-occurrence of a molecule and protein pair in a sentence as an indicative of their association only if the pair is semantically related. GOstruct and Text-KNN do not consider the concept of semantic relationship. For example, Text-KNN considers the occurrences of a term t in an abstract associated with a protein p 1 as indicative of the association between t and p 1 (if t passes the Z-Score threshold), while it overlooks the contexts in The table shows the average depth (level) of each GO term in the molecular function subontology and the accuracy of predicting the function of this term. R, P, and F DENOTE Recall, Precision, and F-value respectively which t occurs. The term t may be associated with a protein other than p 1 , even though it occurs within an abstract(s) associated with p 1 . Consider for example that t and a protein p 2 are semantically related based on their co-occurrences in the sentences of an abstract(s) associated with p 1 . In this case, t is likely to be associated with p 2 and it may not necessary be associated with p 1 Fig. 7 The Recall, Precision, and F-value for predicting GO Biological Process annotations using a successively accumulating set of training proteins even though it occurs in an abstract(s) associated with p 1. Thus, the occurrences of t in an abstract associated with p 1 may not always be an indicative of the association between t and p 1 . We cannot determine this without checking the contexts in which terms occur within sentences (e.g., checking the semantic relationships between terms in sentences).

The impact of the size of GO annotation terms on prediction results
We analysed the results of the experiments conducted using GO dataset. We observed from the results of predicting the functions of individual GO terms the following. As the number of training proteins annotated with the function of a term T gets larger, PPFBM tends to predict the function of T more accurately. This can be seen in the results shown in Tables 6 and 7. This is because, as the number of training proteins gets larger, PPFBM computes the beats/looses scores of molecules more accurately (recall Table 4). PPFBM may not predict the functions of very small classes accurately (classes with fewer than about 100 training proteins). On the other hand, GOstruct tends to predict more accurately the functions of GO terms annotating very small number of training proteins. This is attributed to the fact that GOstruct orders functions based on their influences and gives higher influences to functions with smaller number of proteins annotated with them. This is disadvantageous to GOstruct, since the size of training proteins gets larger over time as unannotated proteins are assigned functions. As for PPFBM, as the set of training proteins annotated with the function of a GO annotation term T gets larger, the set of dominant molecules representing T becomes more optimized and more accurate. This is because the larger the number of training proteins gets, the more accurate becomes the scores assigned to molecules based on their number of beats and looses (recall Table 4).
The impact of the size of training proteins on prediction results As Figs. 7 and 8 show, PPFBM's performance over the other two systems increases steadily as the number of training protein increases. That is, PPFBM's prediction performance becomes more accurate constantly, as the size of training proteins gets larger. This is because every time a new set of test proteins is added to the current set of training proteins, PPFBM optimizes its prediction performance as follows: 1) It updates and optimizes the set of dominant molecules representing each training protein p in the current set of training proteins. It does so by updating the beats/looses scores and normalized weights (recall Table 4) of the molecules associated with p based on the occurrences of these molecules in the abstracts associated with the test proteins that have recently been added to the current set of training proteins. 2) It optimizes the computation of the significance of occurrence frequency of the set S r of proteins that is semantically similar to an un-annotated protein in PubMed abstracts (as described previously). It does so by updating the number of PubMed abstracts associated with each functional category FC by adding the abstracts associated with the test proteins annotated to FC that have recently been added to the current set of training proteins. This improves the computation of Z-score (recall Eq. 6), which improves the prediction performance of PPFBM. As a result, the accuracy of predicting the functional category FC as the functional category of succeeding un-annotated proteins (e.g., in the coming runs) improves.
Thus, PPFBM's prediction performance improves over time as each previously un-annotated set of protein is assigned functional categories and is associated with abstracts in biomedical databases. As for GOstruct, and Text-KNN, the increment of the size of training proteins has no significant impact on their prediction performance.  [34] and their biological process annotations identified by PPFBM (Continued)