Consolidation of datasets
Both MP and GO are used in MGI, an open source resource that freely publishes its datasets. We collected all data used in this study from the MGI ftp site (ftp://ftp.informatics.jax.org/pub/reports/index.html, retrieved 10 June 2013). First, the set of literature that was annotated for both GO and MP in MGI was gathered and formatted; this set provided our training data set (3662 publications with both GO and MP annotations to the same allele). Then, the set of literature selected for both MP and GO, but annotated only for MP (63,028 publications) was collected from internal reports used by curators in their workflow; this set provided the test set for our derived inference rules.
Data alignment and processing
A base set of MP-GO pairs was generated by matching the set of all collected GO terms to the set of all collected MP terms used to annotate the same allele (we use the allele here as a proxy for full genotype) in the same study, knowledge of which is reflected by shared PubMedID (PMID) and alleleID. The GO terms were filtered to select only those terms from the biological process (BP) subontology, and only those GO annotations with evidence as “inferred from a mutant phenotype” (IMP). These selection criteria provide a defined set of papers restricted formally to those tagged for phenotype-based GO annotation for biological process, provided a first level of quality control on the dataset. There were found to be 81,245 MP-GO pairs sharing PMID/alleleID, and 67,424 unique MP-GO pairs, indicating that many of the MP-GO pairs occurred more than once. The MP, GO and the PMID/alleleID data were modeled as a network with PMID/alleleID nodes connecting MP and GO nodes derived from the same study.
All network visualizations were created using Cytoscape v2.8 [26] and all calculations were performed using Numerical Python (Python 2.7.3, NumPy 1.6.2, SciPy 0.10.1, Matplotlib 1.2) [27]. The network was modeled with an adjacency-like matrix. The term “adjacency matrix” is used loosely here, as the matrices presented in this work are quite different than those traditionally used—the matrices used here are rectangular matrices with the columns representing the annotated terms and rows representing PMID/alleleIDs. This approach is necessary since we wish to track individual studies for which both MP and GO annotations have been made. Two matrices were created, one with all unique PMID/alleleIDs (3662 rows) by all unique GO IDs (2472 columns); the other with all unique PMID/alleleIDs by all unique MP IDs (4978 columns). The network was found to be very sparse—that is, the number of all possible edges far exceeded the number of actual edges (graph density of approximately 6e-04).
Calculation and evaluation of statistical significance
After constructing our networks, we calculated the probability that a connection between an MP and GO node was statistically significant rather than due to chance alone. Some gene function-phenotype connections might be supported by many shared annotations, implying that there is some underlying connection between the gene function and the mutant phenotype, while others might be connected by only a few connections relative to the number other connections and yet also be informative.
From a set of unique integers of size N representing all the PMID/alleleID combinations, the probability that a selection of n at random will include j or more ‘successes’ – that is, a PMID/alleleID that is shared between a particular MP-GO node pair, can be defined as a modification of the cumulative binomial distribution (for values of N large compared to n):
(1)
In our case, N = 3662, and n, the maximum number of PMID/alleleID combinations between any MP-GO node pair, is 250, so the condition of equation 1 is satisfied. Applying this definition, the total probability, ptot, that a particular MPi-GOj association is due to chance is given by:
(2)
where PM(MPi) or PM(GOj) returns the set of all PMID/alleleID nodes that are connected to the MPi or GOj nodes, respectively, and:
is the set of data confirmed PMID/alleleID connections between the MPi and the GOj nodes. The p-value was calculated in this way for each MP-GO pair found in the data and sorted. The efficacy of the p-value was evaluated by using each MP-GO connection as a prediction, or ‘rule’: MPi → GOj —to wit: “if an annotation of a particular allele has been made to the MP term MPi by curation of a particular study then an IMP annotation corresponding to that allele can be made to the GO term GOj.” The resulting true and false positives (sensitivity and specificity) were plotted on the receiver operating characteristic (ROC) curve (Figure 1). As shown, the calculated p-value shows sufficient discriminatory power (AUC = 0.733) to discriminate MP-GO pairs that are more likely to correctly predict a GO node from a given MP term. An alternative scenario was also addressed, where possible MP terms were predicted from alleles with GO annotations (but no MP annotations); it was found that this prediction route does not carry as much predictive value (AUC = .686), further validating our approach to predict GO from MP annotations. We therefore used the calculated p-values as our measure of statistical significance for the GO function prediction from the simple rule.
Generalized patterns lead to extended rules
While the calculation of the p-value is useful in discovering which MP-GO pairs have stronger connections when run against the training sets, we found that even those with the lowest probabilities of being due to chance returned many false positives. Therefore, the need arose to either better identify the true positives of an MP-GO pair, or to better exclude the false positives: that is, to improve the discriminatory power of our rules.
Two types of generalized patterns arose from the network, which we chose to identify as the + (plus) rule and the - (minus) rule. The plus-rule can be qualitatively described as inspecting the connections of an MP-GO pair and examining if there is another MP node that is connected to the true positives of the pair, but excludes all of the false positives (PMID/alleleID nodes that are connected to MP, but not to GO) (Figure 2a). We can define the plus-rule: MP1 AND MP2 → GO1 —to wit: “if an annotation of a particular allele has been made to the MP term MP1 and to the MP term MP2 by curation of a single particular study then an IMP annotation corresponding to that allele can be made to the GO term GO1.” The conditional statement for finding an MP1 and MP2 that follow the pattern of the plus-rule is as follows:
(4)
The minus-rule takes the somewhat opposite approach from the plus-rule; that is, instead of discerning if the intersection of two MP nodes can define the successes, the minus-rule examines if the set difference of two MP nodes can define the successes (Figure 2b). Similarly, we can define the minus-rule: MP1 AND NOT MP2 → GO1 —to wit: “if an annotation of a particular allele has been made to the MP term MP1 but not to the MP term MP2 by curation of a particular study then an IMP annotation corresponding to that allele can be made to the GO term GO1.” The statement used to find an MP1 and MP2 that follow the pattern of the minus-rule is as follows:
(5)
The network was then searched for collections of MP and GO nodes that followed the aforementioned patterns, with 3105 instances of the plus-rule pattern and 234 instances of the minus-rule pattern. Detailed examples of both rules are illustrated in Figure 3. However, the need to differentiate between those rules that were more likely to give results to be due to chance versus those that were less likely was still present. Utilizing the demonstrated efficacy of the cumulative binomial distribution as applied to the undirected graph (equation 1), the p-values for the plus and minus-rules were calculated respectively:
(6)
(7)
(8)
(9)
where S in each case is the set of ‘successes’ and Sc is the set complement. As the network is sparse, the p-values are expected to be small. Arising from the basic plus and minus-rule patterns, three other rule patterns were defined, descriptively designated as plus-plus (++) (Figure 4a), minus-minus (--) (Figure 4b) and plus-minus (+-) (Figure 4c). The mathematical statements used to find rules from three additional rule patterns respectively are as follows:
(10)
(11)
(12)
The statements used to calculate the p-values for these rule patterns are expounded in the appendix. The network was again searched for arrangements of nodes that followed the three other patterns, with 3143, 328, and 634 instances of the plus-plus-rule, minus-minus-rule and plus-minus-rule respectively. While the composite plus- and minus-rules can be combined iteratively in theory, there seems to be a point where the statements used to calculate the probabilities for large combinations of nodes become too large to be feasibly evaluated, and where they have the potential of straying further from biological reality. Therefore, we limited our analysis to these five rules as examples. All rule instances are compiled in Additional file 1.
Evaluation of extended rules
All rules were applied to the literature set of PMID/alleleIDs that are annotated to MP term(s) but not yet annotated to a GO term to predict which GO term would be annotated to the PMID/alleleID (set may be found in Additional file 2). Validation of predictions was performed by the selection of a set of 20 papers for each rule that represented a range of the p-values. The papers were read by a GO scientific curator and the curatorial predictions for functional GO annotation were examined in the context of our rule structures. Companion software for the prediction of GO annotations is provided as Additional file 3.