Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease
© Masseroli et al; licensee BioMed Central Ltd. 2006
Received: 11 November 2005
Accepted: 08 June 2006
Published: 08 June 2006
Genomic functional information is valuable for biomedical research. However, such information frequently needs to be extracted from the scientific literature and structured in order to be exploited by automatic systems. Natural language processing is increasingly used for this purpose although it inherently involves errors. A postprocessing strategy that selects relations most likely to be correct is proposed and evaluated on the output of SemGen, a system that extracts semantic predications on the etiology of genetic diseases. Based on the number of intervening phrases between an argument and its predicate, we defined a heuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We also applied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessed SemGen predications to investigate the genetic basis of Parkinson's disease.
The filtering procedure for increased precision is based on the intuition that arguments which occur close to their predicate are easier to identify than those at a distance. For example, if gene-gene relations are filtered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to 70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrence processing is useful, but not as effective as when applied to the output of natural language processing.
In an effort to exploit SemGen predications on the etiology of disease after increasing precision with postprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering and was automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypic information from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders.
Information based on the proximity postprocessing method we suggest is of sufficient quality to be profitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximity filtering is only marginally effective for enhancing the precision of relations extracted with co-occurrence processing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation.
In an effort to minimize errors due to natural language processing (NLP), we developed and evaluated a procedure for postprocessing extracted genetic information. This processing is applied to the output of SemGen (Semantics for Genetics), an NLP system for extracting semantic predications (or relations) from the text of Medline citations [1, 2].
SemGen applies in the domain of molecular genetics and has several components: Journal Descriptor Indexing (JDI) , the MedPost tagger , the SPECIALIST Lexicon , the UMLS Metathesaurus , MetaMap , and ABGene . These components interact to first identify genetic phenomena and disorders and subsequently construct semantic relations among these entities: gene-gene interactions (STIMULATE, INHIBIT, INTERACT_WITH, and their negations) as well as gene-disease associations (ASSOCIATED_WITH, PREDISPOSE, CAUSE, and their negations).
We now show that apoA-II promotes insulin resistance and has diverse effects on fat homeostasis
pron(we) adv(now) verb(show) compl(that) noun(apoA-II) verb(promotes) noun(insulin) noun(resistance) conj(and) aux(has) adj(diverse) noun(effects) prep(on) noun(fat) noun(homeostasis)
[[pron(we)] [adv(now)] [verb(show)] [compl(that)] [head(apoA-II)]NP [verb(promotes)] [mod(insulin) head(resistance)]NP [conj(and)] [aux(has)] [mod(diverse) head(effects)]NP [prep(on) mod(fat) head(homeostasis)]NP]
The syntactic structure in (3) serves as the basis for the next phase, identification of noun phrases expressing relevant semantic concepts: genetic phenomena and disorders. Genetic phenomena are defined broadly as those concepts that may have a bearing on molecular genetics and include genes, proteins, nucleotide sequences, mutations, polymorphisms, and chromosomes. In this study we concentrated exclusively on genes. For each gene name identified, SemGen attempts to provide the corresponding official symbol and Entrez Gene ID, although this is not always possible. Gene symbol resolution is a challenging NLP task, currently under active investigation (for example Morgan et al. 2004 , Hou and Chen 2004 , Yu et al. 2002 ). For disorders, SemGen considers a concept having one of the following UMLS semantic types to be relevant: 'Pathologic Function', 'Disease or Syndrome', 'Neoplastic Process', and 'Congenital Abnormality'.
During processing to identify relevant semantic concepts, SemGen examines each noun phrase in structures such as (3) to determine whether it qualifies as a genetic phenomenon or as a disorder. For disorder names, MetaMap is used exclusively to determine whether the phrase maps to a concept in the UMLS Metathesaurus having a relevant semantic type. In considering genetic phenomena, SemGen first calls on MetaMap and the UMLS Metathesaurus; however, since the Metathesaurus is not complete for gene names, the output from ABGene is also consulted.
"NURR1 gene", "RARs", "RAR/RXR", "Retinoid receptors", "retinoid receptor genes", "RXRgamma", "RXR genes"
In processing input from this citation that contains the phrase NURR1 gene, SemGen determines from MetaMap that NURR1 does not occur in the Metathesaurus. The ABGene list (4) for the citation containing the current input sentence is then consulted, and NURR1 is successfully identified as a genetic phenomenon in the sentence containing this phrase.
[[pron(we)] [adv(now)] [verb(show)] [compl(that)] [genphenom(336|APOA2|apoa-ii|noun(apoA-II)]NP [verb(promotes)] [disorder(mod(insulin) head(resistance)|Insulin Resistance)]NP [conj(and)] [aux(has)] [mod(diverse) head(effects)]NP [prep(on) mod(fat) head(homeostasis)]NP]
There appear to be multiple mechanisms through which JNK might promote diabetes.
Other examples of verbs that indicate semantic relations between a gene and a disorder include "predispose" (for PREDISPOSE) and "influence" and "implicate" (for ASSOCIATED_WITH). Similarly, verbs indicating a relation between two genes are "stimulate" and "upregulate" (for STIMULATE), "block" and "inhibit" (for INHIBIT), and "mediate" (for INTERACT_WITH).
A clinic-based study of the LRRK2 gene in Parkinson disease yields new mutations.
Saitohin represents a candidate gene for Parkinson's disease.
It should be noted that in both (7) and (8) the preposition is the only syntactic cue that there is a semantic relation between the gene and the disorder. In the molecular genetics domain, we have encountered prepositions as indicators only for the semantic predicate ASSOCIATED_WITH.
PARP-1 and nuclear factor kappa B have both been suggested to play a crucial role in inflammatory disorders.
However, Tag, which inactivates the key tumour suppressor p53, is not known to be involved in the pathogenesis of human insulinoma.
We and others have demonstrated that human obesity syndrome is caused by mutations in the gene MKKS.
In order to exploit extracted semantic predications, we used the GFINDer (Genome Function INtegrated Discoverer) application [12–15]. This is a Web application that enriches lists of gene IDs with controlled functional annotations dynamically retrieved from several biomolecular databanks, including Entrez Gene , Gene Ontology , KEGG , Swiss-Prot , Pfam , and OMIM . Moreover, GFINDer allows computational and statistical analyses on the functional and phenotypic annotations of user-selected groups of genes, aimed at highlighting those annotation categories that are significant in the genes selected.
After establishing a baseline for SemGen processing on the basis of a corpus of Medline citations on diabetes, we evaluated the results of postprocessing this SemGen output with a distance filtering procedure we developed. We also compared these results to those obtained from applying our filtering method to genetic (gene-gene) and etiologic (gene-disease) relations obtained with co-occurrence processing applied to the same corpus on diabetes. Finally, we tested the usefulness of the filtered information obtained from SemGen by looking at genes extracted from text on Parkinson's disease and enhanced with annotations using GFINDer.
In order to test this hypothesis, we implemented a procedure that first kept track of indicator category and argument distance in SemGen output, and then devised a parametric filtering procedure based on these phenomena. Argument distance was computed as the number of phrases intervening between an argument and its indicator.
SemGen precision for 2,042 genetic (gene-gene) and etiologic (gene-disease) semantic relations
The use of this filtering procedure implies an inverse relationship between precision and amount of information retained (recall): as precision increases, more information is lost. In all cases, argument distance correlates with this trade-off. For example, if gene-gene relations from verb indicators are filtered for arguments at distance 1 from the indicator, precision increases from 41.95% (95% confidence intervals 37.17% to 46.73%) (baseline) to 70.75% (95% confidence intervals 62.09% to 79.41%) (Figure 4, Graph A); however, information retained (recall) drops to 43.60% (95% confidence intervals 36.19% to 51.02%).
Results demonstrate that proximity to verb indicator has a positive effect on accuracy. In addition, semantic class of predicate (gene-gene or gene-disease) appears to influence reliability. With regard to type of indicator, precision is higher for gene-disease relations with verb indicator (B) than for the same relations with preposition indicator (C). Although we did not conduct a formal error analysis to explicate this difference, it is likely dependent on the fact that "in" as an indicator is ambiguous, and hence prone to generating SemGen errors. Another observation is that for verb indicators, precision is higher for gene-disease relations (B) than for gene-gene relations (A). Again, we can only provide an impressionistic explanation. This result is probably due to the tendency for disorder names to be easier to identify than gene names.
Proximity in co-occurrence processing
Results show that precision varies with distance, reaching best values for moderate to -intermediate distances for gene-gene co-occurrences and low distances for gene-disorder co-occurrence pairs. Filtering at such distances improves results when compared to no filtering. For gene-disease co-occurrences, precision improved by 30.46% when considering only co-occurrences with no more than 1 intervening content word: 37.50% vs. 28.74% precision (95% confidence intervals 23.80% to 51.20% vs. 23.10% to 34.39%). Whereas, for gene-gene co-occurrences improvement when considering only co-occurrences with up to 6 intervening content words was 32.62%: 13.75% vs. 10.37% precision (95% confidence intervals 8.41% to 19.09% vs. 6.91% to 13.82%). When the number of intervening content words is low (up to 4), precision is decreased for gene-gene co-occurrences.
The effect of proximity filtering on overall precision of co-occurrence processing is not dramatic. In addition, the improved precision results have a wide margin that partially overlaps with the unfiltered results, suggesting that in some cases the improvement could be due to variation. However, these results do not limit the benefits of such filtering for semantic interpretation. As noted above, the effectiveness of proximity processing with semantic interpretation is ultimately determined by the structure of English sentences. Co-occurrence processing is applied without reference to that structure, and hence subsequent proximity processing either has minimal effect or depresses precision (with fewest intervening content words).
Exploiting extracted relations
NLP to support research in molecular biology
SemGen is one of several systems currently being developed to provide access to information in text (entities and relations between them) for molecular biology research (see  for an overview). Of the systems that identify relations, various approaches (both statistical and rule based) are being pursued. Due to the complexity of the content involved, most systems focus on a particular molecular biology phenomenon. Several applications address protein-protein interactions: Bunescu et al. , for example, use machine learning techniques, while Corney et al.  employ syntactic patterns for such relations. Blaschke et al.  and Blaschke & Valencia  also use syntactic templates, enhanced with proximity processing between arguments, to identify protein interactions. Huang et al.  also call on syntactic patterns, which they discover with automatic methods, and Temkin and Gilder  use a context free grammar to extract protein interactions from text. Hu et al.  concentrate on protein phosphorylation using a system similar to SemGen. Regarding function and structure, Gaizauskas et al.  employ both syntactic parsing and semantic templates to identify information about protein structure in text. Koike et al.  exploit syntactic patterns for gene function, and Daraselia et al.  take advantage of a full parse for protein function. Friedman et al.  use a semantic grammar to identify molecular pathways, while Santos et al.  combine statistical methods with partial and full parsing and concentrate on the Wnt pathway. Blaschke et al.  mine gene expression information from Medline citations using a method similar to [31, 32]. Finally, Leroy et al.  exploit a shallow parser to identify various relations in molecular biology. Proximity between arguments is also used in their method.
The postprocessing technique we developed selects the extracted semantic relations that are most likely to be correct based on distance of the arguments from the syntactic predicate (indicator). Other methods [31, 32, 41, 42] have employed a related notion, namely distance between arguments participating in a relation, where the relations are identified with templates or shallow parsing. Previous work has not discussed incremental improvements dependent on degree of proximity, nor discussed the recall-precision trade-off, nor compared proximity filtering to unfiltered results.
Although we based our work on SemGen, our filtering process could be applied to relations produced by other NLP methods. The semantic content of the relations is not relevant. The postprocessing technique could be transferred most straightforwardly to those systems that retrieve arguments using rules or patterns, since a verb or preposition (an indicator) is available to interact with argument distance as a predictor of extraction accuracy. When an indicator is not used, as in statistical systems, the technique could be slightly modified to use distance between arguments as the sole predictor of correctness. However, in this case it is not likely to be dramatically effective, as suggested by our experiment with co-occurrence processing.
Argument distance thresholds while postprocessing extracted relations
Effectively exploiting extracted genetic and etiologic relations for subsequent applications depends on maintaining a balance between the highest possible precision and a sufficient level of retained information (Figure 4) for useful applications. For gene-gene relations derived from verbs, for example, this can be obtained by allowing an argument distance of no more than 2 phrases from the indicator (55.88% (95% confidence intervals 49.07% to 62.70%) precision, and 66.28% (95% confidence intervals 59.21% to 73.34%) recall). However, when filtered relations are used for automatic processing, high precision should take precedence over high recall (retained information): for verb indicators, an argument distance of 1 for genetic relations and 2 or 3 for etiologic relations. For supervised applications less strict threshold values can be used.
Considering all extracted relations, whether they include an official gene symbol or not, would increase precision for any distance threshold (Table 1). Such relations could be useful for subsequent supervised applications; however, we limited this study to official gene symbols (and the corresponding Entrez Gene ID) because this allows automatic linking to structured biological data. Subsequent automatic processing based on this information could then unveil hidden biological knowledge [43–45].
Exploiting extracted information
The application example discussed above illustrates that the procedure we propose for filtering the results of automatically extracted gene-gene and gene-disease relations effectively selects useful information. Many of the genes identified, including CYP2D6, LRRK2, MAPT, NR4A2, PARK2, PARK3, PARK7, PINK1, UCHL1 (Figure 6), are associated with Parkinson's disorders in genetic disorder reference databases, such as OMIM, GHR, GAD, and MutPD. Furthermore, the process identified five genes not associated with Parkinson's disease in those databases (APLP2, EN2, IREB2, NGFB, and SLC18A2). By uploading these genes into GFINDer, we were able to highlight their biological process categories in the Gene Ontology (Figure 7). Three of these genes (EN2, IREB2, and NGFB) are currently associated only with high level or general biological process categories that might or might not be related to Parkinson's disease. However, two of these genes (APLP2 and SLC18A2) are associated with low level biological process categories clearly related to Parkinson' disease: "G-protein coupled receptor protein signaling pathway" and "monoamine transport", respectively. Alterations in the biological processes of these categories, which are parent and sibling categories of "dopamine receptor signaling pathway" and "dopamine transport" respectively, may well be involved in Parkinson's disease and suggest interesting avenues for further analysis. In fact, motor symptoms in Parkinson's disease are generally thought to result from deficiency or dysfunction of dopamine or dopaminergic neurons in the substantia nigra .
Although current filtering usefully supports subsequent automatic analysis, there is room for improvement. As a further measure of sentence complexity, an extended multi-parametric filtering could be implemented, which takes into account the total number of arguments on both the left and the right of an indicator. It would also be possible to improve results by exploiting domain knowledge about genes and diseases to support statistical methods for constructing resources expressing functional characteristics such as involvement in biological processes, biochemical pathways, molecular functions, and co-occurring expression in similar tissues. This information could then be consulted to exclude improbable semantic relations.
The genetic and etiologic relations extracted by SemGen from the research literature are normalized semantic descriptions of complex genetic interactions. The filtering method we implemented increases the precision of error-prone NLP by selecting the semantic relations most likely to be correct. This information can then be used for further applications aimed at uncovering new biomedical knowledge.
Establishing a baseline
To establish a baseline, we first evaluated the precision of SemGen in extracting genetic (gene-gene) and etiologic (gene-disease) relations from 5,525 Medline citations on the genetic basis of diabetes retrieved with the PubMed query "diabetes AND (gene OR genes OR genetic)." From these citations SemGen extracted a total of 8,956 genetic and etiologic relations. 2,042 (22.80%) of them were selected from 1,934 sentences and were compared to the original sentences by a genetics domain expert, who classified them as correct or not. The primary consideration in selecting the relations to evaluate was whether all gene names in the relation had been matched to the official gene symbol (in Entrez Gene). The official symbol is required by GFINDer to connect information extracted by SemGen to online resources. 1,372 (of the 8,956 total relations) had gene names matched to Entrez Gene and all such relations were evaluated. Of these, 410 referred to gene-gene relations and 962 to gene-disease predications. 817 relations with an official gene symbol had been derived from a verb indicator by SemGen, and 555 from a preposition. In addition, 670 extracted relations without an official gene symbol and Entrez Gene ID were also randomly selected and assessed (617 gene-gene and 53 gene-disease relations; 643 with verb indicator and 27 with preposition indicator). The results of the evaluation for the 2,042 total semantic relations assessed (1,027 gene-gene and 1,015 gene-disease relations; 1,460 with verb indicator and 582 with preposition indicator) are reported in Table 1.
Filtering SemGen output
As a gold standard for evaluating the filtering strategy to retain the relations most likely to be correct, we used the same 2,042 semantic relations about diabetes on which the baseline was determined. We filtered the gold standard relations at various thresholds of argument distance from the relation indicator by keeping all relations with both argument distances lower or equal to the considered threshold. After each filtering, we grouped the selected relations according to relation type (gene-gene or gene-disease), syntactic category of indicator (preposition or verb), and official gene symbol content (relations with official gene symbol for all extracted genes or not for all). For each group we calculated number of correct and incorrect relations, precision (P) (the percent of semantic relations retained after filtering that are correct), recall (R) or retained information (the percent of the initially correct semantic relations retained after filtering), and F score (F) (the harmonic mean of precision and recall) as follows: P = rc/(rc+ri), R = rc/(rc+dc), F = 2*P*R/(P+R) with rc the retained correct relations, ri the retained incorrect relations, and dc the discarded correct relations. 95% confidence intervals were also computed for each calculated value of P, R, and F.
Mutations in the DJ-1 gene cause early-onset autosomal recessive Parkinson's disease
In evaluating the co-occurrence processing, a sample of 200 sentences (10.34%) was randomly selected from the same 1,934 sentences in the Medline citations on diabetes used to evaluate proximity filtering based on SemGen processing. Then, a domain expert assessed 546 co-occurrences (299 gene-gene and 247 gene-disease) within these 200 sentences and classified them as correct or not.
We subsequently evaluated the effectiveness of distance filtering postprocessing to improve precision of extracted co-occurrences. We first filtered the assessed co-occurrences at incremental distance thresholds of intervening content words between co-occurrences of gene-gene or gene-disorder concepts. Then, as was done for evaluating SemGen output filtering, for each threshold value we calculated the number of correct and incorrect relations, and precision, recall, F score and 95% confidence intervals.
Processing for exploiting extracted relations
A PubMed query on the genetics of Parkinson's disease retrieved 3,871 Medline citations. Initial SemGen processing extracted 1,454 semantic relations from 899 citations. Of these, 365 were gene-gene relations (all with verb indicator; 40 with an official gene symbol and 325 without), and 1,089 were gene-disease relations (454 with verb indicator, 635 with preposition indicator; 233 with an official gene symbol and 856 without). For further processing, we limited the extracted relations to the 85 etiologic relations which included an official gene symbol and Entrez Gene ID and which involved one of the following disorders: Parkinson Disease (74 relations), Parkinsonian Disorders (9), and Autosomal Recessive Parkinsonism (2). One of these relations was subsequently eliminated because it negated a gene-disease association. We then evaluated the performance of distance filtering on these 84 semantic relations comparing them against the sentences that generated them, filtering them at increasing distance thresholds, and calculating precisions, recalls, F scores and their 95% confidence intervals.
We are grateful to Bisharah Libbus for his domain expertise supporting this project. The first author was supported by an appointment to the National Library of Medicine Research Participation Program administered by the Oak Ridge Institute for Science and Education through an inter-agency agreement between the U.S. Department of Energy and the National Library of Medicine. This project was supported in part by the Intramural Research Program of the NIH, National Library of Medicine.
- Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H: Semantic relations asserting the etiology of genetic diseases. In Proceedings of the American Medical Informatics Association annual symposium: 8–12 November 2003; Washington, DC. Edited by: Musen MA, Bethesda MD. American Medical Informatics Association; 2003:554–558.Google Scholar
- Libbus B, Kilicoglu H, Rindflesch TC, Mork JG, Aronson AR: Using natural language processing, Locus Link, and the Gene Ontology to compare OMIM to MEDLINE. In Proceedings of the Workshop on Linking the Biological Literature, Ontologies and Databases: Tools for Users: 6 May 2004; Boston, MA. Edited by: Hirschman L, Pustejovsky J. East Stroudsburg, PA: Association for Computational Linguistics; 2004:69–76.Google Scholar
- Humphrey SM: Automatic indexing of documents from journal descriptors: a preliminary investigation. J Am Soc Inf Sci 1999, 50(8):661–674. http://dx.doi.org/10.1002/(SICI)1097–4571(1999)50:8%3c661::AID-ASI4%3e3.0.CO;2-R 10.1002/(SICI)1097-4571(1999)50:8%3c661::AID-ASI4%3e3.0.CO;2-RPubMed CentralView ArticlePubMedGoogle Scholar
- Smith L, Rindflesch T, Wilbur WJ: MedPost: a part-of-speech tagger for biomedical text. Bioinformatics 2004, 20(14):2320–2321. 10.1093/bioinformatics/bth227View ArticlePubMedGoogle Scholar
- McCray AT, Srinivasan S, Browne AC: Lexical methods for managing variation in biomedical terminologies. In Proceedings of the Annual Symposium on Computer Applications in Medical Care: 5–9 November 1994; Washington, DC. Edited by: Ozbolt JG. Philadelphia, PA: Hanley & Belfus; 1994:235–239.Google Scholar
- Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO: The Unified Medical language System: An informatics research collaboration. J Am Med Inform Assoc 1998, 5(1):1–11.PubMed CentralView ArticlePubMedGoogle Scholar
- Aronson AR: Effective mapping of medical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the American Medical Informatics Association annual symposium: 3–7 November 2001; Washington, DC. Edited by: Bakken S. Philadelphia, PA: Hanley & Belfus; 2001:17–21.Google Scholar
- Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text. Bioinformatics 2002, 18(8):1124–1132. 10.1093/bioinformatics/18.8.1124View ArticlePubMedGoogle Scholar
- Morgan AA, Hirschman L, Colosimo M, Yeh AS, Colombe JB: Gene name identification and normalization using a model organism database. J Biomed Inform 2004, 37(6):396–410. 10.1016/j.jbi.2004.08.010View ArticlePubMedGoogle Scholar
- Hou WJ, Chen HH: Enhancing performance of protein and gene name recognizers with filtering and integration strategies. J Biomed Inform 2004, 37(6):448–460. 10.1016/j.jbi.2004.08.006View ArticlePubMedGoogle Scholar
- Yu H, Hatzivassiloglou V, Rzhetsky A, Wilbur WJ: Automatically identifying gene/protein terms in MEDLINE abstracts. J Biomed Inform 2002, 35(5–6):322–330. 10.1016/S1532-0464(03)00032-7View ArticlePubMedGoogle Scholar
- Masseroli M, Martucci D, Pinciroli F: GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res 2004, (32 Web Server):W293-W300.Google Scholar
- Masseroli M, Galati O, Pinciroli F: GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res 2005, (33 Web Server):W717-W723. 10.1093/nar/gki454Google Scholar
- Masseroli M, Galati O, Manzotti M, Gibert K, Pinciroli F: Inherited disorder phenotypes: controlled annotation and statistical analysis for knowledge mining from gene lists. BMC Bioinformatics 2005, 6(Suppl 4):S18. 10.1186/1471-2105-6-S4-S18PubMed CentralView ArticlePubMedGoogle Scholar
- GFINDer: Genome Function INtegrated Discoverer web site[http://www.bioinformatics.polimi.it/GFINDer/]
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2005, (33 Database):D54-D58.Google Scholar
- The Gene Ontology™ Consortium: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25(1):25–29. 10.1038/75556View ArticleGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28(1):27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Gasteiger E, Jung E, Bairoch A: Swiss-Prot: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol 2001, 3(3):47–55.PubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, (32 Database):D138-D141. 10.1093/nar/gkh121Google Scholar
- McKusick VA: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders. 12th edition. Baltimore, MD: Johns Hopkins University Press; 1998.Google Scholar
- Nussbaum RL, Polymeropoulos MH: Genetics of Parkinson's disease. Hum Mol Genet 1997, 6(10):1687–1691. 10.1093/hmg/6.10.1687View ArticlePubMedGoogle Scholar
- Morris HR: Genetics of Parkinson's disease. Ann Med 2005, 37(2):86–96. 10.1080/07853890510007269View ArticlePubMedGoogle Scholar
- Calne D: A definition of Parkinson's disease. Parkinsonism Relat Disord 2005, 11(Suppl 1):S39-S40. 10.1016/j.parkreldis.2005.01.008View ArticlePubMedGoogle Scholar
- Mitchell JA, Fun J, McCray AT: Design of Genetics Home Reference: a new NLM consumer health resource. J Am Med Inform Assoc 2004, 11(6):439–447. 10.1197/jamia.M1549PubMed CentralView ArticlePubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The Genetic Association Database. Nat Genet 2004, 36(5):431–432. 10.1038/ng0504-431View ArticlePubMedGoogle Scholar
- MutPD – The Parkinson Disease mutation database[http://www.thepi.org/altruesite/files/parkinson/Mutations/new_page_1.html]
- Blaschke C, Hirschman L, Valencia V: Information extraction in molecular biology. Brief Bioinf 2002, 3(2):1–12.View ArticleGoogle Scholar
- Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, Wong YW: Comparative experiments on learning information extractors for proteins and their interactions. Artif Intell Med 2005, 33(2):139–155. 10.1016/j.artmed.2004.07.016View ArticlePubMedGoogle Scholar
- Corney DPA, Buxton BF, Langdon WB, Jones DT: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20(17):3206–3213. 10.1093/bioinformatics/bth386View ArticlePubMedGoogle Scholar
- Blaschke C, Andrade MA, Ouzounis C, Valencia A: Automatic extraction of biological information from scientific text: protein-protein interactions. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology: 6–10 August 1999; Heidelberg, D. Edited by: Lenauer T, Schneider R, Bork P, Brutlag DL, Glasgow JI, Mewes H-W, Zimmer R. San Francisco, CA: Morgan Kaufman Publishers, Inc; 1999:60–67.Google Scholar
- Blaschke C, Valencia A: Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics 2001, 2: 196–206. 10.1002/cfg.91PubMed CentralView ArticlePubMedGoogle Scholar
- Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M: Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 2004, 20(18):3604–12. 10.1093/bioinformatics/bth451View ArticlePubMedGoogle Scholar
- Temkin JM, Gilder MR: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 2003, 19(16):2046–2053. 10.1093/bioinformatics/btg279View ArticlePubMedGoogle Scholar
- Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH: Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics 2005, 21(11):2759–2765. 10.1093/bioinformatics/bti390View ArticlePubMedGoogle Scholar
- Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P: Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 2003, 19(1):135–143. 10.1093/bioinformatics/19.1.135View ArticlePubMedGoogle Scholar
- Koike A, Niwa Y, Takagi T: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005, 21(7):1227–1236. 10.1093/bioinformatics/bti084View ArticlePubMedGoogle Scholar
- Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 2004, 20(5):604–611. 10.1093/bioinformatics/btg452View ArticlePubMedGoogle Scholar
- Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17(Suppl 1):S74-S82.View ArticlePubMedGoogle Scholar
- Santos C, Eggle D, States DJ: Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 2005, 21(8):1653–1658. 10.1093/bioinformatics/bti165View ArticlePubMedGoogle Scholar
- Blaschke C, Oliveros JC, Valencia A: Mining functional information associated with expression arrays. Funct Integr Genomics 2001, 1: 256–268. 10.1007/s101420000036View ArticlePubMedGoogle Scholar
- Leroy G, Chen H, Martinez JD: A shallow parser based on closed-class words to capture relations in biomedical text. J Biomed Inform 2003, 36(3):145–158. 10.1016/S1532-0464(03)00039-XView ArticlePubMedGoogle Scholar
- Swanson DR: Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med 1986, 30(1):7–18.View ArticlePubMedGoogle Scholar
- Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics 2004, 20(Suppl 1):I290-I296. 10.1093/bioinformatics/bth914View ArticlePubMedGoogle Scholar
- Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005, 74(2–4):289–298. 10.1016/j.ijmedinf.2004.04.024View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.