Automatic reconstruction of a bacterial regulatory network using Natural Language Processing
© Rodríguez-Penagos et al; licensee BioMed Central Ltd. 2007
Received: 12 January 2007
Accepted: 07 August 2007
Published: 07 August 2007
Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12.
Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners.
Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.
Genomics and Systems Biology rely on vast amounts of organized data in order to use the sophisticated bioinformatics tools that allow modeling, analysis and interpretation of biological processes like gene regulation, variation in expression profiles and metabolic pathways . Gene regulation, for example, involves complex interactions between genes, transcriptions factors (TFs), proteins and metabolites that can be visualized as networks of inducing and repressing interactions controlling gene expression. Those interactions triggering gene expression and protein synthesis may control cell development and adaptability to environmental changes. Most of the end result of this kind of biological research materializes in textual publication in peer-reviewed journals, but as such is not directly amenable to computational treatment. Databases compiling this information have been developed for some organisms (for example EcoCyc  for E. coli, or SDG  for S. cerevisiae). However, the manual curation involved in ensuring the reliability of this data is a resource- and labor-intensive endeavor. The ever-expanding literature (represented by the millions of papers in electronic repositories that have come to be known as the bibliome ) can literally overwhelm the ability of researchers to make sense of this flood of information. Thousands of papers must be selected and retrieved from repositories such as NIH's PubMed and Medline, and carefully reviewed by experts in order to extract the facts needed by the research community. The last ten years have seen a proliferation of Natural Language Processing (NLP) techniques to aid in dealing with the explosive growth of useful data. Several good overviews [5–8] of the state-of-the-art in Text-Mining have been published recently, and some of the techniques have reached an acceptable level of maturity that allows them to finally perform reasonably well. Information Extraction (IE) is one of the computational methods that has been used successfully before in other non-scientific fields . In contrast to Information Retrieval, which provides a ranked list of relevant documents, IE not only finds which sources of information are relevant, but also automatically populates databases with information that fulfills the user's needs. Compilation of the extremely complex networks of biochemical interactions that control developmental and functional processes in cells is a prime example of the usefulness of these technologies for biological research, as evidenced, for instance, in the various systems competing in the BioCreative evaluations , or in earlier examples, the PastaWeb , GENIES , iHOP  and BioRat  systems, all of them correlating biological networks to the literature that describe them.
In this work, we show how, and to what extent, a state-of-the-art NLP system can aid the manual curation of transcriptional regulation in Escherichia coli, and how both approaches (manual vs. automatic annotation) can complement each other and enrich curation. Although we don't claim any significant technical advance in the current state-of-the-art in text mining per se, we present a much-needed evaluation of the impact of Human Language Technologies (HLT) on annotation and data-gathering using as a gold standard a biological model (a regulatory network) instead of an annotated corpus. We also suggest a methodology to merge these methodologies into the conventional curatorial workflow.
For the overall architecture of the IE system, we adapted a rule-based pipeline first described by Saric et al. for the STRING-IE system [15, 16]. The pipeline is described in detail in the Methods section. We chose a Language-Engineering approach since in general we were concerned more with accuracy (precision) than with coverage (recall), and wanted to be fairly sure about the regulatory interactions we would be extracting. Rule-based systems are more accurate for well-defined tasks (although they are very labor-intensive to implement), while statistically-based approaches, although less exact, are more robust and tolerant to noisy data and errors.
In order to test the capabilities of the extraction system reported here, various collections of documents were downloaded; involving both abstracts from PubMed or full text articles available from the various journal's websites or from subscription services. We employed different strategies for the triage process, which will be described in the Corpus subsection. In this work we used fairly standard search procedures, either gathering the lists of references from pre-curated databases, or doing various keyword searches using the NIH Entrez facilities.
Document sets (corpora) used in this work
# of docs
size in MB
RegulonDB Network References
Full-text papers from the RegulonDB database references that curators have identified as referring specifically to the regulatory network, as opposed to those referring to other objects from the database.
Full text papers from the complete RegulonDB references that we were able to access and download.
Abstracts from the complete RegulonDB references, as of June of 2006.
RegulonDB search strategies
Corpus generated by using the RegulonDB curator's search strategies, without any subsequent filtering.
Abstracts from references in the 2006 EcoCyc database that describes the genome and the biochemical machinery of E. coli.
Corpus of distinct sentences generated by the STRING-IE team by searching in PubMed for "E. coli" (and synonyms), and two gene/protein names in the same abstract, from 195,000 abstracts.
A markup language for mining bacterial regulatory networks
In contrast, acnB expression is activated by CRP and repressed by ArcA, FruR and Fis from PacnB.
In the "interaction" node, the label from shows the internal name of the final extraction grammar rule that justified retrieval of that sentence, while source refers to the collection where it was found, and pmid refers to the identification number of the paper where this particular sentence was retrieved from:
<interaction ID="596" from="anaph+ev_act_expr_xr" ri_function="repressor" source="RegulonAbs">
<regulator GenProtID="ECK120011345" org="ecoli" type="nxpg"> ArcA </regulator>
<regulated GenProtID="ECK120002193" org="ecoli" type="nxpg"> acnB </regulated>
<sentence pmid="9421904"> In contrast, acnB expression is activated by CRP and repressed by ArcA, FruR and Fis from PacnB. </sentence>
Thus, we identify two entities, one the regulator and the other the regulated one, with unique ID numbers for each, as well as organism and semantic types (here "nnpg" means: either a gene or a protein, as sometimes they cannot be disambiguated fully). Note that this entry represents only one of the four different one-to-one interactions that can be extracted from the sentence. We devised RNM2L because: A) System Biology Markup Language  was considered too cumbersome for the more limited task of representing networks with discrete states, and B) because a more specialized tagging scheme would allow better comparison between networks extracted from different sources or with different systems or methodologies, while at the same time being useful both for NLP optimization and for biological interpretation. In the supplementary material site we provide a defining DTD schema for RNM2L.
Evaluation criteria for the network extraction task
To evaluate the extracted regulatory networks we implemented a benchmarking tool that checked each interaction against the RegulonDB database. We assume the database is a reliable gold standard of the final desired output, in contrast to the more customary evaluation methodologies that use a set of manually-annotated sentences from which a database of relevant information can be elicited. Since our database does not include the actual sentences from which the facts have been extracted, our automatic evaluation does not have available to it the linguistic expression of the facts. This methodology involves, of course, some assumptions about the completeness and exhaustiveness of the RegulonDB database. Since E. coli K-12 is one of the best understood model organisms in the literature, and the RegulonDB is carefully compiled and monitored, we can nevertheless rely on it to be a good baseline for evaluation.
Those interactions that had a regulator gene/protein not found in RegulonDB (either by ID, name or synonym) were filtered out, but overall we kept interactions where the gene/protein was identified as a mutant or coming from another (mainly bacterial) organism. Our evaluation considered two cases, a) where the interaction has both the regulator and the regulated genes/proteins correctly identified but their type of interaction might be undetermined, or b) where, besides a correct identification of the involved entities, the nature of their interaction was also correctly identified (as activation or repression).
To test how previously-known information on the network could support the extraction effort, we created two different networks from each corpus: one where we artificially inserted interactions derived from previous knowledge about operons, heterodimers and the like, and another one where we assumed we didn't have a priori information about those multi-entity objects; for example, from the sentence: "transcription of the E. coli melAB operon is regulated by the MelR protein" the fact that melA and melB are regulated by the MelR protein could be extracted. The non-enriched networks, although less complete, would be more realistic with regard to the true capabilities of the extraction system, when applied to other less studied organisms for which we don't have as much genomic information as for the E. coli proteobacteria.
We calculated precision, recall and F-measure of the extracted samples by considering RegulonDB as the ultimate instance of the bacterial network. We previously subtracted from it all computational predictions, since they would conceivably not be present in experimental papers. Recall (the number of relevant interactions extracted with regard to all possible relevant interactions that should have been retrieved) was estimated using the entries in our reference RegulonDB database, and considering as a single one the cases where there was both activation and repression annotated. This is not an orthodox measure of exhaustiveness in Information Extraction evaluations, but since we were not using annotated corpora it was the closest we would get to knowing how many extractable phrases were extant in our corpus, assuming all were there and available to the system. Because of these assumptions, the actual performance of the system could realistically be significantly better than what this fully-automatic verification might lead to believe. For precision (accuracy, or the ratio of all relevant interactions extracted vs. number of all interactions retrieved), we calculated two values: one (precision 1) where we used all the interactions retrieved (even if they were filtered out because the regulator gene/protein was not found in RegulonDB), and another one (precision 2) in which we only considered the subset of all interactions where the regulator gene/protein was cataloged in the reference database. As in other aspects of these evaluation, we did this in order to understand what would happen in cases where we are dealing with other organisms for which we don't have all the information that we have available for E. coli, for instance, where we don't have an assumption of a completely curated network or not all transcription factors are known.
Final network extraction system evaluation metrics
RegulonDB (as of June 2006)
A biologist reviewed a random sampling of the most comprehensive set of interactions, which our system was able to extract, but that were not found to be in RegulonDB. This exercise allowed us to explore information that: a) was not retrieved during the manual curation process, but should have been, b) was processed incorrectly or c) although basically correct, was not completely relevant to our current purposes (for example, gene regulation other than transcriptional, although RegulonDB plans to add this information in the future). From a random sample of 96 interactions, we found 19 that represented relevant information that was not present in our reference database, but that merited either a closer look at the sources or further analysis to establish if it should be incorporated into RegulonDB. There were multiple reasons for this information being missing, among them: 1) the source papers had not been retrieved for curation or had not been curated yet, or 2) the genes or TFs were mentioned with unusual synonyms, Ids, references or terms, which made their manual curation difficult, 3) or the evidence presented either was deemed insufficient by curators or was presented with high level of hedging ("the molR gene probably regulates the expression of the chlD operon"). If we extrapolate this figure to the 1,545 interactions from the comprehensive non-enriched 2,649 interactions set that could not be matched with the relevant 3,108 RegulonDB entries, around 290 new interactions could be added to the database after manual review. RegulonDB curators will curate this automatically-generated network to see if they can integrate the data into the database, but also will search for other pertinent information, such as site and distances for gene and promoters.
In order to test the linguistic processing of the system, we did a manual review of a random sample of 96 interactions extracted, and we established 81 of them as having a basically correct semantic interpretation of the sentences, and 76 of them as being biologically correct to the point of including the right activation or repression function, for a 84% overall precision. The network that was gathered from all sources allowed us to obtain 45% of the entire human-curated RegulonDB network, while a more limited 700-plus selection of network-related papers (RN) accounted for 33% of that total. The "artificial" addition of multiple-entity objects like operons and two-component protein systems from RegulonDB (information that was available from our reference database, but would not be for organisms not yet curated) increased the size of the global network by 10 percentile points (324 interactions). In most datasets the increase was less significant, and we believe that as a whole the value of the information added with previous domain knowledge was not overly important.
In a more extensive evaluation simulating full curation of NLP-generated networks, a domain expert reviewed 481 interactions that A) were obtained from processing the 12,059 abstracts retrieved using the RegulonDB search strategies and B) were not found in the RegulonDB database. Again, we wanted to test if the extraction system could find relevant information that was missed at the triage or curation stages (a difficult task for human curators). We found 91 interactions that could be added to RegulonDB, while in the rest there was either an error in the inference made from the text, or even if the inference was correct the data could not be added to RegulonDB because of various reasons, among them: the regulation was not transcriptional, did not correspond to E. coli, there was an error in the gene/protein identification module, etc. In a few of the interactions the data seemed correct but the complete papers were unavailable to check them fully before adding them to the database. The 18.91 % of total interactions that was either immediately useful or seemed correct pending further analysis constitutes a reasonable addition to the curated network that seems to be worth the review effort by the curatorial team (potentially, close to 290 new interactions could be added if we consider the combined network obtained from all our corpora). In any retrieval system, the small "tail" of valuable missed information is always harder to come by than obvious cases, and improvement on Recall and Precision is hard. It is clear that a fully automatic processing of these sentences would require complex Artificial Intelligence inference engines customized for biological interpretation. Data that was not included in the RegulonDB network we tested against (like other regulation mechanisms or plasmid genes) could nonetheless provide important relationships associated with the annotation process. By doing a manual sweep of computer-generated curation, new or relevant information can be garnered that complements, expands or confirms human annotation.
Informational density of various corpora
B) # Docs
C) Size (MB)
D) Interactions in RegulonDB
E) Total interactions
F) % In RegulonDB
G) Average doc size (kbs)
How much valuable information is lost in papers that are filtered-out in the initial triage? Is a tightly-focused search-and-selection strategy better than casting a wider net? What is preferable when curating large-scale biological networks, exhaustiveness or precision in the data? Can automatic and manual annotation complement each other so we don't have to expect a trade-off with these two parameters? In this work we have addressed some of these issues by evaluating the use of NLP-based Information Extraction techniques for enhancing curation of a database of regulatory interactions. Saric et al. [15, 16], for instance, presented a regulatory extraction system designed for inter-species versatility, but their evaluations focused more on the computational aspect of their system and not sufficiently on the implementation of such a system within a curatorial workflow. Rodriguez-Esteban et al.  evaluated the results of the GeneWays pathway Information Extraction system using Machine-Learning techniques to simulate the decision-making process of curators when reviewing the output of an IE system, thus framing the question as a classification problem ("correct/incorrect"). One conspicuous difference between their evaluation approach and ours is that we used RegulonDB as a gold standard, while they relied on manually-reviewed training-testing data, albeit with measures of inter-annotator agreement that ensures certain objectivity. Karamanis et al.  reported the use of NLP-based tools to assist curation of the fruit fly database, but their evaluations were based on the average time employed by curators to fill their forms and are thus not really comparable with our own methodologies for testing overall curation, where completion time is not as critical as annotation accuracy.
By going over automatically-curated information from an extensive spectrum of sources, human users can not only verify information already curated by more traditional means but can also encounter information that either escaped manual curation or can help contextualize previously captured information. An accurate pre-annotation of non-curated text is also possible with automatic processing, which can suggest, as well, better search heuristics for finding relevant literature. We believe that a parallel manual/automatic curation process, as shown in Figure 1, can achieve results that represent the best of both worlds. This two-pronged approach can ensure broad coverage of the data without significant loss of precision, since a human expert will still vet all information added to the reference database. At present we are evaluating the integration of NLP-aided tools to enrich and enhance RegulonDB curation. NLP tools will require a different type of curation, devoted initially to revising the false positives and negatives, in order to detect what failed in each case, thus providing feedback to the system developers in addition to the actual gathering of biological information, but not replacing manual curation.
Besides some of the challenges already mentioned in the previous section for a fully-automatic curation system, some of the system's shortcomings have to do with mundane reasons, such as incomplete named-entity dictionaries, imperfect format conversion, word tokenization, reporter-gene occurrences, etc. Incremental enhancement in these areas is time-consuming, but it is also perfectly feasible and does not represent especially difficult technical hurdles. A more complex challenge for automated curation is the interpretation of tables and figures, which help authors present multiple interactions or their evidence in a manner well suited for human readers, but enormously difficult for computer processing.
In this work we have put forward an xml-based markup language (RNM2L) to represent text-mined regulatory networks in a way that is useful both to biologists and to NLP specialists alike. We have also suggested using a combination of manual and automatic literature curation to achieve precision in the annotation without sacrificing completeness. Our experiences with automatic and semi-automatic annotation have shown that Natural Language Processing can be an extremely useful tool for curatorial efforts, although it is still far from being able to do full text interpretation or locate all the information that a trained human expert can. An important conclusion we have drawn is that since Information Extraction systems can as easily handle large amounts of documents as a more limited datasets, using extensive search strategies combined with more focused ones can provide both a satisfactory coverage of the potential information, as well as good precision, without too many false positives to review. A rule-based system will not obtain all the network interactions that a human curator team can, but will not overwhelm them with non-relevant instances, unlike a statistical-based one that can potentially be more comprehensive but also more noisy. Methods for filtering interactions that are based on, for example, the number of sentences attesting the same annotation or on the presence of known transcription factors, are being explored by our group and others , but the amount of interactions and the time needed to review them by curators in order to find overlooked or novel information, is still an interesting open question. Manual curation of the output of automatic processes is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages. By combining the exacting precision of human readers with the tireless abilities of Information Extraction systems to rapidly cover a lot of ground with reasonable accuracy, we believe that genomic data on other organisms less studied than E. coli can be obtained for the high-throughput methods of Systems Biology and Genomics.
In contrast, acnB expression is activated by CRP and repressed by ArcA, FruR and Fis from PacnB.
Chunk-parsed sentences were converted into an intermediate xml format and processed by customized heuristics modules that identify the regulatory interactions and establish, whenever possible, the kind of the interaction extracted (activation or repression). The complete system customized for E. coli can be downloaded from our server .
We thank Jasmin Saric for providing us with some materials used in their previous work, mainly some dictionaries for semantic typing. We acknowledge partial support for this work from NIH grant number 5R01 GM071962 02.
- Karp PD: Pathway databases: a case study in computational symbolic theories. Science 293(5537):2040–4. 2001 Sep 14 2001 Sep 14 10.1126/science.1064621View ArticlePubMedGoogle Scholar
- Keseler IM, Collado-Vides J, et al.: EcoCyc: a comprehensive database resource for Escherichia coli . Nucleic Acids Res 33: D334–7. 2005 Jan 1 2005 Jan 1 10.1093/nar/gki108PubMed CentralView ArticlePubMedGoogle Scholar
- Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D: Genetic and physical maps of Saccharomyces cerevisiae . Nature 1997, 387(6632 Suppl):67–73.PubMed CentralPubMedGoogle Scholar
- Grivell L: Mining the bibliome: searching for a needle in a haystack? EMBO Rep 2002, 3(3):200–3. 10.1093/embo-reports/kvf059PubMed CentralView ArticlePubMedGoogle Scholar
- Yandell MD, Majoros WH: Genomics and natural language processing. Nature Reviews – Genetics 2002, 3(8):601–10.PubMedGoogle Scholar
- Krallinger M, Erhardt RA, Valencia A: Text-mining approaches in molecular biology and biomedicine. Drug Discov Today 10(6):439–45. 2005 Mar 15 2005 Mar 15 10.1016/S1359-6446(05)03376-3View ArticlePubMedGoogle Scholar
- Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6(1):57–71. Review Review 10.1093/bib/6.1.57View ArticlePubMedGoogle Scholar
- Scherf M, Epple A, Werner T: The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform 2005, 6(3):287–97. 10.1093/bib/6.3.287View ArticlePubMedGoogle Scholar
- Grishman R, Sundheim B: Message Understanding Conference – A brief history. Proc of the 16th International Conference on Computational Linguistics, Copenhagen 1996.Google Scholar
- Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6(Suppl 1):S1. Epub 2005 May 24 Epub 2005 May 24 10.1186/1471-2105-6-S1-S1PubMed CentralView ArticlePubMedGoogle Scholar
- Demetriou G, Gaizauskas R: Utilizing Text Mining Results: The PastaWeb System. Proc of the Association for Computational Linguistics Workshop on Natural Language Processing in the Biomedical Domain, Philadelphia, US 77–84. July 11 2002 July 11 2002Google Scholar
- Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17(Suppl 1):S74-S82.View ArticlePubMedGoogle Scholar
- Hoffmann R, Valencia A: A gene network for navigating the literature. Nature Genetics 2004, 36: 664. 10.1038/ng0704-664View ArticlePubMedGoogle Scholar
- Corney D, Buxton BF, Langdon WB, Jones DT: BioRAT: Extracting Biological Information from Full-length Papers. Bioinformatics 20(17):3206–13. Nov 22 2004 Nov 22 2004 10.1093/bioinformatics/bth386View ArticlePubMedGoogle Scholar
- Saric J, Jensen L, Rojas I: Large-scale Extraction of Gene Regulation for Model Organisms in an ontological context. Silico Biology 2005, 5(1):21–32.Google Scholar
- Saric J, Jensen L, Ouzounova R, Rojas I, Bork P: Extraction of regulatory gene expression networks from PubMed. In Proc of the Conference of the Association of Computational Linguistics. Volume 22. Barcelona, Spain; 2006:645–650.Google Scholar
- Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, Santos-Zavaleta A, Martinez-Flores I, Jimenez-Jacinto V, Bonavides-Martinez C, Segura-Salazar J, Martinez-Antonio A, Collado-Vides J: RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res 34: D394–7. 2006 Jan 1 2006 Jan 1 10.1093/nar/gkj156PubMed CentralView ArticlePubMedGoogle Scholar
- Hucka M, Finney A, Bornstein BJ, Keating SM, Shapiro BE, Matthews J, Kovitz BL, Schilstra MJ, Funahashi A, Doyle JC, Kitano H: Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. System Biology (Stevenage) 2004, 1(1):41–53. 10.1049/sb:20045008View ArticleGoogle Scholar
- RegulonDB Information Extraction site[http://www.ccg.unam.mx/ccg-ie/]
- Rodriguez-Esteban R, Iossifov I, Rzhetsky A: Imitating Manual Curation of Text-Mined Facts in Biomedicine. PLoS Comput Biol 2006, 2(9):e118. 10.1371/journal.pcbi.0020118PubMed CentralView ArticlePubMedGoogle Scholar
- Karamanis N, Lewin I, Sealy R, Drysdaley R, Briscoe E: Integrating Natural Language Processing with Flybase Curation. Proc from Pacific Symposium on Biocomputing, in press.
- Krallinger M, Rodriguez-Penagos C: Integrating literature mining and biological knowledge for regulation extraction. unpublished manuscript 2007.Google Scholar
- Schmid H: Probabilistic part-of-speech tagging using decision trees. In Proc of International Conference on New Methods in Language Processing. Manchester, England; 1994:44–49.Google Scholar
- Abney S: Partial parsing via finite-state cascades. In Proc of the ESSLLI '96 Robust Parsing Workshop. Prague, Czech Republic; 1996:8–15.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.