Content-rich biological network constructed by mining PubMed abstracts
© Chen and Sharp. 2004
Received: 27 May 2004
Accepted: 08 October 2004
Published: 08 October 2004
Skip to main content
© Chen and Sharp. 2004
Received: 27 May 2004
Accepted: 08 October 2004
Published: 08 October 2004
The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public.
We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses.
Chilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot 10.1186/1471-2105-5-147 can be accessed free of charge to academic users.
A comprehensive understanding of the rapidly expanding corpus of information about the genome, transcriptome, and proteome at large scale requires extensive integration with existing knowledge that often pertains to a number of biological disciplines. Despite the existence of specialized databases (e.g. [1, 2]), most of this knowledge is still stored in the form of unstructured free-texts. Different approaches have been developed that automatically retrieve information on molecular interactions from the biomedical literature. Some assume that the co-occurrence of gene/protein names in texts corresponds to a biological relationship [3, 4]. Others assign relationships based on similarities in the texts of abstracts [5–7]. While computationally efficient, these methods do not characterize each interaction (e.g., inhibition versus stimulation, directionality). Furthermore, relationships are supported by minimal documentation, other than PubMed IDs. Natural language processing (NLP) has also been used as the basis of programs designed to retrieve more detailed information about molecular relationships ([8–11], reviewed in [12, 13]). However, many of these programs were built for testing purposes and are not available to the scientific community at large .
Recent analyses of several types of biological networks (e.g. metabolic , proteomic , and transcriptomic  networks) have found that their connectivities followed the power-law distribution, specifying that the probability of any node connecting to "k" other nodes is proportional to 1/kn. These networks are classified as scale-free networks and are in direct contrast to the bell-shaped distributions seen in random networks . Since most nodes in a scale-free network have very few connections, yet a few nodes (i.e., hubs) have a large number of connections, scale-free networks are robust, resisting the random failure of nodes, but vulnerable if hubs fail. To facilitate comparisons to the structure of other biological networks, the connectivity of networks constructed by Chilibot were analyzed and found to follow the power-law distribution characteristic of scale-free topologies.
The overall goal of Chilibot is to generate graphical representations of the relationships among user provided terms (e.g. molecules, concepts, etc). This is achieved by automatically querying the PubMed literature database and extracting information using natural language processing (NLP) techniques.
Chilibot is an Internet-based application . The system has been tested on FreeBSD and Red Hat Linux operating systems. Users interact with the Chilibot server from web-browsers (e.g. Mozilla Firefox, Netscape, or Microsoft Internet Explorer). Batch queries can also be conducted, but only from the server side.
Chilibot dictionary of gene/protein synonyms.
Number of gene symbols collected
The texts (including each title and abstract) are then parsed into units of one sentence, which has been shown to yield higher performance levels than paragraphs or phrases in the identification of relationships from MEDLINE abstracts . Sentences containing both query terms or their synonyms are subjected to part-of-speech (POS) tagging using the TnT tagger , which is followed by shallow parsing using CASS . A set of rules (see Methods) is followed to classify these sentences into one of five categories: stimulatory (interactive), inhibitory (interactive), neutral (interactive), parallel (non-interactive) and abstract co-occurrence only. The overall relationship between each pair of query terms is then specified based on the relationships found in the sentences (see Methods).
Retrieved relationships are visualized using AiSee (AbsInt, Angewandte Informatik GmbH, Germany). Nodes (boxes) are used to represent query terms and lines for relationships. Icons with different shapes and colors are added to the middle of each line to indicate the nature of the relationship, with arrows indicating directionality. Color coding of individual nodes can be used to report the magnitude of change in experimental data, when provided by the user; different shades of green or red represent up- or down-regulation, respectively, and more saturated colors are associated with larger changes. The weight of an interactive relationship, reflecting the number of abstracts obtained from PubMed, is displayed within the icon (figure. 1). The co-ordinates of the graphical elements are used to link the documentation of the relationships and the query terms to the map. Typically, querying a list of 10 terms takes 3–4 minutes, allowing 3 seconds between PubMed connections as requested by NCBI.
Failure to detect known DIP relationships
Undetected relationships (%)
member of protein family (name generalization)
incomplete synonym list
no reference at abstract level
To estimate precision, defined as the fraction of retrieved relationships that are relevant, we randomly selected 100 relationships from the 702 relationships recovered by Chilibot (86 interactive, 11 parallel, and 3 abstract co-occurrence). We manually confirmed that the documentation retrieved by Chilibot contained information about 96 of the targeted relationships, and the remaining four shared symbols with other genes. In the interactive category, directionality was correctly identified in 79.1% and inhibitory/stimulatory properties in 74.4%. The original data used to perform these analyses are available [see additional file 1 and 2].
One of the key features of Chilibot is its capacity to link the relationships represented in the network map directly to their supporting documentation, usually as sentences containing both of the query terms. In addition, each node is linked to its synonym list and to a set of statements demonstrating the use of the term; these statements are selected from abstract texts by an algorithm favoring conclusive statements (see Methods). By providing the literature in a condensed and highlighted form, Chilibot facilitates the rapid comprehension of the relationships by the user.
Chilibot provides several options for customizing the query process and for viewing the identified relationships. Context specific searches restrict the analysis of relationships to a specific subject area, as defined by the user. Internet searches can also be customized (e.g. searching only documents in PDF format) by using Google WebAPI. Specific subsets of relationships contained in an overall relationship map can be reconfigured. For example, the user can customize the relationship map by requesting only those relationships with direct linkage to a specific node, or those that have a requisite number of supporting publications [see additional file 3 and 4 for examples].
Chilibot also identifies key index terms common to the relationship network. To do so, Chilibot uses Medical Subject Headings (MESH) , a controlled vocabulary that indexes the subjects of the documents developed by the National Library of Medicine. Chilibot ranks MESH keywords indexed in the literature that supports the relationship network. The ranking is determined by the frequencies of the keywords, as well as whether the keyword is a major or minor topic of the paper (see Methods). The top ranked keywords, reflecting the subject area(s) shared by the query terms, can serve as a guide for further reading and suggest new Chilibot queries.
Chilibot also has the capability of suggesting new hypotheses based on the retrieved network of relationships. Such hypotheses, originally described by Swanson et al. as "undiscovered public knowledge" , referred to the inference of an interaction between two items A and C, based on knowledge that A affects B and B affects C. This involves software that generates a large list of "B" terms from titles returned by PubMed queries. The user filters these terms, aided by the titles and abstracts. Variations of this method have been designed and tested by others [25, 26]. Taking a similar approach, Chilibot scans the network of retrieved relationships to find pairs of nodes that have no documented relationship, but have connections to a common tertiary node(s). These pairs of nodes are classified as having a "hypothetical relationship". The networks that contain these "hypothetical relationships", including the tertiary node(s), are then provided to the user in graphical format, with links to their documentation.
Retrospective study of the predictive capability of the "hypothetical relationships" generated by Chilibot
PI-3K PKA CAMKII ACTIN ERK TAU PKC AMPA
PKA ACTIN SYNAPTOPHYSIN PKC NMDA TAU AMPA PLC
PKA ACTIN NMDA TAU AMPA PLC
PKA ACTIN CREB PKC PLC
PKA ACTIN SYNAPTOPHYSIN PKC TAU
PKA CAMKII CREB PKC TAU
ZIF268 PKC NMDA PLC
PKA CAMKII CREB
NMDA TAU PLC ACTIN
PKC TAU CAMKII ACTIN
PKC NMDA TAU ACTIN
Based on the literature that is currently available, Chilibot identified new hypothetical relationships, such as those between synaptophysin/CREB and synaptotagmin/CREB. Currently no direct empirical evidence for these relationships is available. However, scanning the 5' untranslated region of the synaptophysin and synaptotagmin genes did show multiple CREB binding sites, providing bioinformatics-based evidence supporting the plausibility of these potential interactions. Although these examples are promising, they are hypothetical relationships. Further review of the scientific literature, such as the sentences provided by Chilibot, is required to clarify the rationale for these hypotheses.
Recent large-scale studies of metabolic , transcriptomic  and proteomic [16, 31] networks, based on analyses of experimental data, have found that their topologies belong within the class of scale-free networks.
The scale-free topology of gene/protein relationships provides another dimension for comparing and prioritizing research targets after large-scale experiments. Currently, genes or proteins with large-fold changes are generally favored for further study . However, by itself, a large-fold change may be insufficient to predict whether such molecules are pivotal in the regulation of important biological processes. For example, in many biological signaling pathways, a small increase in up-stream events (such as the binding of a peptide or hormone to its receptor(s)) is usually associated with a hundred to thousand-fold increase in down-stream events [35, 36] (e.g., activation of mitogen-activated protein kinases or the production of cAMP). Therefore, knowledge of a network's critical nodes (i.e. hubs), which may be predicted by network connectivity , is likely to increase the power and efficiency of identifying potential experimental targets capable of modifying network function.
Chilibot graphically summarizes the relationships amongst a large set of user provided terms by analyzing abstracts retrieved from the PubMed literature database. We have found in our benchmark tests that these retrieved relationships are reliable. We believe that the scientific community will benefit from this literature mining capability along with the many features that Chilibot provides, especially in an era of science when insight can be submerged in an overwhelming sea of data and modularized knowledge.
Flat text file versions of the six databases (HUGO, LocusLink, OMIM, GDB, SwissProt, and SGD) were downloaded from their corresponding ftp sites. Symbol-name pairs were extracted from the corresponding fields using Perl scripts. Names were curated to remove words that are unlikely to be used in texts, such as "partial cDNA", "fragment", etc. In addition, non-alphanumerical characters were converted into spaces. Entries with the same symbol from the six databases were then combined in a case insensitive manner. The final dictionary is stored in the Postgresql relational database.
The NCBI Eutilies, in particular Esearch and Efetch, are used in conjunction with the Perl LWP module to interact with the http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html server. Optimization was necessary because phrase or adjacency searches are not supported by PubMed. Thus, when searching for names with multiple words, it is possible to retrieve abstracts that contain all the relevant words, however the words are used in different places of the abstract. Further, PubMed has an automatic term mapping feature that converts user input according to the MESH translation table. For our purposes, we considered this an undesirable feature. After small scale testing, the query structure we selected places a title and abstract restriction tag ([tiab]) after the name of the query term. This disables the term translation feature and also treats the term as a phrase when possible, according to PubMed documentation. To test the effectiveness of this strategy, we sampled 510 names with lengths ranging from 1 to 11 words. A total of 4584 abstracts were retrieved. We were able to find the query name from 4487 (97.9%) of the abstracts. We thus constructed the pair-wise PubMed query in the following format:
(Term 1 synonym 1 [tiab] OR Term 1 synonym 2 [tiab] OR ...) AND (Term 2 synonym 1 [tiab] OR Term 2 synonym 2 [tiab] OR ...)
Many methods [e.g. [37–40]] have been developed to translate acronyms unambiguously into their full length terminology, since acronyms may have multiple meanings and become a source of false positives [3, 41]. Chilibot provides an option to verify the meaning of acronyms when they are used as the query term. When a relevant acronym first appears, Chilibot retains a phrase immediately preceding the acronym that contains the same number of words as the number of characters in the acronym. The phrase then is compared to all synonyms of the acronym, which are retrieved from the nomenclature database of Chilibot. The abstract is excluded from analysis if less than 30% of the words in the phrase are found in the synonym list.
All the context keywords provided by the user are combined with an "OR" operation. This string is then combined with the pair-wise PubMed queries, using an "AND" operation. The context keywords are not used in subsequent analyses.
A synopsis is a collection of sentences used to annotate the query terms. It is generated from the first 100 sentences that contain the specific query term or its synonyms. These sentences are sorted by a weighting mechanism that favors short, conclusive sentences. Words suggesting a conclusion, such as "suggest", "found", "show", "data" etc weights as +9 points. Starting the sentence with the query term and a verb weights as +5 points. The presence of words suggesting a negative result such as "not", "lack", "fail", "without" is weighted as -3 points. Having more than 30 words also reduces the weight by 3 points. Lastly, having keywords specified by the user adds 5 points to the weight. The 15 sentences with the highest weights are displayed.
Title and abstract texts retrieved via the Efetch utility are first parsed into individual sentences using a Perl script. Only sentences containing both of the query terms or their synonyms are subjected to NLP analysis, which includes POS tagging by the TnT software  and shallow parsing by the CASS software . Testing TnT on a small corpus of 10 PubMed abstracts (2646 words), using the supplied WSJ language model, showed 537 (20.29%) unknown words. Manual inspection identified 150 errors in the assigned POS tags. We then trained the TnT software with the GENIA corpus  (a collection of 2000 PubMed abstracts annotated with POS and other information). Re-analyzing the same 2646 words, using the customized language model, resulted in only 289 (10.92%) unknown words. Manual inspection identified 31 errors. Thus, the language model based on the GENIA corpus was used for all subsequent analyses. CASS software was used without further adjustment.
All sentences containing two query terms (or their synonyms) are classified into one of six categories: stimulatory (interactive), inhibitory (interactive), both stimulatory and inhibitory (interactive), neutral (interactive), parallel (non-interactive) and abstract co-occurrence only. Sentences are classified into interactive or non-interactive relationships based on the presence or absence of a verb phrase between the two query terms. The following exceptions apply: sentences are classified as parallel when the query terms are present in two separate clauses; sentences without a verb phrase between the query terms, but with specific terms indicating interactions such as "interaction", "bind", etc., are classified as interactive; interactive relationships are converted into parallel relationship when there is a negation (such as "not") within the same clause of the verb phrase. The interactive relationship is further classified into stimulatory, inhibitory, or neutral subtypes based on the presence or absence of words describing such relationships, including "activate", "facilitate", "increase", "induce", "stimulate", "enhance", "elevate", "inactivate", "abolish", "attenuate", "block", "decrease", "eliminate", "inhibit", "reduce", "suppress". For interactive relationships, the direction is defined as from the left query term to the right term and is reversed when passive voice is detected. To avoid the influence by spurious mistakes, the overall relationship between two terms is defined as interactive only when more than 20% of the sentences are detected as either stimulatory or inhibitory. Lastly, the co-occurrence type is assigned when the two query terms are located in the same abstract but not the same sentence. We ranked the informativeness of the relationships in the following order: both stimulatory and inhibitory, either stimulatory or inhibitory, neutral interactive, parallel, abstract co-occurrence. The overall relationship between two query terms is classified as the most informative type of relationship.
Network layout is generated using the aiSee software. Each pair of query terms identified as having relationships is specified by nodes and represented by square boxes. The relationships are represented by solid lines. A special node with unique identification (an icon) is inserted into the middle of each line. The icon is either circular or rhomboidal depending on the relationship it represents (see legend of Figure 1). The network map as well as the links from the map to the descriptions of the relationships are obtained by calling the command line interface of aiSee.
After the query session is finished, the user can request Chilibot to suggest hypothetical relationships for any node that is within the retrieved network. For each node requested (NR) by the user, Chilibot scans the retrieved network to find those nodes that are not directly linked to NR, but have connections to the same tertiary nodes as NR. Chilibot then produces a new network map for each of these "hypothetical relationships", while maintaining the links to the supporting documentation. To test the usefulness of these "hypothetical relationships" in predicting future research, a total of 22 terms (ACTIN, ACTININ, AMPA, ARC, ATF, CAMKII, CAMKIV, CREB, ERK, KV4.2, NMDA, PI-3K, PKA, PKC, PLC, SYNAPSIN I, SYNAPTOPHYSIN, SYNAPTOTAGMIN, TAU, TRKA, TRKB, AND ZIF268) were queried together with LTP (long-term potentiation). Retrospective studies were performed by querying these terms again while adding the PubMed date limiting tag "&mindate=1960&maxdate=$maxdate", where the $maxdate equals to 1990, 1995, 2000, respectively.
The MESH Keywords of the abstracts represented by the graph are collected and sorted by their weighted percentage. When the keyword is the major topic of the publication, it is weighted as 3. Otherwise, it is weighted as 1. The weights are then divided by the number of abstracts to obtain the weighted percentage.
Google WebAPI is accessed through Perl scripts. Due to the limitation of the WebAPI, the query terms are searched directly without the expanded synonyms. The URIs of the top 10 hits were retrieved from Google and then the content of these pages was obtained from their individual servers. These pages are then converted into texts, and sentences containing either one of the query terms are presented to the user. Sentences containing both of the query terms are highlighted. Links are also provided to restrict the web search to educational institutions or to files in the portable document format (PDF). Google is a trademark of Google Technology, Inc.
DIP  is a curated protein interaction database. The version of DIP database released on April 18th, 2003 contains 18494 interactions between 7141 proteins. Relationships that originated from large scale genomic or proteomic studies were excluded, reflecting poor reliability of the data  and the low probability that such interactions would be described in textual forms. Proteins with no SwissProt annotation or of yeast origin were also excluded to further reduce the number of relationships to a manageable subset. This selection procedure resulted in a total of 770 relationships.
All none-graphic files are archived with tar and compressed with bzip2 to reduce file size.
This research was supported by PHS DA-03977 (BMS) and by the University of Tennessee Center for the Neurobiology of Brain Disease.
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (10.1186/1471-2105-5-147), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.