Semantically linking molecular entities in literature through entity relationships
© Van Landeghem et al; licensee BioMed Central Ltd. 2012
Published: 26 June 2012
Skip to main content
© Van Landeghem et al; licensee BioMed Central Ltd. 2012
Published: 26 June 2012
Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts.
We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score > 90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts.
The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.
Due to the exponential growth of the biomedical literature, text mining tools have become crucial to process all available information contained in literature databases such as PubMed. Text mining can offer automatically generated summaries to the expert user who needs to retrieve all knowledge on a certain topic or stay up-to-date with recent findings. The level of detail of the extracted information ranges from simple binary interactions, such as protein-protein interactions [1, 2] or gene-disease associations [3, 4], to a more complex event representation [5–7]. All these relations typically involve one or multiple genes or gene products (GGPs).
GGPs are represented by gene symbols or synonyms and can be linked to database identifiers. For instance, Esr-1 refers to Entrez Gene ID 2099. Similarly, the full term human Esr-1 gene can be linked to the same ID. However, a complex noun phrase should not always be resolved to the embedded gene symbol. For example, the phrase Esr-1 inhibitor refers to an entirely different molecular entity.
Understanding complex noun phrases with embedded gene symbols is thus crucial for a correct interpretation of text mining results . Such non-causal relations between a noun phrase and the embedded gene symbol are being referred to as entity relations , or, in previous work, static relations . This type of relationship may also occur between two different noun phrases within one sentence. Typically, such relations hold between two molecular entities without necessary implication of causality or change. Entity relation types include Equivalence, Locus, Protein-Component, Member-Collection and Subunit-Complex.
The REL supporting task [9, 11] of the BioNLP Shared Task (ST) of 2011  was focused on extracting entity relations, contributing to the general goal of the ST to support more fine-grained text predictions. Furthermore, by formally defining these relations, a text mining module is able to establish semantic links between various molecular entities found in text (e.g. inhibitors, promoter constructs, gene families, etc.).
A more detailed explanation of the entity relations and the corresponding datasets is provided in the next section. Additionally, we describe two machine learning frameworks applied to the prediction of such relations. The Turku Event Extraction System (TEES) provides a largely unified extraction approach for all BioNLP ST'11 sub-challenges, with relatively minor adaptations specifically for the REL task. The Ghent Text Mining (GETM) framework on the other hand, contains several novel REL-specific modules, including the deduction and application of semantic similarities between domain terms, measured using latent semantic analysis and a manually annotated corpus.
Further, we show how feature selection techniques, in combination with the GETM framework, can be used to analyse and visualise the most discriminative patterns in the data in a structured fashion, offering valuable insights into the classification challenge. Finally, we analyse the performance and strengths of both frameworks on different datasets, analysing the contribution of both the term detection and the relation extraction modules. We conclude the paper by discussing the usage of entity relations for large-scale integrative data mining tools.
Examples of entity relations
Type of relation
[human interleukin 2 gene]
[TNF-alpha mRNA transcripts]
[Ikaros family members]
[inflammatory cytokine genes] including TNF, IL-1, and IL-6
[alpha globin regulatory element]
[tyrosine] phosphorylation of STAT1
p50 or relA, the two major subunits of [NF-kappaB]
Equivalence (GENIA - E)
Functional (GENIA - E)
Locus (GENIA - E)
Member-Collection (GENIA - E)
Misc (GENIA - E)
Object-Variant (GENIA - E)
Out-of (GENIA - E)
Protein-Component (GENIA - E)
Subunit-Complex (GENIA - E)
Member-Collection (GENIA - NE)
Protein-Component (GENIA - NE)
Subunit-Complex (GENIA - NE)
In the ST data, two types of entity relations are defined. A Subunit-Complex relation holds between a protein complex and its subunits, while a Protein-Component relation is less specific and involves a GGP and its components, such as protein domains or gene promoters. The GENIA relation corpus contains several additional types, including Equivalence and Member-Collection, which expresses a relationship between e.g. a gene family and its members. This corpus is further divided into 'embedded' and 'non-embedded' relations, the first being relations occurring within a noun phrase , and the latter containing broader relations between nominals . All these datasets are available at the GENIA project webpage: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA.
A supervised learning framework is a perfect match to deal with entity relations, as statistical properties can be drawn from the grammatical and lexical structures in the training data, providing a way to generate plausible hypotheses on unseen data. In this section we describe two machine learning frameworks developed for the extraction of entity relations: the winning system of the ST 2011, implemented by Turku University , and the system that achieved second place, by Ghent University .
The Turku Event Extraction System (TEES) is a generalized biomedical relation extraction tool based on a unified, extensible graph representation, where word entities constitute the nodes, and edges between them define the relations (or 'events'). The system consists of a pipeline of three main components based on support vector machines (SVMs) . First, entity nodes are predicted for each word token in a sentence. Then, argument edges are predicted for each pair of such nodes. Finally, the resulting graph is 'pulled apart' by classifying subgraphs as actual events/relations or not. These main steps can be followed by several post-processing steps, such as prediction of speculation and negation, or conversion to the ST file format. For a detailed description of the general system, see .
TEES relies heavily on syntactic dependency parses, represented as graphs of word token nodes and dependency edges. The system uses the Charniak-Johnson parser  with David Mc-Closky's biomodel  trained on the GENIA corpus  and unlabeled PubMed articles. The parse trees produced by the Charniak-Johnson parser are further processed with the Stanford conversion tool , creating a dependency parse . The parse is the main source of features for all the SVM classification steps.
The term detection component classifies each word token in the sentence as being a domain term or not. Multi-token terms are always represented by a single token, their syntactic head. For term detection, features are mostly based on dependency paths, generated up to a depth of three and centered on the candidate token. Word tokens have many attributes that are used as features, such as part-of-speech tags, dependency types, the word itself and its stem using the Porter stemmer . All examples are classified with the SVM multiclass software, using a linear kernel .
The term detection module is optimized in isolation, but this optimal F-score may not always be best for overall system performance. A recall boosting step multiplies the negative class weight with a set value, trading precision for increased recall. This results in more entities being available for the edge detection step, and a higher final F-score.
After the prediction of domain terms, the edge detection component predicts argument edges between the given GGP names and the predicted terms. One potential edge candidate is generated for each GGP-term pair in a sentence, and these are classified as Subunit-Complex, Protein-Component, Member-Collection, Equivalence or negative. For edge detection, features are largely based on the shortest connecting path of dependencies between the two nodes of respectively the GGP and the domain term. All examples are classified with the same SVM software as for term detection. Since entity relations are pairwise, no further processing is required, and the resulting graph can be directly converted to the ST format.
In contrast to TEES, the GETM framework, with its REL-specific modules, has not been published yet (with the exception of a short poster abstract ). In the next sections, we therefor describe the GETM REL-specific modules in a bit more detail.
To fully understand the relationship between a GGP and a domain term, it is necessary to account for the usage of synonyms and lexical variants in human language. To capture this textual variation, two strategies were implemented for creating semantic lexicons, grouping similar words together. The first method takes advantage of manual annotations of semantic categories in 1000 articles. The second method relies on statistical properties of nearly 15,000 articles.
The GENIA term corpus contains manual annotations of various domain terms such as promoters, complexes and other biological entities . These annotations are used to link certain lexical patterns to semantic categories, such as DNA-domain-or-region and protein-family-or-group. The GENIA term corpus consists of the same 1000 abstracts as the combined training and development ST data, and is therefore a very suitable additional data source.
In addition to using the GENIA term corpus, semantic spaces were calculated to deduce the underlying similarities between domain terms. A semantic space can be defined as a mathematical representation of a text corpus, containing high-dimensional vectors that capture the context in which certain words are used. Similar vectors then represent semantically similar words. By applying semantic spaces in this study, we aim at finding clusters of closely related biomolecular concepts, such as complex and heterodimer.
In a first step, a large-scale corpus is collected containing 14,958 PubMed articles concerning the topic of the GENIA corpus: human transcription factor blood cells. Next, all words are transformed to their lowercase variants and the Porter stemming algorithm is used for generalization purposes .
The actual semantic spaces are then built with the open-source S-Space Package . This package contains implementations of several semantic algorithms that have been extensively documented, tested and validated. We have experimented with latent semantic analysis (LSA) , random indexing , HAL  and COALS . By running these semantic algorithms on the nearly 15 thousand articles, we obtain datasets of terms linked to their semantic vectors. In a final step, these semantic vectors are clustered into meaningful groups. Clustering was performed using the Markov Cluster algorithm  with the cosine similarity measure.
Some terms in the clusters can be assigned a gold classification label by looking at the training portion of the GENIA relation corpus. For example, the domain term "complex" is always associated with a Subunit-Complex relation. The number of such gold labels in each cluster is represented by Known and the homogeneity of a cluster (HG) expresses the internal agreement between these labels. The homogeneity HG is multiplied by the number of unlabeled test terms (Unknown) to assess the predictive value of the cluster. The reliability (Reliability) of the cluster further expresses the percentage of known labels versus predicted ones. Clusters with relatively more known labels are deemed to be more reliable, unless the labels are highly contradicting, which would result in less homogeneity and thus a lower score. The final score metric S calculates one score for each cluster, and a clustering result is scored as the sum of all clusters.
The machine learning component of the framework identifies entity relations by analysing lexical and grammatical patterns in sentences containing GGPs.
To capture the lexical information for each sentence containing at least one GGP, bag-of-word (BOW) features are derived. In addition, n-grams, containing n consecutive words, are extracted from the sentence. All lexical information in the feature vectors is stemmed with the Porter algorithm and generalized by blinding the GGP symbol with protx and all other co-occurring GGP symbols with exprotx, while at the same time keeping the content of these symbols as separate features. Furthermore, terms occurring in the semantic lexicons (as described previously) are blinded with the corresponding cluster number or category for additional, generalized features.
For extracting grammatical patterns, the same parsing techniques are employed as used by TEES. Additionally, the semantic mappings are applied to the patterns derived from dependency graphs, and one additional generalization is obtained by substituting words for their part-of-speech tags. A few example features are depicted in Figure 1, with generalized features in italic.
The final feature vectors are classified using an SVM with a radial basis function (RBF) as kernel. The RBF kernel has been evaluated to perform best in this framework which employs several binary, type-specific classifiers in parallel . An optimal parameter setting (C and gamma) for this kernel was obtained by 5-fold cross-validation on the training data.
In the GETM framework, sentences are selected for classification if they contain at least one GGP. When the sentence is classified as containing a certain type of entity relation, it is necessary to also identify the exact domain term that is related to the GGP. To this end, a pattern matching algorithm was designed that applies a rule-based search algorithm within a given window (number of tokens) around the gene symbol. The search algorithm employs dictionaries obtained from the training data in combination with information from the semantic lexicons.
In this section we first present the official ST'11 results. We then analyse these results and the underlying frameworks in more detail by benchmarking on the GENIA relation corpus. A hybrid framework is created and validated on the (hidden) ST test set. Finally, we experiment with further combinations of the frameworks and achieve either high-precision or high-recall results.
Results on the ST data
T ∩ H
T ∪ H
A performance gap of 16 percentage points is measured between the best and second system, and another discrepancy of 9.5 percentage points between the second and third system. The relatively high performance of TEES is remarkable, as this system has not been developed specifically for the detection of entity relations, but rather is able to generalize quite well to different text mining challenges. In contrast, the GETM framework contains specific algorithms designed for the entity relations classification task such as the creation of the semantic lexicons. In this study, we aim at elucidating the performance discrepancy between the first two systems, by analysing whether most errors originate from the term recognition step or from the relation extraction module.
To analyse the performance discrepancy between TEES and the GETM framework, a number of analyses were performed on the GENIA relation corpus. The GENIA relation corpus was chosen for two main reasons. First, its scope is broader and the annotations cover several additional types of entity relations compared to the ST data (Table 2). Secondly, the availability of gold standard domain annotations in the GENIA relation corpus allow for benchmarking the relation extraction module in isolation. This also means that the results obtained here are not directly comparable to the results on the ST data, because the latter corpus does not include gold domain terms.
For the new experiments, TEES has remained unchanged, while the feature generation module of the GETM framework has been modified slightly to benefit from the specific properties of the GENIA relation corpus, optimizing feature sets for the different entity relation types (embedded vs. non-embedded). Due to the available gold standard terms, the GETM framework now classifies sentences with exactly one GGP and one term, rather than just sentences containing one GGP. Consequently, additional features describing the lexical and semantic content of the domain terms are added to the feature vectors.
The classification parameters of TEES have been optimized on the ST development corpus, which corresponds to the GENIA relation test set. For the GETM framework, the best feature set was selected after several analyses on the same data. These settings result in slightly optimistic performance values for both systems, when benchmarking on the GENIA relation data. However, the resulting overfitting only accounts for a few percentage points in F-score, and because these analyses are used for comparison between TEES and GETM, this is not considered to be a problem. This is even more the case because the hidden ST test set is the de facto standard for benchmarking and comparing different systems. The results on this dataset are described in the next section.
Results on the GENIA relation corpus
Another important result emerging from the analysis on the GENIA relation corpus, is the performance discrepancy between embedded and non-embedded types. Global performance reaches around 93-94% F-score for the embedded cases, while the non-embedded relations are predicted with an F-score of 74-76%. The embedded cases are indeed less grammatically complex than the non-embedded ones. Interestingly, they do represent an important sub-challenge of entity relations. When combining text mining results with public databases, automatically tagged GGP symbols need to be resolved to the correct record in the database. GGP symbols are often extracted by named entity recognition software such as BANNER , which applies statistical models for the recognition of GGP symbols in text, and might sometimes tag a whole noun phrase rather than just the embedded GGP name. Embedded relation types formally describe the relationship between e.g. Esr-1 and Esr-1 promoter, thus providing an automatic way of dealing with these strings and enabling a meaningful integration between text and database records. Even taking the previously described effects of overfitting into account, embedded relations can still be predicted with an F-score above 90%.
The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles [35, 36], provides an interesting opportunity to overlay these entity relations with event predictions. In this setting, entity relations could provide hubs between events concerning similar molecular entities, they could improve on the level of detail provided by the events  and finally they would be useful for large-scale normalization and integration with external databases such as Entrez Gene from NCBI  or Uniprot .
To test the hypothesis that the GETM framework lags behind because of its term detection module, a hybrid framework was created by combining the term detection module of TEES with the GETM relation detection module. This framework is tested on the official ST test data and it performs almost equally well as the original submission by TEES (1.15 percentage points lower F-score, Table 4). This result clearly shows the huge impact of the term detection module on the final results, as the relation extraction modules perform almost equally well. Apparently, the SVM-based term detection module of TEES performs much better than the rule-based approach implemented in the GETM framework, resulting in a much higher global performance result on the ST data.
Even though the performance of TEES and the hybrid framework are very similar, there is still a considerable variability in the underlying predictions, as the relation extraction component differs significantly. Consequently, we can further experiment with ensemble methods to combine both systems. Considering we only have access to two high-performing systems, the options for creating combinations are limited.
First, the intersection of the two systems was created. Comparing two relations across the different frameworks is straightforward because they use the same GGP occurrences (gold annotations) and the same domain terms (predicted by TEES). The results are shown in Table 4. Obviously, an intersection could never improve on recall compared to the original submission by TEES, but we do find a precision increase of 2.99 and 8.30 percentage points for Protein-Component and Subunit-Component respectively. The resulting F-score is 0.19 percentage points higher. While this is marginally higher than the original submission with TEES, the difference is not statistically significant, and this new framework is more complex as it needs to train two different classifiers. Finally, it is important to note that any machine learning framework can in theory be tuned to achieve either high recall or high precision by applying the well-known precision-recall trade-off .
The union of TEES and the hybrid system was subsequently constructed aiming at higher recall rates while still benefiting from the relatively high precision rates of both systems. However, this approach seems to include many irrelevant false positives (Table 4, last row). Recall rises with 2.99 and 1.84 percentage points for Protein-Component and Subunit-Component respectively, but F-score drops with 1.57 percentage points compared to the original submission with TEES.
We have presented the two best frameworks for the BioNLP ST'11 REL challenge, discussing the application of machine learning techniques, semantic spaces and feature selection for this task. We further analysed the performance discrepancy of 16 percentage points between these two frameworks as observed on the ST data. Benchmarking on a related and more extensive dataset has guided the construction of a hybrid framework which combines the TEES term recognition module with the GETM relation detection module. From these experiments, it became clear that the term detection module has a much higher impact than the relation extraction module on the final performance, and future development efforts in this field should thus focus more on accurate detection of the domain terms.
The extraction of entity relations from text has several interesting applications, such as retrieval and semantic labeling of various molecular concepts within and across articles. Additionally, the prediction of entity relations can also have an impact on the prediction of biomolecular events from text. Finally, knowledge on entity relations is a crucial step for the normalization of biological entities automatically extracted from text with named entity recognition software. We have shown that we can predict the class of embedded entity relations, necessary for such normalization efforts, extremely well (F-score above 90%).
The results obtained in this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed, exploiting these relations for further refinements and improvements of large-scale text mining efforts.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 11, 2012: Selected articles from BioNLP Shared Task 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S11.
The authors would like to thank the Shared Task organizers for providing the dataset and evaluation framework for this task. Furthermore, we would like to thank Sampo Pyysalo and Tomoko Ohta for their work on the GENIA relation corpus and fruitful discussions.
SVL and TA would like to thank the Research Foundation Flanders (FWO) for funding their research. Additionally, TA is a post doctoral fellow of the Belgian American Education Foundation. SVL, TA, BDB and YvdP acknowledge the support of Ghent University (Multidisciplinary Research Partnership "Bioinformatics: from nucleotides to networks").
JB and TS thank CSC - IT Center for Science Ltd for providing computational resources.