Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research
© Bravo et al.; licensee BioMed Central. 2015
Received: 24 July 2014
Accepted: 19 January 2015
Published: 21 February 2015
Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.
By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.
BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
KeywordsText mining Information extraction Big data Translational bioinformatics Biocuration Disease Machine learning Corpus development
Due to the increasing size of literature repositories, there is a strong need for tools that firstly, identify and gather the relevant information from the literature, and secondly, place it in the context of current biomedical knowledge. Nowadays, the automatic analysis of the literature by text mining approaches eases the access to information otherwise locked in millions of documents and supports translational research projects .
Despite the advances on biomedical text mining, several challenges remain to be solved in the field, such as the identification of complex relationships between entities of biomedical interest and the exploitation of the extracted information in real-case settings for supporting specific research questions in translational research. This is particularly relevant for researchers interested in human diseases, since they are struggling with the large number of publications in their domain. There is a pressing need for methods that: a) can extract information on human diseases and their genes in a precise manner; b) can be applied to large document repositories; c) allow to integrate the extracted data with other information to aid subsequent analysis and knowledge discovery. In particular, text mining tools that help in the identification of the actionable knowledge from the vast amount of data available in document repositories are key for bridging the gap between bench and bedside .
In the past, most efforts in text mining of relationships have been devoted to the identification of interactions between proteins, both due to the availability of corpora and the push driven by specific text mining challenges . In contrast, less attention has been paid to the identification of relationships involving entities of biomedical interest such as diseases, drugs, genes and their sequence variants. In the last years, however, this trend has changed and there is much more interest in gathering this kind of information [3,4]. There are examples of systems developed for identification of drug-gene interactions [5,6], drug-drug interactions [7,8], drug-indications [9,10], drug-adverse effect [11,12], gene-disease [13,14], and also systems covering different types of relationships .
An important aspect in the field of relation extraction (RE) are the different perspectives that can be used to define the relationships between entities. The relationship between two entities might be unqualified or not specified at the semantic level (e.g. “The LOXL1 gene is associated with exfoliation glaucoma”), or, on the other hand, semantically specified (e.g. “The LOXL1 gene is overexpressed in exfoliation glaucoma”). Moreover, the relationships can also be considered from the perspective of their level of certainty; that is, if the scientific statement is phrased as a fact or proven experimental observation or, alternatively, as a speculation or hypothesis (e.g. “The LOXL1 gene might be associated with exfoliation glaucoma”). Research in the area of discourse analysis has been applied to approach this latter perspective of RE [16-18]. Finally, complex representations of relations are the events as defined by the BioNLP shared tasks, that involve several participant entities, semantically-defined relationship types and their regulators .
A wide range of approaches for RE have been applied in the biomedical field, namely co-occurrence based statistics [20-22], rule-based systems [23,24], machine learning [13,25-27] and NLP-based approaches [28,29]. In particular, supervised learning approaches have shown good performance exploiting both syntactic and semantic information . Most of the studies have focused on kernel based methods to identify associations between entities [31-34]. These methods are able to classify text based on how a relationship between two entities is represented. Different kinds of features, such as word frequencies in the sentences or the relationship between words provided by phrase structure or dependency trees, can be used to represent a relationship between two entities. A common approach involves considering distance criteria like the shortest path between the candidate entities in a parse tree to unravel associations [3,35].
In this paper, we propose the combination of the Shallow Linguistic Kernel (K SL )  with a new kernel that exploits deep syntactic information, the Dependency Kernel (K DEP ), for the identification of relationships between genes, diseases and drugs. The K SL , which uses only shallow syntactic information, was successfully applied to extract adverse drug reactions from clinical reports  and drug-drug interactions . On the other hand, the K DEP exploits the syntactic information of the sentence using the walk-weighted subsequence kernels as proposed by . A major requisite of supervised learning approaches for RE is the availability of annotated corpora for both development and evaluation. Although there are several annotated corpora for identification of PPIs (LLL, AIMed, Bioinfer, HPRD50 and IEPA), manually annotated corpora for other associations are scarce . Our group has developed one of these resources, the EU-ADR corpus, that contains annotations on drugs, diseases, genes and proteins and associations between them . We used this corpus to develop a RE system for the identification of relationships between genes and diseases. In addition, we also evaluated the RE systems for the identification of relationships between diseases, drugs and their targets. As we are particularly interested in identifying associations between genes and diseases, we also developed a new corpus in this domain by a semi-automatic annotation procedure based on the Genetic Association Database (GAD), an archive of human genetic association studies of complex diseases and disorders, as a starting point. The RE module in combination with our previously reported Biomedical Named Entity Recognition or BioNER  constitutes the BeFree system (http://ibi.imim.es/befree/).
In addition to the evaluation of the performance of the RE system based on Precision (P), Recall (R) and F-score (F), that is common practice in the text mining domain, we wanted to assess the ability of the BeFree system to identify useful information in the context of concrete biomedical problems. More specifically, we applied the BeFree system for the extraction of associations between genes and diseases to two real-case scenarios: 1) the search for genes associated to one of the most prevalent diseases, depression, and 2) the population of DisGeNET, a database of gene-disease associations . In the first case study we demonstrate the ability of BeFree to identify useful information related to this particular disease. In the second case study we evaluated the application of BeFree to large-scale data extraction and integration with another knowledge source. This resulted in a large dataset on gene-disease associations (approx. 500,000 associations) that raised issues related with the quality of the extracted information. Therefore, we were faced with another challenge, which is the prioritization of the results obtained by large-scale mining of the biomedical literature. Since manual curation is not possible for this kind of large datasets, we performed a series of analysis on the data in order to gain insight on its quality and provide a discussion on the outcome.
In summary, we present a novel text mining system, BeFree, specifically focused on the identification of associations between drugs, diseases and genes. Another important contribution of this work is the GAD corpus for the evaluation of RE systems for gene-disease associations. We focus on the identification of gene-disease relationships, and analyse the outcome of the two case studies that highlight the value of the extracted information, and finally discuss the impact of this kind of approach for translational research. We address some of the current challenges in the field, such as improving RE for entities of biomedical interest, disambiguation between semantically different entities, integration with existing knowledge bases and exploitation of extracted information in real-case scenarios.
The complete set of gene-disease associations extracted by BeFree, with the supporting statements and information on the provenance, is available in DisGeNET (http://www.disgenet.org/). The corpora used in this study, including the new corpus on gene-disease associations, are available at http://ibi.imim.es/befree/#corpora. The BeFree code is available upon request.
Results and discussion
We then focus on the identification of gene-disease associations. We present the results of the real-life performance of the system and discuss its application for the identification of associations between genes and diseases in two different scenarios: a) the research on the genes involved in depression, one of the major health problems in the world, and b) the population of DisGeNET, a public database of gene-disease associations.
Identification of drug-target, gene-disease and drug-disease relationships
We assessed the performance of the K SL and K DEP kernels on the relationships available in the EU-ADR corpus . We used different combination of features to represent the associations, but only a selection of the better results is shown in the BeFree webpage (http://ibi.imim.es/befree/#supplmaterial) and some of them transcribed here (Additional file 3: Table S2). In the case of the drug-disease associations, the best performance both in terms of F-score and Recall is obtained with the K DEP kernel (Experiment 3: P 70.2%, R 93.2%, F 79.3%), using stems on the v-walk feature, while in terms of Precision the best result is obtained using POS tags on both the e-walk and v-walk features (Experiment 19: P 74.5%, R 71.5%, F 72.3%). The best results obtained by combining both kernels did not improve the performance of the dependency kernel alone (Experiment 75: P 72.0%%, R 84.0%, F 77.0%). Similar results were obtained for the gene-disease association classification, where the K DEP kernel alone achieved the best performance. The best performance in terms of F-score and Recall was obtained using stem or lemma over the v-walk features (Experiments 5 and 3: P 75.1%, R 97.7%, F 84.6%), while the best performance in terms of Precision was obtained when using lemma in v-walk and role in e-walk (Experiment 30: P 83.8%, R 71.0%, F 75.6%). Finally, for target-drug relationship, the highest Precision was obtained by the K DEP kernel with role and POS features over the v-walk and e-walk, respectively (Experiment 21: P 75.2%, R 68.1%, F 70.2%), while the highest Recall was obtained when using a combination of K SL and K DEP (Experiment 80: P 73%, R 98%, F 82.8%). The best classification in terms of F-score is achieved when using different combination of features with both kernels (see for instance Experiment 102: P 74.2%, R 97.4%, F 83.3%). Nevertheless, it is worth mentioning that the K SL kernel, which only uses shallow linguistic information, achieves competitive results in the classification of sentences containing drug-disease, gene-disease and drug-gene associations (F-score: 76.7%, 80.9%, 81.1% respectively).
In order to evaluate the results of BeFree in the context of other approaches, we evaluated the performance of SemRep [28,41] for the identification of the three relationship types using the EU-ADR corpus. SemRep is quite different than BeFree because it has been designed to identify a large variety of semantic predications taking into account the hierarchical relationships between concepts. Nevertheless, we decided to use SemRep for comparison because: 1) it is publicly available, and 2) among all the relationship types covered, some of them can be mapped to the three relationship types covered in this study (gene-disease, drug-target and drug-disease, see Methods). SemRep identified these relationship types with high precision but lower Recall than BeFree, achieveing 96% P, 36% R and 52% F1 for gene-disease associations, 95% P, 39% R and 55% F1 for drug-target and finally, 100% Precision, 40% R and 57% F1 for drug-disease (Additional file 3: Table S2). Thus, compared to SemRep, BeFree achieves more balanced results in terms of P and R for the identification of the three entity types.
Comparison of BeFree performance to previous work
Chun et al. 2006 
Bundschus et al. 2008 
Gurulingappa et al. 2012 
Kang et al. 2014 
Névéol and Lu. 2010 
Xu and Wang 2012 
Xu and Wang 2013 
Hakenberg et al. 2012 
Buyko et al. 2012 
The EU-ADR corpus is a valuable resource because it contains annotations for different types of associations, but its main drawback is its small size. There are only a limited number of corpora for entities of biomedical interest (see http://corpora.informatik.hu-berlin.de/ for a recent update). In order to test the feasibility of using a semi-automatic annotated corpus for biomedical RE, we developed a corpus from the GAD database to have a large benchmark of gene-disease associations in which to train and evaluate gene-disease classifiers. Then, we tested the classifier for gene-disease relationships on the GAD corpus by 10-fold cross-validation. Compared to the gene-disease set from the EU-ADR corpus, the GAD corpus is larger and contains a different ratio of true/false associations. Thus, it is interesting to see how the different combination of kernels and features behave in this benchmark. In addition, it contains a larger fraction of negative sentences, allowing the classification of positive (PA) and negative (NA) sentences pertaining gene-disease associations. Although these annotations are available in the EU-ADR corpus, due to its small size, it was not possible to train a classifier to distinguish between positive and negative sentences. When assessing the classification over the class TRUE, the best results where those obtained with the K SL (1: P 77.8%, R 87.2% F 82.2%). Contrasting with the results obtained on the EU-ADR corpus for gene-disease associations, K DEP alone did not work very well on the GAD corpus, and the combination of both kernels showed an improvement of the performance but was always lower than the ones obtained with K SL alone (see http://ibi.imim.es/befree/#supplbefree). In the scenario of the classification over three classes (PA vs NA vs FALSE), although the best performance in terms of Precision or Recall is obtained with combination of kernels, the best F-score is achieved by the K SL kernel with sparse bigrams, where the Precision and Recall values although not optimal are competitive (2: P 66.0%, R 73.8% F 69.6%). In summary, these results show that a corpus developed by automatic annotation from an expert-curated database on gene-disease associations can produce competitive classifiers, and that the K SL kernel with shallow linguistic information performs quite well in the classification.
Evaluation on real-life case studies
Case study on genetic basis of depression
Depression is a chronic, recurring, life-threatening disease and the second cause of morbidity worldwide, costing billions of dollars per year to the society . It is currently accepted that a variety of genetic, environmentally-driven epigenetic changes and neurobiological factors play a role in the development of depression; however the exact mechanisms that lead to the disease and affect therapy efficacy are still poorly understood. MEDLINE currently indexes more than 100,000 publications on depression, thus it is a good resource to gather information on genetic determinants of this illness. We performed an evaluation on a real-life setting to test the performance of BeFree in identifying genes associated to depression. We evaluated the results in terms of Precision, Recall and F-score of the predictions. Next, we evaluated the quality of extracted information comparing it with what is available in curated resources. We defined a document set of 270 abstracts pertaining to depression that were published during 2012. This document set was processed to identify genes and depression terms with BioNER (see Methods) and the associations between them using gene-disease models trained in EU-ADR and GAD corpora. In both cases, we used the model that in cross-validation achieved the best F-score (for EU-ADR, experiment 3; for GAD, experiment 1 and 2, see http://ibi.imim.es/befree/#supplbefree, Additional file 3: Table S2). From a total number of 830 gene-disease associations predicted by the models, we manually reviewed a subset of 100 selected at random to estimate the performance of each model. In the case of the model trained on the EU-ADR, although the Recall was almost perfect (96.6%), we observe a decrease in the Precision of the classification compared to the cross-validation scenario (59.4%). On the other hand, the model trained on the GAD corpus performed better in terms of Precision (70%) but worst in terms of Recall (59.3%) when compared to the cross-validation scenario. A model trained in the GAD corpus is also able to classify sentences containing gene-disease associations as positive, negative and false with F-score of 53.7% (data not shown). All in all, the model trained in the EU-ADR corpus, despite its small size, performed a better classification of gene-disease sentences in a real case setting (F-score 73.5%).
Large-scale analysis of gene-disease associations from the literature
We applied the BeFree system on a set of 737,712 abstracts pertaining to human diseases and their genes (see Methods for details on document selection) to identify relationships between genes and diseases. Note that our approach for NER takes into account the existing ambiguity in the nomenclature between entities of different semantic types, such as genes and diseases (see Methods for more details). This resulted in 530,347 gene-disease associations between 14,777 genes and 12,650 diseases, which were reported in 355,976 publications. DisGeNET, a database that integrates associations between genes and diseases from several sources, includes 372,465 gene-disease associations at the time of this analysis. Thus, the data identified by BeFree represent a very large dataset on gene-disease associations. Some concerns on the quality of the extracted information could be raised such as 1) errors in the text mining approach, both at the level of NER and RE, and 2) quality of the experimental evidence supporting the association. A simple way to identify both types of error would be to manually curate all the associations, but this is not a feasible task.
Thus, before delivering the data to the public through the DisGeNET knowledge portal, we conducted a series of analysis to learn more about the data and its provenance.
Data analysis and filtering
Integration with DisGeNET and data prioritization
Examples of gene-disease associations from BeFree with low score and supported by a large number of publications
Number of PMIDs (Number of PMIDS provided by BeFree)
Malignant Neoplasms (C0006826)
Breast Carcinoma (C0678222)
Malignant neoplasm of breast (C0006142)
Breast Carcinoma (C0678222)
Malignant neoplasm of breast (C0006142)
Breast Carcinoma (C0678222)
Malignant neoplasm of breast (C0006142)
Ovarian carcinoma (C0029925)
Malignant Neoplasm of Ovary (C1140680)
Ovarian Neoplasm (C0919267)
Characterization of BeFree data
Our results show that a kernel based approach using both morpho-syntactic and dependency information performs competitively for the identification of drug-disease, drug-target and gene-disease relationships from free text. Although the exact combination of features that yield better results depends both on the association type and the corpus used for training the system, the use of shallow linguistic information is enough to produce accurate RE classifiers to recognize these associations. A supervised learning RE system for gene-disease associations trained on different corpora (EU-ADR, GAD) with very different characteristics is able to identify gene-disease associations in real-case scenarios with good performance. As previously suggested by others , a corpus developed by semi-automatic annotation is a good resource for developing a RE system in biomedicine.
The scientific literature is a rich resource for extracting gene-disease associations, even considering only abstracts from a specific subset of MEDLINE.
Only a small proportion of the gene-disease associations discovered by text mining are collected in expert curated databases. There is a pressing need to find alternative strategies to manual curation to review, prioritize and curate these associations and incorporate them into domain-specific databases.
A first and important step is to extract this information and put it in a standardized format to allow its integration with other data sources and their subsequent analysis for different purposes.
Joint analysis of data derived by text mining with data curated by experts appears as a suitable approach to assess data quality and identify novel and interesting information.
A large proportion of the associations are supported by only one publication, raising concerns on data reproducibility but also pointing out novel putative targets for research and innovation.
There are important differences in the use of disease terminologies between database curators and the authors of publications, and also in the level of granularity of disease concepts to describe a disease phenotype. This is a current challenge for large-scale disease data integration that aims to gather a comprehensive coverage of disease and ensure systematic interoperability across biomedical domains.
Biocuration of large data sets, a.k.a. big data, is becoming a bottleneck for biomedical research. Recently, the crowdsourcing approach has attracted interest in the bioinformatic domain and holds promise for biocuration tasks . As more and more scientific groups are extracting knowledge blocked in free text by text mining and exposing it to the public domain, another upcoming question is the meta-curation of such deluge of data. In this regard, the nanopublication concept, based on Semantic Web triple-assertions, is a promising approach to aid the prioritization of associations based on the supporting evidence . This kind of approaches could be applied to large datasets such as the gene-disease associations extracted by BeFree.
In summary, the study presented here highlights the importance of performing several steps of data analysis on large data sets, before using the data for further bioinformatic analysis and even to feed it to the curation pipeline of a database. We suggest that this kind of iterative process of data extraction, analysis and refinement of data extraction methodology should be applied to other approaches aimed at extracting large-scale information from the literature.
One of the main limitations of this work is that, due to the type of annotations currently available in the corpora for gene-disease, drug-disease and drug-target associations, the relationships identified are not semantically typed. For instance, for a drug-target relationship, it is not possible to know if the drug is an agonist, or if it inactivates the target. Availability of the corpora for these relationships annotated at the semantic level will allow the development of a system able to type these relationships. Another limitation is the focus on relationships that are stated at the sentence level, not handling anaphora to detect associations that go beyond the boundaries of a sentence. Future work in both these directions will provide a system with a higher Recall that provides better definition of relationships from the semantic point of view. Another aspect that we will like to explore is the identification of contextual information of the relationship. For example, in the case of gene-disease associations, we would like to know the experimental method used to detect this association (GWAS, exome sequencing, transcriptomic analysis, etc.), information on cell-type or tissue (e.g. adrenal cells), population tested (e.g. caucasic males), etc.
We present the development of the BeFree system, composed of a biomedical named entity recognition system (BioNER, presented in ) and a kernel-based KE.
Kernel based RE
In order to implement a RE for different relationships (drug-target, drug-disease, gene-disease), we propose the combination of the Shallow Linguistic Kernel (K SL ) based on the system originally proposed by  and our Dependency Kernel (K DEP ). Both kernels are described in the next two sections.
Shallow Linguistic Kernel (K SL )
The Shallow Linguistic Kernel (K SL ), developed by  has been successfully applied to PPI, drug-side effects  and drug-drug interaction extraction . Here, we propose its application for the identification of the relationships gene-disease, drug-disease and drug-target. K SL is composed of a linear combination of the kernels K GC and K LC that provide different representations of the association between two candidate entities. The global context kernel (K GC ) is based on the assumption that an association between two entities is more likely to be expressed within one of three patterns (fore-between, between, between-after, see Figure 1d). Three term frequency vectors are obtained based on the bag-of-words approach using trigrams of tokens. Sparse bigrams were included to improve the classification performance, as suggested in the original implementation. The local context kernel (K LC ) uses orthographic and shallow linguistic features (POS, lemma, stem) of the tokens located at the left and right of the candidate entities (window size of 2). Figure 1 shows the features considered by each kernel using an example sentence.
Dependency Kernel (K DEP )
We developed the Dependency Kernel (K DEP ) to train a model to recognize relationships between the entities of interest using walk features . The syntactic dependencies of the words within a sentence can be represented as dependency graphs. Figure 2a shows the dependency graph of an example sentence extracted from MEDLINE as obtained by the Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml). The shortest path between the two candidate entities can be extracted from the dependency graph (highlighted with a solid line in the example, Figure 2b), which includes the Least Common Subsumer (LCS) node (common governor node between the two candidates, subgraph detailed in Figure 2b). In the K DEP , two types of walk features are used, the v-walk feature that is composed of node (i) -edge (i, i+1) -node (i+1) , and the e-walk feature that is composed of edge (i-1, i) -node (i) -edge (i, i+1) (both illustrated in Figure 2c). For the edges we consider the dependency relation type, while for the nodes we consider different features of the token, such as the token itself, its stem, lemma, role (if this token is candidate or not) and part-of-speech (POS) tag.
We used two manually annotated corpora: AIMed for PPIs and EU-ADR for gene-disease, drug-target and disease-drug associations. In addition, we developed a semi-automatically annotated corpus for gene-disease associations based on the GAD database. All datasets were pre-processed with a combination of tools to extract the features required by the RE system. More specifically, sentence boundaries were identified by NLTK (http://www.nltk.org/), tokens and part-of-speech (POS) tags were obtained using UIMA modules (http://www.julielab.de/), lemmas were obtained with Biolemmatizer (http://biolemmatizer.sourceforge.net/) and stems were identified with the Porter’s algorithm. Syntactic dependencies were obtained with the Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml). Both EU-ADR and GAD corpora are publicly available (http://ibi.imim.es/befree/#corpora).
The AIMed corpus is widely used for PPI extraction (ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/). The AImed corpus consists of 225 MEDLINE abstracts, of which 200 abstracts describe interactions between human proteins and 25 do not refer to any interaction. There are 5625 annotated sentences, 1008 containing a true PPI (TRUE) and 4617 not containing a true PPI (FALSE).
The EU-ADR corpus contains annotations of different entities (drugs, diseases, and genes/proteins) and the relationships between them . In particular, it contains annotations of relationships between drug and diseases (drug-disease set), drug and their protein targets (drug-target set) and genes/proteins and their association to diseases (gene-disease set). In addition, each relationship is classified according to its level of certainty as: positive association (PA), negative association (NA), speculative association (SA) and false association (FA). The EU-ADR corpus is composed of 100 MEDLINE abstracts for each relationship set, and its annotation was performed by three experts. In this study we considered the relationships that result from the consensus annotation of two experts. Additional file 1: Table S1 shows the number of relationships for each set.
The Genetic Association Database (GAD) is an archive of human genetic association studies of complex diseases, including summary data extracted from publications on candidate gene and GWAS studies (http://geneticassociationdb.nih.gov/). We use GAD for the development of a corpus on associations between genes and diseases (downloaded on January 21st, 2013). We considered the annotations of relationships between a gene and a disease in a single sentence as a reference set to build this corpus. GAD contains over 130,000 records with different types of information. We selected the records satisfying the following requirements: (i) the association between gene and disease is annotated as positive or negative, (ii) the association is expressed in one sentence and (iii) the Entrez Gene identifier for the gene is provided. Although GAD provides the sentence in which a gene-disease association is stated, there is no information on the exact location of the gene and disease entities in the text. In order to develop a corpus suitable for training a gene-disease RE system, the exact location of the interacting entities in the text is required. To achieve that, we applied our own NER system (BioNER, see below) to identify the gene and disease entities in the text and normalize them to NCBI Gene and UMLS identifiers, respectively. Then, the sentences in which a given gene was found together with a specific disease, and this gene-disease association was annotated by GAD curators as positive or negative, were labelled as TRUE. In order to create a dataset containing false associations (FALSE) between a gene and a disease, that is, a gene and a disease that co-occur in a sentence but are semantically not associated, we selected the sentences with co-occurrences between a disease and a gene found by the BioNER system that were not annotated by GAD curators as gene-disease associations. Additional file 3: Table S2 shows the number of TRUE and FALSE associations that represent the GAD corpus.
Evaluation of Kernel based RE
The performance of each model for association classification was evaluated by sentence-level 10-fold cross validation in each corpus. The classifiers’ performances were assessed using P, R and F-score over the class TRUE. TRUE sentences contain real relationship between the entities analysed, in contrast with FALSE sentences where the two entities co-occur, but there is no semantic relationship between them. In the case of the GAD corpus, we also trained a classifier that distinguishes between positive, negative and false associations, and therefore the performance was assessed over the class positive (PA) and negative (NA) separately. Due to the nature of the annotations available in both corpora used, the focus of this work is on associations that are unqualified (not defined at the semantic level).
Evaluation of SemRep for identification of drug-target, gene-disease and drug-disease relationships
We performed an evaluation of the SemRep system  for identification of drug-target, gene-disease and drug-disease relationships using the EU-ADR corpus. Since the scope and types of associations covered by SemRep are quite different than the ones covered by the EU-ADR corpus, we selected a subset of the association types retrieved by SemRep and mapped them to each of the association types in the EU-ADR corpus. For gene-disease associations, we selected the following SemRep association types: AFFECTS, ASSOCIATED_WITH, AUGMENTS, CAUSES, PREDISPOSES, COEXISTS_WITH, NEG_ASSOCIATED_WITH. For drug-target, the SemRep association types selected were: AFFECTS, ASSOCIATED_WITH, AUGMENTS, DISRUPTS, DISRUPTS(SPEC), INHIBITS, INHIBITS(SPEC), INTERACTS_WITH, NEG_INTERACTS_WITH, PRODUCES, STIMULATES. For drug-disease, we selected these SemRep association types: AFFECTS, ASSOCIATED_WITH(INFER), CAUSES, NEG_AFFECTS, NEG_TREATS, PREDISPOSES, PREVENTS, TREATS, TREATS(INFER), TREATS(SPEC), USES.
We used the batch mode of the SemRep program available at http://skr3.nlm.nih.gov/. For concept recognition we used the 2012AA version of the knowledge sources, that is the most recent version available within SemRep.
Identification of entities
We identified gene and disease mentions in free text using the BioNER system . During the initial phase of the project, we evaluated several NER tools publicly available. Our requirements were that the NER tool had to be able to detect and normalize to database identifiers two types of entities: genes/proteins and diseases. After an initial evaluation, the decision was to develop our own system because none of the tools evaluated work properly for us (a brief description of the tools evaluated is presented in the Suppl. File S1). Another reason for developing our own tool was to be able to regularly update the dictionaries used by the NER to keep the data up-to-date, and to perform curation of the dictionaries to reduce errors. We also invested efforts on the problem of the ambiguities between entities (see below), which we think is not currently addressed by any other tool. We believe that this point is important for subsequent steps in the text-mining workflow, such as the identification of relationships between entities.
BioNER uses gene and disease dictionaries with fuzzy and pattern matching methods to find and uniquely identify these entity mentions in the literature. During initial analysis of the RE results we observed that a source of error was the wrong identification of entities due to ambiguities in the terminologies for diseases and genes. This is particularly problematic in the case of acronyms, where the same token can be used to refer to a disease or a gene. Thus, we introduced a series of modifications on BioNER in order to address ambiguities in the identification of a single entity type (e.g. a gene) and between different entity types (genes and diseases).
Frequently, an acronym appears after the long term is defined in the text. In this case, we compare the list of concept identifiers of both mentions (acronym, long form) to determine if the acronym refers to the long form, using an in-house developed tool. For example, in the sentence “Selective gene targeting using the carcinoembryonic antigen (CEA) promoter is useful in gene therapy for gastrointestinal cancer” (from PMID 11053994), BioNER detects the long form expression “carcinoembryonic antigen” as a gene with NCBI Gene Id 1048, and the acronym “CEA” as four different gene entities (with four NCBI Gene Ids 1087, 5670, 1084 and 1048). The concept identifier in common between the two entities (NCBI Gene Id 1048) is kept as the right annotation. If there is more than one concept identifier in common between the two entities, we look at the similarity of the terms of each concept to select the right identifier. The file gene2pubmed source from Entrez Gene was also used to select the correct identifier in these ambiguous cases. Evaluation of BioNER for gene normalization using the BioCreative II Gene Normalization (BC2GN)  resulted in very low Precision (P: 48.1% R: 80.1% F: 60.1%), which could be improved considerably when applying the above mentioned strategies to handle the ambiguities between genes (P: 74.0%, R: 76.2%, F: 75.0%).
The other type of ambiguity arises when one candidate entity in a sentence can refer to different semantic types (disease and gene). For example the symbol “APC” can refer to the gene “adenomatous polyposis coli” or to the disease “atrial premature complex”. To properly recognize the identity of the mention, we take into account the contextual information of the candidate entity. For instance, to disambiguate a candidate entity to a gene, we look for keywords such as “gene”, “protein”, “factor”, “target”, “biomarker”, etc., whereas to disambiguate a candidate entity to a disease, we look for keywords like “disease”, “disorder”, “condition”, “syndrome”, etc. We also looked at the MeSH Disease annotations of the corresponding abstract to decide if a candidate entity refers to a gene or a disease. We compared the terms of the candidate entity to the terms of the MeSH disease concepts annotated to the abstract using a soft-matching approach, and if a match was found, we annotated the candidate entity as a disease.
In addition, we performed an evaluation of the performance of BioNER in the identification and normalization of disease entities using the Arizona Disease Corpus achieving competitive (P: 72.1% R: 64.4% F: 68.0%) results compared to previous approaches [49,50].
Case study on genetic basis of depression
We defined a PubMed query to retrieve a set of document to depression and published in 2012 as follows:
(“Depression” [Mesh] OR “Depressive Disorder” [Mesh]) AND “genetics” [Subheading] AND (hasabstract[text] AND (“2012” [PDAT]) AND English[lang] AND “humans” [MeSH Terms]) NOT (“Case Reports” [PT] OR “Clinical Trial” [PT] OR “Clinical conference” [PT] OR “Clinical Trial, Phase I” [PT] OR “Clinical Trial, Phase II” [PT] OR “Clinical Trial, Phase III” [PT] OR “Clinical Trial, Phase IV” [PT] OR “Controlled Clinical Trial” [PT] OR “Randomized Controlled Trial” [PT] OR “Meta-Analysis” [PT]).
This query resulted in 270 citations (date of search March 19, 2013). The abstracts were processed with BeFree trained on GAD and EU-ADR corpora to find gene-disease associations.
Case study on large-scale analysis of gene-disease associations from the literature
We defined a PubMed query to retrieve documents pertaining to human diseases and their associated genes published from 1980:
(“Psychiatry and Psychology Category” [Mesh] AND “genetics” [Subheading]) OR (“Diseases Category” [Mesh] AND “genetics” [Subheading]) AND (hasabstract[text] AND (“1980” [PDAT] : “2014” [PDAT]) AND “humans” [MeSH Terms] AND English[lang]).
This query retrieved 737,712 citations (date of search February 25, 2014), which were processed by Befree trained on the EU-ADR corpus to identify relationships between genes and diseases.
For more details on the DisGeNET score visit the DisGeNET web page (http: //www.disgenet.org/).
The complete set of gene-disease associations extracted by BeFree, with the supporting statements and information on the provenance, are available in DisGeNET (http://www.disgenet.org). The corpora used in this study are available at http://ibi.imim.es/befree/#corpora. DisGeNET data is distributed under the Open Database License (http://opendatacommons.org/licenses/odbl/).
We thank Dr. Olga Valverde (UPF) and Dr. Marta Torrens (IMIM) for their expertise in the use case of depression. Funding: The research leading to these results has received support from Instituto de Salud Carlos III-Fondo Europeo de Desarrollo Regional (PI13/00082 and CP10/00524), the Innovative Medicines Initiative Joint Undertaking under grants agreements n°  (eTOX) and n°  (Open PHACTS)], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. À.B. and L.I.F received support from Instituto de Salud Carlos III Fondo Europeo de Desarollo Regional (CP10/00524). The Research Programme on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB).
- Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet. 2012;13:829–39.View ArticlePubMedGoogle Scholar
- Cases M, Furlong LI, Albanell J, Altman RB, Bellazzi R, Boyer S, et al. Improving data and knowledge management to better integrate health care and research. J Intern Med. 2013;274:321–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature–a survey of the state of the art. Brief Bioinform. 2012;13:460–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Arighi CN, Wu CH, Cohen KB, Hirschman L, Krallinger M, Valencia A, et al. BioCreative-IV virtual issue. Database. 2014;2014:bau039–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Pakhomov S, McInnes BT, Lamba J, Liu Y, Melton GB, Ghodke Y, et al. Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies. J Biomed Inform. 2012;45:862–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu R, Wang Q. A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform. 2012;45:827–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Percha B, Garten Y, Altman RB. Discovery and explanation of drug-drug interactions via text mining. Pac Symp Biocomput 2012:410–21.
- Segura-Bedmar I, Martínez P, de Pablo-Sánchez C. Using a shallow linguistic kernel for drug-drug interaction extraction. J Biomed Inform. 2011;44:789–804.View ArticlePubMedGoogle Scholar
- Fiszman M, Demner-Fushman D, Kilicoglu H, Rindflesch TC. Automatic summarization of MEDLINE citations for evidence-based medical treatment: a topic-oriented evaluation. J Biomed Inform. 2009;42:801–13.View ArticlePubMedGoogle Scholar
- Névéol A, Lu Z. Automatic integration of drug indications from multiple health resources. In: Proc ACM Int Conf Heal informatics - IHI ‘10. New York, New York, USA: ACM Press; 2010. p. 666.View ArticleGoogle Scholar
- Gurulingappa H, Mateen-Rajput A, Toldo L. Extraction of potential adverse drug events from medical case reports. J Biomed Semantics. 2012;3:15.View ArticlePubMedPubMed CentralGoogle Scholar
- Kang N, Singh B, Bui C, Afzal Z, van Mulligen EM, Kors JA. Knowledge-based extraction of adverse drug events from biomedical text. BMC Bioinformatics. 2014;15:64.View ArticlePubMedPubMed CentralGoogle Scholar
- Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel H-P. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008;9:207.View ArticlePubMedPubMed CentralGoogle Scholar
- Ozgür A, Vu T, Erkan G, Radev DR. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics. 2008;24:i277–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Hakenberg J, Voronov D, Nguyên VH, Liang S, Anwar S, Lumpkin B, et al. A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions. J Biomed Inform. 2012;45:842–50.View ArticlePubMedGoogle Scholar
- Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics. 2008;9 Suppl 11:S9.View ArticlePubMedPubMed CentralGoogle Scholar
- Kilicoglu H, Bergler S. Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics. 2008;9 Suppl 11:S10.View ArticlePubMedPubMed CentralGoogle Scholar
- Nawaz R, Thompson P, Ananiadou S. Negated bio-events: analysis and identification. BMC Bioinformatics. 2013;14:14.View ArticlePubMedPubMed CentralGoogle Scholar
- Ananiadou S, Thompson P, Nawaz R, McNaught J, Kell DB. Event-based text mining for biology and functional genomics. Brief Funct Genomics 2014:elu015–.
- Jenssen T-K, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–8.PubMedGoogle Scholar
- Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005;21 Suppl 2:ii252–8.View ArticlePubMedGoogle Scholar
- Wang X, Hripcsak G, Markatou M, Friedman C. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc , 16:328–37.
- Fundel K, Küffner R, Zimmer R. RelEx–relation extraction using dependency parse trees. Bioinformatics. 2007;23:365–71.View ArticlePubMedGoogle Scholar
- Caporaso JG, Baumgartner WA, Randolph DA, Cohen KB, Hunter L. Rapid pattern development for concept recognition systems: application to point mutations. J Bioinform Comput Biol. 2007;5:1233–59.View ArticlePubMedGoogle Scholar
- Chun H, Tsuruoka Y, Kim J, Shiba R, Nagata N, Hishiki T, et al. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Pac Symp Biocomput 2006;4.
- Liu H, Hunter L, Kešelj V, Verspoor K. Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations. PLoS One. 2013;8:e60954.View ArticlePubMedPubMed CentralGoogle Scholar
- McClosky D, Riedel S, Surdeanu M, McCallum A, Manning CD. Combining joint models for biomedical event extraction. BMC Bioinformatics. 2012;13 Suppl 11:S9.View ArticlePubMedPubMed CentralGoogle Scholar
- Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003;36:462–77.View ArticlePubMedGoogle Scholar
- Friedman C, Liu H, Shagina L, Johnson S, Hripcsak G. Evaluating the UMLS as a source of lexical knowledge for medical language processing. Proc AMIA Symp 2001;189–193.
- Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii J. Overview of BioNLP’09 shared task on event extraction. In: BioNLP ‘09 Proc Work Curr Trends Biomed Nat Lang Process Shar Task. Association for Computational Linguistics. 2009. p. 1–9.Google Scholar
- Chowdhury MFM, Lavelli A. Combining tree structures, flat features and patterns for biomedical relation extraction. In EACL ‘12 Proc 13th Conf Eur Chapter Assoc Comput Linguist. Association for Computational Linguistics; 2012:420–429.
- Culotta A, Sorensen J. Dependency tree kernels for relation extraction. In: Proc 42nd Annu Meet Assoc Comput Linguist - ACL ‘04. Morristown, NJ, USA: Association for Computational Linguistics; 2004. p. 423–es.View ArticleGoogle Scholar
- Giuliano C, Lavelli A, Romano L. Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature. In: 11th Conf Eur Chapter Assoc Comput Linguist (EACL ‘06). 2006. p. 401–8.Google Scholar
- Miwa M, Saetre R, Miyao Y, Tsujii J. Protein-protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inform. 2009;78:e39–46.View ArticlePubMedGoogle Scholar
- Kim S, Yoon J, Yang J, Park S. Walk-weighted subsequence kernels for protein-protein interaction extraction. BMC Bioinformatics. 2010;11:107.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Yoon J, Yang J. Kernel approaches for genic interaction extraction. Bioinformatics. 2008;24:118–26.View ArticlePubMedGoogle Scholar
- Hahn U, Cohen K. Mining the pharmacogenomics literature—a survey of the state of the art. Brief Bioinform. 2012;13(4):460–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Van Mulligen EM, Fourrier-Reglat A, Gurwitz D, Molokhia M, Nieto A, Trifiro G, et al. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform. 2012;45:879–84.View ArticlePubMedGoogle Scholar
- Bravo A, Cases M, Queralt-Rosinach N, Sanz F, Furlong LI. A knowledge-driven approach to extract disease-related biomarkers from the literature. Biomed Res Int. 2014;2014:253128.View ArticlePubMedPubMed CentralGoogle Scholar
- Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI. DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinformatics. 2010;26:2924–6.View ArticlePubMedGoogle Scholar
- Semantic Knowledge Representation. [http://semrep.nlm.nih.gov/]
- Buyko E, Beisswanger E, Hahn U. The extraction of pharmacogenetic and pharmacogenomic relations--a case study using PharmGKB. Pac Symp Biocomput 2012;376–87.
- Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing. BMC Bioinformatics. 2013;14:181.View ArticlePubMedPubMed CentralGoogle Scholar
- Albert PR, Benkelfat C, Descarries L. The neurobiology of depression–revisiting the serotonin hypothesis. I. Cellular and molecular mechanisms. Philos Trans R Soc Lond B Biol Sci. 2012;367:2378–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4:P3.View ArticlePubMedGoogle Scholar
- Good BM, Su AI. Crowdsourcing for bioinformatics. Bioinformatics. 2013;29:1925–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Mons B, van Haagen H, Chichester C, Hoen P-B ‘t, den Dunnen JT, van Ommen G, et al. The value of data. Nat Genet. 2011;43:281–3.View ArticlePubMedGoogle Scholar
- Smith L, Tanabe L, Ando R, Kuo C, Chung I, Hsu C, et al. Overview of BioCreative II gene mention recognition. Genome Biol. 2008;9:S2.View ArticlePubMedPubMed CentralGoogle Scholar
- Kang N, Singh B, Afzal Z, van Mulligen EM, Kors JA. Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Inform Assoc. 2012;20:876–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Leaman R, Miller C. Enabling Recognition of Diseases in Biomedical Text with Machine Learning : Corpus and Benchmark. In: Proc 3rd Int Symp Lang Biol Med. 2009. p. 82–9.Google Scholar
- DisGeNET - a database of gene-disease associations. [http://www.disgenet.org/]
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.