KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences
© Ernst et al.; licensee BioMed Central. 2015
Received: 28 August 2014
Accepted: 25 March 2015
Published: 14 May 2015
Biomedical knowledge bases (KB’s) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects.
We address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of pattern-based extraction, we harvest a huge number of facts in an automated manner without requiring any explicit training.
We extend previous techniques for pattern-based IE with confidence statistics, and we combine this recall-oriented stage with logical reasoning for consistency constraint checking to achieve high precision. To our knowledge, this is the first method that uses consistency checking for biomedical relations. Our approach can be easily extended to incorporate additional relations and constraints.
We ran extensive experiments not only for scientific publications, but also for encyclopedic health portals and online communities, creating different KB’s based on different configurations. We assess the size and quality of each KB, in terms of number of facts and precision. The best configured KB, KnowLife, contains more than 500,000 facts at a precision of 93% for 13 relations covering genes, organs, diseases, symptoms, treatments, as well as environmental and lifestyle risk factors.
KnowLife is a large knowledge base for health and life sciences, automatically constructed from different Web sources. As a unique feature, KnowLife is harvested from different text genres such as scientific publications, health portals, and online communities. Thus, it has the potential to serve as one-stop portal for a wide range of relations and use cases. To showcase the breadth and usefulness, we make the KnowLife KB accessible through the health portal (http://knowlife.mpi-inf.mpg.de).
KeywordsBiomedical text mining Knowledge base Relation extraction
Large knowledge bases (KB’s) about entities, their properties, and the relationships between entities, have become an important asset for semantic search, analytics, and smart recommendations over Web contents and other kinds of Big Data [1,2]. Notable projects are DBpedia , Yago , and the Google Knowledge Graph with its public core Freebase (freebase.com).
In the biomedical domain, KB’s such as the Gene Ontology, the Disease Ontology, the National Drug File - Reference Terminology, and the Foundational Model of Anatomy are prominent examples of the rich knowledge that is digitally available. However, each of these KB’s is highly specialized and covers only a relative narrow topic within the life sciences, and there is very little interlinkage between the KB’s. Thus, in contrast to the general-domain KB’s that power Web search and analytics, there is no way of obtaining an integrated view on all aspects of biomedical knowledge. The lack of a “one-stop” KB that spans biological, medical, and health knowledge, hinders the development of advanced search and analytic applications in this field.
In order to build a comprehensive biomedical KB, the following three bottlenecks must be addressed.
Beyond manual curation. Biomedical knowledge is advancing at rates far greater than any single human can absorb. Therefore, relying on manual curation of KB’s is bound to be a bottleneck. To fully leverage all published knowledge, automated information extraction (IE) from input texts is mandatory.
Beyond scientific literature. Besides scientific publications found in PubMed Medline and PubMed Central, there are substantial efforts on patient-oriented health portals such as Mayo Clinic, Medline Plus, UpToDate, Wikipedia’s Health Portal, and there are also popular online discussion forums such as healthboards.com or patient.co.uk. All this constitutes a rich universe of information, but the information is spread across many sources, mostly in textual, unstructured and sometimes noisy form. Prior work on biomedical IE has focused on scientific literature only, and completely disregards the opportunities that lie in tapping into health portals and communities for automated IE.
Beyond molecular entities. IE from biomedical texts has strongly focused on entities and relations at the molecular level; a typical IE task is to extract protein-protein interactions. There is very little work on comprehensive approaches that link diverse entity types, spanning genes, diseases, symptoms, anatomic parts, drugs, drug effects, etc. In particular, no prior work on KB construction has addressed the aspects of environmental and lifestyle risk factors in the development of diseases and the effects of drugs and therapies.
The main body of IE research in biomedical informatics has focused on molecular entities and chemogenomics, like Protein-Protein Interactions (PPI) or gene-drug relations. These efforts have been driven by competitions such as BioNLP Shared Task (BioNLP-ST)  and BioCreative . These shared tasks come with pre-annotated corpora as gold standard, such as the GENIA corpus , the multi-level event extraction (MLEE) corpus , and various BioCreative corpora. Efforts such as the Pharmacogenetics Research Network and Knowledge Base (PharmGKB) , which curates and disseminates knowledge about the impact of human genetic variations on drug responses, or the Open PHACTS project , a pharmacological information platform for drug discovery, offer knowledge bases with annotated text corpora to facilitate approaches for these use cases.
Most IE work in this line of research relies on supervised learning, like Support Vector Machines [10-13] or Probabilistic Graphical Models [14,15]. The 2012 i2b2 challenge aimed at extracting temporal relations from clinical narratives . Unsupervised approaches have been pursued by [17-20], to discover associations between genes and diseases based on the co-occurrence of entities as cues for relations. To further improve the quality of discovered associations, crowdsourcing has also been applied [21,22]. Burger et al.  uses Amazon Mechanical Turk to validate gene-mutation relations which are extracted from PubMed abstracts. Aroyo et al.  describes a crowdsourcing approach to generate gold standard annotations for medical relations, taking into account the disagreement between crowd workers.
Pattern-based approaches exploit text patterns that connect entities. Many of them [25-28] manually define extraction patterns. Kolářik et al.  uses Hearst patterns  to identify terms that describe various properties of drugs. SemRep  manually specifies extraction rules obtained from dependency parse trees. Outside the biomedical domain, sentic patterns  leverage commonsense and syntactic dependencies to extract sentiments from movie reviews. However, while manually defined patterns yield high precision, they rely on expert guidance and do not scale to large and potentially noisy inputs and a broader scope of relations. Bootstrapping approaches such as [33,34] use a limited number of seeds to learn extraction patterns; these techniques go back to [35,36]. Our method follows this paradigm, but extends prior work with additional statistics to quantify the confidence of patterns and extracted facts.
A small number of projects like Sofie/Prospera [37,38] and NELL  have combined pattern-based extraction with logical consistency rules that constrain the space of fact candidates. Nebot et al.  harness the IE methods of  for populating disease-centric relations. This approach uses logical consistency reasoning for high precision, but the small scale of this work leads to a very restricted KB. Movshovitz-Attias et al.  used NELL to learn instances of biological classes, but did not extract binary relations and did not make use of constraints either. The other works on constrained extraction tackle non-biological relations only (e.g., birthplaces of people or headquarters of companies). Our method builds on Sofie/Prospera, but additionally develops customized constraints for the biomedical relations targeted here.
Most prior work in biomedical Named Entity Recognition (NER) specializes in recognizing specific types of entities such as proteins and genes, chemicals, diseases, and organisms. MetaMap  is the most notable tool capable of recognizing a wide range of entities. As for biomedical Named Entity Disambiguation (NED), there is relatively little prior work. MetaMap offers limited NED functionality, while others focus on disambiguating between genes  or small sets of word senses .
Most prior IE work processes only abstracts of Pubmed articles; few projects have considered full-length articles from Pubmed Central, let alone Web portals and online communities. Vydiswaran et al.  addressed the issue of assessing the credibility of medical claims about diseases and their treatments in health portals. Mukherjee et al.  tapped discussion forums to assess statements about side effects of drugs. White et al.  demonstrated how to derive insight on drug effects from query logs of search engines. Building a comprehensive KB from such raw assets has been beyond the scope of these prior works.
We present KnowLife, a large KB that captures a wide variety of biomedical knowledge, automatically extracted from different genres of input sources. KnowLife’s novel approach to KB construction overcomes the following three limitations of prior work.
Beyond manual curation. Using distant supervision in the form of seed facts from existing expert-level knowledge collections, the KnowLife processing pipeline is able to automatically learn textual patterns and harvest a large number of relational facts from such patterns. In contrast to prior work on IE for biomedical data which relies on extraction patterns only, our method achieves high precision by specifying and checking logical consistency constraints that fact candidates have to satisfy. These constraints are customized for the relations of interest in KnowLife, and include constraints that couple different relations. The consistency constraints are available as supplementary material (see Additional file 1). KnowLife is easily extensible, since new relations can be added with little manual effort and without requiring explicit training; only a small number of seed facts for each new relation is needed.
Beyond scientific literature. KnowLife copes with input text at large scale – considering not only knowledge from scientific publications, but also tapping into previously neglected textual sources like Web portals on health issues and online communities with discussion boards. We present an extensive evaluation of 22,000 facts on how these different genres of input texts affect the resulting precision and recall of the KB. We also present an error analysis that provides further insight on the quality and contribution of different text genres.
Beyond molecular entities. The entities and facts in KnowLife go way beyond the traditionally covered level of proteins and genes. Besides genetic factors of diseases, the KB also captures diseases, therapies, drugs, and risk factors like nutritional habits, life-style properties, and side effects of treatments.
In summary, the novelty of KnowLife is its versatile, largely automated, and scalable approach for the comprehensive construction of a KB – covering a spectrum of different text genres as input and distilling a wide variety facts from different biomedical areas as output. Coupled with an entity recognition module that covers the entire range of biomedical entities, the resulting KB features a much wider spectrum of knowledge and use-cases than previously built, highly specialized KB’s. In terms of methodology, our extraction pipeline builds on existing techniques but extends them, and is specifically customized to the life-science domain. Most notably, unlike prior work on biomedical IE, KnowLife employs logical reasoning for checking consistency constraints, tailored to the different relations that connect diseases, symptoms, drugs, genes, risk factors, etc. This constraint checking eliminates many false positives that are produced by methods that solely rely on pattern-based extraction.
In its best configuration, the KnowLife KB contains a total of 542,689 facts for 13 different relations, with an average precision of 93% (i.e., validity of the acquired facts) as determined by extensive sampling with manual assessment. The precision for the different relations ranges from 71% (createsRisk: ecofactor × disease) to 97% (sideEffect:(symptom ∪ disease) × drug). All facts in KnowLife carry provenance information, so that one can explore the evidence for a fact and filter by source. We developed a web portal that showcases use-cases from speed-reading to semantic search along with richly annotated literature, the details of which are described in the demo paper .
Dictionary We use UMLS (Unified Medical Language System) as the dictionary of biomedical entities. UMLS is a metathesaurus, the largest collection of biomedical dictionaries containing 2.9 million entities and 11.4 million entity names and synonyms. Each entity has a semantic type assigned by experts. For instance, the entities IL4R and asthma are of semantic types Gene or Genome and Disease or Syndrome, respectively. The UMLS dictionary enables KnowLife to detect entities in text, going beyond genes and proteins and covering entities about anatomy, physiology, and therapy.
KnowLife relations, their type signatures, and number of seeds
Symptom or Disease
Drug or Behavior
Symptom or Disease
Seed facts. A seed fact R(e 1,e 2) for relation R is a triple presumed to be true based on expert statements. We collected 467 seed facts (see Table 1) from the medical online portal uptodate.com, a highly regarded clinical resource written by physician authors. These seed facts are further cross-checked in other sources to assert their veracity. Example seed facts include i s S y m p t o m(C h e s t P a i n,M y o c a r d i a l I n f a r c t i o n) and c r e a t e s R i s k(O b e s i t y,D i a b e t e s).
Overview of KnowLife’s input corpus
Anemia is a common symptom of sarcoidosis.
Eventually, a heart attack leads to arrythmias.
Ironically, a myocardial infarction can also lead to pericarditis.
where myocardial infarction and heart attack are synonyms representing the same canonical entity.
The goal of the pattern analysis is to identify the most useful seed patterns out of all the pattern candidates gathered thus far. A seed pattern should generalize the over-specific phrases encountered in the input texts, by containing only the crucial words that express a relation and masking out (by a wildcard or part-of-speech tag) inessential words. This way we arrive at high-confidence patterns.
where S X(R i ) is the set of all entity tuples (e 1,e 2) appearing in any seed fact with relation R i and C X(R i ) is the set of all entity tuples (e 1,e 2) appearing in any seed fact without relation R i . The rationale is that the more strongly a pattern correlates with the seed-fact entities of a particular relation, the more confident we are that the pattern expresses the relation. The patterns with confidence greater than a threshold (set to 0.3 in our experiments) are selected as seed patterns.
Examples of seed facts and seed patterns as well as automatically acquired patterns and facts
c a u s e s(T u b e r c u l o s i s,P e r i c a r d i t i s)
which progresses to
c a u s e s(P e r i c a r d i t i s,T a m p o n a d e)
c r e a t e s R i s k(O b e s i t y,D i a b e t e s)
still progressing to
c r e a t e s R i s k(W a r t,S k i n c a r c i n o m a)
c r e a t e s R i s k(O b e s i t y,A s t h m a)
children risk factors
c r e a t e s R i s k(W o o d D u s t,A s t h m a)
c r e a t e s R i s k(M a l a r i a,S t i l l b i r t h)
have risk factors
c r e a t e s R i s k(G o l f,T e n d i n i t i s)
known risk factors
c r e a t e s R i s k(G B v i r u s C,H e p a t i t i s)
i s S y m p t o m(P a i n,C r o h n D i s e a s e)
a f f e c t s(H a s h i m o t o ′ s,T h y r o i d G l a n d)
a f f e c t s(P e r i c a r d i t i s,H e a r t)
i s S y m p t o m(A n e m i a,S a r c o i d o s i s)
The pattern analysis stage provides us with a large set of fact candidates and their supporting patterns. However, these contain many false positives. To prune these out and improve precision, the last stage of KnowLife applies logical consistency constraints to the fact candidates and accepts only a consistent subset of them.
We leverage two kinds of manually defined semantic constraints: i) the type signatures of relations (see Table 1) for type checking of fact candidates, and ii) mutual exclusion constraints between certain pairs of relations. For example, if a drug has a certain symptom as a side effect, it cannot treat this symptom at the same time. These rules allow us to handle conflicting candidate facts. The reasoning uses probabilistic weights derived from the statistics of the candidate gathering phase.
To reason with consistency constraints, we follow the framework of , by encoding all facts, patterns, and grounded (i.e., instantiated) constraints into weighted logical clauses. We extend this prior work by computing informative weights from the confidence statistics obtained in the pattern-based stage of our IE pipeline. We then use a weighted Max-Sat solver to reason on the hypotheses space of fact candidates, to compute a consistent subset of clauses with the largest total weight. Due to the NP-hardness of the weighted Max-Sat problem, we resort to an approximation algorithm that combines the dominating-unit-clause technique  with Johnson’s heuristic algorithm . Suchanek et al.  has shown that this combination empirically gives very good approximation ratios. The complete set of consistency constraints is in the supplementary material (see Additional file 1).
Results and discussion
Evaluation of different text genres
Evaluation of the impact of different components
Impact of different text genres
We first discuss the results obtained from the different text genres: i) scientific (PubMed publications), ii) encyclopedic (Web portals like Mayo Clinic or Wikipedia), iii) social (discussion forums). Table 4 gives, column-wise, the number of facts and precision figures for four different combinations of genres.
Generally, combining genres gave more facts at a lower precision, as texts of lower quality like social sources introduced noise. The combination that gave the best balance of precision and total yield was scientific with encyclopedic sources, with a micro-averaged precision of 0.933 for a total of 542,689 facts. We consider this the best of the KB’s that KnowLife generated.
The best overall precision was achieved when using encyclopedic texts only. This confirmed our hypothesis that a pattern-based approach works best when the language is simple and grammatically correct. Contrast this with scientific publications which often exhibit convoluted language, and online discussions with a notable fraction of grammatically incorrect language. In these cases, the quality of patterns degraded and precision dropped. Incorrect facts stemming from errors in the entity recognition step were especially rampant in online discussions, where colloquial language (for example, meds, or short for medicines) led to incorrect entities (acronym for Microcephaly, Epilepsy, and Diabetes Syndrome).
The results vary highly across the 13 relations in our experiments. The number of facts depends on the extent to which the text sources express a relation, while precision reflects how decisively patterns point to that relation. Interacts and SideEffect are prime examples: the drugs.com portal lists many side effects and drug-drug interactions by the DOM structure, which boosted the extraction accuracy of KnowLife, leading to many facts at precisions of 95.6% and 96.4%, respectively. Facts for the relations Alleviates, CreatesRisk, and ReducesRisk, on the other hand, mostly came from scientific publications, which resulted in fewer facts and lower precision.
A few relations, however, defied these general trends. Patterns of Contraindicates were too sparse and ambiguous within encyclopedic texts alone and also within scientific publications alone. However, when the two genres were combined, the good patterns reached a critical mass to break through the confidence threshold, giving rise to a sudden increase in harvested facts. For the CreatesRisk and ReducesRisk relations, combining encyclopedic and scientific sources increased the number of facts compared to using only encyclopedic texts, and increased the precision compared to using only scientific publications.
As Table 4 shows, incorporating social sources brought a significant gain in the number of harvested facts, at a trade-off of lowered precision. As  pointed out, there are facts that come only from social sources and, depending on the use case, it is still worthwhile to incorporate them; for example, to facilitate search and discovery applications where recall may be more important. Morever, the patterns extracted from encyclopedic and scientific sources could be reused to annotate text in social sources, so as to identify existing information.
Number of fact occurrences in text sources
Impact of different components
In each setting, only one component was disabled, and the processing pipeline ran with all other components enabled. We used the KnowLife setting with scientific and encyclopedic sources, which, by and large, performed best, as the basis for investigating the impact of different components in the KnowLife pipeline. To this end, we disabled individual components: DOM tree patterns, statistical analysis of patterns, consistency reasoning – each disabled separately while retaining the others. This way we obtained insight into how strongly KnowLife depends on each component. Table 5 shows the results of this ablation study.
No DOM tree patterns: When disregarding patterns on the document structure and solely focusing on textual patterns, KnowLife degrades in precision (from 93% to 78%) and sharply drops in the number of acquired facts (from ca. 540,000 to 80,000). The extent of these general effects varies across the different relations. Relations whose patterns are predominantly encoded in document structures – once again Interacts and SideEffect – exhibit the most drastic loss. On the other hand, relations like Affects, Aggravates, Alleviates, and Treats, are affected only to a minor extent, as their patterns are mostly found in free text.
No statistical pattern analysis: Here we disabled the statistical analysis of pattern confidence and the frequent itemset mining for generalizing patterns. This way, without confidence values, KnowLife kept all patterns, including many noisy ones. Patterns that would be pruned in the full configuration led to poor seed patterns; for example, the single word causes was taken as a seed pattern for both relations SymptomOf and Contraindicates. Without frequent itemset mining, long and overly specific patterns also contributed to poor seed patterns. The combined effect greatly increased the number of false positives, thus dropping in precision (from 93% to 77%). In terms of acquired facts, not scrutinizing the patterns increased the yield (from ca. 540,000 to 720,000 facts).
Relations mainly extracted from DOM tree patterns, such as Interacts and SideEffect, were not much affected. Also, relations like Affects and Diagnoses exhibited only small losses in precision; for these relations, the co-occurrence of two types of entities is often already sufficient to express a relation. The presence of consistency constraints on type signatures also helped to keep the output quality high.
No consistency reasoning: In this setting, neither type signatures nor other consistency constraints were checked. Thus, conflicting facts could be accepted, leading to a large fraction of false positives. This effect was unequivocally witnessed by an increase in the number of facts (from ca. 540,000 to 850,000) accompanied by a sharp decrease in precision (from 93% to 70%).
The relations Interacts and SideEffect were least affected by this degradation, as they are mostly expressed in the via document structure of encyclopedic texts where entity types are implicitly encoded in the DOM tree tags (see Figure 2). Here, consistency reasoning was not vital.
Lessons learned: Overall, this ablation study clearly shows that all major components of the KnowLife pipeline are essential for high quality (precision) and high yield (number of facts) of the constructed KB. Each of the three configurations where one component is disabled suffered substantial if not dramatic losses in either precision or acquired facts, and sometimes both. We conclude that the full pipeline is a well-designed architecture whose strong performance cannot be easily achieved by a simpler approach.
- 1.Anemia is a common symptom of sarcoidosis.Table 7
Error analysis (number of facts in brackets)
Percentage based on text genre
Cause of error
Pattern Relation Duality
Swapped left and right-hand entity
A common symptom of sarcoidosis is anemia.
In both cases, the same pattern is a common symptom of is extracted. In sentence 2, however, an incorrect fact would be extracted since the order in which the entities occur is reversed. Negation: This error was caused by not detecting negation expressed in the text. The word expressing the negation may occur textually far away from the entities, as in It is disputed whether early antibiotic treatment prevents reactive arthritis, and thus escaped our pattern gathering method. In other cases, the negation phrase will require subtle semantic understanding to tease out, as in Except for osteoarthritis, I think my symptoms are all from heart disease. Factually Wrong: Although our methods successfully harvested a fact, the underlying text evidence made a wrong statement. Lessons learned: Overall, this error analysis confirms that scientific and encyclopedic sources contain well-written texts that are amenable to a text mining pipeline. Social sources, with their poorer quality of language style as well as information content, were the biggest contributor in almost all error categories. Errors in entity recognition and disambiguation accounted for close to 60% of all errors; overcoming them will require better methods that go beyond a dictionary, and incorporate deeper linguistic and semantic understanding.
Top-20 pairs of inter-connected biomedical areas within KnowLife
The predominant number of facts involves entities of the semantic group Disorders, for two reasons. First, with our choice of relations, disorders appear in almost all type signatures. Second, entities of type clinical finding are covered by the group Disorders, and these are frequent in all text genres. However, this type also includes diverse, non-disorder entities such as pregnancy, which is clearly not a disorder.
To showcase the usefulness of KnowLife, we developed a health portal (http://knowlife.mpi-inf.mpg.de) that allows interactive exploration of the harvested facts and their input sources. The KnowLife portal supports a number of use cases for different information needs . A patient may wish to find out the side effects of a specific drug, by searching for the drug name and browsing the SideEffect facts and their provenance. A physician may want to “speed read” publications and online discussions on treatment options for an unfamiliar disease. Provenance information is vital here, as the physician would want to consider the recency and authority of the sources for certain statements. The health portal also provides a function for on-the-fly annotation of new text from publications or social media, leveraging known patterns to highlight any relations found.
In the future, we plan to improve the entity recognition to accommodate a wider variety of entities beyond those in UMLS. For instance, colloquial usage (meds for medicines) and composite entities (amputation of right leg) are not yet addressed. Entities within UMLS also require more sophisticated disambiguation. For instance, the text occurrence stress may be correctly distinguished between the brand name of a drug and the psychological feeling.
Finally, we would like to address the challenge of mining and representing the context of harvested facts. Binary relations are often not sufficient to express medical knowledge. For example, the statement Fever is a symptom of Lupus Flare during pregnancy cannot be suitably represented by a binary fact.
We plan to cope with such statements by extracting ternary and higher-arity relations, with appropriate extensions of both pattern-based extraction and consistency reasoning.
We would like to thank Timo Kötzing and Thomas Ruschel for participating in the evaluation.
- Barbosa D, Wang H, Yu C. Shallow information extraction for the knowledge web. In: Proceedings of International Conference On Data Engineering (ICDE). Washington, DC, USA: IEEE Computer Society: 2013. p. 1264–7.Google Scholar
- Suchanek F, Weikum G. Knowledge harvesting from text and web sources. In: Proceedings of International Conference On Data Engineering (ICDE). Washington, DC, USA: IEEE Computer Society: 2013. p. 1250–3.Google Scholar
- Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, et al. DBpedia – a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web J. 2013; 6(2):167–95.Google Scholar
- Hoffart J, Suchanek F, Berberich K, Weikum G. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. In: Proceedings of Special issue of the Artificial Intelligence Journal. Menlo Park, CA, USA: AAAI Press: 2013. p. 28–61.Google Scholar
- Pyysalo S, Ohta T, Miwa M, Cho H-C, Tsujii J, Ananiadou S. Event extraction across multiple levels of biological organization. Bioinformatics. 2012; 28(18):575–81.View ArticleGoogle Scholar
- Arighi C, Roberts P, Agarwal S, Bhattacharya S, Cesareni G, Chatr-aryamontri A, et al. BioCreative III interactive task: An overview. BMC Bioinformatics. 2011; 12(Suppl 8):4.View ArticleGoogle Scholar
- Kim JD, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics. 2008; 9(1):10.View ArticlePubMedPubMed CentralGoogle Scholar
- Whirl-Carrillo M, McDonagh E, Hebert J, Gong L, Sangkuhl K, Thorn C,et al. Pharmacogenomics knowledge for personalized medicine. Clinical Pharmacol Ther. 2012; 92(4):414–7.View ArticleGoogle Scholar
- Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, et al. Open PHACTS: semantic interoperability for drug discovery. Drug Discov Today. 2012; 17(21):1188–98.View ArticlePubMedGoogle Scholar
- Buyko E, Faessler E, Wermter J, Hahn U. Event extraction from trimmed dependency graphs. In: Proceedings of Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP): Shared Task. Stroudsburg, PA, USA: ACL: 2009. p. 19–27.Google Scholar
- Miwa M, Sætre R, Kim J-D, Tsujii J. Event extraction with complex event classification using rich features. J Bioinformatics Comput Biol. 2010; 8(1):131–46.View ArticleGoogle Scholar
- Björne J, Salakoski T. Generalizing biomedical event extraction. In: Proceedings of Workshop on Current Trends in Biomedical Natural Language Processing (BioNLP): Shared Task. Stroudsburg, PA, USA: ACL: 2011. p. 183–91.Google Scholar
- Krallinger M, Izarzugaza JMG, Penagos CR, Valencia A. Extraction of human kinase mutations from literature, databases and genotyping studies. BMC Bioinformatics. 2009; 10(S8):1.View ArticleGoogle Scholar
- Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. In: Proceedings of Annual Meeting on Association for Computational Linguistics (ACL). Stroudsburg, PA, USA: ACL: 2004. p. 430.Google Scholar
- Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008; 9(1):207.View ArticlePubMedPubMed CentralGoogle Scholar
- Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. J Am Med Informatics Assoc. 2013; 20(5):806–13.View ArticleGoogle Scholar
- Bravo A, Cases M, Queralt-Rosinach N, Sanz F, Furlong L. A knowledge-driven approach to extract disease-related biomarkers from the literature. BioMed Res Int. 2014. article ID: 253128.Google Scholar
- Chun HW, Tsuruoka Y, Kim JD, Shiba R, Nagata N, Hishiki T, et al. Extraction of gene-disease relations from medline using domain dictionaries and machine learning. In: Proceedings of Pacific Symposium of Biocomputing: 2006. p. 4–15.Google Scholar
- Leroy G, Chen H. Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. J Am Soc Inform Sci Technol. 2005; 56(5):457–68.View ArticleGoogle Scholar
- Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H. Semantic relations asserting the etiology of genetic diseases. In: Proceedings of American Medical Informatics Association (AMIA) Annual Symposium. Bethesda, MD, USA: AMIA: 2003. p. 554–8.Google Scholar
- Good BM, Su AI. Crowdsourcing for bioinformatics. Bioinformatics. 2013; 29(16):1925–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Ranard BL, Ha YP, Meisel ZF, Asch DA, Hill SS, Becker LB, et al.Crowdsourcing–harnessing the masses to advance health and medicine, a systematic review. J General Intern Med. 2014; 29(1):187–203.View ArticleGoogle Scholar
- Burger JD, Doughty E, Khare R, Wei C-H, Mishra R, Aberdeen J, et al.Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database. 2014; 2014. article ID: bau094.Google Scholar
- Aroyo L, Welty C. Measuring crowd truth for medical relation extraction. In: AAAI Fall Symposium Series. Menlo Park, CA, USA: AAAI Press: 2013.Google Scholar
- Hunter L, Lu Z, Firby J, Baumgartner W, Johnson H, Ogren P, et al.OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics. 2008; 9(1):78.View ArticlePubMedPubMed CentralGoogle Scholar
- Torii M, Arighi CN, Wang Q, Wu CH, Vijay-Shanker K. Text mining of protein phosphorylation information using a generalizable rule-based approach. In: Proceedings of International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB). New York, NY, USA: ACM Press: 2013. p. 201–10.Google Scholar
- Müller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology. 2004; 2(11):309.View ArticleGoogle Scholar
- Wattarujeekrit T, Shah P, Collier N. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics. 2004; 5(1):155.View ArticlePubMedPubMed CentralGoogle Scholar
- Kolářik C, Hofmann-Apitius M, Zimmermann M, Fluck J. Identification of new drug classification terms in textual resources. Bioinformatics. 2007; 23(13):i264–72.View ArticlePubMedGoogle Scholar
- Hearst M. Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics (CoLing). Stroudsburg, PA, USA: ACL: 1992. p. 539–45.Google Scholar
- Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003; 36(6):462–77.View ArticlePubMedGoogle Scholar
- Poria S, Cambria E, Winterstein G, Huang G-B. Sentic patterns: dependency-based rules for concept-level sentiment analysis. Knowledge-Based Syst. 2014; 69(0):45–63.View ArticleGoogle Scholar
- Thomas P, Starlinger J, Vowinkel A, Arzt S, Leser U. GeneView: a comprehensive semantic search engine for PubMed. Nucleic Acids Res. 2012; 40(W1):585–91.View ArticleGoogle Scholar
- Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinformatics. 2014; 15(1):105.View ArticlePubMedPubMed CentralGoogle Scholar
- Brin S. Extracting patterns and relations from the World Wide Web. In: Selected Papers from the International Workshop on The World Wide Web and Databases (WebDB). New York, NY, USA: Springer: 1998. p. 172–83.Google Scholar
- Agichtein E, Gravano L. Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM Conference on Digital Libraries (DL). New York, NY, USA: ACM Press: 2000. p. 85–94.Google Scholar
- Suchanek F, Sozio M, Weikum G. SOFIE: A self-organizing framework for information extraction. In: Proceedings of International World Wide Web Conference (WWW). New York, NY, USA: ACM Press: 2009. p. 631–40.Google Scholar
- Nakashole N, Theobald M, Weikum G. Scalable knowledge harvesting with high precision and high recall. In: Proceedings of International Conference on Web Search and Data Mining (WSDM). New York, NY, USA: ACM Press: 2011. p. 227–36.Google Scholar
- Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka Jr ER, Mitchell TM. Toward an architecture for never-ending language learning. In: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference. Menlo Park, CA, USA: AAAI Press: 2010. p. 1306–13.Google Scholar
- Nebot V, Ye M, Albrecht M, Eom J-H, Weikum G. DIDO: A disease-determinants ontology from Web sources. In: Proceedings of International World Wide Web Conference (WWW). New York, NY, USA: ACM Press: 2011. p. 237–40.Google Scholar
- Movshovitz-Attias D, Cohen WW. Bootstrapping biomedical ontologies for scientific text using NELL. In: Proceedings of Workshop on Biomedical Natural Language Processing (BioNLP). Stroudsburg, PA, USA: ACL: 2012. p. 11–19.Google Scholar
- Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010; 17(3):229–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Harmston N, Filsell W, Stumpf M. Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices. Bioinformatics. 2012; 28(2):254–60.View ArticlePubMedGoogle Scholar
- Chasin R, Rumshisky A, Uzuner Ö, Szolovits P. Word sense disambiguation in the clinical domain: a comparison of knowledge-rich and knowledge-poor unsupervised methods. J Am Med Inform Assoc. 2014; 21(5):842–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Vydiswaran VGV, Zhai C, Roth D. Gauging the Internet doctor: Ranking medical claims based on community knowledge. In: Proceedings of Workshop on Data Mining for Medicine and Healthcare (DMMH). New York, NY, USA: ACM Press: 2011. p. 42–51.Google Scholar
- Mukherjee S, Weikum G, Danescu-Niculescu-Mizil C. People on drugs: Credibility of user statements in health communities. In: Proceedings of Conference on Knowledge Discovery and Data Mining (KDD). New York, NY, USA: ACM Press: 2014. p. 65–74.Google Scholar
- White RW, Harpaz R, Shah NH, DuMouchel W, Horvitz E. Toward enhanced pharmacovigilance using patient-generated data on the Internet. Clin Pharmacol Ther. 2014; 96(2):239–46.View ArticlePubMedPubMed CentralGoogle Scholar
- Ernst P, Meng C, Siu A, Weikum G. KnowLife: a knowledge graph for health and life sciences. In: Proceedings of International Conference on Data Engineering (ICDE). Washington, DC, USA: IEEE Computer Society: 2014. p. 1254–7.Google Scholar
- Siu A, Nguyen DB, Weikum G. Fast entity recognition in biomedical text. In: Proceedings of Workshop on Data Mining for Healthcare (DMH) at Conference on Knowledge Discovery and Data Mining (KDD). New York, NY, USA: ACM Press: 2013.Google Scholar
- Charikar MS. Similarity estimation techniques from rounding algorithms. In: Proceedings of Symposium on Theory of Computing (STOC). New York, NY, USA: ACM Press: 2002. p. 380–8.Google Scholar
- Broder AZ, Charikar M, Frieze AM, Mitzenmacher M. Min-wise independent permutations. In: Proceedings of Symposium on Theory of Computing (STOC). New York, NY, USA: ACM Press: 1998. p. 327–36.Google Scholar
- Niedermeier R, Rossmath P. New upper bounds for maximum satisfiability. J Algorithms. 2000; 36(1):63–88.View ArticleGoogle Scholar
- Johnson DS. Approximation algorithms for combinatorial problems. J Comput Syst Sci. 1974; 9(3):256–78.View ArticleGoogle Scholar
- McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Informatics. 2001; 1:216–20.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.