Quantifying and filtering knowledge generated by literature based discovery
© The Author(s) 2017
Published: 31 May 2017
Literature based discovery (LBD) automatically infers missed connections between concepts in literature. It is often assumed that LBD generates more information than can be reasonably examined.
We present a detailed analysis of the quantity of hidden knowledge produced by an LBD system and the effect of various filtering approaches upon this. The investigation of filtering combined with single or multi-step linking term chains is carried out on all articles in PubMed.
The evaluation is carried out using both replication of existing discoveries, which provides justification for multi-step linking chain knowledge in specific cases, and using timeslicing, which gives a large scale measure of performance.
While the quantity of hidden knowledge generated by LBD can be vast, we demonstrate that (a) intelligent filtering can greatly reduce the number of hidden knowledge pairs generated, (b) for a specific term, the number of single step connections can be manageable, and (c) in the absence of single step hidden links, considering multiple steps can provide valid links.
Between 2,000 and 4,000 articles are added to PubMed, the National Library of Medicine’s (NLM) database of publications in biomedicine, every day . This forces researchers to specialize in narrower aspects of their field and they may miss inferable connections, for example ones that reveal new treatments for diseases (e.g. Swanson  automatically discovered a previously unnoticed connection between fish oil and Raynaud disease, via a number of terms such as blood viscosity, platelet aggregation, vascular reactivity, a connection which was later verified ). Literature based discovery (LBD) automates the process of finding new connections (hidden connections) between existing knowledge, and thus can be used for disease candidate gene discovery, to find other uses of existing drugs, or for drug side effect prediction .
Time period. Much of earlier LBD work restricted the knowledge base to a reduced time segment – for example, Gordon and Lindsay  restricted publications to the years 1983–1985 when they sought to replicate Swanson’s  fish oil – Raynaud disease connection.
Relation. Preiss et al.  show that employing more sophisticated definitions of links between terms (relations) greatly reduces the number of hidden knowledge pairs generated without detrimental effect on performance.
Stoplist. For example, Swanson et al.  start by removing non-content words and add, semi-automatically, other terms to a growing stoplist (in 2006, this contained 9,500 terms).
Literature reduction. Swanson et al. also carry out term reduction at an earlier stage: they pre-filter the literature on a per sought term basis by subject heading. If the user is seeking term X, hidden knowledge is only generated from abstracts which contain X in their MeSH subject heading and where X is present in the title. (Note that this clearly requires prior knowledge of search terms.)
Term type. Yetisgen-Yildiz and Pratt  limit the types of linking and target terms permitted (to categories such as chemicals & drugs or genes & molecular sequence) on the basis that this is the type of link they wish to find.
CUIs vs terms. The Unified Medical Language System Metathesaurus (UMLS)  is a large thesaurus which lists millions of biomedical and health related concepts using Concept Unique Identifiers (CUIs). Weeber et al.  filter out non-content words by switching from terms to UMLS CUIs. Aside from removing non-content words, switching to CUIs also avoids spurious connections due to term ambiguity. To identify the correct CUIs, they use MetaMap , a publicly available tool which assigns UMLS CUIs to terms, as well as mapping words to multi-word units where appropriate.
Synonym merging. While not carrying out explicit synonym merging, Cameron et al.  manually add close terms to the source (A) and target (C) term in closed search (see bottom of Fig. 1) LBD. As both Raynaud disease and Raynaud phenomenon appear separately in UMLS, the hidden knowledge generated will vary if these are treated as one unit.
Relation type. Focusing on one type of discovery, adverse drug reactions, Shang et al.  employ only the INTERACTS_WITH and COMPARED_WITH relations within one step of their inference process.
The A−B−C model generates justification(s) for each hidden connection, the linking (B) terms – raynaud disease and fish oil were found to be connected via the linking term blood viscosity, with one decreasing and the other increasing the same. However, hidden knowledge can be identified by following longer paths of linking terms, i.e. A→b 1→b 2→⋯→b n →C. This approach shows promise and has already been explored. For example, Kontostathi and Pottenger  investigate paths of linking terms generated by co-occurrence and Wilkowski et al.  show the feasibility of the approach on a single A and C pair. However, previous evaluation of this approach has been restricted to small numbers of examples and no large scale evaluation has yet been carried out.
The novelty of our work lies in a detailed analysis of the quantity of hidden knowledge produced and the effect of various filtering approaches upon this. This thorough investigation of filtering combined with single or multi-step linking term chains is, to our knowledge, the first comprehensive investigation of this type.
Literature based discovery system
We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia.
A SemRep processed version of Medline is available from NLM , and we use all positive relations from semmedVER24_2 processed up to 30 June 2014 (which contains 70,364,020 relations) to populate the adjacency matrix M.
Mechanical ventilators TREATS Chronic obstructive lung disease
Common cold PROCESS_OF Rhinovirus
If the two senses of cold were not differentiated, a hidden connection could be found between mechanical ventilators and rhinovirus.
While mapping to CUIs clearly reduces the number of incorrect hidden knowledge pairs, it will not eliminate connections via general terms – words such as patient, clinical study or week. The following section discusses a number of filtering techniques.
sy - synonym merging
As with any type of dictionary, a decision is made by the creators as to dividing up senses, termed as lumping or splitting in lexicography . As UMLS is composed of terms from multiple source vocabularies, the splitting / lumping decision is not consistent throughout. In general, there is a tendency to split senses and later merge based on application, and we therefore, based on finding separate CUIs for Raynaud disease and Raynaud phenomenon and finding that Cameron et al.  manually augment their selected A and C terms “with related concepts”, investigate an automatic approach to synonym merging within UMLS.
If the start point, A (e.g. Raynaud Disease), has multiple synonymous CUIs, merging these will result in the generation of more hidden knowledge from A.
If a linking term CUI is equivalent to the CUI of another term, there could be more hidden knowledge generated. Examination of documents supporting a Raynaud disease – fish oil link reveals that some of the expected connections from CUI C0034734 (Raynaud disease) are linked to CUI C0034735 (Raynaud phenomenon) instead. Although the two terms are synonymous for the purposes of evaluating the Raynaud disease – fish oil link, due to their different CUIs, the connection will not be found.
Since synonymous pairs of hidden knowledge (and linking terms) will merge, this will reduce the burden on a user.
Information about synonym classes related to the number of sources supporting the synonymy
Basing classes on 1 source produces a class of 74 elements – this is unlikely to contain synonyms useful for knowledge discovery. Synonym classes with more than 20, or even 10, elements are similarly unlikely to be helpful: we employ synonym classes supported by at least two sources with class size ≤5 which leaves 614 synonym classes and reduces the original 2,868,943 UMLS CUIs to 2,867,188 distinct CUIs (restricting to CUIs that appear in SemRep relations in Medline, this reduces the original 485,538 CUIs to 484,924 distinct CUIs).
st - semantic type
The UMLS Semantic Network contains a hierarchy of subject categories, semantic types (STs), with at least one assigned to each CUI. Previous work selects a number of STs allowed to act as linking or target terms as these are thought to describe the type of desired hidden knowledge; for example Yetisgen-Yildiz and Pratt  allow both linking and target terms to be chemicals & drugs and genes & molecular sequences, but linking terms can also be disorders, physiology and anatomy members.
Obvious removes sts which rarely appeared in useful relations: activities & behaviours, geographic areas, occupations, organizations and procedures.
Manual based on the expert opinion of the likelihood of being in relations, 70 sts were manually selected for exclusion .
Half contains the st supertypes for which at least half of the subtypes were removed in manual: activities & behaviours, geographic areas, occupations, organizations, procedures, anatomy, concepts & ideas, devices, living beings and objects.
Y-Y&P containing the sts chemicals & drugs, genes & molecular sequences, disorders, physiology and anatomy members.
clt - common linking terms stoplist
break – breaking common linking term connections
A filtering technique based on a stoplist needs to be quite conservative so it does not remove useful terms and therefore it is likely to leave some unhelpful terms. Breaking common linking term connection is a fundamentally different idea to creating a stoplist: instead of finding frequently appearing terms, this approach bases its decisions on the number of terms a given term is connected to.
When creating the matrix A, break (discard) all connections to CUI A when the C(A)>threshold.
Discard the connection between CUIs A and B when min(C(A),C(B))>threshold.
(Where C(A) represents the number of CUIs linked to A, and the threshold needs to be empirically determined.)
Results and discussion
LBD is clearly difficult to evaluate: by virtue of the generated knowledge being new, there is no gold standard for comparison. Two standard techniques for evaluation exist: 1) replication of existing discoveries (e.g. [5, 10, 23]), where discoveries made using previous LBD systems are collected from literature and a new LBD system is employed over the same time segment in an attempt to produce the same discovery, and 2) timeslicing , which allows the generation of precision and recall figures by allowing a gold standard to be automatically created from publications after a cut off date with hidden knowledge generated from publications prior. We present both types of evaluation below.
Replication of existing discoveries
- 1.[RD-fsh] Raynaud disease – fish oil [5, 10, 25]; 1960–1985, no direct connections.Table 2
Number of linking terms yielded in replication of existing discoveries
Number of linking terms found after a single step
Number of linking terms found after two steps
[Som-Arg] Somatomedin C – arginine ; 1960–1989, 27 direct connections.
[Mig-Mg] Migraine disorders – magnesium ; 1980–1984, no direct connections.
[Mg-ND] Magnesium deficiency – neurologic disease ; 1960–1994, no direct connections.
[AD-INN] Alzheimer’s disease – indomethacin ; 1966–1996, 6 direct connections.
[AD-est] Alzheimer’s disease – estrogen ; 1960–1995, 25 direct connections.
[Sc-iPL] Schizophrenia – Calcium-Independent Phospholipase A2 ; 1960–1997, 1 direct connection.
Table 2 presents the number of linking terms found when each discovery is replicated: the upper part of the table describes the number of linking terms corresponding to single step connections. Since the single step approach sometimes fails to find a connection, two step connections are sought and are presented in the lower part of the table. The number of linking terms generated for each discovery is clearly linked to the number of connections input – however, despite the connection being inferrable using co-occurence, the reduction to zero hidden links using SemRep combined with filtering is valid. For example, the two linking terms connecting Raynaud disease to fish oil with synonym merging are C0029064 (operating theatre) and C0040426 (set of teeth). Neither clearly supporting the connection and therefore justifiably removed with semantic types.
It is interesting that in most filtering cases, the frequently cited Raynaud disease – fish oil connection is not replicated: this was revealed to be due to a combination of synonym failure and the relation employed not extracting the necessary connections. The schizophrenia – Ca2+iPLA2 link is not replicated via a single step as Ca2+iPLA2 only appears very few times in Medline and is only seen in one SemRep relation. In these cases, it is worth examining two step connections:
schizophrenia – Ca2+iPLA2
C0001473 (atpase) – C0020063 (Parathyroid Hormone)
C0001655 (adrenocorticotropic hormone) – C0020063 (Parathyroid Hormone)
C0003779 (Arginine vasopressor) – C0020063 (Parathyroid Hormone)
C0021641 (Regular insulin) – C0020063 (Parathyroid Hormone)
C0021740 (Recombinant Interferon Gamma) – C0020063 (Parathyroid Hormone)
C0033371 (Prolactin preparation) – C0020063 (Parathyroid Hormone)
C0037659 (Somatostatin preparation) – C0020063 (Parathyroid Hormone)
C0040160 (Thyrotrophin product) – C0020063 (Parathyroid Hormone)
C0041249 (tryptophan (Trp)) – C0020063 (Parathyroid Hormone)
As the UMLS definition states, the parathyroid hormone elevates blood Ca2+ levels and thus is related to the PLA2G6 protein, showing all nine connections to be worthy of further consideration.
fish oil – Raynaud disease
C0005823 (blood pressure) - C0006938 (captopril)
C0005848 (blood viscosity) - C0030899 (pentoxyphylline)
C0005848 (blood viscosity) - C0232338 (blood flow function)
C0005848 (blood viscosity) - C0206502 (hemorheology)
Captopril and pentoxyphylline are used in the treatment of Raynaud’s phenomenon, and both Raynaud’s and fish oil are known to affect blood viscosity. The blood viscosity links are also supported by the linking term analysis in .
These findings suggest that multi linking term exploration is worth pursuing when a connection is suspected but is not found via a single step connection.
One of the drawbacks of evaluating LBD by replicating existing discoveries is that it relies on the use of small test sets. Timeslicing is an alternative approach that allows larger test collections to be created automatically. A cutoff date is chosen, hidden knowledge is generated from publications published prior to this date and the resulting pairs are compared to the new knowledge published after the cutoff (as identified by the used relation) . This section reports results using the timeslicing approach.
Hidden knowledge is generated from all publications listed in Medline up to the end of 2005 and it is evaluated against a gold standard generated from 2006–2015. The gold standard is created by extracting all SemRep relations from abstracts published after the cutoff and removing any SemRep pairs present in Medline before the cutoff; this leaves 1,193,495 pairs.
Performance after a single step
Performance after two steps
Even if the hidden knowledge generated is genuine, it may not have been discovered yet within the segment on which the gold standard is based.
Some knowledge will be known but never published as it is considered ‘obvious’. An LBD system will generate such knowledge nonetheless.
While the total amount of hidden knowledge generated may seem unmanageable and not useful, it is supported by manual findings: UMLS includes manually identified relations, and on average goes through 2 releases per year. There were 29,936,977 instances of relations added between 2015AA and 2005AA version of UMLS. Since 29,936,977×4=119,747,908 (with the number of hidden knowledge pairs generated using break =131,199,050), the quantity of hidden knowledge produced no longer seems unreasonable.
In turn, low precision accounts for the low F-measure; again, this is not unusual for large scale timeslicing results in LBD . The highest F-measure for single step connections (3.22e-03) is achieved by breaking common linking term connections. In this setting the amount of hidden knowledge generated (an average of 2,232 pieces per term) is much more manageable than the amount generated without filtering (34,986 pieces per term).
The very low recall (and precision) of the two step connections is caused by the high recall from the one step connections which results in a large number of two step connections to be generated (see Literature based discovery system).
The large number of two step connections will, in general, make the results unusable for open discovery. However, two step connections provide a good backoff for closed discovery when a link is suspected by cannot be found using a single step. This appears to be more common with rarer, thus more likely to be specific, concepts (such as Ca2+iPLA2).
The results do raise the question of whether precision and recall are good measures for evaluating large scale LBD systems.
We present an extensive discussion of filtering within literature based discovery, and show that using a more sophisticated definition of relation as well as UMLS CUIs (rather than terms directly) is insufficient in itself to yielding usable quantities of hidden knowledge. We explore a number of different approaches and show their effect on both replication and timeslicing evaluations. We find the best performance from the rarely used approach which breaks connections on a term pair basis, rather than removing entire terms.
Based on the results of replication of existing discoveries, we propose that the quantity of hidden knowledge generated for a term A will be proportional to its overall frequency within the corpus, and argue that the high proportion of frequent terms is the cause of the low F-measure found using timeslicing evaluation. A comparison with the number of relations (manually) added to UMLS on each release also suggests a high expected number of hidden knowledge pairs.
We also examine the possibility of generating hidden knowledge between terms A and C using a chain of multi-step linking terms b i , i.e. A→b 1→b n →C with no direct connection between A and C. While such an approach clearly generates an unmanageable quantity of data in open mode, its value can be seen when a single step connection fails to be found in closed mode: in this case, some multi step connections may be suggested and we propose using a multi step system as a backoff for failed single connection LBD.
The work described in this paper was funded by the Engineering and Physical Sciences Research Council (EP/J008427/1).
Publication costs for this article were funded by the authors’ institution.
Availability of data and materials
The data used for the experiments reported in this paper can be found at http://kdisc.rcweb.dcs.shef.ac.uk/resources.html.
Both authors conceived of the study, JP performed experiments and drafted paper. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this supplement
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 7, 2017: Proceedings of the Tenth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-7.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- NLM: MEDLINE fact sheet. https://www.nlm.nih.gov/pubs/factsheets/medline.html. Accessed: 2017-03-17.
- Swanson DR. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986; 30:7–18.View ArticlePubMedGoogle Scholar
- DiGiacomo RA, Kremer JM, Shah DM. Fish-oil dietary supplementation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study. Am J Med. 1989; 86(2):158–64.View ArticlePubMedGoogle Scholar
- Hristovski D, Rindflesch T, Peterlin B. Using literature-based discovery to identify novel therapeutic approaches. Cardiovasc Hematol Agents Med Chem. 2013; 11(1):14–24.View ArticlePubMedGoogle Scholar
- Gordon MD, Lindsay RK. Toward discovery support systems: a replication, re-examination, and extension of swanson’s work on literature-based discovery of a connection between Raynaud’s and fish oil. J Am Soc Inform Sci. 1996; 47(2):116–28.View ArticleGoogle Scholar
- Preiss J, Stevenson M, Gaizauskas R. Exploring relation types for literature-based discovery. J Am Med Inform Assoc. 2015; 22:987–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Swanson DR, Smalheiser NR, Torvik VI. Ranking indirect connnections in literature-based discovery: The role of medical subject headings. J Am Soc Inform Sci Technol. 2006; 57(11):1427–39.View ArticleGoogle Scholar
- Yetisgen-Yildiz M, Pratt W. Evaluation of literature-based discovery systems In: Bruza P, Weeber M, editors. Literature-Based Discovery. New York: Springer: 2009. p. 101–13.Google Scholar
- Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004; 32:267–70.View ArticleGoogle Scholar
- Weeber M, Vos R, Klein H, de Jong-van den Berg LTW. Using concepts in literature-based discovery: Simulating Swanson’s Reynaud – fish oil and migraine – magnesium discoveries. J Am Soc Inform Sci Technol. 2001; 52(7):548–57.View ArticleGoogle Scholar
- Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010; 17(3):229–36.View ArticlePubMedPubMed CentralGoogle Scholar
- Cameron D, Kavuluru R, Rindflesch TC, Sheth AP, Thirunarayan K, Bodenreider O. Context-driven automatic subgraph creation for literature-based discovery. J Biomed Inform. 2015; 54:141–57.View ArticlePubMedPubMed CentralGoogle Scholar
- Shang N, Xua H, Rindflesch TC, Cohen T. Identifying plausible adverse drug reactions using knowledge extracted from the literature. J Biomed Inform. 2014; 52:293–310.View ArticlePubMedPubMed CentralGoogle Scholar
- Kontostathis A, Pottenger WM. A framework for understanding latent semantic indexing (LSI) performance. Inform Process Manage. 2006; 42(1):56–73.View ArticleGoogle Scholar
- Wilkowski B, Fiszman M, Miller CM, Hristovski D, Arabandi S, Rosemblat G, Rindflesch TC. Graph-based methods for discovery browsing with semantic predications. In: AMIA Annual Symposium Proceedings. Washington: 2011. p. 1514–1523.Google Scholar
- Godsil C, Royle G. Algebraic Graph Theory. New York: Springer; 2001.View ArticleGoogle Scholar
- Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003; 36(6):462–77.View ArticlePubMedGoogle Scholar
- NLM: Semrep. http://semrep.nlm.nih.gov/. Accessed: 2017-03-17.
- Atkins BTS, Rundell M. The Oxford Guide to Practical Lexicography. Oxford: Oxford University Press; 2008.Google Scholar
- Fung KW, Hole WT, Nelson SJ, Srinivasan S, Powell T, Roth L. Integrating SNOMED CT into the UMLS: An exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc. 2005; 12(4):486–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Preiss J. Excluded semantic types. http://kdisc.rcweb.dcs.shef.ac.uk/data/filtered_semtypes.txt. Accessed: 2017-03-17.
- Preiss J. Seeking informativeness in literature based discovery. In: Proceedings of BioNLP 2014. Baltimore: 2014. p. 112–7.Google Scholar
- Srinivasan P. Generating hypotheses from MEDLINE. J Am Soc Inform Sci Technol. 2004; 55(5):396–413.View ArticleGoogle Scholar
- Yetisgen-Yildiz M, Pratt W. A new evaluation methodology for literature-based discovery. J Biomed Inform. 2009; 42(4):633–43.View ArticlePubMedGoogle Scholar
- Hu X, Zhang X, Yoo I, Zang Y. A semantic approach for mining hidden links from complementary and non-interactive biomedical literature. In: SDM. Bethesda: 2006. p. 200–9.Google Scholar
- Swanson DR. Somatomedin c and arginine: Implicit connections between mutually isolated literatures. Perspect Biol Med. 1990; 33(2):157–86.View ArticlePubMedGoogle Scholar
- Smalheiser NR, Swanson DR. Assessing a gap in the biomedical literature: Magnesium deficiency and neurologic disease. Neurosci Res Commun. 1994; 15(1):1–9.Google Scholar
- Smalheiser NR, Swanson DR. Indomethacin and Alzheimer’s disease. Neurology. 1996; 46:583.View ArticlePubMedGoogle Scholar
- Smalheiser NR, Swanson DR. Linking estrogen to Alzheimer’s disease. Neurology. 1996; 47:809–10.View ArticlePubMedGoogle Scholar
- Smalheiser NR, Swanson DR. Calcium-independent phospholipase A2 and schizophrenia. Arch Gen Psychiatr. 1997; 55(8):752–3.Google Scholar