Quantifying and filtering knowledge generated by literature based discovery

Background Literature based discovery (LBD) automatically infers missed connections between concepts in literature. It is often assumed that LBD generates more information than can be reasonably examined. Methods We present a detailed analysis of the quantity of hidden knowledge produced by an LBD system and the effect of various filtering approaches upon this. The investigation of filtering combined with single or multi-step linking term chains is carried out on all articles in PubMed. Results The evaluation is carried out using both replication of existing discoveries, which provides justification for multi-step linking chain knowledge in specific cases, and using timeslicing, which gives a large scale measure of performance. Conclusions While the quantity of hidden knowledge generated by LBD can be vast, we demonstrate that (a) intelligent filtering can greatly reduce the number of hidden knowledge pairs generated, (b) for a specific term, the number of single step connections can be manageable, and (c) in the absence of single step hidden links, considering multiple steps can provide valid links.

In the frequently used A-B-C model [2], LBD proposes a hidden connection between two previously unconnected terms, A and C, if there is a document linking A to some term B and the same B is linked to C elsewhere. Clearly, in open discovery where only A is specified (shown at the top of Fig. 1), the quantity of hidden connections suggested rises with input and so LBD systems frequently grossly restrict scale. When system execution and evaluation is not restricted to a toy example, numerous output reductions are put in place, including filtering of terms (whether by discarding uninformative terms or restricting terms to, say, diseases and treatments only), restricting either the time period from which hidden knowledge is generated or the segment of the abstract that knowledge is drawn from (e.g. titles only) and re-ranking of the subsequently produced hidden knowledge, often targeted to the search for a specific discovery or type of discoveries.
Without filtering, large scale LBD becomes computationally difficult and the resulting hidden knowledge can be practically unusable. To give an idea of the scale, consider the frequently used approach using title word cooccurrence as an indication of relatedness (i.e. requiring one title to contain Raynaud disease and blood viscosity and another to contain blood viscosity and fish oil to propose a connection between Raynaud disease and fish oil): there are over 92,000 distinct words in titles of PubMed articles between 1700 and 2005, giving rise to over 561,000 co-occurring pairs. Clearly, this will give rise to a large amount of hidden knowledge if these co-occurrences are all followed, making it impossible for all hidden knowledge to be explored. Therefore some filtering is required, however, it is crucial that important links or terms are not removed. Previously explored filtering options include: 1. Time period. Much of earlier LBD work restricted the knowledge base to a reduced time segment -for example, Gordon and Lindsay [5] restricted publications to the years 1983-1985 when they sought to replicate Swanson's [2] fish oil -Raynaud disease connection. 2. Relation. Preiss et al. [6] show that employing more sophisticated definitions of links between terms (relations) greatly reduces the number of hidden knowledge pairs generated without detrimental effect on performance. 3. Stoplist. For example, Swanson et al. [7]  The A − B − C model generates justification(s) for each hidden connection, the linking (B) termsraynaud disease and fish oil were found to be connected via the linking term blood viscosity, with one decreasing and the other increasing the same. However, hidden knowledge can be identified by following longer paths of linking terms, i.e. A → b 1 → b 2 → · · · → b n → C. This approach shows promise and has already been explored. For example, Kontostathi and Pottenger [14] investigate paths of linking terms generated by co-occurrence and Wilkowski et al. [15] show the feasibility of the approach on a single A and C pair. However, previous evaluation of this approach has been restricted to small numbers of examples and no large scale evaluation has yet been carried out.
The novelty of our work lies in a detailed analysis of the quantity of hidden knowledge produced and the effect of various filtering approaches upon this. This thorough investigation of filtering combined with single or multistep linking term chains is, to our knowledge, the first comprehensive investigation of this type.

Literature based discovery system
We use an LBD system which accepts an adjacency matrix M describing relations between pairs of terms in a term collection: the entry m ij is a positive integer if a relation R is detected between terms t i and t j . If t i and t j are not directly related anywhere in the document collection, m ij will be zero. Using graph theory [16], any non zero terms in where norm converts m ij to 1 if m ij > 0 and leaves 0 otherwise, represent connections via one linking step. The system can be extended to find connections via any number of linking steps, for example any positive (non zero and non negative) terms in represent connections via two linking steps. Similarly, connections via three steps can be obtained and so on.

Relations
The LBD system described above relies on the existence of a relation between a pair of terms. We base our relations on the output of the SemRep system [17] which uses underspecified syntactic processing and UMLS [9] domain knowledge to extract subject-relationobject triples (such as X-treats-Y or X-affects-Y ) from biomedical texts. Building on the output of MetaMap [11], SemRep extracts a number of positive and negative relations as well as a positive and negative comparative relations. For example, from the sentence in 1 SemRep extracts the relations in 2 (terms are presented here for ease of understanding; SemRep extracts CUIs rather than terms): 1. We used hemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalemia.

Hemofiltration-TREATS-Patients Digoxin overdose-PROCESS_OF-Patients hyperkalemia-COMPLICATES-Digoxin overdose Hemofiltration-TREATS(INFER)-Digoxin overdose
A SemRep processed version of Medline is available from NLM [18], and we use all positive relations from semmedVER24_2 processed up to 30 June 2014 (which contains 70,364,020 relations) to populate the adjacency matrix M.
A clear advantage of SemRep is its output in 'UMLS CUI -relation -UMLS CUI' format: this excludes noncontent words, as CUIs do not exist for these, and ensures that a hidden connection is found via a compatible sense of a term. For example, the top five UMLS senses of cold are: cold temperature, common cold, cold therapy, chronic obstructive lung disease and cold sensation. If SemRep did not yield CUIs, its output would be: If the two senses of cold were not differentiated, a hidden connection could be found between mechanical ventilators and rhinovirus.
While mapping to CUIs clearly reduces the number of incorrect hidden knowledge pairs, it will not eliminate connections via general terms -words such as patient, clinical study or week. The following section discusses a number of filtering techniques.

sy -synonym merging
As with any type of dictionary, a decision is made by the creators as to dividing up senses, termed as lumping or splitting in lexicography [19]. As UMLS is composed of terms from multiple source vocabularies, the splitting / lumping decision is not consistent throughout. In general, there is a tendency to split senses and later merge based on application, and we therefore, based on finding separate CUIs for Raynaud disease and Raynaud phenomenon and finding that Cameron et al. [12] manually augment their selected A and C terms "with related concepts", investigate an automatic approach to synonym merging within UMLS.
Note that merging synonyms will affect the quantity of A and C terms as well as possible linking terms, B: • If the start point, A (e.g. Raynaud Disease), has multiple synonymous CUIs, merging these will result in the generation of more hidden knowledge from A. • If a linking term CUI is equivalent to the CUI of another term, there could be more hidden knowledge generated. Examination of documents supporting a Raynaud disease -fish oil link reveals that some of the expected connections from CUI C0034734 (Raynaud disease) are linked to CUI C0034735 (Raynaud phenomenon) instead. Although the two terms are synonymous for the purposes of evaluating the Raynaud disease -fish oil link, due to their different CUIs, the connection will not be found.
• Since synonymous pairs of hidden knowledge (and linking terms) will merge, this will reduce the burden on a user.
The synonymous, SY, relation within UMLS is source asserted synonymy, and thus is listed alongside a source. The quality of synonyms in UMLS has been questioned [20], and we evaluate the synonym classes created (these are formed by gathering all synonyms, and their synonyms etc., into disjoint classes) for synonyms asserted by one or more sources. Table 1 displays details of the synonym information broken down by the minimum number of sources supporting each extracted SY relationship (the first column), with the second column representing the number of synonym classes with at least 2 distinct CUIs. The remaining columns describe the synonym classes: the largest synonym class (max), the number of synonym classes containing at least X CUIs (> X) and the mean synonym class size.
Basing classes on 1 source produces a class of 74 elements -this is unlikely to contain synonyms useful for knowledge discovery. Synonym classes with more than 20, or even 10, elements are similarly unlikely to be helpful: we employ synonym classes supported by at least two sources with class size ≤ 5 which leaves 614 synonym classes and reduces the original 2,868,943 UMLS CUIs to 2,867,188 distinct CUIs (restricting to CUIs that appear in Sem-Rep relations in Medline, this reduces the original 485,538 CUIs to 484,924 distinct CUIs).

st -semantic type
The UMLS Semantic Network contains a hierarchy of subject categories, semantic types (STs), with at least one assigned to each CUI. Previous work selects a number of STs allowed to act as linking or target terms as these are thought to describe the type of desired hidden knowledge; for example Yetisgen-Yildiz and Pratt [8] allow both linking and target terms to be chemicals & drugs and genes & molecular sequences, but linking terms can also be disorders, physiology and anatomy members.
Rather than restricting CUIs to a small number of STs according to the type of discovery expected (which requires prior knowledge), we explore a number of general ST exclusions:

clt -common linking terms stoplist
Some CUIs correspond to terms which are clearly too general but their ST also contains useful CUIs and therefore should not be removed. Although UMLS is hierarchically structured, and thus general terms could be expected closer to root nodes, it is composed of multiple hierarchies with different levels of granularity and so an overall threshold is unlikely to be found. The hypothesis that a CUIs which frequently acts as a linking term is unlikely to be informative gives rise to an automatic technique for building a stoplist shown in Fig. 3 [22]. We create our stoplist from the 1865-2000 segment of Medline.

break -breaking common linking term connections
A filtering technique based on a stoplist needs to be quite conservative so it does not remove useful terms and therefore it is likely to leave some unhelpful terms. Breaking common linking term connection is a fundamentally different idea to creating a stoplist: instead of finding frequently appearing terms, this approach bases its decisions on the number of terms a given term is connected to.
Terms A and B are related if a (non negative) Sem-Rep relation exists between them. An uninformative word, such as study or patient, can be expected to be connected to a large number of CUIs. The hypothesis that highly connected terms are likely to be fairly general (and therefore not useful linking terms), gives rise to the following filtering options:

When creating the matrix A, break (discard) all connections to CUI A when the C(A) > threshold. 2. Discard the connection between CUIs A and B when min(C(A), C(B)) > threshold.
(Where C(A) represents the number of CUIs linked to A, and the threshold needs to be empirically determined.)

Results and discussion
LBD is clearly difficult to evaluate: by virtue of the generated knowledge being new, there is no gold standard for comparison. Two standard techniques for evaluation exist: 1) replication of existing discoveries (e.g. [5,10,23]), where discoveries made using previous LBD systems are collected from literature and a new LBD system is employed over the same time segment in an attempt to produce the same discovery, and 2) timeslicing [24], which allows the generation of precision and recall figures by allowing a gold standard to be automatically created from publications after a cut off date with hidden knowledge generated from publications prior. We present both types of evaluation below.

Replication of existing discoveries
Seven separate discoveries were identified from LBD literature which have previously been used for replication experiments. We include the time segment used in the original discovery and remove any documents containing a direct link between the A and B terms -this can be present for a number of reasons a) LBD is being employed to suggest alternatives, e.g. alternative treatments, or b) the connection was removed in previous work, for example due to a manual inspection showing that A and B are not related despite co-occurring in the same title -the number of documents removed is described as the number of direct connections and the abbreviation used in Table 2 is also included: 1.
[RD-fsh] Raynaud diseasefish oil [5,10,25]; 1960-1985, no direct connections.  Figure 4 shows the effect that the various filtering algorithms have on the number of SemRep relations remaining within each discovery's segment (the filtered number of direct connections remaining is averaged over the 7 discoveries and the percentage remaining, in comparison to the original, unfiltered set, is presented). Except for the unfiltered, original, results, all other filtering results carry out synonym merging, clt and break are added on top of manual semantic type filtering. The graph shows that filtering is an effective way of reducing the number of relations: with the exception of synonym merging alone, all filtering approaches reduce the number of direct relation pairs by at least 50%. Table 2 presents the number of linking terms found when each discovery is replicated: the upper part of the table describes the number of linking terms corresponding to single step connections. Since the single step approach sometimes fails to find a connection, two step connections are sought and are presented in the lower part of the table. The number of linking terms generated for each discovery is clearly linked to Fig. 4 Percentage of original relations remaining after filtering the number of connections input -however, despite the connection being inferrable using co-occurence, the reduction to zero hidden links using SemRep combined with filtering is valid. For example, the two linking terms connecting Raynaud disease to fish oil with synonym merging are C0029064 (operating theatre) and C0040426 (set of teeth). Neither clearly supporting the connection and therefore justifiably removed with semantic types.
It is interesting that in most filtering cases, the frequently cited Raynaud diseasefish oil connection is not replicated: this was revealed to be due to a combination of synonym failure and the relation employed not extracting the necessary connections. The schizophrenia -Ca 2+ iPLA2 link is not replicated via a single step as Ca 2+ iPLA2 only appears very few times in Medline and is only seen in one SemRep relation. In these cases, it is worth examining two step connections:

schizophrenia -Ca 2+ iPLA2
The 9 two step connections for schizophrenia (C0036341)-Ca 2+ iPLA2 (C0538273) generated with manual filtering can be seen below: As the UMLS definition states, the parathyroid hormone elevates blood Ca2+ levels and thus is related to the PLA2G6 protein, showing all nine connections to be worthy of further consideration.

C0005848 (blood viscosity) -C0232338 (blood flow function) 4. C0005848 (blood viscosity) -C0206502 (hemorheology)
Captopril and pentoxyphylline are used in the treatment of Raynaud's phenomenon, and both Raynaud's and fish oil are known to affect blood viscosity. The blood viscosity links are also supported by the linking term analysis in [10].
These findings suggest that multi linking term exploration is worth pursuing when a connection is suspected but is not found via a single step connection.

Timeslicing
One of the drawbacks of evaluating LBD by replicating existing discoveries is that it relies on the use of small test sets. Timeslicing is an alternative approach that allows larger test collections to be created automatically. A cutoff date is chosen, hidden knowledge is generated from publications published prior to this date and the resulting pairs are compared to the new knowledge published after the cutoff (as identified by the used relation) [24]. This section reports results using the timeslicing approach.
Hidden The results of a timeslicing evaluation are presented in Table 3 -these include the total number of pairs of hidden knowledge generated, the number of generated pairs which appear in the gold standard ("correct"), the precision (the percentage of pairs generated that are in the gold standard), recall (the percentage of pairs present in the gold standard that were generated) and the F-measure, a combination of precision and recall. Again, the upper part of the table represents results for single step connections, and the lower part represents both one and two step connections. There is an obvious trade off between precision and recall -the higher the number of pairs returned, the greater the recall, but obviously the less likely it is for a person to be able to go through the resulting knowledge: the last column lists the average number of pairs of hidden knowledge generated per term seen in the segment. The overall precision can be expected to be low for a number of reasons, including: 1. Even if the hidden knowledge generated is genuine, it may not have been discovered yet within the segment on which the gold standard is based. 2. Some knowledge will be known but never published as it is considered 'obvious'. An LBD system will generate such knowledge nonetheless.
While the total amount of hidden knowledge generated may seem unmanageable and not useful, it is supported by manual findings: UMLS includes manually identified relations, and on average goes through 2 releases per year. There were 29,936,977 instances of relations added between 2015AA and 2005AA version of UMLS. Since 29, 936, 977 × 4 = 119, 747, 908 (with the number of hidden knowledge pairs generated using break = 131, 199, 050), the quantity of hidden knowledge produced no longer seems unreasonable. In turn, low precision accounts for the low F-measure; again, this is not unusual for large scale timeslicing results in LBD [6]. The highest F-measure for single step connections (3.22e-03) is achieved by breaking common linking term connections. In this setting the amount of hidden knowledge generated (an average of 2,232 pieces per term) is much more manageable than the amount generated without filtering (34,986 pieces per term).
The very low recall (and precision) of the two step connections is caused by the high recall from the one step connections which results in a large number of two step connections to be generated (see Literature based discovery system).
The large number of two step connections will, in general, make the results unusable for open discovery. However, two step connections provide a good backoff for closed discovery when a link is suspected by cannot be found using a single step. This appears to be more common with rarer, thus more likely to be specific, concepts (such as Ca 2+ iPLA2).
The results do raise the question of whether precision and recall are good measures for evaluating large scale LBD systems.

Conclusions
We present an extensive discussion of filtering within literature based discovery, and show that using a more sophisticated definition of relation as well as UMLS CUIs (rather than terms directly) is insufficient in itself to yielding usable quantities of hidden knowledge. We explore a number of different approaches and show their effect on both replication and timeslicing evaluations. We find the best performance from the rarely used approach which breaks connections on a term pair basis, rather than removing entire terms.
Based on the results of replication of existing discoveries, we propose that the quantity of hidden knowledge generated for a term A will be proportional to its overall frequency within the corpus, and argue that the high proportion of frequent terms is the cause of the low F-measure found using timeslicing evaluation. A comparison with the number of relations (manually) added to UMLS on each release also suggests a high expected number of hidden knowledge pairs.
We also examine the possibility of generating hidden knowledge between terms A and C using a chain of multistep linking terms b i , i.e. A → b 1 → b n → C with no direct connection between A and C. While such an approach clearly generates an unmanageable quantity of data in open mode, its value can be seen when a single step connection fails to be found in closed mode: in this case, some multi step connections may be suggested and we propose using a multi step system as a backoff for failed single connection LBD.