Clustering cliques for graph-based summarization of the biomedical research literature
© Zhang et al.; licensee BioMed Central Ltd. 2013
Received: 5 December 2012
Accepted: 29 May 2013
Published: 7 June 2013
Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts).
SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings.
For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively.
Automatic summarization is emerging as a viable information processing mechanism to help users effectively access the large amount of textual data available online, especially in the biomedical domain. Such processing distils the most important information from source documents to produce an abridged condensate that serves as an informative and indicative summary of a given topic [1, 2]. Summarization is often thought of as a natural language processing task due to the need for in-depth understanding of text to provide a useful summary. The analysis of source text may take various forms. In earlier work this was often limited to textual cues that identify salient information, while more recent research may involve concepts in a domain ontology  and semantic relation extraction [4, 5].
Although Semantic MEDLINE shows promise in managing the results of PubMed searches , it produces graphs that are too large and dense when generating summaries from more than a few hundred citations. This characteristic does not accommodate the thousands of citations that may be returned by a PubMed query (for example, nearly 150,000 for “breast cancer”). In earlier work , we exploited the graph theoretic notion of degree centrality to reduce large graphs by retaining only highly connected concepts. The method is effective in presenting readable, focused information to the user. However, it relies on predefined schemas, which must be devised for each topic point of view, and thus limit the applicability of this summarization methodology in covering the thematic diversity seen in the biomedical research literature.
In this paper, we explore a graph-based method to make automatic summarization more robust when confronted with large numbers of MEDLINE citations, without using predefined schemas. This multidocument method is based on a network representation of the semantic predications extracted from citations (titles and abstracts) returned by a PubMed query. Cliques are first identified in this graph and then clustered and labeled to identify several points of view represented in the summary. Since schemas are not used, the method is applicable to any biomedical topic.
The primary contribution of this paper is the application of graph theoretic constructs to semantic predications for automatic summarization in the biomedical domain. We also introduce a novel semantics-based criterion for determining final clusters, which is compared to a silhouette coefficient method. Finally, evaluations for cluster validity and accuracy, as well as the quality of the summary are also provided.
SemRep semantic interpretation
The clustering method used for automatic summarization in this study depends on cliques identified in a graph of semantic predications extracted from PubMed citations with SemRep [5, 7], a symbolic rule-based natural language processing system relying heavily on biomedical domain knowledge in the Unified Medical Language System (UMLS) . Extraction of predications begins with an underspecified syntactic analysis based on the SPECIALIST Lexicon  and MedPost part-of-speech tagger . MetaMap  maps simple noun phrases from this structure to Metathesaurus concepts, and “indicator rules” map syntactic elements to UMLS Semantic Network predicates. SemRep extracts 30 predicate types, in the domains of clinical medicine (e.g. TREATS, DIAGNOSES), substance interactions (e.g. INTERACTS_WITH, INHIBITS, STIMULATES), etiology (e.g. CAUSES, PREDISPOSES), and pharmacogenomics (e.g. AFFECTS, AUGMENTS, DISRUPTS).Syntactic processing then identifies arguments (noun phrases mapped to Metathesaurus concepts) for each predicate. As an example, the predications (argument-predicate-argument) below were extracted from the text, Patients with single brain lesion received an extra 3 Gy x 5 radiotherapy.
Brain - LOCATION_OF - Single lesion
Single lesion - PROCESS_OF - Patients
Radiation therapy - ADMINISTERED_TO - Patients
Automatic summarization condenses source text into an abbreviated version representing salient information. Most methods exploit an extractive process that selects informative text strings from the source and concatenates them into a summary. Fewer attempts have been made to generate an abstractive summary, which processes the source text and represents it using terms not found in the source. Both summarization techniques depend on identification of salient source content, either through informative textual cues, term frequency, or, more recently, graph-based metrics.
Identifying salient source content
Most frequency-based methods provide extractive summaries composed of source sentences containing frequently occurring content units. Nenkova and Vanderwende  assessed the contribution of frequency of occurrence to summarization, which is considerable. Reeve et al. [16, 17] further exploit domain ontologies to identify salient information.
Recently, graph structures have been used to represent source content to be summarized. Often, terms or sentences are represented as nodes and relations between them as arcs; however, abstractive representations are also used in graph-based analysis. Graph theory-based metrics have been proposed to identify salient information. Two commonly used metrics are degree centrality and eigenvector centrality, and both are based on connectedness. Degree centrality is determined by the connecting arcs a node has, normalized for the size of the graph, while eigenvector centrality is computed based on the connections a node has along with the connectedness of neighboring nodes. Several studies (e.g. [18-20]) have shown that degree centrality, when compared to other connectedness metrics, performs best for most tasks. LexRank  and TextRank  have applied connectedness metrics to generate multidocument summaries. In LexRank, for example, nodes represent sentences and arcs similarity between them. Node connectedness is used to identify prominent sentences as a summary.
In addition to text, biomedical data can also be represented as a graph, with nodes representing biological entities (e.g. genes or proteins) and edges associations between them. For example, protein-protein interactions can be successfully modeled by a graph. Based on the recognition of cohesive subgroups (such as cliques), gene or protein complexes can be extracted to help predict protein interactions or find gene-disease relations [22, 23].
Cliques in graph theory
Identifying cliques can help find cohesive subgroups in a graphical network. Usually, each node in a clique is, in some way, highly related to every other node. This characteristic makes clique identification a very important approach to uncover meaningful groups from a network, such as protein-complex discovery from protein-protein interaction networks , collaborating groups from co-authorship networks , etc. Zubcsek et al.  clustered cliques to identify information communities with UCINET. Taking advantage of node overlap among cliques, Ah-Pine and colleagues  proposed a clique-based clustering method to annotate named entities.
Identification of various themes contained in the summary can help users locate specific information they are interested in and link to relevant source documents. Theme identification, also known as topic identification or topic discovery, is the process of assigning one or more labels to text . To discriminate it from the topic of a summary, we refer to this task as “theme identification” in this paper.
Theme identification is particularly important in multi-document summarization. To avoid similar information repetitiously appearing in the summary, Stein et al.  grouped their summaries from single documents into clusters and selected a representative passage from each cluster to construct the final summary. Other studies  clustered documents before performing summarization in order to help users select clusters of interest.
Clustering is a very powerful data mining technique for identifying and labeling themes in a group of documents, and both k-means and hierarchical clustering are used for this task. For each cluster, features, such as keywords, terms, or sentence are chosen as the label (or theme). K-means clustering [30, 31] groups documents into predefined n classes. It is often used when the number of classes for the documents is known and serves as a reference standard to evaluate the final clusters generated. In reality, it may be hard to obtain expert-determined classification for thousands of biomedical documents, and hierarchical clustering is often used instead.
Hierarchical clustering is an unsupervised method that does not require setting a predefined number of clusters for the documents. When using hierarchical clustering to group documents and generate labels for the clusters, the vector space model is often adopted to produce term or keyword vectors, which help indicate similarity among documents [30, 32, 33]. Subsequently, documents are clustered into several subgroups, and terms or keywords that are salient for a given cluster are extracted as the theme (or label) for the cluster.
Topics for training and testing sets
No. of citations
No. of citations
Anti-inflammatory Agents, Non-steroidal
Hydroxymethylglutaryl-CoA Reductase Inhibitors
Tumor Necrosis Factors
Graphical representation for semantic predications
Besides frequency of occurrence of predications, we used two graphical constructs, degree centrality and cliques, to condense the graph into a summary of salient predications. Both degree centrality and clique detection help identify predications with high connectedness, which, along with frequently occurring arcs in the graph, convey information crucial to the summary. For example, in Figure 4, the predication “Deep Brain Stimulation TREATS Parkinson Disease” is identified as highly salient because the nodes representing its arguments have more connections than other nodes, and the frequency of occurrence of the arc between them is higher than that of other arcs.
Eliminate uninformative predications
Before graph theoretic techniques are applied to create a summary, predications with at least one generic argument are eliminated from the graph, which removes uninformative relationships as part of the condensing process of summarization [1, 2]. As noted earlier, arguments are Metathesaurus concepts, and generic arguments, (e.g. “Patients”) are identified as occurring higher than an empirically determined cutoff in the UMLS hierarchy . For example, the predication “Pharmaceutical Preparations TREATS Parkinson Disease” is eliminated from the graph, while “Dopamine Agonists TREATS Parkinson Disease” is kept because “Pharmaceutical Preparations” in the former is high in the hierarchy, while both arguments in the second predication are lower.
Identify highly connected nodes
We assume that central nodes in the predication network are likely to represent important contents in the documents being summarized. In our previous study , we found that degree centrality effectively identifies information crucial to summarization for researchers and clinicians. In the current study, we used degree centrality to sort the concepts in the network, and then, based on training data, defined a degree centrality cutoff, which is the mean of the sum of the degree centrality scores plus half of the standard deviation. Predications in which both arguments have a degree centrality score above the cutoff are kept, while others are eliminated.
Eliminate predications with lower frequency of occurrence
Since frequency also plays an important role in automatic summarization, we calculated frequency of occurrence for the rest of the predications. The computation for frequency is based on how many citations a predication appears in . (When a predication occurs more than once in a single sentence, we count that occurrence as one.) A formula similar to that for degree centrality (the mean of the sum of the frequency of occurrence, plus half of the standard deviation) was adopted and predications with frequency of occurrence below the cutoff were eliminated from the graph.
After the first three steps, the predications remaining were those with high frequency of occurrence and having highly connected arguments; in the next step, cliques were identified in the graph of these predications. The tool used to identify cliques and cluster them in the next step is UCINET 6 , a social network analysis package particularly useful for extracting cliques and analyzing overlap . There is other research of relevance to our work. Boyack et al.  compare the effectiveness of several algorithms for clustering large numbers of documents, but they do not address details of the semantic content involved. Blondel et al.  discuss an efficient algorithm for identifying communities in large networks, but the “content” of these involves only one feature, primary language used in mobile phone networks, rather than the rich expressiveness of SemRep semantic predications. Since our main concern is exploiting semantic predications for the semantic content of documents, for the purpose of automatic summarization, rather than development of clustering algorithms, UCINET is entirely adequate.
The UCINET algorithm to identify cliques is based on the notion of a maximal clique, one that is not contained in any other. Cliques are allowed to overlap, which means that concepts can be members of more than one clique. This feature is important for summarization because it permits certain concepts, which have high degree centrality and are the core of a network (such as the topic of the summary) to appear in several cliques of the graph.
A summary of a large number of documents usually includes several themes, or points of view. For example, a summary of breast cancer may include information on chemotherapies, procedures, genetic etiology, etc. In exploiting such a summary, a user may want to focus on any one of these themes. The accessibility of a summary is increased if the different themes are discriminated from each other and overtly represented. Although cliques correlate somewhat with themes, this is not absolute due to the fact that cliques share nodes.
Our approach to identifying themes in a summary exploits clusters of cliques and has two phases. In the first, clustering is based solely on nodes in the clique (which represent arguments in the semantic predications constituting the cliques). In addition to identifying cliques from the predication network, UCINET automatically produces a clique co-membership matrix and a hierarchical clique clustering, which produces several possible solutions, each containing a varying number of clusters.
We then use semantic processing to select the clustering solution that best represents the themes of the summary. The goal is to put cliques with similar themes in the same cluster, while keeping cliques with different themes in separate clusters. The challenge is to determine the best clustering solution by grouping cliques in such a way that the themes of the summary are optimally represented. Generally, the best clustering solution is neither too compact (a single cluster containing all cliques) nor too dispersed, as is the case if every cluster is a singleton having only one clique. When this solution has been selected, further processing determines whether some of the clusters should be collapsed  based on semantic similarity.
Visualizing cluster solutions
The tool we used to find and cluster cliques is UCINET , a hierarchical clustering software package originally developed for social network analysis. UCINET produces a clique co-membership matrix in which the (i,j)th entry of the matrix is the number of shared nodes (arguments) in clique i and clique j and the diagonal entries are the size of the cliques. Based on this matrix, UCINET produces an icicle plot composed of solutions to the clique clustering. Although each clique is assigned to a unique cluster, concepts may be in more than one cluster .
Semantic processing for labeling clusters
In our method, determining the best clustering solution is based on semantic similarity of the individual clusters, as represented by theme labels. Our graphs represent semantic predications, and cliques thus contain the arguments of predications (as nodes) along with the predicates connecting them. Identifying similarity of clusters depends on characteristics of the predications contained in the cliques that constitute the clusters. Other approaches have used terms or sentences as theme labels, neither of which provides the greater expressiveness of semantic predications.
Metapredications are used to identify and label the theme for each cluster. A metapredication, whose form is similar to a SemRep predication, is defined as “<Semantic Group > <Predicate Group > <Semantic Group>”. The scope of the semantic group and predicate group is broader than that of the arguments and predicate of a SemRep predication, so a metapredication generalizes the meaning of a cluster of cliques composed of several predications.
inhibits, stimulates, interacts_with
treats, prevents, compared_with*, uses**
associated_with, causes, predisposes
affects, augments, disrupts
<Semantic group > <Predicate group > <Semantic group>
<Anatomy > <Physical > <Anatomy>
<Anatomy > <Physical > <Chemicals & Drugs>
<Anatomy > <Physical > <Disorders>
<Chemicals & Drugs > <Interaction > <Chemicals & Drugs>
<Chemicals & Drugs > <Interaction > <Genes & Molecular Sequences>
<Chemicals & Drugs > <Therapy > <Disorders>
<Chemicals & Drugs > <Therapy > <Chemicals & Drugs > *
<Procedures > <Therapy > <Disorders>
<Procedures > <Therapy > <Chemicals & Drugs > **
<Genes & Molecular Sequences > <Causation > <Disorders>
<Chemicals & Drugs > <Causation > <Disorders>
<Disorders > <Causation > <Disorders>
<Procedure > <Diagnosis > <Disorder>
<Procedure > <Diagnosis > <Chemicals & Drugs > ***
<Disorder > <Affects > <Disorder>
<Chemicals & Drugs > <Affects > <Disorder>
<Chemicals & Drugs > <Affects > <Physiology>
<Disorders > <Comorbidity > Disorders
In theme identification, each SemRep predication in a cluster identified in the icicle plot is assigned to a metapredication. For example, the predications “Dopamine Agonists TREATS Parkinson Disease” and “Dopamine Agonists TREATS Dyskinetic syndrome” are assigned to the metapredication “<Chemicals & Drugs > <Therapy > <Disorders>” because the predicate TREATS belongs to the predicate group < Therapy > and the semantic type of the subjects and the objects of these two predications belongs to the semantic group Chemicals & Drugs and Disorders, respectively. Metapredications are then counted and sorted in descending order of frequency of occurrence; the most frequent identifies the theme of the cluster and serves as its label.
Selecting the optimal clustering solution
Semantic theme labels form the basis for selecting the best clustering solution to represent themes for the summary generated by the method. As represented in the icicle plot, the several clustering solutions are arranged hierarchically, so that the solution containing the most clusters is at the top of the plot. In each succeeding row, adjacent clusters may be merged (based on shared nodes in the cliques being clustered), so that the final, bottom row contains fewer clusters than those preceding it. In our method, merging of clusters in succeeding rows is augmented with semantic processing to choose the optimal clustering solution, one in which there are no clusters that could be merged in the succeeding solution (row) based on shared nodes and which have the same theme label.
After theme labels have been computed for clusters in the icicle plot, each successive row of the icicle plot is processed, starting with the row that is likely to require minimum merging. Based on training data, this is the first row from the top that has no more than three singleton clusters (containing only one clique). The current row is compared to the immediately succeeding row, and it is noted whether any separate clusters in the current row are merged in the following row, and further, whether those clusters have the same metapredication theme label. If both conditions are satisfied, the succeeding row is considered to be a better solution than the current row, and the former succeeding row becomes the current row. When a row is encountered for which the succeeding row is not a better solution than the current row, the current row is considered the optimal solution.
In a previous study , we evaluated the effectiveness of degree centrality as a condensing mechanism for automatic summarization to answer disease treatment questions in a semantic predication graph. We have also constructed a semantic predication gold standard  to support further evaluation. In addition, we have assessed the ability of Semantic MEDLINE, a SemRep-based summarizer, to identify useful drug interventions for evidence-based medical treatment . In this paper, we evaluated two aspects of clustering cliques for automatic summarization: the validity of the clusters produced and the quality of the cluster labeling.
Validity of the clusters
The validity of the clusters was assessed by measuring cluster cohesion and cluster separation. Cohesion measures the purity of the objects within a cluster, i.e. how closely related the objects in a cluster are. Separation measures the isolation of the objects in different clusters, i.e. how distinct a cluster is from other clusters. For our clusters, composed of semantic predications, we evaluated how related the semantic predications in a cluster are to its cluster label (cohesion) and how well-separated semantic predications with different labels are in different clusters.
We compared our system output to a baseline whose clique clusters were determined by the silhouette coefficient , which is often used to determine the appropriate number of clusters in clustering data mining research. We used a symmetric matrix in which each cell was the number of shared nodes by the corresponding pair of cliques to compute the distance between cliques. Then the average silhouette coefficient (ASC) (see  for details) was calculated for each clustering solution and the solution with the highest ASC served as the baseline. Cohesion, separation, and overall validity were also calculated for the baseline.
Quality of the cluster labeling
The accuracy of themes annotated by cluster labels is important to the final summary. A cluster with a poor label may be ignored by users even if it links to a group of documents relevant to their information needs. We thus also evaluated the labeling effectiveness of our system. Since it is almost impossible for domain experts to produce class labels for the results of clustering tens of thousands articles, we constructed a reference standard for evaluation based on the medical subject heading (MeSH) descriptors assigned to source citations that produce predications in the summary clusters. This evaluation was done by comparing arguments extracted from the predications in the cluster to MeSH indexing terms assigned to the citations from which the predications were extracted.
For each citation in MEDLINE, indexers at the National Library of Medicine assign 5 to 15 MeSH descriptors as well as qualifiers (if necessary) to cover the topics of the article; they also indicate those MeSH descriptors reflecting the major points of the article as major MeSH descriptors. Since this indexing procedure is performed by human experts, it is deemed that the MeSH descriptors, especially the major ones, accurately represent the contents of the article. For example, for a citation entitled “Aspirin and antiplatelet agent resistance: implications for prevention of secondary stroke” (PMID: 20932071), the major MeSH descriptors are: Aspirin/pharmacology; Platelet Aggregation Inhibitors/pharmacology; Stroke/prevention & control. In constructing the reference standard, we ignored MeSH qualifiers. For example, MeSH descriptors “Antipsychotic Agents/therapeutic use” and “Antipsychotic Agents/administration & dosage” were counted as one term.
For each cluster in the summary, major MeSH descriptors assigned to citations producing predications in the given cluster were extracted and sorted in descending order of frequency of occurrence. The predication arguments in each cluster were compared to an equal number of the ranked MeSH descriptors, starting with the most frequent descriptor.
In comparing predication arguments to MeSH indexing terms, we exploited Metathesaurus synonymy to match concepts in the graph to MeSH descriptors. For example, the concept “Diabetes Mellitus, Non-Insulin-Dependent” was matched to term “Diabetes Mellitus, Type 2” because the concept is a synonym of the term in MeSH vocabulary. Finally, recall, precision and F-score were calculated.
An example of the final summary
Validity of the clusters
Validity of the clusters for system summary (SS) and baseline (BL)
Anti-inflammatory Agents, Non-steroidal
Hydroxymethylglutaryl-CoA Reductase Inhibitors
Tumor Necrosis Factors
Quality of the summary theme labeling
System output compared to MeSH indexing
Anti-inflammatory Agents, Non-steroidal
Hydroxymethylglutaryl-CoA Reductase Inhibitors
Tumor Necrosis Factors
Generally, results showed that our method, based on graph theory as well as semantic predications, can produce satisfying summaries of large numbers of biomedical documents. The validity of clusters determined by semantics was better than that determined by the Silhouette Coefficient, and, further, the summary represented the major salient content of topics. Analysis of the overall validity of clusters showed that system output is 10% better than the baseline (43% versus 33%). Although the cohesion of the baseline is slightly higher than that of the system summary, the separation of the system summary is significantly better than that of the baseline. The number of clusters determined by the Silhouette Coefficient is greater than the number determined by semantic information, which results in a relatively higher cohesion and lower separation in the baseline.
We used metapredications to calculate cohesion and separation; such predications were constructed from semantic information pertinent to the core meaning of the themes. For example, the drug therapy theme (<Chemicals & Drugs > <Therapy > <Disorders>) expresses predications asserting specific drug therapies (TREATS) and comparison of such therapies (COMPARED_WITH).
Predications that do not belong to these two metapredications are counted as false positives when computing cohesion and separation. A problem arose with the predicate CAUSES, which SemRep uses to expresses both side effect of drug (which would be reasonable to include in the drug therapy theme) and etiology of disease (which is not in the scope of this theme). We chose not to include CAUSES in this theme, which caused some legitimate side-effect predications to be considered false positives when evaluating this theme. This decreased cohesion and separation, as well as overall validity for clusters containing the drug therapy theme.
Two issues were encountered in comparing concepts in each cluster to MeSH descriptors to evaluate the summary, both of which caused discrepancy between results and actual quality of the summary in expressing the semantic content of input citations. The first issue is due to indexing policy. For example, concepts referring to body part represented the major contents in disease location clusters. However, MeSH descriptors in the anatomy category are not normally indexed as major topics. For example, lung (location of lung cancer) and pancreas (location of insulin), were not indexed as major topics.
A second problem encountered in matching predications to MeSH indexing terms involves qualifiers (subheadings). For example, the concept “Toxic effects” in the predication “Anti-inflammatory Agents, Non-steroidal CAUSES Toxic effects” was often extracted from citations that had been indexed with the qualifier “toxicity.” Since only MeSH descriptors were compared in the evaluation, this concept was counted as a false negative.
False positives were largely caused by infelicitous mapping to Metathesaurus concepts. For example, statin has two mapping candidates, “STN gene” and “Hydroxymethylglutaryl-CoA Reductase Inhibitors,” in the Metathesaurus. For most sentences, such as “… patients prescribed a statin with drugs that may increase the risk of myopathy”, “STN gene” was selected due to incorrect word sense disambiguation.
Limitations and future work
Although our system can produce useful summaries for large numbers of MEDLINE citations and cluster the summary into several groups based on the themes, it has limitations. As mentioned in theme identification section, UCINET uses a hierarchical clustering algorithm to cluster cliques. Hierarchical clustering analysis is very practical in detecting topics for documents because it does not require human intervention to assign the number of the clusters in advance, as k-means clustering algorithm does. Wartena and colleagues  used a k-bisecting clustering algorithm, which is based on the k-mean algorithm, to cluster frequently occurring keywords in 758 documents taken from 8 Wikipedia categories. They clustered these into 9 categories, one for each Wikipedia category and one additional cluster. While in reality, it is almost impossible to pre-define the number of the clusters for varied topics in biomedical domain, Lee and colleagues  compared supervised and unsupervised methods to detect topics in biomedical texts and found that the performance of supervised topic spotting methods was better. They also found that unsupervised hierarchical clustering was robust and more readily applicable in real world settings. The clustering algorithm we used is based on the common concepts shared by the cliques. In other words, the clique-clique proximity matrix used for clustering is constructed on the basis of the similarity of predication arguments contained among the cliques. It ignores the similarity of predicates, which may also contribute to the computation of clique similarity. Although the effectiveness of clustering algorithms is not the focus of this paper, we will explore different clustering algorithms and consider adding predicates to enhance results in our future work.
As shown in Figure 9, the system uses a fixed threshold (shown as line A) to group this topic into eight clusters. By considering the semantic information contained in each cluster, we can determine that the themes of cluster one and two are the same (substances interactions), cluster three and four are both about locations of the substances, while clusters five, seven and eight are all about chemicals as the cause of disorders; finally, cluster six is about chemicals treat disorders. It is obvious that repetitive themes are produced.
Instead of the fixed threshold, we will explore the use of a dynamic threshold  to detect clusters. Compared to cutoff based on fixed height, a dynamic threshold, which uses different cut heights on different branches of the cluster tree, makes determining the number of clusters more flexible. For example,  and  used a dynamic tree cut method on the basis of analyzing the shape of the branches of the dendrogram. In the future, we will consider both the shape of the icicle plot and cluster themes to determine a dynamic threshold, such as cutoff B in Figure 9. By considering the themes of clusters 1 to 8 in Figure 9, the dynamic cutoff B chooses different clustering solutions at different cutoff heights, so that clusters having the same cluster labels in the fixed threshold cutting method (clusters 1 and 2, clusters 3 and 4, and clusters 7 and 8) are merged together, and three new clusters (cluster1-2, cluster 3-4 and cluster7-8) are produced. With cutoff B, five clusters (marked in blue under the cutoff line) are produced for the topic TNF. Compared to cutoff A, dynamic cutoff B increases separation by 0.21 (0.52 versus 0.31) and overall validity by 0.14 (0.53 versus 0.39).
We exploited graph theoretical methods to summarize biomedical documents; using hierarchical clustering, we then grouped the summary into several themes for a given topic based on the semantics contained in the summary. The system summary was compared to a reference standard produced by selecting the same number of the most frequent major MeSH descriptors as the number of concepts in the summary. The result showed that recall, precision and F-score were 0.64, 0.65 and 0.65 respectively. The validity of the clusters was compared to a baseline computed with the Silhouette Coefficient method for cohesion, separation and overall validity. The overall validity of the system output clusters was better than that of the Silhouette Coefficient clusters.
The first and fourth authors were supported by an appointment to the National Library of Medicine Research Participation Program administered by the Oak Ridge Institute for Science and Education through an inter-agency agreement between the U.S. Department of Energy and the National Library of Medicine. This study was supported in part by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. The first author was also supported by the Youth Fund on Humanities and Social Sciences of the Ministry of Education of China (grant number 13YJC870030).
The fourth author also gratefully acknowledges funding from the Lundbeckfonden through the Center for Integrated Molecular Brain Imaging (Cimbi.org), Otto Mønsteds Fond, Kaj og Hermilla Ostenfelds Fond, and the Ingeniør Alexandre Haynman og hustru Nina Haynmans Fond.
- Sparck Jones K: Automatic summarising: the state of the art. Information Processing and Management. 2007, 43: 1449-1481. 10.1016/j.ipm.2007.03.009.View ArticleGoogle Scholar
- Mani I: Automatic summarization. 2001, Amsterdam: John BenjaminsView ArticleGoogle Scholar
- Yoo I, Hu X, Song I: A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method. BMC Bioinformatics. 2007, 8 (Suppl 9): S4-10.1186/1471-2105-8-S9-S4.PubMed CentralView ArticlePubMedGoogle Scholar
- Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP: Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics. 2008, 9: 207-10.1186/1471-2105-9-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Rindflesch TC, Fiszman M, Libbus B: Semantic Interpretation for the Biomedical Research Literature. Medical Informatics: Knowledge Management and Data Mining in Biomedicine. Edited by: Chen H, Fuller S, Friedman C, Hersh W. 2005, New York: Springer, 399-422.View ArticleGoogle Scholar
- Fiszman M, Rindflesch TC, Kilicoglu H: Abstraction summarization for managing the biomedical research literature. Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics. 2004, 76-83.View ArticleGoogle Scholar
- Rindflesch TC, Fiszman M: The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 2003, 36 (6): 462-77. 10.1016/j.jbi.2003.11.003.View ArticlePubMedGoogle Scholar
- Kilicoglu H, Fiszman M, Rodriguez A, Shin D, Ripple AM, Rindflesch TC: Semantic MEDLINE: A web application to manage the results of PubMed searches. Proceedings of the Third International Symposium for Semantic Mining in Biomedicine. 2008, 69-76.Google Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004, 32: D267-270. 10.1093/nar/gkh061.PubMed CentralView ArticlePubMedGoogle Scholar
- Fiszman M, Demner-Fushman D, Kilicoglu H, Rindflesch TC: Automatic Summarization of MEDLINE Citations for Evidence-based Medical Treatment: A Topic-oriented Evaluation. J. Biomed. Inform. 2009, 42 (5): 801-813. 10.1016/j.jbi.2008.10.002.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang H, Fiszman M, Shin D, Miller CM, Rosemblat G, Rindflesch TC: Degree centrality for semantic abstraction summarization of therapeutic studies. J. Biomed. Inform. 2011, 44 (5): 830-838. 10.1016/j.jbi.2011.05.001.PubMed CentralView ArticlePubMedGoogle Scholar
- McCray AT, Srinivasan S, Browne AC: Lexical methods for managing variation in biomedical terminologies. Proceedings of the Annual Symposium on Computing Applications in Medical Care. 1994, 235-9.Google Scholar
- Smith L, Rindflesch TC, Wilbur WJ: MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004, 20 (14): 2320-2321. 10.1093/bioinformatics/bth227.View ArticlePubMedGoogle Scholar
- Aronson AR, Lang FM: An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 2010, 17 (3): 229-236.PubMed CentralView ArticlePubMedGoogle Scholar
- Nenkova A, Vanderwende L: Microsoft Research Technical Report. The impact of frequency on summarization. 2005, MSR-TR-2005-101. [http://www.cs.bgu.ac.il/~elhadad/nlp09/sumbasic.pdf]Google Scholar
- Reeve LH, Han H, Brooks AD: The use of domain-specific concepts in biomedical text summarization. Information Processing and Management. 2007, 43 (6): 1765-1776. 10.1016/j.ipm.2007.01.026.View ArticleGoogle Scholar
- Reeve LH, Han H, Nagori S, Yang JC, Schwimmer TA, Brooks AD: Concept frequency distribution in biomedical text summarization. Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 2006, Arlington, 604-611.Google Scholar
- Erkan G, Radev DR: LexRank: graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research. 2004, 22: 457-479.Google Scholar
- Zhang X, Cheng G, Qu Y: Ontology summarization based on RDF sentence graph. Proceedings of the 16th International Conference on World Wide Web. 2007, New York,USA, 707-716.View ArticleGoogle Scholar
- Ozgür A, Vu T, Erkan G, Radev DR: Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics. 2008, 24 (13): i277-285. 10.1093/bioinformatics/btn182.PubMed CentralView ArticlePubMedGoogle Scholar
- Mihalcea R, Tarau P: TextRank: bringing order into texts. Proceedings of the conference on Empirical Methods in Natural Language Processing. 2004, Barcelona, Spain, 404-411.Google Scholar
- Matsunage T, Yonemori C, Tomita E, Muramatsu M: Clique-based data mining for related genes in a biomedical database. BMC Bioinformatics. 2009, 10: 205-10.1186/1471-2105-10-205.View ArticleGoogle Scholar
- Yu H, Paccanaro A, Trifonov V, Gerstein M: Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006, 22 (7): 823-829. 10.1093/bioinformatics/btl014.View ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25 (15): 1891-1897. 10.1093/bioinformatics/btp311.View ArticlePubMedGoogle Scholar
- Liu X, Bollen J, Nelson ML, Van de Sompel H: Co-authorship networks in the digital library research community. Information Processing & Management. 2005, 41 (6): 1462-1480. 10.1016/j.ipm.2005.03.012.View ArticleGoogle Scholar
- Zubcsek PP, Chowdhury I, Katona Z: Information communities: the network structure of communication. [http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1753903]
- Ah-Pine J, Jacquet G: Clique-based clustering for improving named entity recognition system. Proceedings of the 12th conference of the European chapter of the ACL. 2009. 2009, Athens, Greece, 51-59.Google Scholar
- Pons-Porrata A, Berlanga-Llavori R, Ruiz-Shulcloper J: Topic discovery based on text mining techniques. Information Processing and Management. 2007, 43 (3): 752-768. 10.1016/j.ipm.2006.06.001.View ArticleGoogle Scholar
- Stein GC, Strzalkowski T, Wise GB: Interactive, text-based summarization of multiple documents. Computational Intelligence. 2000, 16 (4): 606-613. 10.1111/0824-7935.00131.View ArticleGoogle Scholar
- Naud A, Usui S: Exploration of a collection of documents in neuroscience and extraction of topics by clustering. Neural Netw. 2008, 21 (8): 1205-1211. 10.1016/j.neunet.2008.05.009.View ArticlePubMedGoogle Scholar
- Yang J, Cohen AM, Hersh W: Automatic summarization of mouse gene information by clustering and sentence extraction from MEDLINE abstracts. AMIA Annual Symposium Proceeding. 2007. 2007, Chicago, USA, 831-835.Google Scholar
- Yamamoto Y, Takagi T: Biomedical knowledge navigation by literature clustering. J. Biomed. Inform. 2007, 40 (2): 114-130. 10.1016/j.jbi.2006.07.004.View ArticlePubMedGoogle Scholar
- Lee M, Wang W, Yu H: Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinformatics. 2006, 7: 140-10.1186/1471-2105-7-140.PubMed CentralView ArticlePubMedGoogle Scholar
- Kan M, McKeown KR, Klavans JL: Proceedings of the first Document Understanding Conference. Domain-specific informative and indicative summarization for information retrieval. 2001, 19-26.Google Scholar
- Borgatti SP, Everett MG, Freeman LC: UCINET for windows: software for social network analysis. 2002, Harvard, MA: Analytic TechnologiesGoogle Scholar
- Lerch F, Sydow J, Provan KG: Cliques within clusters- multi-dimensional network integration and innovation activities. 2006, Norway: Paper presentation at the 22nd EGOS colloquiumGoogle Scholar
- Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, et al: Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS One. 2011, 6 (3): e18029-10.1371/journal.pone.0018029.PubMed CentralView ArticlePubMedGoogle Scholar
- Blondel VD, Guillaume J, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 2008, 10: P10008Google Scholar
- Norusis MJ: Cluster Analysis. PASW Statistics 18 Statistical Procedures Companion. Edited by: Norusis MJ. 2010, New Jersey: Prentice Hall, 361-391.Google Scholar
- Everett MG, Borgatti SP: Analyzing clique overlap. Connections. 1998, 21 (1): 49-61.Google Scholar
- Kruskal JB, Landwehr JM: Icicle plots: Better displays for hierarchical clustering. The American Statistician. 1983, 37 (2): 162-168.Google Scholar
- McCray AT, Burgun A, Bodenreider O: Aggregating UMLS semantic types for reducing conceptual complexity. Proceedings of Medinfo. 2001, 10 (Pt 1): 216-220.Google Scholar
- Kilicoglu H, Rosemblat G, Fiszman M, Rindflesch TC: Constructing a semantic predication gold standard from the biomedical literature. BMC Bioinformatics. 2011, 12: 486-10.1186/1471-2105-12-486.PubMed CentralView ArticlePubMedGoogle Scholar
- Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987, 20 (1): 53-65.View ArticleGoogle Scholar
- Tan P, Steinbach M, Kumar V: Cluster Analysis: Basic concepts and algorithms. Introduction to Data Mining. Edited by: Tan P, Steinbach M, Kumar V. 2005, Boston: Addison-Wesley, 487-568.Google Scholar
- Batagelj V, Mrvar A: Pajek - Analysis and Visualization of Large Networks. Graph Drawing Software. Edited by: Jünger M, Mutzel P. 2003, Berlin: Springer, 77-103.Google Scholar
- Goodwin J, Cohen T, Rindflesch T: Discovery by scent: Closed literature-based discovery system based on the information foraging theory. First International Workshop on the Role of the Semantic Web in Literature-Based Discovery, in conjunction with IEEE International Conference on Bioinformatics and Biomedicine. 2012, Philadelphia, USAGoogle Scholar
- Wartena C, Brussee R: Topic detection by clustering keywords. Proceedings of the 19th International Conference on Database and Expert Systems Application 2008. 2008, Turin, Italy, 54-58.Google Scholar
- V an der Spek P, Klusener S: Applying a dynamic threshold to improve cluster detection of LSI. Science of Computer Programming. 2011, 76 (12): 1261-1274. 10.1016/j.scico.2010.12.004.View ArticleGoogle Scholar
- Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R. Bioinformatics. 2008, 24 (5): 719-720. 10.1093/bioinformatics/btm563.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.