Evaluating the use of different positional strategies for sentence selection in biomedical literature summarization

Background The position of a sentence in a document has been traditionally considered an indicator of the relevance of the sentence, and therefore it is frequently used by automatic summarization systems as an attribute for sentence selection. Sentences close to the beginning of the document are supposed to deal with the main topic and thus are selected for the summary. This criterion has shown to be very effective when summarizing some types of documents, such as news items. However, this property is not likely to be found in other types of documents, such as scientific articles, where other positional criteria may be preferred. The purpose of the present work is to study the utility of different positional strategies for biomedical literature summarization. Results We have evaluated three different positional strategies: (1) awarding the sentences at the beginning of the document, (2) preferring those at the beginning and end of the document, and (3) weighting the sentences according to the section in which they appear. To this end, we have implemented two summarizers, one based on semantic graphs and the other based on concept frequencies, and evaluated the summaries they produce when combined with each of the positional strategies above using ROUGE metrics. Our results indicate that it is possible to improve the quality of the summaries by weighting the sentences according to the section in which they appear (≈17% improvement in ROUGE-2 for the graph-based summarizer and ≈20% for the frequency-based summarizer), and that the sections containing the more salient information are the Methods and Material and the Discussion and Results ones. Conclusions It has been found that the use of traditional positional criteria that award sentences at the beginning and/or the end of the document are not helpful when summarizing scientific literature. In contrast, a more appropriate strategy is that which weights sentences according to the section in which they appear.


Introduction and Motivation
The amount of biomedical literature being published is growing rapidly in recent years, making it difficult for researchers to find the information they need. In this context, text automatic techniques may help alleviate the information overload problem. First, automatic summaries may be useful in anticipating the contents of the original documents, so that users may decide which of the documents to read further. As stated in [1], even in the presence of the author's abstract, there are two main reasons for wanting to generate text summaries from *Correspondence: laura.plaza@uam.es 1 Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente, 11,28049 Madrid, Spain Full list of author information is available at the end of the article a full-text: (1) the abstract, which is usually limited to around 200 words, may be missing relevant content, and (2) there is not a single ideal summary, but rather, the ideal summary depends on the user's information needs. Moreover, automatic summaries have been shown to improve indexing and categorization of biomedical literature when used as substitutes for the articles' abstracts [2,3], since they help to filter non relevant and noisy information.
Text summarization refers to the process of generating a brief summary of one or several documents [4]. Summaries may be extractive or abstractive. Extractive summaries are created by identifying salient textual units (i.e., sentences or paragraphs) in the sources, while abstractive summaries are built by paraphrasing the information in the original documents. In other words, while extractive summarization is mainly concerned with what the http://www.biomedcentral.com/1471-2105/14/71 summary content should be, abstractive summarization puts the emphasis on the form [5]. Although human summaries are typically abstracts, most existing systems produce extracts largely because extractive summarization has been demonstrated to report better results than abstractive summarization [6]. This is due to the difficulties that the abstraction process entails, which usually involves the identification of the most prevalent concepts in the source, the appropriate semantic representation of them, a minimum level of inference and the rewriting of the summary through Natural Language Generation techniques.
Extractive methods typically construct summaries based on a superficial analysis of the text. The most popular approaches include statistical techniques and graphbased methods (see [4,7] for a more detailed study of summarization techniques). Graph-based methods represent the text as a graph, where the nodes correspond to word, sentences or even concepts, and the edges represent various types of syntactic and semantic relations among them. Different clustering methods are then applied to identify salient nodes within the graph and to extract the sentences for the summary.
Statistical approaches are based on simple heuristics to rank the sentences for the summary, such as the position of the sentences in the document [8][9][10][11][12], the frequency of their terms [9,10,[13][14][15][16][17], the presence of certain cue words [14,18,19], and the word overlap between the sentences and the document title and headings [11,14]. Despite their simplicity, these features are commonly used in the most recent works on extractive summarization, usually in combination with other more complex approaches, such as graph-based or template-based [20,21].
Focusing on the position of sentences in a document, this has been traditionally considered an important factor in finding the sentences that are most related to the central topic of the document, and used in different NLP tasks [8,22,23]. Baxendale [8] examined 200 paragraphs to find that in 85% of the paragraphs the topic sentence came as the first one and in 7% of the time it was the last sentence. Other works argue that the sentences close to the beginning and/or the end of the document are supposed to deal with the main theme of the document, and so more weight is assigned to them [11,12]. This criterion has showed to be very effective when summarizing some types of documents, such as news items, where the information is placed following the inverted pyramid structure (i.e., the most important information is placed first but, as the article continues, the less important details are presented) [24]. However, as stated in [5,22], even though texts generally follow a predictable discourse structure, and the sentences of greater topic centrality tend to occur in certain specifiable locations, this structure significantly varies over domains and the importance of the sentence position must be evaluated ad hoc for each domain and type of document.
Regarding summarization of biomedical literature, our previous work [20] showed that using a positional function that attaches greater relevance to sentences close to the beginning and end of the document together with a semantic graph-based summarization approach decreases performance compared with not using any positional information. This result was not surprising because scientific papers are not (a priori) expected to present the core information at the beginning and end of the document. In contrast, the first sentences in scientific papers usually introduce the motivation of the study, whereas the last sentences provide conclusions and future work. The most important information is expected to be found in the middle sentences, as part of the method, results, and discussion sections. Therefore, it seems that a more appropriate positional criterion would be one that gives priority to sentences belonging to such central sections. This intuition, however, needs to be empirically evaluated.
Following this idea, in the present work we study if the use of different positional criteria may be of help when summarizing scientific biomedical articles. In particular, three strategies are examined: (1) awarding the sentences at the beginning of the document, (2) preferring those at the beginning and end of the document, or (3) weighting the sentences according to the section (or section group) in which they appear. To this end, we have implemented two different summarizers, one based on semantic graphs and the other based on concept frequencies, and evaluated the summaries they produce when combined with each of the positional strategies above. Our results show that it is possible to improve the quality of the summaries that are generated by weighting the sentences according to the section in which they appear and giving priority to sentences from the Method and Material and Results and Discussion sections of the article. We believe these results to be of great interest since they may guide NLP tasks involving extraction of salient information in biomedical literature.
The paper is structured as follows. We first present some related work in biomedical summarization. We next describe several positional strategies for sentence selection, along with the two summarizers. We evaluate the summaries generated by both summarizers using the different positional strategies and present the evaluation results. These results are then discussed. We finally draw the main conclusions of the study and outline future work.

Background
Even though the first works in automatic text summarization date from the middle of the last century [13], research in biomedical summarization has started only recently. Biomedical summarization works typically adapt http://www.biomedcentral.com/1471-2105/14/71 existing methods from domain-independent summarization to deal with the highly specialized biomedical terminology. To this end, they make use of external knowledge sources to represent the texts as sets of domain concepts and relations. This produces a richer representation than the one provided by traditional term-based models and results in better quality summaries.
A pioneer work in biomedical summarization is found in [25]. They propose the use of semantic predications provided by SemRep [26] and information from the Unified Medical Language System (UMLS) [27] to extract biomedical entities and relations, and generate semanticlevel abstracts, which are presented in graphical format. Ling et al. [28] focus on a narrower domain, genomic, and present a gene summary system that ranks sentences according to three features: the relevance of six gene aspects, such as the DNA sequence, the relevance of the documents where the sentences are taken from, and the position of the sentences in the document. Reeve et al. [1] use the frequency of the UMLS Metathesaurus concepts found in the text and adapt the lexical chaining approach [29] to deal with concepts instead of terms. Their system is used to produce single-document extracts of biomedical articles.
More sophisticated is the work of Yoo et al. [30] for multi-document summarization. They represent a corpus of documents as a graph, where the nodes are the MeSH [31] descriptors found in the corpus and the edges represent hypernymy and co-occurrence relations between them. They cluster the MeSH concepts in the corpus to identify sets of documents dealing with the same topic and then generate a summary from each document cluster. BioSquash [32] is a question-oriented extractive system for biomedical multi-document summarization. It constructs a semantic graph that contains concepts of three types: ontological concepts (general ones from WordNet [33] and specific ones from the UMLS), named entities and noun phrases.
More recent is the work of Shang et al. [34], where the aim is to combine information retrieval techniques with information extraction methods to generate text summaries of sets of documents describing a certain topic. To do this, they use SemRep to extract relations among UMLS Metathesaurus concepts and a relationlevel retrieval method to select the relations more relevant to a given query concept. Finally, they extract the most relevant sentences for each topic based on the previous ranking of relations and the location of the sentences in different sections of the document. However, no details are given about how the location scores are calculated.

Methods
In this section, we first present the different positional strategies for sentence selections. We next describe the two different summarizers (one based on semantic graphs and the other based on concept frequencies) that have been developed to test such positional strategies.

Positional strategies for sentence selection
In order to test our hypothesis that the position of a sentence in the different sections of the document is an indication of the importance of the sentence for inclusion in a summary, and that traditional positional strategies are not appropriate for summarizing biomedical literature, we have defined the following positional criteria:

Begin End Pos(S j
• Section in the document (Section-Pos) : We consider the following section classes or clusters: (1) Introduction, (2) , and C&F(S j ) are equal to 1 if the sentence S j belongs to each of the five section groups, respectively, and 0 otherwise. The values of γ , δ, θ, σ , and π vary between 0 and 1, and need to be empirically determined.

Graph-based summarizer
We use the graph-based summarization method presented in [20]. This method is based on the representation of the document as a conceptual graph, using the UMLS [27] as the knowledge source, and the use of a degreebased clustering algorithm for detecting salient concepts within the graph. The original summarizer has been modified to incorporate more advanced positional strategies in the sentence selection step. The system architecture is illustrated in Figure 1.
The method consists of the 4 main steps, which are briefly explained below (see [20] for a detailed explanation): • The first step, concept identification, is to map the document to concepts from the UMLS Metathesaurus and semantic types from the UMLS Semantic Network. We run the MetaMap [35] program over the body section of the document to obtain the Metathesaurus concepts that are found within the text. MetaMap is invoked using the word sense disambiguation option (-y flag). This flag implements the Journal Descriptor Indexing (JDI) methodology described in [36]. UMLS concepts belonging to very general semantic types are discarded since they have been found to be excessively broad and do not contribute to summarization. These types are Quantitative concept, Qualitative concept, Temporal concept, Functional concept, Idea or concept, Intellectual product, Mental process, Spatial concept and Language. • The second step, document representation, is to construct a graph-based representation of the document. To do this, we first extend the UMLS concepts with their complete hierarchy of hypernyms (is a relations) and merge the hierarchies of all the concepts in the same sentence to construct a sentence graph. The upper levels of these hierarchies are removed, since they represent concepts with excessively broad meanings. Next, all the sentence graphs are merged into a single document graph. This graph is extended with two further types of relations: relations between concepts in the UMLS Metathesaurus and relations between semantic types in the UMLS Semantic Network (see [37] for a description of the different relationships in the UMLS). Finally, each edge is assigned a weight in [0, 1] as shown in equation 4. The weight of an edge e representing an is a relation between two vertices, v i and v j (where v i is a parent of v j ), is calculated as the ratio of the depth of v i to the depth of v j from the root of their hierarchy. The weight of an edge representing any other relation (i.e. associated with and other related ) between pairs of leaf vertices is always 1. where if e represents an is a relation β = 1 otherwise To illustrate this process, Figure 2 shows the document graph for the following text from [38]: • The third step, topic recognition, consists of clustering the UMLS concepts in the document graph using a degree-based clustering method similar to that used by [30]. The aim is to construct sets of concepts strongly related in meaning, based on the assumption that each of these clusters represents a different topic in the document. We first compute the salience of each vertex in the graph, as the sum of the weights of the edges that are linked to it. Next, the nodes are ranked according to its salience. The n vertices with a highest salience are labeled as hub vertices. The clustering algorithm then groups the hub vertices into hub vertex sets (HVS). These can be interpreted as sets of concepts strongly connected and will represent the centroids of the final clusters. The remaining vertices (i.e. those not included in the HVS) are iteratively assigned to the cluster to which they are more connected. The output of this step is, therefore, a number of clusters of UMLS concepts, each cluster represented by the set of most highly connected concepts within it (the so-called HVS). In this way, for instance, the top five salient concepts in the document graph represented in Figure 2 are: cells, LRF, cJUN, c-fos, and growth. • The last step, sentence selection, consists of computing the similarity between each sentence graph and each cluster, and selecting the sentences for the summary based on these similarities. To compute sentence-to-cluster similarity, we use a non-democratic vote mechanism so that each vertex of a sentence assigns a vote to a cluster if the vertex belongs to its HVS, half a vote if the vertex belongs to it but not to its HVS, and no votes otherwise. The similarity between the sentence graph and the cluster is computed as the sum of the votes assigned by all the vertices in the sentence graph to the cluster. A single score for each sentence is calculated, as the sum of its similarity to each cluster adjusted to the cluster's size (equation 5).
Finally, this semantic similarity is normalized in the [0,1] interval and combined with each of the positional criteria explained in the previous section using a linear function (see equation 6). The N sentences with higher score are then selected for the summary.
α and β can be assigned different weights between 0 and 1. Their optimal values must be determined empirically.

Concept frequency-based summarizer
The second summarizer is based on the frequency of UMLS concepts in the document. It consists of 4 steps: • The first step, concept identification, is to map the document to concepts from the UMLS Metathesaurus and semantic types from the UMLS Semantic Network using MetaMap, as explained for the graph-based summarizer.

• Concept frequency representation: Following
Luhn's theory, we assume that the more times a word (or concept) appears in a document, the more relevant become the sentences that contain this word. In this way, if α and β can be assigned different weights between 0 and 1. Their optimal values must be determined empirically.

Evaluation methods
The most common approach to evaluating automatically generated summaries of a document (also known as peers) is to compare them against manually-created summaries (called reference or model summaries) and measure the similarity between their content. The more content that is shared between the peer and reference summaries, the better the peer summary is assumed to be. To the authors' knowledge, no corpus of model summaries exists for biomedical articles. For this reason, in this work we use a collection of 100 biomedical scientific articles randomly selected from the PMC Open Access Subset [39]. When collecting the articles, we made sure that they present, at least, the five main following sections: Introduction, Background, Methods, Results and Discussion, and Conclusions and Future Work. The abstracts for the articles are used as model summaries, since they condensate the most relevant content in the articles and have been written manually. The ROUGE metrics [40] are used to quantify the content similarity between the automatic summaries and the reference ones. ROUGE is a commonly used evaluation method for summarization which uses the proportion of n-grams between a peer and one or more reference summaries to compute a value within [0,1]. Higher values of ROUGE are preferred, since they indicate a greater content overlap between the peer and the model. The following ROUGE metrics are used: ROUGE-2 and ROUGE-SU4. ROUGE-2 counts the number of bigrams that are shared by the peer and reference summaries and computes a recall-related measure. Similarly, ROUGE-SU4 measures the overlap of skip-bigrams (i.e., pairs of words in their sentence order, allowing for arbitrary gaps), using a skip distance of 4. Both ROUGE-2 and ROUGE-SU4 have shown high correlation with the human judges gathered from the Document Understanding Conferences [41]. However, it must be noted that ROUGE metrics present two important limitations: (1) they depend on the length of the peer summaries (i.e., the longer is the peer with respect to the model, the higher are expected to be the ROUGE scores), and (2) since they use lexical matching instead of semantic matching, peer summaries that are worded different but have the same semantic information may be assigned different ROUGE scores. Thus, these metrics should only be used in a comparative fashion on the same dataset and should not be interpreted as absolute measures.
Automatic summaries are generated by selecting sentences until each summary reaches the same number of sentences than its corresponding model summary (i.e., the article's abstract). We generate summaries using both the graph-based and the frequency-based summarizers and the three positional strategies for sentence selection, and assigning different weights to the different parameters of the summarizers. For these experiments, different combinations of values for α, β, γ , δ, θ, σ , and π were tested. However, for the sake of brevity, only the combinations that produced the best ROUGE scores are presented. A Wilcoxon Signed Ranks Test with a 95% confidence interval is used to test statistical significance of the results.

Results
We first evaluate the adequacy of the Begin-Pos positional strategy. The results of evaluating the automatic summaries generated when different weights are assigned to such criterion are shown in Table 1. As it may seen from this table, giving greater weight to sentences at the beginning of the document improves the quality of the automatic summaries compared with not using positional information, but only when the weight assigned to the position of the sentences is low (from 0.1 to 0.2). The improvement achieved is higher for the frequency-based summarizer than for the graph-based one, and it is only statistically significant for the first one.  Significance is calculated with respect to the non-positional information baseline (β = 0.0), and shown using the following convention: * = p<.05 and no star indicates non-significance. The best scores per summarizer are shown in bold.

Table 1 ROUGE scores for the summaries generated using the Begin-Pos strategy
We next evaluate the effect of the Begin-End-Pos criterion, which attaches greater weight to sentences close to the beginning and end of the document. It may be seen from Table 2 that this strategy does not benefit the quality of the graph-based summaries, regardless of the weights assigned to the different criteria, but slightly improves the performance of the concept-frequency based summarizer when β is set to 0.1. However, once again, the improvement achieved is not significant.
We finally examine the summaries generated when different weights are assigned to sentences depending on the section in which they appear (Section-Pos strategy). These results are shown in Table 3. For both summarizers, there exist a combination of weights that produces significantly better summaries compared with the nonpositional information summaries (β = 0.0). In particular, the best ROUGE scores are reported when the Introduction and the Conclusions and Future Work sections are given a weight of 0.2, no weight is assigned to sentences from the Related Work section, and the sentences from the Methods and Material and the Results and Discussion sections are given a weight of 1.0. This configuration allows for improvements of over 17% in ROUGE-2 for the graph-based summarizer and over 20% for the frequencybased summarizer.
It is also worth mentioning that the experiments showed that weights for the Background and Conclusion and Future Work sections (i.e., δ and π values) above 0.1 produce very poor summarization results, and that γ values (i.e., the weight for the Introduction section) upper 0.2 decrease performance as well. In contrast, the best results are reported when the Methods and Material and the Results and Discussion sections are assigned high weights.
Finally, Table 4 compiles the best results for each positional strategy and summarizer. For comparison purposes, this table also shows the ROUGE scores for the summaries generated using LexRank [42]. LexRank is the best-known graph-based method for summarization. It models documents as undirected graphs in which each node corresponds to a sentence, represented by its TF-IDF vector, and the edges are labeled with the cosine similarity between the sentences. It may be seen from Table 4 that the best ROUGE scores are obtained when the graph-based summarizer is combined with information about the position of the sentences in the different document sections. These scores are significantly better than those of the frequency-based summarizer and than those of LexRank.

Weights
Graph-based Frequency-based

Discussion
The results in Tables 1, 2, and 3 confirm our hypothesis that traditional positional strategies are not appropriate when summarizing biomedical scientific articles, as opposed to summarizing other types of documents, such as news items. Our experiments have shown that awarding sentences close to the end of the document decreases summarization performance comparing with not using any positional information, while awarding sentences close to the beginning of the document only improves the quality of the summaries for the frequencybased summarizer.
In contrast, we have found that it is possible to improve summarization by taking into account the section in the article in which the different sentences appear, and attaching greater relevance to sentences from the appropriate sections. In particular, it has been found that sentences from the Methods and Material and Results and Discussion sections are more relevant for inclusion in the summary, since they are more related to the main topic of the document that sentences in other sections, such as Introduction and Conclusions and Future Work. Sentences in the Related Work section seem to be secondary, and therefore are not usually included in a summary. These results confirm what may be observed in the abstracts of the articles. If we examine such abstracts we realize that the information considered as important by the authors of the articles is mostly related to the method and results being described in the article, while the remaining sections are given less credit.
An interesting finding is that, in general, the frequencybased approach takes more advantage from the information about the position of sentences in the document than the graph-based one. We think this is due to the fact that the use of the frequency of the concepts alone is not enough to capture the salience of the sentences, and so the use of the positional criteria helps to bias the selection of sentences toward the most relevant information.
In contrast, the graph-based method captures better the importance of the different sentences, and thus produces better quality summaries even when no positional information is used. However, it still presents some limitations that will be addressed in future work. The first limitation is to do with the coverage of the UMLS. While general clinical terms are quite well covered, other vocabulary, specially that related to genomic, is not well supported [43]. As a results, automatic summaries of genetic and proteomic articles present low ROUGE values.
The second limitation is to do with the accuracy of the MetaMap mappings and the ambiguity in the UMLS Metathesaurus. Even using the -y disambiguation option, the precision of the JDI algorithm is reported to be around 0.78 when evaluated against a set of 45 ambiguous terms from the NLM-WSD corpus [44]. This precision, however, is expected to be lower for genomic entities, such as protein and gene names, where ambiguity is more frequent.
Third, our summarization and evaluation methods assume that all users have the same information needs, and that these needs are reflected in the authors' abstracts. However, different users may have different interests. In future work, we plant to extend the summarizer to produce query-based summaries that take into account the readers' information needs as specified in a user's query. To this end, the similarity of each sentence in the document to the user's query may be computed and uses as a feature for sentence selection.

Conclusions
This work explores the utility of the position of the sentences as a feature for automatic summarization of scientific articles. Toward this goal, we have developed two different summarizers, one based on semantic graphs and the other using concept frequencies, which implement three different positional strategies: the first gives more importance to sentences at the beginning of the article, the second prefers sentences both at the beginning and end, and the third weights sentences according to the section in which they appear. The summaries generated are evaluated and compared with non-positional summaries.
Overall, the results suggest that it is possible to improve summarization by taking into account the section in the article in which the different sentences appear, and attaching greater relevance to sentences from the appropriate sections. In contrast, traditional strategies that attach greater weights to sentences at the beginning and end of the document are not suitable when summarizing biomedical scientific articles. We believe that our results are of great interest since they may guide NLP tasks involving extraction of salient information in biomedical literature. http://www.biomedcentral.com/1471-2105/14/71 As future work, we plan to investigate the importance of the specific location of sentences within the different sections of the article. In this way, for instance, the last sentences of the Introduction section may be more relevant that the first sentences in the same sections, since they usually anticipate the content of the document.