MeSH indexing based on automatically generated summaries

Background MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. Results We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. Conclusions Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.


Background
MEDLINE® citations are manually indexed using the Medical Subject Headings (MeSH)® controlled vocabulary. This indexing is performed by a relatively small group of highly qualified indexing contractors and staff at the US National Library of Medicine (NLM). MeSH indexing consists of reviewing the full text of each article, rather than an abstract or summary, and assigning descriptors that represent the central concepts that are discussed.
Indexers assign descriptors from the MeSH vocabulary of 26,581 main headings (2012), which are often referred to as MeSH Headings (MHs). Main heading descriptors may be further qualified by selections from a collection of 83 topical Subheadings (SHs). In addition there are 203,658 Supplementary Concepts (formerly Supplementary Chemicals) which are available for inclusion in MEDLINE records.
Since 1990, there has been a steady and sizeable increase in the number of articles indexed for MEDLINE, because of both an increase in the number of indexed journals and, to a lesser extent, an increase in the number of in-scope articles in journals that are already being indexed. The http://www.biomedcentral.com/1471-2105/14/208 NLM expects to index over one million articles annually within a few years [1].
In the face of a growing workload and dwindling resources, NLM has undertaken the Indexing Initiative to explore indexing methodologies that can help ensure that MEDLINE and other NLM document collections maintain their quality and currency and thereby contribute to NLM's mission of maintaining quality access to the biomedical literature.
The NLM Indexing Initiative has developed the Medical Text Indexer (MTI) [2][3][4], which is a support tool for assisting indexers as they add MeSH indexing to MED-LINE. Given a MEDLINE citation with only the title and abstract, MTI will deliver a ranked list of MHs, as shown in Figure 1. This includes not only MHs but also related SHs. MTI and its current relation to MESH indexing are described in more detail in the Methods section.
Even though indexers have access to the full text during indexing time, MTI has to rely solely on title and abstract since full text is not yet available for automatic processing. Most of the research in MEDLINE indexing with MeSH has been performed on MEDLINE titles and abstracts. We would like to explore the possibility of extending MTI to full text or other more suitable representations to understand the problems of dealing with larger representations, both in efficiency and performance. In previous work, full text has been used with the MTI tool [5]. Despite the decrease in precision, indexing based on full text provides a potential increase in recall.
In this work, we propose exploring the use of automatically generated summaries from full text articles as an intermediary step to identifying the salient pieces of information for indexing using several algorithms; i.e. MTI, individual MTI components and machine learning. To this end, we have considered summaries of different lengths generated automatically from the full text as surrogates for full text articles in automatic indexing. Summaries provide more information than title and abstract, which might improve the coverage provided by the automatic indexing approaches at the expense of some loss in precision. In addition, as the summaries contain salient information from the full text article, it may reduce the number of false positives that automatic indexing systems like MTI currently generate based on MEDLINE citations. As soon as more full text articles are available for automatic processing, they might be considered within the MTI system. This article is organized as follows. First, related work in indexing and automatic summarization is presented. Then, MTI is described, along with the two systems used for generating the automatic summaries. We later present the evaluation setup and discuss the results of several experiments. We finally draw conclusions and outline future work.

Related work
In this section, we present some previous work in biomedical text indexing and automatic summarization. We also present some related work on the use of automatic summaries as an intermediate step in text categorization and indexing.

Biomedical text indexing
In addition to the NLM Indexing Initiative developments, MeSH indexing has received attention from other research groups. We find that most of the methods fit either into pattern matching methods which are based on a reference terminology (like Unified Medical Language System (UMLS)® or MeSH) and machine learning approaches which learn a model from examples of previously indexed citations.
Among the pattern matching methods we find the MetaMap component of MTI and an information retrieval approach by Ruch [6]; in his system the categories are the documents and the query is the text to be indexed. http://www.biomedcentral.com/1471-2105/ 14/208 Pattern matching considers only the inner structure of the terms but not the terms with which they co-occur. This means that if a document is related to a MeSH heading but does not appear in the text being indexed, it will not be suggested. Machine learning based on previously indexed citations might help to overcome this problem.
A growing body of work approaches retrieval of MED-LINE citations as a classification task. For example, MScanner classifies all MEDLINE citations as relevant to a set of positive examples submitted by a user or not [7], and Kastrin et al. [8] determine the likelihood of MED-LINE citations, topical relevance to genetics research. The large body of related work provides valuable insights with respect to classification of MEDLINE citations and feature selection methods.
Machine learning methods tend to be ineffective with a large number of categories; MeSH contains more than 26k. Small scale studies with machine learning approaches exist [9,10], but the presence of a large number of categories has forced machine learning approaches to be combined with information retrieval methods designed to reduce the search space. For instance, PRC (PubMed Related Citations) [11] and a k-NN approach by Trieschnigg et al. [12] look for similar citations in MED-LINE and predict MeSH headings by a voting mechanism on the top-scoring citations.
In previous work, full text has been used within the context of MeSH indexing using the MTI tool [5]. This research shows that there is a potential contribution from the full text which usually is not available for title and abstract. However, in most of the previous work, including work at the NLM Indexing Initiative project, indexing is performed on titles and abstracts. This is due to the fact that, due to license restrictions, the full text of the articles is not available. Even if some of these articles might become available from open source journals, the indexing is performed before these articles are available. We would like to evaluate the performance of the current indexing tools so that they are ready when full text becomes commonly available for indexing.

Summarization of biomedical text
Text summarization is the process of generating a brief summary of one or several documents by selection or generalization of what is important in the source [13]. Extractive summarization systems identify salient sentences from the original documents to build the summaries by using a number of techniques. In the biomedical domain, the most popular approaches include statistical techniques and graph-based methods (see [14] for an extensive review of biomedical summarization).
Statistical approaches are based on simple heuristics such as the position of the sentences in the document [15], the frequency of terms [16,17], the presence of certain cue words [17] or the word overlap between sentences and the document title and headings [17]. Graph-based methods represent the text as a graph, where the nodes correspond to words or sentences, and the edges represent various types of syntactic and semantic relations among them. Different clustering methods are then applied to identify salient nodes within the graph and to extract the sentences for the summary [18,19].
Biomedical terminology is highly specialized and presents some peculiarities, such as lexical ambiguity and the frequent use of acronyms and abbreviations, that make automatic summarization different from that in others domains [20]. To capture the meaning of the text and work at the semantic level, most approaches use domain-specific knowledge sources, such as the UMLS or MeSH [21][22][23]. Moreover, biomedical articles usually follow the IMRaD structure (Introduction, Method, Results and Discussion), which allows summarization systems to exploit the documents' structure to produce higher quality summaries.
Examples of recent biomedical summarization approaches are described next. Reeve et al. [21] use UMLS concepts to represent the text and discover strong thematic chains of UMLS semantic types, and apply this to single document summarization. BioSquash [24] is a question-oriented multi-document summarizer for biomedical texts. It constructs a graph that contains concepts of three types: ontological concepts, named entities, and noun phrases. Fiszman et al. [25] propose an abstractive approach that relies on the semantic predications provided by SemRep [26] to interpret biomedical text and on a transformation step using lexical and semantic information from the UMLS to produce abstracts from biomedical scientific articles. Yoo et al. [22] describe an approach to multi-document summarization that uses MeSH descriptors and a graphbased method for clustering articles into topical groups and producing a multi-document summary of each group.
Finally, it is worth mentioning that, considering their intended application, the automatic summaries may be an end in themselves (i.e., they aim to substitute the original documents) or a means to improve the performance of other NLP tasks. Automatic summaries, for instance, have been shown to improve categorization of biomedical literature when used as substitutes for the articles' abstracts [27]. The next section explores this issue in detail.

Using automatic summaries for text indexing and categorization
Automatic summarization has shown to be of use as an intermediate step in other Natural Language Processing tasks, especially text categorization, when the http://www.biomedcentral.com/1471-2105/14/208 automatic summaries are used as substitutes for the original documents.
Shen et al. [28], for instance, improve accuracy of a web page classifier by using summarization techniques. Since web pages typically present noisy content, automatic summaries may help to extract relevant information and to avoid bias for the classification algorithm.
Similarly, Kolcz et al. [29] use automatic summarization as a feature selection function that allows to reduce the size of the documents within a categorization. In this context, the authors tested a number of simple summarization strategies and concluded that automatic summarization may be of help when categorizing short newswire stories.
In Lloret et al. [30], the use of text summarization in the classification of user-generated product reviews is investigated. In particular, the authors study whether it is possible to improve the rating-inference task (i.e., the task of identifying the author's evaluation of an entity with respect to an ordinal-scale based on the author's textual evaluation of the entity) by using summaries of different lengths instead of the original full-text user reviews.
In the biomedical domain, however, the use of automatic summaries in text categorization has been less exploited, and only a few preliminary works have been published [27].

Methods
In this section, we first present the Medical Text Indexer developed as part of the NLM Indexing Initiative. Then, we describe the summarization methods used to generate the automatic summaries.

The medical text indexer
The Medical Text Indexer (MTI) [2][3][4] is a support tool for assisting indexers as they add MeSH indexing to MEDLINE. Figure 1 shows a diagram of the MTI system. MTI has two main components: MetaMap [31] and the PubMed® Related Citations (PRC) algorithm [11]. MetaMap indexing (MMI) analyzes citations and annotates them with UMLS concepts. The mapping from UMLS to MeSH follows the Restrict-to-MeSH [32] approach which is based primarily on the semantic relationships among UMLS concepts. The PRC algorithm is a modified k-Nearest Neighbors (k-NN) algorithm which relies on document similarity to assign MeSH headings (MHs). PRC attempts to increase the recall of MTI by proposing indexing candidates for MHs which are not explicitly present in the title and abstract of the citation but which are used in similar contexts.
In a process called Clustering and Ranking, the output of MMI and PRC are merged by linear combination of their indexing confidence. The ranked lists of MeSH headings produced by all of the methods described so far must be clustered into a single, final list of recommended indexing terms. The task here is to provide a weighting of the confidence or strength of belief in the assignment, and rank the suggested headings appropriately.
Once all of the recommendations are ranked and selected, a Post-Processing step validates the recommendations based on the targeted end-user. The purpose of this step is to comply with the indexing policy at the NLM and to incorporate indexer feedback. This step applies a set of rules triggered by either recommended headings (e.g. if the Pregnancy heading is recommended add the Female heading) or by terms from the text (e.g if the term cohort appears in text, add the heading Cohort Studies). In addition, commonly occurring MHs called Check Tags (CTs) are added based on: triggers from the text, recommended headings, and a machine learning algorithm for the most frequently occurring Check Tags [33,34]. Check Tags are a special class of MeSH Headings considered routinely for every article, which cover species, sex, human age groups, historical periods and pregnancy [35]. Finally, MTI performs subheading attachment [36] to individual headings and for the text in general.
Indexers can use MTI suggestions for the citations that they are indexing. MTI usage has grown steadily to the point where indexers request MTI results almost 2,500 times a day representing about 50% of indexing throughput [37]. In addition, the users can access the MTI why tool to examine the evidence for the MTI suggestions in the MEDLINE citation they are indexing, providing a better understanding of the proposed indexing terms. Currently, there are a set of 23 journals indexed for which MTI is used as first line indexer. This means that the suggestions by MTI for these journals are considered as good as the ones provided by a human indexer and subject to the normal manual review process. MTI is available as well as a web service [38] and requires UTS (UMLS Terminology Services) credentials.

Summarization methods
Two summarizers are implemented and used for the experiments: the first is based on semantic graphs and the second is based on concept frequencies. Each summarizer is described below.

Graph-based summarization
We use the graph-based summarization method presented in Plaza et al. [23], which we briefly explain here for completeness (see [23] for additional details). The method consists of the following four main steps: • The first step, concept identification, is to map the document to concepts from the UMLS Metathesaurus and semantic types in the UMLS Semantic Network. We first run the MetaMap program over the http://www.biomedcentral.com/1471-2105/14/208 text in the body section of the document. MetaMap returns the list of candidate mappings, along with their score. To accurately select the correct mapping when MetaMap is unable to return a single best-scoring mapping for a phrase because of a text ambiguity problem, we use the AEC (Automatic Extracted Corpus) [39] disambiguation algorithm to decide. This algorithm was shown to behave better than other WSD methods in the context of a text summarization task (see [40]). UMLS concepts belonging to very general semantic types are discarded since they have been found to be excessively broad and do not contribute to summarization. • The second step, document representation, is to construct a graph-based representation of the document. To do this, we first extend the disambiguated UMLS concepts with their complete hierarchy of hypernyms (is a relations). Then, we merge the hierarchies of all the concepts in the same sentence to construct a sentence graph. The two upper levels of these hierarchies are removed, since they represent concepts with excessively broad meanings. Next, all the sentence graphs are merged into a single document graph. This graph is extended with two further relations (other related from the Metathesaurus and associated with from the Semantic Network) to obtain a more complete representation of the document. Finally, each edge is assigned a weight in [0, 1]. The weight of an edge e representing an is a relation between two vertices, v i and v j (where v i is a parent of v j ), is calculated as the ratio of the depth of v i to the depth of v j from the root of their hierarchy. The weight of an edge representing any other relation (i.e., associated with and other related ) between pairs of leaf vertices is always 1. • The third step, topic recognition, consists of clustering the UMLS concepts in the document graph using a degree-based clustering method similar to PageRank [41]. The aim is to construct sets of concepts strongly related in meaning, based on the assumption that each of these clusters represents a different topic in the document. We first compute the salience or prestige of each vertex in the graph, as the sum of the weights of the edges that are linked to it. Next, the nodes are ranked according to its salience. The n vertices with a highest salience are labeled as hub vertices. The clustering algorithm then groups the hub vertices into hub vertex sets (HVS). These can be interpreted as sets of strongly connected concepts and will represent the centroids of the final clusters. The remaining vertices (i.e., those not included in the HVS) are iteratively assigned to the cluster to which they are more connected. The output of this step is, therefore, a number of clusters of UMLS concepts, each cluster represented by the set of most highly connected concepts within it (the so-called HVS). • The last step, sentence selection consist of computing the similarity between each sentence graph and each cluster, and selecting the sentences for the summary based on these similarities. To compute sentence-to-cluster similarity, we use a nondemocratic voting mechanism [22] so that each vertex of a sentence assigns a vote to a cluster if the vertex belongs to its HVS, half a vote if the vertex belongs to it but not to its HVS, and no votes otherwise. The similarity between the sentence graph and the cluster is computed as the sum of the votes assigned by all the vertices in the sentence graph to the cluster. Finally, a single score for each sentence is calculated, as the sum of its similarity to each cluster adjusted to the cluster's size (Equation 1). The N sentences with highest scores are then selected for the summary.

Concept frequency-based summarization
The second summarization method is a statistical summarizer which is mainly based on the frequency of the UMLS concepts in the document, but also considers other well-accepted heuristics for sentence selection, such as the similarity of the sentences with the title and abstract sections and their position in the document. The method consists of five steps: • The first step, concept identification, is to map the document to concepts from the UMLS Metathesaurus and semantic types in the UMLS Semantic Network. MetaMap is run over the text in the body, abstract and title sections. As with the graph-based summarizer, ambiguity is resolved using the AEC algorithm. Again, concepts belonging to very general semantic types are discarded. • Term frequency representation: Following Luhn's theory [16], we assume that the more times a word (or concept) appears in a document, the more relevant become the sentences that contain this word. In this way, if {C 1 , C 2 , ..., C n } is the set of n Metathesaurus concepts that appear in the body of a document d, and • Similarity with the title and abstract: We next compute the similarity between each sentence in the body of the document and the title and abstract, respectively. The title given to a document by its author is intended to represent the most significant information in the document, and thus it is frequently used to quantify the relevance of a sentence. Similarly, the abstract is expected to summarize the important content of the document. We compute these similarities as the proportion of UMLS concepts in common between the sentence and the title/abstract, as shown in Equations 2 and 3. Title • Sentence position: The position of the sentences in the document has been traditionally considered an important factor in finding the sentences that are most related to the topic of the document [15]. In some types of documents, such as news items, sentences close to the beginning of the document are expected to deal with the main theme of the document, and therefore more weight is assigned to them. However, Plaza et al. [23] showed that this is not true for biomedical scientific papers. In contrast, it was found that a more appropriate criterion would be that which attaches greater importance to sentences belonging to the central sections of the article. For that reason, in this work we calculate a Position(S j ) score according to Equation 4, where the functions Intro(S j ), MRD(S j ), and Concl(S j ) are equal to 1 if the sentence S j belongs to the Background section, to the Methods, Results and discussion section, and to the Conclusions section, respectively, and 0 otherwise.
The values of σ , ρ, and θ vary between 0 and 1, and need to be empirically determined (see section Evaluation method). • The last step, sentence selection, consists of extracting the most important sentences for the summary.
Having computed the four different weights for each sentence (its CF-score, its similarity with the title and abstract sections, and its positional score), the final score for a sentence Score(S j ) is calculated according to Equation 5. Finally, the N sentences with highest score are extracted for the summary, where N depends on the desired compression rate.
α, β, γ , and δ can be assigned different weights between 0 and 1, depending on whether we would like to give more importance to one attribute or another.
Their optimal values need to be empirically determined (see section Evaluation method).

Evaluation method
This section presents the evaluation methodology, including the test collection, the summarization parametrization, and the evaluation of the indexing process.

Evaluation data set
We use a collection of 1413 biomedical scientific articles randomly selected from the PMC Open Access Subset [42]. This subset contains more than 436, 000 articles from a range of biomedical journals; they are in XML format, which allows us to easily identify the title, abstract, and the different sections. Moreover, the full texts of the articles in the PMC Open Access Subset are available for research purposes, so that we can run our summarizers and the MTI program over them. When collecting the articles, we made sure that they contain separate title, abstract, and body sections, and that they are assigned MeSH descriptors. It is also worth noting that the average length of the articles' body is 178 sentences. The shortest article is 16 sentences while the longest one is 835 sentences.

Summaries parametrization
We generated automatic summaries using the two summarizers explained in the previous sections, and using different compression rates (i.e., 15%, 30% and 50%). The text in the tables and figures were not taken into account when building the summaries.
For assigning values to the parameters of the summarizers, different combinations that arise from varying each parameter in [0,1] at intervals of 0.1, have been tested using a set of 150 biomedical articles different from those used in the experimentation. The combination of weights that, according to ROUGE metrics [43], produced the best summaries, was finally selected (i.e., α = 0.5, β = 0.1, γ = 0.2, δ = 0.2, σ = 0.2, ρ = 0.7, and θ = 0.1).
ROUGE is a commonly used evaluation method for summarization which uses the proportion of n-grams between a peer and one or more reference summaries to compute a value within [0,1]. Higher values of ROUGE http://www.biomedcentral.com/1471-2105/14/208 are preferred, since they indicate a greater content overlap between the peer and the model. The 1.2 version of ROUGE is used and the ROUGE-2 and ROUGE-SU4 metrics are used for evaluation. ROUGE-2 counts the number of bigrams that are shared by the peer and reference summaries and computes a recall-related measure. Similarly, ROUGE-SU4 measures the overlap of skip-bigrams. As model summaries, we use the articles' abstracts. Even though using more than one single reference summary would report more accurate results, previous experiments have shown that, when the size of the evaluation collection is large enough, using a single reference summaries produces reliable results [44].

Indexing evaluation
The evaluation of the indexing process is carried out by comparing the MeSH headings recommendations by the different indexing methods (i.e., MTI, individual MTI components, and machine learning) on the different types of documents (i.e., full text articles, titles and abstracts, and automatic summaries of different lengths) and the actual indexing of the articles by the MEDLINE indexers for the 1413 articles in the evaluation collection, and using text categorization measures: precision (P), recall (R), and F-measure (F 1 ). See Additional file 1: Evaluation benchmark.

Results and discussion
The following sections present and discuss the results of the experimental evaluation. Even though the evaluation is performed by comparing to previously indexed citations, as presented in the previous section, interannotation agreement between human indexers is not available. Previous work by Funk and Reid [45] have compared the consistency of indexing using doubly annotated MEDLINE citations, showing several MeSH branches with higher consistency, being the Check Tags the most consistent one. In addition to the overall results, we have shown results per MeSH heading branch. Table 1 shows the performance of the MTI indexing on different types of documents (i.e., full text articles, MEDLINE citations (titles and abstracts), and automatic summaries of different lengths). The micro and macro average measures in this table show that in both cases, the summaries perform better than full text. The best F 1 is obtained when the MEDLINE citations are used to discover indexing terms, while the worst F 1 is reported by the full text articles, the difference being more than 12 percentage points in F 1 . MEDLINE citations show the highest precision, while full text has the highest recall. The poor performance of MTI on the full text of the articles is mainly due to a very low precision (0.375 versus 0.596 for MEDLINE citations), while achieving a recall only slightly better than that of the MEDLINE citations. The high recall of the full text is expected since it contains more details than the summaries or MEDLINE citations.

Overall results
Regarding the use of automatic summaries, it is observed that the graph-based method (Gr-sum) produces better F 1 than the concept frequency-based summarizer (CF-sum). Graph-based summaries are more precise. However, recall is higher for the frequency-based summaries. The reason seems to be that, on average, frequency-based summaries are longer than graph-based ones, since the frequency-based summarizer tends to select longest sentences. Among the summaries, the ones at the 15% compression rate present the lowest recall but the highest precision, so achieving a higher F 1 for micro average. On the other hand, F 1 is slightly higher for macro average.
As expected, as the summary length increases, recall improves but precision worsens, and this is true for both types of automatic summaries. The best F 1 is obtained by shorter summaries, and this is due to the fact that, when the summary length grows, the improvement in recall is not enough to compensate for the loss of precision. Increasing the length of the summaries means adding non-central or secondary contents, so that the probability of MTI recommending incorrect MeSH headings is greater.
The automatic summaries produced by the graph-based method using a 15% compression rate attain indexing results close to those of the MEDLINE citations, the difference in F 1 being approximately 3 percentage points. The recall is higher for the automatic summaries than for the MEDLINE citations, but the precision is lower in the former than in the later. However, it must be taken into account that the summaries are generated automatically, and that it is expected that some important content is missing, which affects precision adversely.
We find as well that the difference between micro and macro average is large in terms of precision for full text. This means that there are very frequent terms with low precision but high recall. Table 2 shows the top terms ranked by the number of positive index entries. In both cases, full text shows a large recall compared to MEDLINE citations but with a much lower precision.

MTI components results
MTI components are combined and tuned using MED-LINE, since it is the target source of documents, providing an advantage compared to summaries and full text. This includes as well the set of additional rules added to either comply with indexing policies or address indexers feedback. We have performed several experiments that include using the individual components of MTI: MMI and PRC. MMI implements a dictionary http://www.biomedcentral.com/1471-2105/14/208  Table 3. F 1 results of MMI and PRC are lower compared to MTI results, which is due to the combination of complementary methods performed by MTI and to the ad-hoc filtering rules in the final step of MTI. MMI shows higher recall compared to PRC but both lower precision and recall compared to MTI. PRC shows higher precision compared to the other approaches but with a much lower recall, contributing to the MeSH headings suggested by MMI.
Except for PRC, the other indexing methods show the same behavior, the MEDLINE citations seem to perform better compared to the full text and the summaries. The automatically built summaries have better performance compared to full text.

Term ranking per document results
The indexing algorithms deliver the MeSH terms in decreasing order of relevance. This means that we could evaluate the ranking of the indexing algorithms. Ranking results are available in Table 4 and in an additional file. Average results of the ranking of MeSH terms per document have been obtained using the trec eval evaluation tool. We show the MAP (mean average precision), precision at 0 recall and precision@5. See Additional file 2: Evaluation of MeSH term ranking per document.
MTI and MMI already deliver ranked results. In the case of PRC, the frequency of the MeSH headings for the top 10 retrieved citations is used. Again, except for PRC, results obtained with MEDLINE citations seem to be better than the results obtained with the full text and the summaries. Summaries seem to perform better than full text, except for PRC.

Machine learning results
Summarization has been used as a feature selection algorithm in other categorization tasks, e.g. categorizing web pages [46]. We could consider the automatically built summaries as a method to perform feature selection on the full text articles. In this setup, MED-LINE abstracts are the human produced summaries of the articles.
We have compared the results of these three representations with MTI, MMI, PRC and two machine learning algorithms. We have included learning algorithms like SVM with linear kernel and AdaBoostM1, both from the WEKA package [47]. Precision, recall and F 1 are averaged over 10-fold cross validation. Since the number of available MeSH headings is quite large (over 26k), we have   Table 5 shows the average performance of the learning algorithms. Overall, it seems that, when both SVM and AdaBoost are used, full text performs better compared to summaries and MEDLINE citations.
This performance might be due to the capabilities of the full text to provide disambiguation features that other methods, like MMI, are not using, similar to the increased performance of PRC on full text. In contrast to other works, summaries do not offer better performance compared to full text. On the other hand, further tuning of the set of parameters for the summarization process might improve summary performance [48]. From the learning algorithms, SVM seems to perform better compared to AdaBoost in most of the considered MeSH headings.
Globally, results for SVM and AdaBoost are better than MMI and PRC. This has been already seen in previous work with learning algorithms and very frequent MeSH headings. On the other hand, it has been shown [48] that less frequent MeSH headings have poorer performance compared to other approaches due to the scarcity of training data for those headings.

Results by MeSH branch
MeSH terms are organized in a tree structure. The top nodes of this tree define broad topics within the medical domain. Each branch is identified by a letter, and Table 6 contains the list of top-level branch codes from 2012 MeSH. A MeSH heading can be assigned to more than one branch, so in the analysis its contribution is added to all the branches it belongs to. As an example, Cohort Studies appears under the E (Analytical, Diagnostic and Therapeutic Techniques and Equipment) and N (Health Care) branches. We have used this MeSH structure to group the results by tree branches, according to the MeSH headings in those branches. The idea is that, for instance, the indexing of terms in branch C (Disease) will be different to the indexing of terms in branch G (Phenomena and Processes). See Additional file 4: Average results per MeSH 2012 top level branch code. Comparing both summary types and MeSH branches, we observe, as above, that graph-based summaries achieve higher precision but lower recall compared to the frequency-based summaries. We find that the larger differences between the two types of summaries occur in the B, M, N and Z branches.
In the case of the B (Organisms) and M (Named Groups) branches, terms like Humans, Mice, and Animals are most frequent terms in the results of each method. This result is similar to the one observed in full text articles. These terms belong to a special category denominated Check Tags (CTs) [49]. Recall that CTs are a special class of MeSH headings considered routinely for every article, which cover species, sex and human age groups, historical periods and pregnancy. The indexing for the most common CTs are derived from machine learning methods [33]. Summaries and full text seem to follow a different term distribution as the one expected by the trained methods. The result is a higher recall with lower precision. http://www.biomedcentral.com/1471-2105/14/208 In the case of the N (Health care) branch, terms like Cohort Studies are predicted by forced rules. These rules are encoded into MTI to comply with the indexing policy at the NLM and are supposed to improve the quality of indexing based on indexer feedback. Terms like cohort indexes the citation with the MeSH heading Cohort Studies, which seem to be more frequent in frequency-based summaries.
In the case of the Z (Geographicals) branch, the difference is larger, but becomes more similar as the size of the summary increases. The Z branch presents the highest recall but the lowest precision in the full text. On the other hand, the summaries do not exhibit this behavior. Examples of high recall but low precision in full text are: United States, (1g/l glucose: Gibco Laboratories, Grand Island, NY, USA), PMID "20473639", and Germany, Rapid DNA ligation kit was from Roche (Mannheim, Germany), PMID "19609521". In these cases, the country was mentioned as a reference in the full text. Neither the MEDLINE citation or the summaries contain mentions to them. If we compare the summaries to MEDLINE citations, the trend is higher recall but lower precision. Only the M branch (Named Groups) shows a slight advantage in favor of MEDLINE citations. The M branch contains a limited number of MeSH headings and some of them overlap with the Check Tags for which we have trained learning algorithms.
Comparing the recall of the summaries and the full text we find that, as expected, in most of the cases the full text has a higher recall. However, we have identifiedtwo MeSH branches for which the summaries achieve higher recall compared to full text. The branches are A (Anatomy) and D (Chemicals and Drugs). We find that terms in these branches are identified using the Related Citations which predicts the MeSH heading if there is enough evidence in similar documents. In this case, the summaries seem to be more similar to previously indexed citations.

Conclusions
This paper explores the use of different types of automatic summaries for the task of obtaining MeSH descriptors of biomedical articles. To this end, we compare the results obtained by different indexing algorithms (i.e., MTI, individual MTI components, and different machine learning techniques) when applied on (1) summaries of different lengths generated with two different summarization methods (2) full text articles and (3) MEDLINE citations.
Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. Compared to MEDLINE abstracts, they allow for higher recall but lower precision. With respect to the different types of summaries, the best results are obtained by a graph-based method with a compression rate of 15%.
There are several reasons for the lower precision of summaries and full text compared to MEDLINE citations. In many cases, it is the use of specific techniques which were tuned for MEDLINE citations. This tuning provides a higher recall in summaries and full text due to the higher probability of triggering the rules. We have evaluated indexing without the forced rules and machine http://www.biomedcentral.com/1471-2105/14/208 learning algorithms. Without these rules, both the precision and recall dropped. A revision of the forced rules for the summaries and full text might improve the indexing performance.
Furthermore, it must be noted that summarization algorithms are tuned based on ROUGE. Tuning of the summarization algorithms based on MeSH indexing could also provide better performance.
Even with full text, the indexing recall is still low in some cases. We have looked into frequent example terms, and one of the reasons for low recall is that in some cases the terms are not explicitly mentioned in the citations or appear with a different term, e.g., synonym not covered by MeSH or the UMLS. The PRC and machine learning algorithms try to address this problem.
In previous work, machine learning has been evaluated on some of the MeSH headings and MEDLINE with mixed results [33,50]. We have contributed by comparing the performance of machine learning algorithms with different document representations on frequent MeSH headings. In our experiments, full text outperforms both summaries and MEDLINE citations. On the other hand, indexing performance might be dependent on the MeSH heading [48] being indexed. Summarization techniques could thus be considered as a feature selection algorithm [51] that might have to be tuned individually for each MeSH heading.