A structural SVM approach for reference parsing

Background Automated extraction of bibliographic data, such as article titles, author names, abstracts, and references is essential to the affordable creation of large citation databases. References, typically appearing at the end of journal articles, can also provide valuable information for extracting other bibliographic data. Therefore, parsing individual reference to extract author, title, journal, year, etc. is sometimes a necessary preprocessing step in building citation-indexing systems. The regular structure in references enables us to consider reference parsing a sequence learning problem and to study structural Support Vector Machine (structural SVM), a newly developed structured learning algorithm on parsing references. Results In this study, we implemented structural SVM and used two types of contextual features to compare structural SVM with conventional SVM. Both methods achieve above 98% token classification accuracy and above 95% overall chunk-level accuracy for reference parsing. We also compared SVM and structural SVM to Conditional Random Field (CRF). The experimental results show that structural SVM and CRF achieve similar accuracies at token- and chunk-levels. Conclusions When only basic observation features are used for each token, structural SVM achieves higher performance compared to SVM since it utilizes the contextual label features. However, when the contextual observation features from neighboring tokens are combined, SVM performance improves greatly, and is close to that of structural SVM after adding the second order contextual observation features. The comparison of these two methods with CRF using the same set of binary features show that both structural SVM and CRF perform better than SVM, indicating their stronger sequence learning ability in reference parsing.


Background
Bibliographic references, typically cited at the end of scientific articles, provide much valuable information. Parsing these references is an essential step for building citation-indexing systems. Many well-known citationindexing systems, such as CiteSeer [1], ISI Web of Knowledge [2] and Google Scholar [3], could have implemented complex reference parsing algorithms, though detailed reports about their algorithms and performance have not been found in the literature. As the authors of CiteSeer mention in [4], the reliable parsing of references may still be considered an open problem. MEDLINE ® , the flagship database of the U.S. National Library of Medicine, contains over 18 million citations to the medical journal literature and is a critical source of information for biomedical research and clinical medicine. With the rapid increase of journal literature indexed by MEDLINE every year, it is essential to have automated methods to extract bibliographic data, including article titles, author names, affiliations, abstracts, and many others.
While references are not included in MEDLINE citations, they are indispensable for detecting several other items. For example, creating the Comment-On/Comment-In field for MEDLINE (identifying pairs of articles, with one article commenting on the other) requires matching references to the citing text [5]. In addition, assigning Medical Subject Heading (MeSH) terms [6], an essential step in indexing the article, may also benefit from analyzing the MeSH terms assigned to the cited articles, which requires parsing the references to those articles. Reliable reference parsing is therefore an important step for automatically creating citations for MEDLINE.
In this work, our goal is to extract the following 7 entities from the references: Citation Number (<N>), Author Names (<A>), Article Title (<T>), Journal Title (<J>), Volume (<V>), Pagination (<P>) and Publication Year (<Y>). All remaining words in the reference are labeled as Other (<O>). The notation inside each parenthesis is the abbreviated entity label.
In the large number of journals (over 5,200) indexed for MEDLINE, references are formatted in a large variety of ways, some of which are shown in Table 1. In each example, the original reference is followed by the ground-truth labeling. Most of the references cite "normal" journal articles, but a small number cite books, e. g., (f) and international standards, e.g., (g). Some references omit Citation Numbers, e.g., (c), and among others which do have these, there are different formats either as a single number or an author-year chunk, e.g., (a) and (b). There is also some variation in the way Author Names are expressed: initials followed by last names, e.g. (a); last name followed by initials, e.g., (d); not all authors listed, e.g., (e); the first author and the remaining authors in different formats, e.g., (c); and occasionally an anonymous author, e.g., (g). Most Journal Titles are significantly abbreviated, and most Paginations consist only of digits, but (d) is an example where Pagination contains non-digit characters. There are also many variations in the use of commas, spaces, semicolons or periods to separate different entities; and in character capitalizations. This wide variability makes reliable reference parsing a challenging task.
Early research in reference parsing involved rule-based methods, which usually depend on knowledge that is manually crafted and based on a domain expert's observation. This domain knowledge is organized as templates or hierarchical frameworks, which summarize the recognizable patterns formed by the data or the surrounding text, and the rules associated with those recognizable patterns. After the knowledge representation is built, various algorithms can be used to match the text to the knowledge representation, and to extract data according to the rules. These matching algorithms include template mining [7,8], INFOMAP [9,10] and BLAST (Basic Local Alignment Search Tool), a tool originally designed for gene sequence alignment [11].
Rule-based methods can be very successful when the references are from a small or moderate number of journals. This is because journal publishers usually require authors to strictly follow predefined citation Table 1 Examples of references following different styles in medical journal articles styles, conduct careful editorial checking and correction before publishing. However, when a large number of journals are involved, it can be very challenging to build a sound knowledge representation due to the large variety of, and sometimes conflicting, citation styles. Rulebased methods also require domain experts to design the rules and maintain them over time, and therefore lack adaptability and are difficult to tune.
Machine learning approaches have recently attracted increased attention because they automatically learn the knowledge from training samples and therefore exhibit good adaptability. For example, Parmentier and Belaïd have developed a concept network to hierarchically represent and recognize structured data from bibliographic citations [12]. Besagni et al. took a bottom-up approach based on Part-of-Speech (PoS) tagging [13]. Basic tags, which are easily recognized, are first grouped into homogeneous classes. Confusing tokens are then classified by either a set of PoS correction rules or a structure model generated from well-detected tokens.
Reference parsing is essentially a sequence processing task and therefore statistical sequence models, e.g., Hidden Markov Model (HMM) and Conditional Random Field (CRF), as successful machine learning tools for information retrieval, have also been studied for parsing references. For example, Takasu applied HMM for metadata extraction from erroneous references [14]. Another frequently adopted machine learning method for information extraction is the Support Vector Machine (SVM) classifier. Okada et al. combined SVM and HMM for bibliographic component extraction [15]. In our previous research, we developed and compared a SVM-based method with one based on CRF [16].
Since collecting ground-truth training samples can be labor-intensive, unsupervised approaches have also been proposed. For example, Cortze et al. proposed an unsupervised approach, called FLUX-CiM, which is based on a frequency-tuned lexicon and includes four stages: blocking, matching, binding and joining [17].
There are also a few reference parsing libraries available online. These include ParsCit [18] and FreeCite [19].
As pointed out in a recent article, despite over a decade of research, reference parsing is still an unsolved task for several reasons, including data-entry errors, the wide variability of citation formats, lack of (or enforcement of) standards, large-scale citation data, and so on [4].
In this paper, we describe an extension of our previous work on reference parsing, reported in [16]. We adopted the recently proposed structural SVM method and compared it to conventional SVM. Our experiments on 1800 ground-truth labeled references show that the structural SVM method achieves over 98% token-level accuracy and over 95% chunk-level accuracy. In addition, we compared SVM and structural SVM to Conditional Random Field (CRF), another state-of-theart sequence learning method. We observe that structural SVM and CRF achieve about the same accuracies at token-and chunk-levels. Both methods show the advantage of stronger sequence learning ability over SVM.

Mathematical description of structural SVM
Structural Support Vector Machine (Structural SVM), introduced by Tsochantaridis et al., is a supervised learning method designed for predicting complex structured outputs, such as sequences, trees and graphs [20]. Given a training sample of input-output pairs (x 1 ,y 1 ),… (x n ,y n ) X ×Y drawn from an unknown distribution, structural SVM addresses the general problem of learning a mapping f : X Y from input patterns x X to discrete outputs y Y that has low prediction errors. The idea is to learn a discriminant function F from which we can derive a prediction by maximizing F over Y given a specific input is a linear combination of some joint feature representations of inputs and outputs, where w is a parameter vector and ψ is a feature vector relating x and y. The flexibility in designing ψ allows structural SVM to model many problems as diverse as natural language parsing, multiclass classification, sequence learning, etc.
Training the parameter vector w in structural SVM generalizes the maximum-margin principle in traditional SVM, leading to a quadratic optimization problem similar to multi-class SVM [20,21].
The constraints are built upon the condition that given a training sample (x i ,y i ,), the value of w x y T i i Ψ( , ) for the correct prediction y i should be greater than those for all other incorrect predictions y . Each training sample is associated with |y| -1 constraints which share the same slack variable ξ i . The introduction of ξ i allows structural SVM to learn a large soft margin with small misclassification errors, which makes structural SVM more general to solve those classification problems where different classes are not strictly separable even in high feature space. The objective function is penalized by adding non-zero slack variables, ξ i , each of which measures the degree of misclassification of a sample x i . Therefore, the optimization becomes a trade-off between a large margin and a small error penalty. ∑ξ i gives an upper bound for the empirical risk on the training set, and the constant C is a regularization term that controls the trade-off between training error minimization and margin maximization. Training structural SVM is computationally expensive due to the large number of margin constraints. By an equivalent 1-slack reformulation of the n-slack structural SVM, Joachims et al. proposed a "l-slack cutting-plane" method which significantly reduces the computation time, thereby making the training on large databases feasible [21]. Both SVM and structural SVM are discriminative models. They learn optimal linear-separable hyperplanes with maximummargin between classes. Structural SVM conducts global optimization on the whole structure, while SVM optimizes locally on individual tokens. Structural SVM is more general than SVM in its capability of learning interdependent and structured outputs. It has shown promising results for building highly complex, but still accurate discriminative models in the areas of classification with taxonomies, protein sequence alignment, and natural language context-free grammar parsing.

Feature extraction
A reference is first preprocessed and segmented into individual word tokens based on spaces and punctuations such as commas, periods, semi-colons, brackets, etc. We then extract 14 binary features and one normalized position feature from each token. They are briefly explained in Table 2. The first three are dictionary features which are collected by looking up a candidate word in Author Name, Article Title, and Journal Title dictionaries. We built these dictionaries from 10 years of MEDLINE data that contains about 236,748 Author Name words, 108,484 Article Title words, and 6,909 Journal Title words. The remaining 12 features provide further important information to help identify different entities.
Features from neighboring tokens are very informative as they exploit the contextual dependencies between tokens. There are two kinds of contextual features: the observation features extracted from the neighboring tokens and the labels assigned to those tokens. We call the first one "contextual observation features" and the second "contextual label features". Since in reference parsing, structural SVM is implemented as a sequence learning algorithm, the joint feature presentation function ψ(x, y) includes two kinds of features: state transition features and observation features extracted from individual tokens within a sequence. State transition features utilize contextual label information to model the dependencies between adjacent labels. Having these similar types of feature representations as Hidden Markov Models, structural SVM designed specifically for sequence labeling is sometimes called SVM HMM . In addition to contextual label features, we also combine contextual observation features from neighboring tokens for sequence classification.

Results and discussion
We randomly selected 600 references for training and 1800 references for testing from 1000 HTML articles collected from the top 100 journals cited in the MED-LINE 2006 database. We manually labeled these 2400 references. There are 18003 words in the training references and 53622 words in the testing references. Each entity in reference parsing is a single word, also called a token. The algorithm performance is evaluated at two levels. One is at token-level, i.e., the accuracy of labeling individual tokens. The other is at chunk-level, i.e., the percentage of the entity chunks correctly identified, where an entity chunk is the set of consecutive words having the same entity label. For example, in Table 1 (e), the Citation Number chunk is a single word "12" and the Author chunk is "T.J. McCarthy et al." consisting of four words. The total number of words and chunks for each of the 8 entities in testing references are shown in Table 3. The number of words for Citation Number (742) is larger than the number of chunks (627) is due to the existence of author-year style Citation Numbers, which have more than one word.

Evaluation of structural SVM
For our experiments, we use the SVM HMM library, an implementation of structural SVM for sequence labeling [22], and the linear-kernel since other kernels, e.g. radial basic function (RBF), can be extremely computation intensive. To compare this with SVM, we use LibSVM [23], a library developed at National Taiwan University for word classification. Here linear kernel function is also adopted to facilitate a fair comparison. All the meta-parameters in both SVM and structural SVM are determined with cross-validation on training samples. We extract 15 observation features including 14 binary features and one normalized position feature from each token. For both SVM and structural SVM, we use 3 sets of features: observation features from the token itself (15 features), observation features from the token and its two neighbors (45 features), and observation features from the token and its four neighbors (75 features). We call the observation features extracted from the neighboring tokens contextual observation features. Specifically, observation features from the immediate left and right neighbors are named as the first order contextual observation features; observation features from the left two and right two neighbors are referred to the second order contextual observation features, and so on. In structural SVM, contextual labels from neighboring tokens are also utilized to explore the dependencies between adjacent tokens. Tables 4, 5, and 6 show the overall token classification accuracies and chunk-level accuracies obtained by SVM and structural SVM for the extraction of 8 entities from the references.
We first use only the 15 observation features from the token itself. Since SVM does not use contextual features, it provides a baseline performance by analyzing only the token itself. As expected, the performance is relatively low: the token classification accuracy is 93.03% and the overall chunk accuracy is only 79.12%. Although structural SVM does not use contextual observation features, it does use the contextual label features. The overall accuracies at token-level and chunk-level are 98.41% and 95.35%, respectively, which are much better than those of the SVM method. This clearly indicates the value of contextual label features in structural SVM.
We then add the observation features from the immediate left and right neighbors (the first order contextual observation features). The corresponding token classification accuracy and overall chunk-level accuracy of SVM significantly increase to 98.20% and 94.27%. This indicates that the first order contextual observation features are very important for SVM classification. After combining observation features from one further left and one further right neighbors, the corresponding token-level and chunk-level accuracies increase to 98.65% and 95.59%. This indicates that the second order contextual observation features are still helpful, but less so than the first order ones. For the structural SVM method, when the first order contextual observation features are added, the overall accuracies at token-level and chunk-level increase to 98.91% and 96.81%, respectively. The accuracy improvement is not so substantial as that compared to the SVM method, which may imply that the contextual observation features and contextual label features share redundant discriminative information. After including the second order contextual observation features, there is virtually no performance gain for the structural SVM method, even though it uses extra contextual label features.

Comparison with Conditional Random Field (CRF)
We also compared our methods to CRF, another stateof-the-art sequence learning algorithm [24]. Because only binary features can be used in CRF models, we removed the normalized position feature from the feature vector used in previous evaluation. We then repeated some experiments using the same set of binary features in SVM, structural SVM and CRF methods for Table 3 The total number of words and chunks for each of the 8 entities in references for evaluation   Tables 7 and 8. Compared to the numbers in Tables 4, 5 and 6, the accuracies for both SVM and structural SVM drop a little due to the absence of the normalized position feature. Structural SVM achieved 98.99% token classification accuracy, higher than those of SVM (97.84%) and CRF (98.91%). However, CRF obtained 96.93% overall chunk-level accuracy, higher than that of structural SVM. Since both structural SVM and CRF are sequence learning methods, we do observe that they achieve overall higher token-and chunk-level accuracies than SVM in reference parsing.
The accuracies in CRF experiments are a little different from those reported in [16]. That is because in [16], additional large number of word features is extracted from each token and used in the classification. Adding those word features significantly increases the feature dimensionality, which causes difficulties in training SVM and structural SVM. On the other hand, adding those thousands of word features in CRF improves accuracy only slightly, indicating the non-importance of word features. Basically, we use the first 14 binary features described in Table 2 for a fair comparison.

Conclusions
We have compared SVM and structural SVM as methods for parsing references that appear in medical journal articles. One important difference between the two methods is that the SVM uses only the contextual observation features, while structural SVM uses these as well as contextual label features. Although SVM performance improves greatly and is close to that of structural SVM when the second order contextual observation features are used, structural SVM achieves higher overall token-level and chunk-level accuracies than the SVM method. Both methods achieve above 98% token classification accuracy and an overall chunk-level accuracy of over 95%. Compared to the CRF, we find that the structural SVM achieves similar performance. However, both methods perform better than SVM, showing the advantage of their stronger sequence learning ability.
Reference parsing is considered a sequence learning problem due to the strong regular internal structure in each reference. Additionally, we note that references cited in any one article generally follow the same style. Further exploiting this consistency in consecutive references to improve the performance of reference parsing will be the subject of future work.