A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

(NP (NP (ADJP-1 enhanced) CYP2C9 production) and (NP (ADJP-1 *P*) (NML 11,12 EET) production)) It becomes more complicated when there are multiple shared premodifiers, where a place-holder *P* has to be postulated for each premodifier: (NP (NP (ADJP-1 cultured) (NML-2 rat) pancreatic acinar cells) and (NP (ADJP-1 *P*) (NML-2 *P*) hepatocytes)) When both the head and premodifier are coordinated structures, creating a node for each entity is simply impossible. For example, in "the N-and K-ras cells and tumors", there are four entities "N-ras cells", "K-ras cells", "N-ras tumors" and "K-ras tumors". The best effort from the Penn BioIE project yields this structure: (NP the (NML (NML (NML-1 (NML N-(NML-2 *P*)) and (NML K-(NML-2 ras))) cells) and (NML (NML-1 *P*) tumors))) Instead of explicitly representing all four entities, this structure only explicitly represents "N-ras and K-ras cells" and "N-ras and K-ras tumors". In order to derive all four entities from the parse tree, it will have to resort to an inference mechanism like the rule we have outlined above. If this inference mechanism has to be applied anyway, it seems to make sense to apply it consistently in all cases and simplify the representation of the coordinated nominal structures. Therefore we will represent the above three examples as follows: (NP enhanced (NML (NML CYP2C9 production) and (NML (NML 11,12 EET production)) (NP cultured rat (NML (NML pancreative acinar cells) and (NML hepatocytes))) (NP the (NML N -and K -) (NML cells and tumors))

A note about NML
NML is a label that is added in the Penn BioIE addendum to represent sub-NP structures. We will adopt this label in our annotation. Although there are rules about where NML should be used, the addendum does not explicitly state the distinction between NML and NP. This section attempts to clarify that distinction.
PTB allows flat NP structures when there is a strictly right-branching structure where each daughter of the NP forms a constituent with everything on its right. For example, (NP primary liver cancer) is an implicit representation of (NP primary (NODE liver (NODE cander))) The purpose of the flat structure is to improve readability when the structure is completely predictable. Any non-right-branching structures have to be explicitly represented with nodes labeled NML: In CRAFT, as in PTB2, NP adjunction is represented by NP rather than NML when there a no nominal sub-constituents.
(NP (NP loss) (PP of (NP hybridization))) (NP (NP the patient) (VP seen (NP * ) (NP-TMP yesterday))) When this larger NP is a premodifier, it is labeled NML: (NP the (NML (NML guanine) (PP to (NP cytosine))) transformation) It is important to note that NML is not a "full" NP, but rather a piece of an NP. For instance, NML can never contain a determiner. This distinction is tricky because, particularly with proper name NMLs, material marked NML could also stand on its own as NP, but in the particular context in which we find them they are constituents that are really only part of an NP. For example, in the citation "(Anthony Nicholson, the Jackson Laboratory, personal communication)," "the Jackson Laboratory" is an NP on its own (NP the Jackson Laboratory) However, as a modifier, as in "The Jackson Laboratory foundation stock" it is annotated as a NML, (NP the (NML Jackson Laboratory) foundation stock)

Adding other coordinated structures.
A separate and yet related issue when deciding on how to handle shared modifiers or head is the representation of coordination structures because coordination is the main mechanism for sharing modifiers or head. In the PTB II guidelines as well as the Penn BioIE addendum, certain coordination structures are left flat for legibility. For example, if the coordinated elements are single tokens that share some premodifiers, then the structure is left flat: (NP combined washings and brushings) (NP the dogs and cats) (NP 11 dogs and cats) In CRAFT, we explicitly show the scope of "the" (which is modifying both "cats" and "dogs"), by putting a NML node around "cats and dogs": (NP the (NML cats and dogs)) In this way, we more closely align the annotation of these single-token coordinated heads with existing PTB2a policy regarding the use of NML in multi-token coordinated phrases with shared premodifiers: (NP the (NML (NML grey cats) and (NML brown dogs))) (NP the (NML (NML pupil) and (NML optic nerve))) Leaving such coordination structures flat means less keystrokes or mouse clicks for annotators, but it often poses problems for users who are used to using the Treebank data as is without having to read the bulky guidelines to learn about the linguistic nuances. For example, when converting phrase structure to dependency structure, one would have to recognize the coordinate structure to identify the head.
This has implications for how to represent coordination structures for other categories as well, if we want to be consistent across all categories. The same argument from a user's perspective applies to coordinated VPs as well. Previously, in the PTB, shared adjuncts for coordinated VPs are left at the conjunction level: (S (NP-SBJ-1 the company) (VP expects (S (NP-SBJ-1 PRO*) (VP to (VP (VP obtain (NP regulatory approval)) and (VP complete (NP transaction)) (PP by (NP year-end))))))) Notice that the PP "by year-end" is shared by both VPs "obtain regulatory approval" and "complete transaction", but is attached at the same level as those two VPs to form a flat structure. Like the flat coordinated nominal structures, this could also be a potential problem the user tries to convert this into a dependency structure. Our proposal is to add a layer of VP so that the PP modifier and the coordinated VP are at different levels of attachment. The added VP is in bold and underlined. This way, it will easier for the user to detect coordination structures and determine what the head is for such structures.
(S (NP-SBJ-1 the company) (VP expects (S (NP-SBJ-1 PRO*) (VP to (VP (VP (VP obtain (NP regulatory approval)) and (VP complete (NP transaction))) (PP by (NP year-end))))))) Shared modifiers can also occur at the S level. When two clauses share a modifier, the modifier is attached at the coordination level. It poses similar problems for phrase structure to dependency structures conversion: In order to make this consistent with our treatment of coordinated NPs and VPs, we are also adapting this treatment from PTB.
(S (SBAR-ADV Although X) (S either (S Y) or (S Z))) In some cases, the shared modifier is attached at the level of the first clause in the PTB even if it has scope over both coordinated clauses: [This is no longer true for current PTB projects. Modifiers shared across sentences are now left loose at coordination level. I've attempted to correct the tree below, but these examples are now exactly parallel to the "Although X…" examples above, maybe we don't need both?] (S (PP-TMP After (NP (NP puncture) (PP of (NP (NP coagulated blood) (PP from (NP the corpora cavernosa))))) (S (NP-SBJ-1 urine retention) (VP developed)) and (S (NP-SBJ-1 a suprapubic catheter) (VP had (S (NP-SBJ-1 *) (VP to (VP be (VP introduced (NP-1 *) (ADVP temporarily) (PP-PRP for (NP urine drainage))))))))) Our proposal is to adjoin the temporal PP to the coordinated S structure as follows: (S (PP-TMP After (NP (NP puncture) (PP of (NP (NP coagulated blood) (PP from (NP the corpora cavernosa)))))) (S (S (NP-SBJ-1 urine retention) (VP developed)) and (S (NP-SBJ-1 a suprapubic catheter) (VP had (S (NP-SBJ-1 *) (VP to (VP be (VP introduced (NP-1 *) (ADVP temporarily) (PP-PRP for (NP urine drainage)))))))))) The following is a summary of the changes and modifications in annotation style to explicitly annotate all coordination structures: i). single-token coordinated heads inside NP with shared premodifier(s) () old style:

Section 2: Headings, Titles, and Captions
In Penn Treebank II, sentence fragments are labeled FRAG at the top level. Since the data we are dealing with are journal articles and books, we are using more informative labels for things that would have been tagged FRAG based on PTP II guidelines. These labels are for things that are important in structuring a journal article or a book. CRAFT had created TITLE, HEADING, and CAPTION node labels to denote these sections of journal articles: These new node labels expand upon the -HLN tag used in newswire treebanks. Fragments are still labeled FRAG in CRAFT as they would be in other treebanks, but get this additional node on top of that.] (PP of (NP SOX1))) (PP in (NP (NP the Development (PP of (NP VS Neurons))))))) These nodes require internal structure the same as other main text nodes, however, TITLE, HEADING, and CAPTION nodes have only one daughter. In cases where titles, headings or captions are not complete sentences FRAG may be used to make a single constituent of the daughter nodes. Since citations are pervasive in journal articles and books, we are adding a CIT tag for inline citations. The internal structures for citations are flat: (CIT Shelton et al., 1983) CIT applies only to author references that occur inside of parentheses. All other, nonparenthetical references are bracketed as normal text.