The full text articles in XML format from the PubMed Central Open Access collection was made available to participant systems at http://www.biocreative.org/resources/corpora/biocreative-iii-corpus/
System assessment method
A total of ten UAG members (including the chair) participated in the system assessment. The systems were tested against the same set of articles (five articles in total). One of these articles was common to all members and used for training so they could familiarize themselves with their assigned system. For this, an article previously curated by all group members was selected (PMC2613882, the subject of Table 2). Each of the systems was primarily assessed by two members, with each member curating a different set of two articles which were novel to them. The exception to the assessment procedure above was MyMiner which was inspected separately as it was not originally designed to meet the specifications of the IAT task. The assessment of all systems was done remotely. The UAG members curated the articles using the system: they would get the raw output from the system, go over the gene list provided by the system and add any missing genes, correct mis-assigned organisms, and identify central genes. Once the initial assisted-curation task was complete, curators were permitted to use and comment on other systems. Note that there were some limitations to testing, including assignment of two curators per system and the number of articles processed, due to time constraints (only 2 weeks), and number of UAG members that participated in the testing (not all were available). UAG members recorded the time spent curating using the assigned system. The latter activity could not be reliably compared in all cases because some of the UAG members timed their annotation for validating central genes, while others timed their activity for validating all genes. However, in one case we can provide some preliminary information based on comparison to the manual, unassisted time spent for curation (see case 1 in Result section).
For performance assessment the precision and recall for the gene normalization task were calculated as follows:
Precision = TP/(TP+FP)
Recall= TP/(TP+FN)
Where,
TP: true positives, i.e. number of genes correctly identified and linked to the correct database object.
FP: false positives, i.e. number of gene mentions that are incorrectly identified, including cases of gene mentions with incorrect database link (mis-assignment of species), and non-gene mentions (mentions that are not genes but are detected as such by the systems and/or curators).
FN: false negative, i.e., number of missed genes (not detected by systems and/or curators).
Further information about the IAT task is available at http://www.biocreative.org/tasks/biocreative-iii/iat/.
Systems description
Team 65- ODIN (Simon Clematide and Fabio Rinaldi)
URL: http://www.ontogene.org/odin/ (Figure 2)
The ODIN system is being developed within the scope of the OntoGene project, as acollaboration between the OntoGene group at the University of Zurich and the NITAS/TMS group (Text Mining Services) of Novartis Pharma AG. The purpose of the system is to allow a human annotator/curator to leverage the results of a text mining system in order to enhance the speed and effectiveness of the annotation process.
Methods: The OntoGene system takes as input a document in plain text or supported XML-based formats (including PubMed Central) and processes it with a custom NLP pipeline, which includes Named Entity recognition and relation extraction. Entities which are currently supported include proteins, genes, experimental methods, cell lines, and species. Entities detected in the input document are disambiguated with respect to a reference database (UniProt [18], Entrez Gene [17], NCBI taxonomy [34], PSI-MI ontology ). Since ODIN was primarily intended as a document inspector for annotation purposes, there is only an experimentally added retrieval function without ranking of the results.
Interface: The annotated documents are handed back to the ODIN interface (as pure XML documents), which allows multiple display modalities, plus various selection and modification options. The curator can view the whole document with in-line annotations highlighted, or can browse the extracted entities and be pointed back to the mentions within the document. All entity annotations are editable. Different entity views are supported, with sorting capabilities according to different criteria (entity type, confidence score, etc.) Selective display of text units (e.g. sentences) containing entities of interest is supported. Rapid disambiguation can be achieved through manual organism selection. Additionally, extensive logging functionalities are provided, which may be integrated in the document itself for document revision purposes. More details on ODIN are available in additional file 1.
Team 68- GeneView (Philippe E. Thomas and Ulf Leser)
URL: http://bc3.informatik.hu-berlin.de/ (Figure 3)
GeneView is a tool for gene-centric searching, ranking, and visualization of scientific full text articles.
Methods: GeneView initially performs a series of pre-processing steps on each corpus that should be indexed: Full text articles are parsed and indexed using Lucene. Gene names are identified and normalized to Entrez Gene IDs using the BioCreative III version of GNAT [35, 36]. This version of GNAT has been improved to deal more efficiently with full texts and allows for a more general species-specific disambiguation of gene names. In addition, single nucleotide polymorphisms are identified using MutationFinder [37]. All recognized entities are added to the Lucene index, together with the section type they were found in and their entity type. This structure allows for a very fast, section-specific search for entities, words, or phrases, and is also used for section specific article ranking.
To find articles that are most relevant for a given gene, the gene index and the sections in which the gene appears are taken into account, as suggested in [38]. Approximately 2,000 different section boost settings using the NCBI Gene2Pubmed mapping as gold-standard have been evaluated. Precision of each setting has been estimated using 10 randomly selected genes and their top 20 query results. On this subset the team achieved an overall precision of 72.2%. Using the best section-specific boosting, precision increased by 3.5%. This setting reflects our assumption that sections like Title, Abstract and Result are of higher importance than other sections. Surprisingly the incorporation of figure and table captions decreased the quality of ranking.
Interface: HTML-based display of an article encompasses the full text itself with highlighting of all identified entities and a count-based summary of detected entities. Users can access entity-specific information, integrated from a number of public data sources, by a single mouse click. As the importance of genes mentioned in the article depends on a specific user's needs, GeneView allows personalization of the ranking function. Per default, genes are ranked by their total number of occurrence in the article, but users have the possibility to exclude sections from this calculation.
The processing time for a query is currently less than one second. To further assist user in assessing the relevance of an article and its contained genes, GeneView also identifies all genes co-occurring with a given query in any of the articles in the corpus. Each such gene is tested for positive association using a single sided χ2-test. The five most significantly associated entities are then displayed by GeneView at the top of the search results page.
Team 78- University of Iowa (Sanmitra Bhattacharya and Padmini Srinivasan)
URL: http://siena.cs.uiowa.edu/~biocreative/ (Figure 4)
The system for the IAT task [39] was developed based on the corresponding BioCreative III gene normalization system [40].
Methods: The gene and protein mentions were identified in the full text using ABNER [41] and LingPipe [42] while the species mentions were identified using LINNAEUS [43]. The initial gene list was filtered using a stop list of terms (e.g. ‘antigen’, ‘Ab’, etc.) and shorthand gene names were expanded to constituent terms. Also the LINNAEUS species dictionary was modified to include genera of model organisms (e.g. Arabidopsis for Arabidopsis thaliana, ID: 3702) and common species strains (e.g. Saccharomyces cerevisiae S288c, ID: 559292). Gene and species entities were then associated if they appeared within fixed character windows and the resulting pairs were searched on the Entrez Gene database. The first Entrez Gene hit obtained from a search is returned as the unique identifier for a particular gene mention.
User Interface - The interface of the system for the IAT task is simple and intuitive. Users have a choice of selecting inputs for either the indexing or the retrieval subtask. For the indexing subtask, the full text of a user-selected article is displayed in the left frame of the web page. In the right frame the gene names, species names, normalized NCBI Taxonomy IDs, normalized Entrez Gene IDs and frequency count of the gene names corresponding to the article are displayed. The results are pre-sorted by the frequency count which is based on the count of the gene names as identified by the gene name taggers. However, users may sort the results on individual fields. The gene and species names are highlighted in the full text in yellow on selecting the individual gene and species names from the right frame. The species identifiers and normalized Entrez Gene IDs have linkouts to corresponding records in the NCBI Taxonomy database and the Entrez Gene database, respectively. For the retrieval part of the task, the system displays a sortable list of PMCIDs with the frequency of the selected gene mention for that article. Each PMCID of the list has link to the full text of the article.
Team 89- University of Wisconsin (Shashank Agarwal and Feifan Liu)
URL: http://autumn.ims.uwm.edu:8080/biocreative3iat/ (Figure 5)
Team 89 developed a demonstration system-GeneIR, that performs both gene indexing and gene oriented document retrieval.
Methods: For gene normalization, a machine learning system was developed. The system used existing named entity recognition tool (Banner) to identify gene mentions and employed information retrieval based method to map those mentions to their candidate genes in Entrez Gene database. To further disambiguate the candidate genes, several learning algorithms were explored. A variety of features, such as the gene’s species’ mention in the article, presence of a part or whole of the gene’s genetic sequence in the article, and similarity between the gene’s GO [44] and GeneRIF [17] annotations and the article, were used for model training.
For article retrieval, all articles in the data source were indexed by different fields such as article’s title, abstract, full text, figure legend and references, which offerflexible support on different retrieval strategies as well as interface functions. To account for gene name variations (for example, BRCA1 vs BRCA-1), a gene name variation generator was implemented. For a gene name query, the system matches it and its variations to the index for article retrieval. For a gene ID query, the system obtains the gene's symbol and synonyms and uses them along with their variations as query to retrieve relevant documents.
Interface: A user interface that provided two search boxes was developed: one to obtain articles based on gene name or gene's Entrez Gene ID, the other to obtain all the normalized genes from an article of a given PMC ID. From the gene results or article results, one could view other genes in an article or other articles containing a specific gene, respectively. When viewing the gene normalizations from an article, the genes can be sorted by centrality (default), presence in title and abstract, or the frequency with which they appear in the article. To determine the centrality of a gene, a machine learning classifier was trained that makes use of features such as the presence of the gene’s mention in title or abstract, the frequency of the gene’s mention in the article, and the popularity of the gene in public resources GO and GeneRIF. The interface allows users to be able to view all genes or an individual gene highlighted in the article, as well as manually adding or deleting genes from a given article. The displayed gene list can be downloaded as a tsv (tab separated values) file.
Team 93 - The GNSuite system (Rune Sætre and Naoaki Okazaki)
URL: http://www.idi.ntnu.no/~satre/biocreative/IAT
http://www-tsujii.is.s.u-tokyo.ac.jp/satre/biocreative/IAT/ (Figure 6)
Methods: The GNSuite service is running on two servers in different parts of the world for efficiency and stability. The GNSuite web-based interface is used to present pre-processed input from the underlying parsing, protein recognition and DB identifier assignment systems. Eighteen thousand full text articles are indexed by GNSuite, and more than eighteen million abstracts from PubMed by MEDIE [45].
The system accepts several sources of input such as, MEDIE , GNSuite, and LINNAEUS. This can easily be extended with other systems that provide stand-off annotations, since each system is presented in a separate tab in the user interface. All underlying results are integrated to improve recall. A web-service [46] is used to find and highlight alternative names for the recognized genes and species in the text. See the BioCreative III Gene Normalization article for more details on the GNSuite sub-system (Look for Team 93 in the GN article in this BC-III issue).
Interface: The GNSuite front page shows PMC and PubMed identifiers for all the available full text articles (sorted, and grouped into several pages). The number of normalized genes found in the title/abstract/full text for each article is also shown.
A “gene table” tab summarizes and ranks the recognized genes based on the combined input from all the underlying systems. This list of genes for all articles can be sorted by relevance scores based on frequency, confidence, whether they appear in the title or abstract, etc. On the top of each article’s individual visualization page (Figure 6) is a summary table with all the genes and the number of mentions in the article. The user can click on any gene symbol to see the entry in Entrez Gene, and all the recognized gene names are highlighted in the text. The user can jump from one gene occurrence to the next by clicking on the gene name, either in the abstract or in the full text. The gene table can be manipulated both manually and automatically, and can be stored to a local file on the user’s computer.
Team 61- MyMiner (David Salgado and Martin Krallinger)
URL: http://myminer.armi.monash.edu.au (Figure 7)
The MyMiner project proposes a set of tools (1) that facilitate individual and community-based annotation initiatives, through a free and user-friendly interface that performs the most common tasks in manual literature curation and dataset creation; (2) that aim to improve performance of predictive systems, by enhancing the quality of manually annotated sets of documents required for the development of text-mining applications; and (3) that simplify the transfer of unexploited knowledge encoded into textual format within scientific documents into computer-usable information. MyMiner has been instrumental for the creation of a muscle-dedicated database and during the BioCreative III PPI project to classify scientific documents, gene ontology terms and disease descriptions, to detect and normalise bio-entities (e.g. genes and proteins) embedded in text and to detect protein-protein interactions.
Methods: The MyMiner system works with any input text and thus was not tailored to specific format of the set of articles proposed by the task organizers. It is based on a general 3 column tabulated input format that allows MyMiner to be utilized by users with limited computer skills. The recognition of bio-entities is based on the integration of the named-entity recognition tool ABNER, that automatically tags mentions of proteins, genes, cell lines, cell types (ABNER). LINNAEUS is used to recognize the species. In order to generate from an entity tagged text a ranked collection of database links, MyMiner proposes a list of database identifiers per bio-entity mention. We use the UniProt query scoring mechanism for proteins and genes [47]. In this case, the protein mentions that are either automatically or manually tagged are used as direct queries within MyMiner to retrieve a ranked set of hits. Alternatively, organism query filters can be applied. The main features that influence the scoring/ranking mechanism are: (1) How often the term (i.e. selected gene/protein mention) occurs in a given UniProt entry (not normalizing with respect to the document size to avoid over-weighting sparsely annotated records), (2) Weighting depending on the field of the record in which the term was detected (e.g. higher weights are returned for hits against the protein name fields as opposed to a referenced publication field); (3) Weighting depending on whether the record had been reviewed or not, scoring higher those records that have been reviewed (as they are generally more reliable); (4) Weighting depending on how comprehensively annotated a record is, to deliberately bias the system for well-annotated entries, which in general are also more likely to be the actual hit given an input article. Ajax requests are executed to query distant databases such as NCBI taxonomy, Uniprot and OMIM [48] databases, using web services protocols or similar. Results of theses queries are treated and displayed “on the fly”, on the webpage.
Interface: The MyMiner application combines several standard web languages and techniques such as PHP, Javascript and Ajax to enhance user interactivity. MyMiner is composed of four main application interfaces: “File labelling”, “Entity tagging”, “Entity linking”, and “Compare file”. MyMiner user interfaces offer options and tools to resolve a variety of limitations and bottlenecks identified in each task. To make this system flexible and interactive, automatically generated tags can be corrected, edited or removed. Entities are highlighted using CSS and Javascript. When a tag is defined, a corresponding CSS style is dynamically created. Upon user actions, such as text selection and tagging, html tags are added using Document Object Model manipulation functions in Javascript. Each module provides an export option to save results. The time spent for processing a document is recorded and available on the export file. To enhance the user-friendliness of interfaces, a common display layout has been adopted and conserved between applications. Text area that contains the text or document to be analysed is located on the top of the page. Options and tools are placed below the main curation zone.
MyMiner applications relevant to IAT task- The module, “Entity tagging” allows the automatic tagging of entities of biological interest in a document. It enables the manual correction and editing of those terms to overcome potential tagging errors and facilitates user interaction. Moreover, the user can add new terms, and specific relations between terms using a matrix check box. Such relations might be useful for the extraction of annotations, e.g. protein-protein interactions or protein functions.
The “Entity Linking” module facilitates the identification of database links for proteins, species and diseases mentioned in a document. Biological terms are first automatically detected and displayed in a list that can be manually edited to add new terms or to remove incorrectly identified ones. MyMiner then links each identified gene/protein to UniProtKB identifiers. A check box allows the selection of the most appropriate identifiers from the list of potential candidates. A short description is provided for each term to help validate those candidates. Species and diseases are mapped to NCBI taxonomy and OMIM database identifiers, respectively. Help sections and tutorial movies are provided. A feedback form is also available to send comments and suggestions.