Skip to main content

Species identification for gene name normalization

Background

Protein interaction networks are expensive to construct experimentally. Therefore, researchers usually refer to the literature or domain-specific databases to convey knowledge on currently known interactions. Yet the task of manual collection of knowledge from scientific papers is labor intensive, and therefore should be automated to the extent possible. For this, an important step is identifying gene and protein names (termed entities). After identification, gene names must be mapped to database identifiers to connect them to structured knowledge. One particular problem in this step are homonymous, i.e., identical names referring to different genes in different species.

Methods

We present different approaches that aim at assigning species labels to MEDLINE abstracts. We use (1) as a baseline, the most frequent species MeSH term of the corresponding journal represented as MeSH terms; (2) the prediction of a binary classifier (SVM) for each species; (3) species names found by the tools Ali Baba [1] or LINNAEUS [2]; (4) the species of a normalized protein mention found by GNAT [3]. For evaluation, we use two sources as gold standard document-level annotations: The MeSH terms from MEDLINE and the species from UniProt and the E. coli-specific RegulonDB via protein- MEDLINE references.

Results

Measurements on a random set of 200 k abstracts from MEDLINE are summarized in Table 1. For MeSH term prediction, the text based methods (Ali Baba, LINNAEUS, GNAT) show stable performance across species, while the classification methods, as they rely on training data, suffer for species with lower prior probability. For the most frequent species human, the bag-of-word based SVM overcomes the difficulty of missing explicit species mention by learning other clues. Using UniProt as gold standard, learning methods produce substantially higher recall, indicating that molecular biology papers are more explicitly mentioning their focus organisms. There is a considerable disagreement between gold standard databases, e.g., only 85.7 % of the papers referenced from a comprehensive E. coli-specific database are annotated as E. coli by MeSH. Reasons for this could be, i.e., incompleteness of MeSH annotations or consideration of orthologs in RegulonDB.

Table 1 Comparison of methods for document-level species annotation

Conclusion

We conclude that there is no one-size-fits-all method for identifying species in abstracts. For less frequent species, direct species mention identification methods work best. The advantage of using indirect clues could only be realized for the most frequent species human, suggesting that machine learning methods should be applied after better balancing the training data. We also showed that using MeSH term queries to filter papers poses considerable limitations on recall.

References

  1. 1.

    Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph. Bioinformatics 2006, 22(19):2444–2445. 10.1093/bioinformatics/btl408

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Gerner M, Nenadic G, Bergman C: LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 2010, 11: 85. 10.1186/1471-2105-11-85

    PubMed Central  Article  PubMed  Google Scholar 

  3. 3.

    Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G: Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008, 24(16):126–132. 10.1093/bioinformatics/btn299

    Article  Google Scholar 

  4. 4.

    Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M, Penaloza-Spinola M, Martinez-Antonio A, Karp P, Collado-Vides J: The comprehensive updated regulatory network of Escherichia coli K-12. BMC Bioinformatics 2006, 7: 5. 10.1186/1471-2105-7-5

    PubMed Central  Article  PubMed  Google Scholar 

Download references

Acknowledgements

Domonkos Tikk was supported by the Alexander-von-Humboldt Foundation.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Illés Solt.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Solt, I., Tikk, D. & Leser, U. Species identification for gene name normalization. BMC Bioinformatics 11, P5 (2010). https://doi.org/10.1186/1471-2105-11-S5-P5

Download citation

Keywords

  • MeSH
  • MeSH Term
  • Protein Interaction Network
  • Term Query
  • Frequent Species