- Open Access
MimoSA: a system for minimotif annotation
- Jay Vyas†1,
- Ronald J Nowling†1,
- Thomas Meusburger2,
- David Sargeant2,
- Krishna Kadaveru1,
- Michael R Gryk1,
- Vamsi Kundeti3,
- Sanguthevar Rajasekaran3 and
- Martin R Schiller1, 2Email author
© Vyas et al; licensee BioMed Central Ltd. 2010
- Received: 19 February 2010
- Accepted: 16 June 2010
- Published: 16 June 2010
Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature.
We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database.
MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context.
- Negative Training
- PubMed Abstract
- Paper Ranking
- Paper Scoring
- Short Peptide Sequence
Minimotifs are short peptide sequences that are the recognition elements for many protein functions. These short sequences are responsible for protein interaction interfaces involving other proteins (or molecules) in cells, trafficking proteins to specific cellular compartments, or serving as the basis for enzymes to post-translationally modify the minimotif sequence. At present, many minimotif instances and consensus sequences are collected into a wide spanning set of relatively small databases such as MnM, ELM, Domino, PepCyber, and ScanSite [1–5]. Most databases focus on specific subsets of minimotifs. For example, Phospho-ELM has merged with PhosphoBase as a database that focuses on instances of phosphorylation on proteins . Likewise, ScanSite largely concentrates on protein interaction minimotifs for a small subset of domains. In addition to these databases, recent years have seen increased publication rates of high throughput studies that generate minimotif data. Despite this growth in information, many of the reported minimotif attributes have yet to be integrated into any database.
The goal of the MnM project is to integrate well-structured data for a set of defined attributes of minimotifs in a single, non-redundant data repository with high accuracy. The number of reports of minimotifs in the literature has continued to grow since the late 1980's, recently with more rapid growth due to high throughput functional peptide screens. Previously, we showed that the several thousand minimotifs in MnM can be discretized into a structured syntax which can be directly enforced and modeled in a relational database [1, 7]. Through this process, we recognized the need for a system that manages minimotif annotation, which would help identify papers, reduce the time required for manual annotation, reduce errors, duplications and ambiguities, and aids in maintenance of the database.
Currently, there are no bioinformatics tools designed for annotating minimotifs from the literature. Most reported annotation methodologies concentrate mainly on genomes and proteome scale data [8–10]. A proposed stratification of annotation efforts refers to sequence-based annotation as the first dimension of genome annotation which defines components . The second dimension can be considered those annotations that focus on component interactions. This is exemplified by the human kinome and other types of functional annotations in the SwissProt and Entrez Gene databases [12, 13]. Annotation of minimotifs can be considered a second dimension annotation.
In considering whether to design a novel minimotif annotation system or adapt an existing annotation system used for another purpose, we identified a number of requirements to facilitate accurate, non-redundant, and efficient annotation of minimotif literature. We wanted the system to interface with a relational database that enforces controlled vocabularies from external databases and eliminates duplication. The system should be able to read, write, and edit entries in a database. The system should display papers that have been and are yet to be annotated, as well as support database-driven machine learning that scores papers for minimotif content, paper sorting, and paper filtering. The system should also have the capability to track annotations from multiple annotators. Finally, the system should be capable of accepting the fine-grained information content of minimotifs, in a structured and comprehensive manner.
Despite advances in management and mining of scientific literature, no tool existed that met the requirements we required for accurately annotating minimotif data. For example, each of the existing annotation tools such as MIMAS, Textpresso and Biorat only addresses a subset of the above requirements [14–16].
In this paper, we describe MimoSA, a Mi nimo tif S ystem for A nnotation designed for managing and facilitating minimotif annotation. MimoSA allows for minimotif-centric analysis of PubMed abstracts and annotation of minimotifs. MimoSA's contents are entirely database driven, thus enabling its adaption as an annotation tool for other information spaces that require extraction of information from the primary literature.
We present the generalizable architecture and implementation of MimoSA, an application, which allows minimotif annotations to be entered, reviewed, edited, approved by multiple users, and disseminated through the publically-available MnM web application. We also describe a generalizable paper-scoring algorithm and its implementation for ranking papers that contain minimotifs. By embedding this methodology into MimoSA, PubMed abstracts can be scored and associated papers can be ranked based on the presence of minimotif information content.
MimoSA was developed in Java http://java.sun.com and interfaced with a MySQL database http://www.mysql.com using the Hibernate object-relational mapper http://www.hibernate.org. MimoSA was built to interface with the MnM relational database, which has been expanded to include the ability to store information about papers to be annotated and the relationships between minimotif annotations and their source papers . The graphical user interface (GUI) was developed using Swing http://java.sun.com/docs/books/tutorial/uiswing. Supporting applications used for offline data processing were also developed in Java. These applications identify new keywords and terms used to highlight text in the abstract display window and download content and metadata from PubMed for papers added into the system. For these features, we have relied extensively on the PubMed Application Programming Interface (API) and Remote Procedure Call (RPC) library.
Unlike other annotation and text mining systems, the data artifacts produced by MimoSA are accessible by an API, which is syntax-driven and strongly typed. This allows for high-precision annotation of articles that is not coupled to any one data repository. Thus, MimoSA may easily be configured, for example, to save annotations to an XML document or text file by simply modifying the data access layer implementation.
The generality of the MimoSA application enables its adaptation to other databases and other knowledge domains. This was a consideration made during the development of MimoSA, so as to more broadly enable adaption to other bioinformatics projects.
MimoSA prototype design
The primary function of MimoSA is to support the process of annotating functional minimotifs and their metadata from the primary literature. Secondary functions include minimizing user errors and data redundancy, improving annotation efficiency through techniques such as automated motif/activity/target suggestions, and aiding in the identification of papers containing minimotif content through a machine learning-based ranking system. MimoSA features distinct components and algorithms, which streamline these processes.
The components of MimoSA can be broken up into three functional categories: MnM database management tools, minimotif annotation tools, and paper management tools. Descriptions of each component follow.
The minimotif annotation tools consist of the Minimotif Annotation Form, the Abstract Viewer, and the Protein Sequence Validator. Multiple forms can be displayed at once. On the Minimotif Annotation Form, there is a "clone" function, which opens a new instance of the form pre-filled with all of the minimotif-syntactical attributes except the minimotif's sequence and position. This is intended to facilitate more efficient annotation of high-throughput papers for minimotif discovery (e.g. phage display), where several attributes of a minimotif are varied in a controlled fashion, thus generating a broad landscape of similar minimotifs with subtle variations [17, 18].
Another component that assists annotators is the Protein Sequence Validation function (Fig. 3B). Once an accession number has been entered, the protein sequence is automatically retrieved from a local version of public databases such as NCBI and displayed in the Protein Sequence Window. Once loaded, the position of the minimotif in the protein sequence is bolded. This ensures that the minimotif is indeed present in the selected protein.
Paper tracking status definitions
For papers that have not yet been received.
Reviewed no minimotifs
For papers that were reviewed and do not contain minimotifs.
reviewed for some minimotifs
For papers that were reviewed, but for which not all of the minimotifs from the paper have been annotated.
reviewed for all minimotifs
For papers that were reviewed and and for which all minimotifs have been annotated.
For papers with questionable interpretation that require discussion by the annotation group.
no electronic version
For papers for which an electronic version is not available.
minimotif present but not annotated
For papers that have a minimotif, but have not yet been annotated.
Modification of the minimotif miner data model and syntax
In order to better exploit MimoSA's functionality and facilitate unambiguous and accurate annotation, we recognized that some changes to the model we previously presented were required . Our minimotif syntax defined the motif source as the protein that contains the minimotif. However, a consensus minimotif definition such as [RK]xx[RK] can have multiple occurrences in a minimotif source so we needed to specify a position for the first minimotif residue relative to the protein sequence start position in the corresponding sequence file specified by a protein sequence accession number. Another change we considered is that experiments, which contribute to minimotif definitions may either use peptides or full length proteins. We think it is important to specify this as an attribute since the two sources represent very different chemical entities. Finally, we have started using PSI-MOD and GO controlled vocabularies for indicating activities and post-translational modifications of minimotifs.
Identification of papers with minimotif content
The MnM database contains many papers that were previously annotated for minimotif content, but many more papers have yet to be annotated. PubMed contains well over 19 million abstracts of scientific papers. Only those papers that have minimotif content are useful for annotation. Our first approach to pare down the paper list used keyword searches to identify papers, which were likely to contain minimotif content; however, this approach was not efficient. Therefore, we developed new strategies and an efficiency metric for the evaluation and comparison of these strategies (see Additional File 1).
We initially evaluated six general strategies: Keywords/Medical Subject Headings (MeSH), date restriction, forward and reverse citations, authors with affiliations, and minimotif regular expressions. A detailed description of the strategies and results are presented in Additional File 1. These strategies were evaluated using a Minimotif Identification Efficiency (MIE) score, which is defined as the percentage of papers that contain minimotifs. Collectively, these strategies provided a list of approximately 120,000 abstracts, of which ~30% were expected to contain minimotifs based on extrapolation.
Design and training of the TextMine algorithm that scores papers for minimotif content
We wanted to score and rank these papers as a means to better identify the ~30% that contain minimotifs and develop a strategy for scoring all PubMed papers that can be used for future maintenance of the MnM database. To rank papers for minimotif content, we designed the Paper Scoring (PS) algorithm and trained the algorithm using structured data for defined paper sets in the MnM database.
The basic problem of interest can be stated as follows: given a research article (or an abstract), automatically rank the article by its likelihood of containing a minimotif. We used a subset of papers as a training set for training the PS algorithm. Each article in a research article collection A, which is used for training, is read by hand and given a score of either 0, indicating that the paper does not contain minimotifs, or 1, indicating that the paper has at least one minimotif. A similar algorithm has been employed to characterize unknown microorganisms . A crucial difference between the PS algorithm and that of Goh, et al., is that the PS algorithm provides an ordering of the papers instead of using a filter threshold.
The workflow for this phase consists of the following steps: We start with disjoint sets P, N, and T of abstracts, which are positive, negative, or not reviewed for minimotif content, respectively. Let W be the ordered term vector found by taking all significant words (e.g. words like "the", "of", "new" etc., that have no discriminatory value between P and N) from the documents of sets P and N. For each word w in W and each article a in P we divide the number of instances of w by the size of a: this is the enrichment of w in a. Then, we sum these enrichments over all P and divide by the size of P to obtain an overall enrichment of w. We repeat this over set N, and subtract the result from wp to arrive upon a "score" for word w, which ranges from -1 to 1. Higher values indicate more positive association with minimotif content. We now have a vector of decimal "scores", which has the same dimension as W, with one entry per term in the term vector. Call this vector S.
Scan through the paper (or abstract) to count how many times each word w of W occurs in this article.
Construct a vector v of all values from (1) in which the order corresponds with S.
Compute the correlation between v and S and obtain a Pearson's correlation coefficient pc for each paper. If X and Y are any two random variables, then the Pearson's correlation coefficient between X and Y is computed as where μ X is the expected value of X, μ Y is the expected value of Y, σ X is the standard deviation of X, and σ Y is the standard deviation of Y.
Thus, we have now computed a "score" of the article, which is the Pearson's correlation coefficient between the scored words from the training set W and respective enrichments of those words in the article n.
The Paper Scoring (PS) algorithm's pseudo code is provided in the Additional File 1. The correlation coefficients for the lexemes range from -1.000 to 1.000. This score positively correlates with the presence of minimotif content, as expected.
Paper ranking and evaluation of the paper scoring algorithm
The algorithm above is packaged as an independent application, TextMine, which can be used in conjunction with MimoSA (or as a standalone open source java application which can be integrated with any annotation or analysis pipeline). For the test set, we selected 91 new articles, which we determined to either have or not have minimotif content and were disjoint from the training sets. The basis for all testing of the TextMine application was derived from correlations of TextMine scores to this set.
The TextMine website and package provides a test data set which reproduces our analysis for a set of test papers. The current version of MimoSA, utilized for MnM annotation, uses scores from TextMine calculated for 120,000 abstracts for paper sorting.
Paper scoring algorithm and training set size
Larger training set sizes (negative, positive) modestly improve algorithm performance
For use in testing TextMine's performance relative to the size of the training set the application package includes an iteration module, which allows for specification of the sizes of positive and negative training sets (this iteration package generated the data in Table 2). We recorded the performance for incrementally increased training set sizes, and noted that as the number of either positive or negative training documents increased, a modest performance improvement was observed. The performance of the algorithm is determined by the correlation coefficient between the calculated scores, between -1 and 1, and an actual score, between 0 and 1.
The table indicates that large increases in the number of positive training articles were comparable to small increases in the number of negative training articles, ultimately showing that both had modest increases in value with set size. A positive correlation coefficient between positive or negative training size and the algorithm performance was observed (0.52 and 0.46, respectively). The correlation score between TextMine scores and the training set scores showed modest increases with size (ranging from 0.59 to 0.66 when using 40 negative and 400 positive abstracts).
Because the general utility of this algorithm far exceeds the field of minimotif annotation, we have released TextMine as a stand-alone application that is cross-platform and database-independent.
We have built an application that facilitates annotation of minimotifs from the primary literature, which we are currently using to populate a more comprehensive MnM minimotif database. The application scores a set of papers for minimotif content. In principle, the TextMine score can be used to score all PubMed abstracts for minimotif content and can be used in the future for maintaining the database. As text mining algorithms increase in proficiency and scope, it may be possible to use a large, MimoSA-curated set of minimotif-containing papers as a training set for automatically detecting minimotif definition sentences and phrases in papers by machine learning approaches.
The implementation of the paper scoring algorithm as a SQL stored procedure in MimoSA automates its execution and is amenable to further machine learning development. A static algorithm would have required a word or file list as input and require manual merging of results into the database. One limitation of the TextMine application is that it does not directly control for type biasing. That is, depending on the training set, we expect that there is some risk of "weighting" words heavily to bias previously seen content types. Instead of controlling for this automatically, TextMine outputs the scores of all calculated words so as to enable user inspection of how their training set influences the algorithm. This allows for informed adjustments to the training set on a case-by-case basis.
Although MimoSA was developed primarily for Minimotif annotation, the PS algorithm for scoring content in papers has broader applications. In consideration of its potential use, we have implemented it as a separate program, TextMine. For other annotation purposes, correlation scores for individual words from a training set of articles already known to either contain, or not-contain, the desired information are calculated. This results in a rank order for several thousands of words. For each single article, the PS algorithm then calculates a Pearson's Correlation Coefficient between two large linear sets: the score of each word in the aforementioned dictionary, and the corresponding enrichment of that word in the article's title and abstract. Despite the broad range of semantic methodologies for communication of peptide minimotif information, we still observed significant differentiation of the paper rankings when applied to the minimotif content papers.
The MimoSA application interfaces with a normalized model of minimotif function, facilitating non-redundant annotation of minimotifs. The MimoSA user interface combines a set of features that facilitate annotation; including the browsing, sorting, creation, and modification of minimotif annotation entries. Additionally, interactive paper selection, a database driven Minimotif Annotation Form and literature browser, minimotif attribute based markup and highlighting of abstracts, the display of minimotif positions in protein sequences, and minimotif publication scoring and status tracking. MimoSA also features an adaptive, database-driven paper-ranking strategy that can be used to rank papers for minimotif content, which, when combined with the paper tracking module, represents an adaptive approach to literature scoring and content rating. The layered architecture, generalizable data model of minimotif functionality, and database driven application components enable MimoSA to be readily adapted for other molecular annotation projects.
Project name: Minimotif System for annotation
Project home page: mimosa.bio-toolkit.com, textmine.bio-toolkit.com
Operating system(s): Platform independent
Programming language: Java
Other requirements: MySQL 5.0 or higher, Java Virtual Machine 1.6 or higher,
License: Open Source
Any restrictions to use: This paper must be referenced in any publication that uses MimoSA or TextMine, or any application that is developed based on these core applications.
We thank the National Institutes of Health for funding (GM079689, AI078708 to MRS and GM083072 to MRG). We would like to thank members of the Minimotif Miner team for suggestions in preparation of this manuscript.
- Rajasekaran S, Balla S, Gradie P, Gryk MR, Kadaveru K, Kundeti V, Maciejewski MW, Mi T, Rubino N, Vyas J, Schiller MR: Minimotif miner 2nd release: a database and web system for motif search. Nucleic Acids Res 2009, 37: D185-D190. 10.1093/nar/gkn865View ArticlePubMedPubMed CentralGoogle Scholar
- Balla S, Thapar V, Luong T, Faghri T, Huang CH, Rajasekaran S, del Campo JJ, Shin JH, Mohler WA, Maciejewski MW, Gryk M, Piccirillo B, Schiller SR, Schiller MR: Minimotif Miner, a tool for investigating protein function. Nat Methods 2006, 3: 175–177. 10.1038/nmeth856View ArticlePubMedGoogle Scholar
- Gong WM, Zhou DH, Ren YL, Wang YJ, Zuo ZX, Shen YP, Xiao FF, Zhu Q, Hong AL, Zhou X, Gao XL, Li TB: PepCyber : PPEP: a database of human protein-protein interactions mediated by phosphoprotein-binding domains. Nucleic Acids Res 2008, 36: D679-D683. 10.1093/nar/gkm854View ArticlePubMedPubMed CentralGoogle Scholar
- Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DMA, Ausiello G, Brannetti B, Costantini A, Ferre F, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G, Wyrwicz L, Ramu C, McGuigan C, Gudavalli R, Letunic I, Bork P, Rychlewski L, Kuster B, Helmer-Citterich M, Hunter WN, Aasland R, Gibson TJ: ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 2003, 31: 3625–3630. 10.1093/nar/gkg545View ArticlePubMedPubMed CentralGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 2003, 31: 3635–3641. 10.1093/nar/gkg584View ArticlePubMedPubMed CentralGoogle Scholar
- Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho.ELM: a database of phosphorylation sites - update 2008. Nucleic Acids Res 2008, 36: D240-D244. 10.1093/nar/gkm772View ArticlePubMedPubMed CentralGoogle Scholar
- Vyas J, Nowling RJ, Maciejewski MW, Rajasekaran S, Gryk MR, Schiller MR: A proposed syntax for Minimotif Semantics, version 1. Bmc Genomics 2009, 10: 360. 10.1186/1471-2164-10-360View ArticlePubMedPubMed CentralGoogle Scholar
- Reeves GA, Talavera D, Thornton JM: Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 2009, 6: 129–147. 10.1098/rsif.2008.0341View ArticlePubMedPubMed CentralGoogle Scholar
- Sherman BT, Huang dW, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics 2007, 8: 426. 10.1186/1471-2105-8-426View ArticlePubMedPubMed CentralGoogle Scholar
- Kawaji H, Hayashizaki Y: Genome annotation. Methods Mol Biol 2008, 452: 125–139. full_textView ArticlePubMedGoogle Scholar
- Reed JL, Famili I, Thiele I, Palsson BO: Towards multidimensional genome annotation. Nature Reviews Genetics 2006, 7: 130–141. 10.1038/nrg1769View ArticlePubMedGoogle Scholar
- Braconi QS, Orchard S: The annotation of both human and mouse kinomes in UniProtKB/Swiss-Prot: one small step in manual annotation, one giant leap for full comprehension of genomes. Mol Cell Proteomics 2008, 7: 1409–1419. 10.1074/mcp.R700001-MCP200View ArticleGoogle Scholar
- Boeckmann B, Blatter MC, Famiglietti L, Hinz U, Lane L, Roechert B, Bairoch A: Protein variety and functional diversity: Swiss-Prot annotation in its biological context. C R Biol 2005, 328: 882–899. 10.1016/j.crvi.2005.06.001View ArticlePubMedGoogle Scholar
- Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6: 57–71. 10.1093/bib/6.1.57View ArticlePubMedGoogle Scholar
- Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004, 2: e309. 10.1371/journal.pbio.0020309View ArticlePubMedPubMed CentralGoogle Scholar
- Gattiker A, Hermida L, Liechti R, Xenarios I, Collin O, Rougemont J, Primig M: MIMAS 3.0 is a Multiomics Information Management and Annotation System. BMC Bioinformatics 2009, 10: 151. 10.1186/1471-2105-10-151View ArticlePubMedPubMed CentralGoogle Scholar
- Songyang Z, Shoelson SE, Mcglade J, Olivier P, Pawson T, Bustelo XR, Barbacid M, Sabe H, Hanafusa H, Yi T, Ren R, Baltimore D, Ratnofsky S, Feldman RA, Cantley LC: Specific Motifs Recognized by the Sh2 Domains of Csk, 3Bp2, Fps Fes, Grb-2, Hcp, Shc, Syk, and Vav. Mol Cell Biol 1994, 14: 2777–2785.View ArticlePubMedPubMed CentralGoogle Scholar
- Kaushansky A, Gordus A, Chang B, Rush J, MacBeath G: A quantitative study of the recruitment potential of all intracellular tyrosine residues on EGFR, FGFR1 and IGF1R. Molecular Biosystems 2008, 4: 643–653. 10.1039/b801018hView ArticlePubMedPubMed CentralGoogle Scholar
- Goh CS, Gianoulis TA, Liu Y, Li J, Paccanaro A, Lussier YA, Gerstein M: Integration of curated databases to identify genotype-phenotype associations. BMC Genomics 2006, 7: 257. 10.1186/1471-2164-7-257View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.