The PPI affix dictionary (PPIAD) and BioMethod Lexicon: importance of affixes and tags for recognition of entity mentions and experimental protein interactions
© Krallinger et al; licensee BioMed Central Ltd. 2010
Published: 06 October 2010
Substantial text mining efforts are being devoted to detect protein mentions and protein-protein interaction (PPI) relations from scientific articles [1, 2]. In this context, the BioCreative challenge showed that the correct identification of the individual interactor proteins is still a challenging task, especially when using full text articles . A systematic analysis of particularities of protein mentions in the context of interaction descriptions was nonetheless missing. Experimental biologists often use specific fusion proteins or protein-tags such as -GST, -His, -Myc, FLAG-, antibodies or fluorescent protein (GFP, YFP, CFP and RFP) tags to detect and visualize interactions. These tags are often mentioned as affixes of the target proteins in the literature. The importance of affixes in biomedical text mining had been addressed in case of affixal negation expressions , to consider general posttranslational modifications of proteins  and can be observed in trigger verbs used for interaction extraction .
We carried out a detailed study on the presence of common affixes belonging to interactor protein mentions in full text sentences considered by database curators as evidential support for experimentally characterized physical protein interactions. Furthermore, we tried to determine whether specific affixes might be useful to detect PPI relevant articles and to correlate affix mentions with particular interaction detection methods.
Within a total set of 6,300 interaction evidence sentences, 1,946 (31 %) mentioned at least one interaction relevant affix, which shows that it is a relatively common feature of interaction descriptions. Using statistical analysis of associations between affix classes and interaction detection method annotations (Chi-square test) we discovered that some of the affix classes showed strong associations to interaction methods, such as between: MI:0096 - AF_21 (MI: pull down and PPIAD: gst_pull_down_tag), MI:0676 - AF_6 (tandem affinity purification and Tandem_Affinity_Purification_tag), MI:0018 - AF_10 (two hybrid and Gal4_tag), MI:0006 - AF_4 (anti bait coimmunoprecipitation and Antibody_tag), MI:0055 - AF_15 (fluorescent resonance energy transfer and yellow_fluorescent_protein_tag), MI:0809 - AF_15 (bimolecular fluorescence complementation and yellow_fluorescent_protein_tag) or MI:0007 - AF_22 (anti tag coimmunoprecipitation and hemagglutinin_tag). This could be important to detect experimentally validated interactions and even to help associating some of them to potential interaction detection methods.
To determine if interaction affix mentions might be exploited for finding PPI relevant papers, the distribution of affix mentions across relevant and non-relevant full text articles from BioCreative II.5 training and test set was examined, showing that some of the affix classes were more frequently linked to PPI relevant articles. This indicates that they could be exploited as additional features for an article selection task.
At the level of identification of interactor proteins and interaction pairs through these affixes additional analysis is required. However, it is clear that dictionary look-up based strategies for detecting mentions of proteins need to take into account affix handling for correct interactor identification from mention strings. For detecting interaction pairs, affix mentions can be a criterion for cases where other strategies are not able to retrieve interactions for co-mentioned entities or fail to determine whether the interaction has been experimentally proven. Difficulties encountered by affix-based PPI extraction relate to recall when one of the interactors does not display a valid affix and to precision when only some of the affix-mentioning proteins do show interactions (Fig. 1B). To address these issues, a manual collection of 799 affix relevant interaction expression patterns has been constructed (Fig. 1C). Additional materials and the PPIAD are available at: http://www.cse.iitm.ac.in/~ashishvt/research/PPIAD/.
This work was supported by the European Commission FP6 NoEs ENFIN LSHG-CT-2005-518254 and by grants BIO2007-66855 from the Spanish Ministerio de Ciencia e Innovación and by the Spanish National Bioinformatics Institute (http://www.inab.org/).
- Smith L, et al.: Overview of BioCreative II gene mention recognition. Genome Biol 2008, 9(Suppl 2):S2. 10.1186/gb-2008-9-s2-s2PubMed CentralView ArticlePubMedGoogle Scholar
- Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol 2008, 9(Suppl 2):S4. 10.1186/gb-2008-9-s2-s4PubMed CentralView ArticlePubMedGoogle Scholar
- Sanchez-Graillet O, Poesio M: Negation of protein–protein interactions: analysis and extraction. Bioinformatics 2007, 23(13):i424-i432. 10.1093/bioinformatics/btm184View ArticlePubMedGoogle Scholar
- Narayanaswamy M, Ravikumar KE, Vijay-Shanker K: Beyond the clause: extraction of phosphorylation information from medline abstracts. Bioinformatics 2005, (Suppl 1):i319–27. 10.1093/bioinformatics/bti1011Google Scholar
- Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl2):S14. 10.1186/gb-2008-9-s2-s14PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd.