- Methodology article
- Open Access
Improving classification in protein structure databases using text mining
© Koussounadis et al; licensee BioMed Central Ltd. 2009
- Received: 27 November 2008
- Accepted: 05 May 2009
- Published: 05 May 2009
The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions.
An optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of correct decisions were made by the combined classifier.
We have described a simple text based method to classify protein domains that demonstrates an improvement over existing methods. The method is unique in incorporating structural and text based classifiers directly and is particularly useful in cases where inconclusive evidence from sequence or structure similarity requires laborious manual classification.
- Support Vector Machine
- Support Vector Machine Model
- Matthews Correlation Coefficient
- Text Similarity
- Protein Classification
Advances in structural biology have increased the rate at which new protein structures are being determined, thus creating a need for automated methods for protein domain classification. The main computational tools for classification in protein structure databases such as CATH  and SCOP  remain sequence and structural comparisons. Indeed, in the processing of CATH, for example, a high degree of structural similarity often warrants the direct inheritance of the classification of the matched known domains. Although it may be possible to classify protein domains purely on the basis of clear sequence and structural similarity, there are many cases that exhibit 'borderline' or low similarity to existing members which require laborious manual classification. This manual classification usually requires study of the relevant literature, and so classification of these 'borderline' domains may benefit from automated literature analysis. To address this need, text mining based methods may complement the existing molecular computational approaches, especially in cases where the evidence from such sequence and structural similarities is inconclusive.
Support Vector Machines (SVM) are one of the newer machine learning approaches in wide usage today, and they are based on statistical methods to minimise the risk of error and offer solutions for optimal generalisation performance . SVMs exploit statistical learning theory and are capable of overcoming the problems commonly associated with high dimensionality, such as overfitting. SVMs have been demonstrated to perform well in document classification tasks . In such applications, text documents are represented as vectors according to the bag-of-words model, whereby each word represents a dimension in a high-dimensionality space. These vectors not only have high dimensionality, but they are also sparse at the same time, as each document typically contains a small subset of the very large set of words which are present in the corpus vocabulary. SVMs are based on learning a separating hyperplane that divides two sets of vectors such that the risk of misclassification is minimized, and are particularly suitable for this kind of high dimensionality, sparse data.
There are several applications of SVMs for document classification. PreBIND is an information extraction system based on SVM technology for the detection of protein-protein interactions in the literature . Stapley et al.  used SVMs to infer the sub-cellular location of proteins from text sources. Rice et al.  developed a machine learning approach to mine protein function predictions from text. Another SVM based document classification algorithm was implemented to assign documents into nine main categories which correspond to chapters of the WormBook. The system was optimised for the Caenorhabditis elegans corpus using rules, but the same classification engine can be applied to other domains .
Several approaches already exist that incorporate text mining methods and other bioinformatics tools. Indeed, the combination of sensitive sequence similarity searches with functional annotations has been successfully implemented for the functional prediction of proteins, based on experimental knowledge of remote homologues . Other attempts combine functional information with a variety of similarity search methods. SAWTED (Structure Assignment With Text Description) uses textual descriptions from several fields of UniProt records to enhance the detection of remote homologues from PSI-BLAST results . ProFAT combines PSI-BLAST sequence similarity searches with fold recognition and text mining to refine hits according to their function .
Various methods are available for automatic functional annotation of proteins from text, and several of these have been evaluated in the second task of BioCreAtIvE . Couto et al., have developed an unsupervised method for recognizing biological properties in free text (FiGO) . The system splits text in sentences and considers the evidence content based on the nomenclature of a genomic ontology that structures the properties. GOAnnotator  is able to link Gene Ontology (GO) terms in uncurated annotations of UniProt entries with evidence text from documents related to such entries. Other methods for automatic identification of GO terms in free text include the approach by Ruch  which is based on pattern matching and text categorisation ranking, and by Gaudan et al.  which considers not only the presence of words of the GO terms occurring in text, but also proximity between words and their specificity based on their information content.
We have developed a novel text based approach for protein classification, which is based in text similarity of documents related to proteins, with a view to support the curation of protein structure databases. It is assumed that textual similarity between sets of documents relating to proteins mirrors structural and functional relationships and therefore can be used to make protein classification assignments. Such documents usually contain a description of the protein function and protein structure. In addition, they may mention specific characteristics, related or homologous proteins and their classification. In general, they represent concepts that report on functional and structural details. Several of the words used in these descriptions are specific to each protein or protein class. The method exploits the presence of such protein specific words (features) in document classification. It is expected that documents related to proteins that belong to the same class, and therefore have similar function and structure, are more similar than documents related to proteins in different classes. A classification assignment for a query protein can be inferred by assessing text similarity between a set of documents relating to the unclassified protein and sets of documents relating to classified proteins. For instance, if documents related to an unclassified protein are similar to documents related to a classified protein, then it could be that they belong to the same class, or in this case, the same protein superfamily.
The documents can be abstracts or full text articles, although only abstracts have been used here due to the large variety of access control methods employed on journal full texts. An SVM model was developed to discriminate sentences in the abstracts containing functional, structural and classification information that are relevant to the protein classification task. The SVM model was used to remove text that is irrelevant to the classification task, thus reducing noise and increasing the prevalence of informative terms, allowing for more accurate classification predictions. The SVM model presented is novel and optimised for the discrimination of sentences in text that contain useful information for protein classification.
The text similarity algorithm was firstly optimised using a previously described gold standard dataset . Several conditions were tested, such as the inclusion of additional text from related articles and UniProt  and Protein Data Bank  annotations, as well as the filtering of the abstracts using the SVM model. The optimal conditions were then applied in text comparisons within a much larger collection of documents that relate to proteins already classified in the superfamily (H) level of the CATH database. Although the structural similarity classifier performed best, text similarity was useful in a logistic regression model that combined the two classifiers.
The method is unique in incorporating structural similarity searches with text mining directly. In a dataset of 'borderline' proteins, which lack clear structural and sequence similarity to classified proteins, the combination of the structural and text similarity classifiers resulted in an improvement in coverage by up to 50% at low error rates (10-3), compared to the structural classifier alone. Additionally, it makes an extra 4.2% of correct classification decisions when only the highest scoring predictions were used to infer classification. This method is useful for the challenging task of classification of such 'borderline' cases that usually require manual curation involving time-consuming study of the relevant literature.
Application to CATH database
The CATH database is a hierarchical classification of protein domain structures in PDB. There are four major levels in this hierarchy: Class, Architecture, Topology (Fold family) and Homologous superfamily. The latter level groups together domains where evidence from sequence, structure and function similarity suggest they have evolved from a common ancestor. Classification is currently guided by structure and sequence similarity measures such as CATHEDRAL , SSAP , profile HMMs  and manual procedures such as literature analysis. The SSAP structure comparison method uses a double dynamic programming algorithm to align two protein structures.
In CATH, an existing classification is inherited if a protein domain displays "clear" structural and sequence similarity (over 35% sequence identity and/or SSAP score over 80) with a classified domain. However, if there is no domain in the database that fulfils these requirements, the classification is performed manually by considering the results of structural and sequence similarity and analysis of relevant literature. The literature often contains references to the function of the protein and even which evolutionary family it is thought to belong to. It is this information that can be highly useful for classification and lends itself to a text mining approach.
Combined structure and text classifier outperforms structural similarity in protein classification of 'borderline' cases in CATH
Performance of structure, text classifiers and logistic regression models in 'borderline' proteins from CATH
SSAP + TEXT
Coefficients and Wald tests for logistic regression model
Comparative classifier performance in protein classification
Number of classification matches at various rates of false positives in the 'borderline' DC1.1993 dataset
CATH superfamily classification matches (TP)
SSAP + TEXT
False Positive Rate
Number of errors
The data in Figures 2A and 2B and Table 3 demonstrate that at low false positive rates, 10-6 and 10-5, all classifiers display low coverage. At a higher rate of false positives (10-3), the SSAP, TEXT and SSAP+TEXT classifiers detect 10%, 4.21% and 15.33% of the true matches. The SSAP+TEXT classifier consistently outperforms the structural classifier when used on its own, especially in the range of low error, with more than double the number of matches found for error rates below 10-4. The TP-FP breakeven point is the number of true positives that equals the number of false positives. In the test set for the SSAP+TEXT, the breakeven point is reached after 2098 matches are detected (12.51%). Both SSAP and TEXT predict more false than true positives for all cutoffs in the 'borderline' dataset.
Aside from the ability of each classifier to separate true from false positives across the range of scores, we were also interested in the power of each method to assign correct structural classifications to proteins of the 'borderline' set in CATH using the top scoring hit. Only the best scoring domain comparison was considered for each of the 2436 domains of the DC1.1993 set (tophit), in order to simulate classification tasks. The combined SSAP+TEXT classifier increased coverage by 4.2%, capturing 1694 (69.5%) correct classifications relative to 1591 (65.3%) by SSAP alone.
Our results demonstrate that although much work is required to address fully-automated classification of proteins that lack clear structural or sequence similarity, the combination of text and structure similarity can improve classification performance for 'borderline' cases. When text and structural similarity scores were compared for their classification power in a set of simulated 'borderline' cases of the CATH protein structure database, structural similarity scores performed best in the H (superfamily) level of CATH. This may be expected for structural databases, where structural relationships primarily define class membership. Although the SSAP scores perform better than text similarity in classification as single predictors, the performance improved considerably when structure and text similarity classifiers were combined in a logistic regression model.
On a dataset of 'borderline' cases, a significant improvement in coverage by up to 50% at low error rates was observed. For equivalent coverage, the combined classifier benefits from a significant reduction in the number of false positives compared to SSAP. For instance, when SSAP captures 1000 true matches, it also includes 1612 false positives, while the SSAP+TEXT classifier includes 559, which is a 65% reduction. Moreover, the combined classifier was able to make an additional 4.3% of correct classification predictions based only on the top scoring hits, compared to SSAP alone. The combined classifier is therefore a better starting point than SSAP on its own, when classifying a 'borderline' protein. Although the error rate is not sufficiently low to enable fully automatic protein classification of 'borderline' cases, the combined classifier can have practical application in the support and evaluation of manual protein domain classification assignments. More development of the text similarity algorithm using a more elaborate weighting scheme combined with more efficient text retrieval and information extraction is required to improve performance further.
The text similarity algorithm was benchmarked and optimised using a manually curated dataset of enzyme superfamilies, before it was applied on a large dataset of 'borderline' cases in CATH. Removal of sentences which were semantically irrelevant to protein classification using an SVM model resulted in better performance of the text classifier, while the dimensionality of the feature space of the document reference set of CATH was reduced by 32.7% making calculations speedier. Although all terms were used in this implementation, performance may improve further with feature selection based on thresholding metrics, such as document frequency, information gain, mutual information or chi-square and more involved weighting schemes. For example, the weights of a specific set of highly informative terms can be boosted. For the CATH database reference set, these terms would include selected names and synonyms of the superfamily and their protein members and function annotations from UniProt. Performance should also benefit by using a more sophisticated term generation scheme that accounts for synonymy.
All words of the abstracts, including title and authors, were used in our method. Typically abstracts contain a brief description of the protein structure. A more detailed and extensive description of important features may only be mentioned in the main article, so full text is expected to yield better results. Irrelevant sentences can be removed by classifying each sentence of the full text article using the SVM filter. The inclusion of additional abstracts from related articles in the text related to each protein of the reference set improved performance. Abstracts which are related to the main reference usually contain additional terms that are likely to be characteristic of each class. As full text articles are not always available in public databases, this is an alternative way to muster additional relevant text for each protein and leverage the lack of full text.
A more comprehensive selection of texts by inclusion of abstracts of relevant articles sourced from UniProt and GenBank records may provide more comprehensive text sources for the query proteins. Annotations from PDB and UniProt may also be used alternatively or additionally to abstracts in the query proteins. In this implementation, abstracts were used as text relating to the query protein of the test set because they are more readily available. Moreover, it is common for annotations to contain technical terms (for instance '3D structure'), which may introduce bias to the text similarity classifier. They also lack consistency and reliability because not all proteins are fully annotated yet, while several annotations are propagated based on high homology rather than experimental observations.
Structural database entries are protein domains, however domains lack a one-to-one relationship to PubMed abstracts. In fact, it is assumed that each abstract possesses information in all domains of its related protein. In reality it is quite frequent for certain domains within a protein to attract greater interest, and text, than others. Inclusion of additional abstracts or full text articles related to the query protein will increase the likelihood that information relevant to all domains of the protein is present in the text.
Combination of the text similarity algorithm with other structural similarity algorithms, such as CE , DALI  and MSDFold  via logistic regression or machine learning based methods, may provide a useful performance benefit. The same methodology, with little alteration, can also be used to improve fold recognition and structure prediction results. Moreover, the text similarity algorithm can be proven valuable in protein classification tasks where a more accurate function classifier is not available. For example, it may be useful in enzyme classification. In the Enzyme Commission classification system, enzymes are classified according to the chemical reaction they catalyse. It is likely that text similarity may be a more appropriate classifier than structure or sequence similarity for this database.
Finally, we demonstrated the practical application of a text based classifier in protein structure database curation. The model resulting from its combination with the structural classifier is superior to the structural classifier alone, thus providing an improved way to classify 'borderline' proteins in the CATH protein structure database. Although in the context of full automation the improvements might at first sight look relatively moderate, it is important to bear in mind that in the principal application domain, the textual results are intended to be used by manual curators or users of a structure classification server as a guide to manual classification. Putting it another way, we need the textual results to be confirmatory of the structural similarity results rather than being entirely novel. The fact that our text classification scheme reproduces around 88% of the purely structure-based AUC, and in combination increases the AUC by a small but significant amount, shows that we are indeed extracting the most relevant texts and that some of these texts are key to making better informed decisions on superfamily membership. A user of a hybrid system would therefore be provided with a highly relevant shortlist of texts from which he or she can make an informed decision as to the correct superfamily classification of the protein being analysed. In other words, the fact that we can improve automatic classification is actually far less important that the fact that we are able to select the relevant texts which can be further analysed by the user in semi-automatic classification. Consequently, we are planning to build this text classification functionality into the user interfaces of our existing fold recognition web servers: both the sequence-based GenTHREADER , and the structure-based CATHEDRAL server .
Overall, our findings show a useful combination of a structural similarity with a text mining approach and demonstrate the value of the text based approaches in protein classification.
In summary, a text based classifier was developed and implemented for the classification of proteins in the CATH database. Although the structural similarity scores perform better than text in classification of proteins in structure databases, it was proven that the combination of the structure and text classifiers in a logistic regression model provides a more powerful classifier, significantly increasing coverage especially at low error levels compared to using structural similarity alone. The benefit is particularly useful in cases where structural similarity is not high enough to be conclusive.
We found that, for 'borderline' matches with SSAP scores below 80, which are notoriously difficult to classify, it is preferable to use the combined structure and text similarity classifier than SSAP alone. This result should be useful in the development of servers which aim to classify proteins automatically and reliably.
Text similarity algorithm
The value of c is high when the compared documents share several rare words. For ease of calculations, the range of c was transformed to range 0–100.
Lucene : Lucene is a high performance, scalable search engine library written entirely in Java. It is available as open source software under the Apache License and offers powerful features including text indexing, searching and term vector generation. It features a selection of analysers, including the Simple, Standard and Stop analysers which were used in this project. Analysers vary in the way they generate terms from text, removal of stop words, lowercasing, etc.
R : Plotting of ROC curves, calculations of AUC, F-measure and Matthews correlation coefficient were performed using R's ROCR library . Logistic regression models were generated using the Design package for R .
MCC is a measure of overall classifier accuracy. A value of 0 indicates random performance, whilst a value of 1 implies perfect classification. The F-measure is the weighted harmonic mean of precision (prec) and recall (rec). For α = 0.5, which is the value of α used in the assessment of this method, the mean is balanced.
Datasets: REGS352 and PDB145
REGS352 and PDB145 datasets
Another set of abstracts (PDB145) was derived from the primary references (JRNL field) of the PDB structures that correspond to the REGS352 sequences. The primary references of the 282 PDB files described in the gold dataset of enzyme superfamilies were checked in order to remove PDB files without references published in MEDLINE, and files with identical references to avoid duplicate references appearing in the set. The filtered set consists of 145 PDB files, each with a unique reference, and their distribution across the enzyme families and superfamilies of the gold set is shown in Table 4.
Document retrieval and analysis
Abstracts relating to a protein were downloaded from PubMed using a set of Perl scripts. The title and authors with their affiliations were also collected with the body of the abstract. The primary reference for each protein was either specified by a PMID contained in the corresponding Genbank or UniProt annotation, or, if a structure for the protein was available, in the JRNL field of the relevant PDB file. Additional abstracts were collected from the 'Related Articles' hyperlink of PubMed.
To perform text comparisons, the documents were converted into IDF weighted term vectors using one of Lucene's analysers. The cosine similarity of the angle between the normalised text vectors was used to assess the similarity between text documents and was calculated according to equation (2).
Using machine learning to screen for informative sentences
Example sentences used in training of the SVM model
Sequence analysis showed that pENO2 shares 75.6% nucleotide and 89.5% deduced amino acid sequence identity with pENO1 and is encoded by a distinct gene.
The packing of the octameric enzyme in this crystal form is unusual, because the asymmetric unit contains three subunits.
Cys-592, which is essential for enzymatic activity, is located in the above-mentioned histidine-rich region.
From the significant sequence similarity between intradiol enzymes, it has been shown that intradiol enzymes evolved from a common ancestor.
Two 2,3-dihydroxybiphenyl (23DHBP) dioxygenase genes, bphC1 and etbC involved in the degradation of polychlorinated biphenyl(s) (PCBs) have been isolated and characterized from a strong PCB degrader, Rhodococcus sp.
A thermostable hydantoinase of Bacillus stearothermophilus NS1122A: cloning, sequencing, and high expression of the enzyme gene, and some properties of the expressed enzyme.
A catechol 2,3-dioxygenase gene in chromosomal DNA of P. putida KF715 was cloned and its nucleotide sequence analyzed.
The K+ ion activates the enzyme 100-fold with an activation constant of 6 mM, well below the physiologic concentration of K+ in E. coli.
A putative regulator and its possible recognition site was suggested on the basis of homology data.
The enzyme has a subunit Mr of 33,500 +/- 2000 by SDS/polyacrylamide-gel electrophoresis.
Example sentences used in testing of the SVM model
These homologous proteins, designated the "enolase superfamily", include enolase as well as more metabolically specialized enzymes: mandelate racemase, galactonate dehydratase, glucarate dehydratase, muconate-lactonizing enzymes, N-acylamino acid racemase, beta-methylaspartate ammonia-lyase, and o-succinylbenzoate synthase.
GlucD is a member of the mandelate racemase (MR) subgroup of the enolase superfamily, the members of which catalyze reactions that are initiated by abstraction of the alpha-proton of a carboxylate anion substrate.
The structure of Neurospora crassa 3-carboxy-cis, cis-muconate lactonizing enzyme, a beta propeller cycloisomerase.
The corresponding cDNA was amplified from a library of lobster muscle cDNA, and a sequence corresponding to residues 27–398 was determined.
The values for kcat were reduced 4.5 × 10(3)-fold for (R)-®delate and 2.9 × 10(4)-fold for (S)-mandelate; the values for kcat/Km were reduced 3 × 10(4)-fold.
Protein, family and superfamily names were removed from the positive and negative examples to avoid unfavourable training of the model on these terms. SVMs were used to learn the features of the training set and classify new unseen sentences. Binary and frequency feature representations as well as a range of values of the c parameter of SVMlight from 0.1 to 2 were tested to identify the optimal SVM model. The frequency representation gave better results and was selected over binary. The optimal c value in our dataset was 0.25. Leave-one-out cross validation of the optimal SVM model estimated the recall 83.45% and precision 82.88%. The SVM model was used to classify each sentence in the abstracts as relevant or not-relevant. Only relevant sentences were used in the text comparisons.
Text similarity algorithm optimisation
Classifier performance in the enzyme dataset
N = 352, 5 classes
Dp20 – DX33 -Ann
Dp20 – DX33 -Ann
Dp20 – Ann
Dp20 – Ann
(1) SVM filtering
The SVM model was used to filter the abstracts of the PDB145 reference set for informative sentences. The SVM filtered reference set performed better (F measure: 0.56) than the set of intact abstracts (0.53).
(2) Retrieval of related abstracts
Perl scripts were implemented to download 19 additional abstracts for each primary reference from the 'Related Articles' hyperlink of PubMed. This function of PubMed identifies other publications in MEDLINE which resemble to the primary reference. Retrieval is based on the evaluation of similarity between two documents over all vocabulary terms using a probabilistic model . The inclusion of 19 related articles from PubMed to the text related to each protein of the reference set improved the performance, presumably as a result of the inclusion of additional class-specific words which are present in the related abstracts. (AUC 0.74 for 20 abstracts versus 0.64 for 1 abstract).
(3) Text analysis
Three types of text analysis were tested using Lucene's Standard, Stop and Simple analysers. All analysers lowercase text, however only the first two remove stop words, and differ in the way they split the text into terms. The Standard has rules for acronyms, hostnames, etc. which are retained intact as a single term, while the Stop and Simple analysers split the words in special characters. The Standard analysis produces a large number of uniquely found, non-informative terms compared to the shorter, fewer and more abundant terms generated by the Stop analyser. Performance was improved when words were split in special characters as well as white space (AUC 0.74) instead of splitting in white space only (0.70). It also benefits from removal of 'stop' words, common words that occur in similar frequency across the reference set. There was no improvement when the Porter stemming algorithm was applied (results not shown) . However, other stemmers such as the Krovetz stemmer  may be more suitable.
(4) Inclusion of annotations
A set of Perl scripts was used to retrieve annotations from the PDB (Title, Classification fields) and UniProt (Protein names, Synonyms, Function, Keywords fields) databases and concatenate them to the abstracts of the reference set. The inclusion of annotations from the UniProt and PDB databases produces a negligible improvement over the performance of the plain 20 abstracts (MCC 0.51 with annotations versus 0.50 without). The method was also tested using UniProt database annotations instead of abstracts in the test set. When relevant annotations from UniProt (Protein names, Synonyms, Function, Keywords fields) were used instead of abstracts as text relating to the query proteins, performance was improved (F measure 0.58, versus 0.56 for abstracts). However, abstracts were used in this implementation because annotations are not immediately available in the databases, in contrast to abstracts that usually accompany structural biology papers describing protein structures upon their release.
CATH abstracts retrieval
A reference set of abstracts was assembled to assess the ability of the method to classify domains at the homologous superfamily level of CATH. CATH v3.1 contains structural classifications for domains from 30028 PDB files. Using PubMed IDs (PMIDs) indicated in the JRNL field of these PDB files of protein domains already classified at the superfamily level of CATH, a total of 15154 unique abstracts with their related articles from PubMed and annotations from UniProt and PDB were downloaded, analysed, indexed and filtered for informative sentences using the SVM model. There are 27069 PDB files (90.1%) with primary abstracts within this document set (textCATH set), because several PDB files share the same primary reference. The textCATH set included 139171 terms and covered 1806 of the 2090 superfamilies in CATH. The textCATH reference corpus is available for download from our website http://bioinf.cs.ucl.ac.uk/downloads/textCATH.
An additional reference set was compiled from the intact abstracts, without any filtering, in order to compare the effect of using the SVM model in the performance of the system. The latter set was also analysed using the Stop analyser and contained 204904 terms.
'Borderline' cases dataset
In order to compare the performance of the structure and text similarity measures in protein classification in the superfamily level of CATH, an all-versus-all structural comparison using SSAP was performed on the protein domains corresponding to the 15154 abstracts of the textCATH set. The majority of the protein domains have "clear" homologues in the database displaying clear structural similarity (SSAP > 80). In the benchmark, the 'easy' pairs are ignored in order to focus on protein domains without clear homologues. Only domains whose matches in the database displayed SSAP scores below 80 and sequence identity below 30% were retained in order to simulate 'borderline' classification assignments. As a result, there were 2436 'borderline' protein domains with a total of 6207493 comparisons, of which 33577 were true matches (0.54%). The set of 1993 unique primary references contained in the PDB files of such domains comprise the DC1.1993 set.
where p is the probability of a classification match, and x1, x2,..., xi are the explanatory, independent variables, which in this case are the SSAP structural similarity and the text similarity (TEXT) scores. The resulting score is thus defined as the natural logarithm of the odds.
For validation purposes, the set of comparisons was split into a training set using the comparisons of 1000 randomly selected abstracts and a test set comprising the comparisons of the remaining 993 abstracts. The training and test sets included 1224 and 1212 domains from the DC1.1993 'borderline' cases dataset, respectively. There were 3130887 (3076606) domain pair comparisons in the training (test) set, of which 16812 (16765) were domain pairs in the same CATH superfamily or true matches and the remaining 3114075 (3059841) were non-homologous pairs or non-matches.
The predictive ability of the models was addressed using the Nagelkerke R2 measure. The R2 statistic is a measure of the effect size and indicates how useful the explanatory variables are in predicting the response variable.
AK was funded by BBSRC grant BBC5072531 and the BioSapiens Network of Excellence (funded by the European Commission within its FP6 programme, under the thematic area Life Sciences, Genomics and Biotechnology for Health), grant number: LSHG-CT-2003-503265. We are grateful to Michael Sadowski, Anna Lobley and Yvonne Edwards for useful discussions.
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – A Hierarchic Classification of Protein Domain Structures. Structure 1997, 5: 1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247: 536–540.PubMedGoogle Scholar
- Vapnik VN: The Nature of Statistical Learning Theory. New York: Springer; 1995.View ArticleGoogle Scholar
- Joachims T: Text categorization with support vector machines: learning many relevant features. In Proceedings of 10th European Conference on Machine Learning. Springer-Verlag, Heidelberg; 1998:137–142.Google Scholar
- Donaldson I, Martin J, de Bruijn B, Walting C, Lay V, Tuekam B, et al.: PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4: 11. 10.1186/1471-2105-4-11PubMed CentralView ArticlePubMedGoogle Scholar
- Stapley BJ, Kelley LA, Sternberg MJ: Predicting the sub-cellular location of proteins from text using support vector machines. Pac Symp Biocomput 2002, 374–385.Google Scholar
- Rice SB, Nenadic G, Stapley BI: Mining protein function from text using term-based support vector machine. BMC Bioinformatics 2005, 6(Suppl 1):S22. 10.1186/1471-2105-6-S1-S22PubMed CentralView ArticlePubMedGoogle Scholar
- Chen D, Muller H-M, Sternberg PW: Automatic document classification of biological literature. BMC Bioinformatics 2006, 7: 370. 10.1186/1471-2105-7-370PubMed CentralView ArticlePubMedGoogle Scholar
- Miaczynska M, Christoforidis S, Giner A, Shevchenko A, Uttenweiler-Joseph S, Habermann B, Wilm M, Parton RG, Zerial M: APPL proteins link Rab5 to nuclear signal transduction via an endosomal compartment. Cell 2004, 116: 445–456. 10.1016/S0092-8674(04)00117-5View ArticlePubMedGoogle Scholar
- MacCallum RM, Kelley LA, Sternberg MJE: SAWTED: Structure assignment with text description – Enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 2000, 16: 125–129. 10.1093/bioinformatics/16.2.125View ArticlePubMedGoogle Scholar
- Bradshaw CR, Surendranath V, Habermann B: ProFAT: a web-based tool for the functional annotation for protein sequences. BMC Bioinformatics 2006, 7: 466. 10.1186/1471-2105-7-466PubMed CentralView ArticlePubMedGoogle Scholar
- Blaschke C, Leon EA, Krallinger M, Valencia A: Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics 2005, 6(Suppl 1):S16. 10.1186/1471-2105-6-S1-S16PubMed CentralView ArticlePubMedGoogle Scholar
- Couto FM, Silva MJ, Coutinho PM: Finding genomic ontology terms in text using evidence content. BMC Bioinformatics 2005, 6(Suppl 1):S21. 10.1186/1471-2105-6-S1-S21PubMed CentralView ArticlePubMedGoogle Scholar
- Couto FM, Silva MJ, Lee V, Dimmer E, Camon E, Apweiler R, Kirsch H, Rebholz-Schuhmann D: GOAnnotator: linking protein GO annotations to evidence text. Journal of Biomedical Discovery and Collaboration 2006, 1: 19. 10.1186/1747-5333-1-19PubMed CentralView ArticlePubMedGoogle Scholar
- Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 2006, 22: 658–664. 10.1093/bioinformatics/bti783View ArticlePubMedGoogle Scholar
- Gaudan S, Jimeno Yepes A, Lee V, Rebholz-Schuhmann D: Combining evidence, specificity, and proximity towards the normalization of Gene Ontology terms in text. EURASIP Journal on Bioinformatics and Systems Biology 2008, 342746.Google Scholar
- Brown SD, Gerlt JA, Seffernick JL, Babbitt PC: A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biology 2006, 7(I):R8. 10.1186/gb-2006-7-1-r8PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler A, Wu CH, Barker WC, Boeckman B, Ferro S, et al.: The Universal Protein Resource (UniProt). Nucleic Acids Research 2005, 33: C154-D159. 10.1093/nar/gki070View ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Redfern OC, Harrison A, Dallman T, Pearl FM, Orengo CA: CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Computational Biology 2007, 3(11):e232. 10.1371/journal.pcbi.0030232PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor WR, Orengo CA: Protein structure alignment. Journal of Molecular Biology 1989, 208: 1–22. 10.1016/0022-2836(89)90084-3View ArticlePubMedGoogle Scholar
- Reid AJ, Yeats C, Orengo CA: Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone. Bioinformatic 2007, 23(18):2353–60. 10.1093/bioinformatics/btm355View ArticleGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Clothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. Journal of Molecular Biology 1998, 284: 1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 1998, 11: 739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Mapping the protein universe. Science 1996, 273: 595–603. 10.1126/science.273.5275.595View ArticlePubMedGoogle Scholar
- Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 2004, D60: 2256–2268. 10.1107/S0907444904026460View ArticleGoogle Scholar
- The PSIPRED Protein Structure Prediction Server[http://bioinf.cs.ucl.ac.uk/psipred]
- The CATHEDRAL server[http://www.cathdb.info/cgi-bin/CathedralServer.pl]
- Wilbur WJ, Yang YM: An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine 1996, 26: 209–222. 10.1016/0010-4825(95)00055-0View ArticlePubMedGoogle Scholar
- The R Project for Statistical Computing[http://www.r-project.org]
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21: 3940–3941. 10.1093/bioinformatics/bti623View ArticlePubMedGoogle Scholar
- Harrell FE Jr: Regression modeling strategies with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.Google Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 2003, 31(1):365–70. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Joachims T: Making large-Scale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning. Edited by: Schölkopf B, Burges CJC, Smola AJ. Cambridge, MA: MIT Press; 1999:41–56.Google Scholar
- Lin J, Wilbur WJ: PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics 2007, 8: 423. 10.1186/1471-2105-8-423PubMed CentralView ArticlePubMedGoogle Scholar
- Porter MF: An algorithm for suffix stripping. Program 1980, 14: 130–137.View ArticleGoogle Scholar
- Krovetz R: Viewing morphology as an inference process. ACM, Pittsburgh; 1993:191–203.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.