Improving classification in protein structure databases using text mining

BMC Bioinformatics

Table 1 Performance of structure, text classifiers and logistic regression models in 'borderline' proteins from CATH

The performance of the structural similarity measured by the SSAP algorithm and text similarity (TEXT) as classifiers for protein classification in the homologous superfamily level in the DC1.1993 dataset of 'borderline' cases in CATH using the textCATH as reference set. Classification performance was assessed using the AUC and MCC measures on the whole set (a), training (b), and test (c) sets. Nagelkerke's R² is a measure of the variance accounted by the variables of the logistic regression models. Model L.R. stands for model likelihood chi-square which is the difference between Null and Residual deviance. Logistic regression models were trained on a random subset of comparisons from 1000 abstracts and tested on the remaining 993 abstracts from the 'borderline' cases dataset DC1.1993.

ISSN: 1471-2105