Outline of the text-mining protocol
The TM procedure was tested on 579 protein-protein complexes (bound X-ray structures purged at 30% sequence identity level) from the Dockground resource (http://dockground.compbio.ku.edu) [38]. The basic stage of the procedure consists of two major steps: information retrieval and information extraction [37] (Fig. 1). The abstracts are retrieved from PubMed using NCBI E-utilities tool (http://www.ncbi.nlm.nih. gov/books/NBK25501) requiring that either the names of both proteins (AND-query) or the name of one protein in a complex (OR-query) are present in the abstract. The text of the retrieved abstracts is then processed for the residue names. The structures of the individual proteins are used to filter the pool of the extracted residues by: (i) correspondence of the name and the number of the extracted residues to those in the Protein Data Bank (PDB) file, and (ii) presence of the extracted residue on the surface of the protein. Several NLP-based approaches (semantic similarity to generic and specialized keywords, parse tree analysis with or without SVM enhancement) were further applied for additional filtering of the extracted residues from the abstracts retrieved by the OR-queries. Performance of the TM protocol for a particular PPI, for which N residue-containing abstracts were retrieved, is evaluated as
$$ P{}_{\mathrm{TM}}\kern0.5em =\frac{\sum \limits_{i=1}^N{N}_i^{\mathrm{int}}}{\sum \limits_{i=1}^N\left({N}_i^{\mathrm{int}}+{N}_i^{\mathrm{non}}\right)}, $$
(1)
where \( {N}_i^{\mathrm{int}} \) and \( {N}_i^{\mathrm{non}} \) are the number of the interface and the non-interface residues, correspondingly, mentioned in abstract i for this PPI, not filtered out by a specific algorithm (if all residues in an abstract are purged, then this abstract is excluded from the PTM calculations). It is convenient to compare the performance of two algorithms for residue filtering in terms of
$$ \Delta N\left(P{}_{\mathrm{TM}}\right)\kern0.5em ={N}_{tar}^{X_1}\left(P{}_{\mathrm{TM}}\right)-{N}_{tar}^{X_2}\left(P{}_{\mathrm{TM}}\right), $$
(2)
where \( {N}_{tar}^{X_1}\left(P{}_{\mathrm{TM}}\right) \) and \( {N}_{tar}^{X_2}\left(P{}_{\mathrm{TM}}\right) \) are the number of targets with PTM value yielded by algorithms X1 and X2, respectively. The N(0) and N(1) values capture the general shape of the PTM distribution. Thus, the effectiveness of an algorithm can be judged by its ability to reduce N(0) (all false positives) and increase N(1) (all true positives). In this study, advanced residue filtering algorithms are applied to the pool of residues extracted by the OR-queries with the basic residue filtering, thus X2 will hereafter refer to this algorithm. The negative values of ΔN(0) and the positive values of ΔN(1) indicate successful purging of irrelevant residues from the mined abstracts.
Selection of keywords
Generic keywords semantically closest to PPI-specific concept keywords (see Results) were found using Perl module QueryData.pm. The other Perl modules lesk.pm, lin.pm and path.pm were used to calculate similarity scores introduced by Lesk [39, 40], Lin [41] and Path [42, 43], correspondingly, between the token (words) in a residue-containing sentence and the generic keywords. These Perl modules, provided by the WordNet [44, 45] (http://wordnet.princeton.edu), were downloaded from http://search.cpan.org. The score thresholds for the residue filtering were set as 20, 0.2, and 0.11, for the Lesk, Lin and Path scores, respectively.
The keywords relevant to the PPI binding site (PPI + ive words), and the keywords that may represent the fact of interaction only (PPI-ive words) (Table 3) were selected from manual analysis of the parse trees for 500 sentences from 208 abstracts on studies of 32 protein complexes.
Scoring of residue-containing and context sentences
The parse tree of a sentence was built by the Perl module of the Stanford parser [46, 47] (http://nlp.stanford.edu/software/index.shtml) downloaded from http://search.cpan.org. The score of a residue in the sentence was calculated as
$$ {S}_{\mathrm{X}}\kern0.5em =\sum \limits_i\frac{1}{d_{\mathrm{X}i}^{+}}-\sum \limits_j\frac{1}{d_{\mathrm{X}j}^{-}}, $$
(3)
where \( {d}_{\mathrm{X}i}^{+} \) and \( {d}_{\mathrm{X}j}^{-} \) are parse-tree distances between a residue and PPI + ive word i and PPI-ive word j in that sentence, respectively. Distances were calculated by edge counting in the parse tree. An example of a parse tree of residue-containing sentence with two interface residues having score 0.7 is shown in Additional file 1: (Figure S1).
An add-on value to the main SX score (Eq. 3) from the context sentences (sentences immediately preceding and following the residue-containing sentence) was calculated either as simple presence or absence of keywords in these sentences, or as a score, similar to the SX score, but between the keywords and the root of the sentence on the parse tree.
SVM model
The features vector for the SVM model was constructed from the SX score(s) of the residue-containing sentence and the keyword scores of the context sentences (see above). In addition, the scores accounting for the presence of protein names in the sentence
$$ {S}_{\mathrm{prot}}\kern0.5em =\left\{\begin{array}{l}0,\mathrm{if}\ \mathrm{no}\ \mathrm{protein}\ \mathrm{name}\mathrm{s}\ \mathrm{in}\ \mathrm{the}\ \mathrm{sentence}\\ {}1,\mathrm{if}\ \mathrm{only}\ \mathrm{name}\ \mathrm{of}\ \mathrm{one}\ \mathrm{protein}\ \mathrm{in}\ \mathrm{the}\ \mathrm{sentence}\\ {}2,\mathrm{if}\ \mathrm{name}\ \mathrm{of}\ \mathrm{both}\ \mathrm{proteins}\ \mathrm{in}\ \mathrm{the}\ \mathrm{sentence}\end{array}\right. $$
(4)
were also included, separately for the residue-containing, preceding, and following sentences. The SVM model was trained and validated (in 50/50 random split) on a subset of 1921 positive (with the interface residue) and 3865 negative (non-interface residue only) sentences using program SVMLight with linear, polynomial and RBF kernels [48,49,50]. The sentences were chosen in the order of abstract appearance in the TM results.
The SVM performance was evaluated in usual terms of precision P, recall R, accuracy A, and F-score [51].
$$ {\displaystyle \begin{array}{l}P\kern0.5em =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\kern1em R\kern0.5em =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},\\ {}\\ {}A\kern0.5em =\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{TN}+\mathrm{FP}},\kern1em F\kern0.5em =2\frac{P\times R}{P+R},\end{array}} $$
(5)
where TP, FP, TN, and FN are, correspondingly, the number of correctly identified interface residues, incorrectly identified interface residues, correctly identified non-interface residues, and incorrectly identified non-interface residues in the validation set. The results (Additional file 1: Figure S2-S7) showed that the best performance was achieved using RBF kernel with gamma 16. Thus, this model was incorporated in the TM protocol (Fig. 1).
Text mining constraints in docking protocol
TM constraints were incorporated in the docking protocol and the docking success rates assessed by benchmarking. Basic TM tool [37] with OR-queries was used to mine residues for 395 complexes from the Dockground unbound benchmark set 4. The set consists of the unbound crystallographically determined protein structures and corresponding co-crystallized complexes (bound structures). Binary combinations of OR and AND queries were generated [37]. The original publication on the crystallographically determined complex was left out, according to PMID in the PDB file. Because of the frequent discrepancy in the residue numbering and the chain IDs in the bound and the unbound structures, the residues were matched to the ones in the bound protein. The residues were ranked for each interacting protein using a confidence score. The confidence range was between 1 (low) and 10 (high). The AND-query residues were given preference over the OR-query ones for the basic TM protocol, according to our ranking scheme [37]. The confidence score was calculated as
$$ f(R)=\min \left(10,\sum \limits_{i=1}^{N_R}{a}_i\right), $$
(6)
where N
R
is the number of abstracts, mentioning residue R, a
i
= 1, if abstract i was retrieved by the OR-query only, and a
i
= 2, if the abstract was retrieved by the AND-query. For each protein, the top five residues were used as constraints in GRAMM docking [52]. The constraints were utilized by adding an extra weight to the docking score if the identified residue was at the predicted interface. The maximum value of 10 reflects the difference between the low confidence (f = 1) and the high confidence (f = 10) constraints, while alleviating the effect of possible residue overrepresentation in published abstracts (very high f values).
For the NLP score, the confidence ranking scheme was modified such that the range is preserved between 1 and 10 and the AND-query residues are given higher precedence than the OR-query residues. The NLP was used for re-ranking within each category as
$$ {\displaystyle \begin{array}{l}f(R)=\left\{\begin{array}{l}10, if\kern0.5em for\kern0.5em some\kern0.5em i,\kern0.5em {a}_i\kern0.5em is\kern0.5em retrieved\kern0.5em in\kern0.5em AND\kern0.5em query\kern0.5em and\kern0.5em passes\kern0.5em NLP\\ {}8,\kern0.5em if\kern0.5em {a}_i\kern0.5em is\kern0.5em retrieved\kern0.5em in\kern0.5em AND\kern0.5em query\\ {}6,\kern0.5em if\kern0.5em any\kern0.5em {a}_i\kern0.5em retrieved\kern0.5em in\kern0.5em OR\kern0.5em query\kern0.5em passes\kern0.5em NLP\\ {}\max \left(5,\kern0.5em count\kern0.5em of\kern0.5em abstracts\kern0.5em containing\kern0.5em R\right)\end{array}\right\},\\ {}\end{array}} $$
(7)
The residues at the co-crystallized interface were used as reference. Such residues were determined by 6 Å atom-atom distance across the interface. The reference residue pairs were ranked according to the Cα - Cα distance. The top three residue-residue pairs were used in docking with the highest confidence score 10, to determine the maximum possible success rate for the protein set.