A tree-based conservation scoring method for short linear motifs in multiple alignments of protein sequences

Background The structure of many eukaryotic cell regulatory proteins is highly modular. They are assembled from globular domains, segments of natively disordered polypeptides and short linear motifs. The latter are involved in protein interactions and formation of regulatory complexes. The function of such proteins, which may be difficult to define, is the aggregate of the subfunctions of the modules. It is therefore desirable to efficiently predict linear motifs with some degree of accuracy, yet sequence database searches return results that are not significant. Results We have developed a method for scoring the conservation of linear motif instances. It requires only primary sequence-derived information (e.g. multiple alignment and sequence tree) and takes into account the degenerate nature of linear motif patterns. On our benchmarking, the method accurately scores 86% of the known positive instances, while distinguishing them from random matches in 78% of the cases. The conservation score is implemented as a real time application designed to be integrated into other tools. It is currently accessible via a Web Service or through a graphical interface. Conclusion The conservation score improves the prediction of linear motifs, by discarding those matches that are unlikely to be functional because they have not been conserved during the evolution of the protein sequences. It is especially useful for instances in non-structured regions of the proteins, where a domain masking filtering strategy is not applicable.

The IUPred score was calculated for all the randomly chosen instances and the 182 known positive instances located in unstructured regions of proteins (IUPs). The plot of the TPR  Figure S1. ROC curves of IUPred and CS scores. Points in each ROC curve indicate the proportion of wrongly scored known negatives (false positive rate, FPR) versus the fraction of correctly scored known positives (true positive rate, TPR). Each point is calculated for a certain threshold inside the score range, [0, 1] for the CS (all models) and the IUPred score. An ideal scoring method would be one arriving at the upper left corner, ie. TPR=1 and FPR=0. The diagonal indicates a random scoring scheme.  against the FPR for the IUPred score is a diagonal that corresponds to a uniform distribution of the score for the instances in the two sets (known positives in IUPs and randomly chosen instances). Moreover this indicates that the set of putative negatives is non-biased ( Figure S1).
Given the fact that not all the 356 known positives could be used for the test described above, we compared the behaviour of the ROC curves calculated with both known positive sets: all instances and instances found only in intrinsically unstructured regions of proteins (IUPs) ( Figure S2). For all three models the two curves differ only in the plateau region, after the ideal threshold where maximum sensitivity is reached at the lowest FPR. Indeed, there is no statistically significant difference between the CS distribution of both sets. The P-values of the Kolmogorov-Smirnov test range from 0.11 to 0.77 depending on the model used to calculate the CS.

C Comparison with Dinkel and Sticht score
Recently, another conservation score for ranking predicted motif instances has been proposed [2]. This method retrieves homologous sequences by a BLAST search and sorts them according to their similarity with the query sequence. Then it looks for the presence of the predicted instance in the pairwise alignments with the homologous sequences. The method takes into account the variability of the motif pattern when assessing the presence of an instance in the homologous sequence. Each presence is weighted depending on the sequence similarity between the homologous sequence and the query. A conservation score is then calculated averaging the weighted presences for a number of homologous sequences (average conservation score, ACS). The final conservation score for the given instance is the maximum average conservation score (MCS) obtained using different numbers of homologues.
In general the method of Dinkel and Sticht follows a similar logic to the CS, in particular to the EXC CONT model. The main differences with the CS method are: the absence of a "closed" homologous sequence set, in the sense that the number of considered homologues depends on the maximisation of the ACS; the use of pair wise alignments instead of multiple sequence alignments. In spite of the general resemblance between the two methods, Dinkel and Sticht report a recovery rate (e.g. sensitivity) of 75% for their method [2]. This result differs from the 0.83 sensitivity found for the CS EXC CONT model. The most natural explanation for this would be the difference between our known positives set and theirs. Dinkel and Sticht used all the 675 ELM instances (567 after filtering for sequence redundancy). For the CS benchmark, instead, only the 356 ELM instances linked to experimental evidence were used. In order to test this hypothesis, we implemented their method and calculate the MCS of our known positive and known negative instances.
The resulting ROC curves show that the MCS and the CS EXC CONT model have similar sensitivity ( Figure S3). Indeed both methods take into account the degenerate nature of linear motifs patterns. This seems to be important when scoring instances that are known to be functional (for further discussion see Testing section and Figure 5, in main article).
Nevertheless, there is a difference between the two methods in the high TPR and low FPR region of the ROC curve (see inset Figure S3). There, the CS EXC CONT model reaches the same TPR as the MCS at lower FPR cost. The difference in FPR ranges from 0.05 to 0.10. It is possible that the two main differences between the methods explained above are responsible for the overscoring of some of the randomly chosen instances. The maximisation of the ACS might not be ideal in a prediction framework where it is necessary to distinguish among functional and non functional instances. It could "overscore" random matches. Moreover, the use of pairwise alignments could increase the probability of finding the random instances in the homologous sequences and therefore of scoring it as conserved. This is less likely to happen in a multiple sequence alignment.
For a final test, we combined both methods. We constructed the set of homologous sequences according to our approach (for a detailed description see section Step1: Homologous sequence set definition, in main article). Using the corresponding multiple sequence alignments, we calculate Dinkel and Sticht ACS averaging the weighted presences for sequences in the homologues set. The best performace is achieved when the strong points of both methods are put together (red line in Figure S3).