A weighted string kernel for protein fold recognition

Background Alignment-free methods for comparing protein sequences have proved to be viable alternatives to approaches that first rely on an alignment of the sequences to be compared. Much work however need to be done before those methods provide reliable fold recognition for proteins whose sequences share little similarity. We have recently proposed an alignment-free method based on the concept of string kernels, SeqKernel (Nojoomi and Koehl, BMC Bioinformatics, 2017, 18:137). In this previous study, we have shown that while Seqkernel performs better than standard alignment-based methods, its applications are potentially limited, because of biases due mostly to sequence length effects. Methods In this study, we propose improvements to SeqKernel that follows two directions. First, we developed a weighted version of the kernel, WSeqKernel. Second, we expand the concept of string kernels into a novel framework for deriving information on amino acids from protein sequences. Results Using a dataset that only contains remote homologs, we have shown that WSeqKernel performs remarkably well in fold recognition experiments. We have shown that with the appropriate weighting scheme, we can remove the length effects on the kernel values. WSeqKernel, just like any alignment-based sequence comparison method, depends on a substitution matrix. We have shown that this matrix can be optimized so that sequence similarity scores correlate well with structure similarity scores. Starting from no information on amino acid similarity, we have shown that we can derive a scoring matrix that echoes the physico-chemical properties of amino acids. Conclusion We have made progress in characterizing and parametrizing string kernels as alignment-based methods for comparing protein sequences, and we have shown that they provide a framework for extracting sequence information from structure. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1795-5) contains supplementary material, which is available to authorized users.

details of this procedure, namely we describe the calculations of the derivatives of the objective function as well as the algorithm used to perform the optimization.
Objective function. Let L be a set of pairs of proteins. We note N the cardinality of L.
For each pair n of proteins in L, we compute the alignment between their structures using STRUCTAL [1], and record its SAS score, Y n . This score is an input to the procedure. In parallel, we compute the kernel between their sequences, using SeqKernel, and record it as X n . X i is a non-linear function of the elements of the kernel K 1 . Our objective is to optimize the degree of linear dependence between the two variables Y and X. We quantify this linear dependence using the Pearson's correlation coefficient, which we denote as P : (S1) P takes value between -1 and 1.
Derivatives of the objective function. The kernel matrix K 1 is a 20 × 20 symmetric, positive, definite matrix. As such, it contains 210 independent variables. Let K 1 (i, j) be one of those variables, where i ∈ [1, 20] and j ∈ [i, 20]. We compute the derivative of the objective function P with respect to K 1 (i, j) using simple rules for computing derivatives.
Let us rewrite first P = P 1 /P 2 , then where: We therefore need to compute the δXn δK 1 (i,j) for all pairs of proteins n in L. Let S and T be the two sequences corresponding to the pair n. Then X n =K 3 (S, T ), and: For any sequences S and T , K 3 (S, T ) is a weighted sum of kernels K k 3 (S, T ), where k is the length of the k-mers considered. K k 3 (S, T ) is itself a sum of kernels K k 2 computed all all pairs of k-mers of length k in S and T . Let u k and v k be two such k-mers. Then, , and 0 otherwise. The terms δXn δK 1 (i,j) are then computed by summing all derivatives given by Equation S5, for all k-mers of size k in S and T , and for all k ∈ [1, k max ].
Maintaining the kernel K 1 positive definite. Direct minimization of the objective function defined by Equation S1 is likely to fail as there are no guarantee that the matrix K 1 stays positive definite. One option to circumvent this problem is to consider the Cholesky factorization of K 1 . Indeed, any positive definitive real matrix K 1 can be decomposed as where L is a lower triangle matrix. Conversely, if K 1 is a matrix that can be written as LL T for some invertible lower triangular matrix L, then K 1 is positive definite. The latter provides a framework for enforcing positive definiteness, namely we set the parameters of the optimization to be the matrix L, the Cholesky factorization of the kernel K 1 . We do need to impose that the matrix L remains invertible, which is achieved by preventing any of the coefficients L(i, i) to become zero. The change of variables from K 1 to L is fully defined once we have a means to compute one set from the others, and once we can transfer derivatives. The former is given by Equation S6. In practice, we only compute the factorization once; this factorization is in fact trivial, as our input kernel K 1 is the identity matrix and the corresponding L is also the identity matrix. The latter is given by the chain rule. Notice first that: Second, we apply the chain rule, where the derivatives δK 1 (k,l) δL(i,j) are directly computed from Equation S7.
An algorithm to optimize the kernel K 1 . Combination of all the elements described above (i.e. the choice of the objective function, computation of its derivatives, and enforcement of positive definiteness of the kernel matrix) leads to algorithm 1 for optimizing K 1 .
Algorithm 1 OptimK1, an algorithm for refining a substitution matrix based on structure comparison data Input: L, list of pairs of proteins; Y n , corresponding SAS values; k max , largest k-mer considered (1) Initialize: Set K 0 1 = I (the identity matrix); compute L 0 , the Cholesky factorization of K 0 1 . for n = 1, . . . until convergence do (2) Compute K n 1 = L n−1 [L n−1 ] T (3) For all n in L, compute X n and δXn δK n 1 (i,j) , the kernel value based on K n 1 and its derivatives with respect to the elements of K n 1 . (4) Compute the objective function P n using equation S1 and its derivatives with respect to the elements of K n 1 (5) Convert derivatives with respect to the elements of K n 1 to derivatives with respect to elements of L n−1 using Equation S8 (6) Check for convergence: if |P n − P n−1 | < T OL stop (7) Update L n−1 → L n end for Output: the optimized matrix K opt 1 , the corresponding kernel values X for protein pairs in L, and the correlation coefficient P .
The time consuming part of this algorithm is concerned with computing all kernel values and their derivatives for the protein pairs in L (step 3). We note however that this part of the algorithm is embarrassingly parallelizable. The Cholesky factorization in step (1) is trivial when K 0 1 is the identity matrix I 20 , in which case L 0 is the same identity matrix. For a more general starting matrix K 0 1 , we have used a C++ version of the Cholesky decomposition code proposed by Healy [2]. The optimization itself is iterative. There are many ways in which the parameters L n−1 can be updated in step (7), such as steepest descent or conjugate gradients, or even second order Hessian-based method such as Newton's method. We have used an intermediate between those two approaches, L-BFGS-B [3]. This method builds an estimate of the Hessian as it proceeds along the optimization, saving the time that the computation of the Hessian would require. It converges faster than steepest descent or conjugate gradients. This version also considers bounds on the variables, which allowed us to enforce that the elements of the diagonal of the matrix L remains strictly positive.

Supplemental Results
Optimizing the input amino acid kernel K 1 We extracted 793 proteins from CATH2833 whose lengths are between 120 and 180 residues. We performed an all-against-all comparisons of the structures of those proteins using STRUCTAL [1]. We selected randomly a subset of 13177 pairs of proteins in this dataset such that the SAS score of their structural alignment is between 0 and 20. The objective of the optimization procedure is to find a kernel matrix K 1 that maximizes the Pearson's correlation coefficient between those SAS scores, and the corresponding kernel values obtained with SerKernel (see Equation S1). We used the identity matrix as starting point of the optimization. Two sets of optimizations were performed, with two values of k max , namely 2 and 10, respectively. Results are presented in Figure S1.  Figure S1: Optimizing the input kernel K 1 for SeqKernel. The matrix K 1 is optimized to maximize the Pearson's correlation coefficient P between STRUCTAL SAS scores, and the corresponding kernel values for 13177 pairs of proteins from CATH2833 (see text for details). Two versions of SeqKernel are considered, with the maximum size k max of the k-mers included in the calculation of the kernel set to 2 and 10, respectively. Results for k max = 2 are shown in panels A, B, and C, while the equivalent results for k max = 10 are shown in panels D, E, and F. In panels A and D, P is plotted against the iteration number during the optimization. The Seqkernel scores prior to (panels B and E), and after optimization (panels C and F) are plotted against the STRUCTAL scores.