Volume 11 Supplement 8
Exploiting physico-chemical properties in string kernels
© Toussaint et al; licensee BioMed Central Ltd. 2010
Published: 26 October 2010
String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.
We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.
In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.
Data sets, code and additional information are available from http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask. Implementations of the developed kernels are available as part of the Shogun toolbox.
String kernels are a powerful and popular tool for machine learning in computational biology. They have been successfully applied to numerous applications ranging from protein remote homology detection [1–3], to gene identification [4–6], to sub-cellular location prediction [7, 8] to drug design [9, 10]. The different kernel formulations commonly exploit the sequential structure of the sequences and by doing so can effectively eliminate implausible features, leading to improved results. When using string kernels on protein sequences, one key disadvantage is that prior knowledge about the properties of individual amino acids (AAs), e.g., their size, hydrophobicity, secondary structure preference, cannot be easily incorporated. While these properties can be learned implicitly by the machine learning methods if the training data sets are large enough, it would be advantageous to include this information in the sequence representation. The goal of this work is to combine the benefits of string kernels with the ones of physico-chemical descriptors for AAs. The main idea is to replace the comparison of substrings, which is computed during kernel computation, with a term that takes the AA properties into account. While this seems quite simple at first sight, it is less so, when considering k-mers instead of single AAs. The key insight is how to compute the kernels such that the beneficial properties of sequence kernels do not get lost. In particular, we would like that either the use of uninformative descriptors (e.g., each AA corresponds to a unit vector) or the choice of distinct kernel parameters reduces the new kernel to the original string kernel.
String kernels for sequence classification
Kernels that have been proposed for classifying nucleic and amino acids can be divided into two main classes: (a) kernels describing the sequence content of sequences of varying length and (b) kernels for identifying localized signals within sequences of fixed length. The first class is typically used for classifying whole protein or mRNA sequences, while the second class is typically used to recognize a specific site in a window of fixed length sliding over a sequence.
Kernels describing ℓ-mer content
where x[ i : i + l ] is the substring of length ℓ of x at position i.
Here, we consider all pairs of substrings at any position in each of the two input sequences. This formulation has the benefit that it makes the comparison between the substrings more explicit, which is needed in the derivation of the extensions.
Kernels for localized signals
where is the weighting of the substring lengths. The WD kernel is quite related to the spectrum kernel formulation in (4), where we consider only the ℓ-mers occurring at the same position, i.e., where i = j. The oligo kernel is similar in spirit but it also compares substrings at different positions.
Incorporation of knowledge on AA properties
In this work we propose modifications to existing string kernels that supplement the kernels’ beneficial properties by incorporating prior knowledge on physico-chemical and other properties of AAs. Previous work on incorporating prior knowledge has been either focused on using physico-chemical properties for single AAs, i.e., ignoring the sequential nature of the sequences (e.g., [14, 15]), or took advantage of Blast or PSI-Blast profiles for improving spectrum kernels [2, 3, 16]. We propose a complementary approach of employing physico-chemical or other information to refine the similarity between two substrings used in most existing string kernels. We illustrate the usefulness of these modifications for both classes of string kernels on two problems: (a) the prediction of MHC-binding peptides as an example for localized signals and (b) protein fold classification as an example for ℓ-mer content.
where and .
Using the feature representation corresponding to this kernel, we can now recognize sequences of AAs that have certain properties (e.g. first AA: hydrophobic, second AA: large, third AA: positively charged, etc.): There is a feature induced in the kernel corresponding to all combinations of products of features involving exactly one AA property per substring position. For instance, when considering products of the form (x1,1 + x1,2 +…+ x1, n ) · (x2,1 +x2,2 +…+ x2, n ) · (x3,1 + x3,2 +…+ x3, n ), then we get n3 different monomials which each use exactly one of the n features from the three different groups. There are no monomials x i , j x i , k for any i = 1,…,3 and j, k = 1,…,n.
Both kernels induce a considerably richer feature space, which can be beneficial for accurate classification of sequences.
AA substring kernel for localized string kernels
For σ → 0 and an encoding Ψ with Ψ(a) = Ψ(b) if and only if a = b, the WD-RBF kernel corresponds to the WD kernel: the RBF AA substring kernel will be one only if the substrings are identical, otherwise it will be zero.
Relation to non-substring-based kernels
Please note that here we use the full sequence and do not separately consider subsequences. Both kernels consider higher order correlations between properties of the sequence at arbitrary position in the sequence. Hence, the sequential nature of the sequences is not fully taken into account (particularly important for long sequences).
AA substring kernel for ℓ-mer content string kernels
As before, for σ → 0, the above formulation is identical to the original spectrum kernel. A drawback of this approach is, however, that one now has to compute the substring comparisons for every pair of occurring substrings. Hence, the computational complexity, O(|x| · |x′|), is much higher than for the original spectrum kernel and makes this kernel impractical.
The mismatch kernel
Rather than simply counting similar substrings this feature representation accounts for the degree of similarity: similar substrings contribute stronger than dissimilar ones. This strategy is particularly beneficial, when allowing many mismatches.
For σ → ∞ it corresponds to the mismatch feature map (16) since the RBF AA substring kernel will be one for all substring pairs.
The profile kernel
The second term determines whether the substring is within the mutation neighbourhood and should be counted and the first term determines the contribution of the substring based on AA similarities. This kernel can be computed as efficient as the original profile kernel. Since the elements in the neighbourhood are weighted based on AA property similarity, the kernel may be able to take advantage of larger neighbourhoods.
The profile kernel is similar to the profile-based direct kernels described in  and similar ideas to incorporate AA properties can be applied there as well. The profile and mismatch kernel have, however, the advantage that they allow for an efficient computation using the data structures proposed in [2, 22]. These data structures unfortunately are not applicable to the profile kernel formulations in .
We evaluate the performance of the proposed kernels on two problems: the kernels for localized signals on MHC-peptide binding classification, and the kernels describing ℓ-mer content on protein classification. For MHC-peptide binding experiments we utilized the IEDB benchmark data set from Peters et al. . It contains quantitative binding data (IC50 values) of nonameric peptides with respect to various MHC alleles. Peptides with IC50 values greater than 500 were considered non-binders, all others binders. Protein classification data was taken from the supplementary material of . This commonly used data set comprises 7,329 protein domains from 54 families. Corresponding profile information was taken from [http://cbio.mskcc.org/leslielab/software/string-kernels].
A wide range of physico-chemical descriptors of AAs have been published. Many of them can be obtained from the amino acid index database (AAIndex) . Within this work we use three sets of descriptors: (1) five descriptors derived from a principal component analysis of 237 physico-chemical properties taken from the AAIndex  (pca), (2) three descriptors representing hydrophobicity, size, and electronic properties (zscale), and (3) 20 descriptors corresponding to the respective entries of the Blosum50 substitution matrix  (blosum50).
Evaluation of string kernels for localized signals
Performance analysis. Preliminary experiments on three human MHC alleles (A*2301, B*5801, A*0201) were carried out to analyze the performance of the different kernels WD (5), RBF (12), poly (11), WD-RBF (10), WD-poly (as WD-RBF, but with polynomial substring kernel) combined with different encodings (pca, zscale, blosum50). The alleles were chosen to comprise a small data set (A*2301, 104 examples) as well as a medium (B*5801, 988 examples) and a large (A*0201, 3,089 examples) data set from the IEDB benchmark . Performances of the WD kernel and the WD-RBF kernel with blosum50 encoding were subsequently analyzed on all 35 human MHC alleles contained in the IEDB benchmark. We used two times nested 5-fold cross-validation, i.e. two nested cross-validation loops, to (1) perform model-selection over the kernel and regularization parameters (inner loop), (2) estimate the prediction performance (outer loop) (see, e.g., page S30 of the supporting online material of ). Performance is measured by averaging the area under the ROC curve (auROC).
Learning curve analysis. The performance dependence on the amount of training data was analyzed on allele A*0201 in 100 runs of two times nested 5-fold cross-validation to average over different data splits to reduce random fluctuations of the performance values. Performance is measured by averaging the area under the ROC curve (auROC). In each run, thirty percent of the available data was used for testing. From the remaining data training sets of different sizes (20, 31, 50, 80, 128, 204, 324, 516, 822, 1,308) were selected randomly.
Evaluation of string kernels describing l-mer content
Mismatch kernel. For the comparison of the mismatch kernel and the mismatch-RBF kernel, protein classification data and experimental setup were taken from the supplementary material of . The ROC50 score, i.e. the area under the ROC curve computed up to the first 50 false positives, is used as performance measure.
Profile kernel. For the comparison of the profile kernel and the profile-RBF kernel, protein classification data and experimental setup were taken from the supplementary material of . Corresponding PSI-blast profiles were taken from . The ROC50 score is used as performance measure.
All SVM computations were performed using the Matlab interface of the freely available large scale machine learning toolbox Shogun . All used kernels are implemented as part of the toolbox.
Results and discussion
The main goal of this work is the methodological improvement of existing string kernels by incorporation of prior knowledge on AA properties. In order to analyze the benefits of the proposed modifications we conducted performance comparisons between the original and the modified string kernels.
String kernels for localized signals
Performances of kernels utilizing sequential structure and/or AA properties on three MHC alleles
Finally, we compare our results with the ones obtained using a multi-task learning (MTL) method for MHC classification described in . Here, the authors used two kernels, one to define the similarity between examples and one to define the similarity between tasks. They report an auROC of 90.3% using two string kernels. When using the WD-RBF for computing the similarity between the examples, we can slightly improve upon their performance to 90.5% (data splits and model selection as in ). Hence, the AA property-enhanced kernels once more have a slight, but consistent advantage over the base-line kernels. Besides the performance improvement, the modified WD kernel allows, at least theoretically, for the extraction of biological insights: employing an analysis method analogous to  individual patterns of AA properties that are relevant for the classification can be extracted.
String kernels describing ℓ-mer content
To show that also the modification of kernels for describing ℓ-mer content of sequences has desirable properties, we chose the problem of protein remote homology detection. Here, the task is to classify proteins into folds, super-families or families based on their sequence. This problem has been previously tackled in a series of papers in [11, 21, 22] which suggested the spectrum kernel, followed by the mismatch kernel and finally the profile kernel. The profile kernel already uses AA similarities based on blast or PSI-blast profiles which lead to significant improvements. Here, we would like to illustrate that using the AA property-enhanced versions of these kernels can still lead to an improvement. We chose the family classification task for this analysis since it was considered in all mentioned previous studies.
Comparison of kernels for ℓ-mer content with their AA-property enhanced counterparts.
Spectrum (ℓ = 5)
Spectrum-RBF (ℓ = 5, σ = 1)
Mismatch (ℓ = 5, m = 1)
Mismatch-RBF (ℓ = 5,m = 1, σ = 1)
Profile (ℓ = 5, τ = 7.5)
Profile-RBF (ℓ = 5,τ = 7.5, σ = 100)
In summary, in our experiments we can observe that the newly proposed kernels lead to consistently better performances than the string kernels on AA sequences as well as the non-substring kernels.
We have proposed new kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. String kernels are powerful and expressive, yet one needs sufficiently many examples during training to learn relationships between amino acids in the very high dimensional space induced by the string kernel. Standard kernels based on physico-chemical descriptors of amino acids, on the other hand, cannot exploit the sequential structure of the input sequences and implicitly generate many more features, numerous of which will be biologically implausible. Here, one also needs more examples to learn which subset of features is needed for accurate discrimination, especially for longer protein sequences.
We could show that the proposed modifications of the WD kernel yield significant improvements in the prediction of MHC-binding peptides. As expected, the improvement is particularly strong when data is less abundant. For protein remote homology detection AA property-enhanced kernels can also lead to significant performance improvements. For the most sophisticated kernels using blast or PSI-blast profiles, however, information about the similarities of AAs can already be derived from the profiles and the improvement is marginal.
Overall, our experiments demonstrate that the proposed kernels indeed lead to a better performance than string kernels and non-substring kernels. These improvements are not major, but consistent. It has to be noted that a big difference between the previously proposed kernels and the proposed kernels cannot be expected: The proposed kernels essentially work on subsets of the features of previously proposed kernels and the improvements that we observe mainly come from the SVM’s degraded performance when including uninformative features (which typically is not very pronounced).
In summary, the proposed modifications, in particular the combination with the RBF AA substring kernel, consistently yield improvements without seriously affecting the computing time (except for the Spectrum-RBF kernel). In all formulations, the original string kernel formulation can be recovered by appropriately choosing σ. Hence, when σ is included in model selection, the performance of the proposed kernels should be at least as good as the original string kernels. We therefore believe that the proposed kernels should be preferred over the original formulations for any protein sequence classification task.
List of abbreviations used
major histocompatibility complex
support vector machine
This work was partly supported by Deutsche Forschungsgemeinschaft (SFB 685, project B1).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.
- Saigo H, Vert JP, Ueda N, Akutsu T: Protein homology detection using string alignment kernels. Bioinformatics 2004, 20(11):1682–9. 10.1093/bioinformatics/bth141View ArticlePubMedGoogle Scholar
- Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. Proceedings IEEE Computational Systems Bioinformatics Conference 2004.Google Scholar
- Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS: Semi-supervised protein classification using cluster kernels. Bioinformatics 2005, 21(15):3241–3247. 10.1093/bioinformatics/bti497View ArticlePubMedGoogle Scholar
- Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B: Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol 2007, 3(2):e20. 10.1371/journal.pcbi.0030020PubMed CentralView ArticlePubMedGoogle Scholar
- Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Ratsch G: mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res 2009, 19(11):2133–43. 10.1101/gr.090597.108PubMed CentralView ArticlePubMedGoogle Scholar
- Schultheiss SJ, Busch W, Lohmann JU, Kohlbacher O, Rätsch G: KIRMES: kernel-based identification of regulatory modules in euchromatic sequences. Bioinformatics 2009, 25(16):2126–33. 10.1093/bioinformatics/btp278PubMed CentralView ArticlePubMedGoogle Scholar
- Roth V, Fischer B: Improved functional prediction of proteins by learning kernel combinations in multilabel settings. BMC Bioinformatics 2007, 8(Suppl 2):S12. 10.1186/1471-2105-8-S2-S12PubMed CentralView ArticlePubMedGoogle Scholar
- Ong CS, Zien A: An Automated Combination of Kernels for Predicting Protein Subcellular Localization. In Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI). Lecture Notes in Bioinformatics, Springer; 2008:168–179.Google Scholar
- Jacob L, Vert JP: Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics 2008, 24(3):358–66. 10.1093/bioinformatics/btm611View ArticlePubMedGoogle Scholar
- Röttig M, Rausch C, Kohlbacher O: Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families. PLoS Comput Biol 2010, 6: e1000636. 10.1371/journal.pcbi.1000636PubMed CentralView ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Noble WS: The Spectrum Kernel: A String Kernel For SVM Protein Classification. In Proceedings of the Pacific Symposium on Biocomputing 2002, 564–575.Google Scholar
- Rätsch G, Sonnenburg S: Accurate Splice Site Detection for Caenorhabditis elegans. In Kernel Methods in Computational Biology. Edited by: B Schölkopf KT, Vert JP. MIT Press; 2004:277–298.Google Scholar
- Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 2004., 5(169):
- Shen B, Bai J, Vihinen M: Physicochemical feature-based classification of amino acid mutations. Protein Eng Des Sel 2008, 21: 37–44. 10.1093/protein/gzm084View ArticlePubMedGoogle Scholar
- Pfeifer N, Kohlbacher O: Multiple Instance Learning Allows MHC Class II Epitope Predictions Across Alleles. In Algorithms in Bioinformatics. Volume 5251. Lecture Notes in Computer Science, Springer; 2008:210–221. full_textView ArticleGoogle Scholar
- Rangwala H, Karypis G: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 2005, 21(23):4239–4247. 10.1093/bioinformatics/bti687View ArticlePubMedGoogle Scholar
- Venkatarajan M, Braun W: New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical—chemical properties. Journal of Molecular Modeling 2001, 7: 445–453. 10.1007/s00894-001-0058-5View ArticleGoogle Scholar
- Ong CS, Zien A: An Automated Combination of Kernels for Predicting Protein Subcellular Localization. In Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI). Lecture Notes in Bioinformatics, Springer; 2008:186–179.View ArticleGoogle Scholar
- Schölkopf B, Burges CJC, Smola AJ (Eds): In Advances in Kernel Methods: Support Vector Learning. Cambridge, MA, USA: MIT Press; 1999.
- Tung CW, Ho SY: POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. Bioinformatics 2007, 23(8):942–949. 10.1093/bioinformatics/btm061View ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Cohen A, Weston J, Noble WS: Mismatch string kernels for discriminative protein classification. Bioinformatics 2004, 20(4):467–476. 10.1093/bioinformatics/btg431View ArticlePubMedGoogle Scholar
- Leslie C, Eskin E, Weston J, Noble W: Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics 2004, 20(4):467–476. 10.1093/bioinformatics/btg431View ArticlePubMedGoogle Scholar
- Peters B, Bui HH, Frankild S, Nielsen M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson SS, Sidney J, Lund O, Buus S, Sette A: A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules. PLoS Comput Biol 2006, 2(6):e65. 10.1371/journal.pcbi.0020065PubMed CentralView ArticlePubMedGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Research 1999, 27: 368–369. 10.1093/nar/27.1.368PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D: Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana. Science 2007, 317(5836):338–342. 10.1126/science.1138632View ArticlePubMedGoogle Scholar
- The Leslie Lab - Software - String Kernels.[http://cbio.mskcc.org/leslielab/software/string-kernels]
- Sonnenburg S, Rätsch G, Henschel S, Widmer C, Behr J, Zien A, de Bona F, Binder A, Gehl C, Franc V: The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research 2010, 11(z):1799–1802.Google Scholar
- Sonnenburg S, Zien A, Philips P, Rätsch G: POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors. Bioinformatics 2008, 24(13):i6–14. 10.1093/bioinformatics/btn170PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.