A weighted q-gram method for glycan structure classification

Background Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or organism. Many computational approaches have been proposed for analyzing these complex glycan structures, which are chains of monosaccharides. The monosaccharides are linked to one another by glycosidic bonds, which can take on a variety of comformations, thus forming branches and resulting in complex tree structures. The q-gram method is one of these recent methods used to understand glycan function based on the classification of their tree structures. This q-gram method assumes that for a certain q, different q-grams share no similarity among themselves. That is, that if two structures have completely different components, then they are completely different. However, from a biological standpoint, this is not the case. In this paper, we propose a weighted q-gram method to measure the similarity among glycans by incorporating the similarity of the geometric structures, monosaccharides and glycosidic bonds among q-grams. In contrast to the traditional q-gram method, our weighted q-gram method admits similarity among q-grams for a certain q. Thus our new kernels for glycan structure were developed and then applied in SVMs to classify glycans. Results Two glycan datasets were used to compare the weighted q-gram method and the original q-gram method. The results show that the incorporation of q-gram similarity improves the classification performance for all of the important glycan classes tested. Conclusion The results in this paper indicate that similarity among q-grams obtained from geometric structure, monosaccharides and glycosidic linkage contributes to the glycan function classification. This is a big step towards the understanding of glycan function based on their complex structures.


Background
Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or an organism. Glycans consist of monosaccharides, which are linked to other monosaccharides by a glycosidic bond, which can take on a variety of conformations. Thus a single monosaccharide may be linked to at most four other monosaccharide (children), in addition to its parent monosaccharide. Glycans are usually found on proteins and lipids, linked from its root monosaccharide and branching out. The structure at the leaves are understood to be important for various biological functions [1]. These often consist of many branches and can become rather complex. Although major glycan motifs (often found at the leaves) are known to be biomarkers for specific diseases, the mechanism of this glycan recognition is still not well understood. Bioinformatics methods for glycobiology have recently developed rapidly due to the availability of glycan structure databases provided by major institutions including KEGG and the CFG (Consortium for Functional Glycomics). One of the important bioinformatics techniques applied to glycans is support vector machines (SVMs) for the extraction of species-specific glycan substructures [2]. This method has been further applied to the extraction of leukemia-specific glycan motifs in human [3]. The application of tree kernels for glycan classification [4] was then developed at the same time as the q-gram distribution kernel for diseasespecific glycan motif extraction [5].
These methods were all groundbreaking in the sense that glycan structures have complex tree structures and that no one had ever tried, let alone were successful, in finding biologically important glycan motifs from glycan structure data alone. However, glycobiology is still a difficult topic, since these methods have not been further improved nor applied to actual problems in glycobiology. One of the reasons for this lay in the fact that, just as amino acids have physico-chemical properties that allow them to be grouped with one another, the components of glycans (monosaccharides and glycosidic linkages) also have such properties which have been ignored in the previous methods.
Thus, in this paper, we attempt to incorporate such similarity measures in glycan kernels in order to increase the classification accuracy of glycan structures, and thus improve glycan function prediction. Namely, we focused on the q-gram kernel due to the large variety of features that can be handled with a fairly simple kernel. We incorporated similarity measures as resolved by the KCaM (KEGG Carbohydrate Matcher) algorithm [6], a dynamic programming method for tree structure alignment. We then compared the accuracy of our new method compared to the previous methods and show that the prediction can be improved by using our proposed weighted q-gram method.

Methods
We apply the Support Vector Machine (SVM) to classify glycans based on different kernels constructed from their tree structures. We first describe the original q-gram method and then our proposed weighted q-gram method. We constructed three kernels for the weighted q-gram method: Linkage (LK) kernel, KCaM (KM) kernel and Linkage KCaM (LKM) kernel. Linkage kernel considers the similarity of glycosidic linkage, their layers and the tree structure among q-grams. In previous work, the concept of layers was used to indicate the distance of a particular monosaccharide from the root. We also make use of this concept in our kernels by defining the layer of a monosaccharide and its linkage towards its parent as the number of linkages between the monosaccharide and the root. KCaM kernel considers the similarity of tree structure among different q-grams. Linkage KCaM kernel further improves KCaM kernel by incorporating glycosidic linkage and layer similarity of q-grams.
The q-gram method Suppose we are given n glycans G = {g 1 ,..., g n }. A q-gram is defined as a subtree with q nodes isomorphic to a path, in which every node has no more than two adjacent nodes, for q ≥ 1. We denote the set of all q-grams existing in these n glycans to be q-gram set where d q is the total number of q-grams in the set Φ q . Let Q-gram set Φ = {Φ q } g Q , where Q is a set of integers, and we usually take Q = {2,...,9}. A q-gram is also called as Q-gram if q Q.
Then Q-gram set Φ is the set of all the Q-grams existing in the n glycans. For each glycan g s , the q-gram representation is a column vector , where x is q means the number of ith q-gram φ q i in the glycan g s , while the Q-gram representation is the column vector x s , which is the concatenation of the vectors x s q , where d is the total number of Q-grams in the set Φ. We note that d = ∑ q Q d q . The feature space Γ q using q-grams is spanned by the column vectors of X q , and the feature space Γ Q using Q-grams is spanned by the column vectors of X. The feature space Γ Q is the direct sum of feature spaces Γ q for all q Q, i.e., Γ Q = ⊕ q Q Γ q . The q-gram similarity between glycan g s and g t can be represented as the inner product of their feature vectors, Thus the q-gram kernel K q can be simply obtained by The Q-gram kernel K Q can then be obtained by K Q = X T X. We note that it is easy to see that K Q = ∑ q Q K q .
Weighted q-gram method We note that the q-gram method does not consider the similarity among the q-grams in Φ q . Suppose the similarity among the q-grams in Φ q is represented as a weight matrix W q , where W i j w q ij q ( , ) = is the similarity score between the ith q-gram and jth q-gram. We represent the similarity between the two glycan g s and g t as follows:

= ∑
Thus the weighted q-gram kernel can be obtained by The weighted q-gram method is indeed a generalization of q-gram method. The q-grams and p-grams are considered to have no similarity if p is not equal to q. One can then obtain the weighted Q-gram kernel by If we add a weight a q on each q-gram, we can get double weighted Q-gram kernel K awQ = ∑ q Q a q K wq . Here W q is required to be a positive definite matrix since the kernel matrix should be positive definite. Let W S S q q T q = . The feature space Γ wq for weighted q-gram method is spanned by the column vectors of S q X q , and the feature space for the weighted Q-gram method is Γ wQ = ⊕ q Q Γ wq .

q-gram similarity
In this section, we discuss some methods to obtain the similarity among the q-grams.

Linkage kernel
Each q-gram is composed of several monosaccharides and bonds. We represent the ith q-gram by φ σ For each q, the q-grams with different structure shape is considered to be totally different.
Now we consider two q-grams, φ σ We denote the similarity between them by S q (i, j). As mentioned above, if s i and s j is not same, then S q (i, j) = 0. The similarity between the q-grams with the same structure shape depends on the similarity of the layers, monosaccharides, and bonds between them. We represent the similarity between q-grams φ q i and φ q j by For the similarity between the layers, it is easy to obtain the similarity function by using the distance of the layers. We then define the similarity After the linkage similarity S q is constructed for all q-grams, the linkage kernel can be obtained similarly as discussed in the previous section. This method of measuring the similarity among q-grams is denoted as the LK method. For the monosaccharides and bonds, since there are too many possibilities of combinations, in the database, we only choose the most frequent ones, which appear most in mammals. In the KEGG glycan database, the most frequent monosaccharides (including other chemical compounds) and the bonds are shown in Table 1. The similarity among monosaccharides can be obtained from the chemical structure comparison method SIMCOMP developed by Hattori et al. [7,8]. The bonds similarity is set by their chemical meanings and is shown in Table 2.

KCaM kernel and Linkage KCaM kernel
KCaM is a tool implementing a polynomial-time dynamic programming algorithm to align glycan tree structures. From KCaM we can obtain the similarity score for two glycan structures in the range of 0 to 100. Here we apply KCaM to the q-gram structures to obtain the similarity scores among the q-grams. We denote this method of measuring q-gram similarity as the KM method.
We note that KCaM also does not consider the similarity of the monosaccharides and bonds among the glycan structures and the layers of q-grams. However, a method of computing the score matrices of monosaccharides and linkages have been developed, which can be utilized by KCaM [9]. Thus we incorporated the linkage similarity and layers into the score matrix and developed a new KCaM method called Linkage KCaM (LKM) method. We then use the alignment scores of all q-grams obtained by the Linkage KCaM as the their similarity scores.
Let the alignment score matrix obtained by KCaM or Linkage KCaM be S q . Here S q (i, j) is the alignment score between the ith and jth q-grams. Then the weight matrix W q can be represented using S q . Since the kernel

Results and discussion
Glycan data We used two sets of glycan data to evaluate the classification performance using different kernels we constructed. Glycan structure data was retrieved from the KEGG/GLYCAN database [10] and their annotations are Occurrence gives the times each chemical bond and monosaccharides appears in glycan database. Percent means the percentage of each chemical bond or monosaccharides in glycan database.  from CarbBank/CCSD database [11]. The first data set includes glycan structures that are known to be related to leukemic cells, as was originally analyzed by [3]. We also tested our method on another data set of glycans related to cystic fibrosis as was previously used by [5]. The data used here is summarized in Table 3. The statistical analysis for all q-grams in the glycan database shows the distribution of each monosaccharide and chemical bond in Table 1. The most frequent chemical bond in the glycan database is b1 -4, and the most frequent monosaccharide is GlcNAc. The similarity among monosaccharides is calculated by the software SIMCOMP [7,8], and the similarity among chemical bond is obtained by their chemical meanings, which is shown in Table 2.

Results
We evaluated the performance of our weighted q-gram method by comparing it with the original q-gram method on glycan data. SVM was performed in the environment of Matlab version 2007a. Table 4 shows the Area Under the ROC curve(AUC) using SVM classifier for the two datasets. The AUC values in Table 4 are the average of 5-fold cross validation performed 50 times. We used two ways to combine the information from different q-gram representations. First is the Q-gram method, which represents each glycan by the vector of {2,...,9}-grams. The other is the recently developed multiple kernel learning [12], which generates an optimal weighting of each representation for each classification task. In Table 4, for the traditional q-gram method, KCaM kernel method and Linkage KCaM kernel method, the results using Q-gram and multiple kernel method are both reported. For the linkage kernel method, we only reported the results for Linkage 2-gram and Linkage 3-gram since when q is larger than 3, the q-gram similarity calculation becomes exponential.
Note that Linkage kernel considers the linkage similarity among q-grams, KCaM kernel considers the structure similarity among q-grams, while linkage KCaM kernel considers both. The left of Table 4 reports the results of the classification for leukemic cells. The result shows that all the three kernels for weighted q-gram methods preforms better than traditional q-gram. In particular, for smaller values of q, the weighted kernels obtain higher accuracy, supporting the fact that it has been shown that trimer (tri-saccharide) structures have been most effective in discriminating between glycans related to leukemic cells compared to non-leukemic blood components. The right of Table 4 reports the results of cystic fibrosis. The AUCs of all the methods indicate that KM kernel performs best for all the methods. This implies that q-gram structure similarity contributes to cystic fibrosis-related glycan classification. In contrast to the leukemic cell data set, the result that linkage and Linkage KCam method performs worse than traditional q-gram method for lower values of q indicates that the q-gram similarity obtained from comparing their glycosidic linkages may not be helpful for the classification of cystic fibrosis. This result is supported by the fact that it has been shown in previous work that the sulfate-GlcNAc dimeric structure, which does not contain any glycosidic bond data, is a potential glycan biomarker. Interestingly, the best performance of KCaM kernel especially for higher values of q indicate that larger structures may be overlooked as potential biomarkers, thus illustrating the effectiveness of weighted q-gram method in this classification task. In conclusion, the results for two glycan data sets both show the improved  prediction accuracy of the weighted q-gram method for glycan biomarker prediction.

Discussion
In this paper, we focused on glycan classification by considering the similarity among q-gram. In the traditional q-gram method, similarity among sub-structures of glycans (q-gram) is not considered. Thus it assumes that q-gram similarity contributes nothing in glycan classification. We proposed a weighted q-gram method based on three kernels including linkage kernel, KCaM kernel and linkage KCaM kernel, and then compared the performance of weighted q-gram method with that of the traditional q-gram method using two glycan data sets.
The results show that the consideration of similarity among q-grams contributes to higher accuracy of classification. Further research may focus on the biological properties of the glycan structure features that contribute to the increase in performance of these kernels, thus aiding the understanding of the mechanisms behind glycan structure recognition.

Conclusion
We proposed a new approach in this paper to classify glycans based on their tree structures. The method attempts to involve similarity among q-grams in glycan classification. Three kernels (Linkage, KCaM and Linkage KCaM) are constructed for weighted q-gram methods.
The experimental results showed its effective role in classification of glycan functions.