Volume 11 Supplement 1

## Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)

- Research
- Open Access
- Published:

# A weighted *q*-gram method for glycan structure classification

*BMC Bioinformatics***volume 11**, Article number: S33 (2010)

## Abstract

### Background

Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or organism. Many computational approaches have been proposed for analyzing these complex glycan structures, which are chains of monosaccharides. The monosaccharides are linked to one another by glycosidic bonds, which can take on a variety of comformations, thus forming branches and resulting in complex tree structures. The *q*-gram method is one of these recent methods used to understand glycan function based on the classification of their tree structures. This *q*-gram method assumes that for a certain *q*, different *q*-grams share no similarity among themselves. That is, that if two structures have completely different components, then they are completely different. However, from a biological standpoint, this is not the case. In this paper, we propose a weighted *q*-gram method to measure the similarity among glycans by incorporating the similarity of the geometric structures, monosaccharides and glycosidic bonds among *q*-grams. In contrast to the traditional *q*-gram method, our weighted *q*-gram method admits similarity among *q*-grams for a certain *q*. Thus our new kernels for glycan structure were developed and then applied in SVMs to classify glycans.

### Results

Two glycan datasets were used to compare the weighted *q*-gram method and the original *q*-gram method. The results show that the incorporation of *q*-gram similarity improves the classification performance for all of the important glycan classes tested.

### Conclusion

The results in this paper indicate that similarity among *q*-grams obtained from geometric structure, monosaccharides and glycosidic linkage contributes to the glycan function classification. This is a big step towards the understanding of glycan function based on their complex structures.

## Background

Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or an organism. Glycans consist of monosaccharides, which are linked to other monosaccharides by a glycosidic bond, which can take on a variety of conformations. Thus a single monosaccharide may be linked to at most four other monosaccharide (children), in addition to its parent monosaccharide. Glycans are usually found on proteins and lipids, linked from its root monosaccharide and branching out. The structure at the leaves are understood to be important for various biological functions [1]. These often consist of many branches and can become rather complex. Although major glycan motifs (often found at the leaves) are known to be biomarkers for specific diseases, the mechanism of this glycan recognition is still not well understood. Bioinformatics methods for glycobiology have recently developed rapidly due to the availability of glycan structure databases provided by major institutions including KEGG and the CFG (Consortium for Functional Glycomics). One of the important bioinformatics techniques applied to glycans is support vector machines (SVMs) for the extraction of species-specific glycan substructures [2]. This method has been further applied to the extraction of leukemia-specific glycan motifs in human [3]. The application of tree kernels for glycan classification [4] was then developed at the same time as the *q*-gram distribution kernel for disease-specific glycan motif extraction [5].

These methods were all groundbreaking in the sense that glycan structures have complex tree structures and that no one had ever tried, let alone were successful, in finding biologically important glycan motifs from glycan structure data alone. However, glycobiology is still a difficult topic, since these methods have not been further improved nor applied to actual problems in glycobiology. One of the reasons for this lay in the fact that, just as amino acids have physico-chemical properties that allow them to be grouped with one another, the components of glycans (monosaccharides and glycosidic linkages) also have such properties which have been ignored in the previous methods.

Thus, in this paper, we attempt to incorporate such similarity measures in glycan kernels in order to increase the classification accuracy of glycan structures, and thus improve glycan function prediction. Namely, we focused on the *q*-gram kernel due to the large variety of features that can be handled with a fairly simple kernel. We incorporated similarity measures as resolved by the KCaM (KEGG Carbohydrate Matcher) algorithm [6], a dynamic programming method for tree structure alignment. We then compared the accuracy of our new method compared to the previous methods and show that the prediction can be improved by using our proposed weighted *q*-gram method.

## Methods

We apply the Support Vector Machine (SVM) to classify glycans based on different kernels constructed from their tree structures. We first describe the original *q*-gram method and then our proposed weighted *q*-gram method. We constructed three kernels for the weighted *q*-gram method: Linkage (LK) kernel, KCaM (KM) kernel and Linkage KCaM (LKM) kernel. Linkage kernel considers the similarity of glycosidic linkage, their layers and the tree structure among *q*-grams. In previous work, the concept of *layers* was used to indicate the distance of a particular monosaccharide from the root. We also make use of this concept in our kernels by defining the layer of a monosaccharide and its linkage towards its parent as the number of linkages between the monosaccharide and the root. KCaM kernel considers the similarity of tree structure among different *q*-grams. Linkage KCaM kernel further improves KCaM kernel by incorporating glycosidic linkage and layer similarity of *q*-grams.

### The *q*-gram method

Suppose we are given *n* glycans *G* = {*g*_{1},⋯, *g*_{
n
}}. A *q*-gram is defined as a subtree with *q* nodes isomorphic to a path, in which every node has no more than two adjacent nodes, for *q* ≥ 1. We denote the set of all *q*-grams existing in these *n* glycans to be *q*-gram set where *d*_{
q
}is the total number of *q*-grams in the set Φ_{
q
}. Let *Q*-gram set Φ = {Φ_{
q
}}_{g∈Q}, where *Q* is a set of integers, and we usually take *Q* = {2,⋯,9}. A *q*-gram is also called as *Q*-gram if *q* ∈ *Q*. Then *Q*-gram set Φ is the set of all the *Q*-grams existing in the *n* glycans. For each glycan *g*_{
s
}, the *q*-gram representation is a column vector , where means the number of *i* th *q*-gram in the glycan *g*_{
s
}, while the *Q*-gram representation is the column vector **x**_{
s
}, which is the concatenation of the vectors , where *q* ∈ *Q*. Let and *X* = [**x**_{
1
},⋯,**x**_{
n
}] ∈ *R*^{d×n}, where *d* is the total number of *Q*-grams in the set Φ. We note that *d* = ∑_{q∈Q}*d*_{
q
}. The feature space Γ_{
q
}using *q*-grams is spanned by the column vectors of *X*_{
q
}, and the feature space Γ_{
Q
}using *Q*-grams is spanned by the column vectors of *X*. The feature space Γ_{
Q
}is the direct sum of feature spaces Γ_{
q
}for all *q* ∈ *Q*, i.e., Γ_{
Q
}= ⊕_{q∈Q}Γ_{
q
}. The *q*-gram similarity between glycan *g*_{
s
}and *g*_{
t
}can be represented as the inner product of their feature vectors,

Thus the *q*-gram kernel *K*_{
q
}can be simply obtained by The *Q*-gram kernel *K*_{
Q
}can then be obtained by *K*_{
Q
}= *X*^{T}*X*. We note that it is easy to see that *K*_{
Q
}= ∑_{q∈Q}*K*_{
q
}.

### Weighted *q*-gram method

We note that the *q*-gram method does not consider the similarity among the *q*-grams in Φ_{
q
}. Suppose the similarity among the *q* -grams in Φ_{
q
}is represented as a weight matrix *W*_{
q
}, where is the similarity score between the *i* th *q*-gram and *j* th *q*-gram. We represent the similarity between the two glycan *g*_{
s
}and *g*_{
t
}as follows:

Thus the weighted *q*-gram kernel can be obtained by . The weighted *q*-gram method is indeed a generalization of *q*-gram method. The *q*-grams and *p*-grams are considered to have no similarity if *p* is not equal to *q*. One can then obtain the weighted *Q*-gram kernel by

If we add a weight *α*_{
q
}on each *q*-gram, we can get double weighted *Q*-gram kernel *K*_{
αwQ
}= ∑_{q∈Q}*α*_{
q
}*K*_{
wq
}. Here *W*_{
q
}is required to be a positive definite matrix since the kernel matrix should be positive definite. Let . The feature space Γ_{
wq
}for weighted *q*-gram method is spanned by the column vectors of *S*_{
q
}*X*_{
q
}, and the feature space for the weighted *Q*-gram method is Γ_{
wQ
}= ⊕_{q∈Q}Γ_{
wq
}.

*q*-gram similarity

In this section, we discuss some methods to obtain the similarity among the *q*-grams.

#### Linkage kernel

Each *q*-gram is composed of several monosaccharides and bonds. We represent the *i* th *q*-gram by , where *l*^{i}is the layer of this *q*-gram, *σ*^{i}represents the structure shape of this *q*-gram, is an ordered set of monosaccharides, is an ordered set of bonds.

For each *q*, the *q*-grams with different structure shape is considered to be totally different.

Now we consider two *q*-grams, and . We denote the similarity between them by *S*_{
q
}(*i*, *j*). As mentioned above, if *σ*^{i}and *σ*^{j}is not same, then *S*_{
q
}(*i*, *j*) = 0. The similarity between the *q*-grams with the same structure shape depends on the similarity of the layers, monosaccharides, and bonds between them. We represent the similarity between *q*-grams and by

where *S*^{l}(*l*^{i}, *l*^{j}) is the similarity of the layers of the two *q*-grams, is the similarity of the monosaccharides and , and is the similarity of the bonds and .

For the similarity between the layers, it is easy to obtain the similarity function by using the distance of the layers. We then define the similarity

After the linkage similarity *S*_{
q
}is constructed for all *q*-grams, the linkage kernel can be obtained similarly as discussed in the previous section. This method of measuring the similarity among *q*-grams is denoted as the LK method. For the monosaccharides and bonds, since there are too many possibilities of combinations, in the database, we only choose the most frequent ones, which appear most in mammals. In the KEGG glycan database, the most frequent monosaccharides (including other chemical compounds) and the bonds are shown in Table 1. The similarity among monosaccharides can be obtained from the chemical structure comparison method SIMCOMP developed by Hattori et al. [7, 8]. The bonds similarity is set by their chemical meanings and is shown in Table 2.

#### KCaM kernel and Linkage KCaM kernel

KCaM is a tool implementing a polynomial-time dynamic programming algorithm to align glycan tree structures. From KCaM we can obtain the similarity score for two glycan structures in the range of 0 to 100. Here we apply KCaM to the *q*-gram structures to obtain the similarity scores among the *q*-grams. We denote this method of measuring *q*-gram similarity as the KM method.

We note that KCaM also does not consider the similarity of the monosaccharides and bonds among the glycan structures and the layers of *q*-grams. However, a method of computing the score matrices of monosaccharides and linkages have been developed, which can be utilized by KCaM [9]. Thus we incorporated the linkage similarity and layers into the score matrix and developed a new KCaM method called Linkage KCaM (LKM) method. We then use the alignment scores of all *q*-grams obtained by the Linkage KCaM as the their similarity scores.

Let the alignment score matrix obtained by KCaM or Linkage KCaM be *S*_{
q
}. Here *S*_{
q
}(*i*, *j*) is the alignment score between the *i* th and *j* th *q*-grams. Then the weight matrix *W*_{
q
}can be represented using *S*_{
q
}. Since the kernel should be positive definite, we revise *W*_{
q
}to Thus we have the kernel

## Results and discussion

### Glycan data

We used two sets of glycan data to evaluate the classification performance using different kernels we constructed. Glycan structure data was retrieved from the KEGG/GLYCAN database [10] and their annotations are from CarbBank/CCSD database [11]. The first data set includes glycan structures that are known to be related to leukemic cells, as was originally analyzed by [3]. We also tested our method on another data set of glycans related to cystic fibrosis as was previously used by [5]. The data used here is summarized in Table 3. The statistical analysis for all *q*-grams in the glycan database shows the distribution of each monosaccharide and chemical bond in Table 1. The most frequent chemical bond in the glycan database is *b* 1 - 4, and the most frequent monosaccharide is GlcNAc. The similarity among monosaccharides is calculated by the software SIMCOMP [7, 8], and the similarity among chemical bond is obtained by their chemical meanings, which is shown in Table 2.

## Results

We evaluated the performance of our weighted *q*-gram method by comparing it with the original *q*-gram method on glycan data. SVM was performed in the environment of Matlab version 2007a. Table 4 shows the Area Under the ROC curve(AUC) using SVM classifier for the two datasets. The AUC values in Table 4 are the average of 5-fold cross validation performed 50 times. We used two ways to combine the information from different *q*-gram representations. First is the *Q*-gram method, which represents each glycan by the vector of {2,⋯,9}-grams. The other is the recently developed multiple kernel learning [12], which generates an optimal weighting of each representation for each classification task. In Table 4, for the traditional *q*-gram method, KCaM kernel method and Linkage KCaM kernel method, the results using *Q*-gram and multiple kernel method are both reported. For the linkage kernel method, we only reported the results for Linkage 2-gram and Linkage 3-gram since when *q* is larger than 3, the *q*-gram similarity calculation becomes exponential.

Note that Linkage kernel considers the linkage similarity among *q*-grams, KCaM kernel considers the structure similarity among *q*-grams, while linkage KCaM kernel considers both. The left of Table 4 reports the results of the classification for leukemic cells. The result shows that all the three kernels for weighted *q*-gram methods preforms better than traditional *q*-gram. In particular, for smaller values of *q*, the weighted kernels obtain higher accuracy, supporting the fact that it has been shown that trimer (tri-saccharide) structures have been most effective in discriminating between glycans related to leukemic cells compared to non-leukemic blood components. The right of Table 4 reports the results of cystic fibrosis. The AUCs of all the methods indicate that KM kernel performs best for all the methods. This implies that *q*-gram structure similarity contributes to cystic fibrosis-related glycan classification. In contrast to the leukemic cell data set, the result that linkage and Linkage KCam method performs worse than traditional *q*-gram method for lower values of *q* indicates that the *q*-gram similarity obtained from comparing their glycosidic linkages may not be helpful for the classification of cystic fibrosis. This result is supported by the fact that it has been shown in previous work that the sulfate-GlcNAc dimeric structure, which does not contain any glycosidic bond data, is a potential glycan biomarker. Interestingly, the best performance of KCaM kernel especially for higher values of *q* indicate that larger structures may be overlooked as potential biomarkers, thus illustrating the effectiveness of weighted *q*-gram method in this classification task. In conclusion, the results for two glycan data sets both show the improved prediction accuracy of the weighted *q*-gram method for glycan biomarker prediction.

## Discussion

In this paper, we focused on glycan classification by considering the similarity among *q*-gram. In the traditional *q*-gram method, similarity among sub-structures of glycans (*q*-gram) is not considered. Thus it assumes that *q*-gram similarity contributes nothing in glycan classification. We proposed a weighted *q*-gram method based on three kernels including linkage kernel, KCaM kernel and linkage KCaM kernel, and then compared the performance of weighted *q*-gram method with that of the traditional *q*-gram method using two glycan data sets. The results show that the consideration of similarity among *q*-grams contributes to higher accuracy of classification. Further research may focus on the biological properties of the glycan structure features that contribute to the increase in performance of these kernels, thus aiding the understanding of the mechanisms behind glycan structure recognition.

## Conclusion

We proposed a new approach in this paper to classify glycans based on their tree structures. The method attempts to involve similarity among *q*-grams in glycan classification. Three kernels (Linkage, KCaM and Linkage KCaM) are constructed for weighted *q*-gram methods. The experimental results showed its effective role in classification of glycan functions.

## References

- 1.
Varki A, Cummings R, Esko J, Freeze H, Hart G, Etzler M:Essentials of Glycobiology. second edition. New York: Cold Spring Harbor Laboratory Press; 2008. [http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=glyco2]

- 2.
Hizukuri Y, Yamanishi Y, Hashimoto K, Kanehisa M: Extraction of Species-Specific Glycan Substructures. Genome Informatics 2004, 15: 69-81.

- 3.
Hizukuri Y, Yamanishi Y, Nakamura O, Yagi F, Goto S, Kanehisa M: Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr Res 2005, 340: 2270-2278. 10.1016/j.carres.2005.07.012

- 4.
Yamanishi Y, Bach F, Vert JP: Glycan classification with tree kernels. Bioinformatics 2007, 23(10):1211-1216. 10.1093/bioinformatics/btm090

- 5.
Kuboyama T, Hirata K, Aoki-Kinoshita K, Kashima H, Yasuda H: A Gram Distribution Kernel Applied to Glycan Classification and Motif Extraction. Genome Informatics 2006, 17(2):25-34.

- 6.
Aoki K, Yamaguchi A, Okuno Y, Akutsu T, Ueda N, Kanehisa M, Mamitsuka H: Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 2003, 14: 134-143.

- 7.
Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society 2003, 125: 11853-11865. 10.1021/ja036030u

- 8.
Hattori M, Okuno Y, Goto S, Kanehisa M: Heuristics for Chemical Compound Matching. Genome Informatics 2003, 14: 144-153.

- 9.
Aoki K, Mamitsuka H, Akutsu T, Kanehisa M: A score matrix to reveal the hidden links in glycans. Bioinformatics 2005, 21(8):1457-1463. 10.1093/bioinformatics/bti193

- 10.
Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita K, Ueda N, Hamajima M, Kawasaki T, Kanehisa M: KEGG as a glycome informatics resource. Glycobiology 2006, 16(5):63R-70R. 10.1093/glycob/cwj010

- 11.
Doubet S, Albersheim P: CarbBank. Glycobiology 1992, 2(6):505-507. 10.1093/glycob/2.6.505

- 12.
Bach F, Thibaux R, Jordan M: Computing regularization paths for learning multiple kernels. Adv Neural Inform Process Syst 2005, 17: 73-80.

## Acknowledgements

The authors would like to thank the two anonymous referees and the editor for their helpful suggestions and corrections. Research supported in part by HKRGC Grant No. 7017/07P, HKUCRGC Grants, HKU Strategy Research Theme fund on Computational Sciences, Hung Hing Ying Physical Research Sciences Research Grant, National Natural Science Foundation of China Grant No. 10971075 and Guangdong Provincial Natural Science Grant No. 9151063101000021.

This article has been published as part of *BMC Bioinformatics* Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.

## Author information

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

LL and KFA conceived the idea and designed the research. TY processed the data. LL performed the research and analyzed the results. LL, KFA and WC wrote the paper. All authors read and approved the final manuscript.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Published

#### DOI

### Keywords

- Support Vector Machine
- Cystic Fibrosis
- Monosaccharide
- Glycosidic Bond
- Glycosidic Linkage