Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition
© Zhou et al; licensee BioMed Central Ltd. 2006
Received: 06 November 2005
Accepted: 26 April 2006
Published: 26 April 2006
Meiotic double-strand breaks occur at relatively high frequencies in some genomic regions (hotspots) and relatively low frequencies in others (coldspots). Hotspots and coldspots are receiving increasing attention in research into the mechanism of meiotic recombination. However, predicting hotspots and coldspots from DNA sequence information is still a challenging task.
We present a novel method for classification of hot and cold ORFs located in hotspots and coldspots respectively in Saccharomyces cerevisiae, using support vector machine (SVM), which relies on codon composition differences. This method has achieved a high classification accuracy of 85.0%. Since codon composition is a fusion of codon usage bias and amino acid composition signals, the ability of these two kinds of sequence attributes to discriminate hot ORFs from cold ORFs was also investigated separately. Our results indicate that neither codon usage bias nor amino acid composition taken separately performed as well as codon composition. Moreover, our SVM based method was applied to the full genome: We predicted the hot/cold ORFs from the yeast genome by using cutoffs of recombination rate. We found that the performance of our method for predicting cold ORFs is not as good as that for predicting hot ORFs. Besides, we also observed a considerable correlation between meiotic recombination rate and amino acid composition of certain residues, which probably reflects the structural and functional dissimilarity between the hot and cold groups.
We have introduced a SVM-based novel method to discriminate hot ORFs from cold ones. Applying codon composition as sequence attributes, we have achieved a high classification accuracy, which suggests that codon composition has strong potential to be used as sequence attributes in the prediction of hot and cold ORFs.
Meiotic recombination occurs more frequently in some regions of the eukaryotic genome than in others, with variations of several orders of magnitude observed in frequencies of meiotic exchange per unit physical distance . In Saccharomyces cerevisiae, meiotic recombination is initiated by double-strand DNA breaks (DSBs) [1, 2]. Some genomic regions in which meiotic DSBs occur at relatively high frequencies are called hotspots, and by contraries, the regions with relatively low frequencies are called coldspots . Although observations concerning individual hotspots and coldspots have given clues as to the mechanism of recombination initiation, the prediction of hotspots and coldspots from DNA sequence information is very limited . So, several global mapping studies have been performed to map DSB sites on chromosomes in yeast to determine whether they share common DNA sequences and/or structural elements [2–5]. It was found that, in S. cerevisiae, the position of hotspots were nonrandomly associated with certain transcriptional profiles and some feature of chromosome structure related to GC-richness regions, while coldspots may associate with the centromeres and telomeres [2, 6]. Further analysis showed that, in yeast, there is a significant correlation between codon usage bias and recombination rate, and the similar phenomenon was also observed in some other organisms, such as Drosophila melanogaster, mouse and human, which may be interpreted by biased genetic conversion during meiosis and/or Hill-Robertson interference [6–11].
Anyway, more mechanistic studies will be still critical in predicting hotspots and coldspots and defining corresponding operational rules . Although experimental techniques can be applied for this purpose, they are laborious and time-consuming and therefore become infeasible for large numbers of genomic sequences . Hence efficient and reliable computational methods for discriminating hotspots from coldspots are required.
An advanced method to this task employs statistical learning theory, typically the support vector machine (SVM), which is a type of supervised machine learning algorithm. Lin et al. have investigated the ability of SVM to discriminate ribosomal protein coding genes from all other genes of known function based on codon usage in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae . Codon usage was also formerly used by Friedel et al. as sequence attributes for separation of mixed plant-pathogen EST collections using SVM with high classification accuracy . In this paper, we present a novel method for prediction of hot and cold ORFs located in hotspots and coldspots respectively in S. cerevisiae using SVM based on codon composition differences. Our method can accurately differentiate hot ORFs from cold ORFs, which suggests that codon composition is a satisfying sequence attribute. Moreover, our SVM based method was applied to the full genome: We predicted the hot/cold ORFs from all selected ORFs in yeast genome by using cutoffs of recombination rate and found that the performance of our method for predicting cold ORFs is not as good as that for predicting hot ORFs. Besides, in this study, we also detected a significant correlation between meiotic recombination rate and amino acid composition of certain residues in proteins encoded by the ORFs located in recombination hot or cold spots.
Results and discussion
Classification of hot and cold ORFs based on different sets of attributes
The performance of SVM for differentiating the hot ORFs from the cold ORFs. The SVM models were trained using codon composition, codon bias and amino acid composition as sequence attributes respectively.
It was formerly suggested that the information contained in codon composition is representative of both codon usage bias and amino acid composition . So, we evaluated these two attribute types separately on the same random training and test splits as above. In our study, codon usage bias and amino acid composition were measured by relative synonymous codon usage (RSCU) and amino acid use frequency (FAAU) respectively (see Methods section for details). Using only RSCU values as sequence attributes resulted in an average classification accuracy of 82.5%, while only considering amino acid composition for prediction, the SVM performed more poorly with an accuracy of 65.8% (see Table 1).
Compared with the original codon use frequency (FCU), neither RSCU nor FAAU contributed any new information for classification and these additional attribute sets taken separately did not have the same excellent performance as codon composition. And, according to McNemar's test, the FCU based method shows significant difference with respect to the other two methods (P < 0.05). Therefore, we can conclude here that codon composition has strong potential to be used as sequence attributes in the classification of hot and cold ORFs.
Contribution of amino acid composition for classification
The performance of SVM with exclusion of certain amino acids. The SVM models were trained using codon composition, codon bias and amino acid composition as sequence attributes respectively, excluding His, Arg, Asn and Ser from the training datasets.
The difference in amino acid composition between the proteins coded by hot and cold ORFs probably reflects the structural and functional dissimilarity in these two groups. Gerton et al. have observed several correlations between hot or cold ORFs and gene functions: there was a very significant over-representation of the hotspot ORFs in the metabolism and ionic homeostasis functional classes, as well as an overrepresentation of coldspot ORFs in the categories of transport facilitation and intracellular transport, which may be due to an association of certain categories of genes with a particular chromatin structure that is favorable or unfavorable for initiating meiotic recombination . At the same time, several previous studies have shown that proteins with similar function may share a similar amino acid composition [14–16]. Therefore, it is understandable that there is a skewed amino acid composition between the proteins coded by hot and cold ORFs. However, why amino acid composition is associated with the variation in meiotic recombination rate has not been solved unambiguously.
Contribution of codon bias for classification
In S. cerevisiae, it was thought that the codon usage bias strongly correlates with gene expression . In this study, we used the codon adaptation index (CAI) to measure the gene expression level . High CAI genes are presumed to be highly expressed while low CAI genes are presumed to be lowly expressed. As observed in previous studies, we found that there is a significant negative correlation between the first axis and CAI (r = -0.92, P < 0.00001), which means that the first axis can discriminate genes with different expression level . This kind of correlation between codon usage and gene expression reflects the nature selection acting at translational level.
Besides, a considerable positive correlation was also found between the second axis and recombination rate (r = 0.39, P < 0.00001) in yeast. This kind of correlation was also observed in D. melanogaster, which has been explained by two proposed models [7–9]. In the first model, it is proposed that the reduction of codon bias in the regions with limited recombination is consistent with Hill-Robertson interference [7, 19]. The effectiveness of nature selection on one target due to stochastic effects of selection on linked targets is expected to be reduced by this conflict . Computer simulation also suggests that the effect of genetic linkage should be particularly damaging in the case of weak selection, including the selection acting on codon usage [8, 9, 19, 20]. The second model proposes that the correlation between codon usage pattern and recombination rate is caused by recombination-related mutational bias [8, 9]. There is a positive correlation between G+C content and recombination rate in the organisms, such as D. melanogaster, S. cerevisiae, mouse and human [2, 6–11], which probably can be interpreted by biased genetic conversion during meiosis .
In fact, there is a significant positive correlation between the frequency of G+C at the synonymous third codon position (GC3S) and the second axis (r = 0.96, P < 0.00001). At the same time, the recombination rate for each ORF is greatly positively correlated with GC3S in yeast (r = 0.40, P < 0.00001). Given that GC3S may reflect regional base compositional bias, the second model we mentioned above might explain the association between codon usage and recombination rate much better. In other words, biased genetic conversion between parental chromosomes during meiosis might mainly account for the correlation between codon usage variation and recombination in S. cerevisiae.
In addition, we have mentioned above that there are correlations between hot or cold ORFs and gene function. And it has also been reported that there is a relationship between gene function and codon usage pattern in eukaryotic organisms [23, 24], which may partially account for the correlation between codon bias and recombination rate in yeast.
The performance of SVM only trained on the key codons. The SVM models were trained using FCU and RSCU values of the codons listed in Figure 2 as sequence attributes respectively. These codons are thought to play important roles in classification.
Applying the SVM based method to full genome analysis
Our above analysis is only focused on the ORFs located in hot and cold spots which were detected by Gerton et al. Now, we will take all the ORFs in S. cerevisiae genome into account, including the neutral ORFs.
Prediction of meiotic recombination hot/cold spots in eukaryotic genomes based on computational technique is a challenging problem, partially because of current limited scale of published experimental data and few models to represent the training data. In this paper, we have introduced a SVM-based novel method to discriminate hot ORFs from cold ones. Using codon composition as sequence attributes, we have achieved a high classification accuracy. Since codon composition is a fusion of codon usage bias and amino acid composition signals, the ability of these two kinds of sequence attributes for classification was also investigated separately. Our results indicate that neither codon usage bias nor amino acid composition taken separately performed as well as codon composition.
Moreover, our SVM based method was also applied to the full genome: We tried to predict the hot/cold ORFs from all selected ORFs in yeast genome by using cutoffs of recombination rate. We found that the FCU based model still behaved better than the models based on RSCU and FAAU. However, the performance of our method for predicting cold ORFs is not as good as that for predicting hot ORFs.
In addition, we also observed a considerable correlation between meiotic recombination rate and amino acid composition of positively charged residue His and Arg and polar residue Asn and Ser in proteins encoded by the ORFs located in hot/cold spots, which probably reflects the structural and functional dissimilarity between the hot and cold groups.
Gerton et al. have estimated relative recombination rates for most of the S. cerevisiae loci using DNA microarrays at a resolution of about 2–3 kb . In S. cerevisiae, the average length of intergenic regions is about 500 bp. Therefore, although most hotspots are intergenic rather than intragenic, ORFs are used to locate hot/cold spots by Gerton et al. They detected 303 hot ORFs clustered into 177 hotspots whose recombination rates ranked in the top 12.5% and 49 cold ORFs clustered into 40 coldspots whose recombination rate ranked in the bottom 12.5% . We extracted the hot/cold ORFs (one of the cold ORFs listed in Gerton's paper doesn't exist) and other neutral ORFs from GenBank database. The corresponding recombination data were obtained from  which was generated by Gerton et al. The relative recombination rate of each ORF is determined by the median array value of seven microarray experiments.
Codon composition was measured by codon use frequency (FCU). Each ORF was represented by 61-dimensional vector with respect to the 61 sense codons. The FCU value of the jth codon for the ith amino acid was calculated thus:
FCU ij = obs ij /Total
where obsij is the observed number of the jth codon for the ithamino acid and Total is the total number of codons in the ORF. FCU is inherently the fusion of both codon usage bias and amino acid composition signals .
To examine synonymous codon usage without the confounding influence of amino acid composition of different gene samples, the values of relative synonymous codon usage (RSCU) of different codons in each sequence were also calculated. The RSCU value of the jth codon for the ith amino acid was calculated by
where obsij is the observed number of the jth codon for the ithamino acid which has ni type of synonymous codons. This attribute is independent of amino acid usage and it is obvious that RSCU value close to 1.0 indicates a lack of bias for the corresponding codon .
As well as the definition of FCU, amino acid use frequency (FAAU) was used to examine the amino acid composition of the corresponding protein for the selected ORF. The FAAU value of the ith amino acid was calculated as below:
FAAU i = obs i /Total
where obsi is the observed number of amino aicd i in protein sample, and Total is the total number of the twenty kinds of amino acids in the protein.
Support vector machine
Support vector machine (SVM) is a machine learning technique based on statistical theory. The principle of SVM is to find a maximum margin hyperplane for classification. If it is not possible in the given space, the instances are mapped to a higher dimensional space using the kernel function. Kernel function allows one to work in a higher dimensional space without computing all elements, which reduces computational complexity and connects the input space and the higher dimensional space directly. SVM will then choose a maximum soft margin separating hyperplane in this higher dimensional space, which separates the training instances by their classes. The classification of a test sample will then be determined by a sign function which is defined by the parameters of the hyperplane. The instances closest to the hyperplane are called support vectors and are vital for training . Using SVM requires careful attention to the kernel function and the magnitude of the trade-off between accuracy and generalization .
In this paper, we used SVMlight version 6.01 for data training and classifying. To ensure that the parameter estimation and model generation of SVM is completely independent of the test data, the original data set of hot and cold ORFs was divided into two parts: set A and B. Set A was used as a separate validation set to optimize the parameters of SVM, which contains one third of the original sequences (117 ORFs). The remainder sequences were put into set B for performance evaluation (234 ORFs). When classifying hot and cold ORFs, the SVM models were trained with all the hot ORFs with positive labels and all the sequences from coldspots with negative labels. Ten-fold cross-validation was used for both parameter and accuracy estimation. Ten-fold cross-validation is to divide the dataset of hot and cold ORFs randomly into ten subsets and then alternately using one subset for testing and the other nine sets for training. In our study, ORFs from the same hot or cold spot were placed in one subset. In other words, we didn't distribute the ORFs from the same spot to different subsets.
We used different kernel functions in our experiments, including linear function, polynomial function and radial basis function. To obtain SVM classifier with optimal performance, the penalty parameter C and the parameters of kernel function are tuned by the standard grid search method based on set A. In this study, the value of parameter C was optimized to 100. When employing FCU as sequence attributes, the best results were obtained using the radial basis function kernel with γ = 200; while employing RSCU and FAAU as sequence attributes, the best performance was obtained using the polynomial kernel of five and three degree respectively, which implies that the FCU based classification problem is much more non-linear than the problems based on RSCU and FAAU.
Measurement of SVM performance
The performance of the SVM model in distinguishing hot ORFs from cold ORFs was evaluated by ten-fold cross-validation using sensitivity, specificity, overall performance (OP) and accuracy. These indices are determined thus:
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
OP = (Sensitivity + Specificity)/2
Accuracy = (TP + TN)/(TP + TN + FP + FN)
where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives respectively.
When we examined the ability of our SVM based method to identifying hot or cold ORFs from the full genome, true positive rate and false positive rate are used for performance evaluation.
True positive rate = TP/(TP + FN)
False positive rate = FP/(TN + FP)
McNemar's test was also applied to decide whether significant differences exist between the performances of any two SVM models based on different attribute types. This test is performed by summarizing the classification errors of the two algorithms and has a lower Type I error (the probability of incorrectly detecting a difference when no difference exists) .
Principal component analysis
To assess the statistical difference between the hot and cold ORFs, principal component analysis (PCA) was used to investigate the major trends in codon usage variation among genes. The RSCU values of each ORF were plotted in a multidimensional space of 59 axes (excluding Met, Trp and stop codons) and PCA identified a series of new orthogonal axes accounting for the greatest variation among genes. The analysis yielded the coordinate of each ORF on each new axis, and the fraction of the total variation was accounted for by each axis.
This research work is a part of Project 60121101 supported by National Natural Science Foundation of China.
- Lichten M, Goldman AS: Meiotic recombination hotspots. Annu Rev Genet 1995, 29: 423–444. 10.1146/annurev.ge.29.120195.002231View ArticlePubMedGoogle Scholar
- Gerton JL, DeRisi J, Shroff R, Lichten M, Brown PO, Petes TD: Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae . Proc Natl Acad Sci USA 2000, 97: 11383–11390. 10.1073/pnas.97.21.11383PubMed CentralView ArticlePubMedGoogle Scholar
- Baudat F, Nicolas A: Clustering of meiotic double-strand breaks on yeast chromosome III. Proc Natl Acad Sci USA 1997, 94: 5213–5218. 10.1073/pnas.94.10.5213PubMed CentralView ArticlePubMedGoogle Scholar
- Klein S, Zenvirth D, Dror V, Barton AB, Kaback DB, Simchen G: Patterns of meiotic double-strand breakage on native and artificial yeast chromosomes. Chromosoma 1996, 105: 276–84. 10.1007/s004120050185View ArticlePubMedGoogle Scholar
- Zenvirth D, Arbel T, Sherman A, Goldway M, Klein S, Simchen G: Multiple sites for double-strand breaks in whole meiotic chromosomes of Saccharomyces cerevisiae . EMBO J 1992, 11: 3441–3447.PubMed CentralPubMedGoogle Scholar
- Kliman RM, Irving N, Santiago M: Selection conflicts, gene expression, and codon usage trends in yeast. J Mol Evol 2003, 57: 98–109. 10.1007/s00239-003-2459-9View ArticlePubMedGoogle Scholar
- Kliman RM, Hey J: Reduced natural selection associated with low recombination in Drosophila melanogaster . Mol Biol Evol 1993, 10: 1239–1258.PubMedGoogle Scholar
- Marais G, Mouchiroud D, Duret L: Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc Natl Acad Sci USA 2001, 98: 5688–5692. 10.1073/pnas.091427698PubMed CentralView ArticlePubMedGoogle Scholar
- Marais G, Piganeau G: Hill-Robertson interference is a minor determinant of variations in codon bias across Drosophila melanogaster and Caenorhabditis elegans genome. Mol Biol Evol 2002, 19: 1399–1406.View ArticlePubMedGoogle Scholar
- Perry J, Ashworth A: Evolutionary rate of a gene affected by chromosomal position. Curr Biol 1999, 9: 987–989. 10.1016/S0960-9822(99)80430-8View ArticlePubMedGoogle Scholar
- Fullerton SM, Bernardo Carvalho A, Clark AG: Local rates of recombination are positively correlated with GC content in the human genome. Mol Biol Evol 2001, 18: 1139–1142.View ArticlePubMedGoogle Scholar
- Friedel CC, Jahn KH, Sommer S, Rudd S, Mewes HW, Tetko IV: Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage. Bioinformatics 2005, 21: 1383–1388. 10.1093/bioinformatics/bti200View ArticlePubMedGoogle Scholar
- Lin K, Kuang Y, Joseph JS, Kolatkar PR: Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae : lessons from supervised machine learning in functional genomics. Nucleic Acids Res 2002, 30: 2599–2607. 10.1093/nar/30.11.2599PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Blaisdell BE, Bucher P: Quantile distributions of amino acid usage in protein classes. Protein Eng 1992, 5: 729–38.View ArticlePubMedGoogle Scholar
- Cedano J, Patrick A, PerezPons J, Querol E: Relation between amino acid composition and cellular location of proteins. J Mol Biol 1997, 266: 594–600. 10.1006/jmbi.1996.0804View ArticlePubMedGoogle Scholar
- Nakashima H, Nishikawa K: Discrimination of intracellular and extracellular proteins using amino acid compositions and residue-pair frequencies. J Mol Biol 1994, 238: 54–61. 10.1006/jmbi.1994.1267View ArticlePubMedGoogle Scholar
- Sharp PM, Tuohy T, Mosurski K, Codon usage in yeast: Cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 1986, 14: 5125–5143.PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp PM, Li WH: The codon adaptation index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987, 15: 1281–1295.PubMed CentralView ArticlePubMedGoogle Scholar
- Comeron JM, Kreitman M, Aguade M: Natural selection on synonymous sites is correlated with gene length and recombination rate in Drosophila . Genetics 1999, 151: 239–249.PubMed CentralPubMedGoogle Scholar
- McVean GAT, Charlesworth B: The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 2000, 155: 929–944.PubMed CentralPubMedGoogle Scholar
- Bennetzen JL, Hall BD: Codon selection in yeast. J Biol Chem 1982, 257: 3026–3031.PubMedGoogle Scholar
- Sharp PM, Cowe E: Synonymous codon usage in Saccharomyces cerevisiae . Yeast 1991, 7: 657–678. 10.1002/yea.320070702View ArticlePubMedGoogle Scholar
- Ma JM, Zhou T, Gu WJ, Sun X, Lu ZH: Cluster analysis of the codon use frequency of MHC genes from different species. Biosystems 2002, 65: 199–207. 10.1016/S0303-2647(02)00016-3View ArticlePubMedGoogle Scholar
- Richard JE, Lin K, Tan T: A functional significance for codon third bases. Gene 2000, 245: 291–298. 10.1016/S0378-1119(00)00042-1View ArticleGoogle Scholar
- Global Mapping of Meiotic Recombination Hotspots and Coldspots in the yeast Saccharomyces cerevisiae[http://derisilab.ucsf.edu/hotspots/]
- Sharp PM, Li WH: Codon usage in regulatory genes in Escherichia coli does not reflect selection for 'rare' codons. Nucleic Acids Res 1986, 14: 7737–7749.PubMed CentralView ArticlePubMedGoogle Scholar
- Vapnik VN: Statistical Learning Theory. New York: John Wiley & Sons; 1998.Google Scholar
- Dietterich DT: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 1998, 10: 1895–1923. 10.1162/089976698300017197View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.