Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM
© Li et al.; licensee BioMed Central Ltd. 2014
Received: 9 July 2014
Accepted: 29 September 2014
Published: 20 November 2014
Identification of the recombination hot/cold spots is critical for understanding the mechanism of recombination as well as the genome evolution process. However, experimental identification of recombination spots is both time-consuming and costly. Developing an accurate and automated method for reliably and quickly identifying recombination spots is thus urgently needed.
Here we proposed a novel approach by fusing features from pseudo nucleic acid composition (PseNAC), including NAC, n-tier NAC and pseudo dinucleotide composition (PseDNC). A recursive feature extraction by linear kernel support vector machine (SVM) was then used to rank the integrated feature vectors and extract optimal features. SVM was adopted for identifying recombination spots based on these optimal features. To evaluate the performance of the proposed method, jackknife cross-validation test was employed on a benchmark dataset. The overall accuracy of this approach was 84.09%, which was higher (from 0.37% to 3.79%) than those of state-of-the-art tools.
Comparison results suggested that linear kernel SVM is a useful vehicle for identifying recombination hot/cold spots.
Meiotic recombination is a vital biological process in diploid organisms, which could be described by two processes: meiosis and recombination. During the former one, the genome is divided into two gametes for sexual reproduction, while diverse gametes combined together to form new genetic variations during the latter. Initiated by double-strand breaks (DSB), recombination provides chances for the natural exchanges of genetic material . By segregating advantageous and deleterious genes, it optimizes genotypes as well as accelerates the evolution of sexual reproductive organisms.
Identification of recombination spots is pivotal in understanding the mechanism of the main driving force in the genome evolution process. Recombination usually occurs in some regions of 1 ~ 2.5 kilobase. In order to find whether they share DNA sequences and structural features, plenty of global mapping studies have been performed to map DSB sites on chromosomes [2, 3]. The genomic regions with relatively high frequencies were known as hotspots, while others with relatively low frequencies were coldspots. Studies showed that most positions of hotspots were intergenic. Meanwhile, positions of hotspots were associated with special chromatin structures, such as GC-rich regions, repeats and consensus DNA motifs and dinucleotides bias. Identifying the recombination hot/cold spots is crucial for understanding the mechanism of recombination as well as the genome evolution process. Since experimental methods are time-consuming and laborious, they may fail to deal with large numbers of genomic sequences. Thus, developing efficient and accurate computational approaches to identify recombination hot/cold spots is required.
The computational approaches for identifying recombination hot/cold-spots consist of the following three components: i) feature extraction for sample representation; ii) optimal feature selection; iii) algorithm selection for classification. Finding proper features to represent the sequences is the first step towards building a novel model. In the past, some features have been used to identify the hotspots. For example, K-mer frequencies of nucleotide sequence contents were used as the features to predict hotspots in IDQD model . However, one of the most important problems in this model, as well as in computational proteomics, is the neglect of global sequence-order effect. In order to keep considerable sequence order information of samples in a discrete model, Chou et al. proposed the concept of pseudo amino acid composition (PseAAC) [4–6], which has been applied to many prediction tasks in computational proteomics [7–10], such as prediction of protein S-nitrosylation sites, protein quaternary structural attributes, protein subcellular locations, membrane protein types, etc. To identify the recombination spots, Chen et al.  further proposed the concept of pseudo dinucleotide composition (PseDNC) to represent DNA sequences. Inspired by their model, here we proposed the concept of pseudo nucleic acid composition (PseNAC) of DNA sequence to represent DNA sequences.
Feature selection is another critical step in classification. By decreasing the model’s complexity, the selection of the optimal features can reduce the risk of overfitting and enhance the efficiency. Commonly used feature selection techniques can be attributed into three categories: filter, wrapper and embedded methods [11, 12]. The filter methods, such as Euclidean distance, T-test and X2-statistics, eliminate poorly informative features according to their feature relevance score before inputting any classification algorithm. Wrapper and Embedded methods often provide better results than filter methods because they rank the feature values as subsets as well as interact with the respective classification algorithm. Unlike wrapper methods, which depend on a given but separate classification algorithm, embedded methods perform both tasks, feature selection as well as classifier construction. Thus embedded methods, such as SVM-RFE , are computationally less intensive than wrapper methods.
Many different prediction algorithms in computational biology, such as support vector machine (SVM), discriminant algorithm, neural network algorithm, k-nearest neighbor algorithm (k-NN), naive bayes, random forest classifier and increment of diversity (ID), have been developed [14–19]. Among them, SVM was proven to be very powerful in many classification tasks due to its efficiency in analyzing large amounts of samples as well as adaptable to new data [20–22].
In the current work, an SVM-based model was developed to further improve the prediction of recombination spots from pseudo nucleic acid composition (PseNAC) of DNA sequence, including NAC, n-tier NAC and PseDNC. Before inputting to an SVM classifier, crucial features were selected by a powerful feature selecting tool, SVM-RFE, for reliable and accurate identification of recombination spots. Employing Jackknife test, our method showed improved prediction performance compared to existing methods.
Results and discussion
Before optimizing the regularization parameter C in LIBSVM, we should notice that the dimension of initial feature vector would increase exponentially as the number of the most contiguous residue components increased. For example, the dimension of feature vectors was 42 = 16 for the most two contiguous residue components; while it was 410 = 1,048,576 for the ten most contiguous residue components. However, the higher the number of most contiguous residue components was, the higher rate of redundant information was included in feature vector. Due to the high rate of redundant information and limits in computing power, we finally fixed the maximum number as five for the most contiguous residue components.
Comparison with other methods
A comparison of the proposed method with the existing methods
In this study, an SVM-based model was constructed for the identification of recombination hot/cold spots by selecting the optimal features from pseudo nucleic acid composition, i.e., NAC, 2nd ~ 5th tier NACs and PseDNC. The overall accuracy was 84.09% for this benchmark dataset, indicating that this approach was satisfying in identifying recombination sports. It supported the assumption that pseudo nucleic acid composition could better reflect the feature of a DNA sequence through a discrete model, and improved the prediction results for recombination spots identification. Besides, the recursive feature extraction method adopted here was very powerful and effective in getting the optimal features from high dimension feature vectors. Thereore, it improved the final prediction performance as well as accelerated the computing procedure. The good performance of our predictor for identifying recombination spot suggests that our method can be applied as a useful tool in such predicting task. Since user-friendly and publicly accessible web-servers represent the future direction for developing more useful methods, models or predictors, we will make efforts in our future work to provide a web-server for the method presented in this study.
Avowedly, there are still some challenges remaining to be solved in recombination spot identification. Despite the fact that our method suffered from a little high computational complexity for feature ranking, it could effectively catch the key features to improve the identification of recombination spots. In addition, we only focused on the identification of recombination spots, an important step in meiotic recombination. The future attention will be paid in clarifying the relationship between the optimal features selected by this approach and the mechanism of meiotic recombination. As the good performance in identifying recombination spots, we will apply our method to other novel pattern recognition tasks, e.g., prediction of facial features from DNA, DNA methylation level, sparse protein-DNA binding landscapes and small RNA targets, networks and interaction domains.
There is no human or animal experiment in this work.
In this study, the dataset for identifying recombination spots was taken from Liu et al. , which contains 490 recombination hotspots and 591 recombination coldspots. It was widely applied as a benchmark dataset for identifying recombination spots .
where f (AA) was the normalized occurence frequency of AA in the DNA sequence; f (AC) was that of AC; f (AG) was that of AG and so as f (TT). In order to capture more local sequence information, the most three, four, five et al. contiguous residue components, i.e., the 3rd, 4th, 5th et al. tier NACs were also incorparated to the PseNAC and similarly we had 43, 44, 45 … features for each DNA sequence. Although the most contiguous local sequence-order information of a DNA sequence was considered, the global sequence-order information was still not reflected. To address this issue, the pseudo dinucleotide composition, i.e., PseDNC was introducted here.
The normalized values for the six DNA dinucleotide physical structures
Feature extraction by SVM-RFE
In previous step, NAC, n-tier NAC and PseDNC of each DNA sequence were merged as a feature vector. Then, a recursive feature selection approach, SVM-RFE was applied to select a group of important features for reliable identification of recombination spots. Then, through training a linear kernel SVM iteratively, the SVM-RFE algorithm is adopted to get a ranking list of all features by removing only one feature with the lowest influence on the predictions of an SVM model each time [26, 27]. The first item in the ranking list was the most relevant feature in identification of recombination spots, and the last item had the least relevant feature. Finally, the ranking list of the top K features was selected to build an SVM model.
The SVM classifier
SVM is a universal approximator. It is a supervised learning model in analyzing data and recognizing patterns. SVM is attractive to biological sequence analysis due to its ability to handle large input spaces, large dataset and noise. Thus it has been widely used in the bioinformatics applications [28–32]. The basic idea behind SVM is to represent a sample as a point in a high dimensional feature space and then predict it to a category based on the optimal separating hyperplane . In this study, the SVM implementation was based on the package LIBSVM 3.17 [34, 35]. Since the SVM-RFE algorithm was based on a linear kernel SVM, the linear kernel function was applied to obtain the best classification hyperplane. Thus only one free parameter, i.e., the regularization parameter C should be optimized. It was determined with an optimal procedure using a grid search method. Finally, the SVM module predicted recombination spots of a DNA sequence using the top features and the optimal value of parameter C.
Assessment of prediction performance
This work was partially supported by grants from the National Natural Science Foundation of China (No. 81302134 and No. 31100953), and program for Changjiang scholars and innovative research team in University (IRT 13050 to HY), Shanghai Leading Academic Discipline Project (No. S30405), Innovation Program of Shanghai Municipal Education Commission (No. 12YZ088) and Supported by the Program of Shanghai Normal University (DZL121).
- Chen W, Feng PM, Lin H, Chou KC: iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 2013, 41 (6): e68-10.1093/nar/gks1450.View ArticlePubMed CentralPubMedGoogle Scholar
- Liu G, Liu J, Cui X, Cai L: Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol. 2012, 293: 49-54.View ArticlePubMedGoogle Scholar
- Chou KC: Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001, 43 (3): 246-255. 10.1002/prot.1035.View ArticlePubMedGoogle Scholar
- Xu Y, Wen X, Shao XJ, Deng NY, Chou KC: iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int J Mol Sci. 2014, 15 (5): 7594-7610. 10.3390/ijms15057594.View ArticlePubMed CentralPubMedGoogle Scholar
- Xu Y, Ding J, Wu LY, Chou KC: iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One. 2013, 8 (2): e55844-10.1371/journal.pone.0055844.View ArticlePubMed CentralPubMedGoogle Scholar
- Xiao X, Min JL, Wang P, Chou KC: iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. J Theor Biol. 2013, 337: 71-79.View ArticlePubMedGoogle Scholar
- Jia C, Lin X, Wang Z: Prediction of protein S-nitrosylation sites based on adapted normal distribution bi-profile Bayes and Chou's pseudo amino acid composition. Int J Mol Sci. 2014, 15 (6): 10410-10423. 10.3390/ijms150610410.View ArticlePubMed CentralPubMedGoogle Scholar
- Sun XY, Shi SP, Qiu JD, Suo SB, Huang SY, Liang RP: Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform. Mol Biosyst. 2012, 8 (12): 3178-3184. 10.1039/c2mb25280e.View ArticlePubMedGoogle Scholar
- Li L, Yu S, Xiao W, Li Y, Li M, Huang L, Zheng X, Zhou S, Yang H: Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach. Biochimie. 2014, 104: 100-107.View ArticlePubMedGoogle Scholar
- Han GS, Yu ZG, Anh V: A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou's PseAAC. J Theor Biol. 2014, 344: 31-39.View ArticlePubMedGoogle Scholar
- Liang Y, Liu C, Luan XZ, Leung KS, Chan TM, Xu ZB, Zhang H: Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013, 14: 198-10.1186/1471-2105-14-198.View ArticlePubMed CentralPubMedGoogle Scholar
- Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344.View ArticlePubMedGoogle Scholar
- Fernandez-Lozano C, Fernandez-Blanco E, Dave K, Pedreira N, Gestal M, Dorado J, Munteanu CR: Improving enzyme regulatory protein classification by means of SVM-RFE feature selection. Mol Biosyst. 2014, 10 (5): 1063-1071. 10.1039/c3mb70489k.View ArticlePubMedGoogle Scholar
- De Santis M, Rinaldi F, Falcone E, Lucidi S, Piaggio G, Gurtner A, Farina L: Combining optimization and machine learning techniques for genome-wide prediction of human cell cycle-regulated genes. Bioinformatics. 2014, 30 (2): 228-233. 10.1093/bioinformatics/btt671.View ArticlePubMedGoogle Scholar
- Ofer D, Linial M: NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes. Bioinformatics. 2014, 30 (7): 931-940. 10.1093/bioinformatics/btt725.View ArticlePubMedGoogle Scholar
- Peng J, Lu J, Shen Q, Zheng M, Luo X, Zhu W, Jiang H, Chen K: In silico site of metabolism prediction for human UGT-catalyzed reactions. Bioinformatics. 2014, 30 (3): 398-405. 10.1093/bioinformatics/btt681.View ArticlePubMedGoogle Scholar
- Huang C, Yuan J: Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites. Biosystems. 2013, 113 (1): 50-57. 10.1016/j.biosystems.2013.04.005.View ArticlePubMedGoogle Scholar
- Liao B, Li Y, Jiang Y, Cai L: Using multi-instance hierarchical clustering learning system to predict yeast gene function. PLoS One. 2014, 9 (3): e90962-10.1371/journal.pone.0090962.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang J, Kou Z, Duan M, Ma C, Zhou Y: Using amino acid factor scores to predict avian-to-human transmission of avian influenza viruses: a machine learning study. Protein Pept Lett. 2013, 20 (10): 1115-1121. 10.2174/0929866511320100005.View ArticlePubMedGoogle Scholar
- Dou Y, Yao B, Zhang C: PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids. 2014, 46 (6): 1459-1469. 10.1007/s00726-014-1711-5.View ArticlePubMedGoogle Scholar
- Matsuta Y, Ito M, Tohsato Y: ECOH: an enzyme commission number predictor using mutual information and a support vector machine. Bioinformatics. 2013, 29 (3): 365-372. 10.1093/bioinformatics/bts700.View ArticlePubMedGoogle Scholar
- Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, Zhou Y, Zheng X: PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations. PLoS One. 2014, 9 (3): e92863-10.1371/journal.pone.0092863.View ArticlePubMed CentralPubMedGoogle Scholar
- Qiu WR, Xiao X, Chou KC: iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci. 2014, 15 (2): 1746-1766. 10.3390/ijms15021746.View ArticlePubMed CentralPubMedGoogle Scholar
- Chou KC, Shen HB: Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. J Proteome Res. 2006, 5 (8): 1888-1897. 10.1021/pr060167c.View ArticlePubMedGoogle Scholar
- Goni JR, Perez A, Torrents D, Orozco M: Determining promoter location based on DNA structure first-principles calculations. Genome Biol. 2007, 8 (12): R263-10.1186/gb-2007-8-12-r263.View ArticlePubMed CentralPubMedGoogle Scholar
- Wei X, Ai J, Deng Y, Guan X, Johnson DR, Ang CY, Zhang C, Perkins EJ: Identification of biomarkers that distinguish chemical contaminants based on gene expression profiles. BMC Genomics. 2014, 15: 248-10.1186/1471-2164-15-248.View ArticlePubMed CentralPubMedGoogle Scholar
- Ota K, Oishi N, Ito K, Fukuyama H: A comparison of three brain atlases for MCI prediction. J Neurosci Methods. 2014, 221: 139-150.View ArticlePubMedGoogle Scholar
- Li L, Zhang Y, Zou L, Li C, Yu B, Zheng X, Zhou Y: An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One. 2012, 7 (1): e31057-10.1371/journal.pone.0031057.View ArticlePubMed CentralPubMedGoogle Scholar
- Karsenty S, Rappoport N, Ofer D, Zair A, Linial M: NeuroPID: a classifier of neuropeptide precursors. Nucleic Acids Res. 2014, 42 (Web Server issue): W182-W186.View ArticlePubMed CentralPubMedGoogle Scholar
- Fletez-Brant C, Lee D, McCallion AS, Beer MA: kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 2013, 41 (Web Server issue): W544-W556.View ArticlePubMed CentralPubMedGoogle Scholar
- O'Fallon BD, Wooderchak-Donahue W, Crockett DK: A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data. Bioinformatics. 2013, 29 (11): 1361-1366. 10.1093/bioinformatics/btt172.View ArticlePubMedGoogle Scholar
- Li LQ, Zhang Y, Zou LY, Zhou Y, Zheng XQ: Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition. Protein Pept Lett. 2012, 19 (4): 375-387. 10.2174/092986612799789369.View ArticlePubMedGoogle Scholar
- Zou L, Nan C, Hu F: Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles. Bioinformatics. 2013, 29 (24): 3135-3142. 10.1093/bioinformatics/btt554.View ArticlePubMedGoogle Scholar
- Jagga Z, Gupta D: Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors. PLoS One. 2014, 9 (5): e97446-10.1371/journal.pone.0097446.View ArticlePubMed CentralPubMedGoogle Scholar
- Panwar B, Arora A, Raghava GP: Prediction and classification of ncRNAs using structural information. BMC Genomics. 2014, 15: 127-10.1186/1471-2164-15-127.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.