- Open Access
Diagnostic prediction of complex diseases using phase-only correlation based on virtual sample template
© Wang et al.; licensee BioMed Central Ltd. 2013
- Published: 9 May 2013
Complex diseases induce perturbations to interaction and regulation networks in living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is of great benefit to the diagnosis and treatment of complex diseases. However, it remains a major challenge to deal with the high dimensionality and small size of available complex disease gene expression datasets currently used for discovering gene expression patterns.
Here we present a phase-only correlation (POC) based classification method for recognizing the type of complex diseases. First, a virtual sample template is constructed for each subclass by averaging all samples of each subclass in a training dataset. Then the label of a test sample is determined by measuring the similarity between the test sample and each template. This novel method can detect the similarity of overall patterns emerged from the differentially expressed genes or proteins while ignoring small mismatches.
The experimental results obtained on seven publicly available complex disease datasets including microarray and protein array data demonstrate that the proposed POC-based disease classification method is effective and robust for diagnosing complex diseases with regard to the number of initially selected features, and its recognition accuracy is better than or comparable to other state-of-the-art machine learning methods. In addition, the proposed method does not require parameter tuning and data scaling, which can effectively reduce the occurrence of over-fitting and bias.
- Support Vector Machine
- Independent Component Analysis
- Feature Extraction Method
- Inverse Discrete Fourier Transform
- Dynamic Equilibrium State
Classification and diagnostic prediction of complex diseases such as cancers and neuron-degeneration diseases using genomic or proteomic data can improve the quality of pathological diagnosis and help develop personalized treatment of these diseases . Although great efforts have been exerted in this field, making early and precise diagnosis of complex diseases, followed through with effectively treating remains a great challenge. For example, the histological methods cannot precisely distinguish between the subtypes of some cancers  that the development of effective therapies depends on. The molecular mechanisms of many neuron-degeneration diseases such as Alzerheimer's (AD) and Parkinson's (PD) diseases are not fully understood and diagnosis of these diseases rely on medical history evaluation and the combination of physical and neurological assessments [3, 4], often after irreversible brain damage or mental decline already occurs.
The rationale of classification and diagnostic prediction of complex diseases using genomic or proteomic data is based on the assumption that complex diseases induce perturbations to interaction and regulation networks of living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is a key task to the success of these types of approaches. Many pattern recognition methods based on machine learning, such as k-nearest neighbor (KNN), support vector machines (SVM) [5–7], probabilistic neuron networks (PNN) [8–10], naive Bayes model (NBM)  and random forest (RF) [4, 12], etc., have been extensively explored for the classification and diagnostic prediction of complex diseases . Usually, these supervised learning methods are called model-based ones because a classification model needs to be constructed using a training set before it can be used to predict the label of a test sample. However, for the model-based methods, feature extraction and feature selection techniques play a vital role in improving the performance of complex disease classification due to the high-dimensionality and small sample size of GEP dataset.
An example of feature extraction methods is that independent component analysis (ICA) was used to extract independent components from GEP to reduce the dimensionality of sample [7, 14, 15]. Other feature extraction methods such as principal component analysis (PCA) , linear discriminant analysis (LDA)  and locally linear discriminant embedding (LLDE)  are also extensively applied to the dimensionality reduction of GEP. Although such methods can achieve satisfactory classification performance, there is weak biomedical interpreter and significance. An example of gene selection methods is that the Classification to Nearest Centroids (ClaNC) method for class-specific gene selection was proposed to determine a gene subset of given size that maximizes the classification accuracy . Although such methods have biomedical meaning, there are a great number of gene subsets with the same predictive performance, which could lead to the selection arbitrariness of candidate gene subsets. In fact, each method has its drawbacks, and many factors such as normalization, small sample size, noisy data, improper evaluation methods, and too many model parameters can lead to the over-fitting of the constructed model, the bias of results and false discovery [19–22]. Even so, "microarrays remain a useful technology to address a wide array of biological problems and the optimal analysis of these data to extract meaningful results still pose many bioinformatics challenges." . Therefore, with the increasing accumulation of GEP and protein microarray data, it is still necessary to design more effective and more biomedical methods to recognize complex disease type, which is also the requirement of clinical application.
For potential clinical applications, a candidate classification model should be evaluated for three aspects: accuracy, interpretability and practicality . And a novel method should be measured up from three aspects. 1) A good model should be simple and have no or few parameters to be tuned. If parameters are necessary, the model should be robust with regard to the variation of these parameters. 2) The obtained model should achieve the best or near-optimal performance of disease classification as compared to the relevant state-of-the-art methods because there is no classification method that always outperforms all others in all circumstances [23, 24]. 3) The obtained model should be obviously interpretable from biomedical perspective, which requires that the intrinsic signatures of sample set should be used as designing the classification model.
Previous studies suggest that each complex disease type or subtype corresponds to a dynamic equilibrium state of disease-induced genomic interaction and regulation network, and different samples at the same state are similar in gene expression profiles . Thus analyzing the similarity level of gene expression profiles can be in principle used to distinguish different disease types or subtypes. A gene expression profile, which comprises the expression levels of numerous genes, can be likened to a digital image that consists of the luminance of pixels. In fact, both microarray and protein array data are originated from digital images. We therefore suggest that it is reasonable to apply some image processing methods to analyze genomic and proteomic data. Based on this idea, recently we successfully proposed two correlation filters based on tumor classification methods, namely, minimum average correlation energy (MACE) and optimal tradeoff synthetic discriminant function (OTSDF), to identify the overall pattern of differentially expressed genes (DEGs), corresponding to the tumor subtypes . Although the two methods perform well in classifying tumor subtypes, they have some drawbacks: 1) The two methods are sensitive to the data scaling methods used to standardize the data; 2) although the template synthesized for each subtype in frequency domain space can be used to characterize the corresponding subtype, the biomedical significance of the synthesized template itself is not obvious enough. Thus it is highly desirable to explore other correlation methods which can recognize disease types well but without the weaknesses of the MACE and OTSDF-based disease classification methods.
Our further experiments indicate that phase-only correlation (POC)  may be such a method. Like the MACE and OTSDF filters, POC also utilizes a fast frequency domain approach to estimate the similarity degree between two samples. In recent years POC has also been extensively applied to image recognition [28, 29] and identification of seismic events . In this study, we present a novel POC-based method to complex disease classification based on virtual sample templates using genomic or proteomic data. First, we construct one template for each subclass on a training set. Sample matching can then be performed by cross-correlating a test sample with each template in training set using POC and analyzing the resulting correlation output. By comparing the peaks of correlation output, the test sample can be easily assigned to the class for which the template with the highest similarity to the test sample represents.
Complex disease datasets
The summary of the seven complex disease datasets.
Invitrogen ProtoArray v5.0
Both protein and DNA microarray data can be represented with matrices. Thus we use DNA microarray data as an example to describe the design of our method. Let denote a set of N genes, and denote a set containing samples, where denotes the gene expression column vector of the corresponding sample on all features. Each sample is assigned with a label denoting the k-th subclass set , , where is the total number of subclasses and is the index of the subclass with the label , and represents the number of samples with the same label .
Flowchart of analysis
1) The entire sample set is randomly split into two disjoint parts: a training set and a test set. We then select a certain number of DEGs or differentially expressed proteins (DEPs) using the Kruskal-Wallis rank sum test (KWRST) method .
2) A virtual sample template for each subclass of training set is constructed by averaging all samples in the subclass. The j- th component of virtual sample template for subclass is , the mean expression of the k-th subclass in training set for feature . Thus the concept of virtual sample template is the same as the centroid proposed in .
If we adopt a square matrix to represent a sample instead of the vector form of the sample, we can analyze a sample set using two-dimensional POC (2D POC) to identify disease types. The flowchart of 2D POC analysis method is very similar to the 1D POC shown in Figure 1. The only difference is that 1D DFT and 1D IDFT in Figure 1 are replaced with 2D DFT and 2D IDFT, respectively. In fact, we can easily convert a sample vector (assuming that the length of the sample vector is a square number) into a square matrix easily.
where is the 1D POC function between and , and its value has a range from 0 to 1. The correlation peak value of provides a measure of the similarity between the two samples. Usually, the larger the peak value is, the more similar the two samples are, and vice versa. The peak value decreases when the noise in a test sample and the constructed templates increase . Thus high-level noise in samples may degrade the accuracy of prediction.
In contrast to the template-based POC method, we design a POC1DKNN method that utilizes 1D POC to measure the similarity of two samples and apply 5-nearest neighbor (5NN) to predict the label of test sample.
Although there is no parameter in the proposed method, the different number of pre-selected features and the different divisions of training sets and test sets can also affect the classification performance. To obtain objective results, the Balance Division Method (BDM) is used to divide each original dataset into balanced training sets and test sets . For the BDM, samples from each subclass of the original dataset are randomly selected and used as a training set, while the remaining samples are used as test set. For example, if we set to 5 for the SRBCT dataset, then 5 samples per subclass are randomly selected, that is, samples are used as a training set and the rest samples are assigned to a test set. Considering that 2D POC requires the square number of features selected, we select features using KWRST to evaluate the performance of POC method because the number of genes or proteins related to complex diseases is unknown and likely different from one disease to an-other.
Visualization of experimental results
Comparison with MACE method
Comparison with other model-based methods
where and .
For PNN, there is a smoothing parameter to be tuned within the range of . To determine the optimal value, 5-fold cross-validation (5-fold CV) is performed by taking value from 0 to 1 by step 0.1 on each training set divided randomly on original dataset using BDM. The optimal is the one with the best performance of 5-fold CV. For SVM, radial basis function (RBF) kernel is used as the kernel function of SVM. There are two parameters, and , to be tuned. We use 5-fold CV on training set to determine the optimal combination of the parameters and by screening all combinations of the following and : , and . Because SVM requires data scaling, each dataset is standardized into one with zero mean and unit variance. Therefore, to obtain fairer comparison data scaling pre-process is performed before classification.
Comparison with feature extraction-based methods
Due to the high dimensionality of dataset, feature extraction is often used to reduce the dimensionality of dataset before classification, and it plays a crucial role in simplifying classification model and improving the classification performance. Here we compare our method with five dimensionality reduction methods, i.e., PCA, LDA, ICA, LLDE, and LPP, which are extensively applied into the classification of complex disease. Our previous study suggests that the prediction accuracy depends less on classification methods  when the number of features extracted is small enough. Thus we also adopt the simplest classification method k-nearest neighbor (KNN) with correlation distance to classify disease samples, and fixedly set its to 5.
Permutation tests with POC1D and POC2D on the six datasets.
Data scaling or normalization is a very important data pre-processing step for many machine learning algorithms sensitive to the numeric ranges of attributes. There are several widely used scaling methods such as Z-score that transforms data into the one with zero-mean and one-variance, and 0-1 scaling method that transforms data into the range between 0 and 1, etc. Currently it is difficult to predict what is the best data scaling method for a given dataset , and no clear standard criterion can be used to evaluate various scaling methods . Besides, information such as dynamic ranges might be lost during data scaling. Therefore the proposed method is advantageous over those demanding a scaling process because it does not require data scaling.
In the present study, we construct the template of each subtype using the means of the data points in the training dataset. The results demonstrate that this approach is reasonable and good performance is achieved. Nevertheless, there are certainly other ways to construct templates. For example, medians, instead of means, are another possible approach that might be more suitable for data that are not normally distributed. For the present study, we test medians but do not find significant difference from means (data not shown). Thus only the results using means are reported.
A POC-based method is reported as a new technique for identifying similar gene expression signatures for the differentially expressed genes or proteins. By measuring the similarity between a test sample and the virtual sample templates constructed on training set for each subclass, the label of the test sample can be easily determined. Applying the POC-based classification method to six complex disease datasets shows that this novel method is feasible, efficient and robust. Compared with five state-of-the-art classification algorithms and five feature extraction-based methods, the proposed method can achieve optimal or near-optimal classification accuracy.
Our methods can detect the similarity of overall pattern while ignoring small mismatches between a giving test sample and templates because correlation filters are based on integration operation. Compared with the MACE and OTSDF methods, POC is not sensitive to data scaling methods. The experimental results show that the POC-based method can achieve satisfactory results even without scaling data. Moreover, there is no parameter to be tuned in POC, so this method can easily avoid the over-fitting problem as well as the effects of dimensionality curse. One possible drawback of this novel method is that high-level noise in the template can suppress the output peak. Our future work will focus on exploring novel method to construct more representative template to further improve predictive accuracy.
We sincerely thank Dr. Yi-Hai Zhu (University of Rhode Island) for the discussion on the application of phase-only correlation method. This work was supported in part by the National Institutes of Health (NIH) Grant P01 AG12993 (PI: E. Michaelis) and the National Science Foundation of China (grant nos. 60973153, 61133010, 31071168, 60873012).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 8, 2013: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S8.
- Karley D, Gupta D, Tiwari A: Biomarkers: the future of medical science to detect cancer. Molecular Biomarkers & Diagnosis. 2011, 2 (5): 118-Google Scholar
- Chan WC, Armitage JO, Gascoyne R, Connors J, Close P, Jacobs P, Norton A, Lister TA, Pedrinis E, Cavalli F: A clinical evaluation of the International Lymphoma Study Group classification of non-Hodgkin's lymphoma. Blood. 1997, 89 (11): 3909-3918.Google Scholar
- Han M, Nagele E, DeMarshall C, Acharya N, Nagele R: Diagnosis of Parkinson's Disease Based on Disease-Specific Autoantibody Profiles in Human Sera. PLoS One. 2012, 7 (2):Google Scholar
- Nagele E, Han M, DeMarshall C, Belinka B, Nagele R: Diagnosis of Alzheimer's Disease Based on Disease-Specific Autoantibody Profiles in Human Sera. PLoS One. 2011, 6 (8):Google Scholar
- Wang L, Zhu J, Zou H: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics. 2008, 24 (3): 412-419. 10.1093/bioinformatics/btm579.View ArticlePubMedGoogle Scholar
- Wang SL, Wang J, Chen HW, Zhang BY: SVM-based tumor classification with gene expression data. Advanced Data Mining and Applications, Proceedings. 2006, 4093: 864-870. 10.1007/11811305_94.View ArticleGoogle Scholar
- Huang DS, Zheng CH: Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics. 2006, 22 (15): 1855-1862. 10.1093/bioinformatics/btl190.View ArticlePubMedGoogle Scholar
- Wang SL, Li XL, Zhang SW, Gui J, Huang DS: Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med. 2010, 40 (2): 179-189. 10.1016/j.compbiomed.2009.11.014.View ArticlePubMedGoogle Scholar
- Huang DS: A constructive approach for finding arbitrary roots of polynomials by neural networks. Ieee T Neural Networ. 2004, 15 (2): 477-491. 10.1109/TNN.2004.824424.View ArticleGoogle Scholar
- Huang DS: Radial basis probabilistic neural networks: Model and application. International Journal of Pattern Recognition and Artificial Intelligence. 1999, 13 (7): 1083-1101. 10.1142/S0218001499000604.View ArticleGoogle Scholar
- Demichelis F, Magni P, Piergiorgi P, Rubin MA, Bellazzi R: A hierarchical Naive Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays. Bmc Bioinformatics. 2006, 7:Google Scholar
- Boulesteix AL, Porzelius C, Daumer M: Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics. 2008, 24 (15): 1698-1706. 10.1093/bioinformatics/btn262.View ArticlePubMedGoogle Scholar
- Zheng CH, Zhang L, Ng VT, Shiu SC, Huang DS: Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8 (6): 1592-1603.View ArticlePubMedGoogle Scholar
- Zheng CH, Chen Y, Li XX, Li YX, Zhu YP: Tumor classification based on independent component analysis. International Journal of Pattern Recognition and Artificial Intelligence. 2006, 20 (2): 297-310. 10.1142/S0218001406004673.View ArticleGoogle Scholar
- Huang DS, Mi JX: A new constrained independent component analysis method. IEEE T Neural Networ. 2007, 18 (5): 1532-1535.View ArticleGoogle Scholar
- Sharma A, Paliwal KK: Cancer classification by gradient LDA technique using microarray gene expression data. Data Knowl Eng. 2008, 66 (2): 338-347. 10.1016/j.datak.2008.04.004.View ArticleGoogle Scholar
- Li B, Zheng CH, Huang DS, Zhang L, Han K: Gene expression data classification using locally linear discriminant embedding. Computers in Biology and Medicine. 2010, 40 (10): 802-810. 10.1016/j.compbiomed.2010.08.003.View ArticlePubMedGoogle Scholar
- Dabney AR: Classification of microarrays to nearest centroids. Bioinformatics. 2005, 21 (22): 4148-4154. 10.1093/bioinformatics/bti681.View ArticlePubMedGoogle Scholar
- Ransohoff DF: Rules of evidence for cancer molecular-marker discovery and validation. Nature Reviews Cancer. 2004, 4 (4): 309-314. 10.1038/nrc1322.View ArticlePubMedGoogle Scholar
- Ransohoff DF: Bias as a threat to the validity of cancer molecular-marker research. Nature Reviews Cancer. 2005, 5 (2): 142-149. 10.1038/nrc1550.View ArticlePubMedGoogle Scholar
- Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute. 2003, 95 (1): 14-18. 10.1093/jnci/95.1.14.View ArticlePubMedGoogle Scholar
- Wood IA, Visscher PM, Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007, 23 (11): 1363-1370. 10.1093/bioinformatics/btm117.View ArticlePubMedGoogle Scholar
- Rocke DM, Ideker T, Troyanskaya O, Quackenbush J, Dopazo J: Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics. 2009, 25 (6): 701-702. 10.1093/bioinformatics/btp038.View ArticleGoogle Scholar
- Wolpert DH, Macready WG: Coevolutionary free lunches. Ieee T Evolut Comput. 2005, 9 (6): 721-735. 10.1109/TEVC.2005.856205.View ArticleGoogle Scholar
- Chen LN, Liu R, Liu ZP, Li MY, Aihara K: Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers. Sci Rep-Uk. 2012, 2:Google Scholar
- Wang SL, Zhu YH, Jia W, Huang DS: Robust Classification Method of Tumor Subtype by Using Correlation Filters. IEEE-Acm Transactions on Computational Biology and Bioinformatics. 2012, 9 (2): 580-591.View ArticlePubMedGoogle Scholar
- Horner JL, Gianino PD: Phase-Only Matched Filtering. Applied Optics. 1984, 23 (6): 812-816. 10.1364/AO.23.000812.View ArticlePubMedGoogle Scholar
- Ito K, Nakajima H, Kobayashi K, Aoki T, Higuchi T: A fingerprint matching algorithm using phase-only correlation. Ieice Transactions on Fundamentals of Electronics Communications and Computer Sciences. 2004, E87A (3): 682-691.Google Scholar
- Shibaharaa T, Aoki T, Nakajima H, Kobayashi K: A high-accuracy stereo correspondence technique using 1D band-limited phase-only correlation. Ieice Electron Expr. 2008, 5 (4): 125-130. 10.1587/elex.5.125.View ArticleGoogle Scholar
- Moriya H: Phase-only correlation of time-varying spectral representations of microseismic data for identification of similar seismic events. Geophysics. 2011, 76 (6): Wc37-Wc45. 10.1190/geo2011-0021.1.View ArticleGoogle Scholar
- Armstrong SA, Staunton JE, Silverman LB, Pieters R, de Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet. 2002, 30 (1): 41-47. 10.1038/ng765.View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001, 7 (6): 673-679. 10.1038/89044.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang WS, Reiman EM, Valla J, Dunckley T, Beach TG, Grover A, Niedzielko TL, Schneider LE, Mastroeni D, Caselli R: Alzheimer's disease is associated with reduced expression of energy in posterior cingulate metabolism genes neurons. Proceedings of the National Academy of Sciences of the United States of America. 2008, 105 (11): 4441-4446. 10.1073/pnas.0709259105.PubMed CentralView ArticlePubMedGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999, 96 (12): 6745-6750. 10.1073/pnas.96.12.6745.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (26): 15149-15154. 10.1073/pnas.211566398.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng L, Ma JW, Pei J: Rank sum method for related gene selection and its application to tumor diagnosis. Chinese Science Bulletin. 2004, 49 (15): 1652-1657.View ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (10): 6567-6572. 10.1073/pnas.082099299.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang SL, You HZ, Lei YK, Li XL: Performance Comparison of Tumor Classification Based on Linear and Non-linear Dimensionality Reduction Methods. Advanced Intelligent Computing Theories and Applications. 2010, 6215: 291-300. 10.1007/978-3-642-14922-1_37.View ArticleGoogle Scholar
- Ojala M, Garriga GC: Permutation Tests for Studying Classifier Performance. Journal of Machine Learning Research. 2010, 11: 1833-1863.Google Scholar
- Chua SW, Vijayakumar P, Nissom PM, Yam CY, Wong VVT, Yang H: A novel normalization method for effective removal of systematic variation in microarray data. Nucleic Acids Research. 2006, 34 (5):Google Scholar
- Gold DL, Wang J, Coombes KR: Inter-gene correlation on oligonucleotide arrays - How much does normalization matter?. Am J Pharmacogenomic. 2005, 5 (4): 271-279.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.