- Research article
- Open Access
Comparative study of classification algorithms for immunosignaturing data
BMC Bioinformatics volume 13, Article number: 139 (2012)
High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.
We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found ‘Naïve Bayes’ far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.
‘Naïve Bayes’ algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.
Serological diagnostics have received increasing scrutiny recently [1, 2] due to their potential to measure antibodies rather than low-abundance biomarker molecules. Antibodies avoid the biomarker dilution problem and are recruited rapidly following infection, chronic, or autoimmune episodes, or exposure to cancer cells. Serological diagnostics using antibodies have the potential to reduce medical costs and may be one of the few methods that allow for true presymptomatic detection of disease. For this reason, our group has pursued immunosignaturing for its ability to detect the diseases early and with a low false positive rate. The platform consists of a peptide microarray with either 10,000 or 330,000 peptides per assay. This microarray is useful with standard mathematical analysis, but for a variety of reasons, certain methods of classification enable the best accuracy [3, 4]. Classification methods differ in their ability to handle high or low numbers of features, the feature selection method, and the features’ combined contribution to a linear, polynomial, or complex discrimination threshold. Expression microarrays are quite ubiquitous and relevant to many biological studies, and have been used often when studying classification methods. However, immunosignaturing microarrays may require that we change our underlying assumptions as we determine the suitability of a particular classifier.
In order to establish the question of classification suitability, we examine a basic classification algorithm, Linear Discriminant Analysis (LDA). LDA is widely used in analyzing biomedical data in order to classify two or more disease classes [5–8]. One of the most commonly used high-throughput analytical methods is the gene expression microarray. Probes on an expression microarray are designed to bind to a single transcript, splice variant or methy variant of that transcript. These one-on-one interactions provide relative transcript numbers and cumulatively help to define high-level biological pathways. LDA uses these data to define biologically relevant classes based on the contribution of differentially expressed genes. This method often uses statistically identified features (gene transcripts) that are different from one condition to another. LDA can leverage coordinated gene expression to make predictions based on a fundamental biological process. The advantage of this method is that relatively few features are required to make sweeping predictions. When features change sporadically or asynchronously, the discriminator predictions are adversely affected. This causes low sensitivity in exchange for occasionally higher discrimination. Tree-based methods use far more features to obtain a less biased but less sensitive view of the data. These methods can partition effects even if the effect sizes vary considerably. This approach can be more useful than frequentist approaches where it is important to maintain partitions in discreet groups.
Immunosignaturing has its foundations in both phage display and peptide microarrays. Most phage display methods that use random-sequence libraries also use fairly short peptides, on the order of 8–11 amino acids . Epitope microarrays use peptides in the same size range, but typically far fewer total peptides, on the order of hundreds to thousands . Each of these methods assumes that a single antibody binds to a single peptide, which is either detected by selection (phage display) or by fluorescent secondary antibody (epitope microarray). Immunosignaturing uses long 20-mer random-sequence peptides that have potentially 7 or more possible linear epitopes per peptide. Although immunosignaturing must make do with only 10,000 to ~300,000 peptides, the information content derived from partial binding makes these data useful in ways quite different from phage display [11–15].
The complexity in analysis arises from the many-to-many relationship between peptide and antibody (Figure 1). This relationship imposes a particular challenge for classification because a simple one-to-one relationship between probe and target, idiomatic for gene expression microarrays, allows a coherent contribution of many genes that behave coordinately based on biological stimuli. That idiom is broken for immunosignaturing microarrays, where each peptide may bind a number of different antibodies and every antibody might bind a number of peptides. Unless disease-specific antibodies find similar groups of peptides across individuals, very little useful information is available to the classifier. The aim of this work is to assess the performance of various classification algorithms on immunosignaturing data.
We have considered 17 diverse data mining classification methods. For feature selection, we used a simple t-test when we examined two classes, and a fixed-effects 1-way ANOVA for multiple classes with no post-hoc stratification. We have assessed these algorithms’ ability to handle increasing numbers of features by providing four different sets of peptides with increasing p-value cutoff. The four levels include from 10 (minimum) to >1000 (maximum) peptides. Each algorithm is thus tested under conditions that highlight either synergistic or antagonistic effects as the feature numbers increase.
A peptide microarray described previously [11–15] was used to provide data for analysis. Two different sets of 10,000 random peptide sequences are tested. The two peptide sets are non-overlapping and are known as CIM10Kv1 and CIM10Kv2. Peptides are printed as in .
Samples consist of sera, plasma or saliva – each produces a suitable distribution of signals upon detection with an anti-human secondary IgG-specific antibody. Samples are added to the microarray at 1:500 dilutions in sample buffer (1xPBS, 0.5% Tween20, 0.5% Bovine Serum Albumin (Sigma, St. Louis, MO)), IgG antibodies are detected through a biotinylated secondary anti human IgG antibody (Novus anti-human IgG (H + L), Littleton, CO), which binds the primary. Fluorescently labeled streptavidin is used to label the secondary antibodies and scanned with an Agilent ‘C’ laser scanner in single-color mode. 16-bit images are processed using GenePix Pro 8, which provides the tabular information for each peptide in a continuous value ranging from 0–65,000. Four unique data sets have been used in this analysis, 2 run on the CIM10Kv1 and 2 on the CIM10Kv2. Each individual sample was run in duplicate; replicates with >0.8 Pearson correlation coefficient were considered for analysis.
Center for Innovations in Medicine, Biodesign Institute, Arizona State University has an existing IRB 0912004625, which allows analysis of blinded samples from collaborators.
Type 1 diabetes data set: This dataset contains 80 sera samples (41 controls and 39 type 1 diabetes children ages 6 to 13). These samples were tested on the CIM10kV1microarrays.
Alzheimer’s disease data set: This dataset contains 23 samples (12 controls and 11 Alzheimer’s disease subjects). These were tested on the CIM10kV2 microarrays.
Antibodies dataset: This dataset contains 50 samples and has 5 groups monoclonal antibodies, arbitrarily arranged. All monoclonals were raised in mouse, and use the same secondary detection antibody. Samples were run on the CIM10kV1 microarrays.
Asthma dataset: This dataset consists of 47 unique samples containing serum from patients with 4 distinct classes, corresponding to the household environment. Condition A consists of 12 control subjects who had no environmental stimuli. Condition B consists of 12 subjects who had stimuli but no asthma-related symptoms. Condition C consists of 11 subjects who had no stimuli but with clinical asthma. Condition D consists of 12 subjects who have both stimuli and clinical asthma. Samples were tested on the CIM 10 kV2 microarrays. Asthma datasets were been analyzed by considering all four conditions using ANOVA in order to study the combined effect of stimuli and asthma on subjects and then by considering pair wise comparison of condition A vs. B, A vs. C, and B vs. D.
Data preprocessing, normalization and feature selection
The 16-bit tiff images from the scanned microarrays were imported into GenePix Pro 6.0 (Molecular Devices, Santa Clara, CA). Raw tabular data were imported into Agilent’s GeneSpring 7.3.1 (Agilent, Santa Clara, CA). Data were median normalized per array and log10 transformed. For feature selection we used Welch-corrected T-test with multiple tested (FWER = 5%). For multiple groups (Antibody and Asthma datasets) we used 1-way fixed-effects ANOVA.
Data mining classification algorithms
Four distinct peptide features are chosen for the comparison study. For each analysis, peptides are selected by t-test or ANOVA across biological classes, with 4 different p-value cutoffs. Cutoffs were selected to obtain roughly equivalent sized feature sets to assess the ability of each algorithm to process sparse to rich feature sets. Once the significant features were collected, data was imported into WEKA  for classification. The algorithms themselves spanned a wide variety of classifiers including Bayesian, regression based methods, meta-analysis, clustering, and tree based approaches.
We obtained accuracy from each analysis type using leave-one-out cross-validation. We obtained a list of t-test or ANOVA-selected peptides at each stringency level. The highest stringency uses peptides with p-values in the range of 10-5 to 10-10 and contains the least ‘noise’. The less-stringent second set uses p-values approximately 10-fold higher than the most stringent. The third contains the top 200 peptides and the forth contains ~1000 peptides at p < 0.05. Although different numbers of peptides are used for each dataset, each peptide set yields the same general ability to distinguish the cognate classes. The WEKA default setting of parameters were used for every algorithm to avoid bias and over fitting. These default parameters are taken from the cited papers listed below for each algorithm. Brief details of default parameters and algorithms are listed
Naïve Bayes: Probabilistic classifier based on Bayes theorem. Numeric estimator precision values are chosen based on analysis of the training data. In the present study, normal distribution was used for numeric attributes rather than kernel estimator .
Bayes net: Probabilistic graphical model that represents random variables and conditional dependencies in the form of a directed acyclic graph. A Simple Estimator algorithm has been used for finding conditional probability tables for Bayes net. A K2 search algorithm was used to search network structure [18, 19].
Logistic Regression (Logistic R.): A generalized linear model that uses logistic curve modeling to fit the probabilistic occurrence of an event. The Quasi-Newton method is used to search for optimization. 1x108 has been used for ridge values in the log likelihood calculation .
Simple Logistic: Classifier for building linear logistic regression models. For fitting the logistic model ‘LogitBoost’, simple regression functions are used. Automatic attribute selection is obtained by cross validation of the optimal number of ‘LogitBoost’ iterations . Heuristic stop parameter is set at 50. The number of maximum iterations for LogitBoost has been set to 500.
Support Vector Machines (SVM): A non-probabilistic binary linear classifier that constructs one or more hyper planes to be can be used for classification. For training support vector classes, John Platt’s sequential minimal optimization algorithm was used which replaces all missing values . Here multiclass problems are used using pair-wise classification. The complexity parameter is set to 1. Epsilon for round off error is set to 1x10*-12. PolyKernel is the set to be kernel. The tolerance parameter is set to 0.001 [24, 25].
Multilayer Perceptron (MLP): A supervised learning technique with a feed forward artificial neural network through back-propagation that can classify non-linearly separable data [26, 27]. The learning rate is set to 0.3 and momentum applied during updating weights is set to 0.2. The validation threshold use to terminate the validation testing is set to 20.
K nearest neighbors (KNN): Instance based learning or lazy learning which trains the classifier function locally by majority note of its neighboring data points. Linear NN Search algorithm is used for search algorithm [28, 29]. K is set to 3.
K Star: Instance based classifier that uses similarity function from the training set to classify test set. Missing values are averaged by column entropy curves and global blending parameter is set to 20 .
Attribute Selected Classifier (ASC): ‘Cfs subset’ evaluator is used during the attribute selection phase to reduce the dimension of training and test data. The ‘BestFit’ search method is invoked after which J48 tree classifier is used .
Classification via clustering (K means): Simple k means clustering method is used where k is set to the number of classes in the data set . Euclidean distance was used for evaluation with 500 iterations.
Classification via Regression (M5P): Regression is a method used to evaluate the relationship between dependent and independent variables through an empirically determined function. The M5P base classifier is used which combines conventional decision tree with the possibility of linear regression at the nodes. The minimum number of instances per leaf node is set to 4 .
Linear Discriminant Analysis (LDA): Prevalent classification technique that identifies the combination of features that best characterizes classes through linear relationships. Prior probabilities are set to uniform and the model as homoscedastic.
Hyper Pipes: Simple, fast classifier that counts internally defined attributes for all samples and compares the number of instances of each attribute per sample. Classification is based on simple counts. Works well when there are many attributes .
VFI: Voting feature interval classifier is a simple heuristic attribute-weighting scheme. Intervals are constructed for numeric attributes. For each feature per interval, class counts are recorded and classification is done by voting. Higher weight is assigned to more confident intervals. The strength of the bias towards more confident features is set to 0 .
J48: Java implementation of C4.5 algorithm. Based on the Hunt’s algorithm, pruning takes place by replacing internal node with a leaf node. Top-down decision tree/voting algorithm . 0.25 is used for the confidence factor. No Laplace method for tree smoothing .
Random Trees: A tree is grown from data that has K randomly chosen attributes at each node. It does not perform pruning. K-value (log2 (number of attributes) + 1) is set at zero. There is no depth restriction. The minimum total weight per leaf is set to 1 .
Random Forest (R. Forest): Like Random Tree, the algorithm constructs a forest of random trees  with locations of attributes chosen at random. It uses an ensemble of unprune decision trees by a bootstrap sample using training data. There is no restriction on the depth of the tree; number of tress used is 100.
CPU time was calculated for every algorithm at the four different significance levels. This time was measured on a standard PC (Intel dual core, 2.2 GHz 3 Gb RAM) that was completely dedicated to WEKA. To measure CPU time, open source jar files from WEKA were imported to Eclipse where the function ‘time ()’ was invoked prior to running the classification including the time required for cross validation. Most Windows 7 services were switched off; the times reported were an average of 5 different measurements.
Overall performance accuracy of classification algorithms over all data sets
For each dataset, accuracies are measured at four levels (top 10, 50,200, 1000 peptides) at various levels of significance. Overall average performance measure is calculated for each algorithm for a given data set. Table 1 shows the overall average percentage score for each algorithm calculated by averaging accuracy, specificity, sensitivity and area under ROC curve under all levels of significance. Scores >90% are marked in bold. MLP algorithm did not finish due to huge memory requirements on last level of significance and is averaged based on first three levels of significance. For type 1 diabetes, Alzheimer’s and antibodies dataset, >6 algorithms scored >90% average score. Overall, Naïve Bayes had the highest average score (90.4%) and was always among top 3 algorithms among all datasets.
Performance accuracy of classification algorithms at different levels of significance over all data sets
For each data set, different levels of significance are chosen to measure the performance accuracy of each algorithm. These levels contain approximately equal number of peptides for each data set. The first level contains 10 peptides selected from the t-test (lowest p value) and hence contains the least noise. Next, approximately 50 peptides, 200 peptides and 1000 peptides were chosen for the other three levels.
Tables 2, 3, 4, 5, 6, 7, 8 shows 4 different performance measures (accuracy, specificity, sensitivity and area under ROC curve) at different levels of significance over 7 datasets. For the Asthma dataset, we considered all conditions A-D together, then performed the pair-wise comparisons of condition A and B, condition A and C, and condition B and D at three different levels of significance. Measures >90% are marked in bold. For the diabetes dataset, 9 algorithms achieved >90% score. For Alzheimer’s and the Antibodies dataset, 6 algorithms achieved >90% score. Naïve Bayes scored 100% in all 4 measures at the first level of significance in the Alzheimer’s dataset and scored 91.5% average score on the Antibodies dataset. For the Asthma datasets, the highest score was <80%. Only Naïve Bayes had >90% specificity for more than one level of significance. For two conditions in Asthma datasets, Naïve Bayes and VFI scored >90% average score.
Comparative analysis of worst time performance of classification algorithms over data sets
The amount of time taken by each algorithm to build the model and perform cross validation was measured. Table 9 shows the time in milliseconds for each algorithm at the lowest level of significance when the number of peptides nears 1000. Random Tree was the fastest, at ~1000 milliseconds (average) to complete the task, while MLP was the worst which did not finish due to high memory requirements. Random tree, Hyper Pipes, Naïve Bayes, VFI and KNN were the five fastest algorithms; each took less than ~4000 milliseconds to complete classification of >1,000 peptides. Logistic Regression and Attribute Selected Classifier, MLP were among the slowest algorithms taking more than 20 minutes to perform classification of >1,000 peptides. The absolute ranking for every algorithm was consistent per dataset; only three datasets have been considered to measure time performance.
Comparative analysis of time performance of classification algorithms at different levels of significance over three data sets
For each level of significance, time was measured for each algorithm to build the model and for cross validation. At the highest level of significance (about 10 peptides), each algorithm were fast enough to complete the task in under 25 seconds. Execution times increased as the level of significance was lowered due to the higher number of features and increased difficulty in constructing the model. Table 10 shows classification algorithms time performance at various levels of significance.
We have explored several disparate classifiers using a relatively new type of microarray data: immunosignaturing data. The tested algorithms come from a broad family of approaches to classify data. We chose algorithms from Bayesian, regression, trees, multivariate and meta analysis and we believe we have sampled sufficiently that the results are relevant. From Table 2 we found that Naïve Bayes had a higher average performance than all other algorithms tested. Naïve Bayes achieved > 90% average for 2 classes datasets where there is a clear distinction between two classes. For the multi-class the Antibodies dataset, where there is a clear difference between different types of antibodies, Naïve Bayes scored 88% average accuracy and was ranked third, close to the 93.3% accuracy of random forest. On the Asthma dataset, containing four classes, none of the algorithms were able to achieve more than 75% accuracy. This matches the biological interpretation very well. Naïve Bayes outperformed all algorithms for speed and accuracy, achieving 77.7% average score overall. Naïve Bayes was one of the top five fastest algorithms, ~500 times faster than the logistic regression. A summary of the all algorithms performance measures and time is given in below and described in Table 11. Distance metrics have been defined to access performance measures for all algorithms compared to the highest scoring algorithm on a given dataset.
Naïve Bayes: Naïve Bayes performed best overall with > 90% overall average score. It was always among the top 3 algorithms in all 7 comparisons. It ranked first 5 out 7 times when comparing all datasets. It was on an average just 0.3% behind the rank 1 algorithm in overall comparison. It is 2X slower than the fastest algorithm due to its mathematical properties. It would be feasible to perform large-scale classification studies using Naïve Bayes.
Multilayer Perceptron (MLP): It ranked second with overall score of 87.3% and was very close to SVM. The overall score is biased since MLP did not finish for level containing ~1000 peptides and hence scored was averaged from just the three levels. It was the slowest algorithm and infeasible to perform large-scale classification.
Support Vector Machines (SVM): Although it ranked third, it was not significantly different from the MLP in terms of performance measures. It was 700X faster than MLP and achieved >90% measured accuracy 3 times. Both MLP and SVM were <5% behind the rank 1 algorithm on average.
VFI: VFI ranked fourth in overall performance measures and was the among top 5 fastest algorithms due to its voting method. Four times it obtained >90% average overall accuracy and ranked 2nd twice.
Hyper Pipes: Hyper pipes ranked fifth overall in performance measures and was among the fastest of the tested algorithms, likely due to its inherently simplistic ranking method. It was <8% from first place 6 times.
Random Forest: Random forest ranked sixth in overall performance measures and performed better on datasets having multiple classes (Antibodies and Asthma). It was 21 times slower than the fastest algorithm due to bootstrapping.
Bayes net: Ranked in the middle for overall accuracy and time. It scored >90% overall measures twice. It was slower than the Naïve Bayes due to construction of networks in the form of an acyclic graph and it is relatively inefficient compared to Naïve Bayes due to the change in network topology during assessment of probability.
K means: K-means ranked eighth in overall performance measures and was 34X slower than the fastest algorithm in time performance due to the multiple iterations required to form clusters. It performed far better for 2 classes compared to multiple classes because guaranteed convergence, scalability and linear separation boundaries are more easily maintained.
Logistic Regression: Logistic regression ranked ninth in overall accuracy. It was >90% three times. It was among the worst in time performance, being ~1000 times slower than the fastest algorithm as it needs to regress on high number of features. It is efficient for small numbers of features and sample sizes > 400.
Simple Logistic: It ranked tenth in overall performance measures and ranked first on the diabetes dataset. It ranked second in multiclass Asthma dataset. It was slow in time performance due to LogitBoost iterations.
K nearest neighbors: It performed well on the 2 classes dataset but didn’t perform as well for multi class datasets. It was >90% performance for only rather difficult Diabetes dataset. This may be related to evenly defined but diffuse clusters related to the subtle differences between the Asthma patients.
K star: It performed >90% for only the Diabetes dataset and was 9 times slower than the fastest algorithm. This algorithm may also be sensitive to the even and diffuse clusters described by this dataset.
M5P: It did not perform well on either time performance or accuracy. It never achieved >90% average score and was 22 times slower than the fastest algorithm due to formation of comprehensive linear model for every interior node of the unpruned tree.
J48: Top 5 fastest algorithm due to rapid construction of trees. It was >20% behind from the rank 1 algorithm on an average; its lower performance may possibly be due to formation of empty/insignificant branches which often leads to overtraining.
Random Trees: It was the fastest algorithm since it builds trees of height log(k) where k is the number of attributes, however it achieves poor accuracy since it performs no pruning.
Attribute Selected Classifier (ASC): One of the slowest algorithms as it had to evaluate attributes prior to classification. It underperformed in performance measures due to the C4.5 classifier limitations that prevent overtraining.
Linear Discriminant Analysis (LDA): Its performance accuracy decreased as the number of features increased due to its inability to deal with highly variant data. It was slow (>500X slower than the fastest algorithm) since it tries to optimize class distinctions but the variance covariance matrix increases dramatically as the number of features increased.
The comparisons provided in this article provide a glimpse into how existing classification algorithms handle data with intrinsically different properties than traditional microarray expression data. Immunosignaturing provides a means to quantify the dispersion of serum (or saliva) antibodies that result from disease or other immune challenge. Unlike most phage display or other panning experiments, fewer but longer random-sequence peptides are used. Rather than converging to relatively few sequences, the immunosignaturing microarray provides data on the binding affinity of all 10,000 peptides with high precision. Classifiers in the open-source program WEKA were used to determine whether any algorithm stood out as being particularly well suited for these data. The 17 classifiers, which were tested, are readily available and represent some of the most widely used classification methods in biology. However, they also represent classifiers that are diverse at the most fundamental levels. Tree methods, regression, and clustering are inherently different; the grouping methods are quite varied and top-down or bottom-up paradigms address data structures in substantially different ways. Given this, we present and interpret the results from our tests, which we believe will be applicable to any dataset with target-probe interactions similar to immunosignaturing microarrays.
From the comparisons above, Naïve Bayes was the superior analysis method in all aspects. Naïve Bayes assumes a feature independent model, which may account for its superior performance. It relies on the degree of correlation of the attributes in the dataset; for immunosignaturing, the number of attributes can be quite large. In gene expression data, where genes are connected by gene regulatory networks, there is a direct and significant correlation between hub genes and dependent genes. This relationship affects the performance of Naïve Bayes by limiting its efficiency through multiple containers of similarly - connected features [39–41]. In peptide-antibody arrays, where the signals that arise from the peptides are multiplexed signals of many antibodies attaching to many peptides, there is no direct correlation between peptides, but there is a general trend. Moreover, there is a competition of antibodies attaching to a single peptide, which makes it difficult for multiple mimotopes to show significant correlation with each other. Thus, the 10,000 random peptides have no direct relationships to each other each contributes partially to defining the disease state. This makes the immunosignaturing technology a better fit for the assumption of strong feature independence employed by the Naïve Bayes technique, and the fact that reproducible data can be had at intensity values down to 1 standard deviation above background enables enormous numbers of informative, precise, and independent features. Presence or absence of a few high- or low-binding peptides on the microarray will not impact the binding affinity for any other peptide, since the kinetics ensures that the antibody pool is not limiting. This is important when building microarrays with >300,000 features per physical assay, as in our newest microarray. More than 90% of the peptides on either microarray demonstrate normal distribution for binding signals. This is important since feature selection methods used in this analysis (t-test and one way ANOVA) and the Naïve Bayes classifier all assume normal distribution of features.
The Naïve Bayes approach requires relatively little training data, which makes it a very good fit for the biomarker field. The sample sizes usually range from N = 20-100 for the training set. Naïve Bayes has other advantages as well: it can train well on a small but high feature data set and still yield good prediction accuracy on a large test set. Any microarray with more than a few thousand probes succumbs to the issue of dimensionality. Since Naïve Bayes independently estimates each distribution instead of calculating a covariance or correlation matrix, it escapes relatively unharmed from problems of dimensionality.
The data used here for evaluating the algorithms were generated using an array with 10,000 different features, almost all of which contribute information. We have arrays with >300,000 peptides per assay (current microarrays are available from http://www.peptidearraycore.com) which should provide for less sharing between peptide and antibody, effectively spreading out antibodies over the peptides with more specificity. This presumably will allow resolving antibody populations with finer detail. This expansion may require a classification method that is robust to noise, irrelevant attributes and redundancy. Naïve Bayes has an outstanding edge in this regard as it is robust to noisy data since such data points are averaged out when estimating conditional probabilities. It can also handle missing values by ignoring them during model building and classification. It is highly robust to irrelevant and redundant attributes because if Yi is irrelevant then P (Class|Yi) becomes uniformly distributed. This is due to that fact that the class conditional probability for Xi has no significant impact on the overall computation of posterior probability. Naïve Bayes will arrive at a correct classification as long as the correct classes are even slightly more predictable than the alternative. Here, class probabilities need not be estimated very well, which corresponds to the practical reality of immunosignaturing: signals are multiplexed due to competition, affinity, and other technological limitation of spotting, background and other biochemical effects that exist between antibody and mimotope.
As the immunosignaturing technology is increasingly used for large-scale experiments, it will result in an explosion of data. We need an algorithm that is accurate and can process enormous amounts of data with low memory overhead and fast enough for model building and evaluation. One aims for next-generation immunosignaturing microarrays is to monitor the health status of a large population on an on-going basis. The number of selected attributes will no longer be limited in such a scenario. For risk evaluation, complex patterns must be normalized against themselves at regular intervals. This time analysis would require a conditional probabilistic argument along with the capacity of accurately predicting the risk with low computational cost. The slope of Naïve Bayes on time performance scale is extremely small, allowing it to process a large number of attributes.
Immunosignaturing is a novel approach which aims to detect complex patterns of antibodies produced in acute or chronic disease. This complex pattern is obtained using random peptide microarrays where 10,000 random peptides are exposed to antibodies in sera/plasma/saliva. Antibody binding to the peptides is not one-to-one but a more complicated and multiplexed process. The quantity and appearance of this data appears numerically, distributionally, and statistically the same as gene expression microarray data, but is fundamentally quite different. The relationships between attributes and functionality of those attributes are not the same. Hence, traditional classification algorithms used in gene expression data might be suboptimal for analyzing immunosignaturing results. We investigated 17 different kinds of classification algorithm spanning Bayesian, regression, tree based approaches and meta-analysis and compared their leave-one-out cross-validated accuracy values using various numbers of features. We found that the Naïve Bayes classification algorithm outperforms the majority of the classification algorithms in classification accuracy and in time performance, which is not the case for expression microarrays . We also discussed its assumptions, simplicity, and fitness for immunosignaturing data. More than most, these data provide access to the information found in antibodies. Deconvoluting this information was a barrier to using antibodies as biomarkers. Pairing immunosignaturing with Naïve Bayes classification may open up the immune system to a more systematic analysis of disease.
Consent was obtained for every sample in this manuscript and was approved by ASU IRB according to protocol number 0912004625 entitled "Profiling Human Sera for Unique Antibody Signatures". Humans were consented by the retrieving institution and a materials transfer agreement was signed between the Biodesign Institute and the collaborating institute. The collaborating institutes' protocols were current and each human subject signed an approved consent form and released their sera.
Haab BB: Methods and applications of antibody microarrays in cancer research. Proteomics 2003, 3: 2116–2122.
Whiteaker JR, Zhao L, Zhang HY, Feng L-C, Piening BD, Anderson L, Paulovich AG: Antibody-based enrichment of peptides on magnetic beads for mass-spectrometry-based quantification of serum biomarkers. Anal Biochem 2007, 362: 44–54.
Reimer U, Reineke U, Schneider-Mergener J: Peptide arrays: from macro to micro. Curr Opin Biotechnol 2002, 13: 315–320.
Merbl Y, Itzchak R, Vider-Shalit T, Louzoun Y, Quintana FJ, Vadai E, Eisenbach L, Cohen IR: A systems immunology approach to the host-tumor interaction: large-scale patterns of natural autoantibodies distinguish healthy and tumor-bearing mice. PLoS One 2009, 4: e6053.
Braga-Neto UM, Dougherty ER: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004, 20: 374–380.
Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2004, 21: 1509–1515.
Sima C, Attoor S, Brag-Neto U, Lowey J, Suh E, Dougherty ER: Impact of error estimation on feature selection. Pattern Recognit 2005, 38: 2472–2482.
Braga-Neto U, Dougherty E: Bolstered error estimation. Pattern Recognit 2004, 37: 1267–1281.
Cwirla SE, Peters EA, Barrett RW, Dower WJ: Peptides on phage: a vast library of peptides for identifying ligands. ProcNatlAcadSci U S A 1990, 87: 6378–6382.
Nahtman T, Jernberg A, Mahdavifar S, Zerweck J, Schutkowski M, Maeurer M, Reilly M: Validation of peptide epitope microarray experiments and extraction of quality data. J Immunol Methods 2007, 328: 1–13.
Boltz KW, Gonzalez-Moa MJ, Stafford P, Johnston SA, Svarovsky SA: Peptide microarrays for carbohydrate recognition. Analyst 2009, 134: 650–652.
Brown J, Stafford P, Johnston S, Dinu V: Statistical Methods for Analyzing Immunosignatures. BMC Bioinforma 2011, 12: 349.
Halperin RF, Stafford P, Johnston SA: Exploring antibody recognition of sequence space through random-sequence peptide microarrays. Mol Cell Proteomics 2011, 10: 110–000786.
Legutki JB, Magee DM, Stafford P, Johnston SA: A general method for characterization of humoral immunity induced by a vaccine or infection. Vaccine 2010, 28: 4529–4537.
Restrepo L, Stafford P, Magee DM, Johnston SA: Application of immunosignatures to the assessment of Alzheimer's disease. Ann Neurol 2011, 70: 286–295.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD ExplorNewsl 2009, 11: 10–18.
John GH, Langley P: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. In Estimating Continuous Distributions in Bayesian Classifiers. Morgan Kaufmann, San Mateo; 1995:338–345.
Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. Mach Learn 1997, 29: 131–163.
Yu J, Chen X: Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics 2005, 21(Suppl 1):i487-i494.
Friedman J, Hastie T, Tibshirani R: Additive logistic regression: a statistical view of boosting. Ann Stat 2000, 28: 337–407.
Cessie SL, Houwelingen JCV: Ridge Estimators in Logistic Regression. J R Stat SocSer C (Appl Stat) 1992, 41: 191–201.
Landwehr N, Hall M, Frank E: Logistic Model Trees. Mach Learn 2005, 59: 161–205.
Platt J: Fast Training of Support Vector Machines using Sequential Minimal Optimization. MIT Press, Book Fast Training of Support Vector Machines using Sequential Minimal Optimization. City; 1998.
Hastie T, Tibshirani R: Classification by Pairwise Coupling. MIT Press, Book Classification by Pairwise Coupling. City; 1998.
Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Comput 2001, 13: 637–649.
Chaudhuri BB, Bhattacharya U: Efficient training and improved performance of multilayer perceptron in pattern classification. Neurocomputing 2000, 34: 11–27.
Gardner MW, Dorling SR: Artificial neural networks (the multilayer perceptron),Äî a review of applications in the atmospheric sciences. Atmos Environ 1998, 32: 2627–2636.
Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Mach Learn 1991, 6: 37–66.
Weinberger K, Blitzer J, Saul L: Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 2009, 10: 207–244.
Cleary J, Trigg L: Proceedings of the 12th International Conference on Machine Learning. In K*: An Instance-based Learner Using an Entropic Distance Measure. Morgan Kaufmann, ; 1995:108–114.
Hall MA: Correlation-based Feature Subset Selection for Machine Learning, PhD Thesis, University of Waikato. Hamilton, New Zealand; 1998.
Hartigan JA: Statistical theory in clustering. J Classif 1985, 2: 63–76.
Quinlan JR: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. In Learning with continuous classes. World Scientific, ; 1992:343–348.
Witten IH, Eibe F, Hall MA: Data Mining: Practical Machine Learning Tools and Techniques. Thirdth edition. Morgan Kaufmann, San Francisco; 2011.
Güvenir HA: Voting features based classifier with feature construction and its application to predicting financial distress. Expert SystAppl 2010, 37: 1713–1718.
Salzberg SL: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn 1994, 16: 235–240.
Quinlan J: Bagging, Boosting and C4. AAAI/IAAI 1996, 5: 1.
Breiman L: Random Forests. Mach Learn 2001, 45: 5–32.
Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M, et al.: Gene-Expression Profiles in Hereditary Breast Cancer. New England J Med 2001, 344: 539–548.
Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20: 2429–2437.
Liu H, Li J, Wong L: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform 2002, 13: 51–60.
Stafford P, Brun M: Three methods for optimization of cross-laboratory and cross-platform microarray expression data. Nucleic Acids Res 2007, 35: e72.
We are thankful to Dr. AbnerNotkins, NIH, National Institute of Dental and Craniofacial Research for providing type 1 diabetes sample, DrLucusRestrepo for providing Alzheimer’s dataset. Dr Bart Legutki and Dr Rebecca Halperin for providing Antibodies dataset, University of Arizona, department of Pharmacy and Pharmacology (Serine Lau, Donata Vercelli, Marilyn Halonen) for providing Asthma samples, Peptide Microarray Core (DrZbigniewCichacz) for providing Asthma datasets, PradeepKanwar for implementing the time function in JAVA and ValentinDinu for invaluable discussion regarding algorithm selection. This work was supported by an Innovator Award from the DoD Breast Cancer Program to SAJ.
US Patent Compound Arrays for Sample profiling: 61218890, US Patent ‘Naïve Bayes Classification for Immunosignaturing M12-104L, SAJ is cofounder of HealthTell Diagnostics which owns the patent to immunosignaturing.
MK completed the analysis of all data, and the original manuscript draft. PS completed all revisions and consulted on analysis. SAJ/PS co-invented immunosignaturing, SAJ funded the project. All authors read and approved the revised manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Kukreja, M., Johnston, S.A. & Stafford, P. Comparative study of classification algorithms for immunosignaturing data. BMC Bioinformatics 13, 139 (2012). https://doi.org/10.1186/1471-2105-13-139
- Random peptide microarray
- Data mining
- Classification algorithms
- Naïve Bayes