Open Access

Comparative study of classification algorithms for immunosignaturing data

  • Muskan Kukreja1,
  • Stephen Albert Johnston1 and
  • Phillip Stafford1Email author
BMC Bioinformatics201213:139

DOI: 10.1186/1471-2105-13-139

Received: 20 January 2012

Accepted: 15 May 2012

Published: 21 June 2012

Abstract

Background

High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data.

Results

We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found ‘Naïve Bayes’ far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy.

Conclusions

‘Naïve Bayes’ algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.

Keywords

Immunosignature Random peptide microarray Data mining Classification algorithms Naïve Bayes

Background

Serological diagnostics have received increasing scrutiny recently [1, 2] due to their potential to measure antibodies rather than low-abundance biomarker molecules. Antibodies avoid the biomarker dilution problem and are recruited rapidly following infection, chronic, or autoimmune episodes, or exposure to cancer cells. Serological diagnostics using antibodies have the potential to reduce medical costs and may be one of the few methods that allow for true presymptomatic detection of disease. For this reason, our group has pursued immunosignaturing for its ability to detect the diseases early and with a low false positive rate. The platform consists of a peptide microarray with either 10,000 or 330,000 peptides per assay. This microarray is useful with standard mathematical analysis, but for a variety of reasons, certain methods of classification enable the best accuracy [3, 4]. Classification methods differ in their ability to handle high or low numbers of features, the feature selection method, and the features’ combined contribution to a linear, polynomial, or complex discrimination threshold. Expression microarrays are quite ubiquitous and relevant to many biological studies, and have been used often when studying classification methods. However, immunosignaturing microarrays may require that we change our underlying assumptions as we determine the suitability of a particular classifier.

In order to establish the question of classification suitability, we examine a basic classification algorithm, Linear Discriminant Analysis (LDA). LDA is widely used in analyzing biomedical data in order to classify two or more disease classes [58]. One of the most commonly used high-throughput analytical methods is the gene expression microarray. Probes on an expression microarray are designed to bind to a single transcript, splice variant or methy variant of that transcript. These one-on-one interactions provide relative transcript numbers and cumulatively help to define high-level biological pathways. LDA uses these data to define biologically relevant classes based on the contribution of differentially expressed genes. This method often uses statistically identified features (gene transcripts) that are different from one condition to another. LDA can leverage coordinated gene expression to make predictions based on a fundamental biological process. The advantage of this method is that relatively few features are required to make sweeping predictions. When features change sporadically or asynchronously, the discriminator predictions are adversely affected. This causes low sensitivity in exchange for occasionally higher discrimination. Tree-based methods use far more features to obtain a less biased but less sensitive view of the data. These methods can partition effects even if the effect sizes vary considerably. This approach can be more useful than frequentist approaches where it is important to maintain partitions in discreet groups.

Immunosignaturing has its foundations in both phage display and peptide microarrays. Most phage display methods that use random-sequence libraries also use fairly short peptides, on the order of 8–11 amino acids [9]. Epitope microarrays use peptides in the same size range, but typically far fewer total peptides, on the order of hundreds to thousands [10]. Each of these methods assumes that a single antibody binds to a single peptide, which is either detected by selection (phage display) or by fluorescent secondary antibody (epitope microarray). Immunosignaturing uses long 20-mer random-sequence peptides that have potentially 7 or more possible linear epitopes per peptide. Although immunosignaturing must make do with only 10,000 to ~300,000 peptides, the information content derived from partial binding makes these data useful in ways quite different from phage display [1115].

The complexity in analysis arises from the many-to-many relationship between peptide and antibody (Figure 1). This relationship imposes a particular challenge for classification because a simple one-to-one relationship between probe and target, idiomatic for gene expression microarrays, allows a coherent contribution of many genes that behave coordinately based on biological stimuli. That idiom is broken for immunosignaturing microarrays, where each peptide may bind a number of different antibodies and every antibody might bind a number of peptides. Unless disease-specific antibodies find similar groups of peptides across individuals, very little useful information is available to the classifier. The aim of this work is to assess the performance of various classification algorithms on immunosignaturing data.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-13-139/MediaObjects/12859_2012_Article_5303_Fig1_HTML.jpg
Figure 1

One-to-one correspondence found in gene expression microarrays is not observed for the immunosignaturing arrays. We propose that a single peptide may bind numerous antibodies, and have shown that a single antibody can bind hundreds of different peptides.

We have considered 17 diverse data mining classification methods. For feature selection, we used a simple t-test when we examined two classes, and a fixed-effects 1-way ANOVA for multiple classes with no post-hoc stratification. We have assessed these algorithms’ ability to handle increasing numbers of features by providing four different sets of peptides with increasing p-value cutoff. The four levels include from 10 (minimum) to >1000 (maximum) peptides. Each algorithm is thus tested under conditions that highlight either synergistic or antagonistic effects as the feature numbers increase.

Methods

Technology

A peptide microarray described previously [1115] was used to provide data for analysis. Two different sets of 10,000 random peptide sequences are tested. The two peptide sets are non-overlapping and are known as CIM10Kv1 and CIM10Kv2. Peptides are printed as in [1].

Sample processing

Samples consist of sera, plasma or saliva – each produces a suitable distribution of signals upon detection with an anti-human secondary IgG-specific antibody. Samples are added to the microarray at 1:500 dilutions in sample buffer (1xPBS, 0.5% Tween20, 0.5% Bovine Serum Albumin (Sigma, St. Louis, MO)), IgG antibodies are detected through a biotinylated secondary anti human IgG antibody (Novus anti-human IgG (H + L), Littleton, CO), which binds the primary. Fluorescently labeled streptavidin is used to label the secondary antibodies and scanned with an Agilent ‘C’ laser scanner in single-color mode. 16-bit images are processed using GenePix Pro 8, which provides the tabular information for each peptide in a continuous value ranging from 0–65,000. Four unique data sets have been used in this analysis, 2 run on the CIM10Kv1 and 2 on the CIM10Kv2. Each individual sample was run in duplicate; replicates with >0.8 Pearson correlation coefficient were considered for analysis.

Datasets

Center for Innovations in Medicine, Biodesign Institute, Arizona State University has an existing IRB 0912004625, which allows analysis of blinded samples from collaborators.
  1. a.)

    Type 1 diabetes data set: This dataset contains 80 sera samples (41 controls and 39 type 1 diabetes children ages 6 to 13). These samples were tested on the CIM10kV1microarrays.

     
  2. b.)

    Alzheimer’s disease data set: This dataset contains 23 samples (12 controls and 11 Alzheimer’s disease subjects). These were tested on the CIM10kV2 microarrays.

     
  3. c.)

    Antibodies dataset: This dataset contains 50 samples and has 5 groups monoclonal antibodies, arbitrarily arranged. All monoclonals were raised in mouse, and use the same secondary detection antibody. Samples were run on the CIM10kV1 microarrays.

     
  4. d.)

    Asthma dataset: This dataset consists of 47 unique samples containing serum from patients with 4 distinct classes, corresponding to the household environment. Condition A consists of 12 control subjects who had no environmental stimuli. Condition B consists of 12 subjects who had stimuli but no asthma-related symptoms. Condition C consists of 11 subjects who had no stimuli but with clinical asthma. Condition D consists of 12 subjects who have both stimuli and clinical asthma. Samples were tested on the CIM 10 kV2 microarrays. Asthma datasets were been analyzed by considering all four conditions using ANOVA in order to study the combined effect of stimuli and asthma on subjects and then by considering pair wise comparison of condition A vs. B, A vs. C, and B vs. D.

     

Data preprocessing, normalization and feature selection

The 16-bit tiff images from the scanned microarrays were imported into GenePix Pro 6.0 (Molecular Devices, Santa Clara, CA). Raw tabular data were imported into Agilent’s GeneSpring 7.3.1 (Agilent, Santa Clara, CA). Data were median normalized per array and log10 transformed. For feature selection we used Welch-corrected T-test with multiple tested (FWER = 5%). For multiple groups (Antibody and Asthma datasets) we used 1-way fixed-effects ANOVA.

Data mining classification algorithms

Four distinct peptide features are chosen for the comparison study. For each analysis, peptides are selected by t-test or ANOVA across biological classes, with 4 different p-value cutoffs. Cutoffs were selected to obtain roughly equivalent sized feature sets to assess the ability of each algorithm to process sparse to rich feature sets. Once the significant features were collected, data was imported into WEKA [16] for classification. The algorithms themselves spanned a wide variety of classifiers including Bayesian, regression based methods, meta-analysis, clustering, and tree based approaches.

We obtained accuracy from each analysis type using leave-one-out cross-validation. We obtained a list of t-test or ANOVA-selected peptides at each stringency level. The highest stringency uses peptides with p-values in the range of 10-5 to 10-10 and contains the least ‘noise’. The less-stringent second set uses p-values approximately 10-fold higher than the most stringent. The third contains the top 200 peptides and the forth contains ~1000 peptides at p < 0.05. Although different numbers of peptides are used for each dataset, each peptide set yields the same general ability to distinguish the cognate classes. The WEKA default setting of parameters were used for every algorithm to avoid bias and over fitting. These default parameters are taken from the cited papers listed below for each algorithm. Brief details of default parameters and algorithms are listed
  1. I.

    Naïve Bayes: Probabilistic classifier based on Bayes theorem. Numeric estimator precision values are chosen based on analysis of the training data. In the present study, normal distribution was used for numeric attributes rather than kernel estimator [17].

     
  2. II.

    Bayes net: Probabilistic graphical model that represents random variables and conditional dependencies in the form of a directed acyclic graph. A Simple Estimator algorithm has been used for finding conditional probability tables for Bayes net. A K2 search algorithm was used to search network structure [18, 19].

     
  3. III.

    Logistic Regression (Logistic R.): A generalized linear model that uses logistic curve modeling to fit the probabilistic occurrence of an event[20]. The Quasi-Newton method is used to search for optimization. 1x108 has been used for ridge values in the log likelihood calculation [21].

     
  4. IV.

    Simple Logistic: Classifier for building linear logistic regression models. For fitting the logistic model ‘LogitBoost’, simple regression functions are used. Automatic attribute selection is obtained by cross validation of the optimal number of ‘LogitBoost’ iterations [22]. Heuristic stop parameter is set at 50. The number of maximum iterations for LogitBoost has been set to 500.

     
  5. V.

    Support Vector Machines (SVM): A non-probabilistic binary linear classifier that constructs one or more hyper planes to be can be used for classification. For training support vector classes, John Platt’s sequential minimal optimization algorithm was used which replaces all missing values [23]. Here multiclass problems are used using pair-wise classification. The complexity parameter is set to 1. Epsilon for round off error is set to 1x10*-12. PolyKernel is the set to be kernel. The tolerance parameter is set to 0.001 [24, 25].

     
  6. VI.

    Multilayer Perceptron (MLP): A supervised learning technique with a feed forward artificial neural network through back-propagation that can classify non-linearly separable data [26, 27]. The learning rate is set to 0.3 and momentum applied during updating weights is set to 0.2. The validation threshold use to terminate the validation testing is set to 20.

     
  7. VII.

    K nearest neighbors (KNN): Instance based learning or lazy learning which trains the classifier function locally by majority note of its neighboring data points. Linear NN Search algorithm is used for search algorithm [28, 29]. K is set to 3.

     
  8. VIII.

    K Star: Instance based classifier that uses similarity function from the training set to classify test set. Missing values are averaged by column entropy curves and global blending parameter is set to 20 [30].

     
  9. IX.

    Attribute Selected Classifier (ASC): ‘Cfs subset’ evaluator is used during the attribute selection phase to reduce the dimension of training and test data. The ‘BestFit’ search method is invoked after which J48 tree classifier is used [31].

     
  10. X.

    Classification via clustering (K means): Simple k means clustering method is used where k is set to the number of classes in the data set [32]. Euclidean distance was used for evaluation with 500 iterations.

     
  11. XI.

    Classification via Regression (M5P): Regression is a method used to evaluate the relationship between dependent and independent variables through an empirically determined function. The M5P base classifier is used which combines conventional decision tree with the possibility of linear regression at the nodes. The minimum number of instances per leaf node is set to 4 [33].

     
  12. XII.

    Linear Discriminant Analysis (LDA): Prevalent classification technique that identifies the combination of features that best characterizes classes through linear relationships. Prior probabilities are set to uniform and the model as homoscedastic.

     
  13. XIII.

    Hyper Pipes: Simple, fast classifier that counts internally defined attributes for all samples and compares the number of instances of each attribute per sample. Classification is based on simple counts. Works well when there are many attributes [34].

     
  14. XIV.

    VFI: Voting feature interval classifier is a simple heuristic attribute-weighting scheme. Intervals are constructed for numeric attributes. For each feature per interval, class counts are recorded and classification is done by voting. Higher weight is assigned to more confident intervals. The strength of the bias towards more confident features is set to 0 [35].

     
  15. XV.

    J48: Java implementation of C4.5 algorithm. Based on the Hunt’s algorithm, pruning takes place by replacing internal node with a leaf node. Top-down decision tree/voting algorithm [36]. 0.25 is used for the confidence factor. No Laplace method for tree smoothing [37].

     
  16. XVI.

    Random Trees: A tree is grown from data that has K randomly chosen attributes at each node. It does not perform pruning. K-value (log2 (number of attributes) + 1) is set at zero. There is no depth restriction. The minimum total weight per leaf is set to 1 [34].

     
  17. XVII.

    Random Forest (R. Forest): Like Random Tree, the algorithm constructs a forest of random trees [38] with locations of attributes chosen at random. It uses an ensemble of unprune decision trees by a bootstrap sample using training data. There is no restriction on the depth of the tree; number of tress used is 100.

     

Time performance

CPU time was calculated for every algorithm at the four different significance levels. This time was measured on a standard PC (Intel dual core, 2.2 GHz 3 Gb RAM) that was completely dedicated to WEKA. To measure CPU time, open source jar files from WEKA were imported to Eclipse where the function ‘time ()’ was invoked prior to running the classification including the time required for cross validation. Most Windows 7 services were switched off; the times reported were an average of 5 different measurements.

Results

Overall performance accuracy of classification algorithms over all data sets

For each dataset, accuracies are measured at four levels (top 10, 50,200, 1000 peptides) at various levels of significance. Overall average performance measure is calculated for each algorithm for a given data set. Table 1 shows the overall average percentage score for each algorithm calculated by averaging accuracy, specificity, sensitivity and area under ROC curve under all levels of significance. Scores >90% are marked in bold. MLP algorithm did not finish due to huge memory requirements on last level of significance and is averaged based on first three levels of significance. For type 1 diabetes, Alzheimer’s and antibodies dataset, >6 algorithms scored >90% average score. Overall, Naïve Bayes had the highest average score (90.4%) and was always among top 3 algorithms among all datasets.
Table 1

Overall performance measure of classification algorithms on datasets

Algorithms

T1D

Az

Ab

Asthma

A & B

A & C

B & D

Avg.

Rank

Naïve Bayes

92.0

93.4

91.5

77.7

90.8

93.5

93.6

90.4

1

MLP

90.1

92.7

90.2

71.1

84.7

92.7

89.3

87.3

2

SVM

91.6

88.0

90.7

71.3

86.1

88.4

93.1

87.0

3

VFI

90.5

92.2

75.5

62.6

87.7

93.4

92.7

84.9

4

Hyper Pipes

89.8

89.7

81.3

62.3

82.0

86.6

87.8

82.8

5

R. Forest

91.5

82.4

93.3

62.8

80.6

81.4

81.1

81.9

6

Bayes Net

90.3

87.7

92.5

53.9

80.2

83.2

85.1

81.8

7

K-means

88.3

91.8

80.7

59.6

77.8

83.3

83.6

80.7

8

Logistic R.

90.6

93.3

60.4

50.7

81.5

84.8

90.7

78.9

9

SLR

92.2

71.8

90.1

72.2

65.0

68.5

84.7

77.8

10

KNN

91.4

81.5

52.5

55.8

87.5

75.7

89.0

76.2

11

K star

81.9

90.7

89.4

53.5

64.3

68.8

70.7

74.2

12

M5P

85.1

58.7

83.2

60.0

75.2

73.4

79.6

73.6

13

J48

80.3

69.7

78.4

48.7

70.6

68.4

76.7

70.4

14

Random Tree

83.8

71.7

76.2

52.9

69.3

60.8

75.0

70.0

15

ASC

76.8

70.0

77.9

43.1

72.0

63.1

76.7

68.5

16

LDA

69.7

52.0

89.1

70.8

62.8

69.7

52.6

66.7

17

T1D: Type 1 diabetes datasets, Az: Alzehemer’s dataset, Ab: Antibodies dataset. Table showing algorithms overall performance in each datasets based on average score. Score >90% are marked in bold. Naïve Bayes scored the overall highest average score of 90.4%.

Performance accuracy of classification algorithms at different levels of significance over all data sets

For each data set, different levels of significance are chosen to measure the performance accuracy of each algorithm. These levels contain approximately equal number of peptides for each data set. The first level contains 10 peptides selected from the t-test (lowest p value) and hence contains the least noise. Next, approximately 50 peptides, 200 peptides and 1000 peptides were chosen for the other three levels.

Tables 2, 3, 4, 5, 6, 7, 8 shows 4 different performance measures (accuracy, specificity, sensitivity and area under ROC curve) at different levels of significance over 7 datasets. For the Asthma dataset, we considered all conditions A-D together, then performed the pair-wise comparisons of condition A and B, condition A and C, and condition B and D at three different levels of significance. Measures >90% are marked in bold. For the diabetes dataset, 9 algorithms achieved >90% score. For Alzheimer’s and the Antibodies dataset, 6 algorithms achieved >90% score. Naïve Bayes scored 100% in all 4 measures at the first level of significance in the Alzheimer’s dataset and scored 91.5% average score on the Antibodies dataset. For the Asthma datasets, the highest score was <80%. Only Naïve Bayes had >90% specificity for more than one level of significance. For two conditions in Asthma datasets, Naïve Bayes and VFI scored >90% average score.
Table 2

Performance measures of data mining algorithm at different levels of significance over Type 1 diabetes dataset

SIGNIFICANCE

p < 5 x 10-13

p < 5 x 10-10

p < 5 x 10-7

p < 5 x 10-4

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc

Sp

Sn

AUC

Avg.

SLR

87.5

85.0

89.7

0.93

92.5

90.2

94.9

0.97

92.5

92.0

92.0

0.96

92.5

90.0

94.9

0.96

92.2

Naïve Bayes

90.0

85.4

95.0

0.97

91.3

90.2

92.3

0.98

92.5

90.2

95.0

0.96

89.0

85.4

92.3

0.92

92.0

SVM

88.8

82.9

94.9

0.89

90.0

82.9

97.4

0.90

93.8

90.2

97.4

0.93

93.8

92.7

94.9

0.94

91.6

R. Forest

87.5

87.8

87.2

0.96

92.5

90.2

94.9

0.97

91.5

87.8

94.9

0.97

88.8

85.4

92.3

0.94

91.5

KNN

92.5

90.2

94.9

0.95

95.0

92.7

97.4

0.96

90.0

85.4

94.9

0.93

85.0

80.5

89.7

0.90

91.4

Logistic. R

86.3

87.8

84.6

0.82

92.5

90.2

94.9

0.97

92.5

92.7

97.4

0.97

87.5

92.7

82.1

0.92

90.6

VFI

87.5

82.9

92.3

0.95

92.5

90.2

94.9

0.97

88.8

85.4

92.3

0.95

87.5

82.9

92.3

0.92

90.5

Bayes Net

91.3

90.2

92.3

0.97

90.0

85.4

94.9

0.98

90.0

85.4

94.9

0.95

83.8

78.0

89.7

0.89

90.3

MLP

80.0

80.5

79.5

0.89

91.3

90.2

92.3

0.98

93.8

90.2

97.4

0.99

dnf

dnf

dnf

dnf

90.1*

Hyper Pipes

87.5

90.2

84.6

0.96

91.3

90.2

92.3

0.97

90.0

90.2

89.7

0.95

83.8

92.7

74.4

0.92

89.8

K-means

91.3

82.9

100

0.92

90.0

82.9

97.4

0.90

86.3

78.0

94.9

0.87

85.0

75.6

94.9

0.85

88.3

M5P

88.8

85.4

92.3

0.94

85.0

80.5

89.7

0.94

81.3

78.0

84.6

0.87

78.8

73.2

84.6

0.85

85.1

Random Tree

85.0

87.8

82.1

0.85

78.8

75.6

82.1

0.79

87.5

85.4

89.7

0.88

83.8

85.4

82.1

0.84

83.8

K star

87.5

87.8

87.2

0.96

91.3

85.4

97.4

0.98

90.0

85.4

94.9

0.97

53.8

100

5.1

0.54

81.9

J48

86.3

85.4

87.2

0.79

81.3

82.9

79.5

0.83

78.8

82.9

74.4

0.72

80.0

85.4

74.4

0.73

80.3

ASC

86.3

85.4

87.2

0.79

80.0

82.9

76.9

0.80

80.0

87.8

71.8

0.78

66.3

80.5

51.3

0.55

76.8

LDA

88.8

82.9

94.9

0.96

91.3

85.4

97.4

0.95

40.0

96.7

15.8

0.68

21.3

94.4

0.0

0.48

69.7

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: “Did Not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 3

Performance measures of data mining algorithm at different levels of significance over Alzheimer’s dataset

SIGNIFICANCE

p < 5 x 10-5

p < 5 x 10-4

p < 5 x 10-3

p < 5 x 10-2

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc

Sp

Sn

AUC

Avg.

Naïve Bayes

100

100

100

1.00

91.3

82.0

100

0.96

91.3

82.0

100

0.96

86.5

91.0

84.0

0.94

93.4

Logistic. R

95.0

90.0

100

0.99

95.7

90.0

100

0.97

91.3

90.0

91.7

0.90

91.3

90.0

91.7

0.90

93.3

MLP

91.3

90.9

91.7

0.97

95.6

90.9

100

0.97

87.0

90.9

83.3

0.97

dnf

dnf

dnf

dnf

92.7*

VFI

91.3

90.9

91.7

0.87

95.7

90.9

100

0.92

91.3

81.8

100

0.89

91.3

81.8

100

1.00

92.2

KNN

91.3

90.9

91.7

0.93

95.6

90.9

100

0.93

86.9

90.9

83.3

0.95

91.3

90.9

91.7

0.92

91.8

K-means

82.6

100

66.7

0.83

91.3

90.9

100

0.91

95.7

90.9

100

0.96

91.3

81.8

100

0.90

90.7

Hyper Pipes

91.3

81.8

100

0.98

95.7

90.9

100

0.97

91.3

81.8

100

0.95

73.9

81.8

66.7

0.90

89.7

SVM

87.0

90.9

83.3

0.87

95.7

90.9

100

0.95

82.6

81.8

83.3

0.83

87.0

81.8

91.7

0.87

88.0

Bayes Net

91.3

81.8

100

0.96

91.3

90.9

91.7

0.95

87.0

81.8

91.7

0.86

78.3

81.8

75.0

0.84

87.7

R. Forest

86.9

81.8

91.7

0.94

82.6

81.8

83.3

0.93

73.9

72.7

75.0

0.89

72.6

81.8

75.0

0.84

82.4

K star

95.7

90.9

100

0.98

91.3

90.9

91.7

0.94

78.2

81.8

75.0

0.86

56.5

18.2

91.7

0.64

81.5

SLR

86.9

81.8

91.7

0.96

73.9

72.7

75.0

0.82

60.9

63.6

58.3

0.80

52.2

54.5

50.0

0.69

71.8

Random Tree

78.3

72.7

83.3

0.78

60.9

54.5

66.7

0.61

73.9

63.6

83.3

0.74

73.9

81.8

66.7

0.74

71.7

ASC

73.9

63.6

83.3

0.61

68.9

63.6

58.3

0.56

73.9

81.8

66.7

0.75

78.2

63.9

91.7

0.61

70.0

J48

73.9

63.6

83.3

0.61

60.9

63.6

58.3

0.56

73.9

81.8

70.0

0.75

78.3

63.6

91.7

0.61

69.7

M5P

69.5

54.5

83.3

0.80

52.2

45.5

58.3

0.73

56.5

45.5

66.7

0.43

56.5

36.4

75.0

0.44

58.7

LDA

69.6

72.7

66.7

0.81

34.8

40.0

75.0

0.45

34.8

0.0

100

0.30

30.4

100

0.0

0.52

52.0

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 4

Performance measures of data mining algorithm at different levels of significance over Antibodies dataset

SIGNIFICANCE

p < 5 x 10-8

p < 5 x 10-7

p < 5 x 10-6

p < 5 x 10-5

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc

Sp

Sn

AUC

Avg.

R. Forest

90.0

93.0

90.0

0.96

90.0

91.0

90.0

0.97

92.0

94.0

92.0

0.96

94.0

96.0

94.0

0.97

93.3

Bayes Net

88.0

92.0

88.0

0.96

88.0

91.0

88.0

0.96

94.0

95.0

94.0

0.95

92.0

95.0

92.0

0.96

92.5

Naïve Bayes

88.0

94.0

88.0

0.96

88.0

94.0

88.0

0.96

88.0

94.0

88.0

0.96

88.0

94.0

88.0

0.96

91.5

SVM

80.0

86.6

80.0

0.86

86.0

89.9

86.0

0.89

94.0

96.6

97.0

0.95

96.0

96.9

96.0

0.96

90.7

MLP

80.0

89.8

80.0

0.91

86.0

89.9

86.0

0.96

94.0

96.6

94.0

0.99

dnf

dnf

dnf

dnf

90.2*

SLR

84.0

91.6

84.0

0.89

86.0

83.2

86.0

0.92

90.0

93.5

90.0

0.97

92.0

95.0

92.0

0.96

90.1

KNN

82.0

90.7

82.0

0.92

84.0

88.7

84.0

0.94

86.0

91.2

86.0

0.95

92.0

96.4

92.0

0.95

89.4

Logistic R.

72.0

85.3

72.0

0.92

84.0

90.1

84.0

0.93

92.0

96.4

92.0

0.98

90.0

96.1

90.0

0.98

89.1

M5P

80.0

91.5

80.0

0.92

76.0

87.4

76.0

0.90

78.0

89.4

78.0

0.91

74.0

85.4

74.0

0.89

83.2

Hyper Pipes

64.0

83.6

64.0

0.90

72.0

84.9

72.0

0.90

80.0

87.5

80.0

0.92

80.0

87.1

80.0

0.93

81.3

K star

88.0

93.4

88.0

0.94

94.0

97.2

94.0

0.95

82.0

91.8

82.0

0.93

20.0

90.2

20.8

0.68

80.7

J48

80.0

92.5

80.0

0.86

72.0

87.0

72.0

0.87

70.0

87.6

70.0

0.79

64.0

86.1

64.0

0.77

78.4

ASC

82.0

91.7

82.0

0.87

72.0

82.9

72.0

0.82

70.0

87.8

70.0

0.76

64.0

88.5

64.0

0.75

77.9

Random Tree

72.0

90.3

72.0

0.81

64.0

82.1

64.0

0.73

68.0

87.7

68.0

0.78

74.0

89.7

74.0

0.82

76.2

VFI

72.0

88.5

72.0

0.86

64.0

91.9

64.0

0.85

58.0

94.7

58.0

0.86

52.0

94.5

52.0

0.89

75.5

LDA

68.0

84.5

68.0

0.88

40.0

81.1

40.0

0.71

42.0

89.7

48.8

0.54

20.0

88.4

25.0

0.58

60.4

K means

46.0

68.7

46.0

0.57

46.0

68.7

46.0

0.57

40.0

68.1

40.0

0.54

40.0

68.1

40.0

0.54

52.5

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 5

Performance measures of data mining algorithm at different levels of significance over Asthma dataset 4 classes

SIGNIFICANCE

p < 5 x 10-5

p < 5 x 10-4

p < 5 x 10-3

p < 5 x 10-2

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc

Sp

Sn

AUC

Avg.

Naïve Bayes

61.7

87.2

61.7

0.82

68.1

89.3

68.1

0.86

72.3

90.8

72.3

0.87

70.2

90.0

70.2

0.86

77.7

SLR

57.5

85.8

57.4

0.80

57.4

85.6

57.4

0.81

72.3

90.7

72.3

0.85

55.3

86.1

55.3

0.76

72.2

SVM

55.3

86.2

55.3

0.77

55.3

86.2

55.3

0.77

61.7

87.2

61.7

0.82

66.0

87.6

66.0

0.81

71.3

MLP

55.3

86.1

55.3

0.82

53.2

84.6

53.2

0.80

63.8

87.8

63.8

0.88

dnf

dnf

dnf

dnf

71.1*

Logistic R.

48.9

87.0

48.9

0.78

53.2

84.4

53.2

0.79

59.6

86.4

59.6

0.84

68.0

89.2

68.1

0.86

70.8

R. Forest

48.9

86.9

48.9

0.77

48.9

86.9

48.9

0.77

46.8

81.1

46.8

0.75

40.4

80.0

40.4

0.71

62.8

VFI

48.9

82.8

48.9

0.66

48.9

82.9

48.9

0.67

51.0

83.6

51.1

0.69

46.8

81.9

46.8

0.77

62.6

Hyper Pipes

51.1

83.4

51.1

0.72

53.2

84.0

53.2

0.70

46.8

71.8

46.8

0.74

42.6

80.3

42.0

0.75

62.3

M5P

48.9

82.8

48.9

0.79

55.3

86.1

55.3

0.81

42.5

81.0

42.6

0.68

27.6

75.8

27.7

0.57

60.0

KNN

42.5

87.1

42.6

0.69

46.8

86.6

46.8

0.67

44.6

88.0

44.7

0.69

36.2

79.7

36.2

0.67

59.6

K means

40.4

81.9

40.4

0.60

46.8

82.2

46.8

0.65

42.6

80.7

42.6

0.62

34.0

78.0

34.0

0.56

55.8

Bayes Net

38.3

79.3

38.3

0.56

36.2

77.8

36.2

0.56

44.7

81.4

44.7

0.63

36.2

77.6

36.2

0.60

53.9

K star

48.9

83.0

48.9

0.70

38.3

79.4

38.3

0.63

36.2

79.4

36.2

0.62

23.4

76.4

23.4

0.49

53.5

Random Tree

29.8

76.6

29.8

0.53

40.4

80.2

40.4

0.60

38.3

79.5

38.3

0.59

40.4

80.2

40.4

0.60

52.9

LDA

53.2

84.4

53.2

0.80

27.7

80.0

32.5

0.57

8.5

86.5

16.7

0.56

14.9

83.6

23.3

0.53

50.7

J48

27.7

75.4

27.7

0.52

27.7

75.9

27.7

0.49

42.6

80.8

42.6

0.58

31.9

77.1

31.9

0.52

48.7

ASC

27.7

76.0

27.7

0.52

19.2

71.8

19.1

0.46

29.8

76.7

29.8

0.52

21.2

74.8

21.3

0.45

43.1

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 6

Performance measures of data mining algorithm at different levels of significance on A & B conditions

SIGNIFICANCE

p < 5 x 10-4

p < 5 x 10-3

p < 5 x 10-2

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Avg.

Naïve Bayes

87.5

83.3

91.7

0.84

91.7

83.3

100

0.97

91.7

83.3

100

0.96

90.8

VFI

79.2

75.0

83.3

0.93

91.7

83.3

100

0.95

87.5

75.0

100

0.90

87.7

K means

87.5

83.3

91.7

0.88

91.7

83.3

100

0.92

83.3

75.0

91.7

0.83

87.5

SVM

83.3

83.3

83.3

0.83

87.5

91.7

83.3

0.87

87.5

83.3

91.7

0.88

86.1

MLP

79.2

83.3

75.0

0.70

91.7

91.7

91.7

0.95

dnf

dnf

dnf

dnf

84.7*

Hyper Pipes

83.3

75.0

91.7

0.91

83.3

83.3

83.3

0.93

70.8

83.3

58.3

0.88

82.0

Logistic R.

66.7

83.3

50.0

0.76

95.8

91.7

100

0.92

79.2

83.3

75.0

0.85

81.5

Random Forest

79.2

83.3

75.0

0.91

79.2

75.0

83.3

0.86

79.2

75.0

83.3

0.78

80.6

Bayes Net

83.3

75.0

91.7

0.87

83.3

83.3

83.3

0.83

75.0

75.0

75.0

0.67

80.2

KNN

75.0

83.3

66.7

0.85

75.0

91.7

58.3

0.90

75.0

91.7

58.3

0.84

77.8

M5P

75.0

83.3

66.7

0.74

75.0

75.0

75.0

0.79

75.0

75.0

75.0

0.74

75.2

ASC

62.5

66.7

58.3

0.65

79.2

83.3

75.0

0.85

70.8

75.0

66.7

0.76

72.0

J48

62.5

66.7

58.3

0.65

79.2

83.3

75.0

0.85

66.7

75.0

58.3

0.72

70.6

Random Tree

70.8

75.0

66.7

0.70

70.8

75.0

66.7

0.70

66.7

66.7

66.7

0.67

69.3

SLR

70.8

75.0

66.7

0.80

66.7

75.0

58.3

0.77

50.0

50.0

50.0

0.60

65.0

K star

66.7

91.7

41.7

0.83

58.3

100

46.7

0.83

50.0

0.0

100

0.50

64.3

LDA

79.2

83.3

75.0

0.84

61.2

64.5

54.5

0.52

29.2

14.3

100

0.56

62.8

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 7

Performance measures of data mining algorithm at different levels of significance on A & C conditions

SIGNIFICANCE

p < 5 x 10-4

p < 5 x 10-3

p < 5 x 10-2

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Avg.

Naïve Bayes

91.3

91.7

91.0

0.94

96.0

100

90.9

0.99

91.3

100

81.8

0.95

93.5

VFI

95.6

100

90.0

0.97

95.6

100

90.0

0.97

87.0

83.3

90.0

0.95

93.4

MLP

86.9

91.7

81.8

0.97

95.6

100

90.9

0.98

dnf

dnf

dnf

dnf

92.7*

SVM

95.6

100

90.9

0.96

95.7

100

90.9

0.96

73.9

75.0

72.7

0.74

88.4

Hyper Pipes

95.7

100

90.9

0.99

82.6

91.7

72.7

0.90

78.2

83.3

72.7

0.83

86.6

Logistic R.

86.0

91.7

81.8

0.96

95.7

100

90.9

0.92

69.6

83.3

54.5

0.76

84.8

KNN

91.3

100

81.8

0.92

91.3

100

81.8

0.94

65.2

66.7

63.6

0.72

83.3

Bayes Net

95.7

100

90.9

0.99

82.6

83.3

81.8

0.92

69.6

66.7

72.7

0.64

83.2

Random Forest

87.0

83.3

90.9

0.93

82.6

83.3

81.8

0.91

69.5

66.7

72.7

0.75

81.4

K means

69.6

83.3

54.5

0.69

95.7

100

90.9

0.95

60.9

63.6

63.6

0.63

75.7

M5P

91.3

91.7

90.9

0.86

65.2

58.3

72.7

0.72

65.2

58.3

72.7

0.56

73.4

LDA

91.3

100

81.8

0.97

65.2

71.7

58.6

0.77

17.4

25.0

100

0.52

69.7

K star

73.9

91.7

54.5

0.93

78.2

100

54.5

0.82

47.8

0.0

100

0.50

68.8

SLR

87.0

83.3

90.9

0.89

73.9

75.0

72.7

0.74

43.5

41.7

45.5

0.45

68.5

J48

69.6

66.7

72.7

0.76

69.6

58.3

81.8

0.77

60.9

58.3

63.6

0.66

68.4

ASC

65.6

66.7

72.7

0.76

69.6

66.7

72.7

0.76

47.8

66.7

27.3

0.49

63.1

Random Tree

73.9

91.7

54.5

0.73

73.9

66.7

81.8

0.74

34.8

33.3

36.4

0.35

60.8

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Table 8

Performance measures of data mining algorithm at different levels of significance on B & D conditions

SIGNIFICANCE

p < 5 x 10-4

p < 5 x 10-3

p < 5 x 10-2

 

Algorithm

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Acc.

Sp

Sn

AUC

Avg.

Naïve Bayes

91.7

100

83.3

0.95

91.7

91.7

91.7

0.92

95.8

91.7

100

0.98

93.6

SVM

91.7

100

83.3

0.92

91.7

91.7

91.7

0.92

95.8

100

91.7

0.96

93.1

VFI

87.5

100

75.0

0.93

91.7

100

83.3

0.94

95.8

100

91.7

1.00

92.7

Logistic R.

79.1

83.3

75.0

0.92

100

100

100

1.00

87.5

91.7

83.3

0.97

90.7

MLP

87.5

91.7

83.3

0.94

87.5

83.3

91.7

0.96

dnf

dnf

dnf

dnf

89.3*

K means

87.5

91.7

83.3

0.88

91.4

91.7

91.7

0.92

87.5

83.3

91.7

0.88

89.0

Hyper Pipes

87.5

83.3

91.7

0.89

91.7

91.7

91.7

0.87

83.3

75.0

91.7

0.90

87.8

Bayes Net

83.3

83.3

83.3

0.89

87.5

91.7

83.3

0.86

83.3

83.3

83.3

0.84

85.1

SLR

83.3

83.3

83.3

0.88

79.2

66.7

91.7

0.90

87.5

100

75.0

0.89

84.7

KNN

79.2

75.0

83.3

0.80

83.3

83.3

83.3

0.83

87.5

91.7

83.3

0.90

83.6

Random Forest

83.3

83.3

83.3

0.83

79.2

83.3

75.0

0.84

79.2

83.3

75.0

0.81

81.1

M5P

87.5

91.7

83.3

0.88

79.2

83.3

75.0

0.73

75.0

83.3

66.7

0.69

79.6

ASC

91.7

100

83.3

0.83

75.0

83.3

66.7

0.61

70.8

75.0

66.7

0.64

76.7

J48

91.7

100

83.3

0.83

75.0

83.3

66.7

0.61

70.8

75.0

66.7

0.64

76.7

Random Tree

83.3

91.7

75.0

0.83

70.8

66.7

75.0

0.71

70.8

66.7

75.0

0.71

75.0

K star

70.8

66.7

75.0

0.83

79.2

75.0

83.3

0.82

58.3

100

16.7

0.58

70.7

LDA

62.5

72.3

60.9

0.75

50.0

65.0

48.0

0.71

20.8

42.6

18.6

0.45

52.6

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve, Avg: Average score in % for each algorithms, dnf: Did not Finish”, * denotes Avg. from 3 significance levels. Measures >90% are marked in bold.

Comparative analysis of worst time performance of classification algorithms over data sets

The amount of time taken by each algorithm to build the model and perform cross validation was measured. Table 9 shows the time in milliseconds for each algorithm at the lowest level of significance when the number of peptides nears 1000. Random Tree was the fastest, at ~1000 milliseconds (average) to complete the task, while MLP was the worst which did not finish due to high memory requirements. Random tree, Hyper Pipes, Naïve Bayes, VFI and KNN were the five fastest algorithms; each took less than ~4000 milliseconds to complete classification of >1,000 peptides. Logistic Regression and Attribute Selected Classifier, MLP were among the slowest algorithms taking more than 20 minutes to perform classification of >1,000 peptides. The absolute ranking for every algorithm was consistent per dataset; only three datasets have been considered to measure time performance.
Table 9

Worst case time performance (in ms) of classification algorithms

Data set

Diabetes

Alzheimer’s

Antibodies

Avg. (in ms)

Rank

Random Tree

1809

491

1478

1260

1

KNN

3016

607

910

1511

2

Hyper Pipes

2486

602

2180

1756

3

Naïve Bayes

4780

1158

2480

2806

4

VFI

7440

1357

3000

3932

5

J48

16581

1385

11731

9899

6

K star

25974

2348

6341

11555

7

SVM

10496

2722

29008

14076

8

R. Forest

50087

8032

21452

26524

9

M5P

50290

8563

23452

27435

10

Bayes Net

55672

9031

25000

29901

11

K-means

85955

12405

29658

42672

12

SLR

632840

48215

605365

428806

13

LDA

658668

869523

632983

720391

14

Logistic R.

1589092

1146783

1315256

1350377

15

ASC

5444533

2465021

4565896

4158483

16

MLP

dnf

dnf

dnf

NA

17

Table showing time performance in milliseconds over >1000 peptides for three datasets. Random Tree, KNN, Hyper Pipes and VFI were among the fastest. MLP were among the slowest with dnf: “Did not finish”. Time measurements less than 10 seconds are marked in bold.

Comparative analysis of time performance of classification algorithms at different levels of significance over three data sets

For each level of significance, time was measured for each algorithm to build the model and for cross validation. At the highest level of significance (about 10 peptides), each algorithm were fast enough to complete the task in under 25 seconds. Execution times increased as the level of significance was lowered due to the higher number of features and increased difficulty in constructing the model. Table 10 shows classification algorithms time performance at various levels of significance.
Table 10

Time performance (in ms) of classification algorithms on datasets

 

Diabetes dataset

Alzheimer’s dataset

Antibodies dataset

p value<

5x10-13

5x10-10

5x10-7

5x10-4

5x10-5

5x10-4

5x10-3

5x10-2

5x10-8

5x10-7

5x10-6

5x10-5

R. Tree

337

408

571

1809

184

200

218

491

250

265

608

1478

KNN

265

333

585

3016

130

156

239

607

187

234

414

910

Hyper Pipes

226

274

630

2486

119

259

423

602

281

312

736

2180

Naïve Bayes

250

456

1120

4780

182

340

500

1158

265

362

892

2480

VFI

299

561

1384

7440

187

337

623

1357

280

368

1379

3000

J48

415

833

3718

16581

166

256

712

1385

468

880

3011

11731

K star

468

1387

4150

25974

187

260

666

2349

299

562

2340

6341

SVM

3313

3635

5304

10496

1054

1108

1389

2722

18297

18372

23712

29009

R. Forest

5717

11889

18254

50087

952

1852

4843

8032

5004

6749

13848

21452

M5P

701

2583

7717

50290

290

524

2324

8563

2632

4711

12033

23452

Bayes Net

718

2087

5653

55672

334

662

4996

9031

733

1140

3394

25000

K means

2618

6651

11876

85955

593

1123

7212

12405

850

908

3442

29658

SLR

11215

26380

79308

632840

1330

3413

22625

48215

17389

20649

89107

605365

LDA

683

1044

7994

658668

402

699

35568

869523

1512

2018

17373

632983

Logistic R.

1204

2592

24687

1589092

629

1651

48659

1146783

1654

9379

255103

1315256

ASC

864

3504

32836

5444533

518

1859

36849

2465021

1217

1763

25496

4565896

MLP

23759

314076

4572305

dnf

2057

30342

2789485

dnf

22916

156905

3277395

dnf

Table showing time performance in milliseconds on all level of significance for three datasets. MLP were among the slowest with dnf: “Did not finish”. Time measurements less than 10 seconds are marked in bold.

Results summary

We have explored several disparate classifiers using a relatively new type of microarray data: immunosignaturing data. The tested algorithms come from a broad family of approaches to classify data. We chose algorithms from Bayesian, regression, trees, multivariate and meta analysis and we believe we have sampled sufficiently that the results are relevant. From Table 2 we found that Naïve Bayes had a higher average performance than all other algorithms tested. Naïve Bayes achieved > 90% average for 2 classes datasets where there is a clear distinction between two classes. For the multi-class the Antibodies dataset, where there is a clear difference between different types of antibodies, Naïve Bayes scored 88% average accuracy and was ranked third, close to the 93.3% accuracy of random forest. On the Asthma dataset, containing four classes, none of the algorithms were able to achieve more than 75% accuracy. This matches the biological interpretation very well. Naïve Bayes outperformed all algorithms for speed and accuracy, achieving 77.7% average score overall. Naïve Bayes was one of the top five fastest algorithms, ~500 times faster than the logistic regression. A summary of the all algorithms performance measures and time is given in below and described in Table 11. Distance metrics have been defined to access performance measures for all algorithms compared to the highest scoring algorithm on a given dataset.
  1. I.

    Naïve Bayes: Naïve Bayes performed best overall with > 90% overall average score. It was always among the top 3 algorithms in all 7 comparisons. It ranked first 5 out 7 times when comparing all datasets. It was on an average just 0.3% behind the rank 1 algorithm in overall comparison. It is 2X slower than the fastest algorithm due to its mathematical properties. It would be feasible to perform large-scale classification studies using Naïve Bayes.

     
  2. II.

    Multilayer Perceptron (MLP): It ranked second with overall score of 87.3% and was very close to SVM. The overall score is biased since MLP did not finish for level containing ~1000 peptides and hence scored was averaged from just the three levels. It was the slowest algorithm and infeasible to perform large-scale classification.

     
  3. III.

    Support Vector Machines (SVM): Although it ranked third, it was not significantly different from the MLP in terms of performance measures. It was 700X faster than MLP and achieved >90% measured accuracy 3 times. Both MLP and SVM were <5% behind the rank 1 algorithm on average.

     
  4. IV.

    VFI: VFI ranked fourth in overall performance measures and was the among top 5 fastest algorithms due to its voting method. Four times it obtained >90% average overall accuracy and ranked 2nd twice.

     
  5. V.

    Hyper Pipes: Hyper pipes ranked fifth overall in performance measures and was among the fastest of the tested algorithms, likely due to its inherently simplistic ranking method. It was <8% from first place 6 times.

     
  6. VI.

    Random Forest: Random forest ranked sixth in overall performance measures and performed better on datasets having multiple classes (Antibodies and Asthma). It was 21 times slower than the fastest algorithm due to bootstrapping.

     
  7. VII.

    Bayes net: Ranked in the middle for overall accuracy and time. It scored >90% overall measures twice. It was slower than the Naïve Bayes due to construction of networks in the form of an acyclic graph and it is relatively inefficient compared to Naïve Bayes due to the change in network topology during assessment of probability.

     
  8. VIII.

    K means: K-means ranked eighth in overall performance measures and was 34X slower than the fastest algorithm in time performance due to the multiple iterations required to form clusters. It performed far better for 2 classes compared to multiple classes because guaranteed convergence, scalability and linear separation boundaries are more easily maintained.

     
  9. IX.

    Logistic Regression: Logistic regression ranked ninth in overall accuracy. It was >90% three times. It was among the worst in time performance, being ~1000 times slower than the fastest algorithm as it needs to regress on high number of features. It is efficient for small numbers of features and sample sizes > 400.

     
  10. X.

    Simple Logistic: It ranked tenth in overall performance measures and ranked first on the diabetes dataset. It ranked second in multiclass Asthma dataset. It was slow in time performance due to LogitBoost iterations.

     
  11. XI.

    K nearest neighbors: It performed well on the 2 classes dataset but didn’t perform as well for multi class datasets. It was >90% performance for only rather difficult Diabetes dataset. This may be related to evenly defined but diffuse clusters related to the subtle differences between the Asthma patients.

     
  12. XII.

    K star: It performed >90% for only the Diabetes dataset and was 9 times slower than the fastest algorithm. This algorithm may also be sensitive to the even and diffuse clusters described by this dataset.

     
  13. XIII.

    M5P: It did not perform well on either time performance or accuracy. It never achieved >90% average score and was 22 times slower than the fastest algorithm due to formation of comprehensive linear model for every interior node of the unpruned tree.

     
  14. XIV.

    J48: Top 5 fastest algorithm due to rapid construction of trees. It was >20% behind from the rank 1 algorithm on an average; its lower performance may possibly be due to formation of empty/insignificant branches which often leads to overtraining.

     
  15. XV.

    Random Trees: It was the fastest algorithm since it builds trees of height log(k) where k is the number of attributes, however it achieves poor accuracy since it performs no pruning.

     
  16. XVI.

    Attribute Selected Classifier (ASC): One of the slowest algorithms as it had to evaluate attributes prior to classification. It underperformed in performance measures due to the C4.5 classifier limitations that prevent overtraining.

     
  17. XVII.

    Linear Discriminant Analysis (LDA): Its performance accuracy decreased as the number of features increased due to its inability to deal with highly variant data. It was slow (>500X slower than the fastest algorithm) since it tries to optimize class distinctions but the variance covariance matrix increases dramatically as the number of features increased.

     
Table 11

Summary of performance and time measures of classification algorithms

 

# Rank 1

# Rank 2

# >90%

Distance

Time

Naïve Bayes

5

1

6

−0.3

2X

MLP

0

0

4

−3.4

7615X

SVM

0

1

3

−3.6

11X

VFI

0

2

4

−5.7

3X

Hyper Pipes

0

0

0

−7.9

1X

R. Forest

1

0

2

−8.8

21X

Bayes Net

0

1

2

−8.8

24X

K-means

0

0

1

−9.9

34X

Logistic R.

0

1

3

−11.8

1072X

SLR

1

1

2

−12.9

340X

KNN

0

0

1

−14.4

1X

K star

0

0

1

−16.5

9X

M5P

0

0

0

−17.0

22X

J48

0

0

0

−20.2

8X

Random Tree

0

0

0

−20.7

1X

ASC

0

0

0

−22.1

3300X

LDA

0

0

0

−24.0

572X

#Rank 1, Rank 2: No. of times algorithm ranked 1st and 2nd on 7 datasets, # > 90%: No. of times algorithm scored overall average score >90% on 7 datasets, Distance: magnitude an algorithm trails behind on average from the Rank 1 for the datasets (5% or less distance are marked in bold). Time: performance slower with respective to fastest algorithm. Time performances slower by 5 folds to fastest algorithm are marked in bold.

Discussion

The comparisons provided in this article provide a glimpse into how existing classification algorithms handle data with intrinsically different properties than traditional microarray expression data. Immunosignaturing provides a means to quantify the dispersion of serum (or saliva) antibodies that result from disease or other immune challenge. Unlike most phage display or other panning experiments, fewer but longer random-sequence peptides are used. Rather than converging to relatively few sequences, the immunosignaturing microarray provides data on the binding affinity of all 10,000 peptides with high precision. Classifiers in the open-source program WEKA were used to determine whether any algorithm stood out as being particularly well suited for these data. The 17 classifiers, which were tested, are readily available and represent some of the most widely used classification methods in biology. However, they also represent classifiers that are diverse at the most fundamental levels. Tree methods, regression, and clustering are inherently different; the grouping methods are quite varied and top-down or bottom-up paradigms address data structures in substantially different ways. Given this, we present and interpret the results from our tests, which we believe will be applicable to any dataset with target-probe interactions similar to immunosignaturing microarrays.

From the comparisons above, Naïve Bayes was the superior analysis method in all aspects. Naïve Bayes assumes a feature independent model, which may account for its superior performance. It relies on the degree of correlation of the attributes in the dataset; for immunosignaturing, the number of attributes can be quite large. In gene expression data, where genes are connected by gene regulatory networks, there is a direct and significant correlation between hub genes and dependent genes. This relationship affects the performance of Naïve Bayes by limiting its efficiency through multiple containers of similarly - connected features [3941]. In peptide-antibody arrays, where the signals that arise from the peptides are multiplexed signals of many antibodies attaching to many peptides, there is no direct correlation between peptides, but there is a general trend. Moreover, there is a competition of antibodies attaching to a single peptide, which makes it difficult for multiple mimotopes to show significant correlation with each other. Thus, the 10,000 random peptides have no direct relationships to each other each contributes partially to defining the disease state. This makes the immunosignaturing technology a better fit for the assumption of strong feature independence employed by the Naïve Bayes technique, and the fact that reproducible data can be had at intensity values down to 1 standard deviation above background enables enormous numbers of informative, precise, and independent features. Presence or absence of a few high- or low-binding peptides on the microarray will not impact the binding affinity for any other peptide, since the kinetics ensures that the antibody pool is not limiting. This is important when building microarrays with >300,000 features per physical assay, as in our newest microarray. More than 90% of the peptides on either microarray demonstrate normal distribution for binding signals. This is important since feature selection methods used in this analysis (t-test and one way ANOVA) and the Naïve Bayes classifier all assume normal distribution of features.

The Naïve Bayes approach requires relatively little training data, which makes it a very good fit for the biomarker field. The sample sizes usually range from N = 20-100 for the training set. Naïve Bayes has other advantages as well: it can train well on a small but high feature data set and still yield good prediction accuracy on a large test set. Any microarray with more than a few thousand probes succumbs to the issue of dimensionality. Since Naïve Bayes independently estimates each distribution instead of calculating a covariance or correlation matrix, it escapes relatively unharmed from problems of dimensionality.

The data used here for evaluating the algorithms were generated using an array with 10,000 different features, almost all of which contribute information. We have arrays with >300,000 peptides per assay (current microarrays are available from http://www.peptidearraycore.com) which should provide for less sharing between peptide and antibody, effectively spreading out antibodies over the peptides with more specificity. This presumably will allow resolving antibody populations with finer detail. This expansion may require a classification method that is robust to noise, irrelevant attributes and redundancy. Naïve Bayes has an outstanding edge in this regard as it is robust to noisy data since such data points are averaged out when estimating conditional probabilities. It can also handle missing values by ignoring them during model building and classification. It is highly robust to irrelevant and redundant attributes because if Yi is irrelevant then P (Class|Yi) becomes uniformly distributed. This is due to that fact that the class conditional probability for Xi has no significant impact on the overall computation of posterior probability. Naïve Bayes will arrive at a correct classification as long as the correct classes are even slightly more predictable than the alternative. Here, class probabilities need not be estimated very well, which corresponds to the practical reality of immunosignaturing: signals are multiplexed due to competition, affinity, and other technological limitation of spotting, background and other biochemical effects that exist between antibody and mimotope.

Time efficiency

As the immunosignaturing technology is increasingly used for large-scale experiments, it will result in an explosion of data. We need an algorithm that is accurate and can process enormous amounts of data with low memory overhead and fast enough for model building and evaluation. One aims for next-generation immunosignaturing microarrays is to monitor the health status of a large population on an on-going basis. The number of selected attributes will no longer be limited in such a scenario. For risk evaluation, complex patterns must be normalized against themselves at regular intervals. This time analysis would require a conditional probabilistic argument along with the capacity of accurately predicting the risk with low computational cost. The slope of Naïve Bayes on time performance scale is extremely small, allowing it to process a large number of attributes.

Conclusion

Immunosignaturing is a novel approach which aims to detect complex patterns of antibodies produced in acute or chronic disease. This complex pattern is obtained using random peptide microarrays where 10,000 random peptides are exposed to antibodies in sera/plasma/saliva. Antibody binding to the peptides is not one-to-one but a more complicated and multiplexed process. The quantity and appearance of this data appears numerically, distributionally, and statistically the same as gene expression microarray data, but is fundamentally quite different. The relationships between attributes and functionality of those attributes are not the same. Hence, traditional classification algorithms used in gene expression data might be suboptimal for analyzing immunosignaturing results. We investigated 17 different kinds of classification algorithm spanning Bayesian, regression, tree based approaches and meta-analysis and compared their leave-one-out cross-validated accuracy values using various numbers of features. We found that the Naïve Bayes classification algorithm outperforms the majority of the classification algorithms in classification accuracy and in time performance, which is not the case for expression microarrays [42]. We also discussed its assumptions, simplicity, and fitness for immunosignaturing data. More than most, these data provide access to the information found in antibodies. Deconvoluting this information was a barrier to using antibodies as biomarkers. Pairing immunosignaturing with Naïve Bayes classification may open up the immune system to a more systematic analysis of disease.

Ethics statement

Consent was obtained for every sample in this manuscript and was approved by ASU IRB according to protocol number 0912004625 entitled "Profiling Human Sera for Unique Antibody Signatures". Humans were consented by the retrieving institution and a materials transfer agreement was signed between the Biodesign Institute and the collaborating institute. The collaborating institutes' protocols were current and each human subject signed an approved consent form and released their sera.

Declarations

Acknowledgements

We are thankful to Dr. AbnerNotkins, NIH, National Institute of Dental and Craniofacial Research for providing type 1 diabetes sample, DrLucusRestrepo for providing Alzheimer’s dataset. Dr Bart Legutki and Dr Rebecca Halperin for providing Antibodies dataset, University of Arizona, department of Pharmacy and Pharmacology (Serine Lau, Donata Vercelli, Marilyn Halonen) for providing Asthma samples, Peptide Microarray Core (DrZbigniewCichacz) for providing Asthma datasets, PradeepKanwar for implementing the time function in JAVA and ValentinDinu for invaluable discussion regarding algorithm selection. This work was supported by an Innovator Award from the DoD Breast Cancer Program to SAJ.

Authors’ Affiliations

(1)
Center for Innovations in Medicine, Biodesign Institute, Arizona State University

References

  1. Haab BB: Methods and applications of antibody microarrays in cancer research. Proteomics 2003, 3: 2116–2122.View ArticlePubMed
  2. Whiteaker JR, Zhao L, Zhang HY, Feng L-C, Piening BD, Anderson L, Paulovich AG: Antibody-based enrichment of peptides on magnetic beads for mass-spectrometry-based quantification of serum biomarkers. Anal Biochem 2007, 362: 44–54.PubMed CentralView ArticlePubMed
  3. Reimer U, Reineke U, Schneider-Mergener J: Peptide arrays: from macro to micro. Curr Opin Biotechnol 2002, 13: 315–320.View ArticlePubMed
  4. Merbl Y, Itzchak R, Vider-Shalit T, Louzoun Y, Quintana FJ, Vadai E, Eisenbach L, Cohen IR: A systems immunology approach to the host-tumor interaction: large-scale patterns of natural autoantibodies distinguish healthy and tumor-bearing mice. PLoS One 2009, 4: e6053.PubMed CentralView ArticlePubMed
  5. Braga-Neto UM, Dougherty ER: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004, 20: 374–380.View ArticlePubMed
  6. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2004, 21: 1509–1515.View ArticlePubMed
  7. Sima C, Attoor S, Brag-Neto U, Lowey J, Suh E, Dougherty ER: Impact of error estimation on feature selection. Pattern Recognit 2005, 38: 2472–2482.View Article
  8. Braga-Neto U, Dougherty E: Bolstered error estimation. Pattern Recognit 2004, 37: 1267–1281.View Article
  9. Cwirla SE, Peters EA, Barrett RW, Dower WJ: Peptides on phage: a vast library of peptides for identifying ligands. ProcNatlAcadSci U S A 1990, 87: 6378–6382.View Article
  10. Nahtman T, Jernberg A, Mahdavifar S, Zerweck J, Schutkowski M, Maeurer M, Reilly M: Validation of peptide epitope microarray experiments and extraction of quality data. J Immunol Methods 2007, 328: 1–13.View ArticlePubMed
  11. Boltz KW, Gonzalez-Moa MJ, Stafford P, Johnston SA, Svarovsky SA: Peptide microarrays for carbohydrate recognition. Analyst 2009, 134: 650–652.View ArticlePubMed
  12. Brown J, Stafford P, Johnston S, Dinu V: Statistical Methods for Analyzing Immunosignatures. BMC Bioinforma 2011, 12: 349.View Article
  13. Halperin RF, Stafford P, Johnston SA: Exploring antibody recognition of sequence space through random-sequence peptide microarrays. Mol Cell Proteomics 2011, 10: 110–000786.View Article
  14. Legutki JB, Magee DM, Stafford P, Johnston SA: A general method for characterization of humoral immunity induced by a vaccine or infection. Vaccine 2010, 28: 4529–4537.View ArticlePubMed
  15. Restrepo L, Stafford P, Magee DM, Johnston SA: Application of immunosignatures to the assessment of Alzheimer's disease. Ann Neurol 2011, 70: 286–295.View ArticlePubMed
  16. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD ExplorNewsl 2009, 11: 10–18.View Article
  17. John GH, Langley P: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. In Estimating Continuous Distributions in Bayesian Classifiers. Morgan Kaufmann, San Mateo; 1995:338–345.
  18. Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. Mach Learn 1997, 29: 131–163.View Article
  19. Yu J, Chen X: Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics 2005, 21(Suppl 1):i487-i494.View ArticlePubMed
  20. Friedman J, Hastie T, Tibshirani R: Additive logistic regression: a statistical view of boosting. Ann Stat 2000, 28: 337–407.View Article
  21. Cessie SL, Houwelingen JCV: Ridge Estimators in Logistic Regression. J R Stat SocSer C (Appl Stat) 1992, 41: 191–201.
  22. Landwehr N, Hall M, Frank E: Logistic Model Trees. Mach Learn 2005, 59: 161–205.View Article
  23. Platt J: Fast Training of Support Vector Machines using Sequential Minimal Optimization. MIT Press, Book Fast Training of Support Vector Machines using Sequential Minimal Optimization. City; 1998.
  24. Hastie T, Tibshirani R: Classification by Pairwise Coupling. MIT Press, Book Classification by Pairwise Coupling. City; 1998.
  25. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Comput 2001, 13: 637–649.View Article
  26. Chaudhuri BB, Bhattacharya U: Efficient training and improved performance of multilayer perceptron in pattern classification. Neurocomputing 2000, 34: 11–27.View Article
  27. Gardner MW, Dorling SR: Artificial neural networks (the multilayer perceptron),Äî a review of applications in the atmospheric sciences. Atmos Environ 1998, 32: 2627–2636.View Article
  28. Aha DW, Kibler D, Albert MK: Instance-based learning algorithms. Mach Learn 1991, 6: 37–66.
  29. Weinberger K, Blitzer J, Saul L: Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 2009, 10: 207–244.
  30. Cleary J, Trigg L: Proceedings of the 12th International Conference on Machine Learning. In K*: An Instance-based Learner Using an Entropic Distance Measure. Morgan Kaufmann, ; 1995:108–114.
  31. Hall MA: Correlation-based Feature Subset Selection for Machine Learning, PhD Thesis, University of Waikato. Hamilton, New Zealand; 1998.
  32. Hartigan JA: Statistical theory in clustering. J Classif 1985, 2: 63–76.View Article
  33. Quinlan JR: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. In Learning with continuous classes. World Scientific, ; 1992:343–348.
  34. Witten IH, Eibe F, Hall MA: Data Mining: Practical Machine Learning Tools and Techniques. Thirdth edition. Morgan Kaufmann, San Francisco; 2011.
  35. Güvenir HA: Voting features based classifier with feature construction and its application to predicting financial distress. Expert SystAppl 2010, 37: 1713–1718.View Article
  36. Salzberg SL: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn 1994, 16: 235–240.
  37. Quinlan J: Bagging, Boosting and C4. AAAI/IAAI 1996, 5: 1.
  38. Breiman L: Random Forests. Mach Learn 2001, 45: 5–32.View Article
  39. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M, et al.: Gene-Expression Profiles in Hereditary Breast Cancer. New England J Med 2001, 344: 539–548.View Article
  40. Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20: 2429–2437.View ArticlePubMed
  41. Liu H, Li J, Wong L: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform 2002, 13: 51–60.PubMed
  42. Stafford P, Brun M: Three methods for optimization of cross-laboratory and cross-platform microarray expression data. Nucleic Acids Res 2007, 35: e72.PubMed CentralView ArticlePubMed

Copyright

© Kukreja et al.; licensee BioMed Central Ltd. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.