Skip to main content
  • Research article
  • Open access
  • Published:

Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery

Abstract

Background

Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.

Results

We use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.

Conclusion

This technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

Background

Transcriptional profiling studies can produce data in the form of abundance measurements for genes in samples assigned to one of two classes. A recent exemplar employed cDNA microarrays to assay 6605 clones from normal liver and liver cancer (hepatocellular carcinoma) tissues [1]. Given such two-class high-dimensional data, one analytical task is identifying a "small" subset of features able to discriminate between the classes. Tools that solve this problem would accelerate development of novel and/or improved molecular targets for diagnosis, prognosis, and therapy [2]. For example, enunciating genes able to distinguish liver cancer from normal samples could assist investigations into the etiology and treatment of liver cancer.

Existing classification and feature selection techniques can be employed to ascertain the cardinality of a feature subset yielding a classifier that generalizes well, i.e., one which makes zero (or few) errors in assigning the class of an unseen data point. Frequently, application of these approaches to a data set results in the definition of one discriminatory subset with tens to hundreds of features and requiring similar numbers of free parameters. This work focuses on subsets smaller than those produced by existing algorithms: all subsets of one-, two-, (and sometimes three-) features that can be separated by a linear surface without error. A multiplicity of error-free linear classifiers constructed from few features could facilitate the creation of cost-effective clinical tests and guide further basic research.

Here, an m-feature classifier is defined as a decision surface for m-dimensional data points where the m features are a subset of P a priori features, m P. The potential number of these classifiers is equivalent to choosing m items out P, i.e., . This number increases when different types of decision boundaries are permissible for each value of m. The scope of the problem can be reduced and simplified if only m-feature linear classifiers (m-LCs) are considered. This restriction of neglecting non-linear decision surfaces is reasonable because hyperplanes can be calculated efficiently, and Support Vector Machines with linear kernels are sufficient for classification problems associated with profiling data (see for example [36]). Recent work by Bo [7] and Kim [8] demonstrate the utility of looking for small feature sets. Bo and Jonassen surveyed a number of classifier discovery methods including linear hyperplanes. They showed that accurate two-gene classifiers exist in real world data sets and that they perform well. They only analyzed 2 data sets, did not report computer runtimes nor consider single genes or gene triples in their analysis. Kim et al employed a heuristic, Monte Carlo-based strategy to discover 2- and 3-LCs for a real-world, 3226-dimensional, two-class transcriptional profiling data set [8]. This sophisticated method computes noise tolerant hyperplanes using an analytic spherical model. However, 140 hours on a supercomputer cluster were required to identify at least 11 pairs of genes, each of which separates the data. Thus, although brute force exhaustive search provides a comprehensive and systematic method for finding all small discriminatory feature subsets, the strategy is expensive computationally and largely untenable for m ≥ 3 (the problem size grows combinatorially). The high dimensionality of transcriptional profiles and the logistical issues associated with exhaustive enumeration of all 1-, 2- and 3-LCs have lead to the prevailing assumption that such searches are both too expensive and unlikely to be informative. Thus, while some recent studies have made use of Kim's method [912], most profiling studies neither consider nor report small sized discriminatory feature subsets.

Here, a relatively inexpensive method for calculating maximal margin hyperplanes, LIKNON [6, 13], is utilized to rapidly find error-free m-LCs in ten published transcriptional profiling data sets that assayed samples from liver, human breast, ovary, lung, skin, gastrointestinal tract, bone marrow, brain, and prostate. The number of free parameters in this method is m + 1, hence is very small relative to the size of the data, greatly reducing the problem of over-fitting to the training data (see Random Data section). It seems plausible that the existence of single genes and gene pairs with the ability to form perfect linear classifiers may be a widespread phenomenon. To demonstrate the biological utility of the strategy, the gene pairs and triples discovered in the aforementioned LiverCancer data set were examined and found to yield new and unanticipated scientific insights. Overall, the results indicate the importance of ascertaining, as a matter of routine, the presence (or absence) of small distinguishing feature subsets.

Results

All results are available through the web site [14].

The 10 published cancer data sets examined here are listed in Table 1. They range in size from small (few genes and/or a small class) to large (many genes with large classes), using a variety of microarray technologies. For each data set, all single and pairs of genes were tested (for some data sets gene triples were tested) using the LIKNON technique. All gene sets that formed zero training error classifiers were saved and are available via the web site. Table 2 lists the number of such gene sets. Many data sets have single genes or pairs that form such perfect linear classifiers which is interesting as most original reports did not note their presence. Computer runtimes are given in Table 3. A rough time estimate is 1 second per million pairs per sample. Evaluating all pairs requires checking about 0.5 * n2 pairs (a triangular matrix). So a 2000 gene, 30 sample set would need about seconds.

As expected, small data sets are found to have many thousands of gene pairs while large data sets have few. The gene sets discovered in the two large data sets (LiverCancer and YeohALL) are likely biologically relevant and are discussed later. The thousands of pairs found on small data sets are mostly due to the small sample size of the data, and would likely not maintain their perfect classification upon addition of further patient samples. Generalization performance was estimated with a LOO methodology (see Methods) and is often reasonably good.

Table 1 Published transcriptional profiling data sets reexamined in this study. For each set, the Table gives the abbreviation; dimensionality of the data points investigated after pre-processing of the features assayed in the original study; number, description and abbreviation for samples assigned to a category.
Table 2 Number of 1-, 2- and 3-LCs for the data sets described in Table 1. For the two-class partitionings shown, the Table gives the number of perfect linear classifiers (given P initial features) that can be constructed using one, two and for some sets 3, genes. The 3 gene results report the total number of triples, and in parenthesis the number saved for later analysis.
Table 3 Wallclock run times for 2-LCs (pairs) for some of the data sets listed in Table. The P-LC program is written in C and uses double precision arithmetic. "Total time" (seconds) and "Time estimator" (seconds per 106 evaluations per sample) are execution times for the software on a 1.8 GHz AMD Athlon computer running Linux in an unloaded network configuration. An average time estimate is roughly 1 second per 106 evaluations per sample.

However, even gene sets found in small data sets can be interesting depending on the end use. For use in a medical diagnostic it is desirable that the gene set be highly accurate on large number of patient samples so the test result error rate is low. If used to guide basic research, even a poor error rate could indicate a productive research direction.

The results are easily visualized for single and pairs of genes by a scatter plot. For each sample the expression values for the gene(s) set the x-y position and the point is labeled with the sample class as shown in the example plot Figure 1. The separating plane is drawn between the two clouds of points and one can see the amount of separation between the two classes (which is twice the "margin") a particular set of genes pair yields. A larger margin means the data are more well separated hence is more resilient to noise and more likely biologically relevant.

Figure 1
figure 1

Example plot using two hypothetical genes. Each data point is labeled with the class, and the separating plane is computed to be positioned halfway between the two classes. In this example there is a large separation between the two classes and perfect separation is achieved and no data point is close to the plane.

Discussion

Many of the genes sets found accurately classify the data with large separation between the classes. It is exciting to consider the possibilities for medical diagnostics if some small gene set is found to accurately and reproducibly indicate a disease state. While general machine learning principles suggest that having more features (genes in this case) is desirable in order to make more noise resistant classifiers, this is data dependent and gene pairs with a training error rate of 1/90 as found in the LiverCancer data could be perfectly acceptable. Small gene sets, even if not accurate enough for medical purposes, can indicate fruitful new research directions. Like other classification methods that produce large numbers of genes, considering the corpus of all genes found can provide insight to the underlying biology.

Experimental design and construction of a data set profoundly influences the presence of small sized classifiers. For example, the BRCA sets are labeled according to their known BRCA1/BRCA2 mutation status. If some of the measured genes reflect this mutation status we would apriori expect to find some (possibly many) small feature sets, and not finding any could indicate errors in the class labels (a sample is mislabeled).

In data where the class sizes are small, most of the small gene sets will prove sensitive to noise. Thus they are not likely to perform well as predictors on new samples, or in different experiment setups. Classifiers made from large data sets are more likely to be reproducible and perform well in other situations. Classifiers with large margin and zero LOO error are more likely to indicate real biological effects that would hold true on new patient samples.

The GIST and BreastBRCA experiments are examples of both of the above conditions, and both lead to very large numbers of pairs. Both have small class sizes and have pre-disposed differences between the classes. The GIST experiment compares cancers from different tissue types which means there will be a very strong signal from just the tissue differences rather than just the cancers alone. The BreastBRCA experiment has the pre-disposition of being split along BRCA status lines. In both cases the number of patient samples and class sizes is small and 100,000+ pairs are found. We suspect there are some biologically real pairs hidden in the large background noise due to small sample sizes. Random data tests (see Methods) indicate that 30+ samples with more even class sizes are needed in order to reduce the random chance noise to a very small level (Table 4).

Table 4 Random data tests. The experimentally determined largest number of pairs found from 20 runs on random data. Each class has half the number of samples (number positives = number of negatives). The total number of genes is 2000. The second column is the number of pairs found for this 2000 gene set size, third column is the observed probability a single pair will be a perfect classifier ("observed" / 20002/2).

Gene sets that perform well across independent experimental data sets also likely indicate real biological effects. However, cross-experiment array comparisons are difficult and would be much easier and more broad if experimenters used more common clones and references.

Is the cost of such computation worth it? Certainly it is for single genes and pairs. Modern computers are powerful enough to solve these size problems in a few minutes. Triples needs tens to hundreds of hours on a single computer, which is still tractable. Quadruples are beyond single computer tractability. However, this type of algorithm is trivially parallelized over a standard network of computers leading to linear speed up. Each computer would be instructed to examine a given part of the search space and thereafter be independent from all the rest. A super computer or dedicated computer cluster is not required. It seems possible that the occurrence of such small sized classifiers is a common characteristic of microarray data, thus making the effort of searching for small gene sets worth the computational cost. The technique is not restricted to RNA/DNA transcript microarray data as used here. It can be applied to protein microarray, mass-spectra, or any data with similar characteristics.

Data set discussion

We can't discuss all the gene sets found in all data sets: there are too many. Here we discuss results from the 2 large data sets that we think produce highly biologically relevant results. The supplement [see Additional file 1] contains further discussion and the web site [14] provides access to all the results and plots.

Liver cancer

The large liver cancer data set contains 2 classes (tumor and normal) with 181 patient samples and measurements for 6605 genes. We examined this data set in more detail than the others as it is large and any results found are likely to be biologically relevant. The original data [1] was re-normalized using the Intensity/local then Spatial/local methods as implemented in the BioConductor R package [15] (this is the best performing method as outlined in, Wei Wu, unpublished 2004). With the original data example labelings (105 tumor, 76 normal), there are no pairs found. However, normal LIKNON [6] discovers a 23 gene classifier with 2 of the tumor examples strongly mis-classified (patient samples 108 and 109 in the raw data table from [1]). This suggests these 2 samples might have some sort of problem with them (contamination with too much normal tissue, a very different type of cancer, different tumor stage, etc) or are simply mislabeled. Relabeling these 2 tumor samples to be normal results in 4 gene pairs being found. Going further we wished to look for gene triples, yet using all 6605 genes would lead to excessive runtimes, so we applied a variation filter to reduce the number of genes. This variation filter (requiring a variation of at least 3.6 in log2 values for each gene) reduced the number of genes examined from 6605 down to 1956, which yielded 43 gene triples (using the original labelings) and 9496 gene triples (with the 2 "outlier" genes relabeled). Only 291 of these triples were saved for further analysis. All 4 gene pairs, 35 of the 43 and 229 of the 291 triples include the recently annotated gene PLAC8 (IMAGE:491644) a "placenta specific gene of unknown function".

The 4 genes in pairs with PLAC8 are

  • IMAGE:669379 GLCCI1 glucocorticoid induced transcript 1

  • IMAGE:590591 ADCY6 adenylate cyclase 6

  • IMAGE:260259 Transcribed sequence with moderate similarity to protein sp:P39188 (H.sapiens)

  • IMAGE:756490 BCAT2 branched chain aminotransferase 2, mitochondrial

Figure 2 shows the plot of the pair PLAC8 and BCAT2. There is good separation between the tumor and normal samples, except for the 2 "outlier" examples. Most of the triples found when the "outlier" examples are relabeled have larger margins than those found with the original data labels. The top triple with the original labeling is

Figure 2
figure 2

Liver cancer pair PLAC8 verses BCAT2. The two misclassified samples 108 and 109 are shown as squares. There are 3 other genes that form such pairs with PLAC8.

  • IMAGE:1472735 MT1E metallothionein 1E (functional) Hs74170

  • IMAGE:784593 ARHE ras homolog gene family, member E Hs6838

  • IMAGE:878564 GPC3 glypican 3 Hs119651

and is shown in Figure 3. GPC3 is a recently noted HCC cancer marker [16], where it was elevated in 6 out of 7 patient samples. Here this gene triple makes no errors on all 181 patient samples. The top triple using the relabeled examples is

Figure 3
figure 3

Liver cancer 3D plot of MT1E, ARHE and GPC3. These 3 genes form a perfect classifier although the margin is small. Red are cancer samples. The web site contains an interactive plot.

  • IMAGE:78353 RNAHP RNA helicase-related protein Hs8765

  • IMAGE:491644 PLAC8

  • IMAGE:667883 PHLDA1 pleckstrin homology-like domain, family A, member 1 Hs82101

In all, the 291 triples make use of 168 of the genes. The fact that triples exist when using the original labelings argues that the "outlier" examples are correctly labeled as tumor, albeit maybe a different type of tumor (or in a different development stage). That these small gene sets exist in such a large data set argues that they are biologically relevant. These 4 gene pairs only make 2 errors out of the 181 patient samples (the two "outlier" samples are the errors), which is an error rate of 2/181 = l/90. The triples found with the original labelings make no errors. Based on such data, one can imagine a simple few-gene diagnostic test based on these pairs and triples.

It is perhaps not surprising that a placental gene is associated with liver cancer. Both are blood organs, and cancers often recapitulate early development stages, of which the fast growing placenta might be an example. In addition, mitochondria related genes have been associated with cancer progression.

For this data, normal LIKNON was a useful aid in identifying outlying data examples. The two outlying tumor samples in this data set could represent a rare tumor type or development state. Using such aids during the experiments would allow such samples to be identified in a timely manner for further investigation.

Modeling of PLAC8

PLAC8 is noted to have a match to the PFAM model pfam04749.5 DUF614. A search of the PDB database using a SAM HMM model [17] created from the PFAM alignment finds a hit to 1P59 (gi|34811270|pdb|1P59|A) which is a Non-Covalent Endonuclease Iii-Dna Complex from bacillus stearothermophilus. The PFAM model locates to the C terminus of the 1P59 crystal structure. The sequence of 1P59 is some 85 amino acids longer than PLAC8, and the alignment hit aligns the last 110 amino acids or so. Visualizing the 3D structure of 1P59 with the alignment hit in PLAC8 colored silver in the RASMOL tool (Figure 4), shows that this hit forms a distinct mostly helical domain at the C terminus.

Figure 4
figure 4

1P59 crystal structure. Shown with the alignment hit to the liver cancer possible biomarker PLAC8 highlighted in strands at the top. Alignment generated from PFAM model pfam 04749.5DUF 614 using the SAM HMM system and displayed in RASMOL.

YeohALL

The YeohALL data set is a large multi-class pediatric acute lymphoblastic leukemia cancer set. The original data contains 7 classes, we use only 6 of them (we do not use their "other" class). For each of the 6 classes we compare each class against all others combined, thus asking the question can each class be distinguished from all the others. There are 248 patient samples and the 6 classes are T, E2A, BCR, TEL, MLL and Hyperdiploid > 50. Both the T vs the rest and E2A vs the rest splittings have single genes that perfectly separate T or E2A from the others. Five out of the 6 splittings have gene pairs, only Hyperdiploid vs the rest does not have any small gene sets. We applied a variation filter to reduce the original 12625 Affymetrix probes down to 4196 genes (the filter level used requires a variation of at least 10000 within each gene). We only discuss the top results for the T and E2A splittings here, see the supplement [see Additional file 1] for more details. Our results reinforce many of the findings in Yeoh's work that there are single genes that accurately classify the T and E2A leukemia subtypes and extends it by identifying accurate 2 gene classifiers.

T vs the rest

The only single gene separating T vs the rest is the same one identified by Yeoh, "38319_at CD3D antigen, delta polypeptide (TiT3 complex) Hs95327". This gene has a high value in T and a low value otherwise (Figure 5) and clearly separates the data. There are 1169 pairs making use of 681 different genes, the top pair is "37039_at HLA-DRA major histocompatibility complex, class II, DR alpha Hs409805", and "1105_s_at M12886 HUMTCBYY Human T-cell receptor active beta-chain mRNA" (Figure 6). In T, HUMTCBYY is generally high and HLA-DRA is low. These and many other of the genes in the top pairs are identified by Yeoh as being significantly differentially expressed in T. However they did not identify any gene pairs as accurate classifiers.

Figure 5
figure 5

From the YeohALL data, T vs the rest, the best single gene CD3D. This gene perfectly separates the classes. Plus signs are T subtype samples.

Figure 6
figure 6

From the YeohALL data, T vs the rest, the best pair HLA-DRA and HUMTCBYY. Each gene alone provides some classification power, but when linearly combined form a perfect classifier, albeit with a small margin. Plus signs are the T subtype samples.

E2A vs the rest

The 4 single probe sets for E2A are

  • "33355_at PBX1 pre-B-cell leukemia transcription factor 1 Hs408222"

  • "1287_at ADPRT ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) Hs177766"

  • "430_at NP nucleoside phosphorylase Hs75514"

  • "32063_at PBX1 pre-B-cell leukemia transcription factor 1 Hs408222"

(note that there are two probes for PBX1 both giving the same result, so there are really only 3 genes) All of these are high in E2A and lower in the rest. Probe 33355_at for PBX1 is shown in Figure 7 where it clearly separates the classes and was the only single gene identified in Yeoh's original work. The other 3 probes barely separate the data and likely failed Yeoh's stringent cross-validation criteria. These 4 probes are in Yeoh's significantly differentially expressed list.

Figure 7
figure 7

From the YeohALL data, E2A vs the rest, the best single gene PBX1. This gene perfectly separates the classes with a wide margin and has higher values in E2A. Plus signs are E2A subtype samples.

There are 386 pairs, the best pair is "35125_at RPS6 ribosomal protein S6 Hs408073" and "35974_at LRMP lymphoid-restricted membrane protein Hs124922". Figure 8 shows that this pair separates the data well. LRMP is identified in Yeoh's lists (but not RPS6), as are many of the genes in the other top pairs. These and the other top pairs often highly separate the data indicating they might be biologically relevant and resilient to noise. These and all the rest of the results are available through the web site.

Figure 8
figure 8

From the YeohALL data, E2A vs the rest, the best gene pair RPS6 and LRMP. Plus signs are the E2A subtype samples. LRMP by itself is a reasonable indicator of E2A status, but when combined with RPS6 can perfectly separate the data.

Conclusion

Small sets of genes (single genes, pairs of genes, and triples of genes) able to accurately classify two-class microarray data occur in many real-world data sets. These small sets could portend simple medical diagnostics and point to important research targets. The many small sized gene sets found here were not noted previously in the literature, and seemingly went unnoticed. Many members of the pairs discovered here have known associations with cancer and indicate the possibility of simple, accurate medical diagnostic tests based on such results.

Exhaustively examining all pairs in thousand gene size datasets is easily tractable on modern computer hardware. All triples is harder, needing a few days, but this is only computer time, and powerful computers are cheap and plentiful. Given that the compute time is small enough and the results possibly important, we conclude that examining microarray data for single genes, gene pairs and maybe gene triples should be done routinely. When performed along with acquiring the biological data, the results can be used as a quality check on the experimental process.

The gene of unknown function, PLAC8 appears to have a role in Liver cancer, and based on our HMM modeling might have a domain similar to part of the crystal structure of 1P59. We find that there are 4 genes that when paired with PLAC8 form a classifier with the low error rate of 1/90. In addition there are many gene triples, often including PLAC8, that form zero error classifiers and might be good biomarkers. We find the previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets able to accurately distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.

Methods

Transcriptional profiling data sets

The existing transcriptional profiling data sets investigated here are summarized in Table 1. We generally chose to minimally process the data. cDNA microarray data were log transformed; Affymetrix data were used as is. Only the LiverCancer data set was subjected to advanced re-normalization procedures. The one data set with a few missing values (LungStanford) had them set to the appropriate class average value. If there were too many genes (operationally defined as more than about 5000), a variation filter was used to remove genes that didn't vary enough across all samples. The filterings (if used) were applied once, before LIKNON analysis, solely to reduce the number of genes and hence the runtimes. They were not used to adjust for "good" results. When three or more categories of samples had been defined by the original authors, two-class data sets were produced by partitioning these categories. Table 1 provides statistics for the final data matrices used as input to LIKNON for determining m-LCs.

m-feature linear classifier (m-LC)

Consider a two-class data set composed of N data points, . Each data point is a P-dimensional vector of features, x n P, assigned to one of two classes, y n {+1, -1}. Assume that the classes are linearly separable. A classifier for such data is a hyperplane parameterized by a weight vector, w p, and an offset from the origin, b . A hyperplane, (w, b), can be used to predict the class of a data point x Pby computing sign(wTx + b). If this value is positive, x is identified with the +1 class, otherwise it belongs to the -1 class. Data points that define the hyperplane, positive half-space (+1 class), and negative half-space (-1 class) are the sets {x|wTx = b}, {x|wTx >b}, and {x|wTx <b} respectively. For two-class profiling data, an error free m-LC is a maximal margin hyperplane based on m of these P features, which assigns the class of every data point correctly. The number of free parameters in such models is m + 1, so in this setting is quite low (2–4), which is much smaller than the number of samples (for the smallest data sets is 14, on the largest data this is 248). By optimizing of the choice of m this might be over fitting the data, but in the Random Data section we show that we are finding many more features than chance alone would account for.

Given data points specified by P features, the potential number of m-LCs is equivalent to the combinatorial problem of choosing m items out of P, i.e., . In this work all single genes (m = 1) and pairs (m = 2) and in some cases triples (m = 3) were evaluated using a linear sparse hyperplane method that has been described elsewhere (LIKNON) [6]. LIKNON determines a maximal margin hyperplane (for example in Figure 1) that separates the data classes. Maximal margin means that the hyperplane is positioned halfway between the two classes. Gene sets that linearly separate the data are recorded, otherwise they are rejected. If a single gene is a classifier, it is not used during the pair checks as it would always form a classifier with any other gene.

Acceptance, generalization and error performance

This work accepts only perfect (no training error) linear hyperplane classifiers from the above method. This criteria for accepting only perfect classification is very stringent. We first thought that this would lead to few result sets being found, and that allowing non-perfect classification would need be done to find more classifiers. But this loosening turned out not to be necessary.

These models have only 2–4 free parameters. Thus the problem of over-fitting the model to the data during training is not a large issue and we don't perform any stringent generalization tests such as multi-round cross validation. Others interested in evaluating particular gene sets should perform such tests.

Generalization performance of classifiers was evaluated using a Leave-One-Out procedure. In LOO testing, one data point is removed and a classifier re-learned using the rest of the data. The resultant classifier is tested on the held out data point and if it is in error, a LOO error is counted. However for these m-LC's where m is small, LOO error is not a good performance measure. In general, only somewhat isolated points lying close to the classifier decision boundary will be found in error during LOO testing. Thus LOO error rates tend to be very low, 0 – 3 out of the total number of points. The small magnitude of this error count allows one to be misled thinking the classifier is "good because the LOO error is low". The LOO error is included in the results available from the web site.

In the end, classifiers with a larger separation (margin) between the classes are able to tolerate more noise without errors. Thus larger margin classifiers are more desirable for use on future data points.

Implementation

The program is an adaptation of the lp_solve version of LIKNON [6] implemented in C. A modern desktop workstation (1.8 Ghz Athlon) is able to evaluate many thousands of pairs per second using this LIKNON method. Table 3 lists run times for some of the data sets. Evaluating all pairs requires about n2/2 (equivalent to filling in a triangular matrix), and all triples about n3/6. A data set with P = 3000 features necessitates the evaluation of 4.5 × 106 pairs and 4.5 × 109 triples. An approximate average time is 1 second/106 pairs/sample. Thus, evaluating all pairs for these data sizes requires only a few minutes whereas all triples needs many hours.

Random data

How many such pairs occur by chance? Theory suggests [18] that the probability of a 2 class data set of N items with a in one class, being linearly separable in 1 dimension is and in 2 dimensions is approximately . Experiments were performed where a data matrix containing random numbers from a uniform distribution was analyzed by LIKNON. The results are shown in Table 4. When the number of samples is lower than about 30, or one class is very small relative to the other class, then the chance of finding a pair of random genes that form a perfect classifier is large enough to easily measure. The smallest real data set examined in this work (the GIST set) has 1987 genes in 19 samples with 13 in one class and 6 in the other. Experiments with 10 random data sets of this size shows that 1500 pairs would be expected on average with a maximum found of 2706, and no single genes, Table 5. Also in Table 5 are results of experiments where the real data and class labels are used, but the class labels are randomly shuffled. The label shuffling results are worse than random data only for the smallest two data sets (GIST and BreastBRCA BRCA1 vs BRCA2). The real GIST data has more than 137000 pairs, some 50 times more than found in random data and 30 times more than when labels are randomly shuffled. It also has 74 single genes, where the random data yields none. The three result sets that are closest to these random results are Cutaneous (596 vs 62), BRCA Breast BRCA1 & BRCA2 verses Sporadic splitting (2114 vs 1286) and BRCA Breast BRCA1 verses BRCA2 (143574 vs 53900).

Table 5 Random data simulations of real data sets. This table compares the results found from the real data (Real column) to two different types of random data. The Random column contains the experimentally determined largest number of pairs found from 10 simulation runs using a random data matrix (drawn from a uniform distribution) where the number of genes and class sizes is the same as the indicated for the real data. The Label Shuffled column contains the experimentally determined largest number of pairs found from 30 simulation runs where the class labels were randomly shuffled. In the samples column, the number in parenthesis is the number of positive samples. The numbers after the slash are the number of single genes found. Label shuffling leads to more pairs found "by chance" only for the smaller data sets. The small data sets have large numbers of pairs expected "by chance".

Web site

The web site [14] contains the data and results for this work.

References

  1. Chen X, Cheung S, So S, Fan S, Barry C, Higgins J, Lai K, Ji J, Dudoit S, Ng I, Van De Rijn M, Botstein D, Brown P: Gene expression patterns in human liver cancers. Mol Biol Cell 2002, 13(6):1929–1939. 10.1091/mbc.02-02-0023.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Liotta L, Ferrari M, Petricoin E: Clinical proteomics: Written in blood. Nature 2003, 425: 905. 10.1038/425905a

    Article  CAS  PubMed  Google Scholar 

  3. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M Jr, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natil Acad Sci U S A 2000, 97: 262–267. 10.1073/pnas.97.1.262

    Article  CAS  Google Scholar 

  4. Moler E, Chow M, Mian I: Analysis of molecular profile data using generative and discriminative methods. Physiological Genomics 2000, 4: 109–126.

    CAS  PubMed  Google Scholar 

  5. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J, Poggio T, Gerald W, Loda M, Lander E, Golub T: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 2001, 98: 15149–15154. 10.1073/pnas.211566398

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Grate L, Bhattacharyya C, Jordan M, Mian I: Simultaneous relevant feature identification and classification in high-dimensional spaces. In Workshop on Algorithms in Bioinformatics (WABI 2002). Edited by: Guigó R, D G. Springer; 2002:1–9.

    Chapter  Google Scholar 

  7. Bo T, Jonassen I: New feature subset selection procedures for classification of expression profiles. Genome Biol 2002, 3(4):research0017.1–0017.11. 10.1186/gb-2002-3-4-research0017

    Article  Google Scholar 

  8. Kim S, Dougherty E, Barrera J, Chen Y, Bittner M, Trent J: Strong Feature sets from small samples. Journal of Computational Biology 2002, 9: 127–146. 10.1089/10665270252833226

    Article  CAS  PubMed  Google Scholar 

  9. Bomprezzi R, Ringner M, Kim S, Bittner M, Khan J, Chen Y, Elkahloun A, Yu A, Bielekova B, Meltzer P, Martin R, McFarland H, Trent J: Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease. Hum Mol Genet 2003, 12(17):2191–2199. 10.1093/hmg/ddg221

    Article  CAS  PubMed  Google Scholar 

  10. Kobayashi T, Yamaguchi M, Kim S, Morikawa J, Ogawa S, Ueno S, Suh E, Dougherty E, Shmulevich I, Shiku H, Zhang W: Microarray reveals differences in both tumors and vascular specific gene expression in de novo CD5+ and CD5- diffuse large B-cell lymphomas. Cancer Res 2003, 63(1):60–66.

    CAS  PubMed  Google Scholar 

  11. Morikawa J, Li H, Kim S, Nishi K, Ueno S, Suh E, Dougherty E, Shmulevich I, Shiku H, Zhang W, Kobayashi T: Identification of signature genes by microarray for acute myeloid leukemia without maturation and acute promyelocytic leukemia with t(15;17)(q22;q12)(PML/RARalpha). Int J Oncol 2003, 23(3):617–625.

    CAS  PubMed  Google Scholar 

  12. Kim S, Dougherty E, Shmulevich L, Hess K, Hamilton S, Trent J, Fuller G, Zhang W: Identification of combination gene sets for glioma classification. Mol Cancer Ther 2002, 1(13):1229–1236.

    CAS  PubMed  Google Scholar 

  13. Bhattacharyya C, Grate L, Jordan M, Ghaoui L, Mian I: Robust sparse hyperplane classifiers: application to uncertain molecular profiling data. Journal of Computational Biology 2004, 11(6):1073–1089. 10.1089/cmb.2004.11.1073

    Article  CAS  PubMed  Google Scholar 

  14. Web site for this paper[http://www.soe.ucsc.edu/~leslie/pairs_web/]

  15. BioConductor R package[http://www.bioconductor.org]

  16. Sung Y, Hwang S, Park M, Farooq M, Han I, Bae H, Kim J, Kim M: Glypican-3 is overexpressed in human hepatocellular carcinoma. Cancer Science 2003, 94: 259–262.

    Article  CAS  PubMed  Google Scholar 

  17. Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 1996, 12: 95–107.

    CAS  PubMed  Google Scholar 

  18. Cover T: Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers 1965, EC-14(3):326–334. [Reprinted in Artificial Neural Networks: Concepts and Theory, IEEE Computer Society Press, 1992].

    Article  Google Scholar 

  19. Gruvberger S, Ringnér M, Chen Y, Panavally S, Saal L, Borg A, Fernö M, Peterson C, Meltzer P: Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Research 2001, 61: 5979–5984.

    CAS  PubMed  Google Scholar 

  20. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M, Yakhini Z, Ben-Dor A, Dougherty E, Kononen J, Bubendorf L, Fehrle W, Pittaluga S, Gruvberger S, Loman N, Johannsson O, Olsson H, Wilfond B, Sauter G, Kallioniemi OP, Borg A, Trent J: Gene-Expression profiles in hereditary breast cancer. New England Journal of Medicine 2001, 344: 539–548. 10.1056/NEJM200102223440801

    Article  CAS  PubMed  Google Scholar 

  21. Jazaeri A, Yee C, Sotiriou C, Brantley K, Boyd J, Liu E: Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. Journal of the National Cancer Institute 2002, 94: 990–1000.

    Article  CAS  PubMed  Google Scholar 

  22. Garber M, Troyanskaya O, Schluens K, Petersen S, Thaesler Z, Pacyana-Gengelbach M, van de Rijn M, Rosen G, Perou C, Whyte R, Altman R, Brown P, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci 2001, 98: 13784–13789. 10.1073/pnas.241500798

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Beer D, Kardia S, Huang C, Giordano A, Levin TJ, Misek D, Lin L, Chen G, Gharib T, Thomas D, Lizyness M, Kuick R, Hayasaka S, Taylor J, Iannettoni M, Orringer M, Hanash S: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine 2002, 8: 816–824.

    CAS  PubMed  Google Scholar 

  24. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000, 406: 536–540. 10.1038/35020115

    Article  CAS  PubMed  Google Scholar 

  25. Allander S, Nupponen N, Ringner M, Hostetter G, Maher G, Goldberger N, Chen Y, J C, Elkahloun A, Meltzer P: Gastrointestinal Stromal Tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. Cancer Research 2001, 61: 8624–8628.

    CAS  PubMed  Google Scholar 

  26. Yeoh E, Ross M, Shurtleff S, Williams W, Patel D, Mahfouz R, Behm F, Raimondi S, Relling M, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui C, Evans W, Naeve C, Wong L, Downing J: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 2002, 1: 133–143. 10.1016/S1535-6108(02)00032-6

    Article  CAS  PubMed  Google Scholar 

  27. Welsh J, Sapinoso L, Su A, Kern S, Wang-Rodriguez J, Moskaluk C, Frierson J Jr, Hampton G: Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Research 2001, 61: 5974–5978.

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The author acknowledges I Saira Mian for her support of this work. This work was supported by the National Institute on Aging, National Institute of Environmental Health Sciences, U.S. Department of Energy and California Breast Cancer Research Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leslie R Grate.

Additional information

Authors' contributions

LG carried out all work outlined in this article. LG wrote and approved the manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grate, L.R. Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery. BMC Bioinformatics 6, 97 (2005). https://doi.org/10.1186/1471-2105-6-97

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-6-97

Keywords