 Research article
 Open Access
 Published:
SED, a normalization free method for DNA microarray data analysis
BMC Bioinformatics volume 5, Article number: 121 (2004)
Abstract
Background
Analysis of DNA microarray data usually begins with a normalization step where intensities of different arrays are adjusted to the same scale so that the intensity levels from different arrays can be compared with one other. Both simple total array intensitybased as well as more complex "local intensity level" dependent normalization methods have been developed, some of which are widely used. Much less developed methods for microarray data analysis include those that bypass the normalization step and therefore yield results that are not confounded by potential normalization errors.
Results
Instead of focusing on the raw intensity levels, we developed a new method for microarray data analysis that maps each gene's expression intensity level to a high dimensional space of SEDs (Signs of Expression Difference), the signs of the expression intensity difference between a given gene and every other gene on the array. Since SED are unchanged under any monotonic transformation of intensity levels, the SED based method is normalization free. When tested on a multiclass tumor classification problem, simple Naive Bayes and Nearest Neighbor methods using the SED approach gave results comparable with normalized intensitybased algorithms. Furthermore, a high percentage of classifiers based on a single gene's SED gave good classification results, suggesting that SED does capture essential information from the intensity levels.
Conclusion
The results of testing this new method on multiclass tumor classification problems suggests that the SEDbased, normalizationfree method of microarray data analysis is feasible and promising.
Background
DNA microarray technology is now playing an increasingly important role in biomedical research. Microarray technology gives one the opportunity to measure gene expression levels of thousands to tens of thousands of genes simultaneously, in order to study the differential gene expression pattern between different developmental stages, diseases states and samples treated with drugs or other compounds. Before comparing data from different arrays to address these biological questions, however, a "much more mundane but indispensable " normalization step [1] is currently used in most microarray analyses.
Because of the slight difference in RNA quantities, imaging settings and other variables, even in very controlled experiments the intensity levels from different arrays are of different scales and need to be normalized before they can be compared with each other. Various normalization methods have been developed and some are widely used. The simplest method is total intensity based normalization [2]; this approach scales intensity levels of every gene by a constant factor so that total intensities of all the arrays are the same. "Spikedin" based normalization methods scale intensity based on spikedin standards [3]. Nonlinear normalization methods use local regression to scale intensities to compensate for the intensitydependent differences between arrays [4–6].
For most current applications, these normalization methods seem to be adequate. However, the residual left by a less than perfect normalization procedure is another source of nonbiological variation that is usually nondesirable, especially when the differences in expression levels are expected to be small [7]. In addition, if the goal is metaanalysis of multiple sets of microarray data [8, 9] systematic differences between experiments may result in a normalization artifact. We were therefore interested in developing an approach to analyse microarray data without first performing a normalization step. Our approach was partly inspired by nonparametric statistical methods [10]. For example, nonparametric methods that use ranks [11, 12] to compare microarray results, in addition to being distribution free, have the additional advantage of being normalization free.
DNA microarray technology has been used widely in biomedical studies. One interesting application is in the area of molecular classification; one popular use is in the comparison of tumor samples. Since clinical and histopathological classification is sometimes difficult and laborintensive, the use of genome wide expression patterns to classify tumor samples has recently become a very active research area [13–16]. Although some tumors appear to be amenable to classification using microarray data [17, 18], general multiple tumor classification using microarray data has proved to be an interesting and challenging task for several reasons: the general difficulties inherent in multiclass classification problems, the small number of samples available, and the inherent biological variation between specimens, etc. We decided to use multiclass tumor classification as a test case to illustrate the power of our approach. We compared our results for a multiclass tumor classification problem with more conventional approaches published by Ramaswamy et al. [19] and Yeang CH et al. [20]. These authors compared the accuracies of using kNearest Neighbors (kNN, 60–70%), Weighted Voting (WV, 60–70%) and Support Vector Machine (SVM, 80%) algorithms in a multiclass tumor classification problem and concluded that SVM is a more powerful machine learning algorithm for this application.
Results
Normalization Free approach to microarray data analysis
Generally, measurements on single microarrays give a realvalued intensity level x_{i} (1<= i <= N) for each gene i on the array, where N is the total number of genes on the array. Without first doing some type of normalization, the intensity level of gene i from array A, x_{i}^{A}, cannot be directly compared with the intensity level of gene i from array B, x_{i}^{B}. In this study, we sought an alternative quantity or quantities that can be directly compared between different arrays without compromising important biological information. One obvious candidate is r_{i}, the rank of intensity level of gene i on the array. However, we felt that rank is not an adequate measure because information about relative expression level is not represented explicitly. Instead, we decided to use the following measures.
Let
s_{ij} = 1 if(x_{i}  x_{j} > 0)

0
otherwise (1)
, where 1<= i, j <= N. Basically, s_{ij} is the sign of the intensity difference of gene i and j on a single microarray and therefore will remain unchanged under any monotonic transformation of x. Therefore, instead of computing with the absolute expression level of a gene, its relative level to all the other genes on the microarray is used. For each gene i, instead of one real valued x_{i}, the approach uses s_{i} = (s_{i1}, ..., s_{ij}, ..., s_{iN}), a binary vector of size N. For ease of reference, we will simply refer to this value as the SED (Signs of Expression Difference) of gene i; and the entire matrix (s_{ij}) the SED of the array. Given (x_{i}), s_{ij} is simply and uniquely defined but (s_{ij}) does not uniquely determine x_{i} so some information is lost by only using (s_{ij}) instead of (x_{i}).
Since r_{i} = Σ_{j = 1,...,N} s_{ij} is the rank of gene i in terms of intensity levels, rank information is preserved in (s_{ij}). What is lost in the transformation from (x_{i}) > (s_{ij}) is just the intensity differences between the closest ranked genes, which in most cases are small, considering that microarray data are generally considered "very noisy". It was our major goal to demonstrate that (s_{ij}) has indeed captured important components of the information from (x_{i}).
Instead of directly using the intensity levels x, and its derivatives such as the mean μ, the standard deviation σ and the signal to noise ratio (S2N) between two sample groups A and B, [(μ^{A} _{+} μ^{B})/(σ^{A}+ σ^{B})], we will use (s_{ij}) to compare gene expression differences between arrays. Since we expect measurement variations within an array will be less than those between arrays and we take the signs of relative expression differences to get SED, we expect the SED will be less "noisy". However, the value of any single s_{ij} may still vary between technical and biological replicates. One would expect more s_{ij} would change values randomly if the technical replicates was done on arrays that were fabricated in a different run than arrays from same run, for example. Biological variations are expected to be even more frequent. However, we hypothesize that we can perform statistical analysis on the SED, which contains tens of thousands of s_{ij} for a single gene, and minimize the impact of such noises.
We will also consider (sp_{ij}), a natural generalization of the SED concept. Here, sp_{ij} is the probability of x_{i} > x_{j}. In other words, imagining one can get a large number, n, of either technical or biological replicates of the sample of interest, then sp_{ij} = m/n as n > 8, where m = Σ_{K = 1,...,n} s_{ij}^{K} and s^{K} is for replicate K. We will call (sp_{i}) = (sp_{i1}, ..., sp_{ij}, ..., sp_{iN}) the SED probabilities of gene i. Note that in calculating both SED and SED probabilities, only intensity comparisons within arrays are involved and therefore forego the normalization step.
For example, if gene i is more highly expressed in sample A than B we would expect that more s_{ij}^{A} than s_{ij}^{B} would be 1 instead of 0 and the overall (very loosely defined) sp_{ij}^{A} would be larger than sp_{ij}^{B}. Since rank can be calculated from SED, any rank based method can be expanded to use SED. A gene i's SED can be viewed in two different perspectives. On one hand, it provides information about gene i's expression level relative to every other gene on the array, and therefore can be used to examine gene i's expression patterns between samples. On the other hand, it also provides information about the expression levels of all the other genes on the array, using the gene i as a control, in essence. Therefore, SED can be used to study questions either at the gene level or at the array level. In this paper, we focus on solving a simpler problem at the array level where it is not necessary to decide whether the expression level of an individual gene is increased or decreased between array A and B and by how much. Rather, it is focused on whether the overall expression patterns are different at all between array A and B.
Multiclass classification of tumor samples
To test whether (s_{ij}) and (sp_{ij}) extracts most of the information from (x_{i}), we used these values in a test case of a multi class classification problem described by Ramaswamy et al. and Yeang et al [19, 20]. Two algorithms were used to classify each of the 144 tumor samples into one of 14 tumor classes. One is the Naive Bayes (NB) classifier [21] using SED probabilities. The other is the Nearest Neighbor (NN) classifier using SED. In the NB method, to classify a sample T, we first calculate (sp_{ij}^{C}) for each class C (1 <= C <= 14) using training samples, i.e. the 144 samples with T taken away (for details see methods). Then sample T is classified according to:
score(T, C)_{=} Σ_{i = 1,...N} Σ_{j = 1,...,N} log(p_{ij}), (2)
where p_{ij} = sp_{ij}^{C} if s_{ij}^{T} = 1

1
 sp_{ij}^{C} otherwise
T is simply classified to the class C that has the maximum score.
In the NN method, we compute instead, for each training sample t,
matches(T, t) = Σ_{i = 1,...N} Σ_{j = 1,...,N} δ(s_{ij}^{T}, s_{ij}^{t}), (3)
where δ(x, y) = 1 if x = y.

0
otherwise
Then T is classified to class C of the sample t that has the maximum matches.
If one is to give a statistical interpretation of these scores, one can simply view (sp_{ij}) as defining a multibinomial probability model. In addition, one could consider each s_{ij} as a draw from a binomial distribution with probability p_{ij} = sp_{ij}. Then, the score(T, C) is simply a logarithm of the probability p to get all the s_{ij} exactly the same as s_{ij}^{T} under the probability model C where p_{ij} = sp_{ij}^{C}. (Since each class defines a different probability model, score(T,C) for difference class C, in theory, should not be directly compared. Instead, a P value should be calculated from the probability model for Pr(p<exp(Score(T, C))) and used to evaluate the closeness of sample T to each class C. For simplicity, we are not considering such issues here.)
When the NB algorithm was applied to the 144 samples, the accuracy obtained was about 63%; the NN algorithm performed slightly better and gave an accuracy of 70%.
Feature selection
Depending on the algorithm, a better classification result can sometimes be obtained by using a subset of genes [22, 23]. We were interested to know whether feature selection helps to increase accuracy in our approach. Within our framework, it is easier to treat (i,j) pairs as selection units. We therefore filtered out (i,j) pairs where the variance of sp_{ij} across the 14 tumor classes was less than a predetermined value and left the rest of the algorithm unchanged. We reasoned that the filteredout part of the matrix has less discriminating power across the tumor spectrum and might add noise due to the small sample sizes used. Using the NB algorithm, the best result achieved was 70% with a cutoff σ^{2} = 0.06, while the NN method with σ^{2} = 0.05 gave 77%. Table 1 lists the accuracies achieved using different feature cutoffs.
"Single gene's SED" based classifier
The above procedure utilized the entire SED matrix as a classifier. In other words, all the relations between genes were considered in the classification. To determine whether the inclusion of the whole matrix was actually required to achieve the current accuracy, we investigated the efficacy of using single gene's SED as classifiers.
In these cases, we define a classifier based on gene i and its relative expression to every other gene. Therefore, the score_{i}(T, C) = Σ_{j = 1,...,N} log(p_{ij}), where p_{ij} is as same as mentioned previously. Similarly, matches_{i}(T, t) = Σ_{j = 1,...,N} δ(s_{ij}^{T}, s_{ij}^{t}) defines a classifier based on gene i's SED. Fig. 1 shows a display of the cumulative frequency of single gene SED based classifiers versus accuracy for all the 16063 classifiers. In general, single genebased classifiers performed worse than the whole genomebased classifiers, as expected. Nevertheless, most of the classifiers performed reasonably well, compared with just using single gene expression levels. About 80% of the single gene based classifiers resulted in an accuracy between 40% and 60%, while about half of the classifiers had an accuracy greater than 50%. These results suggest that there is a lot of redundant information in the SEDs and SED probabilities and that our method should be reasonably robust. We then investigated the number of genes that are required to achieve the current accuracy. Fig. 2 shows the combined results for classifiers using only a subset of genes. Our results suggest that a subset of genes (~200) is sufficient for predictions and that the prediction accuracy is stable after 1000 genes.
Different classification accuracy between tumor classes
From the analyses described above, we noticed that there was a significant difference in accuracy between different tumor classes. For 3 classes (LY, LE, CNS) we obtained either 100% or close to 100% accuracy (see Table 2 for detail). Since these happen to correspond to the 3 classes with more than double the number of training samples than the other classes, we tested whether this high accuracy is due to the larger sample sizes by using only 8 training samples for every tumor classes. The results were essentially the same, indicating that sample size is not the issue. On the other hand, there are classes where we obtained very poor results; these often happen to be the same classes where SVM in [19, 20] performed poorly as well (see Table 2).
We were interested in exploring the possible reasons for misclassification. In Fig. 3, a scatterplot of the SED Match scores (without feature selection) between 8 OV samples and 8 CNS samples is displayed. The OV and CNS class were selected since one is "very hard" and the other is "very easy" to classify. Without trying to be statistically correct, the plot does suggest that samples of the OV class are in general "farther away" from each other, compared with those from the CNS class. As this may be one of the reasons that the OV class is harder to classify, algorithms that take this kind of information into account may perform better than the simple ones we have presented here.
Discussion
Although we used the multiclass tumor classification problem as our test case, our major goal was to illustrate the feasibility of the normalization free SED approach, and not in sample classification per se. Therefore, we chose the algorithms NB and NN for their simplicity and not for their performance in solving this specific problem. The performance of a classifier depends, in this case, mainly on the power of its algorithm, and the data representation it used. From a machine learning perspective, one can simply view the intensity > SED transformation as a change in data representation, a mapping from the gene's attribute, intensity x, to some features SED. It was our goal to demonstrate that the new features (SED and SED probability), in addition to being normalization free, still convey the essential information in the original attribute, the intensity x.
Since the data representation is quite different between the intensity x (a real valued quantity) and SED (a binary valued vector of rather large size N), it is difficult to directly compare the two. No obvious yet nontrivial algorithms work with both representations; even if there were such an algorithm it is not clear that it would be the right one to use for comparison as it might well be the case that different data representation works best with different algorithms. Here, we have limited the presentation to some empirical results with SED representation, which are comparable with results using several different algorithms that are based directly on raw intensity [19, 20]. Our classification results are close to those obtained with WV and kNN methods, which are based on directly focusing on intensity levels. Previous results using SVM were significantly better, but we feel the differences are due more to the power of the algorithm [24] than the way information is coded. In fact, slightly more accurate results are obtained with modification of algorithms that directly manipulate intensity levels [25, 26]. We do not imply that the algorithms (NB and NN) we chose are better than other alternatives (and we do not have empirical evidence pointing either way). Instead, we fully expect more sophisticated algorithms would work better with the SED approach as well.
Certainly, SED probability is more information rich than SED. We expect that an SED probability based analysis would perform better than the simple binary valued SED. In this paper, we mainly tested the SED. SED probability is only used for a group of samples, not for single samples. If one limits oneself to use only raw data, then for single arrays one can only get SED. However, if some assumptions about the patterns of gene expression levels can be made, one can certainly get an estimation of SED probability even for a single array. For example, as in some nonlinear normalization algorithms, if one assumes that the variation of expression levels are similar for genes with similar expression levels, then one can estimate SED probability from a probability model. Also, the magnitude of the intensity difference can also be used to help such an estimation. Alternatively, as more and more microarray data become available, one can use other similar samples to get an estimation of a prior SED probability, and then use a Bayesian approach to estimate the sample's SED probability.
The obvious disadvantage of our SED based approach is that for each gene expression level, one is not dealing with a single real number but instead a vector of size N, where N is in the tens of thousands. This could significantly increase both computing time and memory requirement (however, see methods for details) On the other hand, it also has certain advantages: 1) It is free of normalization noise. Since it is generally believed that biological variation is larger than technical variation and normalization noise is just another source of technical variation, the benefit here is only of a limited scope. However, it may be important when the expression level difference one is interested in is small. 2) In addition to being normalization free, SED and SED probability also have the advantage in being distribution free, and therefore could perform better if the intensity levels were nonnormal. 3) SED and SED probability are easier to interpret. SED values can easily be checked against raw intensity levels according to Eq. (1). While SED probability is one step further away from intensity levels, one could still have an intuitive sense of it and make comparisons between different experiments. It would be much harder to have a real grasp of the absolute gene expression level, except that it is "high" or "low" or somewhere "in between"; it is certainly harder to compare between experiments intuitively.
We have only tested the SED approach on datasets that are from the same chip format. Data from different chip formats or complete different technology platforms, of course, would be harder to compare. But they are also challenge for normalization based method. It would certainly be interesting to compare SED and normalization based method under these more challenging conditions.
If this normalizationfree approach (SED) proves to retain the essential biological information in general, its application may be extended to metaanalysis where different datasets could be integrated and intervalidated. The method could also be used when the number of arrays is a limiting factor for experiments. For example, one could take advantage of the massive amount of public array data, obtain prior distribution of SED probabilities from datasets with similar conditions, and analyze new data within a Bayesian framework. If the performance of the nearest neighbor method in general is anywhere close to what we demonstrated in the multi tumor classifications here (as is clear from Eq. (3) and Fig. 3, the nearest neighbor method, without feature selection at least, allows direct sample vs. sample comparison. Note also that the samples in the multi tumor problems are from different biological specimens, therefore, large betweensamplevariation is expected), it might be used as a microarray database query method, i.e., to find similar microarray results in the database that are "similar" to one's own, independent of array annotations.
It might also be worth noting that the SED approach could easily be applied to other kinds of comparative data analysis for samples with very large numbers of "noisy" attributes. The SED approach may also perform better when betweensamplevariation is large, especially if such variation contains some rather uninteresting technical measurement errors that would not affect withinsamplevariations.
Conclusions
We have proposed a new approach to analyze microarray data and tested the method on a set of publicly available datasets. The results were comparable to those obtained with some widely used normalization based algorithms. We hope that we have demonstrated that this normalization free method is feasible and promising. We think the SED based, normalization free approach could be used to complement the more popular normalization based approaches in microarray data analysis.
Methods
Microarray data for multiple tumor samples were downloaded from http://www.broad.mit.edu/cancer/software/genepattern/datasets/. Naive Bayes and Nearest Neighbour Classifiers were implemented in the Java programming language. Ad hoc analysis was done with perl scripts. Graphics were generated using the R computing environment.
Naive Bayes method
Because of the uneven and relatively small sample sizes for each tumor class (mostly 8 but up to 24), extra care was taken in computing sp_{ij}. Assuming a prior probability of 0.5, sp_{ij} was estimated by Bayesian posterior probability (m+1)/(n+2) where n is the total number of samples in the class and m is the total number of samples where x_{i} > x_{j}. For classes that were overrepresented (sample size > 8), the threshold of sp_{ij} was set to [0.125, 0.875], since the NB method is sensitive to the extreme values of sp, and samples can be overpredicated without thresholds.
In addition, several alternatives were tested to demonstrate that our results were reasonably robust and not sensitive to the particular choices we made:

1)
To examine the influence of the sample size, in a separate analysis the sample size of ME, LE, CNS class was artificially reduced to 8, i.e. only the first 8 samples were used to calculate sp_{ij} with no significant change of results observed;

2)
Since sp_{ij} depends on the sample size n for each tumor class, we have applied a "sample replacement" strategy in addition to the usual "takeonesampleout" approach for crossvalidation, i.e. when one sample is taken out as the test sample, another sample from the same class is duplicated to take its place to keep the sample size constant. Essentially the same results were obtained. Results reported are from the sample replacement runs.
In Feature Filtering, the variance of sp_{ij} between all 14 classes was calculated as:
σ^{2} = (Σ_{C = 1,...,14,}sp_{ij}^{C} * sp_{ij}^{C})/14  ((Σ_{C = 1,...,14} sp_{ij}^{C})/14)^{2} (4)
with the test sample taken out, and used as the criterion for feature exclusion.
Nearest neighbor method
Feature filter was done as in the Naive Bayes Method.
Software implementation and availability
The analysis was done on a computer (Pentium M 1.5 GHz) operating under Microsoft XP. Both Naïve Bayes and Nearest Neighbor Classifiers are implemented in Java. Since SED can be easily calculated from the raw intensities only the later are kept in memory and SED are computed from the intensities on an asneeded bases. Memory needed to analysis the 144 samples is less than 64 MB.
The most computationally intensive algorithm that we tried is the Nearest Neighbor method without any feature selections and it takes about 10 sec to calculate SED score for one pair of tumor samples with about 16000 genes.
A Java program named SED (including source code) to perform nearest neighbor analysis of microarray samples is freely available by contacting author at hw14@columbia.edu.
References
Quackenbush J: Microarray data normalization and transformation. Nat Genet 2002, 32(Suppl):496–1. 10.1038/ng1032
Affymetrix: Affymetrix GeneChip Expression Analysis Technical Manual Affymetrix Inc., Santa Clar, CA 2003.
Hill AA, Brown EL, Whitley MZ, TuckerKellogg G, Hunter CP, Slonim DK: Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls. Genome Biol 2001, 2: RESEARCH0055. 10.1186/gb2001212research0055
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–93. 10.1093/bioinformatics/19.2.185
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 2001, 29: 2549–57. 10.1093/nar/29.12.2549
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 2002, 30: e15. 10.1093/nar/30.4.e15
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34: 267–73. 10.1038/ng1180
Moreau Y, Aerts S, De Moor B, De Strooper B, Dabrowski M: Comparison and metaanalysis of microarray data: from the bench to the computer desk. Trends Genet 2003, 19: 570–7. 10.1016/j.tig.2003.08.006
Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: Metaanalysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res 2002, 62: 4427–33.
Hollander M, Wolfe DA: Nonparametric Statistical Methods 2 Edition John Wiley & Sons, New York 1999.
Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002, 18: 1454–61. 10.1093/bioinformatics/18.11.1454
Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, Smeekens SP: Analysis of high density expression microarrays with signedrank call algorithms. Bioinformatics 2002, 18: 1593–9. 10.1093/bioinformatics/18.12.1593
Cleator S, Ashworth A: Molecular profiling of breast cancer: clinical implications. Br J Cancer 2004, 90: 1120–4. 10.1038/sj.bjc.6601667
Fu LM, FuLiu CS: Multiclass cancer subtype classification based on gene expression signatures with reliability analysis. FEBS Lett 2004, 561: 186–90. 10.1016/S00145793(04)001759
Meyerson M, Franklin WA, Kelley MJ: Molecular classification and molecular genetics of human lung cancers. Semin Oncol 2004, 31(1 Suppl 1):4–19.
Bertucci F, Salas S, Eysteries S, Nasser V, Finetti P, Ginestier C, CharafeJauffret E, Loriod B, Bachelart L, Montfort J, Victorero G, Viret F, Ollendorff V, Fert V, Giovaninni M, Delpero JR, Nguyen C, Viens P, Monges G, Birnbaum D, Houlgatte R: Gene expression profiling of colon cancer by DNA microarrays and correlation with histoclinical parameters. Oncogene 2004, 23: 1377–91. 10.1038/sj.onc.1207262
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–11. 10.1038/35000501
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–7. 10.1126/science.286.5439.531
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A 2001, 98: 15149–54. 10.1073/pnas.211566398
Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, Reich M, Lander E, Mesirov J, Golub T: Molecular classification of multiple tumor types. Bioinformatics 2001, 17(Suppl 1):S316–22.
Mitchell T: Machine Learning. McGrawHill 1997.
Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P: 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 2000, 1: RESEARCH0003. 10.1186/gb200012research0003
LyonsWeiler J, Patel S, Bhattacharya S: A classificationbased machine learning approach for the analysis of genomewide expression data. Genome Res 2003, 13: 503–12. 10.1101/gr.104003
Vapnik V: Statistical learning theory John Wiley & Sons, New York 1998.
Shedden KA, Taylor JM, Giordano TJ, Kuick R, Misek DE, Rennert G, Schwartz DR, Gruber SB, Logsdon C, Simeone D, Kardia SL, Greenson JK, Cho KR, Beer DG, Fearon ER, Hanash S: Accurate molecular classification of human cancers based on gene expression using a simple classifier with a pathological treebased framework. Am J Pathol 2003, 163: 1985–95.
Peng S, Xu Q, Ling XB, Peng X, Du W, Chen L: Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett 2003, 555: 358–62. 10.1016/S00145793(03)012754
Acknowledgements
The authors wish to thank Dr. Randall D. Little for his careful reading and extensive editing of the manuscript. This work is supported by grant R44 AG02224201 from NIH National Institute of Aging of the National Institutes of Health, USA
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
H.W. conceived of the SED study and performed implementations. H.H. refined the approach and provided additional statistical insight on SED. Both authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Wang, H., Huang, H. SED, a normalization free method for DNA microarray data analysis. BMC Bioinformatics 5, 121 (2004). https://doi.org/10.1186/147121055121
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121055121
Keywords
 Feature Selection
 Intensity Level
 Chip Format
 Near Neighbor
 Microarray Data Analysis