# Noise-injected neural networks show promise for use on small-sample expression data

- Jianping Hua
^{1}, - James Lowey
^{1}, - Zixiang Xiong
^{2}and - Edward R Dougherty
^{1, 2}Email author

**7**:274

https://doi.org/10.1186/1471-2105-7-274

© Hua et al; licensee BioMed Central Ltd. 2006

**Received: **03 January 2006

**Accepted: **31 May 2006

**Published: **31 May 2006

## Abstract

### Background

Overfitting the data is a salient issue for classifier design in small-sample settings. This is why selecting a classifier from a constrained family of classifiers, ones that do not possess the potential to too finely partition the feature space, is typically preferable. But overfitting is not merely a consequence of the classifier family; it is highly dependent on the classification rule used to design a classifier from the sample data. Thus, it is possible to consider families that are rather complex but for which there are classification rules that perform well for small samples. Such classification rules can be advantageous because they facilitate satisfactory classification when the class-conditional distributions are not easily separated and the sample is not large. Here we consider neural networks, from the perspectives of classical design based solely on the sample data and from noise-injection-based design.

### Results

This paper provides an extensive simulation-based comparative study of noise-injected neural-network design. It considers a number of different feature-label models across various small sample sizes using varying amounts of noise injection. Besides comparing noise-injected neural-network design to classical neural-network design, the paper compares it to a number of other classification rules. Our particular interest is with the use of microarray data for expression-based classification for diagnosis and prognosis. To that end, we consider noise-injected neural-network design as it relates to a study of survivability of breast cancer patients.

### Conclusion

The conclusion is that in many instances noise-injected neural network design is superior to the other tested methods, and in almost all cases it does not perform substantially worse than the best of the other methods. Since the amount of noise injected is consequential, the effect of differing amounts of injected noise must be considered.

## Background

### Classifier complexity and overfitting

The small-sample problems with microarray-based classification have long been recognized [1]. The potential number of features (variables) upon which a classifier can be based is extremely large, the potential features consisting of all the gene-expression levels measured on a microarray (20,000 or more), and the sample size being the number of microarrays in the study (usually less than 100 and often less than 50). When the number of features is large in comparison to the sample size, classifier design is hampered by the designed classifier tending to overfit the sample data, which means that the designed classifier may provide good discrimination for the sample data but not for the general population from which the sample data have been drawn.

Classifier design involves choosing a classifier from a family of classifiers on the basis of the data by means of some algorithm. In this paper we restrict our attention to the case of two classes. Classification involves a *feature vector* **X** = (*X*_{1}, *X*_{2},..., *X*_{
d
}) on *d*-dimensional Euclidean space ℜ^{
d
} composed of random variables (*features*), a binary random variable *Y*, and a function (*classifier*) ψ: ℜ^{
d
} → {0, 1} to serve as a predictor of *Y*. The values, 0 or 1, of *Y* are treated as class *labels*. The error of ψ is the probability, *P*(ψ(**X**) ≠ *Y*), that the classification is erroneous. Classifier error depends on the probability distribution, *f*_{x, y}(**x**, *y*), called the *feature-label distribution*, of the feature-label pair (**X**, *Y*), in particular, the *class conditional distributions*, *f*_{x|0}(**x**) and *f*_{x|1}(**x**). Since in practice we do not know the class conditional distributions, a classifier is designed from sample data.

A classifier is optimal in a family *G* of classifiers if its error, ε_{
G
}, is minimal among all classifiers in *G*. Since a designed classifier depends on the particular sample, it is random relative to random sampling. We would like the expected error, ε_{G, n}, of the designed classifier, *n* denoting sample size, to be close to ε_{
G
}. If *G* and *H* are families of classifiers such that *G* ⊂ *H*, then ε_{
H
}≤ ε_{
G
}; however, for designed classifiers, it may be that ε_{H, n}> ε_{G, n}. That is, the designed classifier may partition the feature space well relative to the sample data but not relative to the full distribution. This phenomenon, called *overfitting*, is widespread in studies with small sample sizes. To mitigate overfitting, one can choose from smaller classifier families whose classifiers partition the feature space more coarsely. Using *G* instead of *H*, where *G* ⊂ *H*, reduces the *design cost*, ε_{G, n}- ε_{
G
}, relative to ε_{H, n}- ε_{
H
}at the expense of introducing a *constraint cost*, ε_{
G
}- ε_{
H
}. For a fixed sample size *n*, consider a collection of classifier families, *G*_{1} ⊂ *G*_{2} ⊂ *G*_{3} ⊂ .... The increasing size of the families means increasing classifier complexity. While the smaller families extensively reduce design cost, their constraint is excessive thus creating a situation in which the expected errors of the designed classifiers fall as we utilize increasingly large families but then begin to increase when the design cost grows too much.

While overfitting is often thought of as applying to the complexity of functional structure of a classifier, it also applies to the number of features composing a classifier. A feed-forward neural network classifier with one hidden layer, *d* features and *k* hidden nodes, is an operator on *d* dimensional Euclidean space ℜ^{
d
}, as is a linear classifier. Their difference in complexity is that the linear classifier partitions ℜ^{
d
} into two classes via a hyperplane, whereas the neural net more finely partitions the space, thereby reflecting greater complexity. Another way to increase complexity is to increase the number of features. In this way a linear classifier on *d* + 1 features is more complex than a linear classifier on *d* variables because the former reduces to the latter by setting one of the variables equal to 0. In this vein, if we consider a sequence, *x*_{1}, *x*_{2}, ..., *x*_{
d
}, ..., of features, we often first observe a decrease in expected error as *d* increases and then subsequently an increase in error for increasing *d*. While this description is idealized and the situation can be more complex, it describes the *peaking phenomenon*.

*n*and the number

*d*of features. The surface gives the average error of designed LDA classifiers in terms of

*d*and

*n*based on two Gaussian class conditional distributions possessing the same covariance matrix. The features are slightly correlated and we see that peaking occurs with very few features for sample sizes 30 and below, but then exceeds 30 features for sample sizes above 90. A much more serious situation for LDA is presented in Fig. 2, where the situation is the same except that the features are highly correlated. With a sample size of 80, large for most microarray studies, the optimal number of features is 3. Even with a sample size of 200, the optimal number of features is only 8.

In our preceding examples, we knew the distributions and were able to order the features so as not to have to consider all possible feature sets; in practice, the features are not ordered, and a good feature set must be found from among all possible feature subsets. This involves the use of a feature-selection algorithm, which is part of the classification rule. Feature selection yields classifier constraint, not a reduction in the dimensionality of the feature space relative to design. For instance, if there are *d* features available for linear discriminant analysis (LDA), when used directly, then the classifier family consists of all hyperplanes in *d*-dimensional space. But, if a feature-selection algorithm reduces the number of variables to *m* <*d* prior to application of LDA, then the classifier family consists of all hyperplanes in *d*-dimensional space confined to *m*-dimensional subspaces. The dimensionality of the classification rule has not been reduced, but the new classification rule (feature selection plus LDA) is constrained.

A standard way of measuring classifier complexity is via the *VC* (*Vapnik-Chervonenkis*) *dimension* of a family of classifiers [3]. The VC dimension is defined in the Methods Section; here we note that it provides a measure of the degree to which a classifier can separate points, the greater the separation ability the higher the VC dimension and the lower the separation ability the lower the VC dimension. High VC-dimension classifiers have a greater ability to discriminate complex class interaction, at the cost of a greater potential to overfit, than do low VC-dimension classifiers. The VC dimension of a linear classifier with *d* features is *d* + 1, whereas the VC dimension of a feed-forward neural network with one hidden layer, *d* features and *k* hidden nodes, exceeds *d*(*k* - 1) [4]. Depending on the number of nodes in the hidden layer, this can greatly exceed the VC dimension of a linear classifier.

The potential for overfitting is exhibited by a classical bound on the expected design error. For sample size *n*, the expected design cost of a classifier chosen from a family *G* via the empirical-error rule, which chooses the classifier in *G* that makes the least number of errors on the sample data, can be bounded via the VC dimension of *G*,

where *V*_{
G
}is the VC dimension of *G* and *c*_{0} is a constant independent of *G* and *n* [4]. To make the bound small the sample size must greatly exceed the VC dimension. While providing a cautionary warning concerning high VC-dimension classifiers, the bound of Eq. 1 is not the end of the story when it comes to classifier design. First of all, the bound applies to all possible distributions, and therefore can be very loose. Second, and directly to the point of the present study, the design error depends on the classification rule, not only the family from which the classifier is chosen.

Given a very large sample, one might pose the rule of thumb that a more complex distribution requires a more complex classifier. The overfitting problem strikes at this rule of thumb when samples are insufficiently large. The rule is not reversed; rather, it breaks down. The VC dimension, or any other measure of classifier complexity, can only provide a loose guideline in the context of a warning against overfitting. The efficacy of a classifier design strategy depends not only on the complexity (VC dimension) of the classifier family, but also on the complexity of the feature-label distribution, or the difficulty of the classification problem. The latter can be rigorously approached by defining a measure of distributional complexity and calculating classifier performance as a function of distributional complexity [5]. In that case, when samples are small, as expected we observe that simple classifiers work best for low-complexity distributions, but we also often observe that, when a complex classifier performs well, the distribution is also of low complexity and a simple classifier could just as well do the job. This is why in microarray studies one should use high-complexity classifiers with caution.

The point of this paper is that one should not give up on complex classifiers, in particular, neural networks. As we will see in the current investigation of neural-network design, using an appropriate design strategy can yield good results in a small-sample setting, even for a high VC-dimension classifier. Specifically, neural networks can give competitive results for small samples, so long as they are properly trained. The aim of the present study is to examine over a range of models the degree to which variously trained neural networks can provide competitive results for the kinds of sample sizes used in many microarray studies, our emphasis being on training via noise injection.

### Noise injection

Sietsma and Dow [6] found that injecting noise into the sample data can lead to neural networks with improved performance, meaning that the designed neural networks can have smaller misclassification errors than those designed without noise injection. Since then there has been substantial research on two aspects of noise injection, theoretical proof and implementation. Holmsträom and Koistinen [7] showed that noise-injection-based design is asymptotically consistent as the sample size goes to infinity provided that the noise is chosen correctly. Moreover, rather than relying on heuristic choices, they developed rigorous methods to find the distribution of noise to be injected through Gaussian kernel density estimation. By using a Taylor expansion, Matsuoka [8] claimed that injecting noise to the neural network is equal to adding a regularization term to the error cost function. Following a similar but more rigorous approach, a neural network based on a regularized error function was introduced by Bishop [9] to avoid the randomness and increasing computation time brought by the injected random noise. The regularization term in the cost function penalizes a fast changing input-output relationship, and is believed to prevent the neural network from overfitting to the individual sample points. However, An [10] pointed out that the second derivative term of the cost function, which is omitted in Bishop's derivation, is a perturbation that can be either positive or negative, and can be ignored only when the network function fits well or is very smooth. Grandvalet and Cano [11] also found that the error perturbation cannot be omitted, and Grandvalet [12] found a bound for the perturbation as a function of the amount of noise injected. He therefore claimed that noise injection could be beneficial. In the mean time, Reed *et al*. [13] pointed out that noise injection has similar effects as weight decay, another common method used in neural network training in order to improve performance.

Recently, the study of noise injection has moved to improving the training scheme (classification rule). Proposed methods include applying the noise not to the input data but to the hidden nodes [14]; use of elliptic rather than spherical shaped Gaussian noise, and basing the noise distribution on the data [15]; noise designed according to the local data distribution [16]; and utilizing a regularized cost function in conjunction with noise injection to alleviate the effects caused by inappropriate variance of the injected noise [17].

## Results and discussion

In this study, we have conducted an extensive simulation-based comparative study of NINN design by taking advantage of contemporary high-performance computing, a 512-node Beowulf cluster, which was not available when many neural-network training procedures were proposed. We consider a number of different feature-label models across various small sample sizes using varying amounts of noise injection. Besides comparing NINN design to classical SNN design, the paper compares it to a number of other classification rules: LDA, QDA, the strong feature classifier (SFC), the Gaussian kernel (GK) classifier, and the 3-nearest-neighbor (3NN) classifier. Since our applications concern microarray data for expression-based classification for diagnosis and prognosis, we also consider NINN design as it relates to a study of survivability of breast cancer patients. The models and patient data are described in the Methods Section, as are the classification rules used in the study.

Corresponding to each experimental case determined by the model, the degree of correlation, and the number of features, there is a graph of the errors of the classifiers for the sample sizes considered. The complete set of these graphs appear in the additional file 1, with some of them also appearing in the paper to support the discussion. In these comparison graphs, the noise NINN results correspond to largest amount of noise injection available (check Method section for details). The amount of noise injected is important. For each situation there is a graph showing the errors for different amounts of noise for each sample size considered, again the full set being given in the additional file 1. A large amount of noise may be required to achieve good noise-injected design and this entails a high computational cost, which can be prohibitive when considering a large number of feature sets. In the case of linear classification, avoidance of this computational burden motivated the introduction of SFC, which produces the effect of noise injection but determines the classifier analytically without the computational cost of introducing random noise. Let us proceed to consider the different models.

### Linear model

*n*= 10, LDA and SFC perform equivalently, with QDA in the processing of catching up as the sample size increases, as expected. Among the neural nets, NINN performs best and its performance is very close to that of LDA. As is the case throughout the experiments, SNN does poorly. For 10 uncorrelated features, the performance of all the classifiers improves, but now we see the advantage of SFC for very small samples. It outperforms LDA for

*n*< 40, after which it stabilizes, whereas LDA continues to improve. Most interestingly relative to the current study, for

*n*≥ 20, NINN outperforms LDA. For both 5 and 10 features, GK is not far behind NINN and 3NN is not competitive with NINN except for very small sample sizes.

For slightly correlated features, the situation is mostly analogous to the uncorrelated case, except that overall performance is worse, as expected. The main difference is the performance decline of SFC relative to LDA and NINN, although SFC still performs better than QDA, and SNN. Matters are quite different for 10 features. SFC outperforms LDA and NINN for all sample sizes, SFC and NINN significantly outperforming LDA for smaller sample sizes. The key point is that LDA is only performing slightly better with 10 features than with 5. As seen in previous cases, 3NN performs best for very small samples.

When we go to highly correlated features, there is severe performance degradation. For 5 features, NINN bests LDA for *n* ≤ 20, and their performance is essentially the same for *n* ≥ 30. GK is not far behind. The more interesting observations occur with 10 features. Except for *n* ≤ 20, when 3NN is better, NINN provides the best performance, and is much better than LDA for the smaller sample sizes. NINN is also better than GK, although the latter outperforms LDA for *n* ≤ 70. The problem with LDA is that it has suffered the peaking phenomenon: it performs worse with 10 features than with 5. Early peaking for LDA with highly correlated features has been previously observed [2]. Note that in this highly correlated case, as with the previous cases, 3NN performs well for very small samples, but does not improve much thereafter as the sample size increases.

### Low-curvature nonlinear model

*n*≤ 60) owing to its lower complexity. From the perspective here, the key point is that NINN performs best among all classifiers for

*n*≤ 60 and second to QDA (and only slightly worse) for

*n*≥ 70. Note that GK and SFC are not far behind NINN. An interesting phenomenon occurs with 10 features: SFC performs best for

*n*≤ 90. Along with this, LDA significantly outperforms QDA. As for NINN, it is close behind SFC, especially for

*n*≥ 40.

For slightly correlated features, there are some large differences in the performance comparisons. In this case, with 5 features SFC performs relatively very poorly. QDA overtakes LDA earlier and then does much better as the sample size increases. However, once again NINN does well, having the best performance for 20 ≤ *n* ≤ 60, and only being bested by QDA for *n* ≥ 70. Note also that GK performs close to NINN for all *n*. Similar statements hold for 10 features, an exception being that QDA never overtakes NINN.

With highly correlated features, peaking plays a critical role. It is particularly severe for LDA and QDA, with error rates higher for 10 features than for 5. This is in agreement with the results of previous study as shown in Fig. 2[2]. It even occurs for NINN and GK. Otherwise, there are a lot of similarities to the slightly correlated case, with QDA overtaking LDA and, as in the 5-feature slightly correlated case, NINN outperforming QDA for the smaller sample sizes, this time in both the 5- and 10-feature cases. An interesting point regarding neural networks is that, for 5 features, GK is not far behind NINN, as we have witnessed before.

### High-curvature nonlinear model

*n*≥ 50, and much better than it for

*n*≥ 60, whereas in the low-curvature model, QDA does not outperform NINN until

*n*≥ 70, and then not very much. For 10 features in the uncorrelated case, NINN again outperforms SNN across all sample sizes, the gap closing at sample size 100. As in the 5-feature case, NINN is beat by GK. Whereas QDA never surpasses NINN in the low-curvature model, it surpasses NINN in the high-curvature model for

*n*≥ 50. Analogous considerations apply to the different relative performance of NINN compared to GK and QDA in the other five high-curvature models.

Relative to overfitting the sample data, the key difference between the low-curvature and high-curvature nonlinear models is the increased curvature in the high-curvature model. Noise injection has the effect of smoothing the decision boundary and this smoothing has greater benefit when the decision boundary is less curved. As another effect of high curvature, note the strikingly poor performance of 3NN.

### Equal-mean model

In this model, QDA is optimal for the feature-label distribution with the decision boundary being a hypersphere. The main point to be made here is that the comparisons are similar to the high-curvature nonlinear model, the key factor being the high curvature of the decision boundary in the QDA-optimal model. Another noteworthy observation is that the performance of NINN does not always monotonically improve along with the amount of noise injected. In some certain cases, the classification error can first decrease, then increase after the noise injection surpasses a certain amount. This phenomenon is most prominent in the case of 10 highly correlated features.

### XOR model

The results for the XOR model are fairly consistent and clear across all six cases, 5 and 10 features, and uncorrelated, slightly correlated, and highly correlated models. NINN, GK, and 3NN have very close performances and they perform significantly better than SNN. If anything can be said concerning the relationship between NINN, GK, and 3NN, it is that 3NN is insignificantly slightly better than NINN and GK for small sample sizes, with the situation reversing for larger sample sizes.

### Bimodal model

### Patient data

When we observe the 5-feature results for the patient data, we see a striking similarity with those for slightly correlated features in the nonlinear model. Ignoring the fact that QDA provides the best performance for larger samples in the nonlinear model, for which it is optimal relative to the feature-label distribution, in both cases NINN performs best across the full range of sample sizes, and even more so with the patient data. In both cases, LDA and GK are similar and trail NINN. 3NN is a little better in the real patient data while SFC is a little worse. Note that while we have compared the patient-data results to those of the slightly correlated nonlinear model, similar correspondences exist between the patient data and the highly correlated nonlinear model, which is only reasonable since, as pointed out previously, there are many similarities between the slightly and highly correlated nonlinear models.

The salient point regarding using 10 features for the patient data is peaking, as it is in the highly correlated nonlinear model. For instance, for *n* = 40, LDA performs worse with 10 features than with 5 features and QDA performs much worse with 10 features than with 5. NN shows no improvement with 10 features compared with 5 features. NINN shows a slight improvement with 10 features, indicating later peaking. Although NINN pretty much flattens out when *n* > 20, it again has the best performance across all sample sizes. The most prominent differences between the real patient data and highly correlated nonlinear model are QDA and GK. Poor performance owing to peaking is particularly evident with GK.

## Conclusion

Although neural networks have high VC dimension and can therefore suffer from overfitting the sample data, their performance is highly dependent on the training procedure (classification rule) employed. This paper has demonstrated that in many instances noise-injected neural network design is superior to classical neural-network design and to the other tested methods, and in almost all cases it does not perform substantially worse than the best of the other methods. This conclusion has importance for the design of classifiers for diagnosis and prognosis based on gene-expression data because sample sizes are often limited and, unless the class conditional distributions are easily discriminated, say by a linear classifier, a higher-complexity classifier must be employed.

## Methods

Simulations have been conducted using both synthetic data and real patient data. For synthetic data, we have considered six different distribution models for data generation.

▪ **Linear model**: The class-conditional distributions are Gaussian, with *S*_{0}~ *N*(**μ**_{0}, **Σ**_{0}) and *S*_{1}~ *N*(**μ**_{1}, **Σ**_{1}), where **μ**_{0} = (0, 0,..., 0), **μ**_{1} = (1, 1,..., 1) and **Σ**_{0} = **Σ**_{1} = **Σ**. The corresponding Bayes classifier is linear and the Bayes decision boundary is a hyperplane and can be found via linear discriminant analysis (LDA).

▪ **Low-curvature nonlinear model**: The class-conditional distributions are Gaussian with unequal variance matrices: **μ**_{0} = (0, 0,..., 0), **μ**_{1} = (1, 1,..., 1), and 2**Σ**_{0} = **Σ**_{1} = 2**Σ**. The Bayes classifier is nonlinear, the Bayes decision boundary is quadratic and can be found via quadratic discriminant analysis (QDA), and the boundary possesses low curvature in comparison to the high-curvature nonlinear model to be described next.

▪ **High-curvature nonlinear model**: This model and the following equal-mean model have been used in [7] and [18]. The class-conditional distributions are Gaussians: **μ**_{0} = (0, 0,..., 0), **μ**_{1} = (2.32, 0,..., 0) and 4**Σ**_{0} = **Σ**_{1} = 4**Σ**. The Bayes decision boundary is quadratic, it is found via QDA, and it possesses higher curvature than the decision boundary for the low-curvature nonlinear model.

▪ **Equal-mean model**: The class-conditional distributions are Gaussians. Both classes share the same mean vector, **μ**_{0} = **μ**_{1} = (0, 0,..., 0), with 4**Σ**_{0} = **Σ**_{1} = 4**Σ**. The Bayes decision boundary is a hypersphere determined by QDA, the covariance structure is the same as in the high-curvature nonlinear model, the decision boundary has high curvature in comparison to the low-curvature nonlinear model, and this model is more difficult than the high-curvature nonlinear model.

▪ **XOR model**: The class-conditional distributions of both classes are mixture of two equi-probable Gaussians. The covariance matrices of the classes are identical, i.e., **Σ**_{00} = **Σ**_{01} = **Σ**_{10} = **Σ**_{11} = **Σ**. The mean vectors of class *S*_{0} are **μ**_{00} = (1, -1, 1,...), **μ**_{01}= (-1, 1, -1,...). The mean vectors of class *S*_{1} are **μ**_{10} = (1, 1, 1,...), **μ**_{11} = (-1, -1, -1,...). The Bayes decision boundaries are two perpendicular hyperplanes.

▪ **Bimodal model**: The class-conditional distribution of class *S*_{0} is Gaussian, centered at **μ**_{0} = (0, 0,...,0), and the class-conditional distribution of class *S*_{1} is a mixture of two equi-probable Gaussians, centered at **μ**_{10} = (1, 1,..., 1) and **μ**_{11}= (-1, -1,..., -1). The covariance matrices of the classes are identical, i.e., **Σ**_{0} = **Σ**_{10} = **Σ**_{11} = **Σ**. The Bayes decision boundaries are two parallel hyperplanes.

Throughout the simulation, we assume that the two classes have equal prior probability.

We assume the covariance matrix **Σ** has a spherical structure such that every two features possess the same correlation, namely,

where *σ* is the standard deviation of each feature, and *ρ* the correlation between features. If *ρ* = 0, then all features are uncorrelated. As *ρ* increases, the correlation among features increases. We consider three representative covariance matrices by setting *ρ* equal to 0, 0.125 and 0.5, and referring to these as *uncorrelated features*, *slightly correlated features* and *highly correlated features*, respectively. Note that in [7] and [18], no correlation among features is considered when they discuss the nonlinear and equal-mean models. As for the feature size, two feature sizes, 5 and 10, are tested in our simulation. By considering all the possible combinations of the distribution model, feature size and covariance matrix, there are altogether 36 different situations.

The patient data come from a microarray cancer-classification study that analyzes a large number of microarrays prepared with RNA from breast tumor samples of 295 patients [19]. Using a previously established 70-gene prognosis profile [20], a prognosis signature based on gene-expression that correlates well with patient survival data and other existing clinical measures is proposed in [19]. Of the 295 microarrays, 115 belong to the 'good-prognosis' class and the other 180 belong to the 'poor-prognosis' class.

Seven classifiers are considered in the study: standard neural network (SNN) [21], neural network designed with noise injection (NINN) [7], the Gaussian kernel (GK) classifier [4], LDA [21], QDA [21], the strong-feature classifier (SFC) [22], and the 3-nearest-neighbor (3NN) classifier [4]. We include LDA and QDA because with full distributional knowledge they are optimal in the linear and nonlinear models, respectively, with QDA also optimal in the equal-mean model. We include SFC because, by replacing each data point with a spherical Gaussian distribution and then finding the classifier via a Wiener-filter methodology, it provides an analytic form of noise injection that is much more computationally efficient than the addition of random noise. Owing to the kinds of problems they are meant to address, we apply LDA and SFC only in the linear and nonlinear models, and we apply QDA in the linear, nonlinear, and equal-mean models. We include 3NN for comparison purposes because it has been used extensively in expression-based analysis and, as we will see, performs relatively well on very small samples.

We use feed-forward layered networks with one hidden layer. For 5 and 10 features, the network structures are 5-8-2 and 10-15-2, respectively, meaning there are 8 and 15 nodes in the hidden layer, respectively. Since the classifiers are binary, there are two units in the output layer. Normalization of input data and initialization of the parameters inside the network are done according to [21]. Error back-propagation is used to calculate the derivatives of the cost function, which is minimized by using the Levenberg-Marquardt method [23]. The source code of the authors' neural network implementation is available on request.

For the GK classifier, assume there are *n* training sample points, (x_{
i
}, *y*_{
i
}), i = 1, 2,..., *n*, with sample point x_{
i
}a *d*-dimensional vector, and *y*_{
i
}its label. Then for a testing point x, the posterior probability of its label *y* being *k* is estimated by

where *I*() is the identity function, and *h*_{
i
}is the smoothing factor which is chosen in the same way as in the noise-injection procedure to be shortly described. The GK classifier will pick the *k* with the larger posterior probability as the predicted label.

*n*sample points, (x

_{ 1 },

*y*

_{1}), (x

_{ 2 },

*y*

_{2}), ..., (x

_{ n },

*y*

_{ n }). The amount of noise injected is measured by the ratio between the noise-injected sample size and the original sample size. In our simulation, we ensure that the same number of noise points is generated for each training sample point. For instance, if the noise injection amount is

*k*, then our sampling procedure is performed in the following manner:

- 1)
Pick x

_{ i }from the training sample; - 2)
Draw a point z from standard normal distribution, z~

*N*(**0**,**1**);

Generate a noise point by x= x_{
i
}+ *h*_{
i
}z, where *h*_{
i
}is the smoothing factor for point *x*_{
i
}, and is given by [7]

*h*_{
i
}= (8^{d/4-1}*d*^{d/2}Γ(*d*/2))^{2/(d+4)}*n*^{-1/(d+4)}(*σ*_{0}*I*(*y*_{
i
}= 0) + *σ*_{1}*I*(*y*_{
i
}= 1)), (4)

*I*() is identity function, and

*σ*

_{0}and

*σ*

_{1}are the estimated standard deviations of class 0 and 1, respectively;

- 3)
Repeat steps 2 and 3

*k*times to generate*k*noise points around x_{ i }; - 4)
Repeat steps 1 through 4 for

*i*= 1, 2, ...,*n*to generate*kn*noise points.

To test the effects of different amounts of noise injection, for each sample size *n*, we allow *k* = 2^{
b
}, *b* = 0, 1,..., *B*, where *B* is the largest integer that *kn =* 2^{
B
}*n* ≤ 5120, in the simulation. We set 5120 as the maximum sample size after noise injection to avoid too much computation, owing to the slow convergence in the training of the neural network. Note that the original sample points are not used for the final training of the network, so for noise-injection amount 2^{0} = 1, the result is simply a perturbation of the original data. When comparing with other classifiers, the results with the largest amount of noise 2^{
B
} are used.

For synthetic data, the simulation is done by independently applying each classifier to different situations. For each situation, the simulation generates *n* training points (*n*/2 points for each class) according to the distribution model, feature size, and covariance matrix of the corresponding situation. The trained classifier is applied to 200 independently generated test points from the identical distribution. This procedure is repeated 10,000 times for all classifiers, and for NINN, for all possible noise injection amounts. The training sample size varies from 10 to 100, with increase by steps of 10. The entire simulation is repeated for different training sample sizes, feature sizes, and situations.

For patient data, we apply all seven classifiers to the patient data and estimate the error by using a hold-out method. For a sample size of *n*, *n* sample points are drawn without replacement from the 295 data points. Out of the 70 genes, *d* are selected based on the *n* training points in the following manner: for each gene, we calculate the difference of the mean expression values between the two classes, normalize this value by the sum of the corresponding standard deviations, and then select the genes with largest differences [24]. To make a straightforward comparison between the results of the patient data and synthetic data, here we choose the same feature sizes as we do in the synthetic data simulation, i.e., *d* = 5 and 10. The classifier trained on the *n* points is tested on the 295 – *n* points not drawn. This procedure is repeated 5000 times and error rates are averaged to obtain an estimate of the sample-based classification error. As discussed for this hold-out procedure using the same data set in [25], since in the hold-out experiment the observations are not fully independent, we limit *n* to under 80 to reduce the impact of observation correlation.

### VC dimension

To define the VC dimension of a classifier family *C*, for each classifier ψ ∈ *C* consider the set, {(**x**, *y*): ψ(**x**) ≠ *y*}, of all points in the feature-label space ℜ^{
d
} × {0, 1) for which the value of the classifier at point **x** does not equal the value of the label at **x**. Let $\mathcal{A}$_{
C
}be the collection of these sets for all ψ ∈ *C*. If {**z**_{1}, **z**_{2},..., **z**_{
m
}} is a set of *m* points in ℜ^{
d
}, let *N*_{
C
}(**z**_{1}, **z**_{2},..., **z**_{
n
}) be the number of distinct subsets of {**z**_{1}, **z**_{2},..., **z**_{
m
}} created by intersection with sets in $\mathcal{A}$_{
C
}. The *m*^{th} *shatter coefficient* of *C*, denoted by ξ($\mathcal{A}$, *m*), is the maximum value of *N*_{
C
}(**z**_{1}, **z**_{2},..., **z**_{
n
}) over all point sets {**z**_{1}, **z**_{2},..., **z**_{
m
}}. If ξ ($\mathcal{A}$, *m*) = 2^{
m
}, then *C* is said to *shatter* {**z**_{1}, **z**_{2},..., **z**_{
m
}}. This means there is at least one set of *m* points for which all subsets of the set can be constructed by intersection with sets in $\mathcal{A}$_{
C
}. The shatter coefficient of *C* measures the extent to which the sets in $\mathcal{A}$_{
C
}can separate points for various point-set sizes. The largest integer *m* for which ξ ($\mathcal{A}$, *m*) = 2^{
m
} is called the *Vapnik-Chervonenkis* (*VC*) *dimension* of *C*. If ξ ($\mathcal{A}$, *m*) = 2^{
m
} for all *m*, then the VC dimension is ∞. For a comprehensive discussion of the VC dimension and its implications, see [4].

## Declarations

### Acknowledgements

This research was supported in part by the National Science Foundation under Grant CCF-0514644.

## Authors’ Affiliations

## References

- Dougherty ER:
**Small sample issues for microarray-based classification.***Comparative and Functional Genomics*2001,**2:**28–34. 10.1002/cfg.62PubMed CentralView ArticlePubMedGoogle Scholar - Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER:
**Optimal number of features as a function of sample size for various classification rules.***Bioinformatics*2005,**21:**1509–1515. 10.1093/bioinformatics/bti171View ArticlePubMedGoogle Scholar - Vapnik V, Chervonenkis A:
**On the uniform convergence of relative frequencies of events to their probabilities.***Theor Prob Appl*1971,**16:**264–280. 10.1137/1116025View ArticleGoogle Scholar - Devroye L, Györfi L, Lugosi G:
*A Probabilistic Theory of Pattern Recognition*. New York, Springer Verlag; 1996.View ArticleGoogle Scholar - Attoor SN, Dougherty ER:
**Classifier performance as a function of distributional complexity.***Pattern Recognition*2004,**37:**1641–1651. 10.1016/j.patcog.2003.10.013View ArticleGoogle Scholar - Sietsma J, Dow RJF:
**Creating artificial neural networks that generalize.***Neural Networks*1991,**4:**67–79. 10.1016/0893-6080(91)90033-2View ArticleGoogle Scholar - Holmsträom L, Koistinen P:
**Using additive noise in back-propagation training.***IEEE Trans Neural Networks*1992,**3:**24–38. 10.1109/72.105415View ArticleGoogle Scholar - Matsuoka K:
**Noise injection into inputs in back-propagation learning.***IEEE Trans Syst Man and Cybern*1992,**22:**436–440. 10.1109/21.155944View ArticleGoogle Scholar - Bishop CM:
**Training with noise is equivalent to Tikhonov regularization.***Neural Computation*1995,**7:**108–116.View ArticleGoogle Scholar - An G:
**The effects of adding noise during backpropagation traning on a generalization performance.***Neural Computation*1996,**8:**643–674.View ArticleGoogle Scholar - Grandvalet Y, Canu S:
**Comments on "Noise injection into inputs in back-propagation learning".***IEEE Trans Syst Man and Cybern*1995,**25:**678–681. 10.1109/21.370200View ArticleGoogle Scholar - Grandvalet Y, Canu S, Boucheron S:
**Noise injection: theoretical prospects.***Neural Computation*1997,**9:**1093–1108. 10.1162/neco.1997.9.5.1093View ArticleGoogle Scholar - Reed R, Marks RJ II, Oh S:
**Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter.***IEEE Trans Neural Networks*1995,**6:**529–538. 10.1109/72.377960View ArticlePubMedGoogle Scholar - Hammadi NC, Ito H:
**Improving the performance of feedforward neural networks by noise injection into hidden neurons.***J Intell Robot Syst*1998,**21:**103–115. 10.1023/A:1007965819848View ArticleGoogle Scholar - Grandvalet Y:
**Anisotropic noise injection for input variables relevance determination.***IEEE Trans Neural Networks*2000,**11:**1201–1212. 10.1109/72.883393View ArticlePubMedGoogle Scholar - Skurichina M, Raudys S, Duin RPW:
**K-Nearest neighbors directed noise injection in multilayer perceptron training.***IEEE Trans Neural Networks*2002,**11:**504–411. 10.1109/72.839019View ArticleGoogle Scholar - Seghouane A, Moudden Y, Fleury G:
**Regularizing the effect of input noise injection in feedforward neural networks training.***Neural Comput & Applic*2004,**13:**248–254. 10.1007/s00521-004-0411-6View ArticleGoogle Scholar - Kohonen T, Barna G, Chrisley R:
**Statistical pattern recognition with neural networks: Bechmarking studies.***San Diego:Proc IEEE Int Conf Neural Networks*2001,**11:**61–68.Google Scholar - van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R:
**A gene-expression signature as a predictor of survival in breast cancer.***New Eng J Med*2002,**347:**1999–2009. 10.1056/NEJMoa021967View ArticlePubMedGoogle Scholar - van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH:
**Gene expression profiling predicts clinical outcome of breast cancer.***Nature*2002,**415:**530–536. 10.1038/415530aView ArticleGoogle Scholar - Duda R, Hart P, Stork DG:
*Pattern Classification*. 2nd edition. New York, Wiley; 2001.Google Scholar - Kim S, Dougherty ER, Barrera J, Chen Y, Bittner ML, Trent JM:
**Strong feature sets from small samples.***Journal of Computational Biology*2002,**9:**127–146. 10.1089/10665270252833226View ArticlePubMedGoogle Scholar - Press WH, Teukolsky SA, Vetterling WT, Flannery BP:
*Numerical Recipes in C*. 2nd edition. New York, Cambridge University Press; 2002.Google Scholar - Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES:
**Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.***Science*1999,**286:**531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar - Braga-Neto U, Dougherty ER:
**Is cross-validation valid for small-sample microarray classification?***Bioinformatics*2004,**20:**374–380. 10.1093/bioinformatics/btg419View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.