The results presented in this paper illustrate that SVMs offer classification performance advantages compared to RFs in diagnostic and prognostic classification tasks based on microarray gene expression data. We emphasize that when it comes to clinical applications of such models, because the size of the patient populations is typically very large, even very modest differences in performance (e.g., at the order of 0.01 AUC/RCI or even less) can result in very substantial differences in total clinical outcomes (e.g., number of life-years saved) [13].

The reasons for superior classification performance of one universal approximator classifier over the other in a domain where the generative functions are unknown are not trivial to decipher [2, 14]. We provide here as a starting point two plausible explanations supported by theory and a simulation experiment (in Additional File 2). We note that prior research has established that linear decision functions capture very well the underlying distributions in microarray classification tasks [15, 16]. In the following two paragraphs we first demonstrate that for such functions SVMs may be less sensitive to the choice of input parameters than RFs and then explain why SVMs model linear decision functions more naturally than RFs.

The simulation experiment described in Additional File 2 demonstrates high degree of sensitivity of RFs to the values of input parameters *mtry* (i.e., number of genes randomly selected at each node) and *ntree* (i.e., number of trees) even in the case of linear decision function when complicated decision surface modelling is not required. The experiment shows that the choice of RF parameters creates large variation in the classifier performance whereas the choice of the main SVM parameter has only minor effects on the error. In practical analysis of microarrays this means that finding the RFs with optimal error for the dataset may involve extensive model selection which in turn opens up the possibility for overfitting given the small sample sizes in validation datasets.

A second plausible explanation is that decision trees used as base learners in the RF algorithm cannot learn exactly many linear decision functions in the finite case. Specifically, if the generative linear decision function is not orthogonal to the coordinate axes, then a decision tree of infinite size is required to represent this function without error [17]. The voted decision function in RFs approximates linear functions based on rectangular partitioning of the input space, and this "staircase" approximation can capture a linear function exactly when the number of decision trees can grow without bound (assuming that each tree is of finite size). SVMs on the other hand use linear classifiers and thus can model such functions naturally, using a small number of free parameters (i.e., bounded by the available sample size).

We note that regardless of the specific reasons why RFs may have larger error on average in this domain, it is still important to be aware of the empirical performance differences when considering which classifier to use for building molecular signatures. It may take several years before the precise reasons of differences in empirical error are thoroughly understood, and in the meantime the empirical advantages and disadvantages of methods should be noted first by practitioners.

Data analysts should also be aware of a limitation of RFs imposed by its embedded random gene selection. In order for a RF classification model to overcome the trap of large variance, one has to use a large number of trees and build trees based on a large number of genes. The exact values of these parameters depend on both the complexity of the classification function and the number of genes in a microarray dataset. Therefore, in general, it is advisable to optimize these parameters by nested cross-validation that accounts for the variability of the random forest model (e.g., the selected parameter configuration is the one that performs best on average over multiple validation sample sets).

Finally, it is worthwhile to mention the work by Segal [18] who questioned Breiman's empirical demonstration of the claim that random forests do not overfit as the number of trees grows [2]. In short, Segal showed that there exist some data distributions where maximal unpruned trees used in the random forests do not achieve as good performance as the trees with smaller number of splits and/or smaller node size. Thus, application of random forests in general requires careful tuning of the relevant classifier parameters. These observations may suggest future improvements of RF-related analysis protocols.