SignS is as a web-based application (and underlying R package) that provides four methods for gene selection with survival data: the method by Gui and Li [2], the approach of Dave et al. [1], a method that uses random forests with conditional inference trees [3, 31], and a boosting method with component-wise univariate Cox models [3, 32]. There are few methods that explicitly attempt to perform gene selection with survival data while preserving the identity of the individual genes and allowing the recovery of highly correlated genes. Moreover, there are few comparisons among the available methods, except those from [2, 8, 33–37]. In this context, we chose to implement these four very different approaches. The available comparisons indicate that penalization methods, specially those based on the L1 penalty, such as Lasso and LARS, tend to perform well and return results with relatively few genes, thus enhancing interpretation [33, 35, 36]. The method of Gui and Li can approximate the Lasso or LARS estimates, while selecting more relevant genes [2]. On the other hand, relatively heuristic and simple approaches such as those based on clustering and the idea of signatures can sometimes perform remarkably well, compared to sophisticated penalization approaches [36]. The method of Dave et al. [1] is one such method that attempts to explicitly return signatures for survival data. Finally, ensemble approaches are currently gaining popularity. The recent review by [34] has found that random forest-based methods, as in [31], can yield the best survival time predictions and, thus, we have also included this method. An alternative approach to using ensembles is via boosting, as in [3, 32, 38]; this approach has the advantage of providing for explicit variable selection and being computationally efficient, and has been shown to be competitive for at least some microarray data sets [3]. We have parallelized all the algorithms, providing significant decreases in user wall time (see below).
Algorithms
Briefly, the steps of the method by Dave et al. [1] are: A) genes are filtered by p-value using a gene-by-gene Cox model. B) The retained genes are divided into two groups, those with a positive coefficient in the Cox model and those with a negative coefficient, and a hierarchical cluster is built for each of these two groups separately. C) A potential signature is a group of genes (a cluster) such that the minimal correlation between any two genes in the signature is above a (user-selected) threshold, and such that this cluster has between a minimum and a maximum number of genes (again, parameters set by the user). D) The numeric value of a component, signature in the sense of [1], is the average expression level of all the genes in a given component (i.e., for each sample, we compute the value of a signature as the average value, for that sample, of all the genes in that signature). E) Finally, we carry out variable selection using as starting point the best two-signature model; the variables used are the signature values computed in step D) above. Variable selection is carried out on the signatures, not on individual genes, and no gene can belong to multiple signatures. The main advantages of the approach of [1] are that it is easy to understand, that it explicitly attempts to return sets of correlated genes (signatures), and that the user is both forced to be explicit about, and allowed to choose, parameters with relatively straightforward interpretation (such as the minimum correlation of genes within a signature, or the maximum and minimum number of genes in a signature). Thus, the method of [1] is ideal for exploratory analysis, which is further enhanced by our additions, in particular the assessment of stability of solutions and functional annotation via IDconverter and PaLS (see "Results" section).
In contrast, the approach of Gui and Li [2] has two parameters but they rarely need to be tuned, and can instead be chosen by cross-validation. The complete method, including the dimensional reduction and the ability to capture correlated genes, follows from the penalization used (penalization is a general approach–with other examples being the Lasso and ridge regression– used to obtain estimates of the regression coefficients in high-dimensional spaces such as these, with few subjects and thousands of features). The solutions of this method depend on two parameters: a threshold parameter τ and an infinitesimal increment Δυ. The threshold parameter τ, constrained between 0 and 1, is related to the amount of penalization, and larger values lead to a larger numbers of coefficients in the Cox model being set to 0 (i.e., to the selection of a smaller number of genes). The infinitesimal increment Δυ affects the rate of change of the coefficients at each iteration of the algorithm. Note that, operationally, we can instead choose a sufficiently small Δυ, and modify the number of iterations. [2] use cross-validation to choose the optimal parameters: a set of τ is decided in advance, and cross-validation is used to select the Δυ (or, alternatively, the number of iterations), that minimizes the cross-validated partial likelihood (CVPL). The CVPL is a function of the difference in the log Cox's partial likelihoods for the models with and without the i th group of subjects. Once the optimal parameters are decided, it is immediate to obtain the fitted coefficients for the genes and the scores for subjects not in the sample. Thus, if we choose a sufficiently small Δυ, no parameters need to be chosen by the user, as these are selected via cross-validation.
The approach of Hothorn et al. [3, 31] for using random forests utilizes conditional inference trees as base learners. A conditional inference tree is built by recursively splitting the data into two groups according to the value of the covariables. First, we test the global null hypothesis of independence between any of the variables and the response. If this hypothesis is rejected, we select the variable with strongest association with the response, and implement a binary split in the selected input variable. We continue recursively (in each branch of the tree) until the global null is no longer rejected. The tests are carried out in a conditional inference framework (for details see [31]). A forest is built by fitting a conditional inference tree to each of the many (in our case, 500) bootstrap samples of the data. The type of response returned by these forests are Kaplan-Meir curves. In contrast to the previous two methods, there is no variable (gene) selection performed here although, implicitly, the construction of each tree involves choosing the best variable for a split. Following [34], by default we fit the forests to the best 200 genes, as determined from gene-wise Cox models, but the actual number of genes used is a parameter that can be chosen by the user of the application.
The last method included in SignS has been developed by Hothorn and colleagues [3, 38] and uses boosting to analyze survival data when the number of variables is much larger than the number of samples. Boosting is a general iterative optimization method that, at each iteration, fits the best predictor variable (to reduce the appropriate loss function) and updates the estimates. The algorithm used, L2 Boosting, is equivalent to refitting residuals multiple times [32]. For survival data, [3, 38] use component-wise univariate Cox models. Boosting requires choosing the number of boosting iterations, but as shown in [32], we can use cross-validation to select this parameter. This method performs variable selection, similar to the Lasso and other L1-penalty methods [32]. The genes selected are those with non-zero coefficients at the optimal number of boosting iterations.
Implementation and parallelization
The method of [1] is available in both the web-based application and the underlying R code. The method of [2], as originally proposed, is only available in the R code because it is too slow for routine use in the web-based application; following a suggestion by J. Gui, the web application provides an implementation where only one threshold is used (see below). The core statistical functionality of both methods is written in R [39]. Where possible, computations have been parallelized using MPI (its LAM [Local Area Multicomputer]/MPI implementation) via the R-packages Rmpi [40], by H. Yu, and papply [41] by D. Currie. For the web-based application, the CGI (Common Gateway Interface), initial data validation, the setting-up and closing of the parallel infrastructure (booting and halting the LAM/MPI universes), and the fault-tolerance and crash recovery mechanisms, are written in Python.
The implementation of the approach in [1] follows closely the description of their method in the supplementary material to their paper. Our main departures from their implementation are: a) we do not split the data into two halves, but instead use cross-validation to assess model selection; b) it is unclear how the initial two-signature model was selected by the authors, and we choose the one with the largest likelihood, which in this case would be identical to using AIC, Akaike Information Criterion, as a criterion, since all two-signature models have the same number of parameters; c) it seems, from the supplementary material, that the authors used p-values in their forward variable selection, whereas we use AIC, generally a preferred criterion for model selection [42] (i.e., at each step we re-evaluate if any variables removed in previous steps should be incorporated, or any variables previously introduced into the model should be eliminated, using AIC as the criterion).
For this algorithm, we have parallelized computations over cross-validation runs, after experimenting with alternative parallelization schemes. Parallelizing the initial computation of gene-by-gene Cox models leads to decreases in speed because, in our setup, the communication costs are larger than the decrease in computing time; for instance, with a data set of 60000 genes and 160 arrays, the non-parallelized algorithm takes about 49 seconds in contrast to the 78 seconds of a parallelized algorithm that distributes the computational load evenly over 62 CPUs. Parallelization of the next step, clustering genes with positive and negative coefficients independently, could be split into two within each run, but the final step (stepwise model selection via AIC) is inherently sequential. Thus, we can minimize communication costs and simplify further modifications of the algorithm if the algorithm is parallelized at the cross-validation level. With this scheme, the total user wall time is the sum of the computing time of two runs of the algorithm –a first run with all the data, and another run corresponding to one of the cross-validations– plus the time invested in communication and data handling. In contrast, the user wall time of the purely sequential algorithm would be the sum of 11 runs of the algorithm –one for all data, and one for each of the cross-validation runs. For the method of [2] our code is based on the original code by the authors, with many modifications for speed improvement and parallelization. First, several common operations in the code were implemented using faster (sequential) code (e.g., using "crossprod" instead of the naive X 'y, vectorizing loops, rewriting expressions to use fewer steps, etc). Next, the code was parallelized. Taking into account the number of nodes we had available and the number of nodes that can often be used in off-the-shelf computing clusters, we parallelized the computations that search for the best τ. Following [2], we explore the six possible τ values 0, 0.2, 0.4, 0.6, 0.8, 1.0 and select the one that minimizes the CVPL, using 10-fold crossvalidation. Thus, we can parallelize the finding of the best t into 60 independent computations (10 searches at each of the candidate τ). Notice that parallelizing at this level yields increased speed even if no global/double/full cross-validation is performed. The speed-ups achieved with the code changes are discussed below. Because of speed issues with this method, the web-based application does not explore a range of τ values, but instead uses a single one, chosen by the user. J. Gui made us notice that, since their approach can only include genes, not delete them, it can result on small τ thresholds leading to the selection of many genes which can be false positives, and he suggested using only one or a few large t thresholds, and skipping cross-validation over the entire τ range, if time is at a premium (as in the web-based application). By default, therefore, the web-based application uses a t threshold of 0.9.
Adding other algorithms: random forests and boosting
The above two methods have been implemented either almost from scratch, or after extensive modifications of the original code, including careful tuning of how to conduct the parallelization. To show how SignS can be extended with other algorithms, I have included the additional two methods in a much simpler way that also provide examples on how further methods can be incorporated. In both cases, there are R packages that implement each method, so essentially all we need to do is write several wrapper functions that call the necessary pre-existing functions, and that provide output in a way that can be used by SignS.
First, I have written some convenience functions that make it easier to produce formatted HTML with the results from the fits, and to obtain figures ("print.selected.genes", "print.cv.results", "writeForPaLS", "print.validation.results", "kmplots", "kmplots.validation"). These convenience functions will be used with the two new methods, and could be used directly, or with very minor modifications, if other algorithms are implemented.
Next, we will use the pre-existing code. The random forests method is available from the R package "party" [31], and can be downloaded from the R repositories. Likewise, the boosting method is available from the "mboost" R package [43]. For random forests, all we need to do is to wrap-up the existing function ("cforest") together with the other needed elements for SignS to operate. Following [34] we will perform a preliminary gene selection step, and after the model is fitted we will need to obtain predictions for each subject, to assess predictive performance (see below). Gene selection is carried out with a straightforward function, very similar to the one used in the method by Dave et al. to select genes. Subject prediction (either for cross-validation or with validation data) is based on the expected survival time, as explained in p. 109 and ff. of [44]; the mean survival time can be computed (although it is a biased estimate) even if survival at the largest observed time is much larger than 0.5. These three parts of the algorithm (gene selection –function "geneSelect"–, random forest –the original "cforest" function in package "party"–, prediction –"cf.mean.survtime") are wrapped in a single function ("my.cforest").
Next, we write another wrapper that will call this function repeatedly for cross-validation ("my.cforest.cv"). It is in the cross-validation where the algorithm is parallelized, so the 10 calls to cforest, one for each cross-validation fold, are run concurrently. The parallelization is straightforward since we are using the "papply" function (from the R package "papply"; see above).
Similar steps are followed for the boosting method. We define a function ("my.glmboost") that carries out all the needed steps: it calls the "glmboost" function in the mboost package for the initial boosting fit, and then the "cvrisk" function to select the best number of boosting iterations. The selected genes are the ones with non-zero coefficients at that best number of boosting iterations, and subject predictions (glmboost's linear predictors) are obtained using the model with that best number of boosting iterations. As above, we also write another wrapper function ("my.glmboost.cv") that will be used for cross-validating the predictive capacity of the approach and that is parallelized using "papply".
All the above code is available in the "SignS2.R" file (from the repositories, under SignS2/R) and is called from the "f1.R" file. The "f1.R" file is the actual R script, whereas SignS2.R is the underlying R package (by building a simple package, we not only make R namespace usage and management cleaner, but also ease the task of loading the code in multiple nodes when using parallelization). Finally, for the web-based application to operate, the Python code needs to be modified so that the user can select the newly implemented method and pass the appropriate parameters (only the number of genes to select for random forest).
Crash recovery and fault tolerance
In distributed applications, partial failure (e.g., failure of one or more of the computing nodes) is unavoidable [45–47]. In our installation, we follow several complementary strategies to provide fault tolerance and crash recovery.
Shared storage space that uses RAID 50 provides protection against hard-disk failure, as well as access to results and data from nodes different from the one where computations started. Redundancy and load-balancing of the web-service is achieved with Linux Virtual Server with heartbeat and mon [48]. This setup ensures redundancy if one of the master nodes fails, and monitors the server nodes so that no HTTP requests are sent to non-responding nodes.
In addition to problems in the web-service and hard-disks, three sources of partial failure that can affect an ongoing calculation are network problems (which upset the LAM/MPI universe), MPI (or Rmpi) errors, and a crash in one of the nodes that are running a slave MPI job. All of these are recoverable errors and, therefore, there is no need to stop the complete calculation (forcing the user to relaunch the process) or halt indefinitely. Instead, SignS provides mechanisms to continue an ongoing calculation in case of the above sources of failure. First, the web-based application periodically examines MPI and Rmpi logs and existing LAM/MPI daemons to determine if any of the above problems have occurred. If a problem is found, a new LAM/MPI universe is booted. Before booting the new LAM/MPI universe, a script determines which nodes are currently alive and can run MPI processes and, if necessary, modifies the LAM/MPI configuration files accordingly. Once the new LAM/MPI universe is successfully created, a new R process is launched. The R code includes checkpoints so that calculations are not started again from the beginning but only continued from the point they were stopped.
Note that errors in our R code, since they are not recoverable, are functionally equivalent to completion of the calculations. The application monitorizes R logs and currently running R processes and, if any errors are detected, the calculation is aborted immediately, a message returned to the user, and the problem logged to allow for prompt fixing.
To ensure that the application is working properly a test is launched every hour from a machine external to the cluster. The test is one from the functional testing suite (see above) and verifies both the user interface and the parallelization infrastructure. If there are any errors, an email is immediately sent to the author and other system administrators.
Comments
View archived comments (2)