Random forest methodology for model-based recursive partitioning: the mobForest package for R
© Garge et al.; licensee BioMed Central Ltd. 2013
Received: 11 December 2012
Accepted: 27 March 2013
Published: 11 April 2013
Recursive partitioning is a non-parametric modeling technique, widely used in regression and classification problems. Model-based recursive partitioning is used to identify groups of observations with similar values of parameters of the model of interest. The mob() function in the party package in R implements model-based recursive partitioning method. This method produces predictions based on single tree models. Predictions obtained through single tree models are very sensitive to small changes to the learning sample. We extend the model-based recursive partition method to produce predictions based on multiple tree models constructed on random samples achieved either through bootstrapping (random sampling with replacement) or subsampling (random sampling without replacement) on learning data.
Here we present an R package called “mobForest” that implements bagging and random forests methodology for model-based recursive partitioning. The mobForest package constructs large number of model-based trees and the predictions are aggregated across these trees resulting in more stable predictions. The package also includes functions for computing predictive accuracy estimates and plots, residuals plot, and variable importance plot.
The mobForest package implements a random forest type approach for model-based recursive partitioning. The R package along with it source code is available at http://CRAN.R-project.org/package=mobForest.
KeywordsRandom forests Model-based recursive partitioning Ensemble R
Recursive partitioning is a non-parametric modeling technique, widely used in regression and classification problems. Recursive partitioning methods like Random Forests™  are able to deal with large number of predictor variables even in the presence of complex interactions. “Classification and regression trees” (CART)  is one of the most commonly used recursive partitioning methods that can select from among a large number of variables that are most important in explaining the outcome variable. The basic idea of CART algorithm is to sequentially split the data to identify groups of observations with similar values of response variable. During each step, a number of bivariate association models are run using every suspected predictor variable, and the one that has the strongest association with the response variable is selected. Then the data is split into two or more subgroups based on the optimal cutpoint in the selected predictor . Thus the selected predictor becomes a partitioning variable. For binary predictor the split is unambiguous, but for a continuous one the bets split is used and the strength of association is usually adjusted for multiple choices, Strobl et al. 2009 . The subrgoups formed by such split are often called nodes or “leafs”. The partitioning of the data continues till a stopping condition is met such as a) nodes contain observations of only one class, b) no predictor variable shows strong association within a given node, c) number of observations within a node are less than the specified minimum threshold.
Model-based recursive partitioning  partitions the groups of observations with similar model trends (between another predictor variable and the response variable). This is different from partitioning that identifies groups of observations that show similar value of the response variable. For example, a linear regression could be used to model the efficacy of treatments considered in a study. However, the treatment effects as well as the intercept parameter of this model may be different for different subgroups of patients. In this example, the model of interest relates treatment and clinical response but the model parameters can be different for different subgroups. “Model-based recursive partitioning” partitions the feature space to identify subgroups of patients with similar treatment effects and predicts clinical response based on the estimated treatment effects within different subgroups. The mob() function  implemented in the “party” package in R  allows one to perform model-based recursive partitioning. This function takes the model of interest and partition variables (covariates specifying the feature space that are used as splitting variables in a model-based tree) as input arguments and returns a tree with fitted models in each terminal node.
Regardless of the choice of recursive partitioning method (model-based or CART), single tree models could be instable to small changes in learning data. In other words, a slight change in learning sample can produce substantially different tree structures thereby inducing high variability in predictions obtained across trees . Therefore, ensemble methods like “bagging”  and “random forests” - random selection of features (sets of predictor variables) - are commonly exercised to build large number of tree models and aggregate predictions cross the diverse set of trees to obtain stablepredictions [6-9]. Both the methods, bagging and random forests, construct trees on random samples of learning data. Random sampling is achieved either through bootstrapping (random sampling with replacement) or subsampling (sampling without replacement). Bagging, involves fitting trees to each random sample while considering the complete set of predictor variables during the process of splitting a tree node. “Random forests” produces a more diverse set of trees because at each level of a tree, a random subset of predictor variables is considered from which one might be selected for splitting the node. This allows a tree model to incorporate useful but weaker predictors that otherwise would be dominated by stronger predictors .
The main objective of this paper is to introduce the mobForest R package which implements random forest for both bagging and random variable selection methodology for model-based recursive partitioning. The mobForest package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=mobForest. This package computes predictions on multiple model-based trees that are constructed through random forest methodology. Predictions are aggregated across trees to produce stable predictions. The package provides functions to compute predictive accuracy estimates on individual trees and the complete mobForest. Predictive performance is computed on out-of-bag (OOB) cases - cases not used in a tree building process . The metrics implemented to compute predictive performance are “pseudo R2” and mean square error (MSE) for continuous outcome and “proportion of correctly classified” (PCC) for binary outcome. The pseudo R2 predictive accuracy metric is defined as the proportion of total variation in outcome explained by the tree model (or forest). Both metrics “pseudo R2” and PCC range between 0 and 1. The mobForest package computes variable importance scores and provides functions to draw variable importance and predictive performance plots. This package can use multiple cores/processors for parallel computation. The parallel package that supports parallel computing (in R) is utilized for building trees on multiple cores/processors simultaneously. The computation time is greatly reduced if the analysis is run on a multi-core machine.
Overview of functions available in mobForest package
The mobForest package contains functions for constructing model-based trees incorporating random forest methodology, computing predictions, predictive accuracy estimates, residuals plot, and variable importance plot. The detailed description of all the implemented functions is provided in the manual posted at CRAN (http://CRAN.R-project.org/package=mobForest). Here we outline most important functions.
The main function used to develop model-based trees incorporating random forest methodology is called mob_rf_tree(). The mob_rf_tree() is a modified version of mob() function, implemented in the party package in R. The source code in the mob() function was modified such that a random subset of partitioning variables is selected during the process of splitting a tree node.. From this subset, a variable associated with the highest “parameter instability”  is selected as a splitting variable.
Setting up forest controls
Before starting the analysis the users are recommended to specify the parameters that control forest growth. The parameters can be set using mobForest_control() function that returns an object of S4 class mobForestControl containing forest controls. The parameters include:
ntree: Number of trees to be constructed in mobForest (default = 300)
mtry = number of input variables randomly sampled as candidates at each node (default is one-third of the number of partitioning variables).
replace = TRUE/FALSE. replace = TRUE performs bootstrapping. replace = FALSE (default) performs sampling without replacement.
fraction: number of observations to draw without replacement (default is 0.632). This parameter is relevant only if replace = FALSE).
mob.control: Object, implemented in party package, used to set up control parameters for building model-based trees. A few important parameters in this object include:
○ alpha: A node is considered for splitting if the p value for any partitioning variable in that node falls below alpha (default 1).
○ Bonferroni: logical. Should p values be Bonferroni corrected? (default FALSE).
○ minsplit: integer. The minimum number of observations (sum of the weights) in a node (default 20).
The main function mobForestAnalysis()
The mobForest package provides one main function called mobForestAnalysis() that takes all the necessary parameters as input arguments to start the mobForest analysis for model-based recursive partitioning. The input arguments to this function are kept similar to those in mob() from the party package, so users familiar with that function have an easy transition to using mobForestAnalysis(). mobForestAnalysis() takes following input parameters:
formula: an object of class formula specifying the model that will be fit within each node. This should be of type y ~ x1 + … + xk where the variables x1, x2, …, xk are predictor variables and y represents an outcome variable. In this paper, this model will be referred to as the node model.
PartitionVariables: A character vector specifying the partition variables used to build trees within the mobForest.
Data: input dataset that is used for constructing trees in mobForest. Learning samples and out-of-bag (OOB) samples are created from this data (using bootstrapping or subsampling). The mobForest is constructed using learning samples and validated on out-of-bag samples.
mobForest.controls: object of class mobForestControl returned by mobForest_control(), that contains parameters controlling the construction of mobForests.
model: model of class StatModel used for fitting observations in current node, and it is used in the same manner as used in mob(). This parameter allows fitting a linear model or generalized linear model with formula y ~ x1 + … + xk. The parameter “linearModel” fits linear model. The parameter “glinearModel” fits Poisson or logistic regression model depending upon the specification of parameter “family” (explained next). If “family” is specified as binomial() then logistic regression is performed. If the “family” is specified as poisson() then Poisson regression is performed.
family: a description of error distribution and link function to be used in the model, and it is used in the same manner as used in mob(). This parameter needs to be specified if generalized linear model is considered. The parameter “binomial()” is to be specified when logistic regression is considered and “poisson()” when Poisson regression is considered. The values allowed for this parameter are binomial() and poisson().
newTestData: A data frame representing test data for independent validation of mobForest model. This data is not used in the tree building process.
processors: the number of processors/cores on your computer. By default only one core is used for computations. If a computer has more than one core then increasing this variable to a value less than or equal to the number of cores will allow the package to exploit the multi-core parallelism and produce results relatively faster.
The function mobForestAnalysis() returns an object of class mobForestOutput which stores results from random forest analysis. This object stores predicted values, predictive accuracy estimates, residuals and variable importance scores produced during the analysis. This object is passed as an input argument to other functions to extract the relevant results.
After constructing a model-based tree on a learning set using the function mob_rf_tree(), the predicted values for each subject are computed using the function treePredictions(). This function, called within mobForestAnalysis(), takes a dataset and a tree model as input arguments and returns fitted values of response variable on each observation within the dataset. Based on the characteristics of each observation, the treePredictions() function traverses through the tree model to an appropriate terminal node and obtains model parameters to compute fitted values of response variable. If the model of interest is logistic regression, then the fitted values are predicted probabilities of a classification (in one category). The mobForest package summarizes predictions obtained across multiple model-based trees. The Fitted values are averaged across the tree models (for each subject) and can be obtained using the function getPredictedValues(), which is a S4 method of class mobForestOutput. This function returns fitted values averaged on OOB data only, complete data or a new test data (supplied as a newTestData argument in the function mobForestAnalaysis()). The getPredictedValues() function takes three input arguments.
mobForestOutput object - returned by mobForestAnalysis()
OOB = TRUE/FALSE: OOB = TRUE (default) returns predictions across tree model on out-of-bag data (combined across all trees). OOB = FALSE returns predictions on complete data.
Newdata = TRUE/FALSE. If newdata = TRUE, the OOB parameter is ignored and the predictions on the new test data, supplied as a newTestData argument to mobForestAnalysis(), are returned. newdata = FALSE (default) ignores newdata parameter and returns predictions based on the OOB parameter.
The function getPredictedValues() returns a matrix with 3 columns. The first column contains average predicted value for each subject across all the trees models. The predictions are averaged on OOB data, complete data or a new test data (as per the input parameter specification). The second column contains standard deviation of predictions, for each subject, across all the tree models. The third column contains residuals - difference between observed outcome and expected prediction - for each subject across the tree models. The residuals are reported only when linear or Poisson regression is considered as the node model.
Predictive accuracy and error estimates
Predictive performance is also estimated at “forest level” - after aggregating OOB predictions across all the trees and then computing R2 and MSE.
Predictive accuracy function
The function PredictiveAccuracy() (S4 method of class mobForestOutput) can be used to extract predictive accuracy estimates over OOB cases and/or a new test data. It takes three input arguments:
Newdata = TRUE/FALSE. If newdata = TRUE, R2 (or PCC) and MSE are obtained for the new test data supplied as a newTestData argument to mobForestAnalysis(). newdata = FALSE (default) ignores newdata parameter and returns R2 (or PCC) and MSE estimates based on OOB predictions and complete dataset predictions summarized across all trees.
plot = TRUE (default). This allows user to purview the distribution of R2 (or PCC) and MSE estimates for OOB cases across all the trees, overall R2 (or PCC) and MSE estimates when OOB predictions are aggregated across all the trees, and overall R2 (or PCC) and MSE estimates when predictions on new test data are aggregated across all the trees. plot = FALSE produces no plot.
PredictiveAccuracy() returns a list containing: a) OOB R2 (or PCC) estimates across all the trees, b) MSE estimates on OOB data across all the trees, c) overall R2 (or PCC) estimate when OOB predictions are aggregated across all trees, d) overall MSE estimate when OOB predictions are aggregated across all trees, e) R2 (or PCC) estimates on complete data across all the trees, f) MSE estimates on complete data across all the trees, g) overall R2 (or PCC) estimate when complete-data predictions are aggregated across all the trees, h) overall MSE estimate when complete-data predictions are aggregated across all the trees, i)the node model and partition variables used, j)if newdata = TRUE, overall R2 (or PCC) and MSE estimates when predictions on new test data are aggregated across all the trees.
Variable importance Variable importance assessment is a process of ranking variables in the predictor set according to their importance in producing accurate predictions. “Permutation accuracy importance” method [1, 3, 10] is used to compute importance scores for each variable. To determine the importance of a variable m, values of m in the OOB cases are randomly permuted and PCCp (proportion of OOB cases correctly classified when binary outcome is considered) or MSEp (for continuous outcome) is obtained through variable-m-permuted OOB data. Next, MSEp is subtracted from MSEk (or PCCp is subtracted from PCCk) which was obtained using original un-permuted OOB data. The average of this number over all the trees in the forest is the raw importance score for variable m. One can invoke functions getVarimp() and varimplot() (S4 methods of class mobForestOutput) to produce variable importance scores and variable importance plot.
Residual plot One can invoke the function residualPlot() (S4 method of class mobForestOutput) to produce the following diagnostic plots.
residuals vs. predicted outcomes for OOB cases: this plot should produce a distribution of points randomly scattered across 0, regardless of the size of the fitted value.
histogram of OOB residuals: this plot is expected to be roughly normal with mean 0.
It should be noted that the above diagnostic plots are typical when the fitted values are obtained through linear regression but not when logistic or Poisson regression is considered as a node model. Therefore, mobForest package produces the above residual plots only when linear regression is considered. For logistic or Poisson models, a message is printed saying “Residual Plot not produced when logistic of Poisson regression is considered as the node model”.
Results and discussion
where Y represents “fifty.reduce”, X represents baseline percent drinking days (bpdrkday), Tj is a 0/1 dummy variable representing the jth treatment,, β 0 represents the intercept term of regression model, β 1 represents baseline effect, β 2 ,…,β 8 represent the treatment effects for treatments T 2 ,…, T 8 (with treatment group 1 as a reference category), and e represents residuals . We used mobForest package to estimate the treatment effects for different groups of patients, partitioned through model-based recursive trees, and summarize outcome predictions across large number of trees.
We also did fit the learning data using a logistic regression model containing all the parameters in equation (5) plus the best subset of partition variables - selected through stepwise regression analysis with forward selection procedure. Four of the top 5 important variables obtained through mobForest analysis were also selected in the final model obtained using stepwise regression. These variables include “likely.to.d”, “focdsrci”, “action”, and “pssscore”. The AUC estimate for predictions obtained through stepwise regression on the training dataset was 0.71 and validation dataset was 0.60. Therefore, mobForest showed better performance than the stepwise regression method.
The R package mobForest implements random forest method for model-based recursive partitioning. This package combines predictions obtained across diverse set of trees to produce stable predictions. The mobForest provides functions for producing predictive performance plots, variable importance plots and residual plots using data contained in the mobForest object. The package uses multiple cores/processors to perform parallel computations. The parallel package that supports parallel computing (in R) is utilized for faster computation. The mobForest package supports linear, Poisson and logistic regression models for use in model-based random forest type analysis.
Availability and requirements
Project name: Alcohol dependence study
Project Home page:
Operating system: windows platform
Programming Language: R
License: GPL (≥2)
This study was funded in part by a grant from NIAAA (number 1RC4AA020096-01, Bobashev - PI).
- Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and regression trees. 1984, : Chapman & Hall/CRCGoogle Scholar
- Strobl C, Malley J, Tutz G: An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological methods. 2009, 14 (4): 323-348.PubMed CentralView ArticlePubMedGoogle Scholar
- Zelleis A, Hothorn T, Hornik K: Model-based recursive partitioning. J Comput Graph Stat. 2008, 17 (2): 492-514. 10.1198/106186008X319331.View ArticleGoogle Scholar
- Team RC: R: a language and environment for statistical computing. 2012, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Breiman L: Bagging predictors. Mach Learn. 1996, 24 (2): 123-140.Google Scholar
- Bauer E, Kohavi R: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn. 1999, 36 (1-2): 105-139.View ArticleGoogle Scholar
- Buhlmann P, Yu B: Analyzing bagging. Ann Stat. 2002, 30 (4): 927-961.View ArticleGoogle Scholar
- Dietterich TG: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000, 40 (2): 139-157. 10.1023/A:1007607513941.View ArticleGoogle Scholar
- Genuer R, Poggi JM, Tuleau-Malot C: Variable selection using random forests. Pattern Recogn Lett. 2010, 31 (14): 2225-2236. 10.1016/j.patrec.2010.03.014.View ArticleGoogle Scholar
- Anton RF, O’Malley SS, Ciraulo DA, Cisler RA, Couper D, Donovan DM, Gastfriend DR, Hosking JD, Johnson BA, LoCastro JS: Combined pharmacotherapies and behavioral interventions for alcohol dependence: the COMBINE study: a randomized controlled trial. JAMA. 2006, 295 (17): 2003-2017. 10.1001/jama.295.17.2003.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.