Overview of functions available in mobForest package
The mobForest package contains functions for constructing model-based trees incorporating random forest methodology, computing predictions, predictive accuracy estimates, residuals plot, and variable importance plot. The detailed description of all the implemented functions is provided in the manual posted at CRAN (http://CRAN.R-project.org/package=mobForest). Here we outline most important functions.
Tree modeling
The main function used to develop model-based trees incorporating random forest methodology is called mob_rf_tree(). The mob_rf_tree() is a modified version of mob() function, implemented in the party package in R. The source code in the mob() function was modified such that a random subset of partitioning variables is selected during the process of splitting a tree node.. From this subset, a variable associated with the highest “parameter instability” [4] is selected as a splitting variable.
Setting up forest controls
Before starting the analysis the users are recommended to specify the parameters that control forest growth. The parameters can be set using mobForest_control() function that returns an object of S4 class mobForestControl containing forest controls. The parameters include:
-
ntree: Number of trees to be constructed in mobForest (default = 300)
-
mtry = number of input variables randomly sampled as candidates at each node (default is one-third of the number of partitioning variables).
-
replace = TRUE/FALSE. replace = TRUE performs bootstrapping. replace = FALSE (default) performs sampling without replacement.
-
fraction: number of observations to draw without replacement (default is 0.632). This parameter is relevant only if replace = FALSE).
-
mob.control: Object, implemented in party package, used to set up control parameters for building model-based trees. A few important parameters in this object include:
-
○ alpha: A node is considered for splitting if the p value for any partitioning variable in that node falls below alpha (default 1).
-
○ Bonferroni: logical. Should p values be Bonferroni corrected? (default FALSE).
-
○ minsplit: integer. The minimum number of observations (sum of the weights) in a node (default 20).
The main function mobForestAnalysis()
The mobForest package provides one main function called mobForestAnalysis() that takes all the necessary parameters as input arguments to start the mobForest analysis for model-based recursive partitioning. The input arguments to this function are kept similar to those in mob() from the party package, so users familiar with that function have an easy transition to using mobForestAnalysis(). mobForestAnalysis() takes following input parameters:
-
formula: an object of class formula specifying the model that will be fit within each node. This should be of type y ~ x1 + … + xk where the variables x1, x2, …, xk are predictor variables and y represents an outcome variable. In this paper, this model will be referred to as the node model.
-
PartitionVariables: A character vector specifying the partition variables used to build trees within the mobForest.
-
Data: input dataset that is used for constructing trees in mobForest. Learning samples and out-of-bag (OOB) samples are created from this data (using bootstrapping or subsampling). The mobForest is constructed using learning samples and validated on out-of-bag samples.
-
mobForest.controls: object of class mobForestControl returned by mobForest_control(), that contains parameters controlling the construction of mobForests.
-
model: model of class StatModel used for fitting observations in current node, and it is used in the same manner as used in mob(). This parameter allows fitting a linear model or generalized linear model with formula y ~ x1 + … + xk. The parameter “linearModel” fits linear model. The parameter “glinearModel” fits Poisson or logistic regression model depending upon the specification of parameter “family” (explained next). If “family” is specified as binomial() then logistic regression is performed. If the “family” is specified as poisson() then Poisson regression is performed.
-
family: a description of error distribution and link function to be used in the model, and it is used in the same manner as used in mob(). This parameter needs to be specified if generalized linear model is considered. The parameter “binomial()” is to be specified when logistic regression is considered and “poisson()” when Poisson regression is considered. The values allowed for this parameter are binomial() and poisson().
-
newTestData: A data frame representing test data for independent validation of mobForest model. This data is not used in the tree building process.
-
processors: the number of processors/cores on your computer. By default only one core is used for computations. If a computer has more than one core then increasing this variable to a value less than or equal to the number of cores will allow the package to exploit the multi-core parallelism and produce results relatively faster.
The function mobForestAnalysis() returns an object of class mobForestOutput which stores results from random forest analysis. This object stores predicted values, predictive accuracy estimates, residuals and variable importance scores produced during the analysis. This object is passed as an input argument to other functions to extract the relevant results.
Predictions
After constructing a model-based tree on a learning set using the function mob_rf_tree(), the predicted values for each subject are computed using the function treePredictions(). This function, called within mobForestAnalysis(), takes a dataset and a tree model as input arguments and returns fitted values of response variable on each observation within the dataset. Based on the characteristics of each observation, the treePredictions() function traverses through the tree model to an appropriate terminal node and obtains model parameters to compute fitted values of response variable. If the model of interest is logistic regression, then the fitted values are predicted probabilities of a classification (in one category). The mobForest package summarizes predictions obtained across multiple model-based trees. The Fitted values are averaged across the tree models (for each subject) and can be obtained using the function getPredictedValues(), which is a S4 method of class mobForestOutput. This function returns fitted values averaged on OOB data only, complete data or a new test data (supplied as a newTestData argument in the function mobForestAnalaysis()). The getPredictedValues() function takes three input arguments.
-
mobForestOutput object - returned by mobForestAnalysis()
-
OOB = TRUE/FALSE: OOB = TRUE (default) returns predictions across tree model on out-of-bag data (combined across all trees). OOB = FALSE returns predictions on complete data.
-
Newdata = TRUE/FALSE. If newdata = TRUE, the OOB parameter is ignored and the predictions on the new test data, supplied as a newTestData argument to mobForestAnalysis(), are returned. newdata = FALSE (default) ignores newdata parameter and returns predictions based on the OOB parameter.
The function getPredictedValues() returns a matrix with 3 columns. The first column contains average predicted value for each subject across all the trees models. The predictions are averaged on OOB data, complete data or a new test data (as per the input parameter specification). The second column contains standard deviation of predictions, for each subject, across all the tree models. The third column contains residuals - difference between observed outcome and expected prediction - for each subject across the tree models. The residuals are reported only when linear or Poisson regression is considered as the node model.
Predictive accuracy and error estimates
Metrics
The OOB cases provide a fair means of performance/error estimation based on training data alone. The predictive accuracy estimates are computed differently for a logistic regression model than linear or Poisson regression model. When linear or Poisson regression model is considered, predictive accuracy metric R2k is defined as the proportion of total variation in outcome variable explained by the kth tree on OOB cases. In case of logistic regression model, the predicted probabilities for OOB cases are converted into classes (yes/no, high/low, etc. as specified) based on the probability cutoff specified by the end user (default is 0.5 if not specified) and predictive accuracy PCCk is defined as the proportion of OOB cases correctly classified by the kth tree model. Both metrics PCCk and R2k range between zero and 1. In case of linear regression model, R2k is a function of “sum of squared errors” (SSEk) and “total sum of squares” (SSTOk) on OOB data used to build the kth tree. It is computed as
(1)
where,
(2)
(3)
and yx is the observed outcome for xth OOB case, is the predicted outcome for xth OOB case, n is the number of OOB cases not considered in building kth tree, and is the mean observed outcome of OOB cases. It should be noted that R2k can be negative when SSEk is greater than SSTOk. In such situations, we force R2k to zero. The other metric used for measuring predictive accuracy is “mean square error” (MSE) [10]. MSEk defined as the MSE estimate on OOB cases for the kth tree model and is calculated as follows.
(4)
Predictive performance is also estimated at “forest level” - after aggregating OOB predictions across all the trees and then computing R2 and MSE.
Predictive accuracy function
The function PredictiveAccuracy() (S4 method of class mobForestOutput) can be used to extract predictive accuracy estimates over OOB cases and/or a new test data. It takes three input arguments:
-
mobForestOutput object
-
Newdata = TRUE/FALSE. If newdata = TRUE, R2 (or PCC) and MSE are obtained for the new test data supplied as a newTestData argument to mobForestAnalysis(). newdata = FALSE (default) ignores newdata parameter and returns R2 (or PCC) and MSE estimates based on OOB predictions and complete dataset predictions summarized across all trees.
-
plot = TRUE (default). This allows user to purview the distribution of R2 (or PCC) and MSE estimates for OOB cases across all the trees, overall R2 (or PCC) and MSE estimates when OOB predictions are aggregated across all the trees, and overall R2 (or PCC) and MSE estimates when predictions on new test data are aggregated across all the trees. plot = FALSE produces no plot.
PredictiveAccuracy() returns a list containing: a) OOB R2 (or PCC) estimates across all the trees, b) MSE estimates on OOB data across all the trees, c) overall R2 (or PCC) estimate when OOB predictions are aggregated across all trees, d) overall MSE estimate when OOB predictions are aggregated across all trees, e) R2 (or PCC) estimates on complete data across all the trees, f) MSE estimates on complete data across all the trees, g) overall R2 (or PCC) estimate when complete-data predictions are aggregated across all the trees, h) overall MSE estimate when complete-data predictions are aggregated across all the trees, i)the node model and partition variables used, j)if newdata = TRUE, overall R2 (or PCC) and MSE estimates when predictions on new test data are aggregated across all the trees.
Variable importance Variable importance assessment is a process of ranking variables in the predictor set according to their importance in producing accurate predictions. “Permutation accuracy importance” method [1, 3, 10] is used to compute importance scores for each variable. To determine the importance of a variable m, values of m in the OOB cases are randomly permuted and PCCp (proportion of OOB cases correctly classified when binary outcome is considered) or MSEp (for continuous outcome) is obtained through variable-m-permuted OOB data. Next, MSEp is subtracted from MSEk (or PCCp is subtracted from PCCk) which was obtained using original un-permuted OOB data. The average of this number over all the trees in the forest is the raw importance score for variable m. One can invoke functions getVarimp() and varimplot() (S4 methods of class mobForestOutput) to produce variable importance scores and variable importance plot.
Residual plot One can invoke the function residualPlot() (S4 method of class mobForestOutput) to produce the following diagnostic plots.
-
residuals vs. predicted outcomes for OOB cases: this plot should produce a distribution of points randomly scattered across 0, regardless of the size of the fitted value.
-
histogram of OOB residuals: this plot is expected to be roughly normal with mean 0.
It should be noted that the above diagnostic plots are typical when the fitted values are obtained through linear regression but not when logistic or Poisson regression is considered as a node model. Therefore, mobForest package produces the above residual plots only when linear regression is considered. For logistic or Poisson models, a message is printed saying “Residual Plot not produced when logistic of Poisson regression is considered as the node model”.