Random Forest regression
We call the data set comprising N unrelated individuals or samples genotyped at P biallelic markers. For each individual, the markers are arranged in a data vector x
i
= (xi 1, x
i
2, ..., x
iP
), for i = 1, ..., N. Depending on the coding scheme, different genetic models can be applied. For instance, assuming an additive genetic model, each x
ij
represents the count of minor alleles recorded at the j
th
locus--homozygote of minor allele is 2, heterozygote is 1 and homozygote of major allele is 0. The associated quantitative trait for each subject is assumed to be a Q-dimensional real-valued vector which we denote as y
i
= (yi 1, yi 2, ..., y
iQ
), with i = 1, ..., N. In imaging genetics study designs, for instance, it is common that the sample size N is much smaller than min{P, Q}.
The RF algorithm builds an ensemble of regression trees, each one independently learned on a boot-strapped version of . The required number of trees in the forest, Ntree, is a user-defined parameter. The training data set for each tree is obtained by randomly sampling N subjects from with replacement. The tree building process is accomplished by introducing a second layer of randomness and involves selecting a random subset of Mtry candidate SNPs at each node, among the P available SNPs, in order to reduce the correlation among trees. In each tree, the best split at a node is determined by evaluating a split function for each value of a candidate SNP, and then selecting the SNP that maximises this function (see also Equation 2). To reduce bias, the trees are grown to a maximum depth with no pruning or otherwise until a minimum sample size has been reached; by default, we set this value to 5 for univariate trait and 20 for multivariate traits. We only consider binary trees, although in principle multi-way splits could also be accommodated with minor changes.
For each tree, all the subjects in that do not become part of the bootstrap sample used for training are collected together to form an out-of-bag (OOB) sample, which is used as a testing set. Approximately 63.2% of the subjects in are utilised as training data, while the remaining subjects are OOB samples. Each OOB sample is used to obtain an estimate for the prediction error (PE) for its tree and these estimates are then averaged across all trees to provide an overall estimate [22].
Although RF is deemed to be relatively insensitive to the choice of Ntree and Mtry, in practice, for large-scale GWAS involving a massive number of predictors, and possibly multivariate responses, these parameters must be tuned to achieve an optimal predictive performance and increase the statistical power of the algorithm to detect the true causative SNPs. As the number of trees in the forest increases, the OOB error rate is expected to converge to a theoretical prediction error according to the law of large numbers [22]. It is therefore important to select a sufficiently large number of trees to guarantee optimal performance and stable ranking.
Split functions for multivariate traits
The node splitting rule determines how each tree in the forest is built, and depends on the particular predictive task at hand. For each node j, two operations are performed: (a) every allowable split on each SNP is examined; (b) the best of these split is selected, and the left and right daughter nodes are created. The initial node is the root node, which contains the entire data set , and the two operations above are then applied repeatedly to each daughter node until no more splits can be obtained. During this process the value of a split function ϕ(j) is computed for every split at node j. In regression tasks with both univariate and multivariate responses, sum-of-squares functions are commonly used [23]. In what follows, we let denote the subset of samples associated with node j, and the set of Mtry candidate SNPs that are available to split node j. Furthermore, the mean response vector observed in is denoted . With this notation in place, the total sum of squares at node j is
(1)
where is the Q × Q covariance matrix estimated from , and depends on an unknown parameter vector A fully parametrized covariance matrix requires Q(Q + 1)/2 parameters. With low sample sizes, a more parsimonious model is generally preferable so that has only a few unknown parameters. It is standard procedure to set the covariance matrices at the daughter nodes equal to the estimated covariance matrix at the parent node, in order to guarantee that the split function remains positive [23]. This procedure has the additional benefit of reducing the number of computations. We call Standard RF the algorithm that uses the splitting criterion (1).
When a SNP is selected as a candidate to split node j into two daughter nodes, the total sum of squares computed at the left and right daughter nodes are SS(j)
l
and SS(j)
r
, respectively. A suitable function in this case measures the reduction in the sum of squares due to the split, and is given by
(2)
Every candidate SNP is tested to split the node, and the one with the highest ϕ(j) is selected. Once the best split has been found, the daughter nodes become new parent nodes and the covariance matrices are estimated again.
Parsimonious parametrisation of the covariance matrices are required in order to keep the computational burden low. In high-dimensional settings, and especially when N is much smaller than Q, it is commonplace to assume that the covariance matrices are diagonal [15, 24]. For instance, typical whole-brain imaging genetics studies may involve a few hundred thousands brain-wide measurements while the sample remain in the order of a few hundreds [19]. In this situation, the total sum of squares of Eq. (1) can be alternatively expressed in terms of all N × N squared inter-point Euclidean distances between all N samples [25]. By rewriting Eq. (1) in an equivalent form, when ,
(3)
where N(j) indicates the sample size at node j. This strategy provides an equivalent but computationally more efficient way of evaluating the split function of Eq. (2). The evaluation of each SS(j) term has a cost complexity of O(N(j)2) instead of N (j) × Q. We call Distance-based RF the algorithm that uses the splitting criterion (3).
Measure of variable importance for SNP ranking
One of the attractive features of RF for GWAS consists in its ability to perform SNP ranking by computing a measure of variable importance [5]. A commonly used and computationally simple procedure for SNP ranking consists in monitoring the value of the split function ϕ(j) every time a particular SNP has been selected, in each tree. This score, which we will refer to as the information gain importance score, usually produces ranking that are comparable with other variable importance measures, including those that rely on computationally intensive permutation procedures [22]. In the context of genetics studies, SNPs with the highest importance score are preferred candidates for further exploration. In the literature, this approach has been successfully used as a prescreening step to prioritise predictive SNPs [26].
Hadoop implementation
RF implementations generally build trees sequentially. However, a sequential approach is highly inefficient, especially when each tree involves a large number of SNPs, and many trees are needed in order to obtain reliable measures of SNP importance and estimates of prediction error. RF can be easily parallelised because all trees are independently learned from randomised versions of the data. We describe here a parallel version of the RF regression algorithm that we have implemented using the MapReduce programming model for deployment on large Hadoop clusters. Broadly speaking, the approach consists in letting each node in the cluster build a certain number of trees in the forest, and then letting the system collect and aggregate the partial results from all trees in the ensemble, in an automated and fault-tolerant fashion.
The MapReduce model involves three phases: the map phase, the shuffle phase and the reduce phase. Each one of the map and reduce phase has key-value pairs as input and output. The shuffle phase shuffles the output of map phase to the input of reduce phase evenly using the MapReduce library. The map phase runs a user-defined mapper function on a set of key-value pairs [k
j
, v
j
] taken as input, and generates a set of intermediate key-value pairs. In the map phase of our application, each input key corresponds to a unique tree ID and value is NULL since we load the full data set to build trees. A user-defined number of mappers, nmap, are executed whereby each mapper function learns one or more decision trees from bootstrapped versions of the data set. The output of the map phase consists of three types of information: (1) Sample identifier (key) and predictive value, which is then used to estimate the OOB error rate at the reduce phase; (2) SNP identifier (key) and the decrease in sum-of-squares (value), which is used to obtain the SNP importance scores at the reduce phase; (3) Sample pair identifier (key) and its proximity (value), which is used to produce the final proximity matrix extracted from RF. All these outputs from mappers are sorted, shuffled, and copied to reducers by Hadoop. An illustration of this initial process is given in Figure 1. The Hadoop job initially distributes the data set to each map task using a DistributedCache mechanism, which copies the read-only files on to the slave nodes before launching a job. In our implementation, each map task loads the full version of the data set, which is then bootstrapped and used to learn each tree. In this illustration, Ntree = 3, and each one of the three mappers builds one tree. When the number of required trees is larger than the total number of mappers, each mapper builds more than one tree.
The output from all the mapper functions, consisting of key-value pairs, is then sorted, shuffled and copied to the reduce tasks, which receive their input in the form of [k
j
, [v
j
1, v
j
2, ...,]] pairs, where the first element can be SNP or sample identifier and the second element is a list of values associated with that SNP or sample. The reduce tasks run a user-defined reducer function, and generate an output again in the form of key-value pairs to be saved on file. The reduce tasks compute information gain importance score by summing up all the ϕ(j) evaluations obtained by the individual trees in the map phase. Again, the computations are equally distributed across reducers. For instance, in the bottom part of Figure 1, each reducer generates a partial list of key-value pairs containing SNP id and its information gain importance score. These lists are eventually saved on file and eventually they are joined to obtain the final output.
This parallel RF regression algorithm has been implemented in Java. It can be run in standalone, pseudo-distributed, and fully distributed mode. The current version, which supports biallelic markers, gives users the options to calculate OOB, variable importance scores, and the sample proximity matrix. Both standard and distance-based splitting criteria are available, and the Euclidean distance featuring in (3) could be easily replaced by other distances. For the standard RF version, currently both the Euclidean and Mahalanobis distances are available. An installation and user guide is available from the web site.
The Alzheimer's Disease Neuroimaging Initiative cohort
This work was motivated by experimental data produced by the Alzheimer's Disease Neuroimaging Initiative (ADNI). Alzheimer's Disease (AD) is a moderate to highly heritable condition, and a growing list of genetic variants have been associated with greater susceptibility to develop early- and late-onset AD [27]. We have obtained genotypes for 253 unrelated subjects comprising 99 AD patients and 154 elderly healthy controls (CN). Genotyping was performed using the Human610-Quad Bead-Chip, which includes 620, 901 SNPs [28]. Subjects are unrelated, and all of European ancestry, and passed screening for evidence of population stratification using the procedure described in [19]. For this study, we include only autosomal SNPs, and additionally exclude SNPs with a genotyping rate <95%, a Hardy-Weinberg equilibrium p-value < 5 × 107, and a minor allele frequency < 0.1. Missing genotypes were imputed as in [29].
For each subject in this study, longitudinal brain scans at 6, 12 and 24 months after the initial screening were available. Our multivariate quantitative trait provides a measure of structural change observed in the brain relative to baseline over the three time points. More specifically, each individual phenotype vector y
i
consists of 148, 023 slope coefficients, one for each voxel, quantifying the temporal rate of linear brain tissue loss over time, and therefore providing a localised imaging signature of the disease. A more detailed description of the preprocessing steps and the procedure used to extract this imaging signatures can be found in [30].