- Methodology article
- Open Access
Multi-model inference using mixed effects from a linear regression based genetic algorithm
- Koen Van der Borght^{1, 2}Email author,
- Geert Verbeke^{2, 3} and
- Herman van Vlijmen^{1}
https://doi.org/10.1186/1471-2105-15-88
© Van der Borght et al.; licensee BioMed Central Ltd. 2014
- Received: 11 October 2013
- Accepted: 21 March 2014
- Published: 27 March 2014
Abstract
Background
Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.
In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92.
Results
In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R^{2}_{MM} > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R^{2}) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).
Conclusions
We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.
Keywords
- Variable selection
- Linear regression
- Genetic algorithm
- Mixed-effects model
- Multi-model inference
Background
In recent studies, classical regression methods for prediction of a continuous variable from a large number of covariates have been extended for the training of a model when the data set is hierarchical in nature [1–4]. In this article we extend our genetic algorithm (GA) variable selection methodology in [5] to allow for clustering in the data. We compare the performance of multi-model inference (MMI) using restricted maximum likelihood (REML) mixed-effects modeling [6, 7] (MM) with ordinary least squares regression [8] (OLS) and compare GA-MMI with the commonly used penalized regression method Least Absolute Shrinkage and Selection Operator [9] (LASSO). We also show how to optimally set the GA parameters.
As an example, the training of a linear regression model for prediction of Raltegravir (RAL) resistance (“phenotype”) from mutations in the HIV integrase region (“genotype”) is worked out. The data sets used for training and testing were described in more detail in [5]. The training set consisted of n = 991 clonal genotype-phenotype measurements, from multiple clones derived from 153 clinical isolates (on average 5 à 6 clones per isolate) and repeated measurements (on average 3) from 28 site-directed mutants (in-vitro lab created clones with a designed mutational pattern), and the number of candidate mutations for selection was p = 322. Two test sets were used: the first consisted of population data of 171 clinical isolates (test set 1), the second consisted of 67 integrase site-directed mutants containing most of the known RAL resistance associated mutational patterns [10] (test set 2). As it was found in [5] that a second order model did not significantly outperform a first order model, we did not consider interaction terms.
The paper is organized as follows. We begin by recalling the Simple Genetic Algorithm for variable selection in OLS linear regression. Then, we introduce GA-MM as an extension for clustered data. Finally, we introduce MMI for estimation of the model parameters, combining the results from multiple GA-MM (or GA-OLS) runs, followed by a short section on how we applied the LASSO method for comparison. In the remainder of the paper, we illustrate our methodology on an example for the predictive modeling of RAL resistance. For this example, we describe in detail how we optimized the GA parameter settings, and we report the results of comparing GA-MM-MMI with LASSO and GA-OLS-MMI. When nominating one ‘best’ model, from all models evaluated in the comparison, we chose the GA-MM-MMI TOP18 model as a sparse model with high biological relevance (17 out of 18 integrase mutations in this model have been confirmed to be associated with resistance [5]), and having better predictive accuracy than LASSO and GA-OLS-MMI models with equal number of mutations selected. Throughout the text of this article GA related terminology is written in italic.
Methods
GA-OLS
Simple genetic algorithm
Step | Description |
---|---|
1 | Initialize a random population of pop.size individuals, goto step 4. |
2 | Select the more fit individuals to form a new population. |
3 | Modify genetic material of the individuals in this new population by applying genetic operators: mutation and cross-over. |
4 | Evaluate fitness of the population. If no solution found goto step 2, else end. |
In step 3 of Table 1, the mutation genetic operator alters a gene (replacing it with another gene from the pool of candidate genes) in a chromosome with probability Pm. The crossover genetic operator re-combines the genotypes of two individuals. The probability of an individual to be selected for crossover is Pc. The key in the optimization is to keep a good balance between selective pressure (Table 1 step 2) and genetic diversity (Table 1 step 3). The GA run is completed when an individual is found with fitness > goal fitness. When no solution is found within a maximum number of generations (max.generations), the GA run is halted. For step 2 of Table 1, we used tournament selection as detailed in Section II (Results and discussion). Also, elitism is used, meaning that the best chromosome (highest R^{2}) is passed through to the next generation, with a probability Pe.
The running of the GA is done multiple times to generate a set S of solutions. A ranking by importance can then be made for all variables based on their frequency in S.
GA-MM
Although OLS parameter estimates are known to be unbiased when neglecting the correlation structure [6], in this article we want to evaluate whether using a mixed model for the GA models, using a random subject effect in addition to the fixed effects (variables as in the OLS model), can improve the interpretability or performance of the final linear regression model, derived with MMI (next section).
The GA-MM methodology makes use of the Simple Genetic Algorithm (Table 1), completely analogous to GA-OLS, producing a ranking of variables by their frequency in a set S of GA solutions. However, there is no single commonly used definition for the R^{2} statistic as is the case for OLS [16, 17]. Several definitions have been suggested that all have different interpretations in the presence of correlated errors. Here, we used the marginal R^{2}_{MM} definition from [18], quantifying the variance explained by the fixed effects. As new data will originate from other subjects than those used for the training of the model, the random effects cannot be used for prediction. In [1] it has also been described that conditional R^{2} (variance explained by the entire model, including the random effects) should not be used for fixed-effect variable selection. For us, the main motivation for using R^{2}_{MM} was that the MM can be fitted using REML, resulting in better estimates for the variance components, needed in the estimation of the fixed effects, especially in models with many fixed effects [7].
If x_{ k } ∉ M: β_{ k } ≡ 0.
is the between-cluster variance, and ${\mathit{\sigma}}_{\mathit{\u03f5}}^{2}$ is the within-cluster variance.
The intra-class correlation: $\mathit{ICC}=\frac{{\mathit{\sigma}}_{\mathit{\alpha}}^{2}}{{\mathit{\sigma}}_{\mathit{\alpha}}^{2}+{\mathit{\sigma}}_{\mathit{\u03f5}}^{2}}$ for the model without fixed effects was 0.92, showing very strong within-cluster correlation, and suggesting that accounting for this correlation may improve the performance of our model.
GA-MMI
In [19, 20] it has been described that, when the number of samples in the training data is small, making inference from a single best model, e.g., produced with stepwise regression, leads to the inclusion of noise variables. Here, we used MMI to combine the information from the GA solutions into a final model for making predictions. As a GA run is stopped as soon as the goal fitness (calculated in section VI (Results and discussion)) is achieved (Table 1, step 4), GA solutions were ‘equally fit’. Thus, we used equal weighting of the GA solutions in the MMI. In [6] it was shown that for stepwise regression using an information criterion for selection – as we used in [5] for deriving a consensus model from the GA ranking of variable frequencies – one should for MM use the biased ML estimators. An advantage of using MMI in combination with GA-MM is that REML can still be used. Thus, using MMI, we could make a fair comparison between GA-OLS and GA-MM.
- 1.
Refitting for a TOP selection of the GA ranking: from the GA-ranking, the variables with highest frequencies were retained for the final model, which was then refitted using OLS/MM.
- 2.Averaging of parameter estimates ${\widehat{\mathit{\beta}}}_{\mathit{k}}$ using all GA solutions (${\widehat{\mathit{\beta}}}_{\mathit{k}}\equiv 0$, if x _{ k } not in GA solution) (MMI1):$\overline{\mathit{\beta}}{}_{\mathit{k}}=\frac{{\displaystyle \sum _{\mathit{s}=1}^{\left|\mathit{S}\right|}{\widehat{\mathit{\beta}}}_{\mathit{ks}}}}{\left|\mathit{S}\right|},$
- 3.Averaging of parameter estimates ${\widehat{\mathit{\beta}}}_{\mathit{k}}$ using GA solutions where ${\widehat{\mathit{\beta}}}_{\mathit{k}}\ne 0$ (MMI2):$\overline{\mathit{\beta}}{}_{\mathit{k}}=\frac{{\displaystyle \sum _{\mathit{s}=1}^{\left|\mathit{S}\right|}{\widehat{\mathit{\beta}}}_{\mathit{ks}}}}{\left|\left\{\mathit{M}\in \mathit{S}|{\mathit{x}}_{\mathit{k}}\in \mathit{M}\right\}\right|}.$
For the model averaging in 2 and 3, parameters${\widehat{\mathit{\beta}}}_{\mathit{k}}$ were (re-)fitted using OLS/MM for all m variables with presence at least once in a GA solution or for a TOP selection of variables in the GA ranking only.
LASSO
LASSO [9] is a regularization method that performs variable selection by constraining the size of the coefficients, also called shrinkage. By applying an L1 absolute value penalty, regression coefficients are ‘shrunk’ towards zero, forcing some of the regression coefficients to zero. Using the R package glmselect 1.9-3 [21], for the described example in this paper we performed variable selection using the LASSO technique on the clonal genotype-phenotype database returning a LASSO ranking of variables (solution path) as selected by decreasing the amount of penalty applied. Besides using the shrinkage coefficients for variable estimation (default LASSO) we also applied OLS and MM to the LASSO selected variables (post-LASSO [22]).
Results and discussion
GA parameter settings
We optimized the GA parameters one by one in the order (I - > VI) as described below, and taking computational efficiency into account (see Additional file 1). Tournament selection was used as selection method to form a new population of more fit individuals. GA parameters Pm and Pc were optimized together using a meta-GA. Pe and pop.size were fixed in advance and were not optimized. Pe was set to conserve the best chromosome in three consecutive generations, followed by a generation where the probability of keeping the best chromosome was set to 20%. Pop.size was set equal to 20. To set the main GA parameters: max.generations, chrom.size, and num.runs we used cross-validation (Additional file 1 point 7).
For running the GA, we used the R package GALGO 1.0.11 [23]. After inspection of the R^{2}_{CV} results, with exception of goal.fitness, we took the same optimized GA parameters values for GA-OLS and GA-MM (for the model comparison): pop.size = 20, chrom.size = 25, Pm = 0.1, Pc = 0.6, Pe = (1,1,1,0.2), max.generations = 500, tournament.size = 10, num.solutions = 100, goal.fitness.ols = 0.957, and goal.fitness.mm = 0.95. In Additional file 2 is the R code we used to derive these settings and to run GA-MM-MMI.
I. Meta-GA for selection of Pm and Pc
Meta-GA optimization of Pm and Pc
GA | CHOSEN | PRE-SET | BEING OPTIMIZED |
---|---|---|---|
GA-OLS | pop.size = 20 | chrom.size = 15 | Pm ∈ [0,1] |
Pe = (1,1,1,0.2) | max.generations = 100 | Pc ∈ [0,1] | |
num.runs = 1 | |||
goal.fitness = 1 | |||
tournament.size = 10 | |||
metaGA | pop.size _{ meta } = 20 | ||
num.generations _{ meta } = 100 | |||
Pm _{ meta } = 0.01 | |||
Pc _{ meta } = 0.9 | |||
Pe _{ meta } = 0.4 |
Crossover was a fairly weak genetic operator as can be seen from the red band in Figure 1. Oppositely, the mutation genetic operator was a strong operator and was best taken in the range [0.1, 0.4]. The meta-GA converged at (Pm,Pc) = (0.258,0.372). For further evaluation in Section II, we also selected (0.1,0.6) and (0.2,0.6) located in the largest dark red area in Figure 1 (R^{2} > 0.91).
II. Tournamentselection
Tournament selection [15, 25] is a selection method to bias the selection towards the more fit individuals. Pop.size tournaments are organized with k randomly selected chromosomes. The winner of a tournament is the chromosome with the best fitness (highest R^{2}). The pop.size tournament winners become the new population. Selection pressure, the degree to which better individuals are favoured, is increased when the tournament size is increased, as the winner from a large tournament will, on average, have a higher fitness than the winner of a small tournament.
GA parameter settings to evaluate tournament.size and ( Pm , Pc )
GA | CHOSEN | PRE-SET | BEING OPTIMIZED |
---|---|---|---|
GA-OLS | pop.size = 20 | chrom.size = 15 | tournament.size ∈ {1,…,20} |
Pe = (1,1,1,0.2) | max.generations = 100 | (Pm,Pc) ∈{(0.1,0.6);(0.2,0.6); | |
num.runs = 10 | (0.258,0.372); | ||
goal.fitness = 1 | (0.05,0.7)} |
III. Maximum number of generations
GA parameter settings to evaluate max.generations
GA | CHOSEN | PRE-SET | BEING OPTIMIZED |
---|---|---|---|
GA-OLS | pop.size = 20 | chrom.size = 15 | max.generations ∈ {100,200,300,400,500} |
GA-MM | Pe = (1,1,1,0.2) | num.runs = 10 | |
Pm = 0.1 | goal.fitness = 1 | ||
Pc = 0.6 | |||
tournament.size = 10 |
IV. Chromosomesize
GA parameter settings to evaluate chrom.size
GA | CHOSEN | PRE-SET | BEING OPTIMIZED |
---|---|---|---|
GA-OLS | pop.size = 20 | num.runs = 10 | chrom.size ∈ {5,10,15,20,25,30} |
GA-MM | Pe = (1,1,1,0.2) | goal.fitness = 1 | |
Pm = 0.1 | |||
Pc = 0.6 | |||
tournament.size = 10 | |||
max.generations = 400 |
V.Number of GA runs
GA parameter settings to evaluate num.runs
GA | CHOSEN | PRE-SET | BEING OPTIMIZED |
---|---|---|---|
GA-OLS | pop.size = 20 | goal.fitness = 1 | num.runs ∈ {1,10,20,50,100,500} |
GA-MM | chrom.size = 25 | ||
Pe = (1,1,1,0.2) | |||
Pm = 0.1 | |||
Pc = 0.6 | |||
tournament.size = 10 | |||
max.generations = 400 |
VI. Goal fitness
which equals goal.fitness = 0.957 for GA-OLS and goal.fitness = 0.95 for GA-MM. For the calculation we used the R package tolerance 0.5.2 [27].
GA-OLS vs.GA-MM: variable selection
GA-OLS and GA-MM variable selection were performed on the clonal genotype-phenotype database using the GA parameters as specified in the above sections. The percentage of runs that failed reaching the goal.fitness with max.generations = 500 was 16% and 23.1% for GA-OLS and GA-MM, respectively.
GA-OLS and GA-MM variable selection vs.LASSO
When we compared the GA-OLS ranking with the GA-MM ranking (Figures 8 and 9), a relatively low ranking was seen e.g., for GA-OLS for 140A and 155S, which favours GA-MM for its interpretation.
GA-OLS-MMI vs. GA-MM-MMI vs. LASSO: R^{2}performance on test set 1 and test set 2
R ^{ 2 } performance on test set 1
Variable selection | LASSO | GA-OLS | GA-MM | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Variable estimation | Coef (shrinkage) | OLS | MM | OLS | MM | MMI1 | MMI2 | OLS | MM | MMI1 | MMI2 | ||||
OLS | MM | OLS | MM | OLS | MM | OLS | MM | ||||||||
TOP15 variables | 0.815 | 0.816 | 0.827 | 0.830 | 0.833 | 0.832 | 0.838 | 0.827 | 0.831 | 0.834 | 0.835 | 0.834 | 0.839 | 0.829 | 0.832 |
TOP18 variables | 0.816 | 0.818 | 0.827 | 0.830 | 0.831 | 0.833 | 0.838 | 0.825 | 0.829 | 0.832 | 0.835 | 0.835 | 0.839 | 0.829 | 0.832 |
TOP21 variables | 0.819 | 0.825 | 0.835 | 0.821 | 0.825 | 0.836 | 0.839 | 0.819 | 0.824 | 0.824 | 0.826 | 0.834 | 0.838 | 0.819 | 0.824 |
TOP24 variables | 0.820 | 0.822 | 0.824 | 0.819 | 0.824 | 0.837 | 0.840 | 0.818 | 0.824 | 0.820 | 0.821 | 0.834 | 0.837 | 0.817 | 0.821 |
TOP27 variables | 0.827 | 0.817 | 0.818 | 0.827 | 0.829 | 0.839 | 0.841 | 0.822 | 0.827 | 0.814 | 0.820 | 0.835 | 0.838 | 0.814 | 0.819 |
TOP30 variables | 0.828 | 0.812 | 0.817 | 0.821 | 0.822 | 0.838 | 0.840 | 0.819 | 0.823 | na^{c} | na^{c} | na^{c} | na^{c} | na^{c} | na^{c} |
ALL m variables ^{ a } | 0.826 | 0.795 | 0.811 | na^{b} | na^{b} | 0.840 | 0.841 | 0.701 | 0.725 | na^{b} | na^{b} | 0.838 | 0.839 | 0.713 | 0.725 |
(m = 51) | (m = 193) | (m = 200) |
R ^{ 2 } performance on test set 2
Variable selection | LASSO | GA-OLS | GA-MM | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Variable estimation | Coef (shrinkage) | OLS | MM | OLS | MM | MMI1 | MMI2 | OLS | MM | MMI1 | MMI2 | ||||
OLS | MM | OLS | MM | OLS | MM | OLS | MM | ||||||||
TOP15 variables | 0.667 | 0.734 | 0.712 | 0.707 | 0.707 | 0.708 | 0.708 | 0.709 | 0.710 | 0.709 | 0.702 | 0.705 | 0.696 | 0.706 | 0.698 |
TOP18 variables | 0.690 | 0.731 | 0.713 | 0.721 | 0.718 | 0.716 | 0.714 | 0.722 | 0.719 | 0.768 | 0.770 | 0.742 | 0.742 | 0.747 | 0.750 |
TOP21 variables | 0.742 | 0.760 | 0.765 | 0.736 | 0.730 | 0.722 | 0.717 | 0.732 | 0.726 | 0.777 | 0.775 | 0.746 | 0.744 | 0.751 | 0.752 |
TOP24 variables | 0.745 | 0.771 | 0.768 | 0.732 | 0.728 | 0.720 | 0.716 | 0.727 | 0.723 | 0.762 | 0.761 | 0.743 | 0.740 | 0.748 | 0.749 |
TOP27 variables | 0.767 | 0.788 | 0.788 | 0.721 | 0.725 | 0.720 | 0.717 | 0.732 | 0.726 | 0.770 | 0.768 | 0.744 | 0.741 | 0.758 | 0.755 |
TOP30 variables | 0.777 | 0.789 | 0.787 | 0.768 | 0.772 | 0.731 | 0.729 | 0.747 | 0.743 | na^{c} | na^{c} | na^{c} | na^{c} | na^{c} | na^{c} |
ALL m variables ^{ a } | 0.787 | 0.770 | 0.776 | na^{b} | na^{b} | 0.733 | 0.729 | 0.741 | 0.733 | na^{b} | na^{b} | 0.747 | 0.745 | 0.754 | 0.749 |
(m = 51) | (m = 193) | (m = 200) |
On test set 1, using MM for the variable estimation had a slightly better R^{2} performance than using OLS, for all models considered. Note that this was not the case in the cross-validation (section V) where OLS R^{2}_{CV} performance was higher, possibly due to the inclusion of multiple clinical isolates from the same patient. However, as patient information was not given for the training set, we could not take this into account. For the TOP15/TOP18 models containing the smallest number of variables, the best performance was seen for GA-MM-MMI1 (R^{2} = 0.839). For the TOP21- > ALL models with more variables considered, the best performance was seen for GA-OLS-MMI1 (R^{2} = 0.839-0.841). When estimating ALL GA-OLS/GA-MM variables, the worst performance was seen for MMI2 (R^{2} = 0.701-0.725) where noise variables were clearly overweighted. For LASSO, the best R^{2} performance on test set 1 was obtained using MM for the variable estimation for the TOP15- > TOP24 selection of variables (R^{2} = 0.824-0.835). For LASSO TOP27- > ALL, the best R^{2} performance was obtained using the LASSO shrinkage coefficients (R^{2} = 0.826-0.828).
On test set 2, for the sparse models the best performance was observed for LASSO-OLS TOP15 (R^{2} = 0.734), GA-MM-MM TOP18 (R^{2} = 0.770), and GA-MM-OLS TOP21 (R^{2} = 0.777). For the TOP21- > ALL models, the best performance was seen for LASSO (R^{2} = 0.771-0.789). In contrast to the results for test set 1, the MMI2 R^{2} performance was now found to be higher than for MMI1, for the GA-OLS/MM models. The reason is that while test set 1 consisted of clinical samples, with 82.5% not containing any of the primary RAL resistance mutations [5], test set 2 consisted of site-directed mutants containing most of the known resistance patterns but lacking any noise variables as found in clinical samples. Nevertheless, on test set 2, the GA-MM R^{2} values were found to be better than for GA-OLS, confirming that a better selection of variables as made by GA-MM (cf. the above two sections) led to a better performance on unseen data.
Therefore, on the example training set in this article we would favour the GA-MM-MMI TOP18 model. Based on the performance on test set 2, for the MMI variable estimation re-fitting using MM may be preferred over MMI1-MM.
Conclusions
In this article, we extended our GA variable selection methodology to mixed models which account for clustering in the data. Using cross-validation, we optimized the GA parameter settings taking also computational efficiency into account. For the worked-out example, all settings could be taken equal for GA-OLS and GA-MM, with exception of goal.fitness for which we used a marginal R^{2} definition. The model parameters for prediction could then be estimated using MMI-MM (REML) on the GA solutions obtained from 100 GA runs. When testing LASSO, GA-OLS and GA-MM on two unseen data sets, all methods had good performance. When imposing a parsimony restriction for better interpretability of the model, the GA-MM-MMI TOP18 model had better predictive accuracy (R^{2}) than GA-OLS and LASSO.
In summary, we belief that GA-MM-MMI is a direct approach to derive a sparse and interpretable model for making predictions with good accuracy on small data sets with clustered observations and a large number of candidate variables, where chance of overfitting with standard regression techniques is high.
Availability and requirements
Project name: GA-MM-MMI.
Project home page: http://sourceforge.net/projects/ga-mm-mmi.
Operating system: Platform independent.
Programming language: R ≥ 2.15.2, perl, MATLAB.
Other requirements: requires galgo 1.0.11 [23].
License: GNU GPL.
Any restrictions to use by non-academics: none.
Declarations
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments to improve the manuscript. Financial support from the IAP research network #P7/06 of the Belgian Government (Belgian Science Policy) is gratefully acknowledged.
Authors’ Affiliations
References
- Orelien JG, Edwards LJ: Fixed-effect variable selection in linear mixed models using R^{2} statistics. Comput Stat Data An. 2008, 52: 1896-1907. 10.1016/j.csda.2007.06.006.View ArticleGoogle Scholar
- Schelldorfer J, Bühlmann P, Van de Geer S: Estimation for high-dimensional linear mixed-effects models using l1-penalization. Scand J Stat. 2011, 38: 197-214. 10.1111/j.1467-9469.2011.00740.x.View ArticleGoogle Scholar
- Taylor JD, Verbyla AP, Cavanagh C, Newberry M: Variable selection in linear mixed models using an extended class of penalties. Aust N Z J Stat. 2012, 54: 427-449. 10.1111/j.1467-842X.2012.00687.x.View ArticleGoogle Scholar
- Hajjem A, Bellavance F, Larocque D: Mixed-effects random forest for clustered data. J Stat Comput Simul. 2012, [http://dx.doi.org/10.1080/00949655.2012.741599]Google Scholar
- Van der Borght K, Verheyen A, Feyaerts M, Van Wesenbeeck L, Verlinden Y, Van Craenenbroeck E, van Vlijmen H: Quantitative prediction of integrase inhibitor resistance from genotype through consensus linear regression modeling. Virol J. 2013, 10: 8-10.1186/1743-422X-10-8.View ArticlePubMed CentralPubMedGoogle Scholar
- Verbeke G, Molenberghs G: Linear Mixed Models for Longitudinal Data. 2000, New York: SpringerGoogle Scholar
- Zuur AF, Ieno EN, Walker N, Saveliev AA, Smith GM: Mixed Effects Models and Extensions in Ecology with R. 2009, New York: SpringerView ArticleGoogle Scholar
- Kutner MH, Nachtsheim CJ, Neter J, Li W: Applied Linear Statistical Models. 2004, New York: McGraw-HillGoogle Scholar
- Tibshirani R: Regression Shrinkage and Selection via the Lasso. J Royal Stat Soc B. 1996, 58: 267-288.Google Scholar
- FDA: Isentress (raltegravir) drug label. 2009, [http://www.accessdata.fda.gov/drugsatfda_docs/label/2009/022145s004lbl.pdf]Google Scholar
- Affenzeller M, Winkler S, Wagner S, Beham A: Genetic Algorithms and Genetic Programming – Modern Concepts and Practical Applications. 2009, Boca Raton: CRC PressView ArticleGoogle Scholar
- Butz MV: Rule-Based Evolutionary Online Learning Systems – A Principled Approach to LCS Analysis and Design. 2006, Berlin: SpringerGoogle Scholar
- Hopgood AA: Intelligent Systems for Engineers and Scientists. 2001, Boca Raton: CRC PressGoogle Scholar
- Sivanandam SN, Deepa SN: Introduction to Genetic Algorithms. 2008, Heidelberg: SpringerGoogle Scholar
- Michalewicz Z: Genetic Algorithms + Data Structures = Evolution Programs. 1996, New York: SpringerView ArticleGoogle Scholar
- Edwards LJ, Muller KE, Wolfinger RD, Quaqish BF, Schabenberger O: An R^{2} statistic for fixed effects in the linear mixed model. Stat Med. 2008, 27: 6137-6157. 10.1002/sim.3429.View ArticlePubMed CentralPubMedGoogle Scholar
- Kramer M: R^{2}statistics for mixed models. Proceedings of the 17th annual Kansas State University Conference on Applied Statistics in Agriculture: 25-27 April 2005. 2005, Manhattan, Kansas. Kansas State University: , 148-160.Google Scholar
- Nakagawa S, Schielzeth H: A general and simple method for obtaining R^{2} from generalized linear mixed-effects models. Methods Ecol Evol. 2013, 4: 133-142. 10.1111/j.2041-210x.2012.00261.x.View ArticleGoogle Scholar
- Flack VF, Chang PC: Frequency of selecting noise variables in subset regression analysis: a simulation study. Am Stat. 1987, 41: 84-86.Google Scholar
- Lukacs PM, Burnham KP, Anderson DR: Model selection bias and Freedman’s paradox. Ann Inst Stat Math. 2010, 62: 117-125. 10.1007/s10463-009-0234-4.View ArticleGoogle Scholar
- Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010, 33: 1-R package version 1.9-3. [http://CRAN.R-project.org/package=glmnet]View ArticlePubMed CentralPubMedGoogle Scholar
- Belloni A, Chernozhukov V: Least squares after model selection in high-dimensional sparse models. Bernoulli. 2013, 19: 521-547. 10.3150/11-BEJ410.View ArticleGoogle Scholar
- Trevino V, Falciani F: GALGO: an R package for multivariate variable selection using genetic algorithms. Bioinformatics. 2006, 22: 1154-1156. 10.1093/bioinformatics/btl074. R package version 1.0.11. [http://biptemp.bham.ac.uk/vivo/galgo/AppNotesPaper.htm]View ArticlePubMedGoogle Scholar
- Tenorio F: gaoptim: Genetic Algorithm optimization for real-valued problems. 2013, R package version 1.0. [http://CRAN.R-project.org/package=gaoptim]
- Miller BL, Goldberg DE: Genetic algorithms, tournament selection, and the effects of noise. Complex Systems. 1995, 9: 193-212.Google Scholar
- Krishnamoorthy K, Mathew T: Statistical Tolerance Regions: Theory, Applications, and Computation. 2009, Hoboken, NJ: John Wiley & SonsView ArticleGoogle Scholar
- Young DS: An R package for estimating tolerance intervals. J Stat Softw. 2010, 36: 5-R package version 0.5.2. [http://CRAN.R-project.org/package=tolerance]View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.