Skip to main content
  • Methodology article
  • Open access
  • Published:

The C1C2: A framework for simultaneous model selection and assessment

Abstract

Background

There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model.

Results

The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error. Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice but not the generalization error estimates, compared to using the brute-force method. The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a lower accuracy of the generalization error estimates was observed.

Conclusion

The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations. A complete separation of the model choice and the model assessment in terms of data used for each task improves the estimates of the generalization error.

Background

A common task in computational biology/bioinformatics and computational chemistry/chemometrics (hereafter abbreviated BBCC) is to model a dependent variable from a set of independent variables; this gives insight into the workings of the process being modeled and enables prediction of future observations. Typical examples include analyzing potential drug activity through proteochemometrics and quantitative structure-activity relationship (QSAR) modeling [13], discovering gene regulatory binding-site modules [4], and predicting clinical outcomes of cancer from gene expression data [5]. However, recent articles have indicated that predictive modeling approaches have not fully fulfilled expectations for solving real problems. This issue has for instance been discussed in the fields of QSAR [6] and microarray gene expression data modeling [7, 8]. While some of the problems may be attributed to incorrect use and interpretation of the models, others can be ascribed to improper model selection and assessment. Our aim is here to address the latter issue by introducing the C1C2, a general framework for model choice and assessment.

Let D= {X n , y n } be a dataset, where X n = (x '1, ..., x ' n )' is an n × p X matrix whose i th row, x i , is the value of a p X -vector of independent variables associated with y i , the i th row of the n × 1 matrix, y= (y1, ..., y n )'. A statistical model, M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ can be used to characterize the relation between X n and y n . In general, given the dataset D, M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ must be chosen from a set of models, M = { M 1 , ... , M m , ... , M M } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eacaWF9aWaaiWaaeaaiqaacaGFnbWaaSbaaSqaaiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcaGFnbWaaSbaaSqaaiabd2gaTbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcaGFnbWaaSbaaSqaaiabd2eanbqabaaakiaawUhacaGL9baaaaa@4AC6@ according to some criterion (typically the minimization of a loss function). The most common way to select M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ from M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ in BBCC is to use K-fold cross-validation; that is, the dataset D is split into K mutually exclusive subsets, D1,..., D k ,..., D k , of approximately equal size and M l 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaeGimaadaaaaa@3AEF@ (D) is picked to minimize the function:

C 0 ( M m ) = 1 n k = 1 K i = 1 n D k L ( y i , y ^ i ( x , M m , D k ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaWbaaSqabeaacqaIWaamaaGccqGGOaaktCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbnf2C0vMCJfMCKbaceaGaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBaaGcdaaeWbqaamaaqahabaGaemitaWKaeiikaGIaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcuWG5bqEgaqcamaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaemiEaGNaeiilaWIaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiilaWccbmGae4hraq0aaSbaaSqaaiabgkHiTiabdUgaRbqabaGccqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUnaaBaaameaacqWGebardaWgaaqaaiabdUgaRbqabaaabeaaa0GaeyyeIuoaaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aOGaeiykaKIaeiilaWcaaa@6860@
(1)

where L is a loss function and D-kdenotes that the k th subset was excluded from D during the model fitting; n D i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOBa42aaSbaaSqaaGqadiab=reaenaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3012@ is the number of observations in subset D i ; and m = 1,..., M, where M is the number of models in M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ . This model selection method may seem straightforward and intuitive, however it neglects the fact that all the data at hand is used to make the model choice. Thus, we no longer have an independent testset to assess the chosen model by. The result is that, typically, C0( M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ ) underestimates the generalization error (see for instance [9]), defined as the expected prediction error over an independent test sample. This problem has been highlighted in relatively recent works [9, 10], but was noted initially in 1974 [11]. To obtain a more accurate generalization error estimate, the model selection process must be separated from the model assessment in terms of the data that is used. Ideally, if data were abundant and easily produced, we would set aside a large test dataset and use it to assess – but not to choose! – the model M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ , and subsequent model refinements could be assessed with new, unseen data. In practice, this is however often impossible since BBCC data is typically scarce, and expensive to produce. The luxury of large independent testsets can thus rarely be afforded. To tackle this problem, Freyhult et al. [9] suggested using a K-fold cross-validatory assessment of an H-fold cross-validatory model choice, M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaeiiiGyiaaaaa@3B4F@ , as a way of simultaneously choosing M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ and assessing its performance; thereby separating the model selection from its assessment. The model is assessed by the function

C = 1 n k = 1 K i = 1 n D k L ( y i , y ^ i ( x i , M l ( D k ) , D k ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaWbaaSqabeaacqGGGaIHaaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaWaaabCaeaacqWGmbatcqGGOaakcqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiqbdMha5zaajaWaaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSmXvP5wqSXMqHnxAJn0BKvguHDwzZbqeg0uyZrxzYnwyYrgaiqaacaWFnbWaa0baaSqaaiabdYgaSbqaaiabcccigcaakiabcIcaOGqadiab+reaenaaBaaaleaacqGHsislcqWGRbWAaeqaaOGaeiykaKIaeiilaWIae4hraq0aaSbaaSqaaiabgkHiTiabdUgaRbqabaGccqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUnaaBaaameaacqGFebardaWgaaqaaiabdUgaRbqabaaabeaaa0GaeyyeIuoaaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aOGaeiykaKIaeiilaWcaaa@6CC0@
(2)

where M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaeiiiGyiaaaaa@3B4F@ (D-k) is the cross-validatory choice of M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ based on D-k; that is, the model M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ that minimizes the function

C D k ( M m ) = 1 n D k h = 1 H j = 1 n D h L ( y j , y ^ j ( x j , M m , D h k ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aa0baaSqaaGqadiab=reaenaaBaaameaacqGHsislcqWGRbWAaeqaaaWcbaGaeiiiGyiaaOGaeiikaGYexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa+1eadaWgaaWcbaGaemyBa0gabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa42aaSbaaeaacqWFebardaWgaaqaaiabgkHiTiabdUgaRbqabaaabeaaaaGcdaaeWbqaamaaqahabaGaemitaWKaeiikaGIaemyEaK3aaSbaaSqaaiabdQgaQbqabaGccqGGSaalcuWG5bqEgaqcamaaBaaaleaacqWGQbGAaeqaaOGaeiikaGIaemiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGSaalcaGFnbWaaSbaaSqaaiabd2gaTbqabaGccqGGSaalcqWFebardaWgaaWcbaGaeyOeI0IaemiAaGMaem4AaSgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOBa42aaSbaaWqaaiab=reaenaaBaaabaGaemiAaGgabeaaaeqaaaqdcqGHris5aaWcbaGaemiAaGMaeyypa0JaeGymaedabaGaemisaGeaniabggHiLdGccqGGPaqkcqGGSaalaaa@72C7@
(3)

where D-hkdenotes the dataset D with the k th and h th subsets omitted. In the present work we build on and generalize this idea into the C1C2 framework. In general, the number of models in M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ is huge, thus it is unfeasible to go through even a small subset of them manually. Hence, for a framework such as the C1C2 to be useful in practice, automated methods for searching the model space M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ are necessary; in this sense the C1C2 is similar to the automatic modelling approaches taken in for instance [1214]. Here, the specific use of the C1C2 is demonstrated by applying it together with two search methods to simulated and real-world datasets. The results are compared to those obtained by employing the function (1) for model selection and assessment. In the interest of clarity, we have restricted our attention to the study of model choice and assessment within a linear model class, M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3F64@ (defined below) for a quadratic loss function. We discuss the results of the demonstrations, the pros and cons of the generality of the C1C2, and set out some directions for further research.

Results

Algorithm

Let C1, C2 C= {C1,..., C J }, where C is a set of model assessment criteria and C1, C2 represent two specific criteria (i.e. C1 = C i , C2 = C j , i, j = 1,..., J). Further, let S S, where S is a set of search methods; let L L, where L is a set of loss functions; let G denote a sequence of data processing steps (e.g. mean-centering, transformations, whitening, etc) and let G ' contain the results of G applied to D-k(the roles of G and G ' are exemplified in the discussion following the pseudococe below); let R be a positive integer and K an integer between 1 and n, where n is the number of observations. The C1C2 procedure is outlined with the following pseudocode:

Initiate M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ , G, L, C1, C2, R, and K.

for (r in 1,..., R) {

   a. Partition data, D= {D k }k = 1,..., k.

   for (k in 1,..., K) {

      b. Apply G to D-k. Save results in G'.

      c. Search M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ using the data D-kand C2 as objective function. Assume M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ is found to maximize (or minimize) C2. Save M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ .

      d. Apply G to D k using G'.

      e. Assess M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ using C1 and D k . Save assessment result.

   }

}

The data partitioning in (a) separates data for the model choice from data for the model assessment. Note that the partitioning is dependent on the choice of C1 and does not necessarily need to be done in a cross-validation fashion. For instance, the choice C1 = ".632 estimator" [15, 16], partitions the data by independently sampling n rows from D with replacements and lets the observations not included among the sampled observations constitute the test set. The output from the C1C2 is also dependent on the choice of C1; for example, the choice C1 = C2 = Bayesian Information Criterion (BIC, see [17] and Methods) would not give a direct estimation of the generalization error, but rather an assessment of model overfitting. To clarify the roles of G and G', we give the following example: Let G only contain a processing step that scales to unit variance. In (b) G is applied to D-iand the standard deviation of each column of D-kis saved in G '. In (d), G ' is applied to D k , that is, the columns in D k are scaled using the standard deviations calculated in (b). This treatment of G ensures that D k indeed constitutes an independent testset. The 'for loop' over r is introduced to enable calculation of confidence intervals for estimates and, by averaging the estimates over R repetitions, it permits reduction of the variance in parameter and error (or overfitting) estimates by a factor of 1/R.

Figure 1 gives a graphical gives a graphical view of the C1C2 framework. We emphasize that the generality of the framework allows C1, C2, and S to be chosen to fit the problem at hand. Adequate choices of C1, C2, and S make the model selection and assessment more accurate and faster, which we will discuss below.

Figure 1
figure 1

The C1C2 framework. The data partitioning in step (a) in the pseudocode separates the model choice from its assessment, which is highlighted in purple in the figure. The left side of the figure relates to steps (b) to (d) in the pseudocode, and the right side to step (e); i.e. the left side relates to choosing the model and saving the parameter estimates, and the right side to assessing the model and saving the assessment results.

Datasets used in the testing

Both the simulated and the real data used for evaluating a new method or algorithm should reflect typical dataset properties found in real-world application domains. Examples of such properties in BBCC are multicollinearity, a large number of independent variables relative to the number of observations, and binary and categorical independent variables.

Simulated data

We simulated datasets as follows:

Let Δ = (δ i )i = 1,..., p, where δ i = { 1  iff  x i  is in the model 0  else MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiTdq2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpdaGabaqaauaabaqaceaaaeaacqaIXaqmcqqGGaaicqqGPbqAcqqGMbGzcqqGMbGzcqqGGaaiieWacqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabbccaGiabbMgaPjabbohaZjabbccaGiabbMgaPjabb6gaUjabbccaGiabbsha0jabbIgaOjabbwgaLjabbccaGiabb2gaTjabb+gaVjabbsgaKjabbwgaLjabbYgaSbqaaiabicdaWiabbccaGiabbwgaLjabbYgaSjabbohaZjabbwgaLbaaaiaawUhaaaaa@5563@ , represents a subset of X n , and let β p (Δ) = (β i (δ i ))i = 1,..., p, where β i ( δ i ) = { β i 0  iff  δ i = 1 0  else MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqOSdi2aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqaH0oazdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9maaceaabaqbaeaabiqaaaqaaiabek7aInaaBaaaleaacqWGPbqAaeqaaOGaeyiyIKRaeGimaaJaeeiiaaIaeeyAaKMaeeOzayMaeeOzayMaeeiiaaIaeqiTdq2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqaIXaqmaeaacqaIWaamcqqGGaaicqqGLbqzcqqGSbaBcqqGZbWCcqqGLbqzaaaacaGL7baaaaa@4DDE@ , is a vector of regression coefficients. The data matrix, X n was sampled from a twenty-dimensional multivariate normal distribution. Thus, x i ~ N20(020, Σ20), i = 1,..., n, where 020 is a twenty-dimensional vector of zeroes and Σ20 is either the I20 identity matrix or a covariance matrix, S20, estimated from a real in-house QSAR dataset that originated from HIV protease inhibitors. The HIV QSAR dataset contains highly correlated independent variables resulting in an S20 with many large absolute values in the off-diagonal elements. Three δ i in Δ were chosen to be nonzero and equal to one; the positions were chosen at random to be 11, 14, and 18 (but remained fixed throughout the experiment for evaluation purposes). The corresponding regression coefficients, β11(δ11), β14(δ14), and β18(δ18), were, for simplicity, all set equal to 1. The variables 11, 14, and 18 were slightly correlated with an estimated covariance matrix Σ ( 11 , 14 , 18 ) = ( 4 .07 15 .28 -0 .01 15 .28 3423 .20 -0 .33 -0 .01 -0 .33 0 .01 ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeu4Odm1aaSbaaSqaaiabcIcaOiabigdaXiabigdaXiabcYcaSiabigdaXiabisda0iabcYcaSiabigdaXiabiIda4iabcMcaPaqabaGccqGH9aqpdaqadaqaauaabeqadmaaaeaacqqG0aancqqGUaGlcqqGWaamcqqG3aWnaeaacqqGXaqmcqqG1aqncqqGUaGlcqqGYaGmcqqG4aaoaeaacqqGTaqlcqqGWaamcqqGUaGlcqqGWaamcqqGXaqmaeaacqqGXaqmcqqG1aqncqqGUaGlcqqGYaGmcqqG4aaoaeaacqqGZaWmcqqG0aancqqGYaGmcqqGZaWmcqqGUaGlcqqGYaGmcqqGWaamaeaacqqGTaqlcqqGWaamcqqGUaGlcqqGZaWmcqqGZaWmaeaacqqGTaqlcqqGWaamcqqGUaGlcqqGWaamcqqGXaqmaeaacqqGTaqlcqqGWaamcqqGUaGlcqqGZaWmcqqGZaWmaeaacqqGWaamcqqGUaGlcqqGWaamcqqGXaqmaaaacaGLOaGaayzkaaGaeiOla4caaa@632E@ .

Datasets were generated assuming that y n followed a linear model according to: y n = X n β p (Δ) + ε n , where ε i ~N(0,1.5). Four datasets were simulated in order to evaluate the C1C2s performance in settings where n <p x , n > p x , and where the observations were sampled from an orthogonal multivariate normal distribution or not, according to the following schema:

  1. 1.

    n = 15, Σ20 = I20

  2. 2.

    n = 200, Σ20 = I20

  3. 3.

    n = 15, Σ20 = S20

  4. 4.

    n = 200, Σ20 = S20

The simulated datasets are available in CSV format from Additional files 1, 2, 3, 4.

The Selwood dataset

This is a real dataset, made available from a website [18] and originally published in 1990 [19]. It is a widely studied dataset (see [20, 21] and references therein). It contains one dependent variable, 53 independent variables, and 31 observations. The 53 independent variables correspond to numerical descriptions of molecules (antifilarial antimycin analogues) designed to capture their physicochemical properties. The dependent variable is the in vitro antifilarial activity of the molecules. This dataset exhibits extremely strong correlations between the independent variables and contains real valued, binary and categorical independent variables. It is known from previous studies that this dataset contains nonlinearities, but that decent models can be found using linear methods.

Testing

To demonstrate the use of the C1C2, it was applied to the simulated and real-world datasets described above (hereafter referred to as "the datasets"). Below we describe the choices for R, K, M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ , G, L, C1, C2, and S and the motivation for each selection.

Choice of R and K

The larger the choice of R, the higher the accuracy in the estimate of the generalization error; the choice of R is thus constrained by time and is dependent on the size of the dataset and the computational complexity of the choices of M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ , G, L, C1, C2, and S. The choice of K is a trade-off between bias and variance; the larger the K, the more variance and the less bias in the estimates of the generalization error [22]. R was here set to 12 and K to 5.

Choice of M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@

We make the assumption that a normal linear model forms a reasonable approximation of the data. This model is given by: y n = X n β p + ε n , where the subscripts denote the number of rows in a matrix, β p are regression coefficients, and ε n ~N(0,σ2 I). Further, because n <p and the columns in X n are highly correlated in some of the datasets, we decide to use the ridge estimator, β ^ p r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaKaadaqhaaWcbaGaemiCaahabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@35DF@ (see Methods), of the regression coefficients, β p . Let M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3F64@ be the linear class of models given by: y n = X n β p + ε n , where β p is estimated by β ^ p r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaKaadaqhaaWcbaGaemiCaahabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@35DF@ . We thus choose M = M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaiiaacqGF9aqpcaWFnbWaaWbaaSqabeaacqWGYbGCcqWGPbqAcqWGKbazcqWGNbWzcqWGLbqzaaaaaa@413A@ . The problem of model choice within M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3F64@ reduces to the problem of variable selection, i.e. choosing which subset of the p columns in X n to include in the model, and the problem of choosing the ridge penalizing parameter λ (see Methods). Hence, letting Δ = (δ i )i = 1,..., p(see simulated data above) represent a subset of X n , we want to choose Δ and λ using the C1C2 framework. A choice of Δ and λ for given values of r and k will be termed "an estimate" of Δ and λ, respectively, and be denoted Δ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafuiLdqKbaKaaaaa@2D4B@ and λ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaaaaa@2D99@ . Averages of estimates over the K folds and the R repeats in the C1C2 are denoted Δ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafuiLdqKbaKGbaebaaaa@2D62@ and λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ , respectively.

Choice of G

As the columns in the Selwood dataset are measured in different units using different scales, we choose to make G contain mean centering and scaling to unit variance processing steps.

Choice of L

We use the standard quadratic loss function given by:

L ( y i , y ^ i ( x i , M m ) ) = ( y i y ^ i ( x i , M m ) ) 2 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemitaWKaeiikaGIaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcuWG5bqEgaqcamaaBaaaleaacqWGPbqAaeqaaOGaeiikaGccbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaaltCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbnf2C0vMCJfMCKbaceaGaa4xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaeiykaKIaeyypa0JaeiikaGIaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWG5bqEgaqcamaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcaGFnbWaaSbaaSqaaiabd2gaTbqabaGccqGGPaqkcqGGPaqkdaahaaWcbeqaaiabikdaYaaakiabc6caUaaa@5DB0@

Choice of C1and C2

Others [12, 13] have suggested choosing C1 = C2 = cross-validation. Here, we choose C1 = cross-validation and C2 = BIC. Hence, in this demonstration we assess a model choice M l r i d g e M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaOGaeyicI4mceeGaa4xtamaaCaaaleqabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@4A1C@ according to:

C 1 = 1 n k = 1 K i = 1 n D k L ( y i , y ^ i ( x i , M l r i d g e , B I C ( D k ) , D k ) ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aaWbaaSqabeaacqaIXaqmaaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaWaaabCaeaacqWGmbatcqGGOaakcqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiqbdMha5zaajaWaaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSmXvP5wqSXMqHnxAJn0BKvguHDwzZbqeg0uyZrxzYnwyYrgaiqaacaWFnbWaa0baaSqaaiabdYgaSbqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLjabcYcaSiabdkeacjabdMeajjabdoeadbaakiabcIcaOGqadiab+reaenaaBaaaleaacqGHsislcqWGRbWAaeqaaOGaeiykaKIaeiilaWIae4hraq0aaSbaaSqaaiabgkHiTiabdUgaRbqabaGccqGGPaqkcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUnaaBaaameaacqGFebardaWgaaqaaiabdUgaRbqabaaabeaaa0GaeyyeIuoaaSqaaiabdUgaRjabg2da9iabigdaXaqaaiabdUealbqdcqGHris5aOGaeiilaWcaaa@75EE@
(4)

where M l r i d g e , B I C MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzauMaeiilaWIaemOqaiKaemysaKKaem4qameaaaaa@44DB@ (D-k) is the M l r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@40C4@ chosen according to BIC based on D-k; that is, the value of M l r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@40C4@ that optimizes the function:

C D k 2 ( M m r i d g e ) = log P ( D k | β ^ r i d g e , M m r i d g e ) d f 2 log n D k , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qam0aa0baaSqaaGqadiab=reaenaaBaaameaacqGHsislcqWGRbWAaeqaaaWcbaGaeGOmaidaaOGaeiikaGYexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa+1eadaqhaaWcbaGaemyBa0gabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaOGaeiykaKIaeyypa0JagiiBaWMaei4Ba8Maei4zaCMaeeiuaaLaeiikaGIae8hraq0aaSbaaSqaaiabgkHiTiabdUgaRbqabaGccqGG8baFcuaHYoGygaqcamaaCaaaleqabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaOGaeiilaWIaa4xtamaaDaaaleaacqWGTbqBaeaacqWGYbGCcqWGPbqAcqWGKbazcqWGNbWzcqWGLbqzaaGccqGGPaqkcqGHsisljuaGdaWcaaqaaiabdsgaKjabdAgaMbqaaiabikdaYaaakiGbcYgaSjabc+gaVjabcEgaNjabd6gaUnaaBaaaleaacqWFebardaWgaaadbaGaeyOeI0Iaem4AaSgabeaaaSqabaGccqGGSaalaaa@77BC@
(5)

m = 1,..., M, where M is the number of models in M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3F64@ . df in (5) is the number of free parameters in the model M m r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemyBa0gabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@40C6@ (note that this is not equal to the number of parameters in the model M m r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemyBa0gabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@40C6@ , see for instance [16]). The choice of C1 is motivated by that we wish to get a direct estimate, ε ^ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C7@ , of the generalization error, ε gen , of our model choice. Provided that the assumptions behind BIC are fulfilled, the choice C2 = BIC has several advantages over C2 = cross-validation, including: a reduction of bias in parameter estimates [22], a reduction of variance by the Rao-Blackwell type relation derived in [23], and a drastic reduction of the computational cost of the procedure.

Choice of S

A genetic algorithm (GA) was chosen as a search method because it is very easy to adapt to different situations and in general effective for nondeterministic polynomial-time hard combinatorial problems, such as the problem of estimating Δ [24]. A trial solution in the GA is here a varying length chromosome that contains a real-valued number representing λ and a vector of integers representing the indices of the δ i in Δ that are nonzero. The fitness function is our choice of C2. For the simulated datasets, we also chose to run the C1C2 with a brute force search method: for each λ {0,0.01,0.02,...,10} an exhaustive search in the Δ space (i.e. an all-subset regression) was performed. This enabled comparisons between the GA method and a search method guaranteed to find the optimal model (given a specific objective function and the resolution and limits of λ).

Some remarks regarding the demonstration

To enable comparisons with the estimates of Δ, λ, and ε gen obtained with repeated K-fold cross-validation, the demonstration described above was repeated with the function (1) used as a criterion for model choice and for assessing the model. Note that, since the C1C2 includes the 'for loop' over r, (1) was repeated R = 12 times, each time with a new, independent data partitioning. This was done to facilitate an impartial comparison between the two methods.

The demonstration of the C1C2 framework can be compared with the work of for instance Nicolotti and Carotti [20], where a genetic algorithm was employed to estimate Δ. In contrast to that approach, the C1C2 framework completely separates model choice and assessment whereby more accurate generalization error estimates in general are achieved. Further, the use of specific ad hoc objective functions is avoided by choosing C2 to be a formally derived model selection criterion, and simultaneous estimation of model parameters other than Δ (for example, the ridge parameter λ in the demonstration) can be afforded. Typically, in works that have employed a search method for estimating Δ, a given number of nonzero δ i in Δ is assumed (see for instance [20, 25]). Therefore it was of interest to investigate the effect of this assumption on producing good estimates of Δ and ε gen . This can be tested for the simulated datasets in the demonstration, where the number of nonzero δ i is known. The C1C2 was therefore applied to the simulated datasets both with an assumption about the number of nonzero δ i and without the assumption. For simplicity (however somewhat unrealistically), we assumed the correct number of nonzero δ i .

Results of the testing

Simulated datasets

The four simulated datasets in combination with the use of either the C1C2 or repeated K-fold cross-validation for model choice and assessment, the GA or the brute-force search method, and either with or without the assumption of prior knowledge of the number of nonzero δ i constitute a two-level, five-factor, full factorial experimental design. The C1C2 and the repeated K-fold cross-validation were applied four times to each factor combination, thus providing four replicates of the whole demonstration for the simulated data. The design can be analyzed within the normal linear modelw iv = γ0 + γ1 z1i+ γ2z2i+ γ3z3i+ γ4z4i+ γ5z5i+ η i (i = 1,...,128),

where z ji , j = 1,2,3,4,5, are factors corresponding to C1C2 or repeated K-fold cross-validation model choice and assessment, brute force or GA search, Σ20 = I20 or Σ20 = S20 in the multivariate normal distribution from which the data was sampled, 200 or 15 observations, and assuming three nonzero δ i or no such assumption, respectively. i goes from 1 to 128 in (6) as there are 32 factor combinations in four replicates. w iv , v = 1,2,3, are response variables defined according to the following: the Euclidean norm wi 1= Δ ^ ¯ Δ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7amaaBaaaleaacqWGPbqAaeqaaaaa@3463@ was used to measure how well Δ on average was estimated, wi 2= λ ^ ¯ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebadaWgaaWcbaGaemyAaKgabeaaaaa@2F37@ was used as a response variable in the λ case (as the correct choice of λ is not known), and wi 3= | ε ^ g e n ε ˜ g e n | ¯ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaamaaBaaaleaacqWGPbqAaeqaaaaa@3D73@ was used to measure how well the generalization error ε on average was estimated; ε ^ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C7@ denotes the estimate of ε gen for given values of r and k; ε ˜ g e n = 1 N j = 1 N ( y j , e x t y ^ j , e x t ( x j , e x t , M l ) ) 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaGaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOta4eaaOWaaabmaeaacqGGOaakcqWG5bqEdaWgaaWcbaGaemOAaOMaeiilaWIaemyzauMaemiEaGNaemiDaqhabeaakiabgkHiTiqbdMha5zaajaWaaSbaaSqaaiabdQgaQjabcYcaSiabdwgaLjabdIha4jabdsha0bqabaGccqGGOaakieWacqWF4baEdaWgaaWcbaGaemOAaOMaeiilaWIaemyzauMaemiEaGNaemiDaqhabeaakiabcYcaSmXvP5wqSXMqHnxAJn0BKvguHDwzZbqeg0uyZrxzYnwyYrgaiqaacaGFnbWaaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaaa@691D@ denotes the generalization error estimate obtained by using the corresponding choice of model, M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ , to predict the response values in a large (N = 500,000) external test set, generated in the same way as the dataset used for choosing the model and estimating ε ^ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C7@ ; the bar denotes the average over the R·K individual estimates. The generalization error can be decomposed into three parts: one irreducible error (corresponding to the error added when simulating the data), the squared bias, and the variance. The latter two are dependent on the model choice and consequently the generalization error is dependent on the model choice. We here assume that the large-sample estimate of the generalization error, ε ˜ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaGaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C6@ , closely represents the true generalization error, ε, for a given model choice.

The results for choosing a model M l r i d g e M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaOGaeyicI4mceeGaa4xtamaaCaaaleqabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@4A1C@ for the simulated datasets are available in Additional file 5, where Δ ^ ¯ Δ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7aaaa@32DC@ , λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ , and | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ are tabulated for each factor combination and replicate. The parameter estimates for fitting the model (6) using Δ ^ ¯ Δ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7aaaa@32DC@ , λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ , and | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ as response variables are shown in Tables 1, 2, and 3, respectively. All fitted models were highly significant (F5,122 = 26.1, p-value < 2.2 × 10-16 with Δ ^ ¯ Δ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7aaaa@32DC@ as response; F5,122 = 47.7, p-value < 2.2 × 10-16 with λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ as response; and F5,122 = 12.1, p-value = 1.6 × 10-9 with | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ as response); residual plots showed no large deviations from the assumptions of normality of error distribution (an asymptotic normal distribution of the response variables is warranted by the central limit theorem), homoscedasticity, and independent errors (data not shown). A few outliers were however observed, probably resulting from "unfortunate" data partitions.

Table 1 Coefficient estimates of model (6) with wi 1= Δ ^ ¯ Δ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7amaaBaaaleaacqWGPbqAaeqaaaaa@3463@ , i = 1, ..., 128, as a response variable.
Table 2 Coefficient estimates of model (6) with wi 2= λ ^ ¯ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebadaWgaaWcbaGaemyAaKgabeaaaaa@2F37@ , i = 1, ..., 128 as a response variable.
Table 3 Coefficient estimates of model (6) with wi 3= | ε ^ g e n ε ˜ g e n | ¯ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaamaaBaaaleaacqWGPbqAaeqaaaaa@3D73@ , i = 1, ..., 128 as a response variable.

Selwood dataset

Applying the C1C2 to the Selwood dataset yielded Δ ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafuiLdqKbaKaaaaa@2D4B@ that on average contained 11.4 ± 3.5 nonzero δ ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiTdqMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2F11@ . The 11 most frequently picked variables were the partial atomic charge for atoms ATCH1, ATCH3, and ATCH6; electrophilic superdelocalizability for atom ESDL3; the van der Waal's volume VDWVOL; the surface area SURF_A; the principal moments of inertia MOFI_Y and MOFI_Z; the principal ellipsoid axis PEAX_X; the partition coefficient LOGP; and the sum of F substituent constant SUM_F (see [19] for more details about the variables). The estimation of λ by λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ was 2.50 ± 0.09 and the estimation of the generalization error by ε ^ ¯ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKGbaebadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31DE@ was 0.42 ± 0.038, where ε ^ ¯ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKGbaebadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31DE@ is an average over the R·K ε ^ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C7@ produced in the C1C2.

Applying repeated K-fold cross-validation for model choice and assessment to the Selwood dataset gave on average 14.1 ± 4.8 selected variables. The 14 most frequently picked variables included the same 11 variables picked by the C1C2 (see above) plus DIPMOM (the dipole moment), ATCH7 (partial charge of atom 7), and DIPV_Y (the dipole moment vector in the Y-direction). The estimation of λ by λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ was 3.01 ± 0.22 and the estimation of the generalization error by ε ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKGbaebaaaa@2DA3@ was 0.35 ± 0.041, where ε ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKGbaebaaaa@2DA3@ is an average over the R·K ε ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaaaaa@2D8C@ produced in the repeated K-fold cross-validation.

Implementation

Computer programs to implement the C1C2 were written in Java (Sun Microsystems [26]) as a part of the library P, that will serve as the data analysis plugin for Bioclipse [27]. P is available under the GSPL license from the website [28]; it is open source and free for academics. It has a modular architecture that enables plugging in new features, including modeling methods, model selection criteria, and search procedures. P relies on a modified version of the JGap library (available from the website [29]) for the genetic algorithm computations (the modifications are available under the LGPL license from the website [30]). The R-package, pvclust [31, 32], was used for the cluster analysis (see Discussion).

Discussion

Simulated datasets

The model (6) fitted to w = Δ ^ ¯ Δ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7aaaa@32DC@ (see Table 1) showed a relatively clear significant difference (on the 90% level) in average Δ estimates depending on whether the data came from a multivariate normal distribution with Σ20 = I20 or Σ20 = S20. Furthermore, we observed significant positive impacts on average Δ estimates with more observations and knowledge about the number of nonzero δ i . All these findings were expected; highly correlated variables should provide worse estimates of Δ, whereas more observations and trustworthy prior knowledge should provide better estimates. A significant improvement in average Δ estimates was observed when using the brute-force search compared to the GA. The GA on average selected slightly more variables than needed and than what the brute-force method did. No clear significant difference could be seen between using the C1C2 rather than repeated K-fold cross-validation.

It can be shown that the optimal choice of λ (in terms of minimized expected generalization error) tends to zero as the number of observations tends to infinity and decreases with decreasing number of variables (see for instance [33]) and with decreasing correlations between the independent variables. The model (6) fitted with w = λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ as a response (see Table 2) showed that the average estimated λ was significantly smaller for the data that came from a multivariate normal distribution with Σ20 = I20 compared to Σ20 = S20, when more observations were used, and when prior knowledge about the number of nonzero δ i in Δ was assumed. Although the true value of λ is not known, these results are thus consistent with theory and provide evidence that both the C1C2 and repeated K-fold cross-validation gave reasonable estimates of λ in the demonstration. However, the average λ estimates are not equal to zero for all orthogonal datasets, presumably due to the stochastic nature of the GA and to errors in y n . No significant differences were observed between using the GA or the brute force search methods or between the C1C2 and the repeated K-fold cross-validation.

Fitting model (6) with w = | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ as a response (see Table 3) showed that the average error estimates were significantly worsened with the assumption of a given number of nonzero δ i and that no significant difference was observed when using the GA or the brute force method, or when the independent variables in the dataset were correlated or not. These findings might seem confusing given that the assumption of a given number of nonzero δ i , the use of the brute-force search method, and uncorrelated independent variables all improved model selection. The findings can be explained by the fact that, in general, without an assumption of a given number of nonzero δ i , when using the GA for searching the model space, and when independent variables were correlated, more nonzero δ i are on average selected (see Table 1). Thus the chances of also selecting the correct ones improve. This implies that it is worse to estimate at least one δ i = 1 with δ ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiTdqMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2F11@ = 0 than to estimate all δ i = 1 with δ ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiTdqMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2F11@ = 1 and at least one δ i = 0 with δ ^ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiTdqMbaKaadaWgaaWcbaGaemyAaKgabeaaaaa@2F11@ = 1. This makes sense, because the former models are incorrect, whereas the latter ones contain the true model, but are inefficient due to their unnecessary large size. The average error estimates were significantly improved with a large number of observations and when the C1C2 was employed to produce the estimates compared to when the repeated K-fold cross-validation was used (see Fig. 2 and Table 3). The latter result seems contradictory with that no clear difference was found between the average Δ estimates produced with the C1C2 and those obtained with the repeated K-fold cross-validation (see above). It can however be explained by studying the R·K individual Δ estimates, where a clear (99% level) positive effect could be observed when using the C1C2 compared to the repeated K-fold cross-validation. The individual Δ estimates were thus worse when repeated K-fold cross-validation was used, resulting in worse generalization error estimates. However the average Δ estimates from the respective method were almost the same. This observation is seconded by the higher confidence intervals of the average Δ estimates produced with repeated K-fold cross-validation (see Additional file 5). The finding that the C1C2 produces more accurate generalization error estimates than repeated K-fold cross-validation is consistent with the results presented in for instance [9] and provides evidence for that a complete separation of the data used for model choice and the data used for model assessment is necessary to obtain better estimates of the generalization error.

Figure 2
figure 2

Generalization errors obtained with the C1C2 and repeated K -fold cross-validation. The figure shows | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ , where ε ^ g e n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqyTduMbaKaadaWgaaWcbaGaem4zaCMaemyzauMaemOBa4gabeaaaaa@31C7@ were produced using the C1C2 (blue) and repeated K-fold cross-validation (red) for all other factor combinations in model (6). The plot is based on pooled | ε ^ g e n ε ˜ g e n | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBaeqaaaGccaGLhWUaayjcSdaaaaaa@3BEC@ over the four replicates for each method. The bars show the 95% confidence interval, calculated from the pooled results (the confidence intervals are only shown in one direction to avoid cluttering). The factor combinations in model (6) are coded as: ga – the GA search method was used, bf – the brute force search method was used, uncor – orthogonal independent variables in the dataset, cor – correlated independent variables in the dataset, 15n = 15 observations in the dataset, 200n = 200 observations in the dataset, all – no assumption regarding the number of nonzero δ i , 3 – three δ i = 1 were assumed.

Selwood dataset

The result of estimating Δ was, expectedly, less clear when applying the C1C2 and repeated K-fold cross-validation to real-world data; the Selwood dataset is particularly difficult to model due to the extremely high correlations between variables (many variable pairs have correlation coefficients > 0.95), the low observation-to-variable ratio, and deviations from the linearity and homoscedasticity assumptions (see [20]). 11.4 out of the 53 variables were on average selected by the C1C2 and 14.1 by K-fold cross-validation. Interestingly, the 11 most frequently picked variables selected by the C1 C2 is a proper subset of the 14 most recurrently selected variables by K-fold cross-validation. Hierarchically clustering the 14 most frequently picked variables chosen by K-fold cross-validation (which thus includes the 11 variables selected most often by the C1C2) using the absolute correlation as a distance measure revealed three distinct clusters and one subcluster (see Fig. 3). Good models (in terms of estimated generalization error) for the Selwood dataset can be achieved by selecting LOGP and one variable from the set of variables in the blue subcluster (PEAX_X, MOFI_Y, MOFI_Z, VDWVOL, and SURF_A) and one from the set of variables in the green cluster (ESDL3, ATCH1, ATCH3, ATCH6, ATCH7, and SUM_F). LOGP appears to be sufficiently different from the rest of the variables in the red cluster to improve model performance. The variables in the respective blue and green clusters are highly correlated and it is sufficient to have one variable from each cluster in a model.

Figure 3
figure 3

Cluster dendrogram of the 14 selected variables from the Selwood dataset using repeated K -fold cross-validation. Three distinct clusters can be noted (shown in red, green, and yellow rectangles). One sub-cluster can be seen within the red cluster (shown in a blue rectangle). The red and green numbers are p-values of a given cluster; they indicate how well the cluster is supported by data (see [31] for details). +Additional variables selected by repeated K-fold cross-validation compared to the C1C2.

Models containing LOGP and one variable each from the green and blue clusters have high predictive power and comply to the QUICK rules for credible predictive models proposed previously [21]. Furthermore, these models have been found credible in the works of others [20, 25, 34]. The C1C2 chose 11 variables belonging to the green, red, and blue clusters, whereas K-fold cross-validation chose an additional three variables: ATCH7 in the green cluster, and DIPMOM and DIPV_Y belonging to the third distinct (yellow) cluster. Variables from the yellow cluster do not improve the internal predictive ability when testing models containing LOGP and one variable from the respective green, blue, and yellow clusters on the whole Selwood dataset (data not shown); this result is supported by the work of Nicolotti and Carotti (see Table 1 in [20]). More variables in the Selwood dataset were thus on average selected with repeated K-fold cross-validation than when using the C1C2 (the difference was significant on the 80% level, tested by a one-sided Welch's t-test), including two that not seem to improve the predictive ability of the models. The generalization estimates obtained with the K-fold cross-validation was lower than those obtained with the C1C2 (significant on the 70% level). Although these differences are not highly significant, it is tantalizing to arrive at the conclusion that the models selected by repeated K-fold cross-validation in this particular case are more prone to overfitting and that this is reflected in the lower generalization error estimates.

The C1C2

We have here introduced the C1C2 framework for simultaneous model choice and assessment. The main idea is a complete separation of the choice of a model and its assessment in terms of the data used for each task. The C1C2 was applied to the problem of choosing a model, M l r i d g e M r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaqhaaWcbaGaemiBaWgabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaOGaeyicI4mceeGaa4xtamaaCaaaleqabaGaemOCaiNaemyAaKMaemizaqMaem4zaCMaemyzaugaaaaa@4A1C@ . Previously, others have described methods that, within the linear model, tackle the problem of regression coefficient shrinkage and variable selection simultaneously, for example, the lasso [35]. However, the C1C2 framework is general and is easily applied to other settings. For instance, different choices of C2 are favorable in different situations; Akaike's information criterion (AIC) [36] is known to be consistent within the linear model if the true model is very complex, whereas BIC is favorable within linear models of finite dimension [37], and cross-validation is preferable to use in situations where the degrees of freedom of a model is difficult to define, and so forth. The search method can also be tailored to the problem at hand; for instance, brute-force methods are advantageous for small problems, whereas GAs are faster and thus applicable to larger problems. Moreover, if required, M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ can be chosen to contain nonlinear models, L can be chosen to be exponential in order to increase the penalty on outliers, and instead of using the search method to produce an estimate of Δ (as we did in the demonstration) we can let G contain a dedicated variable selection method. The cost of this generality is uncharacterized convergence rates (in finite time) of the parameter estimates, which is coupled to the need of employing a general search method (like a GA) rather than solving standard convex problems. Running the C1C2 R times enables averaging of estimates and calculation of confidence intervals, but renders problems in choosing which out of the R models to use for interpretation and future predictions. A potential remedy to these problems is, instead of choosing a model, to employ all chosen models in a stacking-like schema (see [38] for details on stacking). Testing this idea and further testing of the C1C2 for other choices of M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ , G, L, C1, C2, and S will be pursued in future research.

Conclusion

We have presented some evidence that suggests that the C1C2 works well in terms of choosing the correct model and produce good estimates of the generalization error. It was demonstrated to perform well within a penalized linear model, even for "difficult" datasets with highly correlated independent variables, a low observation-to-variable ratio, and deviations from model assumptions (see Table 4 for a summary of the findings in the demonstrations). However, more research is needed to fully assess the methods performance for more general, for instance nonlinear, models and to provide theoretical insight to frameworks such as the C1C2. The C1C2 is general and reasonable choices of M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ , G, L, C1, C2, and S help in achieving as unbiased estimates with as low a variance to as low a computational cost as possible. A framework that completely separates model choice from assessment in terms of used data, like the C1C2, should always be employed for model selection and assessment in order to avoid positive bias in the generalization error estimates and, ultimately, to avoid false conclusions and using dubious models to direct further research.

Table 4 Summary of the demonstrations of the C1C2.

Methods

Bayesian Information Criterion (BIC)

Suppose we have a set of candidate models, M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ and corresponding model parameters, θ m , and we wish to choose the best model among M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ . Assuming we have a prior distribution, P(θ m | M m MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemyBa0gabeaaaaa@3A02@ ) for the parameters of each model, M m M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemyBa0gabeaakiabgIGioJabbiaa+1eaaaa@3C6A@ , the posterior probability of a given model is:

P ( M m | D ) P ( M m ) P ( D | M m ) P ( M m ) P ( D | θ m , M m ) P ( θ m | M m ) d θ m . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaqabeaacqqGqbaucqGGOaaktCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbnf2C0vMCJfMCKbaceaGaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiiFaWhcbmGae4hraqKaeiykaKIaeyyhIuRaeeiuaaLaeiikaGIaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaeyyXICTaeeiuaaLaeiikaGIae4hraqKaeiiFaWNaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaeyyhIulabaGaeeiuaaLaeiikaGIaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaeyyXIC9aa8qaaeaacqqGqbaucqGGOaakcqGFebarcqGG8baFcqaH4oqCdaWgaaWcbaGaemyBa0gabeaakiabcYcaSiaa=1eadaWgaaWcbaGaemyBa0gabeaakiabcMcaPiabbcfaqjabcIcaOiabeI7aXnaaBaaaleaacqWGTbqBaeqaaOGaeiiFaWNaa8xtamaaBaaaleaacqWGTbqBaeqaaOGaeiykaKIaemizaqMaeqiUde3aaSbaaSqaaiabd2gaTbqabaaabeqab0Gaey4kIipakiabc6caUaaaaa@7730@

To choose a model in a Bayesian setting, the posterior odds, given by:

P ( M l | D ) m =1 M P ( M m | D ) = P ( M l ) P ( D | M l ) m =1 M P ( M m ) P ( D | M m ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqqGqbaucqGGOaaktCvAUfeBSjuyZL2yd9gzLbvyNv2CaeHbnf2C0vMCJfMCKbaceaGaa8xtamaaBaaabaGaemiBaWgabeaacqGG8baFieWacqGFebarcqGGPaqkaeaadaaeWaqaaiabbcfaqjabcIcaOiaa=1eadaWgaaqaaiabd2gaTbqabaGaeiiFaWNae4hraqKaeiykaKcabaGaemyBa0Maeeypa0JaeeymaedabaGaemyta0eacqGHris5aaaakiabg2da9KqbaoaalaaabaGaeeiuaaLaeiikaGIaa8xtamaaBaaabaGaemiBaWgabeaacqGGPaqkcqqGqbaucqGGOaakcqGFebarcqGG8baFcaWFnbWaaSbaaeaacqWGSbaBaeqaaiabcMcaPaqaamaaqadabaGaeeiuaaLaeiikaGIaa8xtamaaBaaabaGaemyBa0gabeaacqGGPaqkcqqGqbaucqGGOaakcqGFebarcqGG8baFcaWFnbWaaSbaaeaacqWGTbqBaeqaaiabcMcaPaqaaiabd2gaTjabb2da9iabbgdaXaqaaiabd2eanbGaeyyeIuoaaaaaaa@702B@
(7)

are formed for all models M l M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaakiabgIGioJabbiaa+1eaaaa@3C68@ , and M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ is picked to maximize equation (7). If all models in M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabbiaa=1eaaaa@3874@ are given equal prior probabilities, the problem of choosing the model M l MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemiBaWgabeaaaaa@3A00@ is reduced to calculating the integrals, P(D| M m MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemyBa0gabeaaaaa@3A02@ ). It can be shown [39, 40] that the Bayesian Information Criterion (BIC) approximates the logarithm of this integral with an O(1) error term, that is:

log P ( D | M m ) = B I C + O ( 1 ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiBaWMaei4Ba8Maei4zaCMaeeiuaaLaeiikaGccbmGae8hraqKaeiiFaW3exLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa+1eadaWgaaWcbaGaemyBa0gabeaakiabcMcaPiabg2da9iabdkeacjabdMeajjabdoeadjabgUcaRiabd+eapjabcIcaOiabigdaXiabcMcaPiabcYcaSaaa@4DB1@

where

B I C = log P ( D | θ ^ m , M m ) d f m 2 log n . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqaiKaemysaKKaem4qamKaeyypa0JagiiBaWMaei4Ba8Maei4zaCMaeeiuaaLaeiikaGccbmGae8hraqKaeiiFaWNafqiUdeNbaKaadaWgaaWcbaGaemyBa0gabeaakiabcYcaSmXvP5wqSXMqHnxAJn0BKvguHDwzZbqeg0uyZrxzYnwyYrgaiqaacaGFnbWaaSbaaSqaaiabd2gaTbqabaGccqGGPaqkcqGHsisljuaGdaWcaaqaaiabdsgaKjabdAgaMnaaBaaabaGaemyBa0gabeaaaeaacqaIYaGmaaGccyGGSbaBcqGGVbWBcqGGNbWzcqWGUbGBcqGGUaGlaaa@597D@

In the latter expression, θ ^ m MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiUdeNbaKaadaWgaaWcbaGaemyBa0gabeaaaaa@2F2A@ is the maximum likelihood estimate, df m is the number of free parameters in model M m MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWexLMBbXgBcf2CPn2qVrwzqf2zLnharyqtHnhDLj3yHjhzaGabaiaa=1eadaWgaaWcbaGaemyBa0gabeaaaaa@3A02@ (note that this in general is not equal to the number of parameters in the model), and n is the number of observations [17]. Thus, BIC is an approximation to the Bayes solution, but valid outside the Bayesian context. This is true because the leading terms in the approximation do not depend on the prior densities of the model parameters, θ m . BIC is, as opposed to nonparametric approaches such as cross-validation, model based and therefore relies on the assumptions made in the modeling. BIC is derived under the assumption that the data comes from a distribution in the exponential family (see [41] for more about the assumptions behind BIC and a comparison with Akaike's Information Criterion).

Ridge regression

The ordinary least squares (OLS) estimator of the regression coefficients β in the standard linear model is efficient (i.e. has the minimum possible variance) within the class of linear and unbiased estimators. However, when the independent variables are correlated, the variance of the OLS estimator is generally high. In these situations, ridge regression [42] can yield improved parameter estimates by minimizing a penalized residual sum of squares, given by: RSS (λ) = (y - X β)T(y - X β) + λ βT β. Finding the minimum of this expression gives the ridge solution: β ^ r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaKaadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3476@ = (XTX + λI)-1 XTy, where I is the p × p identity matrix. The solution thus adds a positive constant to the diagonal of XTX before inversion; this makes the problem nonsingular, even if XTX is not of full rank. While this introduces bias into the coefficient estimates, variance is often greatly reduced.

Note that β ^ r i d g e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaKaadaahaaWcbeqaaiabdkhaYjabdMgaPjabdsgaKjabdEgaNjabdwgaLbaaaaa@3476@ is a linear function in y, thus it is straightforward to define the effective degrees of freedom of the ridge regression fit, (df(λ)) as:

df (λ) = tr [ X ( XTX + λ I )-1XT] [16]. The degrees of freedom of the fit are needed for carrying out model selection according to, for instance, BIC. Linearity in y also enables easy implementation (no quadratic programming required as, for instance, is necessary with the lasso).

Genetic algorithm (GA)

A GA (see [43] for more details) is a stochastic search technique for finding exact or approximate solutions to optimization and search problems. A typical genetic algorithm is defined by a genetic representation of a given solution (normally termed a chromosome in the GA context). That is, a vector, w t i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4DaC3aa0baaSqaaiabdsha0bqaaiabdMgaPbaaaaa@3049@ specifies the numerical representation of the i th chromosome at generation t, and an objective function (or fitness function), f( w t i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbmGae83DaC3aa0baaSqaaiabdsha0bqaaiabdMgaPbaaaaa@304D@ )→//evaluates the fitness of a chromosome. The GA is initiated by setting up a random population that contains a number of trial chromosomes. New solutions are generated by mutation or recombination of existing solutions and are selected for the next generation with a probability given by: p ( w t i ) = f ( w t i ) / j a f ( w t i ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeiikaGccbmGae83DaC3aa0baaSqaaiabdsha0bqaaiabdMgaPbaakiabcMcaPiabg2da9maalyaabaGaemOzayMaeiikaGIae83DaC3aa0baaSqaaiabdsha0bqaaiabdMgaPbaakiabcMcaPaqaamaaqadabaGaemOzayMaeiikaGIae83DaC3aa0baaSqaaiabdsha0bqaaiabdMgaPbaakiabcMcaPaWcbaGaemOAaOgabaGaemyyaeganiabggHiLdaaaaaa@4832@ . The process is continued through a number of generations until an optimal or acceptable solution has been found. Genetic algorithms of this type can be shown to converge with a probability of one to the global optimal solution as t → ∞.

Brute force search

A brute force search systematically tests an exhaustive list of all possible candidates for the solution to a given search or optimization problem and checks whether each candidate satisfies the problem's statement.

Availability and requirements

References

  1. Kontijevskis A, Prusis P, Petrovska R, Yahorava S, Mutulis F, Mutule I, Komorowski J, Wikberg JES: A look inside HIV resistance through retroviral protease interaction maps. PloS Computational Biology 2007., 3(3):

    Google Scholar 

  2. Wikberg JES, Lapinsh M, Prusis P: Proteochemometrics: A tool for modelling the molecular interaction space. In Chemogenomics in Drug Discovery - A Medicinal Chemistry Perspective. Edited by: Kubinyi H, Müller G. Weinheim , Wiley-VCH; 2004:289–309.

    Google Scholar 

  3. Hansch C: A Quantitative Approach to Biochemical Structure-Activity Relationships. Accounts of Chemical Research 1969, 2: 232–239. 10.1021/ar50020a002

    Article  CAS  Google Scholar 

  4. Hvidsten TR, Wilczynski B, Kryshtafovych A, Tiuryn J, Komorowski J, Fidelis K: Discovering regulatory binding-site modules using rule-based learning. Genome Res 2005/06/03 edition. 2005, 15(6):856–866. 10.1101/gr.3760605

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530a

    Article  Google Scholar 

  6. Johnson SR: The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy). J Chem Inf Model 2008, 25–26. 10.1021/ci700332k

    Google Scholar 

  7. Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 2005, 365(9458):488–492. 10.1016/S0140-6736(05)17866-0

    Article  CAS  Google Scholar 

  8. Ntzani EE, Ioannidis JPA: Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. The Lancet 2003, 362(9394):1439–1444. 10.1016/S0140-6736(03)14686-7

    Article  CAS  Google Scholar 

  9. Freyhult E, Peteris P, Lapinsh M, Wikberg JES, Moulton V, Gustafsson MG: Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling. BMC Bioinformatics 2005, 6(50):1–14.

    Google Scholar 

  10. Golbraikh A, Tropsha A: Beware of q2! J Mol Graph Model 2002/02/23 edition. 2002, 20(4):269–276. 10.1016/S1093-3263(01)00123-1

    Article  CAS  PubMed  Google Scholar 

  11. Stone M: Cross-Validatory Choice and Assessment of Statistical Predictions. In Journal of the Royal Statistical Society Series B (Methodological). Volume 36. Royal Statistical Society; 1974:111–147.

    Google Scholar 

  12. Cartmell J, Enoch S, Krstajic D, Leahy DE: Automated QSPR through Competitive Workflow. J Comput Aided Mol Des 2005, 19(11):821–833. 10.1007/s10822-005-9029-8

    Article  CAS  PubMed  Google Scholar 

  13. Cartmell J, Krstajic D, Leahy DE: Competitive Workflow: novel software architecture for automating drug design. Curr Opin Drug Discov Devel 2007, 10(3):347–352.

    CAS  PubMed  Google Scholar 

  14. Obrezanova O, Gola JM, Champness EJ, Segall MD: Automatic QSAR modeling of ADME properties: blood-brain barrier penetration and aqueous solubility. J Comput Aided Mol Des 2008, 22(6–7):431–440. 10.1007/s10822-008-9193-8

    Article  CAS  PubMed  Google Scholar 

  15. Efron B, Tibshirani R: An Introduction to the Bootstrap. New York , Chapman & Hall/CRC. ; 1993.

    Chapter  Google Scholar 

  16. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. In Springer series in statistics. New York , Springer-Verlag; 2001:533.

    Google Scholar 

  17. Schwarz G: Estimating the Dimension of a Model. In The Annals of Statistics. Volume 6. Institute of Mathematical Statistics; 1978:461–464. 10.1214/aos/1176344136

    Article  Google Scholar 

  18. The Selwood dataset[http://www.ndsu.edu/qsar_soc/resource/datasets/selwood.htm]

  19. Selwood DL, Livingstone DJ, Comley JCW, O'Dowd AB, Hudson AT, Jackson P, Jandu KS, Rose VS, Stables JN: Structure-activity relationships of antifilarial antimycin analogs: a multivariate pattern recognition study. Journal of Medicinal Chemistry 1990, 33(1):136–142. 10.1021/jm00163a023

    Article  CAS  PubMed  Google Scholar 

  20. Nicolotti O, Carotti A: QSAR and QSPR studies of a highly structured physicochemical domain. J Chem Inf Model 2006/01/24 edition. 2006, 46(1):264–276. 10.1021/ci050293l

    Article  CAS  PubMed  Google Scholar 

  21. Todeschini R, Consonni V, Mauri A, Pavan M: Detecting “bad” regression models: multicriteria fitness functions in regression analysis. Analytica Chimica Acta 2004, 515(1):99–208.

    Google Scholar 

  22. Burman P: A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation and the Repeated Learning-Testing Methods. In Biometrika. Volume 76. Biometrika Trust; 1989:503–514.

    Article  Google Scholar 

  23. Efron B: The Estimation of Prediction Error: Covariance Penalties and Cross-Validation. Journal of the American Statistical Association. In Journal of the American Statistical Association. Volume 99. American Statistical Association; 2004:619–632. 10.1198/016214504000000692

    Google Scholar 

  24. Amaldi E, Kann V: On the Approximability of Minimizing Nonzero Variables Or Unsatisfied Relations in Linear Systems. Theoretical Computer Science 1997, 209: 237–260.

    Google Scholar 

  25. Kubinyi H: Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution. QSAR & Combinatorial Science 1994, 13(4):393–401. 10.1002/qsar.19940130403

    Article  CAS  Google Scholar 

  26. Java - The Source for Java Developers[http://java.sun.com/]

  27. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JE: Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics 2007/02/24 edition. 2007, 8: 59. 10.1186/1471-2105-8-59

    Article  PubMed Central  PubMed  Google Scholar 

  28. P[http://www.genettasoft.com/p/P.zip]

  29. JGAP - Java Genetic Algorithms Package[http://jgap.sourceforge.net/]

  30. JGAPm[http://www.genettasoft.com/p/JGAPm.zip]

  31. Shimodaira H: Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. Annals of Statistics 2004, 32: 2616–2641. 10.1214/009053604000000823

    Article  Google Scholar 

  32. pvclust[http://www.is.titech.ac.jp/~shimo/prog/pvclust/]

  33. Skurichina M: Stabilizing Weak Classifiers - Regularization and Combining Techniques in Discriminant Analysis. Volume PhD. Vilnius State University; 2001.

    Google Scholar 

  34. Cho SJ, Hermsmeier MA: Genetic Algorithm Guided Selection: Variable Selection and Subset Selection. J Chem Inf Comput Sci 2002, 42(4):927 -9936. 10.1021/ci010247v

    Article  CAS  PubMed  Google Scholar 

  35. Tibshirani R: Regression Shrinkage and Selection via the Lasso. In Journal of the Royal Statistical Society Series B (Methodological). Volume 58. Royal Statistical Society; 1996:267–288.

    Google Scholar 

  36. Akaike H: A new look at the statistical model identification. IEEE transactions on automatic control 1974, 19(6):716 -7723. 10.1109/TAC.1974.1100705

    Article  Google Scholar 

  37. Shao J: An asymptotic theory for linear model selection. Statistica Sinica 1997, 7: 221–264.

    Google Scholar 

  38. Wolpert D: Stacked Generalization. Neural Networks, 1992, 5: 241–259. 10.1016/S0893-6080(05)80023-1

    Article  Google Scholar 

  39. Kass RE, Wasserman L: A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion. In Journal of the American Statistical Association. Volume 90. American Statistical Association; 1995:928–934. 10.2307/2291327

    Article  Google Scholar 

  40. Wasserman L: Bayesian model selection and model averaging. In Mathematical Psychology Symposium. Bloomington, Indiana ; 1999.

    Google Scholar 

  41. Kuha J: AIC and BIC - Comparisons of Assumptions and Performance. Sociological Methods & Research 2004, 33(2):188–229. 10.1177/0049124103262065

    Article  Google Scholar 

  42. Hoerl AE, Kennard RW: Ridge Regression: Biased Estimation for Nonorthogonal Problems. In Technometrics. Volume 12. American Statistical Association; 1970:55–67. 10.2307/1267351

    Article  Google Scholar 

  43. Goldberg DE: Genetic Algorithms in Search, Optimization and Machine Learning. Boston , Addison-Wesley Longman Publishing Co., Inc. ; 1989:372.

    Google Scholar 

Download references

Acknowledgements

Supported by the Swedish VR (04X-05975). We extend out gratitude to three anonymous reviewers who helped to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Eklund.

Additional information

Authors' contributions

ME devised and implemented the proposed C1C2 framework. OS was involved in program design and aided with the implementation. JESW supervised the project. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2008_2345_MOESM1_ESM.csv

Additional file 1: The first of the four simulated datasets used in the demonstrations. This dataset with 15 observations was sampled from a multivariate normal distribution with mean 0 20 and covariance matrix I 20 (see article for details). (CSV 6 KB)

12859_2008_2345_MOESM2_ESM.csv

Additional file 2: The second of the four simulated datasets used in the demonstrations. This dataset with 200 observations was sampled from a multivariate normal distribution with mean 0 20 and covariance matrix I 20 (see article for details). (CSV 76 KB)

12859_2008_2345_MOESM3_ESM.csv

Additional file 3: The third of the four simulated datasets used in the demonstrations. This dataset with 15 observations was sampled from a multivariate normal distribution with mean 0 20 and covariance matrix I 20 (see article for details). (CSV 6 KB)

12859_2008_2345_MOESM4_ESM.csv

Additional file 4: The fourth of the four simulated datasets used in the demonstrations. This dataset with 200 observations was sampled from a multivariate normal distribution with mean 0 20 and covariance matrix S 20 (see article for details). (CSV 77 KB)

12859_2008_2345_MOESM5_ESM.xls

Additional file 5: C1C2 and repeated K -fold cross-validation estimates of Δ, λ , and ε for the simulated data. The factor combinations in model (6) are coded as: c1c2 – the C1C2 was used, k-fold – repeated K-fold cross-validation was used, ga – the GA search method was used, bf – the brute force search method was used, 15n = 15 observations in the dataset, 200n = 200 observations in the dataset, uncor – orthogonal independent variables in the dataset, cor – correlated independent variables in the dataset, all – no assumption regarding the number of nonzero δ i , 3 – three δ i = 1 were assumed. All values are means ± 95% confidence intervals, assuming normal distributions for Δ ^ ¯ Δ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuqHuoargaqcgaqeaiabgkHiTiabfs5aebGaayzcSlaawQa7aaaa@32DC@ , λ ^ ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKGbaebaaaa@2DB0@ , and | ε ^ g e n , i ε ˜ g e n , i | ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaa0aaaeaadaabdaqaaiqbew7aLzaajaWaaSbaaSqaaiabdEgaNjabdwgaLjabd6gaUjabcYcaSiabdMgaPbqabaGccqGHsislcuaH1oqzgaacamaaBaaaleaacqWGNbWzcqWGLbqzcqWGUbGBcqGGSaalcqWGPbqAaeqaaaGccaGLhWUaayjcSdaaaaaa@4062@ . (XLS 49 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Eklund, M., Spjuth, O. & Wikberg, J.E. The C1C2: A framework for simultaneous model selection and assessment. BMC Bioinformatics 9, 360 (2008). https://doi.org/10.1186/1471-2105-9-360

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-360

Keywords