Cross-validated stepwise regression for identification of novel non-nucleoside reverse transcriptase inhibitor resistance associated mutations

Background Linear regression models are used to quantitatively predict drug resistance, the phenotype, from the HIV-1 viral genotype. As new antiretroviral drugs become available, new resistance pathways emerge and the number of resistance associated mutations continues to increase. To accurately identify which drug options are left, the main goal of the modeling has been to maximize predictivity and not interpretability. However, we originally selected linear regression as the preferred method for its transparency as opposed to other techniques such as neural networks. Here, we apply a method to lower the complexity of these phenotype prediction models using a 3-fold cross-validated selection of mutations. Results Compared to standard stepwise regression we were able to reduce the number of mutations in the reverse transcriptase (RT) inhibitor models as well as the number of interaction terms accounting for synergistic and antagonistic effects. This reduction in complexity was most significant for the non-nucleoside reverse transcriptase inhibitor (NNRTI) models, while maintaining prediction accuracy and retaining virtually all known resistance associated mutations as first order terms in the models. Furthermore, for etravirine (ETR) a better performance was seen on two years of unseen data. By analyzing the phenotype prediction models we identified a list of forty novel NNRTI mutations, putatively associated with resistance. The resistance association of novel variants at known NNRTI resistance positions: 100, 101, 181, 190, 221 and of mutations at positions not previously linked with NNRTI resistance: 102, 139, 219, 241, 376 and 382 was confirmed by phenotyping site-directed mutants. Conclusions We successfully identified and validated novel NNRTI resistance associated mutations by developing parsimonious resistance prediction models in which repeated cross-validation within the stepwise regression was applied. Our model selection technique is computationally feasible for large data sets and provides an approach to the continued identification of resistance-causing mutations.

Results: Compared to standard stepwise regression we were able to reduce the number of mutations in the reverse transcriptase (RT) inhibitor models as well as the number of interaction terms accounting for synergistic and antagonistic effects. This reduction in complexity was most significant for the non-nucleoside reverse transcriptase inhibitor (NNRTI) models, while maintaining prediction accuracy and retaining virtually all known resistance associated mutations as first order terms in the models. Furthermore, for etravirine (ETR) a better performance was seen on two years of unseen data. By analyzing the phenotype prediction models we identified a list of forty novel NNRTI mutations, putatively associated with resistance. The resistance association of novel variants at known NNRTI resistance positions: 100, 101, 181, 190, 221 and of mutations at positions not previously linked with NNRTI resistance: 102, 139, 219, 241, 376 and 382 was confirmed by phenotyping site-directed mutants.
Conclusions: We successfully identified and validated novel NNRTI resistance associated mutations by developing parsimonious resistance prediction models in which repeated cross-validation within the stepwise regression was applied. Our model selection technique is computationally feasible for large data sets and provides an approach to the continued identification of resistance-causing mutations.
Background Linear regression models have been shown to be accurate in predicting drug susceptibility from the HIV-1 viral genotype, by calculating the inhibitory concentration 50% (IC 50 ) log Fold-Change (FC) phenotype as a linear combination of parameters, which are mutations [1][2][3] and interaction terms (mutation pairs) [1]. The coefficients of these parameters are named resistance weight factors (RWF), and they quantify the effect on the log FC of the mutations and mutation pairs. To generate models that are able to make predictions for future genotypes, ideally only resistance associated mutations are selected for the models. As it is not feasible to explore all possible subsets of mutations, stepwise regression is used to incrementally generate a series of regression models by addition or removal of mutations in each step. Different performance criteria exist to select one final linear regression model from this series [4,5]. In [1] standard stepwise regression was applied, selecting mutations based on significance with the F-test using predefined p-values. However, since correction for multiple significance-testing is not taken into account and because p-value thresholds are arbitrary, other selection criteria are preferred. Information criteria exist that balance accuracy and parsimony by penalizing for the number of parameters in the models. Information criteria commonly used are Akaike Information Criterion (AIC [6]) and Schwarz Bayesian Criterion (SBC [7]) with the penalty in SBC being more severe than in AIC. Although standard stepwise regression is a fast method in generating a model, the model found may be too complex by containing redundant information. Therefore, techniques are required that increase the stability of subset selection in linear regression. In [8] bootstrap aggregation ('bagging') was presented where an averaged prediction is made using multiple models generated on random re-samples of the original data set with replacement ('bootstrap') [9]. In [10] the bootstrap was integrated in the automatic selection procedure itself as parameters were sequentially added according to the proportion of bootstrap models in which they were selected. We investigated whether cross-validation [4,5] could be used as a less computationally intensive resampling technique than bootstrapping to reduce the complexity of the linear regression models while maintaining accuracy and adding to interpretability by generating only one regression model. As such, we aimed for an improvement in reliability of information extracted from the models, in this case the identification of novel mutations that cause resistance to anti-HIV drugs. In this article we use VirtualPhenotype™-LM (Virco, Beerse, Belgium) [1] as a reference prediction model. Since June 2006, VirtualPhenotype™-LM has been a linear regression model that predicts the log FC based on mutations (first-order terms) and mutation pairs (second-order interaction terms accounting for synergistic and antagonistic effects). We propose a more robust selection procedure by making two major changes to this reference approach. First, models were developed directly in second-order: interaction terms could be selected as soon as the constituting mutations were both present as first-order effect. Second, mutations or interaction terms were selected by repeatedly applying 3-fold cross-validation. In this article we refer to this method as 3F. Repeated cross-validation was presented as a way to reduce variability in the prediction error estimate [11]. Moreover, as shown in [12,13], repeated multi-fold cross-validation leads to better model selection when increasing the size of the validation set. To evaluate the generalizability of the models, the prediction error was calculated on genotypes in an unseen data set with available measured phenotypes.
After generating linear models with reduced complexity we were able to effectively identify novel mutations associated with NNRTI resistance. The individual contribution to resistance of these mutations was experimentally validated by making site-directed mutants and determining in vitro resistance levels.

Reverse Transcriptase Inhibitors
For the reverse transcriptase inhibitors (RTI) a 3F model with lower complexity (using less mutations and less interaction terms) than the reference was found for AZT, 3TC, d4T, ABC, FTC, NVP, EFV and ETR (Table  1). For the nucleoside reverse transcriptase inhibitors (NRTI) class of drugs the reduction in interaction terms and mutations used in 3F versus reference was 20.3% and 11.9%, respectively. For the NNRTI class of drugs the 3F method was even more effective in reducing the complexity: the reduction in interaction terms and mutations was 38.3% and 26%, respectively. For all RTI with exception of AZT the 3F performance on unseen data equalled the reference (3F average squared error within 1% of the reference) or was better than the reference (for ETR: 3F average squared error 1.9% lower than in the reference) (Table 1). Moreover, the 3F model performance as compared to reference was maintained in subsets of unseen data samples, including the subset with one or more mutations included only in the reference model. Although the AZT 3F model had a lower SBC value on the training set than the reference, the averaged squared error on the unseen data was 1.5% higher in 3F. The reduction in the AZT 3F model compared to the reference of 28.7% interaction terms and 17.1% mutations thus resulted in 3F underfitting. Nevertheless, for AZT the concordance in susceptibility calls on the unseen data between 3F and reference was 94.67% and there were no 'major' discordances (fully susceptible or maximal response by reference but fully resistant or minimal response by 3F or vice versa) between the two approaches. In contrast, the ETR reference model had a lower SBC value on the training set than the 3F model but used 2.2 times more interaction terms and 1.3 times more mutations than the 3F model, thus implying reference overfitting. Here, the concordance in susceptibility calls between the two approaches was 90.56%, with one major discordance (see additional file 1: Comparison of susceptibility call between 3F and Reference on unseen data for ATV, AZT and ETR). An increase in the ratio of interaction terms versus single terms in a 3F RT inhibitor model had no significant impact on the 3F performance compared with the reference (P = 0.3673). In total, 172 and 196 different mutations were used as single terms in the NRTI and NNRTI 3F models, respectively ( Figure 1, and see additional file 2: Complexity and performance of 3F and Reference models on genotype-phenotype data sequenced at Virco up to September 2006). The latter include all of the 44 NNRTI resistance-associated mutations identified in [14], with the exception of 179G, 227C, and 236L. Seventy-two RT mutations were present as single terms in both NRTI and NNRTI 3F models, including the highly prevalent mutations 103N and 184V. Three interaction terms were common between the AZT 3F model and one or more of the NNRTI 3F models: 103N&181C, 103N&184V and 215Y&219E. Some rare variants at RT position 181 were identified with high RWF and relatively high standard error: 181F, 181G and 181S ( Figure 1). Linear discriminant analysis (LDA [5,15]) was conducted to compare the predicted FC distribution of genotypes with a mutation from the list compiled in [14] to the genotypes having the wild-type amino acid at the corresponding position. Mutations with the highest impact (LDA F1 [16]) on NNRTI resistance are shown in Figure  2. Ranking by impact (LDA F1) was largely similar when comparing the 3F models to the reference, and the RWF were also very similar in both approaches. However, some clear differences between 3F and reference were observed. For example, for EFV, 190Q (the mutation with the highest RWF in reference and 3F) had more impact in 3F (F1 = 0.133 vs. 0.089). For ETR, mutation 179F had more impact in 3F than in the reference, although the mutation was not present as firstorder effect in the 3F model ( Figure 2). To detect novel NNRTI resistance-associated mutations a similar LDA was conducted for the remaining 100 RT mutations from the list of 124 found as first-order effect in the  Figure 1 RT mutations as first order effect in the 3F linear regression models. RT mutations with their regression coefficient (RWF in log Fold Change) and standard error (STDERR) in the 3F linear models. Mutations with |RWF| ≥ 0.5 are labeled. RWF stands for Resistance Weight Factor (terminology adopted from VirtualPhenotype™-LM).
NNRTI 3F models but not in the NRTI 3F models. We ranked these novel mutations by their impact on NNRTI resistance, and retained the top 40 for further analysis as described in the next section.

Novel resistance-associated mutations
Forty novel NNRTI resistance-associated mutations were found satisfying the following criteria: LDA cutoff > 0 (mutation contributes to resistance) for all of the NNRTIs, RWF > 0 in the 3F model of at least one of the NNRTIs, and RWF ≮ 0 in any 3F model of NVP, EFV or ETR ( Table 2). The mutation with the highest impact (LDA F1) was 179M and occurred relatively rarely (31 times) in the LDA data set containing approximately 79,000 genotypes. Novel mutations that were more frequent with an impact > 0.01 were 376S, 138A and 357T. Mutation 138A has recently been recognized as an ETR resistance associated mutation [17]. Site-directed mutants were made for a selection of the novel mutations. Their resistance contribution was analyzed by looking at the mutant's FC relative to a drug-specific biological cutoff (BCO), used to separate viruses that are susceptible from those that show signs of resistance [18]. When comparing ETR to the other two NNRTIs in Table 2 in most cases the LDA F1 value for ETR was higher than for EFV and NVP (Table 2: max(F1)). However, the in vitro effect of the site-directed single mutations was, with exception of 138A, always below the BCO for ETR, while this was not the case for EFV and NVP. SDMs 100V, 190T, 101A and 101D had measured FC values above the BCO for NVP (> 6.0) and EFV (> 3.3). SDM 139R had an FC value above the BCO for NVP only. For the SDM 138A, the highest FC values measured were 4.9, (NVP, below BCO), 3.6, (EFV, above BCO) and 3.5, (ETR, above BCO (> 3.2)). Other mutations found with elevated FC for at least one of the NNRTIs were 221L, 219H, 219D, 376S, 102L, 101N, 234I, 382T, 139K and 241M.
As these novel mutations frequently co-occur with known resistance-associated mutations, we also studied their effect in genetic background containing such mutations. The Virco database of clinical isolates was searched for the most appropriate genetic backgrounds to test each of the novel mutations for their effect on NNRTI resistance [19]. For the mutations 139R, 219D and 219H we looked at their contributions to resistance in combination with the highest-impact NNRTI resistance mutants 103N and 181C ( Figure 2, and see additional file 3: Linear Discriminant Analysis (LDA) for 103N and 181C). The following site-directed mutants     Figure 2 Impact of known NNRTI resistance associated mutations. LDA on reference (blue) and 3F (red) predicted phenotypes. Mutations shown are from the list of 44 known non-nucleoside RT inhibitor resistance associated mutations [14] having F1 > 0, ranked by F1.  Frequency of mutation (not within a mixture) in LDA data set. f n 11 is the number of samples with amino acid mutation having a predicted phenotype above the LDA cutoff. g n 01 is the number of samples with wild type amino acid having a predicted phenotype above the LDA cutoff. h Cutoff in log Fold-Change (taking the wild-type and mutation frequency percentages as prior probabilities in the LDA can result in cutoff values outside of the range of the predicted phenotypes). were tested for all NNRTIs: 103N, 103N+181C, 139R +103N+181C, 219D+103N+181C and 219H+103N+181C (Table 3). While for NVP and EFV all of the above combinations were resistant, for ETR the single mutation 103N and the combination 103N+181C were susceptible with mean FC values of 0.9 and 3.0, respectively. The limited impact (3F LDA F1: 0.09) of 103N on ETR resistance was thus confirmed by the site-directed mutant and adding 181C did not result in a FC above the BCO. Remarkably, all the above triple combinations were found to be resistant for ETR, thereby clearly demonstrating the contribution to NNRTI resistance of the novel mutations. For ETR a 2.2, 6.3 and 4.6-fold increase in FC compared to 103N+181C was seen when adding 139R, 219D and 219H, respectively. For EFV a 1.8-2.6fold increase was seen as well. For NVP the contribution to resistance of the novel mutations in combination with 103N+181C could not be confirmed due to IC 50 values larger than the maximum assay concentration (Table 3).
Data for the novel mutations 102L, 138A, 139K, 139R, 179Y, 181F, 188F, 221L and 234I tested in different genetic backgrounds can be found in additional file 4: Site directed mutants of novel mutations tested for NVP, EFV and ETR, extending Table 2 with results of combinations. Mutation 181G was not in the list of novel mutations as it was always found in presence of other amino acids at position 181 (a 'mixture') in the LDA data set. A sitedirected mutant, containing 181G only, was made and found to be resistant to NVP (FC: 30.1-38.3) and having FC values of 1.1-1.6 for EFV and 0.9-1.5 for ETR (Table  3).

Discussion
In order to quantitatively predict drug resistance from the HIV-1 viral genotype we used 3-fold cross-validation for the stepwise selection of model mutations or mutation pairs. Importantly, we applied a random division in three parts before each removal step (see additional file 5: K Fold cross-validated stepwise regression using same or different random division before each removal step: ETR model). Thus, in the 3F method robustness was achieved by removing model parameters that were not consistently selected under different random divisions. As a consequence, the 3F method can be considered a greedy stepwise approach where the probability of a parameter being removed increases as the model size increases. As we had to generate models for several drugs this computational approach was a practical one for our purpose. Nevertheless, to make the model selection less greedy, one might consider repeating the crossvalidation at each step multiple times with a different random division in three parts, to improve the estimation of the prediction error.
As our goal was to reduce the complexity and maintain the accuracy of the reference models, we did not investigate defining a stop criterion for the 3F method independent from the reference. Instead, we retained all 3F models with better performance (AIC/SBC) than the reference on the genotype-phenotype data, holding out the two last months. The 3F model with best performance on the hold-out set was then selected as the final model, and the 3F model parameters were recalculated on the genotype-phenotype data including the hold-out set. Ultimately, in deciding between different approaches, the evaluation of performance on future observations remains the first priority. Therefore we tested the performance of the reference and 3F method on a large unseen genotype-phenotype data set sequenced at Virco between September 2006 and December 2008. For the RT inhibitors the 3F method maintained the accuracy of the reference models while reducing complexity of the linear regression models. We also evaluated the performance of the 3F method for protease inhibitors (see additional file 2: Complexity and performance of 3F and Reference models on genotype-phenotype data sequenced at Virco up to September 2006). An equal performance between 3F and reference was seen for the most recent protease inhibitors TPV and DRV, and with both methods the models were less complex than for the other PIs. For the older protease inhibitors the performance of the reference approach was better than the 3F approach. This suggests that protease inhibitors for which resistance patterns have become more complex over time also require more complex regression models, thus increasing the importance of interaction terms. This is in conflict with the constraint in the 3F method that allowed only interaction terms of which both first order effects were already present in the model during the stepwise regression.
As genotyping was done up to RT amino acid (AA) position 400, the 3F linear models of the RTIs, with exception of 3TC and FTC, also contained connection domain mutations (AA 289-423 of RT). Two out of eight of the connection domain mutations described in [20] to be associated with AZT resistance, were found to increase resistance as single term in the AZT 3F model: 348I and 360V. In [21], 376S was found to be associated with an increased risk of virological failure to NVP-based therapy in NNRTI-naïve patients and we could confirm the in vitro effect on NNRTI resistance by site-directed mutagenesis. In [22], resistance predictions for AZT, NVP and EFV by current genotypic drug resistance interpretation systems were found to be noninferior when predicting from short RT sequences (AA 41-238 for RT) compared to full RT sequences. Nevertheless, our analysis suggests that there are 12 novel NNRTI resistance associated connection domain mutations, including 376S and 382T, that could provide more accurate predictions of NNRTI drug resistance.
Mutations found as a single term in both NRTI and NNRTI 3F models consisted of mutations at NRTI positions associated with NNRTI hypersusceptibility [23][24][25], including 215Y, found to be ETR hypersusceptible in the reference and 3F models, of connection domain mutations, including 348I, and of NNRTI resistance associated mutations from the list compiled in [14]. Our 3F models suggest that these latter mutations might have an effect on NRTI resistance as well (two have already been confirmed to have such an effect: 100I [26] and 181C [27]). In this study we were interested in novel NNRTI resistance associated mutations and we considered RT mutations found in the NNRTI 3F models only.
By making site-directed mutants we compared the resistance effect of mutations as first-order effect in the models with the in vitro effect. These results show that despite the reduction in model complexity for the 3F approach, one must still be cautious when equating linear model coefficients with in vitro effect. For example, mutation 181F has only a small in vitro effect for NVP whereas in the 3F regression model this mutation has a high first-order coefficient. A likely explanation for this is that in the same 3F model interaction terms of 181F with 98G, 103N, 106M, 190A (all known resistance key mutations [14]) with a strong predicted resensitizing effect are present as well, canceling out the resistance contribution of 181F as first-order effect. Specifically, we could confirm the resensitizing effect of the combination 181F+103N by comparing the resistance levels of SDM 181F+103N and SDM 103N. (see additional file 4: Site directed mutants of novel mutations tested for NVP, EFV and ETR). The systematic co-occurrence of 181F with known resistance mutations probably resulted in a high ranking for mutation 181F in the list of novel NNRTI resistance associated mutations.
As another example for ETR 179F is present as a firstorder effect with a large resistance coefficient in the reference model whereas in the 3F model 179F is not present as a first-order effect. Inspecting this further we found the interaction term 179F&181C to be present with a strong resistance effect in the 3F model, but absent in the reference model. In this case the 3F method was in line with in vitro selection experiments done in [28], where the synergistic effect of 181C and 179F on the decreased susceptibility of ETR was already described. Moreover, the one genotype with measured resistant phenotype that resulted in a major discordance for ETR between 3F (Minimal Response) and reference (Maximal Response) contained exactly the 179F&181C combination (see additional file 1: Comparison of susceptibility call between 3F and Reference on unseen data for ATV, AZT and ETR). Most of the novel derived NNRTI resistance associated mutations had the highest impact (LDA F1 value) on resistance for ETR but a relatively low in vitro effect in comparison with NVP and EFV. Co-occurrence of multiple resistance associated mutations is needed to cause an elevated ETR FC, as exemplified by the SDMs containing two known resistance associated mutations and one novel mutation (e.g. 139R+103N +181C) with FC values above the BCO for ETR.

Conclusions
By applying repeated 3-fold cross-validation within the stepwise regression, we could lower the complexity of linear regression models for predicting drug resistance while retaining performance on unseen data. The described 3F method thus proves to be a tractable approach when interpretation of the linear model is an objective. As the 3F method worked particularly well for the non-nucleoside reverse transcriptase inhibitors, we derived a list of forty novel NNRTI resistance associated mutations. For a selection of the novel mutations we confirmed their in vitro contribution to resistance by site-directed mutagenesis, individually or in combination with known resistance mutations. As most of these novel mutations were found at relative low frequency in patient samples, carefully following up on drug resistance in patients with viruses carrying these mutations may provide more insight in understanding how the virus escapes the current antiretroviral treatment, as well as in the design of novel drugs.

Genotype-phenotype database
The size of the data sets of genotypes (AA 1-99 of Protease (PR) and AA 1-400 of RT) with an available measured phenotype (Antivirogram, Virco) ranged from approximately 20,000 to 56,000 samples for all protease and reverse transcriptase inhibitors. Phenotypes were measured as the log FC in IC 50 of a sample relative to a wild-type laboratory reference strain.

Reference Linear Models
VirtualPhenotype™Linear Models were calculated for all drugs on samples sequenced at Virco up to July 2006 and recalculated 2 months later (samples up to September 2006). Two stepwise regression procedures were used to develop the models [1]. First, a model was calculated using single mutations only. In the second stepwise regression procedure, all possible interaction terms containing the mutations present in the first-order model were also considered for selection, to account for synergistic and antagonistic effects. Cutoff values for p-values were used to determine which mutations or interaction terms could enter the model (SLE) or stay in the model (SLS) (Figure 3). The minimal number of occurrences in the database for mutations or interaction pairs to be considered in the model was 10 for FTC, ETR, ATV, TPV and DRV and 20 for all other drugs, limiting the number of mutations considered to approximately 300 for PR inhibitors and 1000 for RT inhibitors. Mutations occurring in mixtures (ambiguous sequencing results) were weighted accordingly, mixtures of more than four amino acids were not considered. No validation set was used in selection of the reference models.

3F Linear Models
In the 3F method three fold cross-validated prediction error sum of squares (CVPRESS) was used as selection criterion for parameter inclusion, instead of the significance levels (SLS and SLE) used in the reference approach. The motivation for this choice of fold is described in the next section. During 3-Fold cross-validation the data set is randomly split in three equal-sized parts, and a model is generated three times on two of the three parts and validated/tested on the remaining part. In this repetitive modeling each part is used only once for validation so that every genotype is treated once as unseen. A mutation was selected based on the performance (prediction error) calculated for the 'unseen' genotypes. We applied this cross-validation directly in each step of the selection procedure in evaluating which parameter to add or which parameter(s) to remove from the model. The initial search space consisted of all individual mutations. As soon as two mutations at different positions were selected, the cross-term of both could enter the model (we did not force this hierarchy constraint in the removal step, meaning that mutations could be removed as first-order effect while still present in an interaction term). The search space was thus dynamically enlarged with interaction terms as more single mutations were added (Figure 3). Models were calculated on the July 2006 data set. The same minimal counts (10/20)  second order model Figure 3 Reference and 3F methodology: schematic overview.
In the 3F method cross-validated prediction error (CVPRESS) was used instead of significance levels (p-values) in the reference approach. In the reference approach two stepwise regression procedures were used: all possible mutation pairs were made from mutations in the first order model and candidate for entry in the second order model. In the 3F method, the initial search space consisted of all individual mutations. Mutation pairs could only enter the model if both mutations in the pair were already selected for the model. already in concept selected for less interaction terms. The CVPRESS selection criterion can be specified directly in the SAS GLMSELECT procedure [29]. Therefore CVPRESS can be applied in the same way as selection criteria like AIC or SBC.

Randomized stepwise selection
By using CVPRESS as selection criterion, a choice of fold K had to be made. We selected ETR as drug for evaluating different K. In short, the results of this evaluation were as follows (see additional file 5: K Fold cross-validated stepwise regression using same or different random division before each removal step: ETR model). Keeping the same random division throughout the stepwise regression, K = N (Leave-One-Out, also called PRESS [4]) was found too costly to compute [5,30]. Moreover, as the N training sets were almost similar, the improvement on the reference was rather a result of the hierarchy constraint for introduction of interaction terms than of the cross-validation itself. Nevertheless, choosing K < N automatically leads to selection bias. For K = 3, an apparent decrease in the ratio of (SBC CV -SBC) over the number of model parameters was observed as the model size increased, suggesting overfitting. Therefore, to generate better generalizing models, we decided to alter the random division before each removal step in the stepwise regression. Moreover, we now found K = 3 to be the best choice of fold in generating a model with lower SBC than the reference, and using the lowest number of parameters. Choosing K = 2, the required performance, set by the reference, could not be achieved using a limited number of random divisions, due to an increase of the number of parameter removals. Consequently, for K = 2, also the difference between SBC CV and SBC was found to be the largest, indicating that parameter selection became more difficult when only half of the samples was used for training of the model. The randomized stepwise selection procedure can now be described. An initial forward stepwise regression was executed until two mutations at different positions were present in the model. After that the stepwise selection procedure amounted to the execution of multiple backward-forward regression steps cycles, with each cycle consisting of a backward/removal step followed by a forward/addition step. To avoid overfitting, a different random division in three parts was used in each of these cycles. As a consequence of this randomization, it was possible that the latest parameter added in the forward step of cycle i was directly removed in the backward step of cycle i+1, for instance when a model parameter was only present in one of the three parts of the new random division. This was in fact the reason why we decided to alter the random division before each backward step and not before each forward step. Crossvalidation thus penalized infrequent interaction terms to get/stay in the model but was nevertheless less stringent than applying a minimal count rule as was done in the reference method.

3F model selection
The Virco genotype-phenotype data set between July and September 2006 was used as validation/test set. The 3F models were then selected as follows (Table 4): i) select all models with a better performance than the reference model on the training set with either AIC or SBC and ii) from the models selected in i), select the model that has the best performance (average squared error) on the test set. The selected 3F models were then refitted on all samples (including the validation set) to actually compare them to the September 2006 reference models.
AIC is defined as n ln(SSE/n) + 2p. SBC is defined as n ln(SSE/n) + ln(n)p. In the above definitions SSE is the Sum of Squared Errors, n is the number of genotypesphenotypes and p is the number of parameters in the linear regression model [4].

Linear Discriminant Analysis
To rank RT mutations by their impact on NNRTI resistance, a linear discriminant analysis (LDA) was done on approximately 79,000 predicted phenotypes calculated from genotypes sequenced at Virco between September 2006 and December 2008. Note that the LDA was done on the calculated phenotypes and not on the measured phenotypes in order to incorporate as many recent genotypes as possible, especially since some of the novel NNRTI mutations are infrequent in occurrence ( Table  2). In order to analyze the impact of a mutation at position i on resistance, a division of genotypes into two groups was made. The first group contained all genotypes containing the wild-type (HXB2) amino acid at position i and the second group the genotypes with the amino acid mutation. Genotypes containing a mixture of amino acids at the mutated position were not considered. A contingency table was then made to analyze how well the two groups could be separated using the LDA cutoff applied on the calculated phenotypes. The following metrics were calculated on the two-by-two contingency table: precision, recall and F1. Precision is defined as n 11 n 11 + n 01 . Recall is defined as n 11 /N 1 . n 11 is the number of samples with the amino acid mutation having a calculated phenotype above the LDA cutoff. n 01 is the number of samples with the wild-type amino acid having a calculated phenotype above the LDA cutoff. N 1 is the number of samples with the amino acid mutation. As precision (positive predictive value) of the LDA discrimination in log FC of the mutated group from the wild type group at a position i can be high, while recall (sensitivity) is low or vice versa, we used the F1 metric, trading off precision and recall, for the ranking of mutations for their impact on resistance. F1 is defined as (2 × p × r) (p + r) and thus equally weights precision (p) and recall (r). Ranking by impact on resistance (F1) was done for the known NNRTI resistance-associated mutations. For novel mutations, exclusively present as first-order effect in the 3F NNRTI linear regression models (thus absent in 3F nucleoside reverse transcriptase linear regression models), ranking for being associated with resistance was done using F1 if p + r > 0 and by LDA cutoff otherwise. LDA analysis was done for both the reference and 3F calculated phenotypes calculated using the September 2006 models.

Site-Directed Mutants
Site-directed mutants were created at Eurofins Medigenomix GmbH (Ebersberg, Germany) using the linear reaction method. In this method, the template DNA is linearly amplified using a mutagenesis-grade high-fidelity DNA polymerase which extends the mutagenic primers containing the desired mutation, incorporating the mutation of interest into the newly synthesized strands. The unique primer design allows replication of only the parental strand. Final treatment with Dpn I ensures the digestion of only dam-methylated parental strands. The resulting mutagenic strands were then transformed in ultracompetent cells and cultured on an agar plate. Single colonies were sequenced to ensure the availability of the correct mutation in the strand. A colony of a correct mutation containing strand was cultured and the purified plasmid shipped to Virco. Starting from this plasmid, the Protease -Reverse transcriptase region (AA 1-99 of PR and AA 1-400 of RT) was amplified and transfected into 293T cells and recombined with the deletion backbone by homologous recombination [31]. The cultivated virus was then grown against a standard set of anti-HIV drugs.

Additional material
Additional file 1: Comparison of susceptibility call between 3F and Reference on unseen data for ATV, AZT and ETR. For ATV, AZT and The number of different random divisions used in the stepwise regression in the selected 3F model. f For the nucleoside RT inhibitors the number of random divisions needed was less than 100, with exception of AZT and TDF. g For the non-nucleoside RT inhibitors most random divisions were needed for ETR. h ATV was the only drug for which more than 1000 different random divisions were needed.
Van der Borght et al. BMC Bioinformatics 2011, 12:386 http://www.biomedcentral.com/1471-2105/12/386 ETR a concordance analysis in susceptibility calls was done using vircoTYPE 4.2 clinical cutoffs [34,35]. AZT and ETR were the RTIs with difference in average squared error between 3F and reference larger than 1.0%. ATV was the drug for which most random divisions were needed to generate the 3F model (Table 4), and one of the PIs with averaged squared error > 7% higher in 3F than in the reference.
Additional file 2: Complexity and performance of 3F and Reference models on genotype-phenotype data sequenced at Virco up to September 2006. Complexity of the 3F models for the NRTIs, NNRTIs and PIs, and performance on training and test set. The 296 RT mutations found as single term in the RTI 3F models are listed as i) single terms exclusively found in NRTI 3F models, ii) single terms exclusively found in NNRTI 3F models and iii) single terms found in both NRTI and NNRTI 3F models.
Additional file 3: Linear Discriminant Analysis (LDA) for 103N and 181C. 3F LDA F1 impact on resistance of 103N is largest for NVP: 0.75, then for EFV: 0.63 and then for ETR: 0.09. 3F LDA F1 impact on resistance of 181C is largest for ETR: 0.56, then for EFV: 0.19 and then for NVP: 0.11. LDA cutoff (blue line) is shown to discriminate between samples with wild type at position 103/181 and samples with mutation 103N/181C for which the density histograms are shown. Frequency of wild type (not within a mixture) in LDA data set was 62,010 and 72,643 for positions 103 and 181, respectively. Frequency of mutation (not within a mixture) in LDA data set was 12,012 and 5043 for 103N and 181C, respectively.
Additional file 4: Site Directed Mutants of novel mutations tested for NVP, EFV and ETR. Fold Change (FC) was calculated as the IC 50 of the site-directed mutant divided by the IC 50 of a wild-type laboratory reference strain. All SDMs were measured three times (unless indicated otherwise) and FCs for each of the three measurements are shown. SDMs used as genetic background for evaluating the contribution to resistance of the novel mutations, are given at the top of the file. Noteworthy, the in vitro drug resistance interaction mechanism of the novel mutation and the known NNRTI resistance associated mutations was not always additive: 181F contributed to resensitization to EFV of the 103N mutated virus, 179Y contributed to resensitization to NVP and EFV of the 190A mutated virus.
Additional file 5: K Fold cross-validated stepwise regression using same or different random division before each removal step: ETR model. Different choices of fold K were evaluated for the ETR model. The goal was to find a linear regression model with better SBC than the reference and at the same time using less parameters. (A) When keeping the same random division during the stepwise regression, selection bias resulted in more overfitting, when lowering K. By altering the random division before each removal step, for K = 3 the reference goal SBC was reached with the lowest number of parameters. (B) The difference between SBC CV and SBC (calculated as n ln(CVPRESS/SSE)) was found to be larger when lowering the number of folds K, in case a different random division was used before each removal step. (C) When lowering K, using a different random division before each removal step resulted in more parameter removals. Whereas for K = 3, a model with the required SBC was found using 700 backward-forward cycles, for K = 2, the model size did not increase fast enough during the stepwise procedure as 2000 backward-forward cycles were not sufficient to reach the goal SBC.