A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Doherty, Trevor; Dempster, Emma; Hannon, Eilis; Mill, Jonathan; Poulton, Richie; Corcoran, David; Sugden, Karen; Williams, Ben; Caspi, Avshalom; Moffitt, Terrie E.; Delany, Sarah Jane; Murphy, Therese M.

doi:10.1186/s12859-023-05282-4

Research
Open access
Published: 01 May 2023

A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Trevor Doherty^1,2,
Emma Dempster³,
Eilis Hannon³,
Jonathan Mill³,
Richie Poulton⁴,
David Corcoran⁵,
Karen Sugden⁷,
Ben Williams⁷,
Avshalom Caspi^6,7,
Terrie E. Moffitt^6,7,
Sarah Jane Delany⁸^na1 &
…
Therese M. Murphy¹^na1

BMC Bioinformatics volume 24, Article number: 178 (2023) Cite this article

2653 Accesses
6 Citations
4 Altmetric
Metrics details

This article has been updated

Abstract

Background

The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine–guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons.

Results

We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general—suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities.

Conclusions

The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value.

Peer Review reports

Background

Epigenetic biomarkers such as those derived from 5-methyl cytosine (DNA methylation) can help to address important questions across a myriad of biological fields. DNA methylation-based estimators and classifiers are models developed using statistical and machine learning (ML) methods that utilise DNA methylation data for the estimation of a range of variables including biological age, telomere length (TL), disease diagnosis, smoking status and body mass index (BMI). As illustration of their utility—instead of using chronological age, an imperfect surrogate measure of the ageing process, DNA methylation-based age estimates that indicate biological aging of a person can be used to investigate the impact of stress factors on individuals of the same chronological age [1]. Similarly, many studies utilise self-reported smoking information whose inaccuracies can propagate through medical research studies [2, 3]. An accurate epigenetic estimator of smoking history can serve to ameliorate this issue. Furthermore, the importance of epigenetic mechanisms such as DNA methylation has become evident in the pathogenesis of various diseases, with DNA methylation markers emerging as potential clinical biomarkers [4]. Recently, an expanding body of work involving DNA methylation-based ML and deep learning approaches and their utility to estimate and predict a range of quantitative traits including chronological age [5,6,7,8,9,10,11,12,13,14,15,16], epigenetic smoking scores [17,18,19,20] and body mass index (BMI) [21] have increasingly become apparent.

Current limitations in the accuracy and efficacy of developed estimators include the fact that each estimator’s performance is data set dependent, leading to variability in the markers selected across different estimators of the same traits. Furthermore, studies that utilise DNA methylation-based data for purposes such as disease classification or estimation of a trait sometimes only explore a single feature-selection method or, at best, a relatively limited range. Data sets generated from high-throughput DNA methylation arrays measure methylation levels at CpG sites along the DNA sequence, thus providing features for statistical and ML models. They typically contain extremely high numbers of features in combination with small sample sizes, and can suffer from the curse of dimensionality [22]. Feature-selection methods can return a subset of variables that may reduce the effects of noise or irrelevant features while still providing useful prediction results [23]. Additionally, they can reduce computation time and potentially improve predictive performance [24, 25]. They are often employed to reduce the high dimensionality of input datasets, mitigate collinearity, and, in conjunction with a learning algorithm, estimate the quantitative trait of interest.

There are different approaches to feature-selection. Filter methods, a commonly used approach for high dimensional data, usually perform feature ranking based on statistical or information theoretic measures which generate a score that captures the amount of information each independent variable has about the dependent variable [25]. These fast methods are independent of the predictive model but can suffer from selecting redundant features. Choosing the best filter approach depends very much on the level of computational resources available to the researcher [25]. Wrapper methods in contrast utilize performance metrics of the predictive model to select the best feature subsets while embedded methods include feature-selection in the process of the modelling algorithm’s execution [26, 27].

These feature-selection methods naturally lead to dimension reduction but there are other methods which can also achieve this. Principal component analysis (PCA) is the dominant feature transformation technique, capturing most of the variance in the data by projecting the data into a reduced feature space. This transformative method, which is commonly used in ML, yields a set of orthogonal variables (principal components) that are linear combinations of the original variables [28]. In this sense, it has less utility for identifying specific features e.g., for acquisition of an explicit biological signature. However, its ability to tackle multicollinearity (the presence of strong relationships between variables in a data set), which can impact the performance of statistical and ML-based models, makes it a potentially powerful technique for development of an estimator primarily sought for its predictive capability [29]. It has recently been employed in [30] to reduce noise in CpG-level DNA methylation data and improve reliability of epigenetic clocks.

Using telomere length as a vehicle to explore a robust methodology, our study wishes to build on previous ML studies in the epigenomics field by evaluating a range of feature-selection methods and ML algorithms to identify a novel DNA methylation-based estimator of TL and investigate its association with other health-related demographics. Telomere length—DNA repeat structures which are located at the ends of each chromosome and have a crucial role in maintaining genomic stability—has emerged as a promising biomarker for biological age [31,32,33]. In recent years an association between TL and epigenetic processes has been hypothesised. Recently, Lu et al. [34] developed a DNA methylation-based estimator of TL (DNAmTL) which utilised 140 cytosine-phosphate-guanine dinucleotides (CpGs) using a regression-based ML approach (i.e. elastic net [35]), highlighting the power of ML methods to develop robust DNA methylation-based estimators.

In this study, we first reviewed the literature to ascertain feature-selection and learning algorithms commonly used in the epigenomics field for the estimation of quantitative traits. Due to the popularity in the published literature of elastic net penalised regression for epigenetic aging signatures and its use in the previously reported DNA methylation-based TL estimator [34], this approach forms the baseline algorithm in our study. Many studies also utilise some form of initial feature-selection in advance of applying elastic net, therefore we also investigated applying a range of feature-selection methods as an initial step. Previous studies have utilised association tests corrected for multiple testing, using false discovery rate (FDR) thresholds [6,7,8]. The FDR has demonstrated ability to detect true positives while controlling Type I errors at a designated level with methods such as the Benjamini and Hochberg step-up procedure (BH) [36] and q-value [37] arguably the most widely used and cited approaches for FDR control in practice [38]. In addition to association tests, studies have employed thresholding of correlation between CpGs and the quantitative target feature using Pearson’s correlation coefficient [6, 13, 39,40,41], and mutual information [42,43,44].

Additionally, ensemble ML algorithms such as random forest can be used to obtain a set of ranked features and is among the methods investigated in this study. Several comparative studies demonstrated that Support Vector Regression (SVR) performed well as an alternative to elastic net and other regression-based approaches [8, 41, 45], and we also explore this method along with several other regression algorithms in conjunction with feature-selection. In total, we develop and evaluate thirteen TL estimators, each adopting a different feature-selection and regression algorithm methodology. In addition to identifying a novel DNA methylation-based estimator of TL, we aim to develop a robust methodology utilising ML algorithms which could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation.

Methods

Data description

The data is comprised of 3 cohorts (Dunedin, EXTEND and TWIN), which have measures of both TL and Illumina DNA methylation array data (summarised in Table 1). The Dunedin data set pertains to a cohort (n = 1037) born in Dunedin, New Zealand between April 1972 and March 1973 as described elsewhere [46]. Assessments occurred at a range of ages, most recently at age 45 when 938 of the living 999 study members took part. We have used data from two sweeps of the study i.e., ages 26 and 38. With two time points, most participants contributed two samples and, after preprocessing, 1631 samples were utilised. The DNA methylation data was derived from whole blood and measured using the Illumina Infinium HumanMethylation450 BeadChip [47] (Illumina, CA, USA). TL was measured using a validated quantitative PCR (qPCR) method [48], as previously described [49].

Table 1 Summary details—data sets

Full size table

The validation EXTEND data set (n = 192) is a subset of the Exeter 10,000 epidemiological cohort as described in detail elsewhere [50]. The second validation TWIN data set (n = 178) involves a cohort previously recruited from a multi-centre collaborative project aimed at identifying DNA methylation differences in MZ twin pairs discordant for schizophrenia as described elsewhere [51]. The same laboratory measured TL by a validated qPCR method [52] in both the EXTEND and TWIN validation cohorts, as described previously [50]. DNA methylation levels were measured in the same labs using the Illumina Infinium methylation array platform for both the EXTEND and TWIN validation cohorts. For all 3 data sets the methylumi package [53] was used to extract signal intensities for each CpG probe and perform initial quality control, with data normalization and pre-processing using the wateRmelon package as described previously [54]. All experiments were performed in accordance with national guidelines and regulations as well as the Declaration of Helsinki. The Dunedin study participants gave written informed consent, and study protocols were approved by the NZ-HDEC (Health and Disability Ethics Committee). For both the EXTEND and TWIN studies, informed consent was obtained from participants and ethical approval was obtained from University of Exeter Medical School Research Ethics Board.

DNA methylation levels at CpG sites formed the input features for models while the target variable was TL. For consistency across datasets relative telomere lengths were re-calculated for each cohort separately as described previously [49]. Briefly, reaction efficiencies (E) were calculated from the standard curve slope using E = 10(1/-slope). Next, relative quantities (RQs) were calculated using RQ = EΔCt where ∆Ct is the difference between the average Ct value of the within-plate mean Ct, and the Ct value of the individual sample. Sample RQ values were calculated for each reaction separately (T and S). The telomere length relative to the amount of single-copy transcript was calculated using the ratio RQ(T)/RQ(S). Finally, relative telomere lengths were adjusted based on plate ID via linear regression to control for plate-to-plate variation for each cohort. Any CpG columns containing missing values were not utilised in the development of the final estimators, with the initial set of CpG sites restricted to 417,690 common to all three data sets.

Modelling overview

For the application of the regression algorithms and the feature-selection methods used in this paper, the Python software package scikit-learn (version 0.23.2) was utilised [55]. Stage 1 of our analysis involved the comparison of a range of models that utilised different feature-selection methods in conjunction with elastic net regression i.e., DNA methylation-based TL estimators. A nested cross-validation process, which includes hyperparameter tuning, was utilised for model comparison which has been suggested as a more appropriate approach when assessing a model, providing a reliable estimate of the true error [56, 57], and avoiding information leakage from test data into the training process. A range of performance metrics are reported in this study. These include the Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Pearson correlation coefficient between predicted and actual TL (described in Section 1 of Additional file 1).

The nested cross-validation (CV) process to assess model performance was employed on the Dunedin data set. The data was first split into train and test data sets in 80%:20% proportions respectively i.e., DS_train and DS_test. Each feature-selection method (or feature transformation method in the case of PCA) was applied wholly to DS_train, which yielded a subset of features. Using only these features, threefold cross-validation was then applied to DS_train using elastic net in order to choose the best model hyperparameters (via a random search) based on the mean absolute error (MAE) metric. The next step involved constructing an elastic net model on DS_train (using the discovered feature subset and best parameters). This model (i.e., the estimator) was then used to predict the instances contained in DS_test, yielding a range of pertinent performance scores (MAE, MAPE and correlation coefficient between predicted and actual TL) to indicate estimator efficacy. This constitutes one iteration of the fivefold outer cross-validation that is part of the 5 × 3 nested CV process, and it is therefore repeated for each of the other outer cross-validation iterations. The nested CV process was conducted for each of the 9 investigated estimators (the baseline using elastic net and 8 feature selection/transformation methods followed by elastic net). The estimators were then compared by considering the combination of the aforementioned performance scores. The feature-selection methods and their application in this first stage of the analysis are outlined in “Feature-selection methods” Section.

The next stage of the analysis (Stage 2) involved using the same 9 approaches from Stage 1 to construct DNA methylation-based estimators from the full Dunedin data set. These estimators were then tested on the two sets of independent data (EXTEND and TWIN). Unlike the performance results from the first stage which were generated from nested CV analysis on a single data set, the results of this second stage of analysis represent estimator performance on data originating from different laboratories and populations from which the estimators were trained. As such, this extended our model comparison process from nested CV on a single data set to evaluation on 2 independent data sets.

To obtain feature subsets from the Dunedin data, the 9 approaches were applied to the full Dunedin data set, in each case returning an N feature subset. With these N features, tenfold cross-validation, with a grid search, was conducted on the Dunedin data to find a good set of model hyperparameters. Using the N features and best identified parameters, estimators were then constructed from the Dunedin data and used to make predictions on the 2 independent validation sets. The majority of subjects in the Dunedin data set contributed 2 samples for analysis. Therefore, all data partitioning (such as within the cross-validation process) was implemented such that, where 2 samples pertain to a subject, both samples were contained within the same partition. This prevents information leakage, ensuring subject independence across data partitions. An overview of the process is presented in Fig. 1.

Elastic net regression and feature-selection methods

Elastic net regression

Given its common usage in DNA methylation-based age prediction studies [5, 7, 8, 10] and its application in a recent TL estimator [34], elastic net regression was used as a baseline model for TL estimation from DNA methylation data in this study. Regularized models, like elastic net regression, facilitate the selection of predictive CpGs among correlated markers when the ratio of features to samples is very large [58]. With too much freedom, a model can be prone to over-fitting due to too many available features. One solution to this is regularisation where extra terms are introduced into the objective function that penalise extreme values of regression coefficients and, thus, encourages them to take small values unless absolutely necessary. Elastic net regression is an embedded feature-selection approach, and its algorithm includes dimensionality reduction or intrinsic feature-selection. This may be a reason for its popularity in this domain.

In elastic net regression, the input variables, X, and output variable, y, are represented by a least squares relationship, given as: y = β₀ + βX + e, where β₀ is the intercept, and β and e are the vectors of regression coefficients and residuals respectively. Elastic net employs a mixture of both the l₁ and l₂ penalties and can be represented as:

$$\left( {1 + \frac{{\lambda_{2} }}{n}} \right)argmin_{\beta } \left| {\left| {y - X\beta ||_{2}^{2} + \lambda_{2} } \right|} \right|\beta ||_{2}^{2} + \lambda_{1} ||\beta ||_{1}$$

where $\|y - X\beta_{2}^{2}\|$ is the l₂-norm loss function (i.e., the residual sum of squares), $\|\beta_{2}^{2}\|$ is the l₂-norm penalty on $\beta$, λ₂ is a regularisation (complexity) parameter, $\beta_{1}$ is the l₁-norm penalty on β and λ₁ is a regularisation parameter. Setting ${\upalpha } = \frac{{\lambda_{2} }}{{\lambda_{1} + \lambda_{2} }}$, the elastic net estimator is shown to be equivalent to the minimisation of: argmin_β $\|y - X\beta_{2}^{2}\|$, subject to the penalty P_β(α) = (1–α)$||\beta ||_{1}$ + $\left| {\left| {y - X\beta ||_{2}^{2} + {\upalpha }} \right|} \right|\beta ||_{2}^{2}$ ≤ s, for some s [59]. The parameter α is known as the mixing parameter as it defines how the l₁ and l₂ regularisation is mixed. The elastic net model was tuned using the regularisation parameters, λ, and the mixing parameter α. Over a grid search of parameter values, the parameter pairing which returned the minimum MAE corresponded to the optimal model from the cross-validation parameter tuning process. The parameter ranges used in the grid search process were [1 × 10⁻⁵, 1 × 10⁻⁴, 1 × 10⁻³, 1 × 10⁻², 1 × 10⁻¹, 0.25, 0.5, 1, 10, 100] and [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99, 1] for the λ and α parameters respectively.

Feature-selection methods

The feature-selection methods used in this work include a number of filter feature-selection approaches which typically require some user-specified threshold that determines the set of features used. For the F-test approach, the false discovery rate is set at a threshold value—we use both 0.05 and 0.01 as these are typically used in this domain. Utilising these thresholds will return a set of features.

The other filter feature-selection approaches used provide a ranking of the features. Using Pearson and mutual information, the correlation strengths and actual information gain between each independent variable and the dependent variable respectively can be ranked. With support vector regression, the regression coefficients can be ranked, while the decision tree ensemble approach (random forest) returns a feature importance score that allows feature ranking.

The reduced feature set can be selected using a threshold of some sort on this ranking, e.g., by choosing those with scores higher than a threshold score or by specifying a number of features to use. We select the number of features to use for each method by determining the set that performs best on the training data. We chose to investigate models with successively larger feature subset sizes, in an effort to find the model that corresponded to the error minima/correlation maxima. In Stage 1 of the analysis (nested CV), elastic net regression models were created after initial feature selection using each of the four ranking feature-selection methods, i.e., for ranked feature subsets of 50, 100, 150, 200, 250, and in steps of 250 thereafter up to 20,000 features. Nested CV performance scores were then plotted against feature subset size to yield curves for examination of the effect of increasing feature subset size, and from which an optimal point (feature subset size) could be selected for the stage 2 and stage 3 analyses.

The other feature-selection approaches used in this study include the embedded feature-selection techniques of elastic net, which is our baseline approach, and gradient boosting which is a decision tree ensemble technique which includes regularisation. Both these techniques yield a unique feature subset from the application of the algorithm. We also used PCA which is a feature transformation approach. The principal components were derived from training data and used to project both training and test data into the training PCA subspace. More detail is given on the individual feature selection/transformation methods in Table 2, while Table 3 outlines commonly used feature-selection approaches from a range of studies that utilise DNA methylation data.

Table 2 Summary of feature-selection techniques used in this study

Full size table

Table 3 Commonly used feature-selection approaches in DNA methylation-based studies

Full size table

Other regression-based learning algorithms

In the third stage of analysis (Stage 3), estimators were constructed by combining a range of different regression algorithms with the feature subset utilised in the estimator judged to be the most promising from the Stage 2 analysis. These included partial least squares regression, multi-layer perceptron, least angle regression and support vector regression—these models had relatively low computational overhead, unlike some other algorithms which would require long run times, given that feature set sizes can still be relatively large after the feature-selection stage. The same training, feature-selection, cross-validation and testing methodology was used as described for the Stage 2 analysis.

Statistical analysis

Pearson’s correlation coefficient was used to assess the strength of linear association between actual and estimated TL in all cohorts. In addition, Pearson’s correlation was calculated between the four TL measures (actual TL, MI-EN TL, PCA-EN TL and DNAmTL) and age in the EXTEND and TWIN cohort—where MI-EN TL and PCA-EN TL are the estimators that utilised mutual information-based and principal component analysis-based feature reduction in advance of elastic net regression respectively. The strength of correlation (Pearson) between age acceleration (obtained by regressing DNAmAge [5] on age) and both actual and predicted TL was also assessed.

Where confidence intervals (CI) were calculated for reported correlations, we utilised 83.4% intervals per the recommendation by Knol et al. [73]. In the assessment of effect modification, the overlap of 95% CI is regularly used, wherein a type I error probability of 0.05 is often mistakenly assumed. That is, the chance of finding a non-overlapping 95% CI is assumed to be 0.05 under the null hypothesis that there is no statistically significant difference in a set of observations. The authors recommend an adjusted confidence level (83.4%) to arrive at a Type I error probability of 0.05. We utilised the online correlation coefficient confidence interval calculator [74].

To avoid age confounding potential relationships between actual and estimated TL measures and age-related traits, age-adjusted measures were generated for the EXTEND data. This was achieved by regressing actual TL, MI-EN TL and PCA-EN TL on age, with the raw residuals of this process being defined as TLadjAge, MI-ENadjAge and PCA-ENadjAge respectively. Accordingly, Pearson’s correlation was assessed between these age-adjusted measures and estimated blood cell counts for the EXTEND data set. The same analysis was performed for the TWIN data set (not adjusted for age). Furthermore, repeated measures correlation was used to examine the correlation between actual TL/MI-EN TL/PCA-EN TL and actual blood cell counts available for the Dunedin cohort. This method accounts for repeated measures for the same individuals at ages 26 and 38—addressing non-independence among observations by statistically adjusting for inter-individual variability using analysis of covariance [75].

Multiple linear regression models were used to explore a range of biological correlates i.e., the association between variables such as TL (both actual and estimated) and participant traits (e.g., age and sex), while controlling for confounders. Tests were two-sided and the statistical significance level was defined as p < 0.05. Actual TL was z-score transformed before conducting the regression analysis to allow for easier interpretation of coefficients [76].

The estimator previously developed by Lu et al. (DNAmTL) [34] was compared with the estimators constructed in our study. The ability of the compared estimators was assessed via MAE, MAPE, and the Pearson correlation coefficient between predicted and actual TL, on both our independent data sets (EXTEND and TWIN).

Results and discussion

Nested cross-validation analysis on Dunedin data set

A comparison of nested CV MAE on the Dunedin data set for the 9 feature-selection models investigated in this study is shown in Fig. 2. Models utilise elastic net regression following application of each feature-selection/transformation method, with the exception of the baseline model which applies elastic net regression with no prior feature-selection stage.

Figure 2 shows that the model that utilises the Pearson correlation-based feature rankings performed relatively well at a lower number of features. On closer inspection of the graph, a feature set size of approximately 750 is close to the minimum error observed for that model. We will utilise this feature set size in stage 2 of our analysis where models are constructed based on the full Dunedin data set and tested on the 2 independent data cohorts. The model that utilises linear SVR feature rankings can be seen to progressively improve up to approximately 13,750 features—however we elected to utilise a more parsimonious model with 8500 features which corresponds to one of the best MAE values (Fig. 2) and a near maximum correlation between predicted and actual TL (Fig. 3). The model that implements PCA in advance of elastic net regression yields the lowest error across all models. Although PCA is not strictly a feature-selection technique but rather a feature transformation method, we were interested to assess its performance, given that multicollinearity is a known challenge in high-dimensional data when applying statistical methods [29, 77]. The results suggest that the orthogonality of the transformed variables (principal components) mitigated, to some extent, the issue of correlated predictors.

In addition to assessing the MAE, it is useful to also consider the Pearson correlation coefficient between predicted and actual TL and previous studies report this when considering the efficacy of TL estimation [34, 78]. For the same models shown in Fig. 2, Fig. 3 shows the Pearson correlation coefficient between the predicted and actual TL. In Fig. 3, the Pearson correlation-based feature ranking method with elastic net again indicates relatively good performance at a lower feature set size, with a peak observed at approximately 750 features. It is notable that performance is seen to increase for this model as the number of features approaches 20,000. However, it is important to consider that, in the field of epigenetics, there can be an advantage in discovering relatively smaller feature subsets (biological signatures) given that this facilitates both model interpretation and downstream biological interpretation of DNA methylation estimators of biological traits. However, larger feature numbers may have more utility to estimate telomere length in population-based studies.

Given that Pearson’s correlation indicates how strongly two variables are linearly related, two methods that stand out as performing well relative to the others are the two F-tests with FDR control—exhibiting higher Pearson’s correlation (r > 0.3) between actual and predicted TL (Fig. 3). Interestingly, hypothesis testing with FDR control is known to be a popular method in DNA methylation-based age estimation studies. Again, the application of principal component analysis before the elastic net regression stage yields a relatively good result (r ≈ 0.3). It is notable that the baseline method of elastic net regression without any prior feature-selection stage performs poorly in this analysis (r < 0.2), suggesting that a feature-selection stage in advance of elastic net regression is beneficial and should be explored as part of the estimator discovery process. When seeking an optimal DNA methylation-based estimator, the methods and algorithms used to develop this will naturally vary due to the heterogeneity and diversity of datasets—currently most DNA methylation datasets are too small and not sufficiently representative to yield a general-purpose estimator. This highlights the importance of a robust development methodology, as presented here, in the pursuit of a DNA methylation-based estimator of traits.

The average number of features remaining after both the initial feature selection/transformation and elastic net regression stages for the models constructed using nested CV is shown in Table 4. Additionally, the bar charts in Figs. 4 and 5 show the extent of the feature reduction process for each of the tested models. The left y-axes scales are logarithmic due to the large differences in features shown. Consider an example, the Pearson correlation-based feature-selection model was trained with the top ranked 750 features. The decision to use 750 features was based on observation of minima and peaks in Figs. 2 and 3 respectively from the nested CV analysis stage i.e., we effectively chose the best model from the models tested over the range 50–20,000 Pearson correlation ranked features as described in “Feature-selection methods” Section. In Fig. 4, for the Pearson correlation model, the blue bar indicates the 750 features retained after the initial feature-selection stage, with the adjacent orange bar indicating that approximately 70 features remain after application of the elastic net regression algorithm to the 750-feature data set.

Table 4 Features remaining in models after both feature-selection stages for the nested CV analysis (Stage 1)

Full size table

It is notable that the baseline model drastically reduces the initial 417,690 features down to approximately 30 features (i.e., the average number of features that remained in the models constructed from the 5 training sets of the nested CV process). As with Fig. 4, Fig. 5 shows the same feature reduction information, however, the right y-axis denotes the Pearson correlation between predicted and actual TL. In general, based on this metric, models that retained relatively lower numbers of features yielded the lowest correlation scores.

To investigate how the number of features presented to the elastic net algorithm relates to the number of features selected by it, we plot this relationship over successively larger feature set sizes for each of the 4 models that utilise explicit feature rankings (Fig. 6). Interestingly, models that utilise feature rankings derived from the linear SVR and random forest learning algorithms show an essentially monotonically increasing pattern i.e., in general as the number of features input to the elastic net algorithm increases, so too does the number of features selected by it. In contrast, the Pearson correlation-based and mutual information-based methods display significant fluctuations in the number of features retained by the elastic net algorithm.

In the case of Pearson correlation, feature subsets of similar size (containing predominantly the same ranked features) passed to the elastic net algorithm can, in some cases, result in substantially different feature set sizes being selected by the elastic net algorithm. The radically different behaviour of the four feature-selection methods and the fluctuating interplay of feature-selection method and regression algorithm further underscores the importance of exploring a range of methods and applying a robust methodology when developing a DNA methylation-based estimator.

Validation on EXTEND and TWIN independent data

Estimators, constructed on the full Dunedin data set, were comprised of specific sets of features after training. These estimators were tested on the 2 independent validation cohorts. Analogous to Table 4, the number of features remaining after both any initial feature-selection stage and the elastic net stage is shown in Table 5.

Table 5 Features remaining after both feature-selection stages for models in the Stage 2 analysis

Full size table

Next, the constructed estimators were tested on both the EXTEND and TWIN data sets. Table 6 denotes the performance measures (MAE, MAPE and Pearson correlation coefficient between predicted and actual TL) of the 9 investigated models, indicating their ability to predict TL in both validation data sets.

Table 6 Performance scores for models constructed on the Dunedin data and tested on validation data sets

Full size table

For the EXTEND data set, the model that applies PCA in advance of elastic net regression (PCA-EN TL) achieves the lowest MAE and MAPE, and the highest correlation between predicted and actual TL. Interestingly, PCA-EN TL was also shown to perform well in the nested CV analysis where models were trained and tested on the Dunedin data set. In this case, the model achieved the lowest MAE (Fig. 2) and a relatively high correlation score (Fig. 3). Contrastingly, the baseline model (elastic net regression with no prior feature-selection) performed less well in general on the data sets utilised in our study. On the EXTEND and TWIN independent validation sets, the model achieves the third and fourth best correlations respectively (between predicted and actual TL). The baseline model performed particularly poorly in the nested CV analysis—Fig. 2 shows that it is among the worst performers in terms of the MAE score, while the model's estimated TL are the least correlated with actual values (Fig. 3). Other notable results include the model that utilised mutual information in advance of elastic net regression (MI-EN TL). This model achieved a correlation of 0.203 between the predicted and actual TL, although the model was seen to perform poorly on the TWIN data set, reporting a small negative correlation (r = − 0.067). Of note, no models performed particularly well on the TWIN data.

The use of filter feature-selection techniques in advance of applying elastic net has the disadvantage that user intervention is necessary in most cases. For example, with an F-test, the false discovery rate must be specified, while for the methods that yield explicit feature rankings, the user must explore models with varying feature set sizes to ascertain which model may be optimal. A key aspect of the methodology adopted in our study is that the data used to test performance has not been used for feature-selection—training and testing data must be independent to avoid information leakage from test data into training sets [79].

In contrast, a benefit of utilising embedded methods like elastic net and the gradient boosting implementation used in this study (both without any prior feature-selection step) is that they automatically yield a reduced feature set—which adds to their attractiveness as options. It is important to note that while PCA with elastic net yields the overall best estimator from our tested candidates, the nature of PCA precludes explicit selection of features, yielding instead transformed features that are linear combinations of original features (CpGs). As such, PCA is not commonly used as an initial feature reduction step when developing DNA methylation-based signatures. This limitation, however, is mitigated if the primary motivation for developing an estimator is for predictive purposes where, based on the results of our study, PCA has the potential to yield better estimates of TL than its competitors. A recent study on epigenetic clocks, a widely used aging biomarker derived from DNA methylation data, found that CpG measures can be unreliable due to technical noise. The authors applied PCA in advance of elastic net regression, minimising random noise from CpGs and extracting shared systematic variation in DNA methylation. Technical variance was reduced while preserving relevant biological variance, with the PC-based clocks achieving equivalent or improved prediction of outcomes [30].

Beyond DNA methylation-based estimators of chronological age and TL, studies utilise DNA-methylation data for purposes such as classification of cancer and other diseases [42, 43, 71, 80,81,82]. Some studies utilise a single feature-selection technique, while others combine several approaches. Taken together, the robust methodology of comparing and evaluating a range of feature selection and reduction techniques, as demonstrated in our study, could serve to potentially enhance the efficacy and value of DNA methylation-based classifiers and estimators.

Feature-selection with other regression algorithms for estimator development

In addition to elastic net, several other regression algorithms were explored with an initial feature-selection stage. We chose the feature-selection approaches with the best correlation between predicted and actual TL from the validation analysis for both the EXTEND and TWIN data sets for comparison. These were mutual information (r = 0.203) and F-test (0.01 FDR) (r = 0.119) respectively (Table 6). Although the estimator that utilised PCA achieved a better correlation (r = 0.295) on the EXTEND data, PCA is not strictly a feature-selection method but rather transforms the original variables into new orthogonal variables. The results of estimators utilising other regression algorithms can be seen in Table 7. Notable results include that the estimator MI-SVR TL obtained a similar correlation between predicted and actual TL (r = 0.181) as the MI-EN TL model (Table 6) and that the MI-PLS TL estimator achieves the best MAE and MAPE of all tested estimators on the EXTEND data but a relatively low correlation. However, none of the models that used alternative regression approaches achieved a higher correlation than MI-EN TL for the EXTEND data.

Table 7 Performance scores for models constructed on Dunedin data and tested on EXTEND and TWIN data

Full size table

Comparison with previously developed DNAmTL estimator

The novel 140-CpG DNA methylation-based TL estimator described previously by Lu et al. [34], (DNAmTL), is compared with the range of estimators developed in our study. We utilised both validation sets (EXTEND and TWIN) for this purpose. Predictions (estimates) from each estimator were assessed via their correlation with the actual values of TL in both data sets (Table 6)—an analogous correlation analysis was conducted in [34, 78]. For the EXTEND and TWIN data sets, a comparison of some of our best performing TL estimators and the DNAmTL estimator of Lu et al. [34] is shown in Figs. 7 and 8. Regarding the EXTEND data, the estimator which uses PCA in advance of elastic net regression (PCA-EN TL) achieves the highest correlation coefficient between predicted and actual TL (0.295 (83.4% CI [0.201, 0.384])). The next highest correlation (of our developed models) was achieved by the mutual information with elastic net estimator (MI-EN TL), with a correlation of 0.203 (83.4% CI [0.105, 0.297]). Comparing with the DNAmTL estimator of Lu et al. [34], this achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) (114 of the 140 DNAmTL CpG measures were available in EXTEND).

There is considerable variation observed in the correlations reported across the estimators (Table 6), suggesting that the choice of feature-selection method is important, given the specific nature of individual data sets. From observation of Fig. 7 (relating to the EXTEND cohort), the PCA-EN TL estimator’s 83.4% confidence interval does not overlap with either Boost-EN TL or F-test-0.05-EN TL—thus indicating a statistically significant difference in the correlation coefficients at the 5% level of significance [73, 83]. There is a marginal overlap between PCA-EN TL and LSVR-EN TL and relatively modest overlaps between PCA-EN TL and both the RF-EN TL and r-EN TL estimators. Where there is overlap in confidence intervals, results should be interpreted with caution.

Of note, the confidence interval for DNAmTL overlaps with all compared estimators in Fig. 7 while, as stated above, several estimators share no or a marginal overlap with PCA-EN TL—supporting the inference that PCA-EN TL performs best on this data set. Figure 8 outlines the correlations between predicted and actual TL for a range of the investigated estimators on the TWIN cohort, with a marked decline evident in estimator performance for both our set of developed models and DNAmTL. The wide confidence intervals suggest that the sample sizes of test sets (n = 192 and n = 178 for EXTEND and TWIN respectively) may be sub-optimal for assessing differences, therefore apparent trends where intervals overlap should be considered tentatively [84, 85].

Previously, the DNAmTL estimator developed by Lu et al. [34] was tested on a range of independent data sets. Four of these data sets contained TL measured using the gold standard Southern blot telomere restriction fragment method (TRF), with the DNAmTL estimator achieving correlations ranging r = 0.41–0.5. However, on 3 data sets with quantitative polymerase chain reaction (qPCR) TL measures (i.e., similar to the TL measurement method used in our study), the correlations were observed to be variable (r = − 0.01, 0.08 and 0.38). On applying the DNAmTL estimator to the Dunedin data set, a correlation of 0.242 (83.4% CI [0.209, 0.274]) between predicted and actual TL values was found (with 115 of 140 DNAmTL CpGs available in Dunedin data). Additionally, the DNAmTL estimator achieves correlations of 0.216 (83.4% CI [0.118, 0.310]) and 0.092 (83.4% CI [− 0.012, 0.194]) on the EXTEND and TWIN data sets respectively.

The observed differences in the correlations between predicted and actual TL for TRF and qPCR data, and the substantial variation in the case of qPCR may be due to known lab-to-lab variation of qPCR TL assays and/or be due to assay reproducibility [78]. The qPCR method is frequently used to measure TL due to its high-throughput and small DNA requirements; however, due to its sensitivity to pre-analytic factors such as DNA extraction or storage, its reliability is limited [86]. This suggests that utilising qPCR-based TL as a ground truth may not lead to as accurate or generalisable a DNA methylation-based TL estimator. As a measure of external validity, we assessed the correlation of TL across both time points in the Dunedin data (i.e., between individuals at ages 26 and 38) for both the actual measured TL and our best developed estimator PCA-EN TL. The correlations were observed to be 0.669 (p = 6.83e−97) and 0.261 (p = 5.58e−13) respectively. The strong correlation of the actual TL across time points supports the expectation that TL shows a high degree of intra-individual consistency over time [87,88,89,90].

On the whole, performance of the estimators was substantially lower when tested on the TWIN data set with the estimator that applied the F-test (0.01 FDR) in conjunction with elastic net achieving the highest correlation (r = 0.119 (83.4% CI [0.015, 0.221])) between predicted and actual TL. Comparatively, the DNAmTL estimator yielded a correlation of 0.092 (83.4% CI [− 0.012, 0.194]) on the TWIN data set, with all 140 CpG sites being available for use in the estimator. TL is highly heritable [91,92,93,94,95,96], therefore it would typically be expected to see strong correlations between TL in twin pairs. To further validate the TWIN data set, the correlations for actual TL, MI-EN TL, PCA-EN TL and DNAmTL across twin pairs were assessed and found to be 0.384 (83.4% CI [0.254, 0.5], p = 2.87e−04), 0.856 (83.4% CI [0.812, 0.890], p = 1.62e−25), 0.828 (83.4% CI [0.777, 0.869], p = 1.53e−22) and 0.823 (83.4% CI [0.769, 0.866], p = 4.03e−22) respectively. These results support the reliability of our DNA methylation-based TL estimators.

We compared the features selected by Lu et al.’s DNAmTL estimator [34] and our MI-EN TL estimator, both of which yielded relatively small feature sets (140 and 407 features respectively). Only two features (CpGs) were found to be common to both estimators. This highlights the challenge of finding a DNA methylation-based estimator that would generalise well on all data sets. Theoretically, to acquire such a level of generalisation, one would require very large data sets, which at the present time are difficult to obtain. However, in the future, aggregation of data sources may be possible which could yield a highly generalisable estimator, wherein applying a robust methodology such as that demonstrated in this paper, should further enhance the ultimate estimator.

Recently, Higgins Chen et al. [30] developed principal components-based epigenetic estimators of aging by first reducing the feature set to 78,464 CpGs present in a wide range of DNA methylation data sets. The authors implemented PCA with centering but not scaling of the beta values and excluded the final PC before applying elastic net regression. Guided by their methods and available codebase (https://github.com/MorganLevineLab/PC-Clocks), we applied this approach to our Dunedin training data set (using 76,567/78,464 CpGs present in our data) to construct a new estimator and tested it on both our independent data sets. The new estimator achieved Pearson correlations between predicted and actual TL of 0.194 and − 0.007 for the EXTEND and TWIN data sets respectively. Comparatively, our PCA-based estimator (PCA-EN TL) achieved correlations of 0.295 and 0.074 respectively. A primary focus of the Higgins-Chen et al. [30] study was reliability and replicability of epigenetic signatures. Possibly, moderate stronger correlations were attained by our PCA-EN TL estimator in our datasets as PCA was applied to the entire CpG feature set without initial filtering, thus allowing summarisation of all information.

Correlation of actual TL and estimators MI-EN TL and PCA-EN TL

Plots of actual TL versus two of our best performing DNA methylation-based estimators (MI-EN TL and PCA-EN-TL) are shown in Fig. 9 for the Dunedin, EXTEND and TWIN data sets. As expected, higher correlations are observed when the estimators are applied to the Dunedin training data (r = 0.487 and r = 0.498) with the highest correlation in the validation cohorts given by the PCA-EN TL estimator applied to the EXTEND data (r = 0.295).

Predicted TL from MI-EN TL and PCA-EN TL estimators more strongly correlated with age than TL

In the EXTEND validation data set MI-EN TL and PCA-EN TL showed stronger negative correlations with chronological age (r = − 0.506 (83.4% CI [− 0.577, − 0.427]) and r = − 0.565 (83.4% CI [− 0.63, − 0.493]) respectively) than did actual TL (r = − 0.218 (83.4% CI [− 0.312, − 0.12])), as shown in Additional file 1: Figure S1. Comparatively for the TWIN data, both MI-EN TL and PCA-EN TL showed similar negative correlations with chronological age (r = − 0.512 (83.4% CI [− 0.585, − 0.431], p = 1.33e−07) and r = − 0.520 (83.4% CI [− 0.592, − 0.44], p = 7.9e−08) respectively), in comparison to a slight positive correlation between chronological age and actual TL (r = 0.128, 83.4% CI [0.024, 0.229], p = 0.219). Considering the Lu et al. estimator’s TL predictions, these were shown to have correlations of − 0.7 (83.4% CI [− 0.748, − 0.645], p = 1.28e−29) and − 0.692 (83.4% CI [− 0.743, − 0.633], p = 1.1e−14) with age in the EXTEND and TWIN data sets respectively. Additional file 1: Figure S1 compares the correlations in both data sets for each TL measure. Next, a multiple regression model was constructed using the EXTEND data to explore the relationship between age and TL. Our analysis showed that the MI-EN TL, PCA-EN TL and DNAmTL estimators, were associated with much more significant p-values (p = 8.49e−14, p = 6.91e−17 and p = 4.64e−29 respectively) than actual relative TL (p = 2.83e−03), after adjusting for sex, current smoking status and other confounders (Additional file 1: Table S1). Our findings are consistent with the Lu et al. [34] study, where their DNAmTL estimator was observed to achieve more significant associations with age than actual TL did with age.

Previously, an epigenetic clock, DNAmAge, was developed [5] in the form of a multi-tissue predictor of age that utilises DNA methylation data. We examined age acceleration, generated by regressing DNAmAge on age, and plotted it against actual and estimated TL for comparison (Additional file 1: Figure S2). We observed no correlation (r = − 0.029, p = 0.69) between actual TL and the age acceleration measure in EXTEND data, as shown previously ([50, 97,98,99]). Stronger, but small, negative correlations were observed between age acceleration and both our TL estimators MI-EN TL (r = − 0.143, p = 0.048) and PCA-EN TL (r = − 0.161, p = 0.026). Similarly, Lu et al. [34] reported a Pearson correlation of r = − 0.2 between the DNAmAge age acceleration measure and their age-adjusted TL estimator, DNAmLTLadjAge.

Effect of sex

Previously studies have shown that women have longer TL than men, when considering groups of the same age [100]. All three of the multiple regression models outlined in Additional file 1: Table S1 in the supplementary information indicate that age-adjusted relative TL, age-adjusted MI-EN relative TL and age-adjusted PCA-EN relative TL were longer for women than men. The p-value for age-adjusted relative TL (p = 0.045, n = 192) was much less significant than those for age-adjusted MI-EN relative TL and age-adjusted PCA-EN relative TL (p = 3.61e−07 and p = 1.18e−03 respectively). Comparatively, Lu et al. [34] reported that age-adjusted TL and age-adjusted DNAmTL were longer in females than males with the p-value for age-adjusted LTL (p = 2.15E−04) being much less significant than that for age-adjusted DNAmTL (p = 1.14E−15).

MI-EN TL and PCA-EN TL association with imputed blood cell composition

The following imputed blood cell counts were analysed for the validation cohort EXTEND (B cells, naïve CD4+ T, naïve CD8+ T, exhausted cytotoxic CD8+ T cells (CD8 positive, CD28 negative, CD45R negative), plasma blasts, natural killer (NK) cells, monocytes, and granulocytes). The blood cell composition imputation of B cell, NK cells, monocytes and granulocytes were imputed using the Houseman method [101], while remaining cells were imputed based on the Horvath method [102] as described previously [50].

The abundance of naïve CD8+ T cells and memory CD8+ T cells has previously been shown to correlate with actual TL [103], and a previously developed estimator (DNAmTL) was found to be significantly correlated with several imputed measures of leucocytes e.g. naïve CD8+ T cells (r = 0.42, p = 2.2E−151) [34]. We found similar correlations for naïve CD8+ T cells in the EXTEND cohort (r = 0.395, 83.4% CI [0.307, 0.477], p = 1.5E−08, r = 0.268, 83.4% CI [0.172, 0.359], p = 1.8E−04 and r = 0.465, 83.4% CI [0.382, 0.54], p = 1.1E−11) for PCA-EN TL, MI-EN TL and DNAmTL respectively. Additionally, MI-EN TL was significantly correlated with several other imputed blood cell measures i.e., CD8pCD28nCD45RAn cells (r = − 0.171, 83.4% CI [− 0.267, − 0.072], p = 1.8E−02), CD4T cells (r = 0.191, 83.4% CI [0.092, 0.286], p = 8.1E−03) and natural killer cells (r = − 0.16, 83.4% CI [− 0.256, − 0.061], p = 2.7E−02). Similarly, PCA-EN TL was significantly correlated with CD8pCD28nCD45RAn cells (r = − 0.16, 83.4% CI [− 0.256, − 0.061], p = 2.6E−02), plasma blasts (r = 0.196, 83.4% CI [0.098, 0.291], p = 6.6E−03), natural killer cells (r = − 0.159, 83.4% CI [− 0.255, − 0.06], p = 2.7E−02), monocytes (r = − 0.207, 83.4% CI [− 0.301, − 0.109], p = 4.0E−03) and granulocytes (r = 0.26, 83.4% CI [0.164, 0.351], p = 2.7E−04). Lu et al. DNAmTL [34] was found to be significantly correlated with CD4+ T cells (r = 0.313, 83.4% CI [0.219, 0.401], p = 9.6E−06), CD4T cells (r = 0.314, 83.4% CI [0.221, 0.402], p = 9.4E−06), natural killer cells (r = − 0.17, 83.4% CI [0.071, 0.266], p = 1.9E−02), CD8pCD28nCD45RAn cells (r = − 0.213, 83.4% CI [− 0.307, − 0.115], p = 3.0E−03) and monocytes (r = − 0.214, 83.4% CI [− 0.308, − 0.116], p = 2.8E-03).

A range of strong correlations were observed in the TWIN data set between various TL measures (not adjusted for age) and blood cell concentrations. These included actual TL, MI-EN TL, PCA-EN TL and DNAmTL being significantly correlated with CD8.naive cells with correlations of 0.169 (83.4% CI [0.066, 0.269], p = 2.43E−02), 0.456 (83.4% CI [0.369, 0.535], p = 1.65E−10), 0.617 (83.4% CI [0.548, 0.678], p = 4.48E−20) and 0.685 (83.4% CI [0.625, 0.737], p = 5.47E−26) respectively. Additionally, CD4T cells were significantly correlated with MI-EN TL (r = 0.384, 83.4% CI [0.291, 0.47], p = 1.26E−07), PCA-EN TL (r = 0.456, 83.4% CI [0.369, 0.535], p = 1.6E−10) and DNAmTL (r = 0.457, 83.4% CI [0.37, 0.536], 1.44E−10). MI-EN TL, PCA-EN TL and DNAmTL generally exhibited considerably stronger correlations with imputed blood cell composition than actual TL—further details can be viewed in the Supplementary Information (Additional file 1: Tables S2, S3, S4 and S5).

MI-EN TL and PCA-EN TL association with actual blood cell composition in Dunedin cohort

Actual blood cell counts (as opposed to imputed) were available for the Dunedin study. Moderate negative correlations were observed for basophil count, with the estimated telomere lengths from MI-EN TL and PCA-EN TL achieving the highest correlations (r = − 0.41, p = 1.8E−05 and r = − 0.38, p = 6.7E−05 respectively), compared to r = − 0.24 (p = 0.014) for the actual TL. In the case of eosinophils, actual TL showed similar negative correlation (r = − 0.15, p = 9.3E−5) to both PCA-EN TL and MI-EN TL (r = − 0.19, p = 3.6E−07 and r = − 0.10, p = 1.2E−02 respectively).

There was a significant small correlation of r = − 0.09 (p = 0.013) between actual monocyte count and measured TL for the Dunedin data, while stronger and more significant negative correlations were observed with both MI-EN TL and PCA-EN TL (r = − 0.15, p = 5.4E−05 and r = − 0.16, p = 3.1E−05). Imputed monocyte blood cell counts are available for the EXTEND cohort for comparison. Lower correlations may be expected in this case as correlations generated for actual blood cell counts in the Dunedin data involve TL estimates from models trained on the Dunedin data itself. Of the correlations between imputed monocyte count and actual TL, MI-EN TL and PCA-EN TL, PCA-EN TL showed the strongest correlation (r = − 0.21, p = 4.0E−03) with imputed monocyte count. Additional file 1: Figure S3 shows plots of monocyte counts versus actual TL, MI-EN TL and PCA-EN TL for both the Dunedin and EXTEND data sets. Blood cell composition correlation details for EXTEND, TWIN and Dunedin data can be viewed in Additional file 1: Tables S2, S3, S4, S5 and S6.

It is interesting to note that despite the Lu et al. DNAmTL estimator [34] and our MI-EN TL estimator having only two features in common, they are both shown to correlate with many of the same biological entities i.e., blood cell composition, age and sex. This suggests that our developed estimators may have captured these biological properties by finding features (CpGs) that identify them best in our data, whereas a predominantly different signature may achieve this in another data set.

Conclusions

In summary, we have outlined a robust methodology that utilises feature-selection approaches and ML algorithms in the development of a DNA methylation-based TL estimator. We have shown through results on independent data that differences in the efficacy of developed estimators exist, primarily due to the inherently varying combination of methods and algorithms with relatively small heterogenous data sets. Consistent with research in ML and deep learning, as greater volumes of high quality representative big data become available, it will be possible to develop more robust and accurate estimators of biological traits, such as TL, that generalise well to new unseen data. Interestingly, different estimators with extensively different feature sets, correlate with many of the same biological properties, which suggests that an estimator has a choice of CpGs with which to represent traits.

An interesting outcome of our work is that PCA, a technique not traditionally used for feature reduction of high dimensional data in DNA methylation studies, performs well in comparison to a range of more typically utilised approaches. This suggests that it may have utility in the development of DNA methylation-based estimators or classifiers where prediction is paramount, with identification of underlying features (CpG sites) not of primary interest. Such estimators, however, may have less clinical utility for the development of DNA methylation-based biomarkers of disease. Furthermore, the methodology adopted herein, that compares and assesses candidate estimators of TL, could easily be applied when developing estimators of other biological markers and disease phenotypes, to examine their relationship with DNA methylation and potentially improve their predictive value.

Availability of data and materials

Source code and scripts are available in the GitHub repository https://github.com/trevordoherty/DNA-methylation-based-Telomere-Length-estimator. The Dunedin Study datasets reported in the current article are not publicly available due to a lack of informed consent and ethical approval for public data sharing. The Dunedin study datasets are available on request by qualified scientists. Requests require a concept paper describing the purpose of data access, ethical approval at the applicant’s university and provision for secure data access (https://moffittcaspi.trinity.duke.edu/research-topics/dunedin). We offer secure access on the Duke, Otago and King’s College campuses. For the TWIN study, data is freely available in the supplemental files of the previously published article [51]. The EXTEND study data is deposited in the Gene Expression Omnibus (GEO) database (accession number: GSE113725). For further information on data availability, please contact the corresponding author.

Change history

07 June 2023
In the XML Figs has been corrected to Fig. The article has been updated to rectify the error.

Abbreviations

CV:: Cross-validation
CpG:: Cytosine–phosphate–guanine
DNA:: Deoxyribonucleic acid
FDR:: False discovery rate
MAE:: Mean absolute error
MAPE:: Mean absolute percentage error
ML:: Machine learning
PCA:: Principal component analysis
qPCR:: Quantitative polymerase chain reaction
SVR:: Support vector regression
TL:: Telomere length
TRF:: Telomere restriction fragment

References

Horvath S, Raj K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat Rev Genet. 2018;19(6):371–84.
Article CAS PubMed Google Scholar
Benowitz NL, et al. Prevalence of smoking assessed biochemically in an urban public hospital: a rationale for routine cotinine screening. Am J Epidemiol. 2009;170(7):885–91.
Article PubMed PubMed Central Google Scholar
Hsieh SJ, et al. Biomarkers increase detection of active smoking and secondhand smoke exposure in critically ill patients. Crit Care Med. 2011;39(1):40.
Article PubMed PubMed Central Google Scholar
Ballestar E, Sawalha AH, Lu Q. Clinical value of DNA methylation markers in autoimmune rheumatic diseases. Nat Rev Rheumatol. 2020;16(9):514–24.
Article CAS PubMed PubMed Central Google Scholar
Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):3156.
Article Google Scholar
Bocklandt S, et al. Epigenetic predictor of age. PLoS ONE. 2011;6(6):e14821.
Article CAS PubMed PubMed Central Google Scholar
Hannum G, et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell. 2013;49(2):359–67.
Article CAS PubMed Google Scholar
Choi H, Joe S, Nam H. Development of tissue-specific age predictors using DNA methylation data. Genes. 2019;10(11):888.
Article CAS PubMed PubMed Central Google Scholar
Zhu T, et al. CancerClock: a DNA methylation age predictor to identify and characterize aging clock in pan-cancer. Front Bioeng Biotechnol. 2019;7:388.
Article PubMed PubMed Central Google Scholar
Horvath S et al. DNA methylation aging and transcriptomic studies in horses. Biorxiv, 2021.
Horvath S, et al. Epigenetic clock for skin and blood cells applied to Hutchinson Gilford Progeria syndrome and ex vivo studies. Aging (Albany NY). 2018;10(7):1758.
Article CAS PubMed Google Scholar
Boroni M, et al. Highly accurate skin-specific methylome analysis algorithm as a platform to screen and validate therapeutics for healthy aging. Clin Epigenet. 2020;12(1):1–16.
Article Google Scholar
Weidner CI, et al. Aging of blood can be tracked by DNA methylation changes at just three CpG sites. Genome Biol. 2014;15(2):1–12.
Article Google Scholar
Galkin F, et al. DeepMAge: a methylation aging clock developed with deep learning. Aging Dis. 2020;12(5):1252.
Article Google Scholar
Belsky DW, et al. DunedinPACE, a DNA methylation biomarker of the pace of aging. Elife. 2022;11:e73420.
Article CAS PubMed PubMed Central Google Scholar
de Lima Camillo LP, Lapierre LR, Singh R. A pan-tissue DNA-methylation epigenetic clock based on deep learning. npj Aging. 2022;8(1):1–15.
Article Google Scholar
Bollepalli S, et al. EpiSmokEr: a robust classifier to determine smoking status from DNA methylation data. Epigenomics. 2019;11(13):1469–86.
Article CAS PubMed Google Scholar
Joehanes R, et al. Epigenetic signatures of cigarette smoking. Circ: Cardiovasc Genet. 2016;9(5):436–47.
CAS PubMed Google Scholar
Sugden K, et al. Establishing a generalized polyepigenetic biomarker for tobacco smoking. Transl Psychiatry. 2019;9(1):1–12.
Article Google Scholar
Rauschert S, et al. Machine learning-based DNA methylation score for fetal exposure to maternal smoking: development and validation in samples collected from adolescents and adults. Environ Health Perspect. 2020;128(9):097003.
Article PubMed PubMed Central Google Scholar
Hamilton OK, et al. An epigenetic score for BMI based on DNA methylation correlates with poor physical health and major disease in the Lothian Birth Cohort. Int J Obes. 2019;43(9):1795–802.
Article Google Scholar
Bellman R. Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ, 1961;3(2).
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3(Mar):1157–82.
Google Scholar
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Article CAS PubMed Google Scholar
Bommert A, et al. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal. 2020;143:106839.
Article Google Scholar
Alkuhlani A, Nassef M, Farag I. A comparative study of feature selection and classification techniques for high-throughput DNA methylation data. In International conference on advanced intelligent systems and informatics; 2016. Springer.
Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. in 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO); IEEE. 2015.
Cunningham P. Dimension reduction, in machine learning techniques for multimedia. Springer; 2008. p. 91–112.
Garg A, Tai K. Comparison of statistical and machine learning methods in modelling of data with multicollinearity. Int J Model Ident Control. 2013;18(4):295–312.
Article Google Scholar
Higgins-Chen AT, et al. A computational solution for bolstering reliability of epigenetic clocks: implications for clinical trials and longitudinal tracking. Nat Aging. 2022;2(7):644–61.
Article PubMed PubMed Central Google Scholar
Benetos A, et al. Telomere length as an indicator of biological aging: the gender effect and relation with pulse pressure and pulse wave velocity. Hypertension. 2001;37(2):381–5.
Article CAS PubMed Google Scholar
Pearce EE, et al. Telomere length and epigenetic clocks as markers of cellular aging: a comparative study. GeroScience. 2022;44(3):1861–9.
Article CAS PubMed PubMed Central Google Scholar
Yadav S, Maurya PK. Correlation between telomere length and biomarkers of oxidative stress in human aging. Rejuvenation Res. 2022;25(1):25–9.
Article CAS PubMed Google Scholar
Lu AT, et al. DNA methylation-based estimator of telomere length. Aging (Albany NY). 2019;11(16):5895.
Article CAS PubMed Google Scholar
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Stat Methodol). 2005;67(2):301–20.
Article Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57(1):289–300.
Google Scholar
Storey JD. A direct approach to false discovery rates. J R Stat Soc: Ser B (Stat Methodol). 2002;64(3):479–98.
Article Google Scholar
Korthauer K, et al. A practical guide to methods controlling false discoveries in computational biology. Genome Biol. 2019;20(1):1–21.
Article Google Scholar
Koch CM, Wagner W. Epigenetic-aging-signature to determine age in different tissues. Aging (Albany NY). 2011;3(10):1018.
Article CAS PubMed Google Scholar
Bekaert B, et al. Improved age determination of blood and teeth samples using a selected set of DNA methylation markers. Epigenetics. 2015;10(10):922–30.
Article PubMed PubMed Central Google Scholar
Karir P, Goel N, Garg VK. Human age prediction using DNA methylation and regression methods. Int J Inf Technol. 2019;12:373–81.
Google Scholar
Cai Z, et al. Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst. 2015;11(3):791–800.
Article CAS PubMed Google Scholar
Xu W, et al. Integrative analysis of DNA methylation and gene expression identified cervical cancer-specific diagnostic biomarkers. Signal Transduct Target Ther. 2019;4(1):1–11.
CAS Google Scholar
Chen L, et al. Identification of DNA methylation signature and rules for SARS-CoV-2 associated with age. Front Biosci-Landmark. 2022;27(7):204.
Article Google Scholar
Xu C, et al. A novel strategy for forensic age prediction by DNA methylation and support vector regression model. Sci Rep. 2015;5(1):1–10.
Article Google Scholar
Poulton R, Moffitt TE, Silva PA. The Dunedin multidisciplinary health and development study: overview of the first 40 years, with an eye to the future. Soc Psychiatry Psychiatr Epidemiol. 2015;50(5):679–93.
Article PubMed PubMed Central Google Scholar
Bibikova M, et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98(4):288–95.
Article CAS PubMed Google Scholar
Cawthon RM. Telomere measurement by quantitative PCR. Nucl Acids Res. 2002;30(10):e47–e47.
Article PubMed PubMed Central Google Scholar
Shalev I, et al. Exposure to violence during childhood is associated with telomere erosion from 5 to 10 years of age: a longitudinal study. Mol Psychiatry. 2013;18(5):576–81.
Article CAS PubMed Google Scholar
Crawford B, et al. DNA methylation and inflammation marker profiles associated with a history of depression. Hum Mol Genet. 2018;27(16):2840–50.
Article CAS PubMed PubMed Central Google Scholar
Hannon E, et al. An integrated genetic-epigenetic analysis of schizophrenia: evidence for co-localization of genetic associations and differential DNA methylation. Genome Biol. 2016;17(1):1–16.
Article Google Scholar
O’Callaghan NJ, Fenech M. A quantitative PCR method for measuring absolute telomere length. Biol Proced Online. 2011;13(1):1–10.
Article Google Scholar
Davis SDP et al. methylumi: Handle Illumina methylation data; 2015.
Pidsley R, et al. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14(1):1–10.
Article Google Scholar
Pedregosa F, et al. Scikit-learn: machine learning in Python. J Mac Learn Res. 2011;12(Oct):2825–30.
Google Scholar
Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006;7(1):91.
Article Google Scholar
Krstajic D, et al. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform. 2014;6(1):1–15.
Article Google Scholar
Dugué P-A et al. DNA methylation–based measures of biological aging, In Epigenetics in human disease, Elsevier; 2018. p. 39–64.
Ogutu JO, Schulz-Streeck T, Piepho H-P. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. In BMC proceedings; Springer. 2012.
Benesty J et al. Pearson correlation coefficient, In Noise reduction in speech processing, Springer; 2009. p. 1–4.
Brank J et al. Feature selection using support vector machines. WIT Trans Inf Commun Technol; 2002:28.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Quraishi B, et al. Identifying CpG sites associated with eczema via random forest screening of epigenome-scale DNA methylation. Clin Epigenet. 2015;7(1):1–11.
Article Google Scholar
Cunningham P, Kathirgamanathan B, Delany SJ, Feature selection tutorial with python examples. arXiv preprint http://arxiv.org/abs/2106.06437; 2021.
Gambella C, Ghaddar B, Naoum-Sawaya J. Optimization problems for machine learning: a survey. Eur J Oper Res. 2021;290(3):807–28.
Article Google Scholar
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
Brownlee J. Feature importance and feature selection with xgboost in python. Machine Learning Mastery; 2016. https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python. Accessed 15 Oct 2021.
DeHan C. BoostARoota. 2017.
Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev: Comput Stat. 2010;2(4):433–59.
Article Google Scholar
Xu Z, Taylor JA. Genome-wide age-related DNA methylation changes in blood and other tissues relate to histone modification, expression and cancer. Carcinogenesis. 2014;35(2):356–64.
Article CAS PubMed Google Scholar
Everson TM, et al. DNA methylation loci associated with atopy and high serum IgE: a genome-wide application of recursive random forest feature selection. Genome Med. 2015;7(1):1–16.
Article Google Scholar
Baur B, Bozdag S. A feature selection algorithm to compute gene centric methylation from probe level methylation data. PLoS ONE. 2016;11(2):e0148977.
Article PubMed PubMed Central Google Scholar
Knol MJ, Pestman WR, Grobbee DE. The (mis) use of overlap of confidence intervals to assess effect modification. Eur J Epidemiol. 2011;26(4):253–4.
Article PubMed PubMed Central Google Scholar
Correlation Confidence Interval Calculator. Statistics Kingdom, 2022.
Bakdash JZ, Marusich LR. Repeated measures correlation. Front Psychol. 2017;8:456.
Article PubMed PubMed Central Google Scholar
Verhulst S. Improving comparability between qPCR-based telomere studies. Wiley; 2020.
Book Google Scholar
Algamal ZY, Lee MH. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst Appl. 2015;42(23):9326–32.
Article Google Scholar
Pearce EE, et al. DNA-methylation-based telomere length estimator: comparisons with measurements from flow FISH and qPCR. Aging (Albany NY). 2021;13(11):14675.
Article CAS PubMed Google Scholar
Kelleher J, Mac Namee B, Arcy AD’. Machine learning for predictive data analytics. Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies; 2015. p. 1–19.
Li M, et al. Identification and validation of novel DNA methylation markers for early diagnosis of lung adenocarcinoma. Mol Oncol. 2020;14(11):2744–58.
Article CAS PubMed PubMed Central Google Scholar
Raweh AA, Nassef M, Badr A, Feature selection and extraction framework for DNA methylation in cancer. Int J Adv Comp Science & Appl.;2017:8(7).
Halla-Aho V, Lähdesmäki H. Probabilistic modeling methods for cell-free DNA methylation based cancer classification. BMC Bioinform. 2022;23(1):1–24.
Article Google Scholar
Austin PC, Hux JE. A brief note on overlapping confidence intervals. J Vasc Surg. 2002;36(1):194–5.
Article PubMed Google Scholar
Foody GM. Sample size determination for image classification accuracy assessment and comparison. Int J Rem Sens. 2009;30(20):5273–91.
Article Google Scholar
Duro DC, Franklin SE, Dubé MG. A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using SPOT-5 HRG imagery. Rem Sens Environ. 2012;118:259–72.
Article Google Scholar
Dagnall CL, et al. Effect of pre-analytic variables on the reproducibility of qPCR relative telomere length measurement. PLoS ONE. 2017;12(9):e0184098.
Article PubMed PubMed Central Google Scholar
Chen W, et al. Longitudinal versus cross-sectional evaluations of leukocyte telomere length dynamics: age-dependent telomere shortening is the rule. J Gerontol Ser A: Biomed Sci Med Sci. 2011;66(3):312–9.
Article CAS Google Scholar
Baragetti A, et al. Telomere shortening over 6 years is associated with increased subclinical carotid vascular damage and worse cardiovascular prognosis in the general population. J Intern Med. 2015;277(4):478–87.
Article CAS PubMed Google Scholar
Müezzinler A, Zaineddin AK, Brenner H. A systematic review of leukocyte telomere length and age in adults. Ageing Res Rev. 2013;12(2):509–19.
Article PubMed Google Scholar
Ehrlenbach S, et al. Influences on the reduction of relative telomere length over 10 years in the population-based Bruneck study: introduction of a well-controlled high-throughput assay. Int J Epidemiol. 2009;38(6):1725–34.
Article PubMed Google Scholar
Kim J-H, et al. Heritability of telomere length across three generations of Korean families. Pediatr Res. 2020;87(6):1060–5.
Article PubMed Google Scholar
Bischoff C, et al. The heritability of telomere length among the elderly and oldest-old. Twin Res Hum Genet. 2005;8(5):433–9.
Article PubMed Google Scholar
Broer L, et al. Meta-analysis of telomere length in 19 713 subjects reveals high heritability, stronger maternal inheritance and a paternal age effect. Eur J Hum Genet. 2013;21(10):1163–8.
Article CAS PubMed PubMed Central Google Scholar
Hjelmborg JB, et al. The heritability of leucocyte telomere length dynamics. J Med Genet. 2015;52(5):297–302.
Article CAS PubMed Google Scholar
Honig LS, et al. Heritability of telomere length in a study of long-lived families. Neurobiol Aging. 2015;36(10):2785–90.
Article CAS PubMed PubMed Central Google Scholar
Jeanclos E, et al. Telomere length inversely correlates with pulse pressure and is highly familial. Hypertension. 2000;36(2):195–200.
Article CAS PubMed Google Scholar
Breitling LP, et al. Frailty is associated with the epigenetic clock but not with telomere length in a German cohort. Clin Epigenet. 2016;8(1):1–8.
Article Google Scholar
Marioni RE, et al. The epigenetic clock and telomere length are independently associated with chronological age and mortality. Int J Epidemiol. 2016;45(2):424–32.
Article PubMed PubMed Central Google Scholar
Belsky DW, et al. Eleven telomere, epigenetic clock, and biomarker-composite quantifications of biological aging: do they measure the same thing? Am J Epidemiol. 2018;187(6):1220–30.
PubMed Google Scholar
Dalgård C, et al. Leukocyte telomere length dynamics in women and men: menopause vs age effects. Int J Epidemiol. 2015;44(5):1688–95.
Article PubMed PubMed Central Google Scholar
Houseman EA, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 2012;13(1):1–16.
Article Google Scholar
Horvath S, et al. An epigenetic clock analysis of race/ethnicity, sex, and coronary heart disease. Genome Biol. 2016;17(1):1–23.
Article Google Scholar
Chen BH, et al. Leukocyte telomere length, T cell composition and DNA methylation age. Aging (Albany NY). 2017;9(9):1983.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank the Dunedin Study members, unit research staff, and study founder Phil Silva. We are grateful to all participants of EXETER 10000. We would like to thank the Irish Centre for High End Computing (https://www.ichec.ie/) for the use of their HPC infrastructure and support, and the Applied Intelligence Research Centre (Technological University Dublin) for provision of computing cluster resources.

Funding

This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6183 to TMM, SJD and TD. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The authors would like to acknowledge funding support for the project from the Brain and Behaviour Research Foundation (BBF) through a NARSAD Young Investigator Grant to TMM and ED. This project was also supported by the National Institute for Health Research (NIHR) Exeter Clinical Research Facility. The NIHR Exeter Clinical Research Facility is a partnership between the University of Exeter Medical School College of Medicine and Health, and Royal Devon and Exeter NHS Foundation Trust. The views expressed are those of the authors and not necessarily those of the BBF, NHS, SFI, the NIHR or the Department of Health and Social Care. The Dunedin Multidisciplinary Health and Development Research Unit was supported by the New Zealand Health Research Council and the New Zealand Ministry of Business, Innovation and Employment, and also by the National Institutes of Health National Institute of Aging (Grant No. R01AG032282), Medical Research Council (Grant No. MR/P005918), and Jacobs Foundation.

Author information

Sarah Jane Delany and Therese M. Murphy contributed equally to this work.

Authors and Affiliations

School of Biological, Health and Sports Sciences, Technological University Dublin, Dublin, Ireland
Trevor Doherty & Therese M. Murphy
SFI Centre for Research Training in Machine Learning, Technological University Dublin, Dublin, Ireland
Trevor Doherty
University of Exeter Medical School, University of Exeter, Exeter, UK
Emma Dempster, Eilis Hannon & Jonathan Mill
Department of Psychology, University of Otago, Dunedin, 9016, New Zealand
Richie Poulton
Center for Genomic and Computational Biology, Duke University, Durham, NC, 27708, USA
David Corcoran
Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
Avshalom Caspi & Terrie E. Moffitt
Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
Karen Sugden, Ben Williams, Avshalom Caspi & Terrie E. Moffitt
School of Computer Science, Technological University Dublin, Dublin, Ireland
Sarah Jane Delany

Authors

Trevor Doherty
View author publications
You can also search for this author in PubMed Google Scholar
Emma Dempster
View author publications
You can also search for this author in PubMed Google Scholar
Eilis Hannon
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Mill
View author publications
You can also search for this author in PubMed Google Scholar
Richie Poulton
View author publications
You can also search for this author in PubMed Google Scholar
David Corcoran
View author publications
You can also search for this author in PubMed Google Scholar
Karen Sugden
View author publications
You can also search for this author in PubMed Google Scholar
Ben Williams
View author publications
You can also search for this author in PubMed Google Scholar
Avshalom Caspi
View author publications
You can also search for this author in PubMed Google Scholar
Terrie E. Moffitt
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Jane Delany
View author publications
You can also search for this author in PubMed Google Scholar
Therese M. Murphy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

TMM conceived the study and with SJD supervised the project. TD performed all statistical and machine learning analysis. TD, SJD and TMM contributed to the computational interpretation of the results and were involved in the overall manuscript conceptualisation. TMM, JM, ED, EH, AC, TEM and BW provided telomere length and/or DNA methylation data for the study. AC, TEM and RP conceptualised and designed the Dunedin longitudinal cohort study. RP, DC, KS, BW, AC and TEM conceptualised data collection protocols and created variables for the Dunedin Cohort. TD, TMM and SJD drafted the manuscript. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Trevor Doherty.

Ethics declarations

Ethics approval and consent to participate

All experiments were performed in accordance with national guidelines and regulations as well as the Declaration of Helsinki. The Dunedin study participants gave written informed consent, and study protocols were approved by the NZ-HDEC (New Zealand Health and Disability Ethics Committee). For both the EXTEND and TWIN studies, informed consent was obtained from participants and ethical approval was obtained from the University of Exeter Medical School Research Ethics Board.

Consent for publication

Not applicable.

Competing interests

The authors confirm no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Performance metrics information, correlation plots, multiple linear regression results and blood cell count correlation tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Doherty, T., Dempster, E., Hannon, E. et al. A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator. BMC Bioinformatics 24, 178 (2023). https://doi.org/10.1186/s12859-023-05282-4

Download citation

Received: 29 March 2022
Accepted: 11 April 2023
Published: 01 May 2023
DOI: https://doi.org/10.1186/s12859-023-05282-4

A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator

Abstract

Background

Results

Conclusions

Background

Methods

Data description

Modelling overview

Elastic net regression and feature-selection methods

Elastic net regression

Feature-selection methods

Other regression-based learning algorithms

Statistical analysis

Results and discussion

Nested cross-validation analysis on Dunedin data set

Validation on EXTEND and TWIN independent data

Feature-selection with other regression algorithms for estimator development

Comparison with previously developed DNAmTL estimator

Correlation of actual TL and estimators MI-EN TL and PCA-EN TL

Predicted TL from MI-EN TL and PCA-EN TL estimators more strongly correlated with age than TL

Effect of sex

MI-EN TL and PCA-EN TL association with imputed blood cell composition

MI-EN TL and PCA-EN TL association with actual blood cell composition in Dunedin cohort

Conclusions

Availability of data and materials

Change history

07 June 2023

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us