Latent class distributional regression for the estimation of non-linear reference limits from contaminated data sources

Background Medical decision making based on quantitative test results depends on reliable reference intervals, which represent the range of physiological test results in a healthy population. Current methods for the estimation of reference limits focus either on modelling the age-dependent dynamics of different analytes directly in a prospective setting or the extraction of independent distributions from contaminated data sources, e.g. data with latent heterogeneity due to unlabeled pathologic cases. In this article, we propose a new method to estimate indirect reference limits with non-linear dependencies on covariates from contaminated datasets by combining the framework of mixture models and distributional regression. Results Simulation results based on mixtures of Gaussian and gamma distributions suggest accurate approximation of the true quantiles that improves with increasing sample size and decreasing overlap between the mixture components. Due to the high flexibility of the framework, initialization of the algorithm requires careful considerations regarding appropriate starting weights. Estimated quantiles from the extracted distribution of healthy hemoglobin concentration in boys and girls provide clinically useful pediatric reference limits similar to solutions obtained using different approaches which require more samples and are computationally more expensive. Conclusions Latent class distributional regression models represent the first method to estimate indirect non-linear reference limits from a single model fit, but the general scope of applications can be extended to other scenarios with latent heterogeneity.

considered the gold standard for reference interval determination [3], they require comprehensive and careful definition of the inclusion criteria and recruitment of an appropriate reference population. Together with the recommendations that each laboratory should establish its own reference intervals due to potential transferability problems and conduct periodical reviews of the resulting estimates, this gold standard is an enormous and unmet challenge for most laboratories [4].
This task becomes even more demanding considering that many analytes vary greatly with respect to different covariates of the patient [5]. In practice, this problem is often solved by splitting the population into subgroups to determine separate intervals. While this approach seems reasonable for categorical features such as sex, the decision on how to define the cutpoints to differentiate between multiple age groups is substantially more challenging. Discretization inevitably leads to discontinuities in the reference limits and thus again to increasing uncertainties near these cutpoints. Early work by Virtanen et al. [6] already proposes regression based estimation to address this issue, but the applied models are rather limited and not sufficient to describe the full conditional distribution of most analytes.
Given that age-dependent variations are particularly pronounced during childhood, however, it can be difficult to reach sample sizes big enough to establish reliable intervals over the whole age range, as the recruitment of children for medical studies is subject to strict regulations [7]. Alternative solutions therefore rely on the retrospective use of existing data from laboratory databases to estimate "indirect" reference intervals [4]. However, these databases are contaminated in the sense that it is rarely possible to reconstruct the analyte-specific health status for most entries, as the sampled individuals have not undergone a screening process comparable to the strict inclusion criteria mentioned above, but laboratory testing was performed as part of a diagnostic workup. As a consequence, the retrospective estimation of reference intervals requires additional precautions in order to avoid corruption by pathological samples. While several adequate methods based on statistical procedures have been developed to decompose the unlabeled empirical distribution, they mainly focus on estimating reference intervals independently from covariates.
In this article, we suggest a new approach to derive reference limits from clinical laboratory databases using mixtures of Gaussian location and scale models via the framework of generalized additive models for location, scale and shape (GAMLSS) [8]. The general framework of GAMLSS and other variants of distributional regression have a long history in the construction of reference growth charts for children [9][10][11] and are particularly recommended by the WHO for this task [12]. However, these classical approaches rely on data from reference populations and hence are not directly suitable for our data situation with latent classes (for a Markov-switching GAMLSS approach for latent states see [13]).
By incorporating mixtures of GAMLSS to simultaneously model the non-linear dependence structure of both mean and variance parameter for the unknown healthy and non-healthy components, our new approach allows for a very flexible solution to estimate the underlying distributions of reference limits from heterogeneous but unlabeled data sources. In contrast to recent contributions addressing these issues by splitting the data into multiple overlapping windows and subsequently interpolating the individually calculated reference limits [14,15], our solution provides an integrated approach that only requires a single fit to the data. As a consequence, latent class distributional regression represents the first approach capable of simultaneously accounting for both non-linear dependencies of covariates with respect to multiple distribution parameters and unlabeled data sources in this setting. We demonstrate our approach by estimating continuous hemoglobin reference intervals for boys and girls in a heavily contaminated real-world dataset from a tertiary care center.

Simulation study
To investigate whether or not the suggested latent class distributional regression models are able to approximate the true underlying non-linear components of an unlabeled dataset, we examined the performance of our approach using an adaptive simulation scenario described in this section. With the regressor variables x standard uniformly distributed, all responses were sampled from a Gaussian mixture with two components as follows: The first component in each scenario is considered to be the distribution of main interest, i.e. the distribution of what will later be attributed to the "healthy" part of the sample, with and In order to be able to evaluate the performance of the algorithm under varying conditions, an adjustable specification was considered for the second component to account for different degrees of overlap and the resulting distinctiveness between the true distributions. This was achieved by using an additional spacing variable c for the mean formula: With the variance of the second component is not affected by c and hence the same for all simulation settings. Consequently, the smaller c the larger the overlap, with expected values identical but η (i) 1σ < η (i) 2σ for all i if c = 0. In order to further diversify the simulation setup, we additionally considered different mixture weights α to reflect different amounts of contamination in the data, with α 1 always denoting a more dominant "healthy" component. Figure 1 shows an exemplary simulated data set, where it can be seen that the locations of the two distributions slowly diverge and their variances increase for larger values of x . Revisiting the details of the true components given above, it is clear that the exponent of both variance terms is actually even linearly increasing with respect to x . However, we decided to use non-linear prediction functions for all four distribution parameters, standard deviations included. First, because we felt that this form of linearity might not only be difficult to detect but would also be rarely considered appropriate in most practical applications, and second, because it allowed us to simultaneously investigate how the algorithm deals with over-specification of the model. Parameter estimation regarding the different components of the mixture distribution was achieved using the maximum likelihood approach implemented in the gamlss-package for R [16,17] wrapped inside an EM-algorithm as described in Algorithm 1 with the convergence threshold ǫ set to 0.001. Both non-linear terms for each component were estimated using cubic B-splines with three degrees of freedom. A reproducible capsule of the R-code used for data generation and analysis is publicly available on Code Ocean [18].
While many applications of mixture models are initialized using e.g. random assignment of observations, doing so resulted in many unreasonable solutions, especially for higher values of c. This was a result of the algorithm quickly running into local maxima due to its high flexibility, especially with respect to the variance components. Therefore, we addressed this issue by providing more informative initial weights based on the cumulative distribution function derived from a somewhat 'naive' application of a Gaussian GAMLSS that ignores the fact that the data is sampled from a mixture. Overall, our results are based on 24,000 independently generated datasets, i.e. 1000 repetitions for each unique combination of c ∈ (5, 10, 15, 20) , α 1 ∈ (0.6, 0.7, 0.8) and n ∈ (500, 1000) , specifically. In 4.43% of all runs the algorithm encountered problems before reaching the targeted convergence threshold of ǫ = 0.001 . About 43% of these issues occurred when c was set to 5, which is not that surprising considering the substantial overlap between both components in these rather extreme scenarios. Interestingly, a feasible solution in most of these cases was to rerun the algorithm with randomly assigned component memberships as starting weights, as this resulted in one component to quickly take over and explain large parts of the data, while the second addressed the most extreme outliers. Although the affected runs did not necessarily result in obviously wrong estimations of the desired quantiles when using the naive model as initialization, we decided to omit them from the aggregated results to base the comparisons on identical conditions with respect to the initialization strategy. For a more detailed breakdown, see Table 3 in the "Appendix".
Along with the results of the latent class distributional regression model (LCDR), we additionally report the performance of two standard GAMLSS fits. The first is the "naive" fit to the full data set used for the initialization of the LCDR, which ignores the true nature of the data and should therefore perform rather poorly. The second is a GAMLSS fit only to those observations in each simulated dataset that are in fact sampled from the "healthy" first component. This model can be considered as the gold standard that could have been achieved with a prospective study design. Main objective of all three models is the estimation of Q 1,0.95 (x) , i.e. the 95%-quantile of the first component. Tables 1 and 2 show the aggregated results of the three models, i.e. latent class distributional regression (LCDR), "gold standard" (GOLD) and "naive" fit (NAIVE), for each scenario with n = 500 and n = 1000, respectively. With the predictor variable x standard uniformly distributed, we calculate the integrated error (IE) 1 0Q 1,0.95 (x) − Q 1,0.95 (x) dx as a measure of bias together with the integrated squared error (ISE) 1 0 (Q 1,0.95 (x) − Q 1,0.95 (x)) 2 dx as a measure of general deviation from the true quantile for each run.
As should be expected, the LCDR is mostly situated between the two GAMLSS fits. With increasing values of c, the underlying components are easier to differentiate and LCDR performance generally increases, while the naive model deteriorates. Note that the results for the gold standard models only vary with respect to different c because of the runs removed due to non-convergence of the LCDR-algorithm on the same dataset.
While outperforming the naive approach in every setup, the LCDR shows a tendency to underestimate the true 95%-quantiles for very low c. This can be traced back to initializing the models in a way that they start far away from each other. If the overlap of the mixture components is very large, it is harder for the algorithm to bring the components close enough together while simultaneously trying to estimate their shape, whereas smaller overlap means more favorable and convenient starting positions.
The integrated squared error further affirms this issue. With increasing sample size, however, the average ISE of the LCDR improves considerably. Figures 2 and 3 show the quantiles estimated from the first 100 generated datasets for four corresponding scenarios. While a few estimated limits deviate heavily from the true quantile in the settings with n = 500 especially at the boundaries of x , such obvious errors are no longer present when the sample size is doubled and the true quantile is much better approximated. Of course, this could have been easily prevented by using linear predictors for the variance parameters of the components or rerunning the algorithm with a new set of initial weights in the most obvious cases. Therefore, sample size plays an important role to avoid running into local maxima due to overfitting random patterns in the data.
In an additional simulation study involving a mixture of gamma distributions, the algorithm performs similarly well as in the Gaussian settings. However, the importance of a sufficiently large sample is also evident in these scenarios. The corresponding results are available in full detail on Code Ocean [18].

Age-dependent hemoglobin reference intervals
We apply our model to the estimation of pediatric hemoglobin reference intervals. All laboratory tests were performed in the context of patient care in the Department of Pediatrics and Adolescence at the University Hospital Erlangen, Germany. We used test results from all children, irrespective of health status or speciality unit, including intensive care and oncology units.

Table 1 Simulation results for n = 500
Reported are the means and standard deviations of the integrated error and the integrated squared error over all simulation runs for latent class distributional regression (LCDR), "gold standard" (GOLD) and "naive" fit (NAIVE) After removing samples taken at subsequent visits of the same person, n f = 60424 and n m = 75464 observations of girls and boys aged between 1 and 18 years are available. The model is estimated as a mixture of two Gaussian distributions conditional on age and sex using for both m ∈ (1, 2) components, where I(·) is the indicator function. As in the simulations, we used B-splines with three degrees of freedom for all non-linear model terms. Figure 4 shows 5000 randomly selected data points from the full dataset. The shaded areas represent the estimated distribution for the healthy values of hemoglobin concentration enclosed by the 2.5% and 97.5% quantiles depicted by red or blue solid lines. Fitting the models separately to the data from boys and girls with age as single non-linear η mµ = β m0µ + β m1µ · I female (sex) + h m1µ age · I female (sex) + h m2µ age · I male (sex) η mσ = β m0σ + β m1σ · I female (sex) + h m1σ age · I female (sex) + h m2σ age · I male (sex)

Table 2 Simulation results for n = 1000
Reported are the means and standard deviations of the integrated error and integrated squared error over all simulation runs for latent class distributional regression (LCDR), "gold standard" (GOLD) and "naive" fit (NAIVE) predictor led to almost identical quantiles depicted by dashed lines. The solid black lines show the corresponding quantiles based on estimations available from an alternative approach, that has been evaluated extensively but requires splitting of the dataset and

Discussion
In this article, we propose an algorithm that generally extends the framework of latent class regression models to settings with distributional response variables. We use the resulting model to estimate age-specific hemoglobin reference limits derived from laboratory databases containing unlabeled samples from both healthy and children with a wide range of diseases. While the framework is by no means limited to these specific scenarios and can be used for a large variety of research questions associated with unobserved heterogeneity, the proposed approach clearly presents a valuable contribution to the field of reference interval estimation, where methods capable of simultaneously accounting for both non-linear dependencies of covariates with respect to multiple distribution parameters and unlabeled data sources are still lacking. The algorithm performs very promising in our simulation study even in the most difficult scenarios with extreme overlap between the components. Our simulation study had been motivated by the actual application, however, further research is warranted to extend the approach to more complex scenarios that have not been considered at the moment: First, we restricted our simulations to scenarios with two components, one for the healthy and another one for the pathologic samples. This is perfectly fine if the majority of pathologic values are either higher or lower compared to those considered healthy, as is the case in our application. For other analytes, it may be necessary to specify an additional pathologic distribution to cover deviations from the healthy range in both directions. Moreover, our simulations so far only focused on mixtures of Gaussian and gamma distributions, which may not always represent the best choices in all situations. However, both aspects can generally be addressed by increasing the overall number of components M and using different probability distribution functions for those components. Although the simulation study addresses only scenarios with a single predictor variable, the very similar solutions of the joint and separate models for boys and girls regarding pediatric hemoglobin reference limits further offer a promising perspective for expandability towards scenarios with more than one covariate. Currently, the algorithm uses unpenalized B-splines to estimate the non-linear model terms. Therefore, the degree of smoothness of the effects has to be specified in advance, which will not always be a straightforward decision in practice. A natural expansion of the proposed algorithm is hence the introduction of a penalty to reduce the risk of overfitting. However, the model is already very flexible in its current implementation, which makes the solutions strongly depend on the initial observation weights. As a consequence, the stability of the suggested solutions of the algorithm should be further evaluated in practice by using multiple sets of different starting weights or bootstrapped data sets. Since straying into local maxima is strongly determined by the extent to which the first iterations pick up misleading patterns, an important extension would hence be the search for possibilities to absorb these early missteps. Reducing the dependence on the initialization can then extend the applicability of the model to situations in which theoretically meaningful assumptions about the class memberships are not available in advance. In the context of indirect reference interval estimation, introducing model constraints that prevent crossing of the location parameters could further guide the algorithm to avoid theoretically unreasonable solutions, especially if the model involves separate components for low and high pathologic measurements.
Currently, another limitation are the mixture weights α , which are estimated constant with respect to the values of the predictor variable(s). It is actually not entirely unlikely, however, that the mixture ratio of the components depends on some external factor as well. Regarding the values of analytes sampled from children, for example, the proportion of healthy samples may eventually be decreasing with age, as routine examinations become less frequent and many of the samples obtained in everyday care are taken from children who probably visit the hospital for reasons that may also affect the analyte of interest. Therefore, approaches that are able to estimate dynamic mixture weights as well could be a valuable extension of the model framework. Further possible improvements of our approach involve the implementation of lower bounds for certain distribution parameters like the Gaussian variance parameter to avoid potential problems with respect to unbounded likelihoods.

Conclusion
Latent class distributional regression models provide a both practical and theoretically sound framework suitable for, but not limited to, the estimation of reference limits from contaminated databases. Due to the possibility to allow for very flexible model components, the application requires careful considerations from the researcher with respect to meaningful initial weight vectors and degrees of freedom for the non-linear effects. Moreover, these efforts should also be supported by an appropriately large sample in relation to the desired degree of flexibility.

Finite mixture models
Applying a (parametric) statistical model to a dataset usually requires the assumption of a probability density function that describes the distribution of the variable of interest. In the context of the retrospective estimation of reference limits, however, it is reasonable to assume that measurements taken from patients with specific diseases are differently distributed than those taken from patients considered healthy. If there is not enough information about the origin of the samples to either separate them before the analysis or adjust the model formulation accordingly, this latent heterogeneity means that the use of a single density function does not sufficiently describe the entire distribution of the analyte in the database. As a result, using the quantiles of the estimated distribution as reference limits would most likely overestimate the quantiles of the healthy population, as many pathologic measurements are either substantially higher or lower and thus tend to give more weight to one or both tails of the composite empirical distribution.
In these situations, given the number of components M, a weighted sum of m = 1, . . . , M probability density functions f m (x (i) , θ m ) with corresponding parameter vector θ m can be used to construct a finite mixture distribution where ψ = (α 1 , . . . , α M , θ 1 , . . . , θ M ) contains all unknown weights and parameters to be estimated. With α m > 0 and M m=1 α m = 1 , f (x (i) , ψ) is a convex combination of all f m (x (i) , θ m ) and thus a probability density function itself.
Finite mixture distributions have quite a long history [19,20], are comprehensively described in the statistical literature [21][22][23] and continue to provide useful approaches to a wide range of applications (e.g. [24,25]). The most common strategy to estimate the unknown parameters is the expectation maximization algorithm [26]. A well known limitation of EM algorithms, however, is that they become easily trapped in local maxima [27]. Other limitations involve identifiability problems for certain mixtures of density functions or when the number of components is misspecified [28]. In order to address this issue, several approaches based on bootstrapping methods have been suggested [29,30].

Generalized additive models for location, scale and shape
While the assumed distribution of an outcome y is mostly described by multiple parameters, the main focus of most regression problems lies on using a set of given input variables X to model only the mean. Generalized additive models for location, scale and shape (GAMLSS) [8] extend the original framework of generalized additive models (GAM) [31] by defining additional additive predictors to model the dependency on the covariates of up to four distribution parameters θ k=1,...,4 : Here, g k (·) denotes a known monotonic link function and x k1 , . . . , x kp k the p k (possibly different) input variables for each predictor η θ k with h jθ k (·) describing the shape of the effect of x kj on θ k . The four parameters are also referred to more specifically by θ k = (µ, σ , ν, τ ) . Set up properly, this model class therefore allows to model patterns in the data that are for instance responsible for differences in the residual variance of Gaussian regression or the dispersion parameters for negative binomial regression. The model parameters are usually estimated via (penalized) maximum likelihood by means of either a Newton-Raphson or Fisher scoring algorithm. Later, the use of gradient boosting algorithms was proposed to allow for inference in high-dimensional settings via implicit regularization [32].

Latent class distributional regression
In order to be able to simultaneously address both highly dynamic dependence structures of the distribution of interest as well as the problem of unlabeled observations from possibly multiple sources, we propose to combine the GAMLSS framework with the concept of mixture models. With a pre-specified number of components M, this results in a latent class distributional regression model with mixture weights α = (α 1 , . . . , α M ) and ξ (i) = (θ (i) 1 , . . . , θ (i) M ) containing all observation and component specific distribution parameters. A schematic overview of the steps performed for model estimation is shown in Algorithm 1. After providing initial values for all individual observation weights w (i,0) m , the algorithm starts by fitting suitable GAMLSS for all of the M components, with data each time re-weighted using the corresponding vector w (0) m . In the same step, the initial mixture weights α (0) m are computed. Subsequently, the parameters θ (i,0) m of the initial model fit are then used to derive updates for the observation weights by computing the likelihood for each observation to be part of the separate models. With new weights w (i,1) m available, the data is obviously re-weighted differently and refitting the models for each component again leads to new sets of parameter estimates θ (i, 1) m . Cycling through these alternating updates steadily increases the log-likelihood of the full mixture model and eventually converges to a solution where additional iterations will not bring any substantial improvements.