 Research article
 Open access
 Published:
Detecting contaminated birthdates using generalized additive models
BMC Bioinformatics volume 15, Article number: 185 (2014)
Abstract
Background
Erroneous patient birthdates are common in health databases. Detection of these errors usually involves manual verification, which can be resource intensive and impractical. By identifying a frequent manifestation of birthdate errors, this paper presents a principled and statistically driven procedure to identify erroneous patient birthdates.
Results
Generalized additive models (GAM) enabled explicit incorporation of known demographic trends and birth patterns. With false positive rates controlled, the method identified birthdate contamination with high accuracy. In the health data set used, of the 58 actual incorrect birthdates manually identified by the domain expert, the GAMbased method identified 51, with 8 false positives (resulting in a positive predictive value of 86.0% (51/59) and a false negative rate of 12.0% (7/58)). These results outperformed linear timeseries models.
Conclusions
The GAMbased method is an effective approach to identify systemic birthdate errors, a common data quality issue in both clinical and administrative databases, with high accuracy.
Background
Birthdate information is ubiquitous in clinical, administrative, and research databases. It is one of the most common pieces of information for identifying individuals. Unfortunately, also common in these databases are missing or incorrect birthdates. For example, we have previously identified 1.5% of birthdates as incorrect in a statewide public health service data set with more than 20,000 patient records [1]. Similarly, in a study of 1112 hospital admissions 2% of patients were registered with incorrect names or birthdates [2]. These errors are typically more common in paediatric databases [3].
In most databases, birthdates are stored as datetime values. Compared to a numeric data field, there are a number of reasons datetime columns may be more prone to contamination during data collection and transfer. Firstly, a datetime can be communicated in more than one geographical standard format: The date “January 2nd, 2010” can be entered into a database as “20100102” (ISO8601 standard), “01/02/2010” (North American format), “02/01/2010” (British/Australian format), or even “02Jan10” (default format in Oracle 9i). If a registration clerk uses a format inconsistent with the database, incorrect birthdates will be generated. These format inconsistencies frequently result in truncated dates or NULL values in the database. Secondly, suboptimal userinterface design can induce incorrect birthdates. For example, some user interfaces remember a date previously entered and use it to autocomplete the input textbox. If a registration clerk does not check carefully, an incorrect value may be repeatedly entered. As data are often entered from paper record, direct validation is often difficult. Finally, patients themselves may give an incorrect birthdate, considering the emergency nature of many patient encounters. For example, a patient may confuse age with the year of birth (e.g. age 53 being turned into 1953 as the year of birth) or round the birthdate to the first date of the birth year (e.g., “June 25, 1910” being turned into “January 1, 1910”). Parents or caregivers of young or elderly patients admitted may also be prone to birthdates errors.
As birthdate information is central in patient identification, incorrect birthdates may have severe consequences, for example transfusion of blood or marrow to the wrong patient [4]. Incorrect birthdates may also cause records to be overwritten or duplicated, which may eventually incur significant financial loss to the hospital [3]. Finally incorrect birthdate information can affect derived variables. If an incorrect birthdate occurs repeatedly, then agebased risk stratification based on the data would be invalid [5].
Similar to other data quality issues, the best solution for reducing incorrect birthdates is to have good quality assurance at the data collection stage [6]. However, as health care data is often collected in busy clinical environments, data quality can be less that optimal. It is therefore critical to identify and correct the incorrect birthdates in existing data. Given the volume and complexity of patients seen in both the public and private health systems, it is no longer possible to manually check complete patient databases. With the rapid growth of electronic health records, tools are needed to automatically identify likely erroneous birthdates.
Despite the prevalence of incorrect birthdates and the potential cost and adverse health outcomes that can result, few tools are available to identify birthdate errors. Often incorrect birthdates are identified in an adhoc fashion, using rules specific to a given database and its use. For example, one common type of birthdate errors is that the NULL representing a missing birthdate is replaced by a date representing zero. The detection of such an error often relies on prior knowledge of the zero encoding [7].
If incorrect birthdates were generated in a completely random fashion, it would be impossible to identify them. However in most databases, the majority of the incorrect birthdates are introduced by a few sources of common contamination. In the previous zeroencoding example, the missing birthdates are often replaced by a zero date when data is transferred from one owner to another. The replaced value depends on the software involved and therefore is not random. For example, Microsoft products often use “December 30, 1899” as the zero datetime, and Unix (ISO 8601) uses “January 1st, 1970 UTC” as the zero datetime.
More generally, when birthdates in a database are contaminated systematically, we expect the database to acquire artefacts in the birthdate distribution, in particular overrepresentation of certain birthdates. Based on this observation, we propose an effective approach to identify systemic contamination of birthdates.
The importance of planned data cleaning has long been recognized in the research community. When the database size is not an issue, interactive data cleaning is always a good starting point [8]. From the perspective of data warehousing, Rahm and Do proposed a taxonomy of data cleaning problems [9]. They first identified two levels of data cleaning problems: Schema level and Instance level. Then for each level, they defined different scopes and problems. Birthdate contamination fits into the “Misspelling” problem within the “Attribute” scope at the Instance level. In a wider context, Van den Broeck and colleagues defined data cleaning as a 3stage process: Screening/Detecting, Diagnosis, and Editing [5]. This paper deals with the Detecting stage of data cleaning. In philosophy, our proposed method fits into the general strategy of checking for inconsistencies reflected in frequency distributions and strange patterns. However, because the prevalence of birthdate contamination, its special characteristics, and the threats it poses to data quality, separate attention is needed for detecting birthdate contamination, in particular in a live large database environment.
Methods
Distribution of birthdates in a database
For the purpose of this paper, we consider only databases that identify one or more groups of individuals—such as patients or clinicians. Many clinical or administrative databases fall into this category including electronic health records (EHR), emergency department information systems, databases of controlled drug prescriptions, and medical claims databases. With a clear identification of individuals in such a database, each date d defines a set of N(d) individuals born on that day.
In an ideal situation where the actual numbers of births and deaths for every day are available, for example in a region with a wellmaintained birth and death registry and limited population migration, the actual number of births in the population for every date can be estimated. But with any given database, it is useful to distinguish 1) the general population P_{1} that includes everyone living in a region at a certain time period, 2) the “at risk” population P_{2} that includes everyone who in theory could be included in the database (e.g., males to a prostate cancer database), and 3) the group of people P_{3} who are actually in the database. Individuals in a database P_{3} can be regarded as a sample of the “at risk” population P_{2}. Consequently the number N(d) can be regarded as realization of a birthdate distribution defined on P_{2}.
In an ideal situation where the actual numbers of births and deaths for every day are available, for example in a region with a wellmaintained birth and death registry and limited population migration, the actual number of births in the population for every date can be estimated. But with any given database, it is useful to distinguish Consider an event that an individual was born at time t and later included in P_{3}. We assume that such an event follows a Poisson point process with a timevarying intensity function μ(t). In a Poisson point process with intensity function μ(t), for any time interval A, the number of events N_{ A } in A, follows a Poisson distribution \mathit{p}\left({\mathit{N}}_{\mathit{A}}=\mathit{k}\right)=\frac{\mathit{\lambda}{\left(\mathit{A}\right)}^{\mathit{k}}{\mathit{e}}^{\mathit{\lambda}\left(\mathit{A}\right)}}{\mathit{k}!}, where \mathit{\lambda}\left(\mathit{A}\right)=\underset{\mathit{t}\in \mathit{A}}{\int}\mathit{\mu}\left(\mathit{t}\right)\mathit{dt}. Hence for a date d, the number of individuals with the birthdate d follows a Poisson distribution defined by
Here k is the number of individuals born on day d and λ(d) = ∫_{t ∈ d}μ(t)dt is the aggregated intensity for date d and equals the expected value of N(d).
In certain situations with larger data variance, a negative binomial distribution can be used instead. That requires one more overdispersion parameter to fit. In this application, it is difficult to assess overdispersion; hence we prefer the simpler Poisson model assumption.
Let μ_{ d } be the mean of μ(t) on day d and l be the length of a day. Then λ(d) = μ_{ d } ⋅ l. If μ(t) changes slowly, sequence < λ(d) > can be regarded approximately as the result of sampling μ(t) daily and then multiplying the sample with the constant l. The sequence < λ(d) > is determined by the birthdate distribution of the general population P_{1} and the representation of P_{1} in the database.
The birthdate distribution of the general population P_{1} is determined on a large scale by the age profile of the population and on a small scale by seasonal and weekly fluctuations in child births. The age profile of the population (as in [10]) is the aggregated result of changes in birth and death rates. For example, the postWWII baby boom has contributed to the aging population structure in US, Canada, and Australia [11, 12]. As such changes are often driven by long lasting demographic forces such as economic development, war, and progress of medical science [13, 14], they manifest as slow and smooth variation in the sequence < λ(d)>. In contrast, seasonal and weekly variations in child births (as in [15]) act on a shorter temporal scale. More recently, most child births occur in a hospital environment and scheduled Caesarean section or labour induction are more frequent. These factors lead to more births on weekdays than weekends, which produces cyclic dips in the sequence < λ(d) > [16, 17].
How the general population is represented in a patient database is mostly determined by the nature of the disease. For example, a database of prostate cancer patients should contain only males; a database of skin cancer patients probably contains more Caucasians than patients of other races. In terms of the birthdate distribution, one common consideration is that a disease may pose higher risk to certain age groups. For example, individuals born before year 1930 may be disproportionately represented in a database of chronic noncancer pain. Like the population age profile, the nature of the disease acts on a larger temporal scale and it should not affect the smoothness of the sequence < λ(d)>. For example, a person born on March 2nd, 1980 and a person born on March 3rd, 1980 should have very similar chances of being included in a database, assuming all other conditions are equal.
Finally for most hospitals, a patient database grows out of paper records. As a hospital serves more patients, its database may cover a larger portion of the atrisk population P_{2}. However, it is reasonable to assume that any such change in the size of the database will affect different age groups proportionally, and is independent of the shape of < λ(d) >.
To summarize, the birthdate distribution of a database can be modelled using a function λ(d) that is generally smooth but contains weekly dips due to weekend reduction in births.
Telltale indicators of birthdate contamination: discrepancy between the expected and observed counts
When birthdates in a database are systematically contaminated, for example when a missing value is consistently replaced with the zero datetime, incorrect birthdates may be repeatedly generated. In such cases, the frequency of an incorrect birthdate d will be higher than other dates in the database. That is, the observed number of patients with birthdate d, denoted N(d), will be larger than the expected number λ(d). Therefore identifying incorrect birthdates can be achieved through identifying the date d whose person count N(d) is well above the expected count λ(d). Of course this observation is not new. For example, the difference between N(d) and λ(d) is called within deviation by Dasu and Johnson [18]. To the authors’ knowledge, however, no previous attempt has been made to statistically model this discrepancy between N(d) and λ(d) for error detection.
The deviation of N(d) from the expected count λ(d) can be measured in terms of the tail probability of the Poisson distribution. This provides a way to rank outliers across all birthdates. The top N(d)s with the smallest tail probabilities can be returned to the data cleaning staff for further confirmation and investigation. Alternatively, a small positive number c can be used as a threshold: N(d) is then labelled as an outlier if
In many applications, only a small number of the most likely erroneous birthdates will be manually checked. If required, the expected number of birthdate errors in database can be estimated from the results of manually checking small samples of the database [19].
A generalized additive model for birthdate distribution
The expected birthdate distribution λ(d) can be estimated from the data < N(d) > by assuming smoothness of the function λ(d), with proper handling of the reduction in births on weekends. In the previous section, we have seen that explicit modelling of the mechanisms shaping the age profile would be very complex. Nevertheless, both longterm and seasonal variations in births/deaths can be recovered by smoothing of counts N(d) in a relatively short temporal window. The reduction in weekend births can be modelled by a multiplicative factor that only affects Saturdays, Sundays, and public holidays. As the reduction is a gradual development driven by increasing hospital births, elective Caesarean section, and labour induction, the multiplicative factor can also be modelled by a smooth function. Finally, it is worth noting that government policies to encourage fertility can also lead to more births on a particular day [20], but such events are very rare and should be explicitly modelled on a casebycase basis.
Different techniques can be used to smooth the raw data < N(d) > to recover the function λ(d). But to explicitly model the dayofweek effect, we use an additive model in which λ(d) is modelled as the product of an overall smooth function and a multiplicative factor for weekends. (A justification for the weekend multiplicative factor is that the weekend births are moved from the weekend to the preceding week). We assume that
Here the log link function is used because λ(d) can only be positive; I(⋅) is the indicator function and s_{1}(d) and s_{2}(d) are smooth functions of d. The function s_{1}(d) models both longterm and seasonal variations in the birthdate distribution; s_{2}(d) models the gradual change in the reduced weekend births. Public holidays are similar to weekends; it only requires extra bookkeeping in the modelling process.
Following the standard practice in generalized additive models, the smoothing terms in Equation (2) are represented by regression splines [21]. That is, \phantom{\rule{0.25em}{0ex}}\widehat{\mathit{s}}\left(\mathit{x}\right)=\sum _{\mathit{i}=1}^{\mathit{q}}{\mathit{b}}_{\mathit{i}}\left(\mathit{x}\right){\mathit{\beta}}_{\mathit{i}}, where b_{ i }(x) are a set of basis functions. Common type of basis functions include Bsplines and more familiar cubic splines. These basis functions are sections of polynomials that join at a number of knot locations. To avoid manually selecting knots for the regression splines, thinplate splines can be used [22]. With n data points, a thinplate spline representation is as follows.
Here d is the dimension of the function domain (in our case d = 1), m is the order the smoothness penalty, ϕ_{ j } are \mathit{M}=\left(\begin{array}{c}\hfill \mathit{m}+\mathit{d}+1\hfill \\ \hfill \mathit{d}\hfill \end{array}\right) basis functions that spans the null space of a penalty energy function, and η_{ md } is a radialbasis function. δ = (δ_{1}, δ_{2}, …, δ_{ n })^{T} and α = (α_{1}, α_{2}, …, α_{ M })^{T} are parameters with the constraint \sum _{\mathit{i}=1}^{\mathit{n}}{\mathit{\delta}}_{\mathit{i}}{\mathit{\varphi}}_{\mathit{j}}\left({\mathit{x}}_{\mathit{i}}\right)=0 for each j. Further details of thinplate splines can be found in [23]. Just like other kernel methods, the computation of thinplate splines has a complexity of O(n^{3}). For ease of computation, a lower rank approximation of δ is used. Such reducedrank thinplate splines [23] are used in our application. The rank of thinplate splines controls the smoothness of the function, and is selected by generalized crossvalidation criterion (GCV) [24]. To capture seasonal variations in birthdate distribution, the maximum rank for the candidate splines should be set to a sufficiently large number. We recommend a maximum rank of 2 m, where m is the number of years covered by the birthdates.
Dasu and Johnson [18] generated a histogram similar to Figure 1 to help detect inadvertent censoring caused by a default date. Graphical features of the histogram such as spikes and Vshaped valleys were identified as indicators for missed or censored data. However, the presence and location of such graphical features are determined by visual inspection. In contrast, here with a GAM, the deviation of particular counts in the histogram can be quantified for probability based decisions.
Results
A public health application
We use a health administration database to demonstrate the fitting of GAM and the identification of outliers. A drug regulatory authority maintains a database of drug dispensing records, which covers around 40,000 individuals with birthdates ranging from the 1890s to 2010s. The data is integrated from separate data collections from more than 3000 offsite pharmacies. As each pharmacy may choose different software to manage their data, quality assurance at the data collection stage is limited. For example, one system replaces the null datetime value with a default date, which changes whenever the system is updated. This has resulted in a typical downstream database with multiple sources of contamination. Our task was to identify those birthdates that may have been incorrectly introduced by the faulty data management software. As the data were transcribed from doctor’s prescriptions filled at offsite (community) pharmacies, they were geographically and functionally independent of the data custodians. These handwritten paper scripts were archived after data entry. Geographically locating, accessing and then physically combing through these handwritten documents is theoretically possible (with generous resources), but was impractical for the purposes of this study.
Figure 1 shows the number N(d) of individuals for each birthdate d. For evaluation purposes, most incorrect birthdates have already been manually identified by a domain expert. There were 58 errors in total. The authors have no knowledge of the criteria used by the domain expert to identify the errors and how complete and accurate they are. Nevertheless, Figure 1 shows at least three potential sources of database errors. First on the far left, errors 18991230 and 19000101 were likely to have been introduced by incorrect handling of the zero datetime in software systems. The error 19110101 may also have resulted from the confusion of year 1911 and year 2011. Second in the middle, errors 19500101, 19600101, 19630214, 19700101, 19790101, 19800101 may result from incorrect selfreported birthdates. Finally on the right of the figure, a group of incorrect birth dates were introduced by the pharmacy management system that mixed birthdates with drug dispensing dates.
Outliers detected through generalized additive model
We fit an additive model in the form of Equation (2) to the birthdate counts in Figure 1. The gam function from the mgcv Rpackage [23] was used to estimate the intensity function λ(d). The smoothness was determined through GCV with an upper bound of 200 for the degree of freedom. Figure 2 shows the estimated intensity function and the outliers above the pointwise 99.99th percentile.
The detection resulted in 8 false positives and 7 false negatives. The false positives can be classified into two categories: 5 in the middle between years 1957 to 1980; 3 on the far right between years 2006 and 2007. From visual inspection, the 5 false positives in the middle are indeed outliers. We conjecture that these birthdates were caused by multiple identities of some patients. When data from multiple pharmacies were linked, patient identities were consolidated through names and addresses. Mismatch can happen during the consolidation process and the patient’s birthdate would be overrepresented. The 3 false positives on the right are also curious. They seem to suggest that there are dozens of very young (~6 years old) patients on controlled drugs. This warrants further investigation.
The 7 false negatives include 2 in the middle (19790101 and 19800101), 4 on the right (all in year 2001), and 1 on the right (20110101). The false negative 20110101 has too few patients to justify a statistical decision. The 4 false negatives in year 2001 coincide with a large number of incorrect birthdates that have been correctly identified. We believe that after the correctly identified errors have been cleaned, these 4 false negatives will likely be found by rerunning the identification procedure.
Because the sequence < N(d) > forms a time series, nonparametric smoothing [25] as a common way to find outliers in time series can also be used. However, Figure 2 shows three advantages of our modelbased approach over nonparametric smoothing:

1)
Explicit background knowledge such as the weekend effect is simply modelled. The reduced weekend birth is reflected in the increased variation of the green line in the right part of Figure 2. If the smoothing was done with standard nonparametric methods, such background knowledge is not easy to be incorporated—it would be difficult to select a smoothing parameter (bandwidth) that works well with the changing variation in the signal. Nonparametric models are not good at handling the nonnormality of count data.

2)
Percentile estimation (the green line in Figure 2) can be easily extracted from the parametric Poisson model. Most nonparametric smoothing methods assume Gaussian distribution of residuals, which is not appropriate for percentile estimation of count data.

3)
Compared with direct smoothing, the modelbased approach is robust at the two ends of the birthdate range, where the data are sparse and N(d)s for most days equal zero. In nonparametric smoothing, consecutive zeros at the two ends will result in both zero signal estimate and zero variance estimate, which renders every nonzero N(d) an outlier. For example, in nonparametric smoothing based on the mean absolute deviation (MAD), the median of all neighbouring points of d is used to estimate the underlying signal and the median of all absolute deviation measures the variance [25]. For a sequence (0, 0, 0, 900, 0, 1, 0), the median is 0 and the median of all absolute deviation is also 0. Hence both 900 and 1 will be estimated to be outliers. In contrast, with a Poisson distribution, the probability P(k ≥ 1λ(d)) can still be large with even a low intensity λ. This will greatly minimize the likelihood of false positives at the two ends of the age spectrum.
The estimated Poisson intensity function λ(d) (the blue line in Figure 2) reflects both the general age profile of the patient population and shortterm variations in demographics. Figure 3 shows the segment of the curve for weekdays between year 1925 and 1970. The curve suggests a seasonal pattern of child birth. The dip around 1933–1934 suggests the effect of the ‘Great Depression’ on child birth in Australia. The elevated counts at the centre echo the baby boom in Australia following the second world war, which Salt defined to be the period between 1946 and 1961 [12]. The latter two features of the estimated function λ(d) are consistent with the official statistics [26].
The effect of reduced weekend birth (term s_{2}(d) in Equation (2)) is shown in Figure 4. It is consistent with the trend that weekend births have been significantly reduced due to elective Caesarean section. It also suggests that elective Caesarean section has been gaining popularity in the past 50 years.
Comparison with outlier detection based on a standard timeseries model
As the sequence < N(d) > is also a time series, an autoregressive integrated moving average (ARIMA) model is a natural alternative to a generalized additive model. An ARIMA(p, n, q) model contains p autoregressive terms, n nonseasonal differences, and q moving average terms. It has the following mathematical expression: \left(1{\sum}_{\mathit{i}=1}^{\mathit{p}}{\mathit{\varphi}}_{\mathit{i}}{\mathit{L}}^{\mathit{i}}\right){\left(1\mathit{L}\right)}^{\mathit{n}}\mathit{N}\left(\mathit{d}\right)=\left(1+{\sum}_{\mathit{i}=1}^{\mathit{q}}{\mathit{\theta}}_{\mathit{i}}{\mathit{L}}^{\mathit{i}}\right){\mathit{\epsilon}}_{\mathit{d}}, where L is the lag operator, ϕ_{ i } and θ^{i} are autoregressive and moving average parameters, respectively. When a time series is season with m points per season, a seasonal ARIMA (SARIMA) model can be used. In a SARIMA(p, n, q)(P, N, Q)_{ m } model, additional P autoregressive terms, N differences, and Q moving average terms are used to model the seasonality of the time series. More details of ARIMA models can be found in most time series text books (e.g., [27]).
Although ARIMA models may be more familiar to many people, they are not ideal for modelling the birthdate distribution. First, an ARIMA model assumes normality of data, which is not always appropriate as N(d) is a count. In particular, outlier detection based on normal percentiles may not be as accurate compared with Poisson percentiles. Next, it is not easy to incorporate other covariates in an ARIMA model and assess their effects on the counts. Most importantly, from a linear time series model like ARIMA, it is impossible to infer pointwise tail probabilities (see the green line in Figure 2). If an overall tail probability estimate is used, incorrect birthdates at the two ends of the age range will be missed (see the third advantage of the GAM described previously).
We fit a seasonal ARIMA (SARIMA) model on the sequence < N(d) > adjusting for the weekly cycle. It results in a highorder model ARIMA(0, 1, 5)x(0, 0, 2)_{ 7 }. By looking above the 99.99^{th} percentile in the residuals, we identified 69 outlier birthdates (See Figure 5). We compared these birthdates with the birthdates identified by the domain experts. A total of 36 birthdate errors were correctly matched, with 11 false negatives and 22 false positives, mostly in the period between year 1940 and 1980, in which many patients were born. Because no pointwise percentile can be inferred, this method tends to miss the birthdate errors near the left end of the birthdate range, where the data is sparse. For example, the date 19000101 was identified by the GAM model (see Figure 2), but was missed by the ARIMA model. A comparison of sensitivity and specificity of the two methods are shown in Table 1.
The GAMbased method showed better performance in terms of sensitivity and specificity. The area under the curve (AUC) for SARIMAbased outlier detection was 0.991; AUC for GAMbased outlier detection was 0.998 (pvalue 0.28).
Age profile of young oxycodone user
Finally, we show how identification of incorrect birthdates can prevent serious misinterpretation of data.
Oxycodone is an opioid analgesic used for pain management. A patient using oxycodone for an extended period is likely to develop dependence to the medication. Therefore in many places, including the state of Queensland in Australia, an oxycodone treatment episode longer than two months requires a report to the regulatory authority. Here we use the drug dispensing database to understand the age distribution of longterm users of oxycodone.
The age distribution for younger patients is shown in Figure 6^{a}. The figure shows unusual clusters of patients of age 5 or 10. These clusters are worrying as the main reason for Oxycodone prescription is chronic noncancer pain, which is rare among children.
We applied the GAMbased identification method to find birthdate errors. The 99.99th percentile was used as the cutoff. The records with identified birthdate errors were then removed from the analysis. The corrected age distribution based on the cleaned subset is highlighted in red. Although the new distribution still shows several young patients whose presence in the database is worth further investigation, it is more consistent with the common understanding that young people rarely have chronic pain condition that warrants large quantities of opioids.
Discussion
In current dataquality practice, data artefacts are typically identified by simple crosstabulation or visual inspection [6]. However, visual inspection is only feasible if the database contains only a limited number of distinct values. Visual inspection also relies on arbitrary cutoffs and it is difficult to bound the false negative rate. The GAM model effectively addressed these two problems.
As the importance of data quality becomes more widely recognized, the proposed method provides a new tool for data quality improvement. The method can be implemented to run automatically in most commercial databases and the results can be easily incorporated into regular reports on data. For many downstream databases, data quality reports allows one to identify quality problems at the earliest time possible and contain contamination of the upstream data.
Our method could be further improved with external data, in particular from wellcurated sources. Population wide birth and death statistics are often available from government organisations. For example, the Australian Bureau of Statistics provides the numbers of births by year and month. Potentially such data can augment a model solely based on the counts in a database. A way to use the population wide statistics is to devise a Bayesian prior distribution based on the external data.
In extreme situations, an incorrect birthdate may induce a very large peak in the birthdate sequence, similar to the examples in the top right of Figure 1. These outliers may bias the estimation of the intensity curve itself, reducing the model’s power to detect other less extreme outliers. This is often known as the masking effect in the outlier detection literature [28]. In cases of extreme outliers, one option is to refit the GAM model with the identified outliers removed. Alternatively, a robust GAM fitting procedure [29] can be used. As robust GAM fitting requires intense computation, further work is needed so that it can be applied in a large database with a wide age range.
In view of the limitations discussed above, we see two lines of future research that will generate immediate benefits to a broad range of applications. First, when an external data source is available, it would benefit from concrete techniques for using the external data in birthdate distribution modeling. External data may provide two types of anxillary information: the general shapes of the age distribution and the degree of variance from one day to another. A technique to incorporate such information would generate better models. Second, when a robust smoothing procedure is required, it would benefit from efficient algorithmic implementation. As multicore and cluster systems are becoming more common, parallel algorithms that tailor towards such computing facilities may be desirable.
Conclusions
Birthdates are the most commonly collected domain of health information. Their accuracy is critical to effective health service delivery. Our results demonstrate that a GAM model achieves efficacy and flexibility in detecting incorrect birthdates caused by systematic contamination. The GAM model described in this paper provides a solution that works across a diverse range of health databases.
Endnote
^{a}To tell whether a patient use oxycodone for an extended period of time, a simple way is to calculate the total quantity of the medication consumed in a year. World Health Organization (WHO) has guidelines of the defined daily dose (DDD [24]) for an individual: The DDD for oxycodone is 75 mg per day. Hence if the annual consumption of a patient exceeds the DDD equivalent quantity of two months (4500 mg), then we assume that the patient is a longterm oxycodone user.
References
Luo W, Gallagher M, O'Kane D, Connor J, Dooris M, Roberts C, Mortimer L, Wiles J: Visualising a statewide patient data collection: a case study to expand the audience for healthcare data. The Fourth Australasian Workshop on Health Informatics and Knowledge Management. Edited by: Maeder A, Hansen D. 2010, Darlinghurst, Australia: Australian Computer Society, Inc, 4552.
Harvei S, Tretli S, Langmark F: Quality of prostate cancer data in the cancer registry of Norway. Eur J Cancer. 1996, 32 (1): 104110. 10.1016/09598049(95)005013.
Just B, Proffitt K: Do you know who's who in your EHR?. Healthc Financ Manage. 2009, 63 (8): 68
Goldman M, RémyPrince S, Trepanier A, Decary F: Autologous donation error rates in Canada. Transfusion. 1997, 37 (5): 523527. 10.1046/j.15372995.1997.37597293885.x.
Van den Broeck J, Cunningham SA, Eeckels R, Herbst K: Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005, 2 (10): e26710.1371/journal.pmed.0020267.
Arts DGT, De Keizer NF, Scheffer GJ: Defining and improving data quality in medical registries: a literature review, case study, and generic framework. J Am Med Inform Assoc. 2002, 9 (6): 600611. 10.1197/jamia.M1087.
Barateiro J, Galhardas H: A survey of data quality tools. DatenbankSpektrum. 2005, 14 (15–21): 48
Raman V, Hellerstein JM: Potter's wheel: An interactive data cleaning system. VLDB. 2001, San Francisco: Morgan Kaufmann Publishers Inc., 381390.
Rahm E, Do HH: Data cleaning: Problems and current approaches. IEEE Data Eng Bull. 2000, 23 (4): 313.
Cohen JE: Human population: the next half century. Science. 2003, 302 (5648): 11721175. 10.1126/science.1088665.
Easterlin RA: The American Baby Boom In Historical Perspective. Population, Labor Force, and Long Swings in Economic Growth: The American Experience. Edited by: Easterlin RA. 1968, Cambridge: UMI
Salt B: The big shift. 2004, Victoria: Hardie Grant Publishing
Hauser PM: Aging and worldwide population change. 1976, London: Handbook of aging and the social sciences, 5886.
Ensor T, Cooper S, Davidson L, Fitzmaurice A, Graham W: The impact of economic recession on maternal and infant mortality: lessons from history. BMC Public Health. 2010, 10 (1): 72710.1186/1471245810727.
Lam DA, Miron JA: Seasonality of births in human populations. Biodemography Soc Biol. 1991, 38 (1–2): 5178.
Berresford GC: The uniformity assumption in the birthday problem. Math Mag. 1980, 53 (5): 286288. 10.2307/2689391.
Lerchl A, Reinhard SC: Where are the Sunday babies? II. Declining weekend birth rates in Switzerland. Naturwissenschaften. 2008, 95 (2): 161164. 10.1007/s0011400703054.
Dasu T, Johnson T: Exploratory data mining and data cleaning. 2003, John Wiley & Sons, 442
Herzog TN, Scheuren F, Winkler WE: Data quality and record linkage techniques. 2007, New York: Springer Verlag
Gans JS, Leigh A: Born on the first of July: An (un) natural experiment in birth timing. Journal Public Econ. 2009, 93 (1–2): 246263.
Wahba G: Spline models for observational data. 1990, Philadelphia: Society for Industrial Mathematics
Duchon J: Splines minimizing rotationinvariant seminorms in Sobolev spaces. Constructive theory of functions of several variables. 1977, Berlin – Heidelberg: Springer, 85100.
Wood SN: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J Royal Stat Society: Series B (Statistical Methodology). 2011, 73 (1): 336. 10.1111/j.14679868.2010.00749.x.
Wood SN: Generalized additive models: an introduction with R. 2006, Boca Raton: CRC Press
Gather U, Fried R, Lanius V: Robust Detail‒Preserving Signal Extraction. 2006, Weinheim: Wiley Online Library
Australian Bureau of Statistics: Family Formation: Trends in fertility. 1996, http://www.abs.gov.au/ausstats/abs@.nsf/2f762f95845417aeca25706c00834efa/711797fd722c35c7ca2570ec0073d3b4!OpenDocument,
Chatfield C: The analysis of time series: an introduction. 2003, CRC press
Andrews DF, Pregibon D: Finding the outliers that matter. J Royal Stat Society Series B (Methodological). 1978, 40 (1): 8593.
Alimadad A, SalibianBarrera M: An OutlierRobust Fit for Generalized Additive Models With Applications to Disease Outbreak Detection. J Am Stat Assoc. 2011, 106 (494): 719731. 10.1198/jasa.2011.tm09654.
Acknowledgements
The staff of the Drugs of Dependence Unit at Queensland Health provided data and technical assistance. This work was partially funded by a University of Queensland internal grant and a Linkage grant from the Australian Research Council. JPC is supported by a National Health and Medical Research Council of Australia Career Development Fellowship (APP1031909).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
WL and JW conceived of the study. WL carried out the experiment and analyzed the results. All authors participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
About this article
Cite this article
Luo, W., Gallagher, M., Loveday, B. et al. Detecting contaminated birthdates using generalized additive models. BMC Bioinformatics 15, 185 (2014). https://doi.org/10.1186/1471210515185
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210515185