Distribution of birthdates in a database
For the purpose of this paper, we consider only databases that identify one or more groups of individuals—such as patients or clinicians. Many clinical or administrative databases fall into this category including electronic health records (EHR), emergency department information systems, databases of controlled drug prescriptions, and medical claims databases. With a clear identification of individuals in such a database, each date d defines a set of N(d) individuals born on that day.
In an ideal situation where the actual numbers of births and deaths for every day are available, for example in a region with a well-maintained birth and death registry and limited population migration, the actual number of births in the population for every date can be estimated. But with any given database, it is useful to distinguish 1) the general population P1 that includes everyone living in a region at a certain time period, 2) the “at risk” population P2 that includes everyone who in theory could be included in the database (e.g., males to a prostate cancer database), and 3) the group of people P3 who are actually in the database. Individuals in a database P3 can be regarded as a sample of the “at risk” population P2. Consequently the number N(d) can be regarded as realization of a birthdate distribution defined on P2.
In an ideal situation where the actual numbers of births and deaths for every day are available, for example in a region with a well-maintained birth and death registry and limited population migration, the actual number of births in the population for every date can be estimated. But with any given database, it is useful to distinguish Consider an event that an individual was born at time t and later included in P3. We assume that such an event follows a Poisson point process with a time-varying intensity function μ(t). In a Poisson point process with intensity function μ(t), for any time interval A, the number of events N
A
in A, follows a Poisson distribution , where . Hence for a date d, the number of individuals with the birthdate d follows a Poisson distribution defined by
(1)
Here k is the number of individuals born on day d and λ(d) = ∫t ∈ dμ(t)dt is the aggregated intensity for date d and equals the expected value of N(d).
In certain situations with larger data variance, a negative binomial distribution can be used instead. That requires one more over-dispersion parameter to fit. In this application, it is difficult to assess over-dispersion; hence we prefer the simpler Poisson model assumption.
Let μ
d
be the mean of μ(t) on day d and l be the length of a day. Then λ(d) = μ
d
⋅ l. If μ(t) changes slowly, sequence < λ(d) > can be regarded approximately as the result of sampling μ(t) daily and then multiplying the sample with the constant l. The sequence < λ(d) > is determined by the birthdate distribution of the general population P1 and the representation of P1 in the database.
The birthdate distribution of the general population P1 is determined on a large scale by the age profile of the population and on a small scale by seasonal and weekly fluctuations in child births. The age profile of the population (as in [10]) is the aggregated result of changes in birth and death rates. For example, the post-WWII baby boom has contributed to the aging population structure in US, Canada, and Australia [11, 12]. As such changes are often driven by long lasting demographic forces such as economic development, war, and progress of medical science [13, 14], they manifest as slow and smooth variation in the sequence < λ(d)>. In contrast, seasonal and weekly variations in child births (as in [15]) act on a shorter temporal scale. More recently, most child births occur in a hospital environment and scheduled Caesarean section or labour induction are more frequent. These factors lead to more births on weekdays than weekends, which produces cyclic dips in the sequence < λ(d) > [16, 17].
How the general population is represented in a patient database is mostly determined by the nature of the disease. For example, a database of prostate cancer patients should contain only males; a database of skin cancer patients probably contains more Caucasians than patients of other races. In terms of the birthdate distribution, one common consideration is that a disease may pose higher risk to certain age groups. For example, individuals born before year 1930 may be disproportionately represented in a database of chronic non-cancer pain. Like the population age profile, the nature of the disease acts on a larger temporal scale and it should not affect the smoothness of the sequence < λ(d)>. For example, a person born on March 2nd, 1980 and a person born on March 3rd, 1980 should have very similar chances of being included in a database, assuming all other conditions are equal.
Finally for most hospitals, a patient database grows out of paper records. As a hospital serves more patients, its database may cover a larger portion of the at-risk population P2. However, it is reasonable to assume that any such change in the size of the database will affect different age groups proportionally, and is independent of the shape of < λ(d) >.
To summarize, the birthdate distribution of a database can be modelled using a function λ(d) that is generally smooth but contains weekly dips due to weekend reduction in births.
Tell-tale indicators of birthdate contamination: discrepancy between the expected and observed counts
When birthdates in a database are systematically contaminated, for example when a missing value is consistently replaced with the zero date-time, incorrect birthdates may be repeatedly generated. In such cases, the frequency of an incorrect birthdate d will be higher than other dates in the database. That is, the observed number of patients with birthdate d, denoted N(d), will be larger than the expected number λ(d). Therefore identifying incorrect birthdates can be achieved through identifying the date d whose person count N(d) is well above the expected count λ(d). Of course this observation is not new. For example, the difference between N(d) and λ(d) is called within deviation by Dasu and Johnson [18]. To the authors’ knowledge, however, no previous attempt has been made to statistically model this discrepancy between N(d) and λ(d) for error detection.
The deviation of N(d) from the expected count λ(d) can be measured in terms of the tail probability of the Poisson distribution. This provides a way to rank outliers across all birthdates. The top N(d)s with the smallest tail probabilities can be returned to the data cleaning staff for further confirmation and investigation. Alternatively, a small positive number c can be used as a threshold: N(d) is then labelled as an outlier if
In many applications, only a small number of the most likely erroneous birthdates will be manually checked. If required, the expected number of birthdate errors in database can be estimated from the results of manually checking small samples of the database [19].
A generalized additive model for birthdate distribution
The expected birthdate distribution λ(d) can be estimated from the data < N(d) > by assuming smoothness of the function λ(d), with proper handling of the reduction in births on weekends. In the previous section, we have seen that explicit modelling of the mechanisms shaping the age profile would be very complex. Nevertheless, both long-term and seasonal variations in births/deaths can be recovered by smoothing of counts N(d) in a relatively short temporal window. The reduction in weekend births can be modelled by a multiplicative factor that only affects Saturdays, Sundays, and public holidays. As the reduction is a gradual development driven by increasing hospital births, elective Caesarean section, and labour induction, the multiplicative factor can also be modelled by a smooth function. Finally, it is worth noting that government policies to encourage fertility can also lead to more births on a particular day [20], but such events are very rare and should be explicitly modelled on a case-by-case basis.
Different techniques can be used to smooth the raw data < N(d) > to recover the function λ(d). But to explicitly model the day-of-week effect, we use an additive model in which λ(d) is modelled as the product of an overall smooth function and a multiplicative factor for weekends. (A justification for the weekend multiplicative factor is that the weekend births are moved from the weekend to the preceding week). We assume that
(2)
Here the log link function is used because λ(d) can only be positive; I(⋅) is the indicator function and s1(d) and s2(d) are smooth functions of d. The function s1(d) models both long-term and seasonal variations in the birthdate distribution; s2(d) models the gradual change in the reduced weekend births. Public holidays are similar to weekends; it only requires extra bookkeeping in the modelling process.
Following the standard practice in generalized additive models, the smoothing terms in Equation (2) are represented by regression splines [21]. That is, , where b
i
(x) are a set of basis functions. Common type of basis functions include B-splines and more familiar cubic splines. These basis functions are sections of polynomials that join at a number of knot locations. To avoid manually selecting knots for the regression splines, thin-plate splines can be used [22]. With n data points, a thin-plate spline representation is as follows.
Here d is the dimension of the function domain (in our case d = 1), m is the order the smoothness penalty, ϕ
j
are basis functions that spans the null space of a penalty energy function, and η
md
is a radial-basis function. δ = (δ1, δ2, …, δ
n
)T and α = (α1, α2, …, α
M
)T are parameters with the constraint for each j. Further details of thin-plate splines can be found in [23]. Just like other kernel methods, the computation of thin-plate splines has a complexity of O(n3). For ease of computation, a lower rank approximation of δ is used. Such reduced-rank thin-plate splines [23] are used in our application. The rank of thin-plate splines controls the smoothness of the function, and is selected by generalized cross-validation criterion (GCV) [24]. To capture seasonal variations in birthdate distribution, the maximum rank for the candidate splines should be set to a sufficiently large number. We recommend a maximum rank of 2 m, where m is the number of years covered by the birthdates.
Dasu and Johnson [18] generated a histogram similar to Figure 1 to help detect inadvertent censoring caused by a default date. Graphical features of the histogram such as spikes and V-shaped valleys were identified as indicators for missed or censored data. However, the presence and location of such graphical features are determined by visual inspection. In contrast, here with a GAM, the deviation of particular counts in the histogram can be quantified for probability based decisions.