A first-order model
Figures 2 and 3 show histograms of the number of trimmed TRAV, TRAJ, TRGV and TRGJ nucleotides obtained from 212 TRAV-TRAJ and 220 TRGV-TRGJ junction sequences analysed by IMGT/V-QUEST+JCTA and whose results were agreed upon by experts.
As potentially more nucleotides are trimmed in the "true process" than appear to have been trimmed according to the tool "output," we would like to transform the "output" data into "true process" data.
A factor also to take into consideration are the quantities of data at zero (except for TRGJ), which do not match the relatively smooth form of the tool "output" data distributions (see Figures 2 and 3). This may be evidence of a two-step process: either the trimming process is activated, or not. If activated, it follows some as yet unknown law. If not, no trimming occurs. Obviously, if the unknown law also takes the value zero, the fraction of data that takes the value zero would then have two sources (either the first process is not activated, or is activated and the second process gives the value zero). Thankfully, as will be shown under the following first-order model, probabilistically transforming the "output" distribution towards the "true process" distribution (under the hypotheses of the model) does not cause further complications. Indeed, the transformed masses (i.e., fractions of the total number of data found at each possible data value) above zero do not depend on the original fraction at zero. This means that performing maximum likelihood estimation of the parameters of a two-step process is well-defined on the transformed data.
Recovering an estimation of the true process probability distribution
Here we introduce a mathematical result that allows us to recover an estimation of the true process probability distribution of the number of trimmed V nucleotides. This result is almost (but not entirely) valid for the true process probability distribution of the number of trimmed J nucleotides. The potential problem is that IMGT/V-QUEST+JCTA selects the J gene after the V gene (see Methods and [25] for more details), thus there is a non-zero chance that 5'J-REGION nucleotides will accidentally be included in the V gene prediction when there has been no N region nucleotide addition. After reanalyzing the data, we found that in the TRAV-TRAJ dataset, this happened at most 3 times and thus was rare enough to be ignored. However, for the TRGV-TRGJ data, this potentially happened quite often, so estimated probability distribution results for the TRGJ trimming process must be used with caution.
Let ℙ {B = k} mean 'the probability that k 3'V-REGION nucleotides are trimmed under the (unknown) true trimming process distribution f
B
'. We want to estimate this for k ≥ 0. Let ℙ {F = i} mean 'the probability that i nucleotides appear k to have been trimmed.' That is, the random variable F represents the 3'V-REGION trimming distribution of the tool "output." We do not know the distribution f
F
of F exactly, but through our datasets we have an empirical estimate of it.
The goal is to use this empirical estimate of f
F
to estimate f
B
. To begin, Theorem 1 [see Additional file 1] shows that under some simple hypotheses (the 'first-order' model), there is an explicit link between the law of the observed 3'V-REGION tool "output" trimming distribution and the "true" (or more correctly, "bias-corrected": technically, it is "true" only if the hypotheses of the first-order model hold in general) process distribution. Indeed, for any k ≥ 1 we find:
and for k = 0 we find:
We call this the (4/3, 1/3) rule. Supposing the first-order hypotheses are correct, we would have for example that the bias-corrected probability that 5 V nucleotides were trimmed is equal to (4/3) the probability the tool "output" gives 5 trimmed nucleotides minus (1/3) the probability it gives 6 trimmed nucleotides. We see indeed that under these hypotheses, transformed fractions of data at each data value above zero do not depend on the original fraction of data at zero.
We remark that it is unlikely that the probabilities of appearance of A, C, G and T nucleotides in the N region are equal (= 1/4, as is assumed in the first-order model), nor in the 3'V-REGION or 5'J-REGION. A second-order model, giving much more freedom to possible A, C, G and T frequencies (each frequency taking some value between 1/6 and 1/3) can be found in Supplementary Data [see Additional file 1]. In brief, we find that the first-order model approximates well the more general second-order model. Thus for simplicity, the first-order result can be used in the place of the second-order result to form hypotheses on trimming processes.
Testing the transformed V and J trimming distributions
Under the hypotheses of the first-order model, we transformed the TRA and TRG tool "output" data following the law f
F
into probability distributions following the law f
B
.
Remarking that apart from at zero, these transformed results often resembled Poisson laws, we attempted to formally test this. More precisely, we supposed that we were dealing with a Bernoulli process (with parameter p unknown) followed by a Poisson process (parameter λ unknown) if the Bernoulli process gave a success. This meant a density function of:
Maximum likelihood was then performed in order to simultaneously estimate the parameters p and λ, this being necessary to subsequently test the hypothesis that we are dealing with a two-step Bernoulli-Poisson process having parameters p and λ.
Given data x1, x2,..., x
n
, it is easy to show that maximum likelihood estimation gives the equations g(λ) = (1 - exp(-λ))C - mλ = 0 and p = m/n(1 - exp(-λ)) to be solved, where m is the number of x
i
> 0 and C the sum of the values of the x
i
> 0. As m and C are thus constants given any dataset, we see that resolving g(λ) = 0 for λ then allows us to solve for p in the second equation. Upon performing the first-order transformation, we found (m, C) = (517/3, 708), (580/3, 3286/3), (152, 1682/3), (670/3, 4238/3) for the TRAV, TRAJ, TRGV and TRGJ datasets, respectively.
To see that g(λ) = 0 has a unique solution (and thus p also) here, we first remark that for each of these m, C > 0, limλ→0g'(λ) > 0 and g''(λ) < 0 for λ > 0, limλ→∞g'(λ) = -m < 0, and g'(λ) is a continuous function for λ > 0. Thus, by the intermediate value theorem, there exists at least one λ > 0 such that g'(λ) = 0, and since g''(λ) < 0 for λ > 0, there is in fact a unique solution, which can be easily found numerically for each given m, C > 0. Indeed, we find (p, λ) = (0.83, 4.04), (0.92, 5.65), (0.71, 3.59), (1, 6.31) for the TRAV, TRAJ, TRGV and TRGJ datasets, respectively.
Figure 4 shows the transformed distributions (blue) and the corresponding theoretical predictions (pink) for the Bernoulli-Poisson distribution f in each of the four cases. We tested the four empirical distributions against the theoretical Bernoulli-Poisson distribution f using Pearson's χ2 test. The null hypothesis is that the distribution follows f with parameters (p, λ). In order to keep within the assumptions of the test, the data were re-binned into n = 8, 10, 8 and 9 bins for the TRAV, TRAJ, TRGV and TRGJ trimming distributions, respectively. As shown in [28], since the parameters (p, λ) were initially estimated using maximum likelihood, the degree of freedom lies somewhere between n - 1 - r and n - 1, where r is the number of parameters estimated using maximum likelihood. We have thus that r = 2.
We found χ2 = 7.97, 11.93, 7.27 and 31.62 for the TRAV, TRAJ, TRGV and TRGJ trimming distributions, respectively. For TRAV, TRAJ and TRGV, we find that at all standard values of statistical significance (p = 0.05, 0.01, 0.005), the null hypothesis is not rejected, and thus it is plausible that the empirical results follow a Bernoulli-Poisson-type law. However, for TRGJ, the null hypothesis is rejected at all of the same values of statistical significance. Thus, as it stands, the Bernoulli-Poisson law hypothesis would seem unlikely for the TRGJ trimming process.