Ranking genomic features using an information-theoretic measure of epigenetic discordance

Background Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data. Results We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies. Conclusions The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data. Electronic supplementary material The online version of this article (10.1186/s12859-019-2777-6) contains supplementary material, which is available to authorized users.


Test statistic as distance metric
We would like the test statistic T (q 1 , q 2 ) we use for distinguishing between two phenotypes, q 1 and q 2 , based on their methylation states within a genomic region of interest to satisfy the following properties: (i) T (q 1 , q 2 ) ≥ 0, for every q 1 and q 2 (non-negativity), (ii) T (q 1 , q 2 ) > 0 if and only if q 1 = q 2 (positive definiteness), and (iii) T (q 1 , q 2 ) = T (q 2 , q 1 ), for every q 1 and q 2 (symmetry). Non-negativity can always be satisfied by subtracting from a test statistic its minimum value. Positive definiteness is required to make sure that the test statistic takes its minimum value only when the two phenotypes are the same. Symmetry assures that the test statistic is the same irrespective of the order we use to compare two phenotypes.
In addition, we would like the test statistic to satisfy the following property (iv) T (q 1 , q 2 ) + T (q 1 , q 3 ) ≥ T (q 2 , q 3 ), for every q 1 , q 2 , and q 3 (triangle inequality). To see why, let us assume that DNA methylation within a genomic region does not carry, on the average, sufficient information for distinguishing a phenotype q 1 from a phenotype q 2 as well as from a phenotype q 3 . In this case, we also expect DNA methylation within the genomic region not to carry, on the average, sufficient information for distinguishing q 2 from q 3 . Specifically, we expect that T (q 1 , q 2 ) 0 and T (q 1 , q 3 ) 0 implies T (q 2 , q 3 ) 0, which is always true when T satisfies the triangle inequality.
Although the test statistic 1/K K k=1 [JSD(k)] 2 we discussed in the Main paper satisfies properties (i)−(iii), it does not satisfy property (iv). However, the test statistic given by Eq. (3) of the Main paper is a distance metric and, therefore, satisfies all four properties. This is a consequence of the fact that T is the Euclidean norm of the vector (1/ √ K)[JSD(1), JSD(2), . . . , JSD(K)] of JSDs within GUs with data that overlap the genomic region, the fact that the JSD is a distance metric [2], and of the following lemma.
Lemma. Suppose that d n , n = 1, 2, . . . , N , are distance metrics and let , where x and y are vectors with components x n , y n , n = 1, 2, . . . , N . Moreover, let || · || be an absolute norm (i.e., a norm that is invariant to taking the moduli | · | of its components, which includes the Euclidean norm). Then ||d (x , y )|| is a distance metric.
Proof. We must show that ||d (·, ·)|| is non-negative, positive definite, symmetric, and satisfies the triangle inequality. Since a norm is always non-negative, ||d (x , y )|| ≥ 0, for every x , y , which proves non-negativity. Since a norm is also positive definite, we have that ||d (x , y )|| > 0 if and only if d (x , y ) = 0, whereas, from the non-negativity of d (x , y ) and the positive definiteness of the d n metrics, we have that d (x , y ) = 0 if and only if x n = y n , for n = 1, 2, . . . , N , which proves positive definiteness. The symmetry property ||d (x , y )|| = ||d (y , x )|| follows from the symmetry d n (x n , y n ) = d n (y n , x n ), n = 1, 2, . . . , N , of the d n metrics. This leaves us to show the triangle inequality.
By using the fact that a norm satisfies the triangle inequality, we have that for any x , y , z . Moreover, any absolute norm is monotonic (see Theorem 2 in [1]). This means that ||a|| ≤ ||b|| for two vectors a and b such that |a n | ≤ |b n |, for n = 1, 2, . . . , N . Note now that the distance metrics d n satisfy the triangle inequality, in which case, |d n (x n , y n ) + d n (y n , z n )| ≥ |d n (x n , z n )|, for n = 1, 2, . . . , N , which, together with the monotonicity property of || · ||, implies This result, together with Eq. (S1), shows that ||d (·, ·)|| satisfies the triangle inequality. ♠ Note finally that 0 ≤ T ≤ 1, since the JSD is always a number between 0 and 1 [2].   Figure S1. Boxplots of genome-wide distributions of JSD, MML, and NME values in the normal fetal brain, H3.3-WT, and K27M mutant samples considered in this paper. The JSD values show small methylation discordances associated with the fetal brain samples (which are due to biological, statistical, and technical variability), thus confirming their appropriateness as normal controls. Moreover, the JSD demonstrates a global increase in methylation discordance within the tumor samples, accompanied by global hypomethylation (MML) and gain in methylation entropy (NME) in most samples. aâ Figure S2. α-centile curves, calculated for different values of α from the estimated logitSST-based null PDF f 0 (t; s) within promoter regions, drawn over a scatter plot of 104,694 observed pairs (t k , s k ) of null T statistic values t k and promoter region sizes s k . The percentage α of empirically observed data points that fall below a centile curve agrees well with the corresponding α value, indicating that f 0 (t; s) is consistent with the data.  Figure S4. aâ Figure S5. α-centile curves, calculated for different values of α from the estimated logitSST-based null PDF f 0 (t; s) within bivalent domains, drawn over a scatter plot of 7,446 observed pairs (t k , s k ) of null T statistic values t k and bivalent domain sizes s k . The percentage α of empirically observed data points that fall below a centile curve agrees well with the corresponding α value, indicating that f 0 (t; s) is consistent with the data. The Q-Q plot (green marks) of the logitSST-estimated quantile residuals against the corresponding true quantile residuals is very close to the diagonal (red) line, suggesting a close agreement of the logitSST-estimated null PDF of the T statistic to its true distribution.