The CC-PROMISE method may be used to integrate any two forms of quantitative high-dimensional molecular data with multiple endpoints of diverse data types (quantitative, qualitative, censored time-to-event, etc). Here, we present the method in terms of integrating methylation and RNA expression data as a concrete example. The CC-PROMISE method may be used to integrate other forms of data, such as miRNA and mRNA expression data.
Setting and notation for data
Suppose methylation and gene expression data have been collected for each of i=1,…,n subjects. Let g=1,…,G index the genes for which methylation and expression data are available. For each gene g, let l
g
=1,…,L
g
index the loci of markers for which methylation data are collected. Note that the subscript g of l
g
and L
g
is clear by context. Thus, the subscript g will be omitted from l
g
and L
g
for simplicity of notation. Let m
gli
represent the methylation of locus l of gene g for subject i. Also, let f
g
=1,…,F
g
index the features of gene g for which expression data are available. The subscript g of f
g
and F
g
will be omitted for simplicity of notation. Let x
gfi
represent the expression of feature f of gene g for subject i. Also, suppose that we have collected data on endpoints k=1,…,K for each subject. Let y
ki
represent the value of endpoint k for subject i. A glossary of the mathematical notation is available in the Additional file 1.
Associate each methylation marker with expression feature
For each gene g, it is often interesting to explore the association of each methylation marker with each expression feature. For each gene g, let r
gfl
represent the observed sample correlation and ρ
gfl
represent the true population correlation of the expression x
g
f1,x
g
f2,…,x
gfn
of feature f with the methylation m
g
l1,m
g
l2,…,m
gln
of locus l. Also, let p
gfl
be the p-value testing the null hypothesis H
0:ρ
gfl
=0 that the true correlation ρ
gfl
is zero.
Associate each endpoint with each expression feature
For each gene g, it is also interesting to explore the association of each expression feature with each endpoint. Thus, for each endpoint k and each expression feature f of gene g, compute a statistic a
kgf
that measures the association of the expression x
g
f1,x
g
f2,…,x
gfn
with the endpoint y
k1,y
k2,…,y
kn
. Well-established methods may be used to compute the association statistic. For example, assuming that the expression data are continuous quantitative values, one may use Spearman’s correlation to measure association of expression with a continuous quantitative endpoint, Kendall’s τ to measure association of expression with an ordinal endpoint, ANOVA may be used to measure association with a categorical endpoint, and Cox regression modeling may be used to measure association with a censored time-to-event endpoint. We typically use rank-based statistics for endpoint associations due to their well-established robustness against outliers and other forms of noise in the data. We also use rank-based statistics in the example application below. Nevertheless, our framework allows for other methods to be utilized as appropriate for specific applications. The statistical significance (p-value) may be computed using those classical methods or via a permutation algorithm described in subsection “Compute permutation p-values”. It is important that the association statistics be represented on a common scale for many of the subsequent analyses described below.
Associate each endpoint with each methylation marker
For each gene g, the association of each endpoint with each methylation marker is performed in a very similar manner as described immediately above. For each endpoint k and each methylation marker l of gene g, compute a statistic a
kgl
that measures the association of the methylation m
g
l1,m
g
l2,…,m
gln
with the endpoint y
k1,y
k2,…,y
kn
. Again, classical methods may be used here and all association statistics should be represented on a similar scale.
Define the most interesting statistical evidence
Association statistics can be represented on a correlation-like scale such that values of -1, 0, and +1 respectively indicate a negative deterministic relationship, no association, and a positive deterministic relationship between two variables. On this scale, values of ±1 clearly indicate deterministic associations that are typically of greatest biological interest. Thus, the values ±1 may be considered the most interesting statistical evidence for any particular statistic that measures association on a correlation-like scale. Subsequently, the result a
kfg
=±1 on a correlation-like scale is the most interesting statistical evidence for the association of each endpoint k with the expression of feature f of gene g.
In many applications, biological and mathematical reasoning may be used to define the most interesting statistical evidence for the vector a
fg
={a
1f
g
,a
2f
g
,…,a
Kfg
} of statistics that measure the association of each endpoint k=1,…,K with the expression of feature f of gene g. As described above, a
kgf
=±1 is the most interesting statistical evidence for each endpoint k. Therefore, the most interesting statistical evidence for a
fg
must be the set of 2K vectors of length K with entries ±1. By symmetry, the constraint a
1g
f
=1 is imposed to reduce consideration to a subset of 2K−1 vectors. Now, suppose that prior knowledge about the endpoints indicates that only one of the remaining 2K−1 vectors is biologically interesting or plausible. For example, in the application of subsection “Acute myeloid leukemia example”, all three endpoints measure sensitivity of leukemia cells to the chemotherapeutic agent cytarabine. Thus, the most interesting statistical evidence for that application is observing that expression of feature f of gene g has a deterministic positive (or negative) association with drug sensitivity. With these biological and mathematical considerations, we let λ={λ
1,λ
2,…,λ
k
} represent the most interesting statistical evidence for the vector a
fg
for each feature f of each gene g.
Analogous logic indicates that either +λ or −λ is the most interesting statistical evidence for the vector of statistics a
lg
={a
1f
g
,a
2f
g
,…,a
Kfg
} that measure association of methylation with the endpoints. Again, insisting on biological plausibility imposes a constraint on the sign of λ. In particular, the findings are plausible only if the methylation-expression association, methylation-endpoint associations, and expression-endpoint associations are concordant. Thus, sign(r
gfl
)λ is the most interesting statistical evidence for the vector a
lg
that measures the association of each endpoint k with the methylation at each locus l of gene g.
Associate all endpoints with each expression feature
To explore the association of each expression feature f of each g with all endpoints, we compute the projection onto the most interesting evidence (PROMISE) statistic as
$$ t_{gf} = \sum\limits_{k=1}^{K} \lambda_{k} a_{kgf}. $$
(1)
The magnitude of the PROMISE statistic t
gf
measures the evidence indicating that the associations with the individual endpoints align with predefined most interesting statistical evidence. The sign of the PROMISE statistic indicates the direction of the vector of the associations relative to that of the most interesting statistical evidence. Overall, the PROMISE statistic measures the discrepancy between the observed associations and the global null (all associations are zero) along the direction of the most interesting statistical evidence. The statistical significance of the PROMISE statistic is determined by computing a permutation p-value as described in subsection “Compute permutation p-values”.
Associate all endpoints with each methylation marker
Similarly, to explore the association of methylation marker l of each gene g with all endpoints, we compute the PROMISE statistic
$$ t_{gl} = \sum\limits_{k=1}^{K} \lambda_{k} a_{kgl} $$
(2)
with an analogous interpretation. Likewise, significance is determined by computing a permutation p-value as described in subsection “Compute permutation p-values”.
Associate all endpoints with each pair of one methylation marker and one expression feature
Next, to explore the association of all endpoints with each pair (l,f) of a methylation marker l and expression feature f of each gene g, we compute the combined PROMISE statistic
$$ t^{\star}_{glf} = t_{gf}+\text{sign}(r_{glf})t_{gl}. $$
(3)
This statistic measures the discrepancy between the observed association statistics and the global null (all associations are zero) along the vector defining the most interesting statistical evidence. The sign measures direction in terms of the most interesting statistical evidence and the magnitude measures cumulative weight of the evidence against the global null. Statistical significance is determined by computing permutation p-values as described in subsection “Compute permutation p-values”. Here, we choose an additive formula to define (3) for simplicity of calculation and interpretation in terms of the rejections regions depicted in Fig. 1. Future research may find that other mathematical definitions of a combined statistic have better performance than the additive formula in some applications.
Gene-level analyses
Subsections “Associate each methylation marker with expression feature – Associate all endpoints with each pair of one methylation marker and one expression feature” describe analyses performed at the level of individual expression features and individual methylation markers. To perform a gene-level analysis for each gene g, we first perform canonical correlation analysis (CCA) on the matrix M
g
of the methylation at all loci l
g
=1,…,L
g
and the matrix X
g
of the expression of each feature f
g
=1,…,F
g
. CCA computes the canonical correlation coefficient \(\tilde {r}_{g}\) and formally tests the null hypothesis that the canonical correlation is zero. In this way, CCA performs a gene-level analysis that is analogous to the simple feature-level correlation analysis of subsection “Associate each methylation marker with expression feature”.
CCA also computes a summary score for expression and a summary score for methylation that may be used to perform gene-level analyses analogous to those described in subsections “Associate each endpoint with each expression feature – Associate all endpoints with each pair of one methylation marker and one expression feature”. Given the matrix X
g
of expression values for each subject i=1,…,n and each expression feature f
g
=1,…,F
g
and the matrix M
g
of methylation values for each subject i=1,…,n and each methylation marker l
g
=1,…,L
g
, CCA determines the linear combinations of the columns of the matrices that are maximally correlated. As a result, CCA obtains the expression matrix linear combination value \(\tilde {x}_{gi}\) and the methylation matrix linear combination value \(\tilde {m}_{gi}\) for each subject i=1,…,n. These linear combination scores are variables that can be evaluated using the methods of subsections “Associate each endpoint with each expression feature – Associate all endpoints with each pair of one methylation marker and one expression feature
Associate all endpoints with each pair of one methylation marker and one expression feature”. In particular, these analyses can be performed by substituting the expression score values \(\tilde {x}_{gi}\) for the individual feature expression values x
gfi
into the framework of subsections “Associate each endpoint with each expression feature and Associate all endpoints with each expression feature”, substituting the methylation score values \(\tilde {m}_{gi}\) for the individual marker methylation values m
gli
into the framework of subsections “Associate each endpoint with each methylation marker and Associate all endpoints with each methylation marker”, and finally substituting the canonical correlation \(\tilde {r}_{g}\) for the simple correlation r
gfl
into the framework of subsection “Associate all endpoints with each pair of one methylation marker and one expression feature
Associate all endpoints with each pair of one methylation marker and one expression feature”.
Compute permutation p-values
The statistical significance of the PROMISE statistic is determined by a permutation procedure. The assignment of endpoint data to the molecular data is permuted and the test statistic recomputed many times. The p-value is given by the proportion of permutation repetitions that yield a PROMISE statistic with magnitude greater than or equal to that of the observed PROMISE statistic. An adaptive permutation procedure [8] is used to reduce computing time without compromising the statistical rigor of the results. Briefly, let t
0 be the value of the observed PROMISE statistic, let b index permutation repetitions, and let t
b
be the PROMISE statistic observed from permutation b. (In this section, other subscripts are omitted for simplicity of notation because the same permutation procedure may be used to compute permutation p-values for any of the PROMISE statistics defined above.) In each permutation b, the adaptive permutation procedure notes whether |t
b
|≥|t
0| or |t
b
|<|t
0|. The procedure continues until B
0 permutations obtain |t
b
|≥|t
0| or a total of B
1 permutations are performed. This allows the permutation procedure to terminate early for genes that clearly are not statistically significant. For example, if B
0=100 of the first 200 permutations obtain |t
b
|≥|t
0|, the procedure stops to report a p-value of \(\frac {100}{200} = 0.50\) instead of continuing for 10,000 permutations to report a blatantly insignificant p-value to four decimal places. In applications that involve exploring the association of many genes with the endpoints, the adaptive permutation procedure can reduce computing time by 99 % because typically the vast majority of genes do not have a strong association with the endpoint. The user may select the minimum number of permutations B
0 and B
1 to obtain the desired computational efficiency and statistical rigor as described by [8]. We use B
0=100 and B
1=10,000 in the simulations and application described below.
Conceptual comparison of promise with list-overlap approaches
A very widely used method for integrated data analysis simply identifies genes that appear on multiple lists of the most significant hits from different data analyses. In other words, the analysis identifies the overlap across multiple lists of the most significant genes. This type of list overlap approach is popular because it is simple and thus can be used in a very broad spectrum of applications. It has been used with success in several applications.
However, list overlap approaches have several statistical and practical limitations. Each list includes a set of genes that exceed an arbitrary threshold for a test statistic or p-value. It can be unclear what statistical properties (false positive and false negative rates) are obtained for various thresholds for those lists. In many cases, there may be no overlap across lists even when each list has a very liberal threshold that allows for a large false positive rate. Additionally, the genes will appear in a different order on each list which often makes it unclear how to derive a final ranking of the genes by strength of empirical evidence.
The PROMISE method overcomes these limitations and brings many additional advantages over list overlap approaches. The PROMISE method provides one comprehensive p-value for each gene or feature tested. In this way, the genes are ranked by a common criterion with a clear statistical interpretation in terms of a false discovery rate. Also, with one PROMISE p-value per gene, the problem of finding no genes occurs only when no gene meets the chosen significance threshold for the PROMISE p-value. Furthermore, previously described [7, 8] and illustrated in Fig. 1, PROMISE provides better statistical power to identify genes with effects on multiple endpoints than do list overlap approaches. Finally, as shown in the simulation studies below, the CC-PROMISE provides similar benefits in the integrated analysis of two forms of high-dimensional molecular data with multiple endpoints.