Estimating causal effects with a non-paranormal method for the design of efficient intervention experiments
- Reiji Teramoto^{1}Email author,
- Chiaki Saito^{1} and
- Shin-ichi Funahashi^{1}
https://doi.org/10.1186/1471-2105-15-228
© Teramoto et al.; licensee BioMed Central Ltd. 2014
Received: 12 March 2014
Accepted: 24 June 2014
Published: 30 June 2014
Abstract
Background
Knockdown or overexpression of genes is widely used to identify genes that play important roles in many aspects of cellular functions and phenotypes. Because next-generation sequencing generates high-throughput data that allow us to detect genes, it is important to identify genes that drive functional and phenotypic changes of cells. However, conventional methods rely heavily on the assumption of normality and they often give incorrect results when the assumption is not true. To relax the Gaussian assumption in causal inference, we introduce the non-paranormal method to test conditional independence in the PC-algorithm. Then, we present the non-paranormal intervention-calculus when the directed acyclic graph (DAG) is absent (NPN-IDA), which incorporates the cumulative nature of effects through a cascaded pathway via causal inference for ranking causal genes against a phenotype with the non-paranormal method for estimating DAGs.
Results
We demonstrate that causal inference with the non-paranormal method significantly improves the performance in estimating DAGs on synthetic data in comparison with the original PC-algorithm. Moreover, we show that NPN-IDA outperforms the conventional methods in exploring regulators of the flowering time in Arabidopsis thaliana and regulators that control the browning of white adipocytes in mice. Our results show that performance improvement in estimating DAGs contributes to an accurate estimation of causal effects.
Conclusions
Although the simplest alternative procedure was used, our proposed method enables us to design efficient intervention experiments and can be applied to a wide range of research purposes, including drug discovery, because of its generality.
Keywords
Non-paranormal Gaussian assumption Causal effect Intervention-calculus Directed acyclic graph Machine learning Causal inference Experiment designBackground
Intervention experiments, e.g., knockdown or overexpression, are commonly conducted to identify genes that determine cell fates such as differentiation [1], induction of pluripotency [2], and direct reprogramming [3]. Those experiments are now indispensable in biological and medical research. Although intervention experiments identify a causal gene responsible for a phenotype of interest, they are time-consuming and cost-intense. Therefore, it is very important to prioritize and focus on causal genes with high confidence. However, it is difficult to infer causal effects only from observational data. This task coincides with inferring causal effects that are established in Statistics. Note that in this problem setting, a causal effect is different from a regression-type effect of association [4]. In fact, previous studies suggested that representative regression methods such as lasso and elastic net are not appropriate for our goal [4–6].
Recently, there has been much progress to address this problem by employing the intervention-calculus when the directed acyclic graph (DAG) is absent (IDA) [4–6] for the design of efficient intervention experiments. IDA combines two methods: (1) inference the unknown underlying DAG model from observational data by the PC-algorithm [7] and (2) estimating causal effects on the basis of DAG using intervention-calculus; furthermore, it provides estimated lower bounds of total causal effects from observational data. The PC-algorithm is computationally feasible and appropriate for high-dimensional settings. In addition, it has the desirable property to achieve high computational efficiency as a function of sparseness of the true underlying DAG model.
In spite of these advantages, the PC-algorithm requires the Gaussianity assumption to use partial correlations to test conditional independence, and this assumption is not necessarily true in real data sets. Because the normality assumption is restrictive and conclusions inferred under this assumption could be misleading, it is desirable to relax the Gaussian assumption.
On the other hand, non-paranormal methods that use a semiparametric Gaussian copula have been proposed for estimating sparse undirected graphs and exhibit significant improvement in the performance because the normality assumption is relaxed [8, 9]. The main idea of the non-paranormal method is to exploit the nonparametric correlation coefficient instead of Pearson’s correlation coefficient for estimation. Although this is the simplest alternative procedure, the non-paranormal graphical model could be a viable alternative for the Gaussian graphical model.
Consequently, we present non-paranormal IDA (NPN-IDA), which uses nonparametric partial correlations to test conditional independencies in the PC-algorithm for intervention-calculus. In our method, the Gaussian assumption in the PC-algorithm is naturally relaxed by using nonparametric partial correlation. Although the non-paranormal method has been successfully applied to estimating undirected graphs in previous studies, we show that it works well for estimating DAGs in the PC-algorithm. Next, we applied our method to Arabidopsis thaliana microarray data and mouse microarray data to demonstrate that NPN-IDA outperforms IDA in exploring regulators of the flowering time in A. thaliana and regulators that control the browning of white adipocytes in mice.
In summary, the three main contributions of this work are: (1) introduction of a non-paranormal method for inference of the unknown underlying DAG model from observational data in the expansive framework of the PC-algorithm, (2) combination of the non-paranormal method and the PC-algorithm significantly improves the performance in estimating DAGs on synthetic data, and (3) NPN-IDA is effective in exploring regulators that control specific phenotypes of interest.
Methods
We first introduce the IDA procedure. IDA consists of (1) inference of the unknown underlying DAG model from observational data by PC-algorithm and (2) estimation of causal effects based on the DAG using intervention-calculus. Then, we introduce the non-paranormal method for PC-algorithm. Finally, we present the combination of the non-paranormal method for PC-algorithm and estimating causal effects as NPN-IDA algorithm.
Inference DAGs with the PC-algorithm
where Φ is the standard normal distribution function and α is a tuning parameter, which can be interpreted as the significance level of a single partial correlation test. Choosing an appropriate value for α is difficult but, for example, can be done with the Bayesian information criterion.
Estimating causal effects using intervention-calculus
for any value of ${\mathit{x}}_{\mathit{i}}^{\text{'}}$.
Non-paranormalmethod for NPN-IDA
For Gaussian copula distributions, the following important lemma connects Spearman’s correlation coefficient ${\mathit{r}}_{\mathit{ij}}^{\mathit{s}}$ to the underlying Pearson’s correlation coefficient [8, 9].
Lemma 1.[10] By assuming X ∼ NPN(f, ∑ ^{0}), we have ${\sum}_{\mathit{ij}}^{0}=2sin\left(\frac{\mathit{\pi}}{6}{\mathit{r}}_{\mathit{ij}}^{\mathit{s}}\right)$.
Results
Experimental settings
where gaps indicate the pairs of vertex that are not linked.
SHD is the number of edge insertions, deletions, and flips to transfer the estimated DAG into the true DAG [7]. A large SHD indicates a poor fit, whereas a small SHD indicates a good fit.
The second purpose was to evaluate the performance of NPN-IDA when applied to predicting regulators of the flowering time as phenotype of interest using an A. thaliana microarray data set and regulators that control the browning of white adipocytes in mice using mouse microarray data. These two data sets are entirely different in terms of species and target variables. Therefore, if we obtain the consistent results thorough performance comparison of the methods, the consequence will be solid. We implemented the R language and employed the pcalg package for the PC-algorithm and IDA [11].
Performance evaluation of the NPN-PC algorithm
Parameter setting for performance evaluation of the NPN-PC algorithm
Parameters | Set values |
---|---|
α | 10^{−6},10^{−5}, 10^{−4}, 10^{−3}, 10^{−2} |
p | 100, 200 |
n | 50, 100 |
cp | 0.005, 0.01 |
Mixing rate | 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 |
Performance comparison between PC-algorithm and NPN-PC algorithm
p | n | Mixing rate | TPR | FPR | TDR | SHD | ||||
---|---|---|---|---|---|---|---|---|---|---|
PC | NPN-PC | PC | NPN-PC | PC | NPN-PC | PC | NPN-PC | |||
100 | 50 | 0.1 | 0.360 | 0.373 | 0.0021 | 0.0003 | 0.468 | 0.876 | 28.72 | 19.40 |
0.2 | 0.341 | 0.395 | 0.0035 | 0.0002 | 0.326 | 0.899 | 35.74 | 18.42 | ||
0.3 | 0.316 | 0.372 | 0.0044 | 0.0002 | 0.263 | 0.898 | 40.08 | 18.10 | ||
0.4 | 0.325 | 0.387 | 0.0046 | 0.0003 | 0.266 | 0.886 | 42.58 | 19.88 | ||
0.5 | 0.308 | 0.393 | 0.0047 | 0.0002 | 0.257 | 0.908 | 43.30 | 18.92 | ||
0.6 | 0.326 | 0.399 | 0.0049 | 0.0002 | 0.252 | 0.890 | 42.88 | 18.68 | ||
0.7 | 0.310 | 0.409 | 0.0051 | 0.0002 | 0.238 | 0.893 | 45.14 | 18.80 | ||
0.8 | 0.297 | 0.405 | 0.0052 | 0.0002 | 0.221 | 0.898 | 44.64 | 18.04 | ||
0.9 | 0.319 | 0.432 | 0.0052 | 0.0002 | 0.237 | 0.903 | 44.64 | 17.74 | ||
1 | 0.313 | 0.416 | 0.0051 | 0.0002 | 0.241 | 0.904 | 44.74 | 18.24 | ||
100 | 0.1 | 0.487 | 0.625 | 0.0031 | 0.0001 | 0.440 | 0.956 | 31.38 | 13.94 | |
0.2 | 0.451 | 0.623 | 0.0048 | 0.0002 | 0.331 | 0.954 | 42.30 | 15.04 | ||
0.3 | 0.457 | 0.653 | 0.0058 | 0.0001 | 0.282 | 0.963 | 46.04 | 13.60 | ||
0.4 | 0.442 | 0.643 | 0.0063 | 0.0002 | 0.265 | 0.942 | 48.84 | 14.14 | ||
0.5 | 0.425 | 0.667 | 0.0064 | 0.0001 | 0.249 | 0.965 | 49.50 | 12.90 | ||
0.6 | 0.449 | 0.642 | 0.0064 | 0.0002 | 0.264 | 0.950 | 49.32 | 13.92 | ||
0.7 | 0.454 | 0.681 | 0.0064 | 0.0002 | 0.274 | 0.951 | 50.16 | 13.50 | ||
0.8 | 0.421 | 0.665 | 0.0064 | 0.0002 | 0.257 | 0.957 | 49.78 | 13.68 | ||
0.9 | 0.437 | 0.692 | 0.0064 | 0.0001 | 0.254 | 0.965 | 48.80 | 12.30 | ||
1 | 0.401 | 0.672 | 0.0064 | 0.0002 | 0.239 | 0.950 | 50.00 | 12.66 | ||
200 | 50 | 0.1 | 0.291 | 0.315 | 0.0011 | 0.0002 | 0.564 | 0.894 | 105.72 | 86.50 |
0.2 | 0.268 | 0.326 | 0.0018 | 0.0002 | 0.430 | 0.892 | 120.72 | 85.08 | ||
0.3 | 0.256 | 0.323 | 0.0021 | 0.0002 | 0.386 | 0.906 | 127.88 | 85.28 | ||
0.4 | 0.257 | 0.317 | 0.0023 | 0.0002 | 0.366 | 0.903 | 130.22 | 84.86 | ||
0.5 | 0.255 | 0.324 | 0.0023 | 0.0002 | 0.360 | 0.897 | 132.10 | 86.32 | ||
0.6 | 0.258 | 0.345 | 0.0024 | 0.0002 | 0.347 | 0.885 | 129.06 | 81.48 | ||
0.7 | 0.249 | 0.335 | 0.0024 | 0.0002 | 0.345 | 0.908 | 131.80 | 82.98 | ||
0.8 | 0.243 | 0.335 | 0.0024 | 0.0002 | 0.343 | 0.888 | 134.52 | 85.38 | ||
0.9 | 0.250 | 0.338 | 0.0024 | 0.0002 | 0.344 | 0.896 | 133.46 | 83.98 | ||
1 | 0.249 | 0.340 | 0.0024 | 0.0002 | 0.350 | 0.903 | 133.42 | 83.94 | ||
100 | 0.1 | 0.424 | 0.542 | 0.0018 | 0.0001 | 0.534 | 0.956 | 110.60 | 68.40 | |
0.2 | 0.387 | 0.565 | 0.0027 | 0.0001 | 0.415 | 0.968 | 131.66 | 66.50 | ||
0.3 | 0.374 | 0.558 | 0.0031 | 0.0001 | 0.377 | 0.964 | 140.38 | 68.48 | ||
0.4 | 0.365 | 0.567 | 0.0032 | 0.0001 | 0.368 | 0.968 | 145.72 | 68.72 | ||
0.5 | 0.372 | 0.580 | 0.0033 | 0.0001 | 0.357 | 0.963 | 142.34 | 64.52 | ||
0.6 | 0.360 | 0.576 | 0.0034 | 0.0001 | 0.349 | 0.967 | 146.92 | 66.90 | ||
0.7 | 0.363 | 0.577 | 0.0033 | 0.0001 | 0.354 | 0.965 | 146.84 | 67.20 | ||
0.8 | 0.363 | 0.581 | 0.0034 | 0.0001 | 0.346 | 0.968 | 146.18 | 65.90 | ||
0.9 | 0.372 | 0.585 | 0.0034 | 0.0001 | 0.361 | 0.964 | 145.62 | 67.04 | ||
1 | 0.369 | 0.594 | 0.0034 | 0.0001 | 0.356 | 0.972 | 145.32 | 65.72 |
Performance evaluation of the NPN-IDA algorithm
Exploring regulators of the flowering time in A. thaliana
AUC values under the different conditions of correlation screening on exploring regulators of the flowering time in Arabidopsis thaliana
Correlation screening | IDA | NPN-IDA |
---|---|---|
500 | 0.737 | 0.755 |
1000 | 0.696 | 0.776 |
1500 | 0.605 | 0.680 |
2000 | 0.608 | 0.662 |
Exploring the regulators that control the browning of white adipocytes in mice
Known regulators of the differentiation of WAT to BAT
Correlation screening | Gene symbol |
---|---|
2000 | Ppargc1a, Tbx15, Tfam, Bmp7 |
3000 | Ppargc1a, Tbx15, Tfam, Bmp7 |
4000 | Foxc2, Ppargc1a, Tbx15, Tfam, Bmp7 |
AUC values under the different conditions of correlation screening on exploring the regulators that control the browning of white adipocytes in mice
Correlation screening | Known regulators | IDA | NPN-IDA |
---|---|---|---|
2000 | 4 | 0.706 | 0.726 |
3000 | 4 | 0.580 | 0.701 |
4000 | 5 | 0.648 | 0.665 |
Discussion
From Table 2, it appears that the NPN-PC algorithm consistently outperforms the PC-algorithm with regard to all performance metrics, TPR, FPR, TDR, and SHD. Furthermore, as the mixing rate increases, the performance values of the PC-algorithm decrease. This result clearly shows that the PC-algorithm does not work when the Gaussianity assumption is not true. In contrast, the NPN-PC algorithm works well when mixing rate is high. In other words, the NPN-PC algorithm does not require the Gaussianity assumption of data distribution. In terms of FPR, it appears that the FPR of the NPN-PC algorithm is strictly suppressed under all set values. In the NPN-PC algorithm, all performance metrics improved when the number of samples was 100 compared to when the number of samples was 50. From these observations, it can be concluded that the NPN-PC algorithm is robust and does not depend on data distribution, unlike the PC-algorithm. This characteristic is very attractive from the practical point of view.
From Table 3 and Figure 8, NPN-IDA consistently outperformed IDA on exploring regulators of the flowering time in A. thaliana. In particular, when the correlation screenings were set to 1000 and 1500, the difference in the AUCs between NPN-IDA and IDA increased. When the correlation screenings were set to 500, both NPN-IDA and IDA worked well. This result indicates that the known regulators were sufficiently recovered against the flowering time within genes that have high absolute correlation coefficients. Therefore, although we do not know whether the unknown regulators have high absolute correlation coefficients against the flowering time, it would be a good strategy from a practical perspective to explore novel regulators from genes that have high absolute correlation coefficients against the flowering time.
From Table 5 and Figure 9, NPN-IDA consistently outperformed IDA on exploring regulators that control the browning of white adipocytes in mice. In particular, when the correlation screenings were set to 3000 and 4000, the difference in the AUCs between NPN-IDA and IDA increased. These results suggest that NPN-IDA successfully enriches known regulators at the top ranks when the number of available genes increases. Consequently, our two experiments clearly demonstrated that NPN-IDA performs better than IDA.
p-values of the Shapiro-Wilk test of target variables and known regulators
Variable | p-value |
---|---|
Flowering time | 1.09e-7 |
DWF4 | 3.62e-1 |
FLC | 1.81e-3 |
FRI | 2.81e-1 |
RPA2B | 1.24e-4 |
SOC1 | 6.86e-1 |
PDH-E1 BETA | 7.95e-2 |
ATGH9B5 | 1.69e-4 |
LRR | 1.21e-4 |
OTLD1 | 6.84e-1 |
Ucp1 | 1.50e-21 |
Foxc2 | 6.01e-22 |
Ppargc1a | 8.59e-21 |
Tbx15 (merck-NM_00923_at) | 1.75e-21 |
Tbx15 (merck-NM_01154_a_at) | 6.31e-20 |
Tfam | 5.12e-08 |
Bmp7 | 3.09e-13 |
Although the method presented here performs significantly better than IDA, one might consider the difference in the performance as too small. However, within the known regulators of flowering time in A. thaliana, four genes (PDH-E1 BETA, ATGH9B5, LRR, and OTLD1) were experimentally validated using the IDA-based method [6]. Therefore, one should take into account that some of the known regulators were obtained on the basis of the Gaussian assumption. Thus, it is possible that the improvement achieved with NPN-IDA is greater than the experimental results shown in this study.
In summary, the two main results of our experimental study are: (1) the NPN-PC algorithm works better than the PC-algorithm in estimating DAGs on synthetic data, and (2) NPN-IDA performs better than does IDA in predicting regulators of the flowering time in A. thaliana and regulators of the differentiation of WAT to BAT in mice on the basis of microarray data. From these two results, we conclude that the performance to estimate DAGs contributes to the accurate prediction of causal effects.
From a practical point of view, because regulators that control the browning of white adipocytes have not been identified only from microarray data of WAT so far, it might be worthwhile to demonstrate this possibility for the first time using our novel method.
For further performance enhancement, we consider that NPN-IDA could be embedded into CStaR (causal stability ranking) [6] in the future. CStaR uses IDA with stability selection [6, 15] and the performance was greatly improved compared to plain IDA. By simply replacing IDA with NPN-IDA in the estimation process for causal effects, it would be easy to combine NPN-IDA with CStaR.
Conclusions
We presented NPN-IDA, which uses nonparametric partial correlations, to test conditional independencies in the PC-algorithm for intervention-calculus. To reveal the contribution of estimating DAGs, we evaluated the NPN-PC algorithm to estimate DAGs with the non-paranormal method. The two main results of our experimental study are: (1) the NPN-PC algorithm works better than the PC-algorithm in estimating DAGs on synthetic data, and (2) NPN-IDA performs better than IDA does in predicting regulators of the flowering time in A. thaliana and regulators of the differentiation of WAT to BAT in mice. Considering these two results, we concluded that the performance to estimate DAGs contributes to the accurate prediction of causal effects.
Thus, the simplest alternative procedure we used for our proposed method enables us to design efficient intervention experiments and can be applied to a wide range of research purposes, including drug discovery and medicine, because of its generality.
Availability of supporting data
The microarray data sets used in this study are deposited in Additional file 2.
Declarations
Acknowledgements
We are grateful to Tatsumi Yamazaki, the president of Forerunner Pharma Research, for his support and encouragement throughout the study. We also thank Takumi Nagata and Kenji Yoshida for their valuable discussions.
Authors’ Affiliations
References
- Ahfeldt T, Schinzel RT, Lee YK, Hendrickson D, Kaplan A, Lum DH, Camahort R, Xia F, Shay J, Rhee EP, Clish CB, Deo RC, Shen T, Lau FH, Cowley A, Mowrer G, Al-Siddiqi H, Nahrendorf M, Musunuru K, Gerszten RE, Rinn JL, Cowan CA: Programming human pluriopotent stem cells into white and brown adipocytes. Nat Cell Biol. 2012, 14: 209-219.View ArticlePubMed CentralPubMedGoogle Scholar
- Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K, Yamanaka S: Induction of pluripotent stem cells from adult fibroblasts by defined factors. Cell. 2007, 13: 861-872.View ArticleGoogle Scholar
- Ring KL, Tong LM, Balestra ME, Javier R, Andrew-Zwilling Y, Li G, Walker D, Zhang WR, Kreitzer AC, Huang Y: Direct reprogramming of mouse and human fibroblasts into multipotent neural stem cells with a single factor. Cell Stem Cell. 2012, 11: 100-2109.View ArticlePubMed CentralPubMedGoogle Scholar
- Maathuis MH, Kalisch MK, Buhlmann P: Estimating high-dimensional intervention effects from observational data. Ann Stat. 2009, 37: 3133-3164.View ArticleGoogle Scholar
- Maathuis MH, Colombo D, Kalisch MK, Buhlmann P: Predicting causal effects in large-scale systems from observational data. Nat Methods. 2010, 7: 247-248.View ArticlePubMedGoogle Scholar
- Stekhoven DJ, Moraes I, Sveinbjornsson G, Hennig L, Maathuis MH, Buhlmann P: Causal stability ranking. Bioinformatics. 2012, 28: 2819-2823.View ArticlePubMedGoogle Scholar
- Kalisch M, Buhlmann P: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Machine Learning Res. 2007, 8: 613-636.Google Scholar
- Liu H, Laffery J, Wasserman L: The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Machine Learning Res. 2009, 10: 2295-2328.Google Scholar
- Liu H, Han F, Yuan M, Laffery J, Wasserman L: The Nonparanomal SKEPTIC. Proceedings of the 29th International Conference of Machine Learning. 2012, Edinburg, Scotland, UK, 1415-1422.Google Scholar
- Kruskal WH: Ordinal measures of association. J Am Stat Assoc. 1958, 53: 814-861.View ArticleGoogle Scholar
- Kalisch M, Maechler M, Colombo D, Maathuis MH, Buhlmann P: Causal inference using graphical models with R package pcalg. J Stat Software. 2012, 47: 1-26.View ArticleGoogle Scholar
- Muise ES, Souza S, Chi A, Tan Y, Zhao X, Liu F, Dallas-Yang Q, Wu M, Sarr T, Zhu L, Guo H, Li Z, Li W, Hu W, Jiang G, Paweletz CP, Hendrickson RC, Thompson JR, Mu J, Berger JP, Mehmet H: Downstream signaling pathways in mouse adipose tissues following acute in vivo administration of fibroblast growth factor 21. PLoS One. 2013, 8: e73011-View ArticlePubMed CentralPubMedGoogle Scholar
- Lo KA, Sun L: Turning WAT into BAT: a review on regulators controlling the browning of white adipocytes. Biosci Rep. 2013, 33: 711-719.View ArticleGoogle Scholar
- Shapiro SS, Wilk MB: An analysis of variance test for normality (complete samples). Biometrika. 1965, 52: 591-611.View ArticleGoogle Scholar
- Meinshausen N, Buhlmann P: Stability selection. J Roy Stat Soc B. 2010, 72: 417-473.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.