Compositional zero-inflated network estimation for microbiome data

Background The estimation of microbial networks can provide important insight into the ecological relationships among the organisms that comprise the microbiome. However, there are a number of critical statistical challenges in the inference of such networks from high-throughput data. Since the abundances in each sample are constrained to have a fixed sum and there is incomplete overlap in microbial populations across subjects, the data are both compositional and zero-inflated. Results We propose the COmpositional Zero-Inflated Network Estimation (COZINE) method for inference of microbial networks which addresses these critical aspects of the data while maintaining computational scalability. COZINE relies on the multivariate Hurdle model to infer a sparse set of conditional dependencies which reflect not only relationships among the continuous values, but also among binary indicators of presence or absence and between the binary and continuous representations of the data. Our simulation results show that the proposed method is better able to capture various types of microbial relationships than existing approaches. We demonstrate the utility of the method with an application to understanding the oral microbiome network in a cohort of leukemic patients. Conclusions Our proposed method addresses important challenges in microbiome network estimation, and can be effectively applied to discover various types of dependence relationships in microbial communities. The procedure we have developed, which we refer to as COZINE, is available online at https://github.com/MinJinHa/COZINE.

S1 Additional simulations S1.1 K-minimal and H-K networks Here we provide additional simulation studies for AR(1) graphs: (1) K-minimal network where the structure is only determined by the non-zero structure of K, and G and H are set to be diagonal matrices; and (2) H-K network where all true edges have a nonzero entry in K, with half of those edges represented in H as well, and G is set to be a diagonal matrix. All other steps of the simulation setup were kept the same as in the G-minimal, G-K and G-H-K simulations as described in the simulation section.
Based on 25 synthetic datasets, we compared the SpiecEasi-MB, SpiecEasi-GLASSO, COZINE and Ising methods using ROC analysis ( Figure S1). COZINE outperformed all other methods for both simulation settings. When the network structure is solely determined by continuous interactions (K-minimal network), SpiecEasi is in theory the optimal choice, as the COZINE model is over-parametrized by including G and H. However, COZINE achieved better accuracy in estimating the network structure than both approaches of SpiecEasi. This result suggests that the group lasso penalty utilized by COZINE that induces the same inclusion status for an edge across g ij , h ij , h ji and k ij can correctly detect edges encoded in any combination of the four parameters.
When the network structure also encodes a dependence of the mean levels of abundance on the presence of other species (H-K network), COZINE achieves the highest accuracy across all four methods.

S1.2 High-dimensional networks
In order to compare the performance of COZINE with other methodologies in a high-dimensional and more highly zero inflated setting, we have added an additional simulation study on the G-H-K scale-free network where the number of nodes is p = 1000. The data generation follows the same procedure as for the low-dimensional settings, as described in detail the "Simulation study" section.
Our simulation study is based on 25 replicated datasets with a sample size of n = 200. These data have an average proportion of zero values of 0.75. COZINE showed the highest accuracy in estimating the true network structure, with an AUC value of 0.766 ( Figure S2). The average MCC value for COZINE was 0.33, which was significantly higher than those from other three methods: 0.08, 0.05 and 0.04 for SpiecEasi-MB, SpiecEasi-GLASSO, and Ising model, respectively.

S2 Performance comparison on real microbiome data
Since the true microbial interactions are not known for the case study data, we evaluated performance by measuring the stability of the estimated networks. We compare 6 methods, including 4 partial correlation-based methods (COZINE, Ising, SpiecEasi-GLASSO and SpiecEasi-MB) and 2 additional correlation-based methods: SparCC (Friedman et al., 2008) and CCLasso (Fang et al., 2015). SpiecEasi-MB was sensitive to the perturbation in data with the lowest level of stability.
We next evaluate the performance in terms of assortativity to investigate the tendency of genera which occur in the same branch of the taxonomic tree to be linked within co-occurrence networks estimated from the 6 approaches (Table S1). COZINE showed the most significant assortative mixing overall, with highest coefficients at the Class and Order levels across the methods compared.

S3 Inference of OTU network
We also applied COZINE to our case study data defined at the OTU level. The data include 2029 OTU counts and show a high level of sparsity, with a proportion of zero values of 95%. Our method took 3.03 hours on a Linux server (2.93 GHz, 96GB RAM). We found 3058 edges among 518 OTUs (vertices), which is 0.15% of all possible edges. Figure S4 displays the co-occurrence network at the OTU level, with the degree distribution of vertices as an inset. As seen from the degree distribution and the topological structure, the network shows the scale-free property with hubs that have a large number of edges. There were four prominent hub nodes that had degree greater than 100, listed in Table S2.