Microbiome characterization of longitudinal preterm infant stool samples from two independent, historical NICU cohorts
We developed and validated our ML NEC risk assessment system using data from two large studies of preterm infants from which stool microbiota DNA sequences and basic clinical details—including NEC outcomes—were available [27, 31].
The designs of the two studies were conceptually similar. Warner et al. [31] prospectively collected serial stool samples and recorded a panel of clinical data from VLBW preterm infants cared for at three U.S. hospitals. NEC cases were identified based on clinical and radiographic features, and were matched to between one and four control infants who did not develop NEC. Case-control matching was based on gestational age at delivery and the birth medical center. All stools from birth through disease onset were analyzed for NEC-affected cases, whereas stools between birth and 74 days of life were analyzed for controls (the majority of control patients had samples collected up to day of life 60, see Fig. 1A). To characterize the microbiota in collected stools, Warner et al. used Roche 545 sequencing of 16S ribosomal RNA variable region amplicons.
Olm et al. [27] also prospectively collected preterm infant stool and recorded clinical data over the first months of life, then performed post-hoc case-control matching based on gestational age at delivery and the calendar date at birth (to control for potential seasonal fluctuation). Infants for this study were all cared for in the NICU at University of Pittsburgh Medical Center Magee-Womens Hospital. There was no overlap between patient populations for the two studies. As in the Warner et. al. study, NEC-affected infant stool was analyzed between birth and disease onset, with samples preferentially selected for sequencing if they were the last collected before NEC diagnosis (Fig. 1B). In this study, control stool samples selected for analysis were matched in time to case samples. Instead of targeted sequencing of the 16S ribosomal RNA coding region, Olm et al. performed whole-genome sequencing on an Illumina HiSeq 2500 platform, permitting reconstruction of complete or near-complete bacterial genomes. Whole-genome sequencing allows greater precision in bacterial taxonomic classification and reconstruction of metabolic pathways present among the bacterial population than targeted 16S sequencing, but requires more sequencing and bioinformatics resources [39, 40].
We obtained authorized access to Warner et al. patient metadata through the National Center for Biotechnology Information’s database of genotypes and phenotypes (NCBI dbGAP accession phs000247.v5.p3) and used metadata published by Olm et al. Patient details, which included NEC outcomes, were matched with publicly available next-generation sequencing files, which we downloaded from the NCBI’s Sequence Read Archive. Because the two studies used different sample collection protocols, next-generation sequencing strategies, and reported varying demographic and clinical information, we took several steps to ensure that data we used from the two sources were normalized and structured identically.
FASTQ files with 16S ribosomal RNA variable region sequences (Warner et al. dataset) and whole genome shotgun sequences (Olm et al. dataset) were characterized with Kraken2, a flexible microbiota classification software package that aligns short genetic sequences (k-mers) to the lowest common ancestor in order to make a taxonomic assignment [41]. We chose Kraken2 for its speed and resource-efficiency and the fact that it produces identically formatted microbiota classification files regardless of whether its input consists of 16S or whole genome sequences. This uniformity is well-suited for downstream ML analyses, permitting a single model to accept data from distinct study protocols.
After removing all non-bacterial taxa (fungi and archaea) from the two collections of Kraken2 output files, we were left with bacterial DNA fragment counts for stool microbiota from 3595 samples taken from 261 preterm infants (2895 samples from 161 infants from the Warner study, 700 samples from 100 infants from Olm et al.), 75 of whom developed NEC. The taxonomic distribution of the bacteria identified through this analysis is presented in Fig. 1C. Trends identified in both source studies reemerged after our reclassification of the stool microbiota using Kraken2. Specifically, we found that infants from the Warner et al. cohort who developed NEC had a trend toward decreased Negativicutes abundance and increased Proteobacteria abundance relative to unaffected controls. We also found an increase in Klebsiella pneumoniae strain 242_2 among the Olm et al. NEC cases preceding disease onset, which those authors described in their original report (Additional file 1: Fig. S1).
Both Warner et al. and Olm et al. reported demographic and clinical information about enrolled neonates and their families. These metadata ranged from maternal information such as age, parity, and biometric data, to details about the birth—such as mode of delivery—to neonatal metadata such as birthweight, age at the time of each sample collection, and Apgar scores. The collected metadata categories differed between the two studies, however. For our ML analysis, we elected to use a limited metadata collection representing the intersection of the two studies (Table 1).
Pre-processing of compositional microbiome feature counts normalizes data and prunes uninformative taxa
Microbiota classification data generated through next-generation sequencing have characteristics that present challenges to ML-based analyses. First, stool microbiota datasets are sparse: the majority of potential bacteria will not be found in any given sample, resulting in large numbers of “zero count” organisms [36, 42]. Furthermore, microbiota classification data have a characteristic known as compositionality. Briefly, because the total number of DNA reads that can be generated by a next-generation sequencing instrument is finite, any next-generation sequencing-based taxonomic classification is necessarily based on a subsample of the total environmental diversity. As a result, a microbiota community member characterized as absent (i.e. zero on the count table) may actually be present at a low density below the instrument’s limit of detection. Furthermore, incorrect assumptions about correlations between microbiota member abundances can result if the compositional nature of the data is not considered [34,35,36, 43, 44].
A common approach for handling the uncertain zeros in a compositional dataset is to apply a pseudocount where zeros are replaced by a uniform small number, such as two-thirds of the limit of detection [45, 46]. For our dataset, zeros were replaced with 0.66, which was two-thirds of the smallest possible read count of 1. Next, we applied a centered log-ratio transformation in order to adjust compositional data to make them more tractable to statistical analyses. In the centered log-ratio approach, each taxon within a sample is transformed by taking the log-ratio counts for that taxon within a sample divided by the geometric mean of the counts of all taxa [47, 48]. This transformation adjusts sparse datasets so that the frequency distributions are not tightly clustered close to zero; it shifts and widens narrow frequency distributions, accentuating subtle differences that may be lost in the non-transformed data (see Fig. 2).
Another special characteristic of microbiota data is its hierarchically structured feature space (bacteria can be classified at kingdom, phylum, class, order, family, genus, and species levels), which can be exploited for more efficient and accurate modeling through taxonomy dimensionality reduction [49, 50]. To reduce the dimensionality of our taxonomic microbiota data, we removed any node that was redundant with its parent, using a Pearson correlation threshold of 0.7. This reduced the the number of taxa from 3702 to 2282 for the Warner et al. dataset and from 9626 to 6221 for the Olm et al. dataset. Information gains were also calculated for each node of the taxonomy tree using the NEC target label. Any node with an information gain of zero was discarded. This process allowed us to prune the number of features further to 362 taxa and 706 taxa for the Warner and Olm datasets, respectively. For a list of taxa included in the final analyses, after pre-processing, see Additional file 2: Data S1.
Pre-processing highlighted inherent differences between the two datasets. The Warner et al. study, which used relatively low-resolution 16S sequencing, generated a sparser, lower-complexity data landscape after centered log-ratio transformation than the Olm et al. study, in which whole genome sequencing generated greater phylogenetic differentiation and more nuances in the data landscape (Fig. 2). However, the hierarchical feature reduction step removed low-information taxa from both datasets, resulting in similar feature frequency landscapes from the two studies. A significant strength of the NEC prediction system we describe here is that it performed well with different degrees of data complexity, suggesting that it can be successfully paired with various technical strategies for microbiota characterization.
An attention-based multiple instance learning paradigm permits discrimination between affected and unaffected infants using pre-processed microbiome data and basic sample metadata
The field of ML offers a wide range of algorithms to build models from data. In most cases, ML requires training from a set of discrete data points, which are termed instances. Each instance is described by a set of feature values in some domain, such as text for emails, pixel intensity for images, or vital signs for patients. Often, each instance is tagged with its own label (for example, spam or non-spam email). For any new ML task, a crucial step is to identify the learning paradigm and the representation needed to formalize the problem. This will depend on the data and the goal.
MIL is a form of ML where the training instances are arranged in sets, called bags, and the label is provided for the entire bag [51, 52]. There are two main approaches to solve MIL problems. In instance level approaches, predictions are made for each instance and aggregated to obtain the bag level label. In embedding level approaches, instances are mapped to a vectorial embedded space and fed to a final classifier. Embedding level approaches are preferred when instance-level labels are unknown, which is the case for NEC prediction, where no single stool sample collected prior to disease onset can be definitively classified as NEC-affected or unaffected.
We recognized MIL as an appropriate framework for NEC prediction from serial stool sample microbiota analysis. In adapting a MIL approach to predicting NEC, we model each patient as a bag of clinical samples, which are the instances. The instances have feature values that consist of microbiota data and associated clinical details (Table 1). The task is to assign a NEC risk classification to the patient (Fig. 3A) based on the unlabeled instances.
We used a recently developed embedding level strategy in which initial embedding of the features is performed by a neural network, which feeds into an attention-gated MIL pooling algorithm [53]. The embeddings are aggregated using the attention weights, which are then passed to a fully connected layer with a sigmoid activation function that produces a bag probability. This model assumes that a bag label \(X\) is distributed as a Bernoulli distribution \(\theta (X) \in [0, 1]\) and trains it by optimizing the log likelihood function (see Additional file 3: Fig. S2).
The attention-based pooling module of our MIL system assigns attention weights to each instance in a bag. Those instances with the highest attention weights are the most significant contributors to the final bag label. This allows identification of maximally predictive stool samples (Fig. 3B). We used attention weight data to tune dynamic NEC risk scores and to interpret the microbiota taxonomic features that were most informative to our model.
We used stratified sampling to partition our two datasets into training and testing sets. We conducted five trials of this partitioning to avoid any sampling bias. We used a cross-validation modeling strategy whereby we trained our system on a portion of the dataset, then repeatedly tested the resultant model on the data subset that was withheld during training. The final model obtained was then applied to the test set. We averaged the results across all trials. We compared our system to two instance-level MIL approaches called mi-SVM and MI-SVM [54]. We also compared to MILboost, a MIL variant of AdaBoost that has been mostly used for object detection in images [55]. Finally, we compared to logistic regression. The PyTorch Python package [56] was used to implement the attention-based MIL method. The Scikit-learn Python package was used to implement SVM and logistic regression analyses [57].
In repeated trials of our attention-based MIL approach, we demonstrated receiver-operator curve (ROC) areas under the curve (AUC) of 0.86–0.92 for both datasets, suggesting a good balance of sensitivity and specificity. Precision-recall characteristics of the experimental system also exceeded any of the alternative methods, with precision-recall AUC values around 0.75 (Fig. 4). The majority of our model’s predictive performance came from microbiome data. Exclusion of clinical and demographic details in Table 1 from the analysis resulted in AUC values that were not significantly different than when these metadata were included (\(p\ge 0.35\) for Kolmogorov–Smirnov comparisons between ROC and precision-recall curves with and without clinical data). Pre-processing as described above was essential, however. Dimensionality reduction through hierarchical feature and information gain-based pruning allowed our models to converge, which was not possible when trained with the complete microbial dataset. Centered log-ratio transformation increased the ROC and precision-recall AUCs (Additional file 4: Fig. S3). By contrast, the predictive accuracy of the alternative approaches was barely above chance, with receiver-operator characteristics demonstrating only limited ability to differentiate NEC from non-NEC cases (Fig. 4).
A novel “growing bag” method permits determination of a NEC risk score from serial analysis of neonatal stool samples
When supplied with historical data spanning early infancy, our MIL-based system effectively distinguishes unaffected preterm newborns from those who developed NEC. However, a more useful clinical tool would be able to make dynamic predictions over time.
In order to generate dynamic NEC predictions using our system, we developed an approach that we call the growing bag analysis. Here, the model is initially trained on the full set of complete bags except for the bag of interest, representing the patient whose longitudinal NEC prediction is under study. That patient’s instances are then arranged in sequential order and the trained model is asked to generate iterative NEC predictions where each new prediction represents the previous bag to which one new instance has been added (i.e. the growing bag). The process can be repeated for every patient to produce NEC confidence scores as a function of time (Fig. 3C).
Plotting the confidence-over-time functions for patients in our historical datasets revealed an unexpected pattern. The system indicated that the majority of preterm newborns were predicted to be NEC-positive at birth. Over time, however, the two groups diverged, with NEC-negative patients showing a sudden drop in the system’s confidence that they would develop the disease. Meanwhile, most of the affected patients did not show this sudden inflection; their prediction remained NEC-positive throughout the observation period (Fig. 5).
We next developed a transformation that converts the MIL model’s confidence score into a risk score, which is also influenced by the attention weight assigned to each new instance. The algorithm incorporates the neonate’s age and any past risk scores, and produces a new score that could be tracked over time. Using this approach on each patient from the Warner et al. and Olm et al. datasets, we observed divergence of risk scores between the NEC-affected and unaffected patients (Fig. 6A). Importantly, we performed risk score calculations for each patient within the two datasets using a MIL model that had never encountered that patient’s data. In this way, we were able to simulate what the risk score would have been, over time, for each patient in our two datasets if our system had been used during the course of their NICU admission.
In order to determine a maximally predictive risk score threshold, we evaluated a range of cutoff values, comparing their relative sensitivities and specificities for NEC. Results of this analysis are shown in Additional file 5: Fig. S4. We determined that a risk score of 0.35 had optimal performance when applied to our dataset, maximizing accuracy with an average lead-time (i.e. time before disease onset that the risk score rose above threshold) of 8.3 days. Figure 6B, C show cumulative correct predictions in each dataset as a function of time, demonstrating statistically significant differences in prediction rates between affected and unaffected patients. Figure 6D shows the relative proportion of true and false positives and negatives—as well as lead-times for risk scores >0.35—for all patients in this study. Figure 6E shows test characteristics for our system. Across the two studies, using a risk cutoff of 0.35, we found an overall sensitivity of 86% and a specificity of 90%.
Interpretation of the growing bag model reveals bacterial taxa that drive accurate NEC predictions
We sought to interpret the major drivers of the prediction inflections revealed through the growing bag analysis, reasoning that interpretability data from our model would add to the validity of our approach if it highlighted bacterial taxa that past work has shown are correlative with NEC. We also speculated that interpretability data might indicate previously unstudied bacterial taxa that may play a role in NEC pathogenesis, and therefore be useful for generating new hypotheses about how NEC develops.
To interpret NEC predictions from our system, we took advantage of the attention weight module within the MIL architecture. We identified the highest attention weight instance from each positive bag in our two datasets. We extracted these instances and removed all clinical metadata, such that only microbiome data was left in the feature set. Next we performed random forest analysis on these high-attention instances in order to determine microbial covariants with attention weights (Fig. 3D). We analyzed each level of the bacterial taxonomic tree independently and performed five independent rounds of training and testing.
Using only the highest attention weight instances, random forest analysis was able to discriminate NEC-affected from non-affected patients, yielding ROC curve areas of 0.86 and 0.79 for test sets from the Warner and Olm datasets, respectively (Additional file 6: Fig. S5).
Ranked importance values for the highest-influence features, shown in Fig. 7, demonstrate congruence between our results (shown for bacterial phylum, class, and family levels) and those reported by Warner and Olm et al. As in those studies, we found that phyla Firmicutes and Proteobacteria, class Gammaproteobacteria, and family Enterobacteriaceae were at or near the top of the importance-ranked feature lists for each taxonomic level we evaluated. Complete feature importance rankings from both studies are provided in Additional file 7: Data S2.
Interestingly, in our analysis some features emerged as important in one study dataset but not the other. For example, Flavobacteriaceae appeared on the list of important families from the Olm dataset, but not Warner (Fig. 7). This may reflect technical differences between how the two datasets were generated (for example, the divergent methods of next-generation sequencing), such that sensitivity for these taxa was higher in one study than the other. Alternatively, these differences may represent intrinsic aspects of the separate study populations. NEC pathogenesis may be geographically—as well as individually—heterogeneous, and interpretation differences between the two datasets may reflect this heterogeneity.
In addition to identifying previously suspected bacterial contributors to NEC pathogenesis, our analyses also highlight additional potentially important taxa. Classes Clostridia and Actinobacteria, for instance, were third- and fourth-highest importance (after Bacilli and Gammaproteobacteria), raising the possibility that these less-explored bacterial taxa may be worth further investigation in future studies.