eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

Background Regularized generalized linear models (GLMs) are popular regression methods in bioinformatics, particularly useful in scenarios with fewer observations than parameters/features or when many of the features are correlated. In both ridge and lasso regularization, feature shrinkage is controlled by a penalty parameter λ. The elastic net introduces a mixing parameter α to tune the shrinkage continuously from ridge to lasso. Selecting α objectively and determining which features contributed significantly to prediction after model fitting remain a practical challenge given the paucity of available software to evaluate performance and statistical significance. Results eNetXplorer builds on top of glmnet to address the above issues for linear (Gaussian), binomial (logistic), and multinomial GLMs. It provides new functionalities to empower practical applications by using a cross validation framework that assesses the predictive performance and statistical significance of a family of elastic net models (as α is varied) and of the corresponding features that contribute to prediction. The user can select which quality metrics to use to quantify the concordance between predicted and observed values, with defaults provided for each GLM. Statistical significance for each model (as defined by α) is determined based on comparison to a set of null models generated by random permutations of the response; the same permutation-based approach is used to evaluate the significance of individual features. In the analysis of large and complex biological datasets, such as transcriptomic and proteomic data, eNetXplorer provides summary statistics, output tables, and visualizations to help assess which subset(s) of features have predictive value for a set of response measurements, and to what extent those subset(s) of features can be expanded or reduced via regularization. Conclusions This package presents a framework and software for exploratory data analysis and visualization. By making regularized GLMs more accessible and interpretable, eNetXplorer guides the process to generate hypotheses based on features significantly associated with biological phenotypes of interest, e.g. to identify biomarkers for therapeutic responsiveness. eNetXplorer is also generally applicable to any research area that may benefit from predictive modeling and feature identification using regularized GLMs. The package is available under GPL-3 license at the CRAN repository, https://CRAN.R-project.org/package=eNetXplorer. Electronic supplementary material The online version of this article (10.1186/s12859-019-2778-5) contains supplementary material, which is available to authorized users.


About
The R package eNetXplorer is available under GPL-3 license at the CRAN repository. The package source is located at https://CRAN.R-project.org/package=eNetXplorer.

Installation
To install to your default directory, type

eNetXplorer's Workflow
In order to describe eNetXplorer's workflow, this Section presents the analysis pipeline applied to synthetic datasets; to further illustrate eNetXplorer's features, real datasets are distributed with the package as described in Sections below.
First, we create a function to generate data for a Gaussian (linear regression) model, which consists of: • An input numerical matrix with n_inst instances or observations (as rows) and n_pred predictors or features (as columns). • An input numerical vector of length n_inst with the observed response.

Count
We also examine the correlation between predictors and the response; by design, the block of features 1-5 carries the largest positive correlation with the response. Since the number of features is larger than the number of observations, we need to implement a regularized regression model. But which model? We know that ridge will fit a model where all predictors (including all the non-informative ones) will have non-zero contributions. Lasso, on the other end, will exclude most features and provide a minimal model representation. The elastic net allows us to scan the regularization path from ridge to lasso via the mixing parameter alpha. However, the following open questions remain: • Which alpha represents the top-performing regularized model?
• What is the model-level statistical significance across alpha?
• What is the feature-level statistical significance of a given alpha-model? • How does feature-level statistical significance change across alpha?
To address these questions, eNetXplorer generates an ensemble of null models (based on random permutations of the response) on a family of regularized models from ridge to lasso. First, we load the eNetXplorer package:

library(eNetXplorer)
Next, we run eNetXplorer on the datasets we just generated. The call to eNetXplorer with default parameters is: fit_def = eNetXplorer(x = data$predictor, y = data$response, family = "gaussian") Results can be made more precise by increasing the number of cross-validation runs (n_run) and the number of null-model response permutations per run (n_perm_null), as well as by choosing a smaller step in the path of alpha models: fit = eNetXplorer(x = data$predictor, y = data$response, family = "gaussian", alpha = seq(0, 1, by = 0.1), n_run = 1000, n_perm_null = 250, seed = 123) Function summary generates a brief report on the results; for each alpha, it displays the optimal lambda (obtained by maximizing a quality function over out-of-bag instances), the corresponding maximum value of the quality function, and the model significance (p-value based on comparison to permutation null models).

summary(fit)
## Call: ## eNetXplorer(x = data$predictor, y = data$response, family = "gaussian", ## alpha = seq(0, 1, by = 0.    Here, we observe that more regularized models (alpha=0.5-0.8) favor features 5, 8 and 2 (at significance p-value<0.1); towards lasso (alpha=0.9,1), feature 5 is the only one selected at p-value<0.05-0.1. As expected, these models are more stringent on feature selection than less regularized (i.e. smaller alpha) models. It is also interesting to observe that lasso-like models remove feature redundancies: from the block of correlated features 1-5, only feature 5 is picked up as representative. This example illustrates some important characteristics of mixed-regularization model families: • less regularized (smaller alpha) models promote redundancy; they benefit from borrowed information across significantly correlated predictors; they provide larger signatures, which are potentially more robust and resilient under measurement noise; they offer more opportunities for systems-level interpretation (e.g. downstream pathway analysis in the context of genomics).
• more regularized (larger alpha) models promote sparsity; they tend to pick just one predictor out of a set of correlated ones; they may facilitate interpretation with high-dimensional datasets and/or in the absence of systems-level annotations; they provide smaller signatures, which may be more useful in certain contexts (e.g. biomarker panels).
In order to gather more details regarding a particular solution, we plot the quality function across the range of values for the regularization parameter lambda:

alpha=0.2 ; QF=correlation (pearson)
If so desired, eNetXplorer allows end-users to extend the number of lambda values (via nlambda) and/or extend their range while keeping the lambda density uniform in log scale (via nlambda.ext).
There may exist outlier instances that may require further examination; we generate a scatterplot of response vs out-of-bag predictions across all instances: plot(fit, alpha.index = which.max(fit$model_QF_est), plot.type = "measuredVsOOB")  Naturally, model performance across alpha is strongly dependent on the structure of the input datasets. In the case example above, we purposefully generated a block of informative correlated predictors to highlight the characteristics of less regularized (smaller alpha) models and their ability to leverage borrowed information.

model performance vs alpha values
Here, we observe that the quality function appears to increase monotonically with alpha; the maximum corresponds to the lasso solution, alpha=1.

Datasets
The eNetXplorer package provides two datasets of biological interest: • H1N1_Flow, comprised of longitudinal cell population frequencies and titer responses upon H1N1 vaccination; and • Leukemia_miR, which contains microRNA (miR) expression data from cell lines and primary (patient) samples classified by different acute leukemia phenotypes, as well as normal control samples sorted by cell type.

H1N1_Flow
The H1N1_Flow dataset comprises data from a cohort of healthy subjects vaccinated against the influenza virus H1N1 (Tsang et al. 2014). Using five different 15-color flow cytometry stains for T-cell, B-cell, dendritic cell, and monocyte deep-phenotyping, 113 cell population frequencies were measured longitudinally pre-(days -7, 0) and post-vaccination (days 1, 7, 70) on a cohort of 49 healthy human subjects (F=31, M=18, median age=24). Cell populations were manually gated and expressed as percent of parent. Samples and cell populations were filtered independently for each timepoint; samples were excluded if the median of the fraction of viable cells across all five tubes was <0.7, while cell populations were excluded if >80% of samples had <20 cells. Data were log10-transformed and pooled across all timepoints, then adjusted for age, gender, and ethnicity effects. For each timepoint, a numerical matrix of predictors is provided with subjects as rows and cell populations as columns. The response is the adjusted maximum fold change (adjMFC) of serum titers at day 70 relative to baseline, as defined in (Tsang et al. 2014). Two versions of the serum titer response are provided in the package; one as a numerical vector and the other one as a categorical vector discretized into low ("0"), intermediate ("1") and high ("2") response classes. A metadata file with cell population annotations is also provided.
To load the dataset: data(H1N1_Flow)

Leukemia_miR
The Leukemia_miR dataset comprises data of human microRNA (miR) expression of 847 miRs from 80 acute myeloid (AML) and acute lymphoblastic ( To load the dataset:

data(Leukemia_miR)
To filter the full dataset:

Summary
• eNetXplorer addresses key questions regarding model regularization and feature selection based on permutation-based statistical significance tests. Results are provided in the form of standard plots, summary statistics and output tables.
• As illustrated by synthetic datasets, which were generated from covariance matrices involving normally distributed features and response distributions, eNetXplorer workflows are generally applicable to model any datasets consisting of n_inst observations across n_pred predictors or features that aim to explain a set of n_inst response values.
• Regularization models are particularly useful in scenarios involving a large number of features (even larger than the number of observations) and/or sets of significantly correlated features. These scenarios are typical of datasets generated by current technologies in molecular and cellular biology, but applications to other data rich environments are certainly possible.