Fizzy: feature subset selection for metagenomics

Background Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome. Results We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. Conclusions We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.


Background
There is an immense amount of sequence data being collected from the next generation sequencers. Sequences from bacterial communities are collected from whole genome shotgun (WGS), or amplicon sequencing runs, and the analysis of such data allows researchers to study the functional or taxonomic composition of a sample. Microbial ecologists represent the composition in the form of an abundance matrix, which usually holds counts of operational taxonomic units (OTUs), but can also hold counts of genes/metabolic pathway occurrences if the data are collected from WGS. Furthermore, collections of metagenomic samples contains different factors, or phenotypes, such as environmental pH and salinity values, or a health related status [1,2].
In this work, we introduce software tools for microbial ecologist researchers that implement feature subset *Correspondence: gailr@ece.drexel.edu 2 Department of Electrical & Computer Engineering, Drexel University, 3141 Chestnut St., 19104 Philadelphia, PA, USA Full list of author information is available at the end of the article selection routines for biological data formats. Prior to feature selection, we assume that the raw sequences from the environmental samples have already been classified into operational taxonomic units (OTUs), or functional features. The raw OTU counts are stored in a matrix X ∈ N K×M + , where N + is the set of positive natural numbers, K is the number of OTU clusters, and M is the number of samples collected. The M samples contain a significant amount of meta-data describing the sample, which is where we obtain phenotypes describing the sample. While there may be many different meta-data, we shall only focus on one piece of meta-data at a time. For example, a sample may contain the sex, age, and height of the person from where a sample was collected, and the analysis would only use one of those fields. That is we could use X to build a predictive model of sex. Both the data matrix and meta-data can be found for hundreds of publicly available datasets through pioneering projects such as MG-RAST [3], KBase [4], the Human Microbiome Project [5], and the Earth Microbiome Project [6].
A natural question to ask about studies with multiple phenotypes is: "which OTUs or functions are important for differentiating the phenotypes?" Answering such a question can be useful for understanding which conditions are driving/being affected by differences in composition and function across samples. Subset selection is the process of taking a high-dimensional dataset and reducing the size of the feature set by allowing the reduced subset to contain only relevant features [7]. Subset selection can also produce a feature subset that not only removes irrelevant features (i.e., features that do not carry information about the phenotype), but also does not contain features that are redundant (i.e., features carry the same information). This process of reducing the feature set offers a rapid insight into uncovering the differences between multiple populations in a metagenomic study and can be performed as complementary analysis to β-diversity methods, such as PCoA. Feature selection has been performed previously, by tools such as Random Forests [8], and Lefse [9], but is usually tied to a classification type or effect size.

Information-theoretic subset selection
One of the fundamental quantities in information theory that has been widely adopted for feature subset selection with filters is mutual information, which is given by: where p X (x) is the marginal distribution over the random variable X, and p X,Y (x, y) is the joint probability distribution over X and Y. The supports of the random variables X and Y are defined by X and Y. The mutual information can be used as scoring function for determining the set of features F that carry the most information about an outcome Y. A simple algorithm for feature selection with a filter is a greedy forward selection search that seeks to maximize feature scoring function J , which is shown in Fig. 1. The search initializes the relevant feature set F be empty, then for k iterations, an objective function J is maximized. For example, this objective function could be written as where α, β ≥ 0. The first term in the expression captures the relevancy of the variable X. The next two terms measure the redundancy and conditional redundancy of X with the relevant feature set F, respectively. Note that the sign of the conditional redundancy is positive to reward features being jointly informative about the class variable Y. The feature that maximizes this expression is added to the relevant feature set, F, and removed from the feature set, X . Simply using mutual information as the objective function is a fast way for microbial ecologists to examine the relative importance of taxa in a study collected from environmental samples. However, simply using mutual information will not capture inter-feature dependencies. Using other objective functions, such as joint mutual information [10] |F| , β = 0 , captures some of the inter-feature dependencies.
Our recent work developed the Neyman-Pearson Feature Selection (NPFS), which automatically detects the relevant features in a dataset using a generic scoring function J [12,13]. NPFS is highly parallelizable, which allows it to be quite effective for very large datasets. NPFS works by mapping out random samples of the original dataset to a scoring function which makes a prediction on which features are relevant. All of the sub-datasets have the same number of features selected then in a reduction phase NPFS applied the Neyman-Pearson test to detect feature importance. In this setting, NPFS can detect the number of important OTUs simply by guessing k in Fig. 1 for the scoring function and letting the hypothesis detect features that appear to be more important. NPFS was found to improve traditional methods of feature selection, while remaining highly parallelizable.

Subset selection via regularization
Section 'Information-theoretic subset selection' introduced a greedy algorithm and tools from information theory that can be used to select features that are deemed important by the scoring function. Now we present feature selection from an embedded perspective. Let y be a vector in {±1} M containing a binary outcome (e.g., control or stimulus) and X be abundance matrix. Predictions are made on y with X T θ, where θ ∈ R K . If many of the entries of θ were zero then we could view the inner product of θ with X as a form of feature selection. To encourage sparsity in θ's solution, Tibshirani presented lasso, which adds a penalty to the l 1 -norm of θ [14]. Formally, lasso is given by: where λ > 0, and · 1 and · 2 are the l 1 -and l 2 -norms, respectively. For lasso to be effective at feature selection, it is assumed that K M, which is typically an acceptable assumption with 16S and metagenomic data because there are typically only a few samples and a large number of features.

Software implementations
Fizzy is a suite of subset selection tools that takes the Biom standard format [15] as input due to its acceptance into the standards by the Genomic Standards Consortium (http://gensc.org). Commonly used software for analyzing data from microbial ecology, such as Qiime [16], requires a Biom file containing the 16S data and a map file contain the meta-data of the samples within the Biom file. However, Fizzy allows users to store the meta-data in the Biom file directly, thus avoiding requirements for both a Biom and map file.
The Fizzy software suite implements informationtheoretic subset selection, NPFS, and lasso. The core of Fizzy is based on the FEAST C feature selection library [17], which is used to implement all of the information theoretic methods. FEAST was selected for two primary reasons: (i) the library contains a large selection of information-theoretic feature selection objective functions, and (ii) the run-time of FEAST is typically faster than other feature selection libraries because it is written in a compiled language. We implemented a Python interface for FEAST to use within Fizzy, which is available to the public 1 . The Fizzy tool requires a Biom format OTU table (sparse or dense), a mapping file in tab-delimited (TSV) format, a metagenomic phenotype column in the map file, and an output file path be specified. Furthermore, Fizzy allows the user to specify the number of taxonomic units to select as well as the feature selection objective function. The current implementation of Fizzy has nine subset selection objective functions, which are all based on information theory (see Brown et al. for the mathematical details about the objective functions [17]). We also provide an implementation of the NPFS module, which can infer on the number of relevant features given any subset selection methods in FEAST [12]. Since NPFS works on top of a generic scoring function, we indicate the scoring function with NPFS as NPFS-SF, where SF is a scoring function such as MIM, mRMR or JMI. NPFS has a parallel implementation where the user can control the number of cores used by the program. The lasso implementation within Fizzy uses Scikit-Learn [18]. The regularization parameter for lasso is found using cross-validation and a grid search, where the values swept over the grid are determined from the data. The λ that minimizes the cost function is chosen as the final model.

Benchmark data sets
We benchmarked Fizzy using data collected from the American Gut (AG) Project [19], and Qin et al. 's study of IBD patients [1] (both datasets are publicly available). The gut samples from the AG Project study are filtered into a separate Biom file for Fizzy and the diet type of the individual is the metagenomic phenotype. Diet was discriminated based on whether peoples' diets included terrestrial animals, with Omnivores including those who ate chicken and/or red meat. Vegetarians included those who ate seafood, but no terrestrial animals. Qin et al. 's data are sampled from the gut and we use IBD and control as the metagenomic phenotype. The data used in our experiments have been made publicly available 2 .

Results and discussion
We compared five algorithms on the American Gut Project data set: JMI (Table 1), NPFS-JMI (Table 2), Random Forest Classifiers (RFC) ( Table 3), Lefse (Table 4), and lasso (no table due to only one feature selectedsee below). The regularization parameter for lasso, λ in (3), was chosen to be 1.188 × 10 −3 after performing cross validation. JMI was implemented in Fizzy, Lasso is available through our implementation, NPFS-JMI is our novel method, and these are compared to current popular methods such as RFC (used in [16,20]) and Lefse.
The algorithms were run on 2.9k+ samples collected from the AG Project and feature were selected using the diet type as the predictor variable. The diets are are broken down into omnivore and vegetarians, where subcategories of omnivore and vegetarians (e.g., omnivore but does not eat red meat) is simply categorized as omnivore. Table 1 shows the top ranking OTUs as selected for differentiate omnivores versus vegetarians in the AG Project data. Both Bacteroides and Prevotella were detected in the variable selected by Fizzy (note that Prevotella is not shown in the table because it was not ranked within the top 15 OTUs), which have been hypothesized as being important The number followed by "F" indicates the order Fizzy selected the OTU and the "GGID" contains the Greengenes OTU ID from the taxonomic classification differentiators of diet [21]. This effect was also observed when we evaluated only vegans and omnivores. NPFS detected 27 OTUs of the Prevotella genus and the relative abundances were larger for the vegetarians when examining the largest differences, which coincides with results in the literature [22]. Differences between the JMI & NPFS-JMI OTU rankings, could be due to a large cluster of features that carry similar relevance, which when with the bootstrapping in NPFS could rank them in a different order.
We also compare Fizzy to Qiime's random forests [8] because random forest within Qiime has become a commonly used benchmark in microbial ecology, as well as LefSe [9]. The top ranked features for random forests are The number followed by "F" indicates the order NPFS selected the OTU and the "GGID" contains the Greengenes OTU ID from the taxonomic classification The number followed by "F" indicates the order the Random Forest selected the OTU and the "GGID" contains the Greengenes OTU ID from the taxonomic classification found in Table 3. Similar to of feature selection approaches such as mRMR and JMI, a threshold for the number of features to select must be chosen in advance. We find some overlap between the results of Fizzy (using JMI) and the random forests. The Bacteroides genus was detected as relevant several times for both Fizzy and random forests. We find the Bacteroides has been found to be an indicator of diet [23][24][25]. However, Lefse returns different subsets of feature than the proposed methods or the random forests (see Table 4).

Table 4
List of the largest differences in abundance between omnivores and vegetarians in the 16S data collected from the American Gut Project using LefSe. Note that LefSe does not return the Greengenes IDs Figure 2 shows the largest differences between the omnivores and vegetarians in the top 500 OTUs feature selected by JMI. The numerical values on the x-axis that correspond to the OTU given by: where the difference is ×10 −3 , (F#) is the order that JMI ranked the feature, GGID is the Greengenes ID, and a negative value means that the average relative abundance was higher in the vegetarians. Lasso selected only one OTU (Ruminococcaceae) after cross-validation, and a sweep of the regularization parameter, which increasing the regularization parameter could lead to more OTUs being selected at the cost of a larger error rate. It is interesting to observe that features 3 (F127) and 9 (F458) have opposing signs, yet the are the same family. We hypothesize that this result can be explained by different species will have different responses to environmental conditions. The top Pfams that maximize the mutual information for the MetaHit data set are shown in Table 5. It is known in IBD patients, the expression of ABC transporter protein (PF00005, the first feature MIM selected for classifying IBD vs. no IBD samples) is decreased which limits the protection against various luminal threats [26]. The feature selection for IBD also identified glycosyl transferase (PF00535), whose alternation is hypothesized to result in recruitment of bacteria to the gut mucosa and increased inflammation [27,28], and the genotype of acetyltransferase (PF00583) plays an important role in the pathogenesis of IBD, which is useful in the diagnostics and treatment of IBD [29]. It is not surprising that ABC transporter (PF00005) is also selected for obesity, which is known to mediate fatty acid transport that is associated with obesity and insulin resistant states [30], and ATPases (PF02518) that catalyze dephosphorylation reactions to release energy. Figure 3 shows the evaluation time of six feature selection algorithms and the number of features they select evaluated on data collected from Caporaso et al. [31]. Both lasso and NPFS-MIM can select size of the relevant set, which is why they are represented as a single point. An interesting observation to make is that lasso selects very few features (nearly triple compared to NPFS-MIM). Though it should be noted lasso is capable of capturing more feature interdependencies than the current information theoretic approach presented in fizzy. Furthermore, Qiime's RFC implementation is quite a bit slower than NPFS-MIM, but as with lasso, the RFC can capture large groups of feature interdependencies than the informationtheoretic implementations. MIM, as expected, has the fast evaluation time because there is no calculation for redundancy, and the approaches that use redundancy (JMI and mRMR) take significantly longer to run. In fairness of comparison, the evaluation of NPFS can increase by choosing other base subset selection objective functions that incorporate a redundancy term.

Conclusions
Feature subset selection provides an avenue for rapid insight to the taxonomic or functional differences that can be found between different metagenomic or 16S phenotypes in an environmental study. We have presented an information-theoretic feature subset selection, and lasso for biological data formats in Python that are compatible with those used with the software Qiime package.