Fatty Acid Methyl Ester data
The Bacillus fatty acid methyl ester (FAME) data set of Slabbinck et al. [5] was used. Basically, gas chromatographic FAME profiles were generated after growing the bacteria, as described in the protocol of the Sherlock Microbial Identification System of MIDI Inc. (Newark, DE, USA). This protocol defines a standard for growth and culturing of bacterial strains and reproducibility and interpretability of the profiles is only possible by working under the described conditions. Specifically, this protocol recommends 24 h of growth on trypticase soy broth agar at a temperature of 28°C. Subsequent to gas chromatographic analysis, the use of a calibration mix and the TSBA 50 peak naming table, the resulting FAME profiles are standardized by calculating relative peak areas for each named peak with respect to the total named area. Our data set covers 71 identified fatty acids (or features) and 74 validly published Bacillus species (or classes). This number is about one half of the Bacillus species published by IJSEM as of March 2008 [41]. The total number of FAME profiles (or data instances) is 961. The resulting data set was exported from the joint FAME database of both the Laboratory of Microbiology (Ghent University, Belgium) and the BCCM™/LMG Bacteria collection.
Random Forests
In 2001, Breiman [42] proposed a new machine learning technique consisting of an ensemble of classification trees, better known as Random Forests. A Random Forest (RF) classifier can be defined as a classifier consisting of a collection of decision trees, where each tree casts a unit vote for the most popular class at each input. Each tree in the forest is first grown using N training data points that are randomly sampled from the original data set, and subsequently evaluated by the remaining test data points. N equals about two-thirds of the size of the original data set. Importantly, the random sampling procedure is done with replacement, better known as bootstrapping. Aggregation of the classifiers built upon the bootstrap samples is called bagging or bootstrap aggregation. Bagging leads to a specific data set and the remaining data points are called 'out-of-bag'. When using the latter data set for evaluating the accuracy of the grown tree, the prediction error is, therefore, called the out-of-bag error. Randomly sampling data sets to grow the different trees of the forest corresponds to one of the two randomness factors of RFs. The second factor lies in the random split selection. When M features are present in the original data set, m features (m <<M) are sampled randomly out of M features to split each node of the tree. The final split is determined by the best split based on the m randomly sampled features [42, 43]. Note that m is held constant during the growth process of the forest. Each tree is maximally extended and no pruning occurs. In our study, a grid search was performed to optimize the number of trees and the number of split variables. All numbers of features were considered for split variable selection and 1000 to 4000 trees in steps of 250 trees were selected for tuning the number of trees.
Evaluation of the classification accuracy of a RF classifier is based on the different out-of-bag data sets. For each data point i of the data set, all out-of-bag data sets containing i are considered. Data point i is put down the trees corresponding to the respective out-of-bag data sets. Class j is set to be the class that got most votes every time data point i was out-of-bag. The proportion of times that j is not equal to the true class averaged over all data points corresponds to the out-of-bag error estimate. A RF classifier consisting of low-correlated trees with a high individual strength results in optimal generalization and a high accuracy. Moreover, from the law of large numbers and the tree structure, it follows that the generalization error of RFs converges to a certain value, implying that a RF classifier does not overfit, given a large number of trees in the forest [42]. However, Hastie et al. [44] remark that changing the number of trees in the forest does not cause overfitting, given that not all features are used as split variables and the number of noise variables is reduced.
Modeling was performed by the RFs software available at the website of Leo Breiman [43].
Divisive clustering
A divisive clustering algorithm builds a top-down cluster hierarchy or tree, starting from a single cluster that corresponds to the entire dataset (i.e. the root of the tree). In every iteration of the algorithm, one cluster is selected and split into two new clusters. If the distances between all points in the clusters are considered, then this results in a traditional unsupervised clustering procedure. Conversely, supervised divisive clustering also takes the class labels of the respective data points into account, and calculates only distances between the data points with differing class labels. As a consequence, the final number of clusters in supervised clustering equals the number of classes present in the original data set [45, 46].
Popular divisive clustering strategies are single linkage, complete linkage, average linkage, Ward linkage, etc. In these strategies, multiple metrics can be applied as distance measure. In our initial study, for the construction of a tree using solely FAME data, we calculated the distance metric based on the performance of binary classifiers that were trained for all possible splits of the data. In this setting, cluster distances are not simply computed from FAME profiles, but also the class labels are taken into account. As first splitting criterion, the area under the ROC curve (AUC) was considered, and in case of ties, the splitting was refined by accounting for the average linkage of the probability estimates of both classes. The rationale behind this last splitting criterion is that the classifier corresponding to the largest average distance between the probability estimates should be preferred over other classifiers. The Euclidean distance was considered as metric on the probability estimates.
For each level in the divisive top-down setting, a RF classifier is built for all possible two-group combinations of all considered species or classes. As such, all combinations correspond to a two-class classification task. For each node or level, this results in 2n-1- 1 combinations, with n the number of classes considered. Note that, when considering four classes, the combination of classes 1 and 2 automatically excludes the combination of classes 3 and 4. The divisive clustering stops when only two-class clusters are retained. To speed up the divisive clustering and classification process, no grid search and no cross-validation were considered. Specifically, the initial FAME data set was randomly splitted in a stratified train and test set. One-third of the data set was used for testing. For each combination, the according profiles for training and testing were sampled from these two subsets. The forest size was optimized using the default number of split variables (m =
, with M the number of features). After fixing the forest size resulting in the lowest error rate, the optimal number of split variables was selected among
, 2M and
, again corresponding to the lowest error rate. Ultimately, a rooted tree was constructed with equal branch lengths and the different nodes were labeled with the corresponding AUC value. The resulting tree was visualized with the treeing method of the TaxonGap software [47].
As an initial proof-of-concept, 15 Bacillus species were selected from the original data set. Selection was based on classes with reasonable sample size and classes that are taxonomically closely related to each other, e.g. species of the Bacillus cereus and Bacillus subtilis groups. The first selection criterion was chosen to avoid heavily imbalanced data subsets. The following species with respective number of profiles were selected: Bacillus aquimaris (12), Bacillus atrophaeus*s(21), Bacillus cereus*c(62), Bacillus coagulans (32), Bacillus drentensis (38), Bacillus fumarioli (28), Bacillus galactosidilyticus (12), Bacillus licheniformis*s(74), Bacillus megaterium (28), Bacillus mycoides*c(11), Bacillus patagoniensis (12), Bacillus pumilus*s(57), Bacillus sporothermodurans (17), Bacillus subtilis*s(64) and Bacillus thuringiensis*c(12). Species annotated with '*c' belong to the Bacillus cereus group, while species annotated with '*s' belong to a species of the Bacillus subtilis group [41, 48, 49]. It is expected that the species of these two groups cluster together.
To further speed up the clustering process, computations were performed in parallel on an Intel Blade cluster.
Phylogenetic analysis
The SILVA database was used for 16S rRNA gene sequence selection. SILVA provides quality checked and aligned 16S rRNA gene sequences. For each type strain of each species present in our data set, a 16S rRNA gene sequence was selected. If multiple 16S rRNA gene sequences for each type strain were available, selection of the final sequence was based on best quality and longest sequence length. In SILVA, quality is denoted three-fold: pintail quality for sequence anomaly detection, sequence quality and aligment quality [15]. A list of the selected accession numbers can be found in Additional file 4.
Sequence distance calculation was performed by PHYLIP, the PHYLogeny Inference Package version 3.68, using the program Dnadist [35, 36]. The Jukes Cantor evolution model was used for correcting the nucleotide distances. This DNA sequence evolution model assumes an equal and independent change rate for each nucleotide. So, substitution of one nucleotide by one of the three other nucleotides occurs with equal probability. All other parameters were used as default except for the input format. Based on the resulting distance matrix, a NJ and a UPGMA tree was created using the PHYLIP program Neighbor [33, 34, 37]. Default parameter settings were used, except for a randomized input order of the species. Phylogenetic trees were created for the species present in both the full data set and in the 15 species data set. All trees are visualized with the iTol webtool version 1.5 [40]
Phylogenetic learning
Based on the 16S rRNA gene phylogenetic trees, a classification scheme with a hierarchical class structure was developed. As such, a rooted phylogenetic tree can be regarded as a directed acyclic graph with bifurcating nodes. The main idea is similar to that of binary tree classifiers [26–28]. However, in contrast to our study, these authors inferred a tree from the data used for classification, while we considered phylogenetic information (16S rRNA gene) for tree inference and used FAME data solely for classification. We call this approach phylogenetic learning. As a simple, naïve approach, at each node of the 16S rRNA gene phylogenetic tree, a two-class RF classifier was trained, based on a subset of the FAME data set. Herein, only the subset of profiles belonging to that part of the tree was taken for training and testing. The two branches of the node defined the two groups of the binary classification task, and at each node, a positive and negative dummy class label was created.
Given the tree hierarchy, classifiers constructed on terminal nodes and a certain number of parent nodes can become biased due to a small training set size. Herein, terminal nodes are regarded as nodes splitting into two leaves. As a consequence of the small sample size for certain species, splitting the data set into a training set and a test set was not an option. We simply overcame this issue by using cross-validation techniques, as explained in the next paragraph. Furthermore, the classification performance could easily be evaluated, since each profile was presented to the classification hierarchy and its path was fixed. In the case of an incorrectly classified profile at a specific node in the tree, propagation along the true path stopped, and the corresponding profile was further identified along the predicted path. Therefore, the path and ultimate predicted class of each profile could be determined and a multi-class confusion matrix could be generated for statistical analysis (discussed in the next subsection). As an interesting feature, this method offers the possibility to investigate where misclassification mostly occurs along the phylogenetic tree. Hence, the misclassification distance for each species could be estimated by averaging the correct path length of each incorrectly classified profile. This implies that, for such a profile, the correct path length was incremented each time the corresponding classifier resulted in identification of the true branch.
Incrementing continued until the considered profile was incorrectly classified. Note that the node resulting in misclassification also incremented the path length and that the path length was also incremented when ultimate identification occurred in the correct leaf. In the latter case, the correct path length equals the maximal path length. For each class, the average correct path length was plotted against the maximal path length. An example is visualized in Figure 6.
Cross-validation and statistical analysis
In machine learning studies, data sets are typically split in a training set and a test set, where the latter contains mostly one-third to one-half of the data set. This test set is used as hold-out set for estimation of the final classifier error rate. To prevent over- and underfitting, parameter optimization during the training phase should not be performed on the test set, but on a separate validation set [45]. To overcome this problem, cross-validation can be used. Herein, the training set is split in k separate validation sets for which k-fold cross-validation should be performed. In each step of the cross-validation, a different validation set is used to retrieve the optimal parameter combination, tuned during training on the k - 1 other validation sets [45]. In relation to our problem setting, two important issues should be taken into account. Our data set is not particularly large and some classes correspond to a (very) small number of profiles. Due to the small class sizes present in our data set, we chose to use the test set also as validation set, even though it is still better experimental practice to perform a cross-validation on the training set for larger sample sizes. In case of our FAME data set, two additional problems arise: classes are imbalanced, meaning that a different number of profiles is present for each species, and as mentioned above, many species contain only a small number of profiles. To tackle a possible imbalance effect on the classifier performance, the true error rate can be estimated by stratifying train and test sets [38]. For the second case, classification will also become problematic when two-class classifiers are created based on small data sets. This can be solved by performing cross-validation for performance estimation [39]. A three-fold stratified cross-validation was performed for both the hierarchical classification and flat multi-class classification. To prevent overfitting, the number of folds is set equal to the minimum number of profiles over all bacterial species, which is three in our case. In this perspective, the stratification proportion equals one-third. Given the identical nature of the probability estimates resulting from each RF model, we chose to aggregate all test sets in a joint test set for performance evaluation. This method is also better known as pooling [50]. Finally, for the pooled test set, an average of the error estimate for each class in a one-versus-all setting was calculated, next to the average and standard deviation of the error estimates over all classes. Statistics calculated were the AUC, accuracy, sensitivity, precision and F-score.
Besides the calculation of global performance measures, the performance at class level between flat multi-class classification and phylogenetic learning was also compared. The comparison is visualized in a cumulative plot (see Figure 5). Initially, flat multi-class classification and the corresponding classification results of each class were considered. A threshold was set on a metric using steps of 0.01. As metric, sensitivity and F-score were further analyzed. The corresponding thresholds are plotted along the X-axis.
For each threshold, those classes were selected corresponding to sensitivity or F-score values smaller than or equal to the threshold. Secondly, for each threshold and, thus, for each selected set of classes, the corresponding metric values obtained by phylogenetic learning were evaluated. The number of phylogenetic learning metric values that were larger than the corresponding metric values resulting from flat multi-class classification are plotted against the Y-axis on the left. Also, this number is expressed as a percentage of the corresponding class set size. The corresponding percentages are plotted against the Y-axis on the right.