The algorithm of DectICO
We propose for DectICO to select an optimum feature set from all ICO components dynamically, and to train classifiers with feature matrices extracted from those feature sets. The algorithm scheme is described in Fig. 1. We let n
k
(0 ≤ k ≤ N) stand for the size of the feature set to be selected in round k, with descending order (n
0 > n
1 > … > n
N
), and n
0 represents the size of the entire ICO. The maximal number of the round for selecting features is denoted by N, which is defined by users. S
0 is defined as the entire feature set, and consists of all components of the ICO. For each round, the selected feature set of size n
k
is denoted by S
k
(0 < k ≤ N), which is the subset of S
k − 1. The feature matrix extracted from the training data with S
k
is defined as F
k
(0 < k ≤ Ν) whose rows represent the feature vectors extracted from training samples. And F
0 means the feature matrix extracted from training samples with S
0. In addition, a
k
(0 < k ≤ Ν) represents the accuracies of a leave-one-out cross validation (LOOCV) of the classifier trained with F
k
in each round.
The entire workflow of DectICO contains the following steps:
-
1.
Input: Give a set of training samples the raw sequencing reads and a predefined feature selection ladder {n
0 … n
k
}(0 < k ≤ Ν) of descending order.
-
2.
Initialization: Set S
0 as the entire ICO and extract the feature matrix F
0 from the training set. Initialize k = 1.
-
3.
Recursion: Select the feature set S
k
from S
k − 1 by kpls and extract F
k
from the training samples with S
k
. Train a classifier with F
k
and record the accuracy by LOOCV of a
k
. Let k = k + 1 and repeat this process until k = N.
-
4.
Output: Select the optimal classifier with the minimal number of feature components n
k
and the maximal accuracy by LOOCV of a
k
.
DectICO employs the kpls algorithm to select the entire ICO set dynamically. Therefore, the key characteristics of our algorithm are its use of the ICO and dynamic feature selection. We compared classification performance between our ICO-based and a sequence-composition-based method, as well as between dynamic and non-dynamic feature selection methods.
Non-dynamic feature selection methods select a feature set from the entire feature only once for each size of selected set. Furthermore, feature selection by kpls is based on the weights of all features defined by kpls. The weight rank of the entire feature is not necessarily suitable for each size of feature set. Dynamic feature selection overcomes this problem by selecting the entire feature many times, which updates the weight of the selected feature set in each round. DectICO is implemented in Perl and Matlab, and was built using kernelPLS [36] and Libsvm [38], it can run both on the Windows and the Linux system. Source code is available at https://github.com/dingxiao8715/DectICO. Although DectICO isn’t characterized by fast and low RAM consumed, we also gathered a statistics of the runtime and required RAM of DectICO (See Additional file 1: Table S6 ~ S9). The results indicate that DectICO has acceptable runtime and consumed RAM.
Intrinsic correlation of oligonucleotides
The sequence feature itself is the most important element of alignment-free classification methods. There are two kinds of sequence features: sequence composition and sequence correlation. Sequence composition measures the content of different components in a DNA sequence, such as a single base or an oligonucleotide component, and it is wide used in genome analysis. However, sequence correlation represents the relationship among different components in genomes, which contains deeper information of genomes. In this study, we investigated classification performance for both the ICO and a sequence-composition-based method.
As a kind of sequence correlation, ICO represents the correlation between two consecutive parts of oligonucleotides with fixed length [32]. Given an oligonucleotide with length k (i.e., k-mer), we can separate it into two consecutive parts i and j with length m and n respectively (m ranges from 1 to k-1 and n = k-m). The ICO (m, n) for a genomic sequence S is defined as a descriptor that indicates the correlation between any consecutive part i and j within S. Let A and B be sets of all oligonucleotides with length m and n respectively. The counts of components in sets A and B depend on the length of i and j. For example, when we evaluate ICO (1, 3), i represents arbitrary single base like A, C and j means arbitrary trimers like ACT or GAT, all components in A are {A, C, T, G} and B contains all 64 kinds of trimers {AAA, AAC… GGG}. According to the rationale above, the ICO for a genomic sequence S based on the k-mer is a combination of k-1 types of ICO, i.e., ICO (1, k-1), ICO (2, k-2)…ICO (k-1, 1). For example, the ICO based on 4-mer contains ICO (1, 3), ICO (2, 2), and ICO (3, 1). In general, the ICO (m, n) consists of two sections: the first section describes the correlation between two consecutive oligonucleotides (or bases), namely i and j, and the second section represents the average mutual information between them. The definition for the first section follows:
$$ {f}_{ij}=\frac{p_{ij}}{p_i{p}_j} $$
where p
ij
represents the probability of occurrence of junction between the two oligonucleotides i and j, and p
i
and p
j
represent the probability of occurrence of i and j, respectively, in a sequence.
The second section of the ICO vector is based on information theory. We proposed this section to help explore deeper relationships between two oligonucleotides. The definition for this section is:
$$ I(i)={\displaystyle \sum_{j\in B}{p}_{j/i}{ \log}_2\left(\frac{p_{ij}}{p_i{p}_j}\right)}\kern0.5em \left(i\in A\right) $$
where A and B are the sets of all oligonucleotides with length m and n respectively, respectively. I(i) represents the average mutual information of i acquired from j. p
i
, p
j
, and p
ij
are the same as in the above equation. p
j/i
represents the conditional probability of the occurrence of j, on the condition that i is fixed. The performance of distinguishing genomes with the ICO is detailed by [32].
It is noteworthy that we calculate the feature vector for each metagenomic sample, which is regarded as an integrated one, instead of extracting the feature of each read in a sample, and then computing the average feature vector.
The ICO and composition (each component represents the occurrence frequency of every oligonucleotide in a metagenomic sample) vectors are not of the same magnitude; therefore, we employ a simple normalization method described below:
$$ {v_i}^{\hbox{'}}=\frac{v_i-{v}_{\min }}{v_{\max }-{v}_{\min }}\left(i=1\dots n\right) $$
We assume the feature vector has n dimensions, and denote the original and normalized vector element by v
i
and v
i
', respectively. v
max and v
min represent the maximum and minimum value among these components, respectively.
Kernel partial least squares
We employ the kpls algorithm [36] for feature selection, which was first proposed for selecting features from microarray gene expression data for cancer sample classification. Kpls is based on pls [39] and the theory of Reproducing Kernel Hilbert Space [40]. In the following, we introduce the basic algorithm of kpls briefly.
Pls is one of a broad class of methods for modeling relations between sets of observed features by means of latent variables called components [41]. In order to describe the algorithm conveniently, we denote X as a data matrix with N samples and \( \underline{y} \) as the class vector of the samples. The basic goal of pls is to obtain a low dimensional approximation of a data matrix X such that the approximation will be as close as possible to a given vector \( \underline{y} \). Namely, pls seeks a k × 1 vector \( \underline{w} \) satisfying \( \left\Vert \underline{w}\right\Vert =1 \) and that maximizes \( \operatorname{cov}\left(X\underline{w},\underline{y}\right) \). \( X\underline{w} \) is denoted by \( \underline{t} \), and is called the component of X respect to \( \underline{y} \). The approximation errors of X and \( \underline{y} \) are defined as \( E=X-\underline{t}{\underline{p}}^T \) and \( f=\underline{y}-q\underline{t} \) respectively, where \( \underline{p} \) is a k × 1 vector minimizing \( \left|X-\underline{t}{\underline{p}}^T\right| \) and q is a scalar minimizing \( \left|\underline{y}-q\underline{t}\right| \). Here \( \underline{p} \) and q are called the loadings of \( \underline{t} \) with respect to X and \( \underline{y} \), respectively. This process can be repeated until the required halt condition is satisfied. A more detail description of the algorithm can be found in [42].
However, in real biological applications, linear relationships often fail to fully capture all the information among feature vectors extracted from biological data. Kernel methods project the data onto a high dimensional feature space to approach the problem, and are commonly used for revealing the intrinsic relationships hidden in the raw data. The kernel version of pls uses a nonlinear transformation Φ(.) to map the feature matrix into a higher-dimensional kernel space K, i.e., Φ : x
i
∈ X
N × k
→ Φ(x
i
) ∈ K. However, we only need to state the entire algorithm in terms of dot products between pairs of inputs and substitute the kernel function K(.,.) for it, instead of calculating the specific mathematical expression of nonlinear mapping. A detailed description of kpls can be found in [36].
Description of the datasets
We conducted our experiments on three actual collections of metagenomes: two containing deep Illumina-based metagenomes, and one metagenomic dataset of low coverage sampled using 454 FLX Titanium technology. The first deep dataset was derived from the metagenomic project “A human gut microbial gene catalog established by deep metagenomic sequencing”, which was obtained from the faecal samples of 124 European individuals, and contains 25 IBD samples and 99 control samples [43]. The second deep dataset was derived from the metagenomic project “BGI Type 2 Diabetes study”, which was also obtained from the faecal samples, but from 145 Chinese individuals living in the south of China, and includes 71 T2D samples and 74 control samples [44]. The low coverage dataset was from the metagenomic study “Southampton Asthma metagenomics” which was obtained from both the sputum and the bronchoalveolar lavage samples of 55 individuals, and includes 66 asthma samples and 22 control samples [45]. The information of the three metagenomic datasets are detailed in the supplement (Additional file 1: Table S2).
Verification experiment
Our work in this paper focuses on verifying the stability and generality of the DectICO algorithm, and comparing the classification performance of our proposed method with existing metagenomic sample classification methods. Therefore, we conducted two kinds of experiments, and defined them as stability test and generality test, in terms of differing purpose.
In the practical application of metagenomic sample classification, different researchers have usually sampled from different individuals for a specific disease. Consequently, multiple classifiers targeting the same disease will be trained by different samples. The similarity among the performance of classifiers reflects the stability of the metagenomic classification algorithm used. Therefore, we propose that the classification algorithm is considered stable, if the classifiers, which have been trained on a given kind of metagenomic data with different training sets, have similar classification performance. Our stability test was designed to verify the stability of a classification algorithm. Initially different groups of diseased and control samples are randomly selected from all of the samples 20 times with sample size, and then classifiers are trained based on these 20 training sets. Classification algorithm stabilities can be compared using the cross validation accuracies of the 20 classifiers.
The acquisition of diseased samples for a specific disease can be a limiting factor. The classifier trained by the samples labeled limited should distinguish all, or the major part of the unlabeled samples accurately. Namely, the classifiers trained by a classification algorithm should have good generality. Our generality test was designed to evaluate the generality of our method. Initially we select a group of the diseased and control samples randomly from all samples, and a classifier is trained by the training set. Next, 20 groups of the testing sets are selected of the same sample size randomly from the rest of the samples. The classification accuracies of these testing sets can then be obtained by the trained classifier. The generality of our proposed method can be assessed using the differences between the classification accuracies. The number of samples in the training and testing sets for the three metagenomic datasets is described in the Additional file 1 (Table S3).
Note that the numbers of the diseased and the control samples of the testing sets in our generality test are unbalanced (Additional file 1: Table S3). Therefore, we used the F1-measure to evaluate the classifying performance of the testing sets. The F1-measure is defined in the Additional file 1.