AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides

Darabi, Amin; Sobhani, Sayeh; Aghdam, Rosa; Eslahchi, Changiz

doi:10.1186/s12859-024-05859-7

Research
Open access
Published: 16 July 2024

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides

Amin Darabi¹^na1,
Sayeh Sobhani^1,2^na1,
Rosa Aghdam^2,3 &
…
Changiz Eslahchi^1,2

BMC Bioinformatics volume 25, Article number: 241 (2024) Cite this article

Metrics details

Abstract

Background

Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU.

Results

In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector.

Conclusion

The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function.

Availability

A python package is available at: https://github.com/SayehSobhani/AFITBin.

Peer Review reports

Introduction

Metagenomics is a fascinating field of study that investigates the genetic components found in environmental samples containing diverse microbial communities. Due to advancements in sequencing technologies, it has become more accessible and affordable to sequence microbes extracted from these environmental samples directly [1]. The process of obtaining genetic data involves simple steps of sampling, sequencing, and analysis. These samples contain a diverse community of microbes that resemble a laboratory microbe colony. However, in the study of metagenomics, genetic information is obtained by taking samples from environments such as the intestine or soil, which can consist of hundreds or thousands of unknown species [2].

After the removal of impurities from these samples, filtering, isolating DNA strands, and fragmenting them, these samples are sequenced into reads [3, 4]. Metagenomic binning can be performed on reads before assembling reads into contigs. However, since the reads are significantly shorter sequences, binning is typically performed after the contigs have been constructed. These obtained contigs belong to different microbes. Therefore, for future research in metagenomics, algorithms are required to classify these contigs so that contigs related to the same organism are placed in the same class. Considering that the samples in this study may contain unknown microbial species, an unsupervised clustering method is allowed [5, 6].

Clustering, which is a major component of metagenomic binning, is the process of grouping contigs, scaffolds, or genes according to their genetic characteristics, such as oligonucleotide frequency (referred to as l-mers) or coverage. The three types of current approaches for retrieving bins from metagenomic assemblies are: (i) nucleotide composition-based, (ii) differential abundance-based, and (iii) nucleotide composition and abundance-based [7,8,9].

Composition-based approaches rely on variations in oligonucleotide frequency, specifically tetra-nucleotide frequency (TNF). In contrast, differential abundance-based approaches rely on the coverage of contigs across diverse samples in which the abundance of organisms varies. Composition and abundance-based methods focus on making a combined distance matrix by combining analyses based on nucleotide composition and differential abundance [10,11,12]. Composition-based approaches employ the notion that each taxonomic unit (species, genus, etc.) has a unique nucleotide composition and conducts binning by comparing nucleotide content, principally oligonucleotide frequency, and guanine-cytosine (GC) content [8]. The majority of composition-based approaches have been applied to communities with genotypes exhibiting distinctive nucleotide composition patterns, including low GC content and stable oligonucleotide frequency [13]. In order to make the operation more computationally feasible, the sequence composition information is translated into numerical feature vectors. The most prevalent attributes are the normalized frequencies of oligonucleotides of a particular length. According to studies, the frequency of oligonucleotides varies between and within species. The TNF vector is one of the most common types of oligonucleotide frequencies. TNF is a 256-dimensional vector that represents the frequency of all 4-mers. Additionally, the GC content is used, as studies have shown that GC content varies between species [14]. Even though it has not been conclusively demonstrated, it is likely that this method will fail in communities with diverse oligonucleotide compositions [7].

Abundance-based approaches assume that either the distribution of sequenced reads in a single sample follows the Lander-Waterman model [15] or that the coverage profiles of contigs from the same genomes should be highly correlated across multiple samples [6, 8, 16, 17].

Methods based on composition and abundance combine the two previously mentioned approaches into one. It has been demonstrated that by combining composition and coverage information, which indicates species abundance, additional information can be extracted from metagenomic data, resulting in more accurate binning. Coverage information is computed by a contig’s coverage, which is the average number of reads per base from the sample within the contig. Expectation-maximization (EM) algorithms, probabilistic models, and principal component analysis (PCA) are utilized by these methods [18]. CONCOCT [19] uses PCA to reduce dimensionality and the Gaussian mixture model to cluster contigs into bins. MetaCon [20] discovers various distributions of l-mers based on the probabilities of l-mers in each contig; it uses the information contained in long contigs to guide the formation of clusters. MaxBin 2.0 [21] employs the Lander-Waterman model [15] and the EM algorithm to perform iterative clustering. Based on the frequency and abundance of l-mers, MetaBAT 2 [22] computes probabilistic distances between pairs of contigs. SolidBin [23] utilizes spectral clustering coupled with further biological information using a semi-supervised approach. BusyBee Web [24] bins contigs utilizing a bootstrapped supervised binning method. This binning method is for contigs that are 500 bp or longer by default. Due to restricted computing resources and the fact that BusyBee Web is a web-based application, specific limits on the size of the input data have been imposed. MetaBinner [25] employs single-copy gene data for k-means clustering algorithm to produce distinct component binning outcomes. Subsequently, it combines the outcomes of component binning using a two-stage ensemble strategy based on MetaWRAP [26] and UniteM (https://github.com/dparks1134/UniteM). Some of these above-mentioned methods, however, remove small contigs (e.g., less than 1000bp) since the composition and coverage properties are not reliable for short contigs. Also, most binning methods depend on similarity metrics between contigs based on k-mers frequency distributions.

In this paper, we present AFITBin, a novel method for contig binning that can bin millions of contigs derived from numerous samples. We demonstrated that AFITBin generates high-quality bins by employing contig coverage and a newly proposed composition vector that calculates the repetition frequency of substrings based on their initial and terminal base pairs. Similar to MetaCon, AFITBin clusters contigs in two stages: first, by eliminating short contigs and creating clusters, and subsequently, by assigning short contigs to their clusters. Then, we compared AFITBin’s performance to the methods mentioned earlier.

Method

AFIT: a novel composition vector

Due to the fact that a DNA sequence consists of two strings, one of which is the complement of the other, we group all 2-mers into ten classes, $T_i$, $1\le i\le 10$, of 2-mers and their complements. Four 2-mers AT, CG, GC, and TA are palindromes (their complements are equal to themselves). Therefore six classes are of size two, and four are of size 1 (Fig. 1). This study proposes a vector for calculating the repetition frequency of substrings of length between 2 and 10 using only their initial and terminal nucleotides. A substring of size l is in class $T_i$ if the 2-mer generated by its initial and terminal nucleotides belongs to $T_i$. For a contig Z, we consider all its substrings of size l, $2\le k\le 10$. Let $a_{ki}$ denote the number of substrings of size l which belong to bin $T_i$. AFIT vector corresponds to contig Z defined as follows:

$$\begin{aligned} AFIT_Z = (a_{2,1},a_{2,2},..., a_{10,10}) \end{aligned}$$

By this method, we aggregate a contig’s l-mer frequencies based on their initial and terminal nucleotides to a vector (AFIT vector) of size 90.

The rationale for this approach is to overcome limitations seen in traditional methods that rely on k-mer frequencies when binning metagenomic data. Typically, the short sizes of reads and contigs provided for metagenomic problems create constraints when utilizing k-mers. The utilization of k-mer frequencies to construct a composition vector results in an enlarged vector size, particularly noticeable with larger k-mers, leading to sparsity in the composition vector for each contig. Specifically, in metagenomic data analysis, numerous k-mers may be absent in a given contig or might exist with very low frequencies. Consequently, these zero or low-frequency components may not significantly contribute to the overall pattern or composition of the DNA sequence. To tackle this challenge, the AFIT approach aggregates k-mer information into equivalent classes based on their initial and terminal nucleotides. Its primary objective is to address the abundance of zero or low-frequency components within the vector. Consequently, reducing sparsity and the overall size of the composition vector results in a more appropriate representation. Additionally, the AFIT vector encapsulates detailed information concerning substring compositions ranging from 2 to 10 nucleotides. By emphasizing similar information through grouping, its purpose is to enhance the representation and comprehension of the structural characteristics and composition of the DNA sequence. It is notable to mention that a similar approach was used for the prediction of mRNA sub-cellular localization [27], showing that the proposed AFIT vector provides useful information on DNA sequences.

Model setup

In this section, we discuss AFITBin’s approach to the metagenomic contig binning challenge. As previously stated, most binning methods depend on similarity metrics between contigs based on l-mers frequency distributions. In this paper, we presented a new approach for generating the composition vector that outperforms prior methods based on l-mer counts, such as the TNF vector. Figure 2 depicts the AFITBin processing pipeline. Each step will be explained in detail in the following subsections. AFITBin utilizes two distinct genomic features, the AFIT vector and the coverage distance of contigs, in order to obtain the genomic bins. This is achieved through matrix factorization and solving an optimization problem. In two phases, this algorithm assigns contigs to their predetermined bins. First, contigs shorter than a predetermined length threshold are set aside, while the remaining contigs are assigned to bins based on the methodology explained in this section. Subsequently, the shorter contigs are assigned to their respective bins using a slightly different approach.

Let N contigs exist. In this step, the composition matrix is constructed, with each column representing the 90-dimensional AFIT vector obtained in the previous section. The dimension of this feature matrix is $N\times 90$, represented in this paper as $A_{N\times 90}$. The coverage distance between two contigs is calculated using the mean coverage and variance of coverage information for each contig. It is assumed that the distribution of this data is approximately normal. Similar to [28], we consider the non-shared area under the normal distribution graphs of the two contigs to be their coverage distance. As a result, we build a symmetric coverage matrix $C_{N\times N}=[c_{ij}]$, where $c_{ij}$ represents the coverage distances between contigs i and j. AFITBin consists of three steps:

Step1: estimate the number of OTUs

The number of OTUs (bins) is a required input for AFITBin. In this proposed method for determining the number of OTUs in a dataset, similar to the method presented in [29], the k-means algorithm [30] is initialized with a small number of bins, and k is increased until at least 40 percent of the bins remain empty. To use the k-means algorithm, a distance between contigs is required. The distance between two contigs i and j is defined as $\frac{1}{2}(d_{i,j}+c_{i,j})$ where $d_{i,j}$ denoted the Euclidian distance between the ith and jth rows of matrix A.

Step 2: obtaining composition binning index and contig affiliation matrix

The rows and columns corresponding to contigs smaller than a threshold size are removed from matrices A and C to obtain the reduced matrices X and Y with m rows, to prevent errors caused by insufficient compositional information. We determined the threshold to be 1200 bp such methods used in [31, 32]. Assume that input contigs are associated with k distinct bins, which is obtained in the previous step. Using formula 1, we want to factor X into two matrices, $H_{m\times k}=[h_{ij}]$ and $W_{k\times 90}$, where W represents the composition index for each bin and H represents a contig belonging to a bin. As a result, each row of H is a One-hot vector. If contig i belongs to bin j, $h_{ij}=1$; otherwise, $h_{ij}=0$.

$$\begin{aligned} X = H \times W \end{aligned}$$

(1)

In order to obtain the matrices W and H, we solved the subsequent optimization problem shown in formula 2:

$$\begin{aligned} \arg \min _{W, H}\Vert X-HW\Vert ^2 \end{aligned}$$

(2)

where

$$\begin{aligned} \ H \in \{0, 1\}^{n\times k}, ~and~ \Vert H_n\Vert =1 \end{aligned}$$

where $\Vert \Vert$ denotes the Frobenius norm of a matrix. Consider that the optimization problem is an NP-hard integer programming problem that requires a substantial computational effort to solve. We circumvent the binary restriction of H to solve this computational problem. Therefore, the equation is modified to the following minimization problem:

$$\begin{aligned} \arg \min _{W, H}\Vert X-HW\Vert ^2 + \alpha \sum _{n=1}^N\Vert H\Vert ^2+\beta \Vert Y*(HH^T)\Vert ^2 \end{aligned}$$

(3)

where $H^T$ is the transpose of H, and $*$ is regarded as an element-wise multiplication of two matrices, and $\alpha$ and $\beta$ are hyperparameters. This type of matrix multiplication takes into account a binning error based on the coverage distance between two contigs. To solve the second minimization problem, the Conjugate Gradient method is selected, and matrices W and H are successively optimized at each iteration of this method. After calculating the matrix H, the contig i is considered to belong to bin j if and only if j is the bin that $h_{ij}$ is the maximum between all $h_{ir}$, $1\le r \le k$.

Step 3: assigning short contigs to bins

In the previous step, short contigs are eliminated, long contigs are binned, and a composition feature vector is assigned to each bin. In this step, each short contig is assigned to one of the obtained bins. To achieve this, for every small contig, v, and bin B we assign two scores. Let W(B) denote the row of W correspond to the bin B. The first score is defined as:

$$\begin{aligned} S_1(B,v)=d(W(B), AFIT_v). \end{aligned}$$

The second score is defined as:

$$\begin{aligned} S_2(B,v)=\frac{\sum _{b\in B} d(Y(v),Y(b))}{|B|} \end{aligned}$$

where b is a contige in B and d is Euclidean distances. Now, the contig v is assigned to the bin B(v) if:

$$\begin{aligned} S_1(B(v),v)+S_2(B(v),v)=\min \{S_1(B,v)+S_2(B,v)| \text {B is between the obtained bins}\}. \end{aligned}$$

Datasets

In this section, we will briefly describe the datasets used to evaluate our method in this paper. The Sharon dataset [33] is a real dataset utilized for evaluating various metagenomic analyses. This dataset consists of 18 feces samples collected from a newborn infant at eleven distinct intervals. These samples were sequenced on an ILLUMINA machine, and the resulting reads are accessible in the NCBI Sequence ReadArchive database with the accession number SRA052203. The researchers who gathered this dataset assembled the reads into 2,329 contigs and, after analyzing the contigs, assigned them to 33 distinct microbial species. The UC Berkeley Genetic Information Database contains these contigs and identified variants (https://ggkbase.berkeley.edu/carrol/organisms).

Another dataset used to evaluate metagenomic binning methods is the CAMI [34] challenge dataset. Diverse datasets of varying complexity have been collected for this challenge to evaluate various metagenomic tools and analysis techniques. The three datasets used in this study, CAMI-Low, CAMI-Medium, and CAMI-High, contain one sample, two samples, and five samples, respectively. The public can access these datasets via the CAMI Challenge website (https://data.cami-challenge.org/participate).

As a simulated dataset, Strain and Species were considered in this paper, which were simulated by the authors of CONCOCT [19] using a microbial community from the Human Microbiome Project. The authors assembled contigs using reads that were analyzed in various Human Microbiome Project samples [35]. The coverage sequences were then constructed by comparing the constructed contigs to the reads.

Evaluation criteria

We utilize the precision, recall, and F-score metrics to assess AFITBin’s performance. Precision evaluates the accuracy of the classification, whereas recall evaluates its completeness. Therefore, the F-score, which is the harmonic mean of precision and recall, can be utilized to evaluate the performance of binning methods [36].

Assume n is the number of species in a metagenomic dataset and k is the number of bins returned by the binning method. The matrix $M_{k\times n}=[m_{ij}]$ is defined in this case so that the array $m_{ij}$ represents the number of contigs associated with species i that are positioned in bin j using the binning method. The mathematical expression of these scales defines as follows:

$$\begin{aligned} \text {Precision}= & {} \frac{\sum _{i=1}^k\ \max _j m_{ij}}{\sum _{i=1}^k\sum _{j=1}^n m_{ij}} \end{aligned}$$

(4)

$$\begin{aligned} \text {Recall}= & {} \frac{\sum _{j=1}^n\ \max _i m_{ij}}{\sum _{i=1}^k\sum _{j=1}^n m_{ij}} \end{aligned}$$

(5)

$$\begin{aligned} \text {F-score}= & {} \frac{2*\text {Precision}*\text {Recall}}{\text {Precision}+\text {Recall}} \end{aligned}$$

(6)

Results

AFITBin is compared to MaxBin 2.0, MetaBat 2, CONCOCT, MetaCon, SolidBin, BusyBee Web, and MetaBinner, which are all described in the previous sections. The parameters $\alpha$ and $\beta$, described in the previous section, are set to 2 and $\frac{3}{4}$ respectively, and the number of bins for AFITBin for each dataset is as shown in Table 1.

Table 1 This table shows the number of bins for datasets Strain, Species, Sharon, CAMI-Low, CAMI-Medium, and CAMI-High for AFITBin

Full size table

Performance on simulated datasets

First, we compare the performance of AFITBin on the simulated datasets Strain and Species to the aforementioned algorithms. The Strain contains 9417, contigs related to 20 distinct microorganisms, which were assembled from the sequenced reads of 64 separate samples.

As shown in Table 2, AFITBin has the highest F-score for the Strain dataset, 0.91, compared to other methods. The second-highest reported F-score is 0.90, which belongs to MetaCon. The highest precision belongs to AFITBin which is 0.92, while its recall is 0.91, which is the second highest recall.

The results of AFITBin are then compared to those of other methods on the Species dataset. The sequenced reads from 64 different samples were used to put together the 37,628 contigs that belong to 101 different microorganisms in the Species dataset. In this dataset, similar to the previous one, the best F-score and precision belong to AFITBin, while its recall is the second best. For both datasets, we saw an increase in the F-score of at least 2 percent, which is a considerable improvement. It is especially promising because other methods have already reached a high F-score near the maximum, indicating that even minor improvements have significant value.

Performance on the Sharon dataset

In addition, AFITBin is assessed using the Sharon dataset. Sharon is a real dataset from a well-studied microbial experiment in which the species involved have been thoroughly examined, as explained previously. The sequenced reads of 18 distinct samples were used to assemble 2,329 contigs associated with 33 distinct microorganisms. As depicted in Table 2, AFITBin performs comparably to other methods with the best performance for the Sharon dataset across all evaluation criteria. On this dataset, AFITBin and MetaCon have the highest F-score of 0.82.

Performance on the CAMI datasets

AFITBin is evaluated further on CAMI datasets of varying complexity. As shown in Table 2, AFITBin achieves better binning results than other methods on CAMI-Low, and CAMI-Medium, The CAMI-Low dataset contains 1949 contigs from 40 distinct microbes. The reads from a single sample were used to assemble the contigs in this dataset. Even though AFITBin does not report the highest precision, AFITBin increased recall from 0.48 to 0.59, as shown in Table 2. Evidently, AFITBin obtains the highest F-score compared to all other methods and increased the F-score from 0.56 to 0.62.

CAMI-Medium is made up of 63,447 contigs from 132 different microorganisms. These contigs were assembled from sequenced reads from two different samples. Table 2 demonstrates that AFITBin classifies the contigs of this dataset with a higher F-score than other methods. We were successful in increasing the F-score from 0.42 to 0.43. CAMI-High is made up of 42,038 contigs that represent 132 different types of microorganisms. It was made by putting together the sequenced reads of five samples. AFITBin outperforms the other classification methods in terms of recall and has an F-score of 0.44, which is greater than the F-scores of MetaCon, CONCOCT, MaxBin 2.0, and BusyBee Web, but not MetaBat 2, SolidBin and MetaBinner.

Table 2 The overall performances of MaxBin 2.0 (M-Bin2), MetaBat 2 (M-Bat2), CONCOCT (CCT), MetaCon (M-Con), SolidBin (S-Bin), BusyBee Web (B-Bee), MetaBinner (Mt-Bin), and AFITBin (A-Bin) based on precision, recall, and F-score, on real and simulated datasets

Full size table

While AFITBin demonstrates precision values below the best precision obtained by other methods in certain datasets, it’s crucial to recognize that in these datasets, the methods exhibiting high precision often display very low recall. For instance, BusyBee Web, despite showcasing a precision of 0.85 on the CAMI-High dataset, notably suffers from a substantially low recall of 0.21. This lower recall implies that BusyBee Web generates a significantly smaller number of accurate bins compared to the expected number, resulting in a low F-score. When assessing algorithm performance, solely focusing on precision or recall might not offer a comprehensive understanding. The F-score, recognized as the harmonic mean of precision and recall, provides a more balanced evaluation. Moreover, no single method can consistently outperform all others across every dataset. However, as depicted in Table 2, AFITBin exhibits superior overall performance compared to the other methods. Although our method did not achieve the best result on the CAMI-High dataset, AFITBin performed the best among the five remaining datasets. The modest yet consistent improvement demonstrated by AFITBin in this manuscript holds significant importance, particularly considering the complexity of the metagenomic binning challenge. To discuss more about the obtained results, we use CheckM [37], which is a powerful tool widely used in metagenomics to assess the quality of microbial genomes reconstructed from metagenomic data. It evaluates the completeness and contamination levels of genomes, providing valuable insights into the reliability of genomic bins. The results obtained from CheckM analysis revealed crucial information about the quality of the genomic bins produced by both methods. Table 3 shows the performance of two binning methods, AFITBin and MetaBinner using CheckM. AFITBin exhibited promising performance, yielding genomic bins with high completeness levels and low contamination rates. MetaBinner’s performance, as indicated by CheckM results, demonstrated slightly lower completeness levels and marginally higher contamination rates compared to AFITBin. It is noteworthy that in [25], MetaBinner was showcased to surpass all other methods when assessed using CheckM. In conclusion, the CheckM results provide valuable insights into the performance of AFITBin and MetaBinner, highlighting AFITBin’s potential as a robust binning method.

Table 3 Comparison of AFITBin and MetaBinner using CheckM on the CAMI-Low dataset

Full size table

Comparison of AFIT and TNF vectors

To determine the significance of the AFIT vector, we considered several distinct strategies. We begin by comparing TNF and AFIT vectors using contigs from species in the Sharon dataset. For each species, we construct AFIT and TNF vectors based on their contigs. Next, we select a pair of species and consider the AFIT and TNF vectors of these chosen species. Then, utilizing the t-Distributed Stochastic Neighbor Embedding (t-SNE) [38] algorithm, these contigs are depicted in two-dimensional space once based on TNF and once based on AFIT. t-SNE is a dimensionality reduction method with the main purpose of visualizing high-dimensional data in a lower-dimensional space, often in two or three dimensions. It is particularly effective for exploring and understanding patterns in complex datasets. The primary goal of t-SNE is to reduce the dimensionality of data points while maintaining their pairwise similarities or distances as much as possible. It accomplishes this by converting the similarities between data points in the high-dimensional space into conditional probabilities, where similar points have higher probabilities of being picked as neighbors. Our findings demonstrate that contigs from the same species can be effectively grouped together using AFIT vectors but not TNF vectors. Figure 3 is an illustration of the outcomes of the t-SNE algorithm for two species, Finegoldia magna and Leuconostoc citerum, using AFIT and TNF vectors. This figure demonstrates that contigs belonging to the same species can be clustered exceptionally well with AFIT vectors but not with TNF vectors. In the supplementary Figs. S1–S9, we compare the performance of the AFIT vector and TNF vector in clustering two different species, where the right figure shows the performance of the TNF vector and the left figure shows the performance of the AFIT vector for different pairs of species. Our approach and findings align with similar methodologies [39], supporting the effectiveness of using t-SNE to compare AFIT and TNF vectors for genetic characteristics across different species pairs in metagenomic studies.

Recent binning methods have little direct application of Euclidean distance. In the second approach, for additional evaluation, we compare the Euclidean distance obtained by the AFIT vector and the TNF vector for contigs of the same species (intra) and contigs from different species (inter). For this comparison, the Euclidean distance was calculated once between AFIT vectors and once between TNF vectors for more than 2,000 contigs from 33 different species. Table 4 displays the mean and standard deviation of the distances between contigs of the same species and contigs of different species using the AFIT and TNF vectors. This table demonstrates that, compared to the TNF vector, the AFIT vector is better able to differentiate between contigs of the same and different species.

Table 4 The mean($\mu$) and variance($\sigma$) of Euclidian distance of both AFIT vector and TNF vector for contigs from the same species (intra) and contigs from different species (inter)

Full size table

To further assess the effectiveness of the AFIT vector, we conducted distinct evaluations following the methodology outlined by Zhou et al. [40], barcodes of several genomes are plotted using the AFIT and TNF vectors.

We investigated the AFIT vector’s consistency across bacterial genomes. Our analysis revealed notable consistency in frequency patterns from the start to the end of bacterial genomes, even when comparing strains from the same species. Specifically, we focused on plotting barcodes using the AFIT vector for two strains of Escherichia coli (E.coli) bacteria: E. coli O10:H32 and E. coli O100:H21. The resulting plots shown in Fig. 4 indicated remarkable consistency throughout the entire genomes. Additionally, the barcodes generated using AFIT for both strains exhibited striking similarity and provided informative insights.

Furthermore, we extended our evaluation to the genome of Apis cerana, generating its barcode using both the AFIT and TNF vectors. Our observations shown in Fig. 5 illustrated that the barcode derived from TNF vectors lacked informativeness due to nearly identical values across all components, predominantly near zero. Conversely, the barcode created with the AFIT vector offered significant information. It demonstrated consistency across each component while highlighting discernible differences between the values of various components. This emphasized the utility and effectiveness of the AFIT vector in capturing distinctive genome characteristics.

Discussion

Binning metagenomic contigs is a crucial step in metagenomic studies. Metagenomic binning is the process of grouping reads or contigs and assigning them to specific species. Using features extracted from read or contig sequences, binning methods attempt to group these sequences. Typically, these features are compositional features, coverage features, or both. In this paper, we introduced a novel composition vector, AFIT, and the AFITBin binning method, which utilizes both AFIT vector and coverage data. To evaluate the significance of the new composition vector, we not only investigate the performance of AFITBin by comparing its results on different datasets with those of well-known methods, but we also demonstrate how well this vector distinguishes contigs of different species compared to an earlier method.

In addition, we evaluated the performance of various cutting-edge binning algorithms on both simulated and real datasets. Using the composition feature vector, AFIT, AFITBin improved its performance in terms of precision, recall, and F-score. As the results demonstrate, AFITBin outperformed the majority of the stated methods on most of the datasets. According to the findings of this investigation, neither of the introduced methods performs optimally on every dataset. On some datasets, AFITBin has the best performance, whereas on others, it performs similarly to the method with the best-reported results. Aside from this, the new proposed vector has smaller dimensions than the old vector, reducing the computational costs of the binning algorithm and the complexity of designing an appropriate binning algorithm. The AFIT vector introduced by this method can be largely responsible for this outcome.

Conclusion

We demonstrate that the incorporation of the AFIT vector into our binning algorithm allows AFITBin to accommodate datasets with diverse characteristics. In addition, the AFIT vector combines information from previous oligonucleotide (l-mer) frequency feature vectors, making it superior due to its enhanced performance. Clearly, there is still a substantial amount of work to be done to improve metagenomic binning. As previously stated, the majority of binning methods fall short when it comes to determining the actual number of species to be binned. AFITBin currently uses the k-means clustering algorithm to predict the number of species, but errors can lead to incorrectly binned contigs and reduce the precision of the final binning result. Consequently, the development of a suitable method for predicting the actual number of categories is one of the fields that can be utilized to develop the paper’s solutions.

Availability of data and materials

The code of AFITBin is freely available at: https://github.com/SayehSobhani/AFITBin. The Sharon dataset (accession number SRA052203) is available at UC Berkeley’s Genetic Information Database (https://ggkbase.berkeley.edu/carrol/organisms). The CAMI challenge dataset including CAMI-Low, CAMI-Medium, and CAMI-High datasets, is accessible through the CAMI Challenge website (https://data.cami-challenge.org/participate). Also the E. coli O10:H32 (GCF_013282315.1) and E. coli O100:H21 (GCF_015571795.1) dataset of genomes is available at NCBI.

References

Motley ST, Picuri JM, Crowder CD, Minich JJ, Hofstadler SA, Eshoo MW. Improved multiple displacement amplification (imda) and ultraclean reagents. BMC Genom. 2014;15:443.
Article Google Scholar
Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12(87):87.
Article PubMed PubMed Central Google Scholar
Riesenfeld CS, Schloss PD, Handelsman J, et al. Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004;38(1):525–52.
Article CAS PubMed Google Scholar
Alberts B. Molecular biology of the cell. Garland Science. NewYork: Taylor and Francis Group; 2015.
Google Scholar
Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):1–13.
Article Google Scholar
Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):533–8.
Article CAS PubMed Google Scholar
Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets. Microbiome. 2016;4:1–11. https://doi.org/10.1186/s40168-016-0154-5.
Article Google Scholar
Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2017;15:48–55. https://doi.org/10.1016/j.csbj.2016.11.005.
Article CAS PubMed Google Scholar
Mallawaarachchi V, Wickramarachchi A, Lin Y. Graphbin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics. 2020;36:3307–13. https://doi.org/10.1093/bioinformatics/btaa180.
Article CAS PubMed Google Scholar
MacKelprang R, Waldrop MP, Deangelis KM, David MM, Chavarria KL, Blazewicz SJ, Rubin EM, Jansson JK. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature. 2011;480:368–71. https://doi.org/10.1038/nature10576.
Article CAS PubMed Google Scholar
Ghai R, Mizuno CM, Picazo A, Camacho A, Rodriguez-Valera F. Key roles for freshwater actinobacteria revealed by deep metagenomic sequencing. Mol Ecol. 2014;23:6073–90. https://doi.org/10.1111/mec.12985.
Article CAS PubMed Google Scholar
Hua ZS, Han YJ, Chen LX, Liu J, Hu M, Li SJ, Kuang JL, Chain PS, Huang LN, Shu WS. Ecological roles of dominant and rare prokaryotes in acid mine drainage revealed by metagenomics and metatranscriptomics. ISME J. 2015;9:1280–94. https://doi.org/10.1038/ismej.2014.212.
Article CAS PubMed Google Scholar
Iverson V, Morris RM, Frazar CD, Berthiaume CT, Morales RL, Armbrust EV. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science. 2012;335:587–90. https://doi.org/10.1126/science.1212665.
Article CAS PubMed Google Scholar
Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP, Banfield JF. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009;10:1–16. https://doi.org/10.1186/gb-2009-10-8-r85.
Article CAS Google Scholar
Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–9. https://doi.org/10.1016/0888-7543(88)90007-9.
Article CAS PubMed Google Scholar
Wu YW, Ye Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011;18:523–34. https://doi.org/10.1089/cmb.2010.0245.
Article CAS PubMed PubMed Central Google Scholar
...Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta DR, Gautier L, Pedersen AG, Chatelier EL, Pelletier E, Bonde I, Nielsen T, Manichanh C, Arumugam M, Batto JM, Santos MBQD, Blom N, Borruel N, Burgdorf KS, Boumezbeur F, Casellas F, Doré J, Dworzynski P, Guarner F, Hansen T, Hildebrand F, Kaas RS, Kennedy S, Kristiansen K, Kultima JR, Leonard P, Levenez F, Lund O, Moumen B, Paslier DL, Pons N, Pedersen O, Prifti E, Qin J, Raes J, Sørensen S, Tap J, Tims S, Ussery DW, Yamada T, Renault P, Sicheritz-Ponten T, Bork P, Wang J, Brunak S, Ehrlich SD, Jamet A, Mérieux A, Cultrone A, Torrejon A, Quinquis B, Brechot C, Delorme C, M’rini C, Vos WM, Maguin E, Varela E, Guedon E, Gwen F, Haimet F, Artiguenave F, Vandemeulebrouck G, Denariaz G, Khaci G, Knol H, Knol J, Weissenbach J, Hylckama Vlieg JET, Torben J, Parkhill J, Turner K, Guchte M, Antolin M, Rescigno M, Kleerebezem M, Derrien M, Galleron N, Sanchez N, Grarup N, Veiga P, Oozeer R, Dervyn R, Layec S, Bruls T, Winogradski Y, Erwin Z. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32:822–8. https://doi.org/10.1038/nbt.2939.
Article CAS PubMed Google Scholar
Lin HH, Liao YC. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci Rep. 2016;6:24175. https://doi.org/10.1038/srep24175.
Article CAS PubMed PubMed Central Google Scholar
Alneberg J, Bjarnason BS, Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–6. https://doi.org/10.1038/nmeth.3103.
Article CAS PubMed Google Scholar
Qian J, Comin M. MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinf. 2019;20:9. https://doi.org/10.1186/s12859-019-2904-4.
Article Google Scholar
Wu YW, Simmons BA, Singer SW. Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32:605–7. https://doi.org/10.1093/bioinformatics/btv638.
Article CAS PubMed Google Scholar
Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7:7359. https://doi.org/10.7717/peerj.7359.
Article Google Scholar
Wang Z, Wang Z, Lu YY, Sun F, Zhu S. Solidbin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics. 2019;35(21):4229–38.
Article CAS PubMed PubMed Central Google Scholar
Laczny CC, Kiefer C, Galata V, Fehlmann T, Backes C, Keller A. Busybee web: metagenomic data analysis by bootstrapped supervised binning and annotation. Nucleic Acids Res. 2017;45(W1):171–9.
Article Google Scholar
Wang Ziye, Huang Pingqin, You Ronghui, Sun Fengzhu, Zhu Shanfeng. metabinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities. Genome Biol. 2023;24(1):1.
Article PubMed PubMed Central Google Scholar
Uritskiy Gherman V, DiRuggiero Jocelyne, Taylor James. Metawrap-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):1–13.
Article Google Scholar
Babaiha Negin Sadat, Aghdam Rosa, Ghiam Shokoofeh, Eslahchi Changiz. nn-rnaloc: neural network-based model for prediction of mrna sub-cellular localization using distance-based sub-sequence profiles. PLoS ONE. 2023;18(9):0258793.
Article Google Scholar
Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. Peer J. 2015;3:1165. https://doi.org/10.7717/peerj.1165.
Article CAS Google Scholar
Lu YY, Chen T, Fuhrman JA, Sun F, Sahinalp C. Cocacola: binning metagenomic contigs using sequence composition, read coverage, co-alignment and paired-end read linkage. Bioinformatics. 2017;33:791–8. https://doi.org/10.1093/bioinformatics/btw290.
Article CAS PubMed Google Scholar
Likas A, Vlassis N, Verbeek JJ. The global k-means clustering algorithm. Pattern Recogn. 2003;36(2):451–61.
Article Google Scholar
Etter Paul D, Preston Jessica L, Susan Bassham, Cresko William A, Johnson Eric A. local de novo assembly of rad paired-end contigs using short sequencing reads. PLoS ONE. 2011;6(4):18561.
Article Google Scholar
Ke Zhang, Rongnan Lin, Yujun Chang, Qing Zhou, Zhi Zhang. 16s-fasas: an integrated pipeline for synthetic full-length 16s rrna gene sequencing data analysis. Peer J. 2022;10:14043.
Article Google Scholar
Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, Banfield JF. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013;23:111–20. https://doi.org/10.1101/gr.142315.112.
Article CAS PubMed PubMed Central Google Scholar
...Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dröge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jørgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, Demaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvočiutė M, Hansen LH, Sørensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Kang DD, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Wu YW, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin HH, Liao YC, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk HP, Göker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat Methods. 2017;14:1063–71. https://doi.org/10.1038/nmeth.4458.
Article CAS PubMed PubMed Central Google Scholar
...Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, Creasy HH, Earl AM, Fitzgerald MG, Fulton RS, Giglio MG, Hallsworth-Pepin K, Lobos EA, Madupu R, Magrini V, Martin JC, Mitreva M, Muzny DM, Sodergren EJ, Versalovic J, Wollam AM, Worley KC, Wortman JR, Young SK, Zeng Q, Aagaard KM, Abolude OO, Allen-Vercoe E, Alm EJ, Alvarado L, Andersen GL, Anderson S, Appelbaum E, Arachchi HM, Armitage G, Arze CA, Ayvaz T, Baker CC, Begg L, Belachew T, Bhonagiri V, Bihan M, Blaser MJ, Bloom T, Bonazzi V, Brooks JP, Buck GA, Buhay CJ, Busam DA, Campbell JL, Canon SR, Cantarel BL, Chain PSG, Chen IMA, Chen L, Chhibba S, Chu K, Ciulla DM, Clemente JC, Clifton SW, Conlan S, Crabtree J, Cutting MA, Davidovics NJ, Davis CC, Desantis TZ, Deal C, Delehaunty KD, Dewhirst FE, Deych E, Ding Y, Dooling DJ, Dugan SP, Dunne WM, Durkin AS, Edgar RC, Erlich RL, Farmer CN, Farrell RM, Faust K, Feldgarden M, Felix VM, Fisher S, Fodor AA, Forney LJ, Foster L, Francesco VD, Friedman J, Friedrich DC, Fronick CC, Fulton LL, Gao H, Garcia N, Giannoukos G, Giblin C, Giovanni MY, Goldberg JM, Goll J, Gonzalez A, Griggs A, Gujja S, Haake SK, Haas BJ, Hamilton HA, Harris EL, Hepburn TA, Herter B, Hoffmann DE, Holder ME, Howarth C, Huang KH, Huse SM, Izard J, Jansson JK, Jiang H, Jordan C, Joshi V, Katancik JA, Keitel WA, Kelley ST, Kells C, King NB, Knights D, Kong HH, Koren O, Koren S, Kota KC, Kovar CL, Kyrpides NC, Rosa PSL, Lee SL, Lemon KP, Lennon N, Lewis CM, Lewis L, Ley RE, Li K, Liolios K, Liu B, Liu Y, Lo CC, Lozupone CA, Lunsford RD, Madden T, Mahurkar AA, Mannon PJ, Mardis ER, Markowitz VM, Mavromatis K, McCorrison JM, McDonald D, McEwen J, McGuire AL, McInnes P, Mehta T, Mihindukulasuriya KA, Miller JR, Minx PJ, Newsham I, Nusbaum C, Oglaughlin M, Orvis J, Pagani I, Palaniappan K, Patel SM, Pearson M, Peterson J, Podar M, Pohl C, Pollard KS, Pop M, Priest ME, Proctor LM, Qin X, Raes J, Ravel J, Reid JG, Rho M, Rhodes R, Riehle KP, Rivera MC, Rodriguez-Mueller B, Rogers YH, Ross MC, Russ C, Sanka RK, Sankar P, Sathirapongsasuti JF, Schloss JA, Schloss PD, Schmidt TM, Scholz M, Schriml L, Schubert AM, Segata N, Segre JA, Shannon WD, Sharp RR, Sharpton TJ, Shenoy N, Sheth NU, Simone GA, Singh I, Smillie CS, Sobel JD, Sommer DD, Spicer P, Sutton GG, Sykes SM, Tabbaa DG, Thiagarajan M, Tomlinson CM, Torralba M, Treangen TJ, Truty RM, Vishnivetskaya TA, Walker J, Wang L, Wang Z, Ward DV, Warren W, Watson MA, Wellington C, Wetterstrand KA, White JR, Wilczek-Boney K, Wu Y, Wylie KM, Wylie T, Yandava C, Ye L, Ye Y, Yooseph S, Youmans BP, Zhang L, Zhou Y, Zhu Y, Zoloth L, Zucker JD, Birren BW, Gibbs RA, Highlander SK, Methé BA, Nelson KE, Petrosino JF, Weinstock GM, Wilson RK, White O. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–14. https://doi.org/10.1038/nature11234.
Article CAS Google Scholar
VanVinh L, Lang TV, Binh LT, Hoai TV. A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorith Mol Biol. 2015;10:1–12. https://doi.org/10.1186/s13015-014-0030-4.
Article CAS Google Scholar
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. Checkm: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55.
Article CAS PubMed PubMed Central Google Scholar
der Maaten Van, Laurens Hinton Geoffrey. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
Google Scholar
Xu X, Xie Z, Yang Z, Li D, Xu X. A t-SNE based classification approach to compositional microbiome data. Front Genet. 2020;11: 620143.
Article PubMed PubMed Central Google Scholar
Zhou Fengfeng, Olman Victor, Xu Ying. Barcodes for genomes and applications. BMC Bioinf. 2008;9(1):1–11.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Amin Darabi and Sayeh Sobhani have contributed equally to this work.

Authors and Affiliations

Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
Amin Darabi, Sayeh Sobhani & Changiz Eslahchi
School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
Sayeh Sobhani, Rosa Aghdam & Changiz Eslahchi
Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, 53715, USA
Rosa Aghdam

Authors

Amin Darabi
View author publications
You can also search for this author in PubMed Google Scholar
Sayeh Sobhani
View author publications
You can also search for this author in PubMed Google Scholar
Rosa Aghdam
View author publications
You can also search for this author in PubMed Google Scholar
Changiz Eslahchi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.D, S.S. R.A, and C.E have conceived and designed the novel method for this study. A.D provided the data; A.D and S.S ran the novel method and well-known methods on the data. All authors were involved in manuscript writing, reading, and approving the final manuscript.

Corresponding author

Correspondence to Changiz Eslahchi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Darabi, A., Sobhani, S., Aghdam, R. et al. AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides. BMC Bioinformatics 25, 241 (2024). https://doi.org/10.1186/s12859-024-05859-7

Download citation

Received: 27 August 2023
Accepted: 09 July 2024
Published: 16 July 2024
DOI: https://doi.org/10.1186/s12859-024-05859-7

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides

Abstract

Background

Results

Conclusion

Availability

Introduction

Method

AFIT: a novel composition vector

Model setup

Step1: estimate the number of OTUs

Step 2: obtaining composition binning index and contig affiliation matrix

Step 3: assigning short contigs to bins

Datasets

Evaluation criteria

Results

Performance on simulated datasets

Performance on the Sharon dataset

Performance on the CAMI datasets

Comparison of AFIT and TNF vectors

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us