BMC Bioinformatics BioMed Central Methodology article Binning sequences using very sparse labels within a metagenome

Background In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity. Results The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ≥ 10 reads datasets and comparable in the ≥ 8 kb benchmark tests. Conclusion In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.


COP K-Means [1]
COP K-Means stands for COnstraint-Partitioning K-Means. It uses labelled instances as constraints to restrict the K-Means clustering process. All pairs of labelled instances are marked as either 'must-link' or 'cannot-link', which are the constraints for two instances having the same or different labels respectively. During the clustering process, a feasible partition will be produced to satisfy all these constraints.
Since a different initialization of the K-centres will result in a different solution, we take the best solution in 100 runs with random K-centre initialization.

Seeded K-Means [2]
Instead of using the labelled instances as constraints to restrict the clustering process, Seeded K-Means uses them to initialise the K-centres. When there is more than one labelled instance in a class, the average is taken to be the initial centre. Since, the initialisation of this algorithm is restricted by the labelled data, no multiple initialisations are required.

Constrained K-Means [2]
This algorithm is similar to the combination of COP-KMeans and Seeded-KMeans. It uses the labelled instances to initialise the K-centres as well as restrict the clustering process when finding the feasible partitioning.

Multi-class Transductive Support Vector Machine (TSVM)
Support Vector Machine (SVM) is a supervised two-class classification algorithm.
Transductive Support Vector Machine (TSVM) is a semi-supervised two-class learning algorithm. SVM finds a hyper-plane to be placed at the maximum margin between the two labelled classes. Additionally, in TSVM, the hyper-plane is adjusted towards boundary of low density unlabelled instances. We used the available program: SVM-Light (http://svmlight.joachims.org) in this paper. SVM-Light implements the TSVM as described by Joachims [3].
Clustering sequence fragments from several genomes is a multi-class problem. We employ the multi-class architecture proposed by Bruzzone et al. [4] to implement the multi-class TSVM. The architecture diagram is shown below: Figure 1: The multi-class TSVM architecture

Clustering Performance Measures
F-measure calculation for clustering evaluation F-measure was introduced by van Rijsbergen [5] in the field of information retrieval to evaluate the effectiveness of a retrieval system. It has become a traditional clustering evaluation measure when the actual class label can be achieved for evaluation.
The calculation of F-measure is clearer to be presented with a contingency  is the total number of samples in the dataset is the F-measure for class A i Finally, the weighted F-measure for the clustering result is calculated from:

Adjusted Rand Index calculation for clustering evaluation
The Rand Index was proposed by [6] to evaluate the clustering quality when the correct labels can be used during the evaluation. Later, the Adjusted Rand Index was introduced by [7] to correct the non-zero expected value of Rand Index even with randomly distributed clusters. The Adjusted Rand Index has become a popular clustering evaluation measure when the actual class label can be achieved for evaluation.
The following contingency table is used to demonstrate the calculation of Adjusted Rand Index: A 1 n 11 n 1j … n 1k n 1.
A m n m1 n mj … n mk n m.

is the number of actual class
• k is the number of predicted classes is the total number of samples in the dataset The Adjusted Rand Index is calculated using the following equation:

Binning Algorithms
Since the binning results of these algorithms presented in this paper were generated by Mavromatis et. al. [8], the detail explanations and settings of the algorithms can be found in the supplementary material of that paper. For the completeness of this paper, these algorithms are briefly described here.

BLAST
This method is a sequence base binning method. It used fgenesb to predict the genes in the metagenomic sequences against the NCBI protein sequences. Then it assigned the sequences to taxonomic classes according to the distribution of BLAST hits of the predicted genes against the reference databases.

k-mer [9]
This is a composition-base binning method. It assigns the sequence fragments to the taxonomic family of the best matching isolate bin by calculating the oligonucleotide frequencies of the sequences and compared to a reference set of finished genomes.
PhyloPythia [10] PhyloPythia is also a composition-base binning method. It assigned the sequence There is one p-value setting (confidence setting) which is determined by the distance of the sample sequences from the hyperplane of the classifier. For example, a higher p-vlaue (eg. p:0.85) means that only samples having a large enough distance to the hyperplane of a specific clade will be assigned to the clade.

Result evaluation procedures
In the following, the exact data extraction and evaluation procedure is described and the URLs of the required files are listed at the end of this section.
1. Use the ContigReads.txt (A) and the FASTA format contig files (B) to obtain contigs that are >= 8 kb (in which the wildcard bases 'N' does not take into account for the actual sequence length) or are >= 10 reads.
2. Use the contig_assignment.txt (C) to find out all taxonomic levels (as the true class) for each eligible contigs ranging from domain down to strain level. Let us call this the ActualContigTaxa file. E.g. In the case for "simLC, >= 10 reads, phrap", there are 482 contigs in the ActualContigTaxa file. In this dataset, at the order level, there are 11 different actual taxa (as shown in Table   8). 3. We use the predicted taxa of contigs (D) for different binning methods to match the contig_ID in ActualContigTaxa file. Additionally, we use the NCBI database to determine the taxonomy of the predicted taxa names. For example, a contig is predicted to be of taxon Rhizobiales, using NCBI database it is known to be at the order level, as well as all its higher rank (kingdom, phylum, class) taxonomic names. In the same example as above, 478 out of 482 contigs are assigned a predicted taxon by the "gen PhyloPythia p=0.5". However, some contigs can be predicted to be of a taxon that is not in the list of actual taxa.
4. The contigs that were predicted at or below the specified taxonomic level of comparison are identified. The number of contigs for each taxon is counted and placed into a contingency table with the actual taxa as the row and the predicted taxa as the column names. E.g. in the above example, out of 482 contigs in SimLC dataset that has >= 10 reads, there were 369 contigs predicted at the order level or below. However, within these 369 contigs, 1 contig has the order taxonomic label that is not in the 11 actual taxa so it was treated as unassigned. Therefore, only 368 contigs were binned. The exact contig count for each actual taxa and for each predicted taxa are shown in the following contingency table: Ax represents an actual taxon in dataset. The corresponding taxon name can be found in Table 8.
Px has the same name as the corresponding Ax but representing the predicted taxon.
UnAsgn column is the number of contigs that are not assigned by this method at the given taxa level.
Sp is the specificity for the corresponding class Sn is the sensitivity for the corresponding class Since only P1, P3, P4 and P9 contain predicted contigs, so the number of bins is 4.