Metagenomic analysis through the extended Burrows-Wheeler transform

Background The development of Next Generation Sequencing (NGS) has had a major impact on the study of genetic sequences. Among problems that researchers in the field have to face, one of the most challenging is the taxonomic classification of metagenomic reads, i.e., identifying the microorganisms that are present in a sample collected directly from the environment. The analysis of environmental samples (metagenomes) are particularly important to figure out the microbial composition of different ecosystems and it is used in a wide variety of fields: for instance, metagenomic studies in agriculture can help understanding the interactions between plants and microbes, or in ecology, they can provide valuable insights into the functions of environmental communities. Results In this paper, we describe a new lightweight alignment-free and assembly-free framework for metagenomic classification that compares each unknown sequence in the sample to a collection of known genomes. We take advantage of the combinatorial properties of an extension of the Burrows-Wheeler transform, and we sequentially scan the required data structures, so that we can analyze unknown sequences of large collections using little internal memory. The tool LiME (Lightweight Metagenomics via eBWT) is available at https://github.com/veronicaguerrini/LiME. Conclusions In order to assess the reliability of our approach, we run several experiments on NGS data from two simulated metagenomes among those provided in benchmarking analysis and on a real metagenome from the Human Microbiome Project. The experiment results on the simulated data show that LiME is competitive with the widely used taxonomic classifiers. It achieves high levels of precision and specificity – e.g. 99.9% of the positive control reads are correctly assigned and the percentage of classified reads of the negative control is less than 0.01% – while keeping a high sensitivity. On the real metagenome, we show that LiME is able to deliver classification results comparable to that of MagicBlast. Overall, the experiments confirm the effectiveness of our method and its high accuracy even in negative control samples.

(Continued from previous page) classification results comparable to that of MagicBlast. Overall, the experiments confirm the effectiveness of our method and its high accuracy even in negative control samples.

Background
Interest in metagenomics has been sparked by the recent advances in "next-generation" DNA sequencing (NGS) technologies [1]. Metagenomics refers to the sequencing of microbial DNA collected directly from the environment, without isolation and lab cultivation of individual species. The analysis of environmental samples (i.e., metagenomes) are particularly important to figure out the microbial composition of different ecosystems and it is used in several fields: for example, metagenomic studies in agriculture are being used to explore the relations between microbes and plants and to detect crop diseases. Of fundamental importance is to identify with precision the microorganisms that are present in a metagenomic sample by comparing the biological sequences therein and assigning them to a specific taxon.
The metagenomic tools can be preliminarily classified into two categories: referencebased (supervised) [2][3][4][5][6][7] and reference-free (unsupervised) [8,9]. In the first case, the metagenomic analysis needs reference genomes and the goal is to match sequences, typically reads or assembled contigs, against a database of microbial genomes in order to identify the taxon of each sequence. In the second case, the aim is to subdivide the reads or the sequences assembled from metagenomic reads into discrete units, without the need of references, so that sequences clustered together display individual populations that comprise the microbial community. This latter approach is known as reference-free binning [8].
In this paper, we focus on the first approach which provides taxonomic classification of individual sequences from a metagenomic sample. A variety of strategies have been used for the matching step of the reference-based metagenomics tools. The following partial list gives a few examples: aligning reads, mapping k-mers, using complete genomes, aligning marker genes only or translating the DNA and aligning to protein sequences (see also [10]). One can split these methods into alignment-based and assembly-based approaches or alignment-free and assembly-free approaches [11].
Alignment-based classifiers proceed by aligning metagenome sequences against the genomes in a reference database. The most well-known and commonly used tool for DNA search and alignment is BLAST (Basic Local Alignment Search Tool) [12]. BLAST consists of a set of algorithms that attempt to find a short fragment of a query read that aligns with a fragment of a genome stored in the reference database.
Nevertheless, as reference databases and NGS sequencing datasets have grown in size, this alignment strategy has become computationally expensive and alignment-free methods have been developed. So, several techniques exist to reduce the computational complexity of this approach. In particular, there exist many statistical/computational tools for metagenomic classification (e.g. [2][3][4]13]), many of these approaches use exact matching of short words of length k (k-mers) rather than alignment and are often based on hashing, indexing and counting k-mers.
Recent surveys (for instance [14,15]) offer a thorough benchmarking analysis by comparing the majority of the state-of-the-art tools. A recent evaluation of tools for metagenome classification [14] presents the most widely used tools tested on complex and realistic datasets. According to this benchmarking analysis, Kraken [2] and CLARK [3] result to be top-performing tools in terms of both similarity to the correct answer and the fraction of reads classified. Note that these two tools are both k-mer based. An alternative approach to fixed k-mers is to use spaced seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards. CLARK-S [5] is the new version of CLARK that uses spaced seeds 1 and achieves higher sensitivity with the same high precision. The main drawback of these approaches is that they are extremely memory consuming: the memory usage of CLARK-S, as well as of CLARK and Kraken, is high, and the results obtained by running the lightweight version CLARK-L are indicated to be a "draft, or first order approximation" of those obtained by running CLARK or CLARK-S. Recently, a new version of Kraken, named Kraken 2 [7], has been introduced. It improves upon Kraken by reducing memory usage by 85%, and allowing to use greater amounts of reference genomic data while maintaining high accuracy. There are, however, other efficient alignment-free methods not based on k-mer counting, for instance Centrifuge [4], which is based on a read-mapping strategy and uses the FM-index [16] to store and index the genome database. Another recent tool based on mapping is TaxMaps [6]: it uses a database compression algorithm that eliminates redundancy by performing a lowest common ancestor (LCA) pre-assignment and collapse for k-mers of length greater than a specified read length; this allows GEM mapper [17] to conduct non-exact searches in the same manner as it would against the original database, resulting in compression that, for the purpose of taxonomic classification, is lossless. Finally, a novel member of the BLAST family of programs has been introduced recently: Magic-BLAST [18] is a new tool for mapping DNA or RNA next-generation sequencing (NGS) runs against a whole genome.
Our contributions -We propose a new alignment-free, mapping-free and assemblyfree method for comparing sequences (cf. [11,19,20]), which is combinatorial by nature and allows us to use little internal memory with respect to other approaches, since the use of the internal memory mainly depends on the number of reads that one wants to examine at the same time.
Our method is based on an extension of the Burrows-Wheeler Transform (shortly eBWT) to a collection of sequences [21]. The eBWT has been used in several application contexts as the circular pattern matching (cf. [22]) and the alignment-free methods for comparing sequences (cf. [21,[23][24][25][26]). Different distance measures have been defined and successfully applied to several biological datasets, as for instance mitochondrial DNA genomes [21,23], expressed sequence tags [27] and proteins [24].
Usually, when the eBWT is applied to a collection S of sequences, its output string ebwt(S) is enriched by another data structure, called the document array da(S): a different color is assigned to each element of S and each symbol of ebwt(S) is associated with a color in da(S) [28]. In other words, the array da(S) contains a sequence of colors that depends on how the suffixes of the sequences in S are mixed in the sorted list. In [21,23], the authors define a class of dissimilarity measures that, by using the eBWT, formalize the intuitive idea that the greater is the number of substrings shared by two sequences u, v ∈ S, more their colors are intermixed in da(S), and the smaller is the "distance" between u and v.
In this paper, we present a tool for the metagenomic classification task, called LiME (Lightweight Metagenomics via eBWT), that takes as input the (extended) Burrows-Wheeler Transform enhanced with the document array (DA) and the longest common prefix (LCP) array [29]. These are fundamental data structures in string processing and can be built independently from our tool (e.g. [30][31][32][33][34][35]).
The contribution of this paper is twofold, theoretical as well as practical. The theoretical result is the introduction of a novel combinatorial approach, which is alignment-free, mapping-free and assembly-free, that can be the basis of new biological sequence analysis methods. Our approach takes advantage of the combinatorial property of the Burrows-Wheeler Transform (BWT) [36], already exploited by BWT-based compressors (see, for instance [37][38][39]): the output of BWT shows a local similarity (occurrences of a given letter that precede the same context tend to occur in clusters) [40][41][42][43].
From a practical viewpoint, unlike other approaches, LiME does not need to build ad-hoc and keep in internal memory the data structures relating to the database of the reference genomes, as it only needs to run sequential scans on the input data structures, that may consist of a simple text file or may be in compression form. For the best of our knowledge, LiME also is the first metagenomic classifier that runs a many-to-many pairwise comparison and that is able to produce a similarity matrix, by comparing all unknown sequences in the sample versus all known genomes in the collection at the same time in order to be able to assign the correct reference to each read. In the future, such matrix could also be used for other metagenomic applications, for instance reference-free (unsupervised) approaches [9] or all-vs-all comparison of metagenomes [44].
Moreover, the experiments show that our tool has a very high precision and a high specificity, in fact our tool is able to correctly assign a read to a genome, while being able to establish that random reads must not be assigned to any genome. Our tool can take paired-end read collections as input -we recall that a paired-end read is a pair consisting of two reads (or mates), such that the second read occurs in the genome at a known distance after the first one. LiME is able to process the mates individually while still recognizing the pairing information, when the input set is a paired-end read.
This work is an extended version of a paper appeared in [45], where two of the present authors introduced a new similarity measure, based on the clustering effect of the eBWT. The main idea consists in computing the similarity between a read and a genome by identifying and analyzing the consecutive symbols in the output of the eBWT and their related colors in the DA. The previous strategy did not take into account the symbols in the International Union of Pure and Applied Chemistry (IUPAC) 2 code [46], indeed they were considered as placeholders. Moreover, in the preliminary version, several reads were set as ambiguous, and thus not classified, by leaving a more in-depth analysis of them for a further work.
The additional contributions over those in the conference paper [45] are: i) we modify the analysis of the clusters by using two different similarity measure definitions: in the first variant, we count how many matches of symbols there are between each read and each genome in any cluster taking into account also the IUPAC ambiguity code (an ambiguity code can match with themselves or with any letter that is associated with its code); whereas, in the second variant, only the presence of such symbols in clusters is considered and we just distinguish the belonging to different input sequences; ii) we modify the read classification by doing a more in-depth analysis of the ambiguous reads (note that the classification is divided into three phases and only the first phase is in common with [45]); iii) we are able not only to classify the reads at a specific taxonomic rank such as genomes, or any level between species and phylum, but also to classify reads at several taxonomic levels by taking advantage of the taxonomic lineage; iv) we implement a multi-threading version of our tool exploiting the fact that it allows a certain degree of parallelization, indeed as stated in [45], the analysis of clusters is independent of each other, so each thread can handle distinct parts of the input files by reading it through local buffers; we also extend the experiments showing the performance of our tool on a real dataset.
The validation is performed by using both simulated metagenomes among those provided in the benchmarking analysis [14] and a real metagenome from the Human Microbiome Project (HMP).
Concerning the simulated datasets, they reproduce size, complexity and characteristics of real metagenomic samples containing around 20 millions of sequences (for the positive control) in addition to a negative control set of about 5 millions of random shuffled reads which mimic sequences from unknown organisms that are likely to appear in metagenomic samples. The experiment results on the simulated data show that LiME is competitive with the state-of-the-art tools. It achieves high levels of precision and specificity-e.g., 99.9% of the positive control reads are correctly assigned and the percentage of classified reads of the negative control is less than 0.01% -keeping a high sensitivity. Thus, on simulated datasets LiME achieves a high F1 score, which is the harmonic mean between sensitivity and precision. Recall that a classifier obtaining a high F1 score is able to achieve both high precision and high sensitivity. Moreover, LiME is able to deliver classification accuracy comparable to that of Magic-BLAST, TaxMaps and CLARK-S, yet requiring less computational cost.
On the real metagenome, the accuracy of classification cannot be evaluated, since the "ground truth" for real metagenome is not available, and thus we choose to compare the classification of LiME to the other tools' classification results. By comparing the experimental results, we observe that LiME classifies identically the same reads as Magic-BLAST more than the other tools do. Moreover, the number of reads that are classified by LiME but not by Magic-BLAST is greater than that computed by CLARK-S, Centrifuge, and Kraken 2, although LiME classifies more reads than CLARK-S, Centrifuge and Kraken 2.
Overall, the experiments confirm the effectiveness of our method and its high accuracy even in negative control samples.

Method
In this section, we present a new eBWT-based strategy to tackle the problem of metagenomic classification that is assembly-, mapping-and alignment-free and uses a little amount of memory, in the sense that we keep the input data structures we use in external memory, and the similarity matrix between the reads and the genomes in internal memory. So, the space in internal memory depends only on the number of sequences that one wants to examine at the same time and it does not depend on the size (number of symbols) of the input.

Preliminaries and materials
Let S be a string (or sequence) of length n, and its alphabet set, with σ = | |. We denote the i-th symbol of S by S[i]. Let S = {S 1 , S 2 , . . . , S m } be a collection of m strings. We assume that each string S i ∈ S of length n i is followed by a special symbol S i [ n i + 1] = $ i , which is lexicographically smaller than any other characters in S, and does not appear in S elsewhere -for implementation purposes, we may simply use a unique end-marker $ for all strings in S. The alphabet is the biological alphabet {A, C, G, T} enhanced with the degenerated base symbols (IUPAC code [46]) and the end-marker symbol.
A substring of any S ∈ S is denoted as being called a prefix and S[ i, n+1] a suffix of S. A range is delimited by a square bracket if the corresponding endpoint is included.
The Burrows-Wheeler Transform (BWT) [36] is a well-known widely used reversible string transformation that can be extended to a collection of strings. Such an extension, known as eBWT or multi-string BWT, is a reversible transformation whose output string (denoted by ebwt(S)) is a permutation of the symbols of all strings in S [21] (see also [30,31,34,47,48]). The length of ebwt(S) is denoted by , according to the lexicographic sorting of the suffixes of all strings in S. In this case, we say that the suffix S j [ k, n j + 1] is associated with the position i in ebwt(S) and with the color j ∈ {1, 2, . . . , m} in the DA. Then, the output string ebwt(S) is enhanced with the document array of S (denoted by da(S)) of length N where is a symbol of the string S j ∈ S. In other words, a different color is assigned to each sequence. See Fig. 1 for an example.
The longest common prefix (LCP) array [29] of S is the array lcp(S) of length N +1, such that lcp(S)[i], with 2 ≤ i ≤ N, is the length of the longest common prefix between the suffixes associated with the positions i and i − 1 in ebwt(S), and lcp(S) [1] = lcp(S)[ N + 1] = 0 by default. Fig. 1 The required data structures for our running example. The set S is The set S will be omitted if it is clear from the context. Moreover, for clarity of description, we denote by L(S) the sorted list of the suffixes in S, distinguishing the end-marker symbol $, although we do not need it for our computation.
We call u-occurrence any substring u that occurs in any sequence of S.

Remark 1
Recall that ebwt(S) is implicitly associated with L(S) and all the suffixes in S starting with the same substring u, with |u| = k, must be consecutive entries in . . , j, and the symbols of S that are followed by u-occurrences coincide with ebwt[h, j].

Remark 2
Let be the total number of u-occurrences in S, with |u| = k, there exist k − 1 substrings (i.e., all suffixes of u that are not equal to u) which appear at least times in S.
The substring CGT appears exactly once in sequences S 1 and S 2 . The two suffixes of S 1 and S 2 starting with CGT-occurrences occupy consecutive positions, precisely 16 and 17 in Fig. 1, and lcp [17] = 4. Moreover, according to Remark 2, we may note that the number of GT-occurrences is 2 and the one of T-occurrences is 4.

Preprocessing
We recall that our tool takes as input the files containing ebwt(S), da(S) and lcp(S) of the collection S of reads and reference genomes.
Let S = {S 1 , . . . , S m } be the input collection of biological sequences comprising r reads and g genomes, where m = r + g. More in details, S i ∈ S is a read if 1 ≤ i ≤ r and S j ∈ S is a genome if r + 1 ≤ j ≤ m. For simplicity, we denote by R the subset of reads and by G the subset of genomes.
The construction of ebwt(S), da(S) and lcp(S) can be performed in internal or external memory. We can compute ebwt(G) and da(G) with algorithm GAP [30] or eGSA [47]. Then, for each new experiment with a read collection R, we can compute ebwt(R) and da(R) by using the algorithm BCR [31,32] and merge them with ebwt(G) and da(G) to compute ebwt(S) and da(S) by using the algorithm eGAP [34], which gives lcp(S) as a by-product with no additional costs.
Notice that the used method does not affect the classification, so one can use other algorithms to construct ebwt(S), da(S) and lcp(S) which is a good feature, since more efficient tools can appear in the literature.

LiME
In this section, we describe how computing a similarity value between a short sequence and a genome reference, and then we describe how to handle the reverse complement strand of the sequence or the mates for paired-end read collections.
We outline the method that we introduced to classify short reads by assigning it to a specific taxon according to the similarity scores computed for any genome in G through sequential scans on ebwt(S), da(S) and lcp(S).
Our notion of similarity between sequences exploits the underlying properties of the eBWT: (i) the clustering effect, i.e., the fact that this transformation tends to group together symbols that occur in similar contexts in the input string collection; (ii) the fact that if a substring u occurs in one or more sequences, then the suffixes of the collection starting with u-occurrence are likely to be close in the sorted list of suffixes. In other words, the greater the number of substrings shared by two sequences is, the more they are similar.
Roughly speaking, we consider the symbols of S followed by the same substrings (i.e., contexts) which are clustered together in ebwt(S) and match one-to-one the symbols belonging to R to the symbols belonging to G.
The overall scheme of the LiME algorithm can be sketched as follows: Step 1. By reading da(S) and lcp(S) in a sequential way, we detect α-clusters, i.e., the blocks of ebwt(S) containing symbols belonging both to R and to G and whose associated suffixes in L(S) share a common context (u-occurrence) of a minimum length α; Step 2. We analyze α-clusters in order to evaluate a degree of similarity between any read and any genome in S by using two different approaches: (a) by reading both da(S) and ebwt(S), and (b) by reading only da(S).
Step 3. We perform the read assignment: every read is assigned to a specific taxon either at a given taxonomic level (if it is specified) or at any level by taking advantage of the taxonomic lineage of taxa.
In other words, we discard the blocks of ebwt(S) whose associated suffixes do not share a prefix of length at least α or that contain symbols belonging only to one set (R or G). Setting a minimum LCP α is to filter out blocks corresponding to short random contexts (u-occurrences) which are statistically not significant, while imposing symbols from both sets is appropriate for next step's calculations. Moreover, since the genomes in G are (usually) long sequences, the parameter α must be chosen smaller than the read length (sequences in R) to have C α not empty.
Step 1 can be performed by a sequential scan over lcp(S) and da(S) allowing us to keep the input data structures in external memory and use only a small amount of internal memory to detect α-clusters. Since the clusters do not overlap, we implemented LiME so that it performs this construction in parallel way, by dividing the data structures appropriately.
It is easy to see that we compute the similarity between a read S i ∈ R and a genome S j ∈ G by analyzing the data structures of the entire set of sequences S, not only the two sequences S i and S j . This is possible according to the following remark.
Remark 3 Let (pS, pE) be an α-cluster of ebwt(S) that contains at least a symbol of S i and at least a symbol of S j . Other symbols that belong to sequences in S apart from S i and S j may also appear in ebwt [ pS, pE]. Nevertheless, we can implicitly get a new cluster (pS , pE ) by deleting from the ebwt(S) all symbols not belonging to S i and S j , and for the properties of the LCP array, it is easy to verify that (pS , pE ) forms an α-cluster of ebwt({S i , S j }).
Step 2: build the similarity matrix -We compute the |R| × |G| similarity matrix by analyzing the α-clusters detected in the previous step. The idea in this step is to use the information from the clustering effect in ebwt(S) and the mixed colors in da(S). Also in this step, we perform sequential scans of the input data structures.
During the second step, we analyze each α-cluster of the ebwt(S) by using one of the two approaches: Then, we measure the degree of similarity between each sequence in R and each genome in G.
In the first approach, whose related measure we call α eBWT -similarity, we consider the symbols that appear in the α-clusters by making an exact correspondence between the symbols belonging to R and G and taking into account the ambiguity of the IUPAC codes 3 . For each α-cluster, we count first the number of occurrences of symbols of each read that we can associate with the occurrences of the same symbol of each genome appearing in the α-cluster and then we count the ambiguity symbols of each read (resp. genome) if they match a genome (resp. read) symbol that is associated with its code. Then, we sum these two values.
Definition 2 Let C α be the set of all the α-clusters associated with ebwt(S). We define the α eBWT -similarity between two sequences S i ∈ R and S j ∈ G as the quantity where occ a (i, x) (resp. occ a (j, x)) is the number of a-symbols belonging to S i (resp. S j ) in the α-cluster x, and δ IUPAC is obtained as the sum of the number of matchings between the remaining IUPAC ambiguity symbols of S i (resp. S j ) with any remaining symbol associated with its code of S j (resp. S i ).
In the second approach, whose related measure we call α DA -similarity, we only consider the number of symbols from S i ∈ R and S j ∈ G appearing in the α-clusters, that is, we disregard the correspondence between the symbols a ∈ from S i and S j in the α-cluster. In order to do this, we scan sequentially the corresponding positions of the α-clusters in da(S) counting the colors of S i and S j .
Definition 3 Let C α be the set of all the α-clusters associated with ebwt(S). We define the α DA -similarity between two sequences S i ∈ R and S j ∈ G as the quantity where occ(i, x) (resp. occ(j,x)) is the number of symbols belonging to S i (resp. S j ) in the αcluster x. Both approaches of Step 2 allow to analyze α-clusters independently: indeed, we can analyze the clusters in C α one by one through |C α | iterations, or in parallel exploiting the fact that α-clusters are pairs of indices not overlapping.
Single-read and Paired-end read collections -LiME is designed to work either with single-read collection or with paired-end reads of any insert size. In addition, both the original sequences and their reverse complements must be considered in order to keep reads properly oriented (original reads' strand directions are unknown). Paired-end reads (and their reverse complement) are treated initially as two single reads, so that they can overlap and align in several ways. Then, our tool exploits the pairing information during the classification step, which we describe in the next paragraph. Therefore, in case of single-read collections, we keep two read sets, R and R RC , with forward and reverse complement. After performing the procedures at Step 1 and Step 2 for the data structures of both R∪G and R RC ∪G, to perform classification we use the information coming either from R ∪ G or from the reverse complement R RC ∪ G. On the other hand, if we have a paired-end read collection made of sets R 1 and R 2 , the available information to perform the read classification comes from both paired-end reads and their reverse complements: thus, after performing the first two steps for the four input data structures ( α , and M 2RC α , and to classify the pair of reads, we recollect the pairing information. Actually, our theoretical approach holds also if one uses a unique data structure for both mates and their reverse complements (without replicating the genomes) and if one uses the reverse complement of the genome set (rather than the read set).
Step 3: classification-The last step consists in assigning a unique provenance to any read in the input collection. In this step, one can specify a minimum taxonomic level (e.g., genome, species, genus or higher) that will be the minimum rank considered in the classification by LiME. Indeed, LiME can classify at higher levels by taking advantage of the taxonomic lineage of taxa.
To perform the assignment we need to take into account both strands (forward and reverse-complement) or, if a paired-end reads collection is available, we take both strands of each paired-end read.
Here, we consider the case where the input collection is paired-end (we have |R 1 | = |R 2 | = r). Thus, for any index i, with 1 ≤ i ≤ r, the entry R i of a paired-end collection is associated with four different reads each one belonging to one of the sets R 1 , R RC 1 , R 2 , R RC 2 . Recall that, as output of Step 2, we obtained respectively the four similarity matrices M 1F α , M 1RC α , M 2F α , and M 2RC α , each one of size r × g, where g is the size of the genome set G.
We remark that if we have a single-read collection, we can proceed similarly skipping the information about the mates R 2 and R RC 2 . To assign each entry R i to a taxonomic rank, we have to analyze the i-th row of each similarity matrix, denoted by M 1F Our classification strategy proceeds according to the difficulty in classifying any entry R i by using first separately and then jointly the information belonging to an individual mate.
We set a threshold value β (0 ≤ β < 1) and we attempt to classify R i only if there is at least one similarity value in M 1F greater than β. Such a parameter β may affect the classification accuracy: it is easy to see that decreasing the value of β implies a major number of reads that may be classified (increasing sensitivity) at the cost of a larger probability of error (decreasing precision).
For each R i , in a first phase, we compute the maximum value M x i , with x = 1F, 1RC, 2F, 2RC, in each row M x α [i] and we take the maximum M i of these values, i.e., M i = max M x i . In general, we only consider the genomes in G = {G 1 , . . . , G g } that approximately reach, with the entry R i , a similarity value equal to M i (i.e., with tolerance of 0.02). If M i ≤ β the entry R i is said to be not classified.
If the specified minimum taxonomic rank is genome, the entry R i is said to be classified to G j only if there exists a unique j such that M x is approximately equal to M i , belong to the same taxonomic unit T within the specified minimum taxonomic rank. In the case there exist more genomes [j] approximately equal to M i belonging to distinct taxonomic units, R i is said to be ambiguous. In the next paragraph, we join the paired-end information for a re-examination of ambiguous reads. If we have only one genome G k (with k ∈ I i ) corresponding to the maximum value M i , we are able to assign R i to G k or simply to its corresponding taxon at the specified rank. Whereas R i is classified to a taxon T if all such genomes G k (with k ∈ I i ) that reach M i belong to the same taxonomic unit T within the specified minimum taxonomic rank. Otherwise, R i still remains ambiguous, and we proceed with the last classification phase, as follows.

Re-examination of ambiguous reads-
Then and its mate, is M i . The procedure then follows as above. However, if the entry R i still remains ambiguous, then we are not able to classify. Thus, eventually we can either leave R i as ambiguous, if the taxonomic level of classification is fixed, or classify it at higher ranks by taking advantage of the taxonomic lineage of taxa.
Note that in both re-examinations, we take the strand that gives the maximum similarity score, rather than using the summed score of both strands, as it might create a bias in the read classification.

Results
In this section, we describe and test our prototype tool, named LiME [49], implemented in C++. It is arranged as a pipeline that runs the three steps described in the previous section. We distinguish two approaches according to the used similarities: LiME eBWT when S eBWT α is used, and LiME DA when S DA α is used.
In order to evaluate the performance of LiME, we have designed a series of tests with various paired-end read datasets: on both simulated and real datasets. Moreover, to guarantee a fair evaluation among tools, in these experiments, we used customized uniform reference databases with the aim to eliminate any confounding effects of differences between default databases (as specified in [50]).
We compare LiME with five sswell-known tools recently introduced that allow to build customized databases: CLARK-S [5], that uses spaced k-mers rather than simple k-mers; Centrifuge [4], that adapts the data structures of read-mapping algorithms based on the BWT and the FM-index [16] (such as Bowtie [51], which provides very fast alignment with a relatively small memory footprint); Kraken 2 [7], that is based on a probabilistic, compact hash table to map minimizers to lowest common ancestors taxa; TaxMaps [6], that is a taxonomic mapping tool, where the reads are mapped in single-end mode to an indexed database using GEM mapper [17]; Magic-BLAST [18], that is a tool for mapping DNA or RNA sequences against a whole genome or transcriptome 4 . In order to align reads, Magic-BLAST uses a BLAST database that can be customized based on a desired set of sequences. However, unlike Nucleotide-nucleotide BLAST (BLASTN), Magic-BLAST is aware of paired-end sequencing and the best alignments are selected based on the alignment quality of the pair.

On simulated datasets
Dataset description. For the experiments on simulated data, we have used the collections made available by the benchmarking analysis carried out in [14]. Indeed, the authors of [14] evaluate the most widely used tools for metagenome classification by testing them on complex and realistic datasets, which have been designed ad-hoc for this benchmarking analysis and made publicly available [52].
In particular, we perform the validation of our approach by using two sets of metagenomes, randomly selected, among those provided by Lindgreen et al. [14]: the two datasets of paired-end reads setA2 and setB2 reproduce the size, complexity and characteristics of real metagenomic samples containing around 20 millions of sequences of length 100 belonging to 17 different phyla. Some phyla are included in equal proportions, whereas some others vary more substantially between the two sets (see ([14], Supplementary Table S1)).
Moreover, as to test the reliability of the tools, each dataset has been enhanced with a set of simulated negative control sequences to mimic sequences from "unknown" organisms (i.e., their genomes are not present in the reference database) that are likely to appear in metagenome samples -see [14] for further details. Each of these negative control subsets added to setA2 and setB2 includes around 5 million of random shuffled reads.
We precise that the original datasets, downloadable from [14], are not exactly the datasets setA2 and setB2 we use for our evaluations [53]. In fact, we first removed a group of reads associated with the phylum of Eukaryotes whose species provenance was not specified in [14]. Second, since we use a custom database and up-to-date taxonomy data (such as taxonomy id, or accession numbers) downloaded from the NCBI website [54], we preferred not to include in sets setA2 and setB2 groups of reads associated with 3 genomes whose entries in the NCBI database have been indicated as expiring.
The reference database G we use for these experiments comprises 930 genomes from 686 species belonging to 17 phyla as indicated in ( [14], Supplementary Table S2).

Validation step.
In our experiments, we consider only read assignments at species level, and thus all non-random reads assigned to higher taxonomic levels (i.e., more species could be assigned to them) are counted in FN.
As the provenance of the simulated reads is known, we denote by TP (true positives) the number of reads correctly classified (i.e., assigned to their right species), by FP (false positives) the number of reads erroneously classified, and by FN (false negatives) the number of reads unassigned, from which we can calculate the quality metrics: sensitivity SEN, precision PREC, and the F1 score that is the harmonic mean between sensitivity and precision, For simulated negative control sequences that do not belong to any known species, we can set TN as the number of random reads that are correctly not mapped to any species, and calculate the specificity where t is the total number of random shuffled reads.
Experiments Our tool is able to classify the reads to several taxonomic levels such as genomes, species or phylum. For the experiments reported in Figs. 2 and 3, we set the taxonomic rank of classification to species. To run Magic-BLAST, Centrifuge, Kraken 2 and TaxMaps we use the default values provided for paired-end reads. CLARK-S can run with default values only and the results are filtered by using the recommended option -highconfidence (e.g., assignments with confidence score < 0.75 and gamma score < 0.03 are discarded).
For LiME, we set the minimum length of the common context α = 16, since the length of each paired-end read is 100, and we provide results for minimum similarity scores β = 0.25.
We additionally processed these datasets with different parameters (see "Discussion" section, and Figs. 4 and 5) and show that for fixed α, the greater the value β is, the more the sensitivity decreases and the precision increases.
In Figs. 2 and 3, we report the classification results for the simulated datasets setA2 and setB2. It is possible to observe that both LiME eBWT and LiME DA , together with Magic-BLAST and Kraken 2, show higher precision than taxMaps, Clark-S and Centrifuge on both datasets. Moreover, both LiME eBWT and LiME DA , together with Clark-S, Centrifuge, and taxMaps, achieve higher sensitivity than Kraken2 and Magic-BLAST on both datasets. The specificity and the F1 score achieved by both LiME eBWT and LiME DA are comparable or superior to those of the other classifiers.
Finally, there is experimental evidence that taking into account the eBWT symbols (LiME eBWT ) rather than considering only the provenience of the symbols (LiME DA ) leads to achieve a higher precision at the cost of a lower sensitivity. Fig. 2 Comparison between LiME and other sequence classification tools. The results here are shown for Sensitivity, Precision, F1-score and Specificity as evaluated at the species rank on the dataset setA2. Full results are available in Additional file 1: Table S1 Fig. 3 Comparison between LiME and other sequence classification tools. The results here are shown for Sensitivity, Precision, F1-score and Specificity as evaluated at the species rank on the dataset setB2 Fig. 4 Comparison of accuracy by using on dataset setA2. The results here are shown for sensitivity, precision, F1-score and Specificity as evaluated at the species using various parameter values on the dataset setA2. Full results are available in Additional file 1: Table S1 On real datasets Dataset description. For the experiments on real data, we have used a paired-end read collection from the Human Microbiome Project (HMP). The real metagenome SRR1804065 is a DNA tool sample of a female participant generated by using Illumina sequencing, that has been previously studied in [9,55]. The paired-end reads have length 100 bps and their total number in the original dataset was 21,873,781. Since the "ground truth" is not available for a real metagenome and a large number of reads may belong to unknown species, we first filtered this dataset by using BLAST: we mapped reads against the whole nucleotide sequence database with a sequence identity of 98%, and discarded the pairs of reads that do not map to any genome. The resulting dataset SRR1804065 comprises 5,654,624 paired-end reads.
The reference database G we use for this experiment comprises 3,423 genomes from 1,499 species belonging to 42 phyla.
Validation step. Because a real metagenomic dataset is from real sequencing experiments, we cannot be certain of the true taxonomic origin of each individual fragment, and consequently we cannot report sensitivity and precision for this dataset. However, we evaluate the concordance between the reads' classifications performed by the tested tools at the species level. First, we report for each classifier the following quantities: (a) the number of reads that are assigned at the species level, (b) the number of reads with a classification at higher taxonomic levels than species, (c) the number of reads that are not classified. Furthermore, with the purpose of comparing our method to the state-of-the-art tools, we represent the reads classified at the species level as set elements. Then, to measure the similarity of classification between two tools, we compute the similarity between the two sets of classified reads by using the Jaccard similarity coefficient.
The Jaccard coefficient J(A, B) is defined as the size of the intersection divided by the size of the union of the two sets A and B, i.e., |A ∩ B|/|A ∪ B|.
For our purposes, the set A (resp. B) is the set of reads classified by a tool (resp. by a competitor) at the species level, thus we denote by t the total number of reads classified joining the two classifications. On the one hand, we can consider the identifiers of the reads, so that the intersection I id = A ∩ B is given by those reads that are classified by both tools at the species level. On the other hand, we can consider the species assigned to those reads, so that the intersection I as = A ∩ B is given by those reads that are assigned by both tools to the same taxonomic unit. In order to quantify the concordance between the two tools on the basis of the Jaccard coefficient, we calculate the two agreement rates r id = |I id | t and r as = |I as | t .
Experiments Also for the real metagenome, in order to guarantee a fair evaluation, we use a customed reference database for all the tested tools.
As for simulated reads, the taxonomic level we choose to classify reads is the deep level of species. In fact, although our tool can classify reads at different taxonomic levels, we fixed the taxonomic level to fairly compare our tools to the others. For Magic-BLAST we use default values provided for paired-end reads, and for CLARK-S we use the recommended option -highconfidence to filter the classification results. Centrifuge, Kraken 2 and taxMaps use default values, while LiME uses the parameters α = 16 and β = 0.25, as we did for simulated reads. Table 1 reports for each tool the number of reads that are assigned to a particular species, the number of reads that are assigned to higher taxonomic levels, and the number of unclassified reads. Note that CLARK-S classifies reads only at a given taxonomic level, thus the number of reads assigned to higher ranks is 0.
On the real metagenome, our approach achieves the highest number of classified reads at the species level: as shown in Table 1, both LiME eBWT and LiME DA classify a number of reads larger than CLARK-S, Centrifuge, Kraken 2, Magic-BLAST and taxMaps, while the smallest number of unclassified reads is reached by Centrifuge.
In Table 2, we show numerically how much the classification results differ among pairs of tools. In particular, we perform a direct comparison between our tool and the others tools, for completeness we also report the comparison between Magic-BLAST and the others tools, since we built the customized reference database by means of BLAST. However, we do not report a direct comparison between other pairs of tools, as it is out of the scope of this work. More precisely, in Table 2, we compare the classification results of both LiME eBWT and LiME DA to those of CLARK-S, Centrifuge, Magic-BLAST, Kraken 2 and taxMaps. By comparing LiME eBWT and Magic-BLAST, it results not only that the percentage of individual reads classified by both tools is the 95.2% of all classified reads, which is the highest value in Table 2, but also that the 93.6% of all classified reads is assigned to the same taxon by LiME eBWT and Magic-BLAST. All the other percentages reported in Table 2 are strictly lower.
Analogously to the experiments on simulated data, our strategy using the first approach (LiME eBWT ) classifies a smaller number of reads than the second approach (LiME DA ), however the agreement rates r id and r as in Table 2 between Magic-BLAST and our strategy are higher using the first approach. In Fig. 6, thus, we report only a more in-depth analysis of the classification differences between LiME eBWT and the other tools, however similar observations can be deduced for LiME DA by looking at Table 2. In Fig. 6, we evaluate the classification concordance comparing LiME eBWT , Magic-BLAST and another tool. In each picture, the sets of classified reads are represented by a Venn diagram: the reads lying in the intersection of two/three sets are those classified by both/all tools. The number of individual reads classified by all the three tools is greater in Fig. 6c when comparing LiME eBWT , Magic-BLAST and Kraken 2, while comparing LiME eBWT , Magic-BLAST and CLARK-S such number is the smallest (see Fig. 6a). Moreover, we notice that the highest number of reads that are classified by one single tool only is registered for CLARK-S (see Fig. 6a).
In Fig. 6, we report, in addition, the percentage of reads with a concordant classification among both/three tools: the assignments performed by LiME eBWT are closer to Magic-BLAST (as shown in Table 2), without however largely differing from those performed by all the other tools. In fact, according to Fig. 6a, d among the reads that are classified by all the three tools, around 98% of them are assigned to the same taxon by all. Overall, for the real metagenome the agreement rates among the tested tools are high, and LiME classification meets that of state-of-the-art classifiers, such as CLARK-S, Centrifuge, Kraken 2 and taxMaps.

Discussion
In this section we analyze the impact of the parameters α and β and discuss the data structures and the resource usage.
Parameter sweeps In Figs. 4 and 5, we looked at parameters relating to the minimum length α of the context in Step 1 and to the threshold value β in Step 3. The parameter sweep analyzed values in the interval [ 16,22] for α and in the interval [ 0.15, 0.40] for β.
We note that, in both approaches, the parameter sweeps give approximately the same, near-optimal levels of accuracy (see Additional file 1: Table S2, for further details). The experimental results confirm that, fixed α, the parameter β affects the classification accuracy: indeed, decreasing the value of β implies a major sensitivity at the cost of a minor precision. So, the trend of the sensitivity curve decreases, while that of precision increases reaching almost 100%. This suggests performance is not overly sensitive to particular parameter settings. From the analysis of the parameter sweep results and from reported experimental results during the comparisons with other tools, it seems that α = 16 and β = 0.25 is a good sensitivity-precision trade-off.  Table 2: (a) related to LiME eBWT , Magic-BLAST and CLARK-S, (b) related to LiME eBWT , Magic-BLAST and Centrifuge, (c) related to LiME eBWT , Magic-BLAST and Kraken 2, (d) related to LiME eBWT , Magic-BLAST and taxMaps. The black numbers inside each circle if summed give the number of reads classified by the corresponding tool. The percentage under each number denotes the percentage of classified identically reads among two/three tools (see Additional file 1: Table S5) Observations on the data structures. We recall that our tool takes as input the files containing ebwt(S), da(S) and lcp(S) of the collection S of reads and reference genomes. This task can be achieved using, for example one of the following tools, BCR [32,56], eGSA [47,57], gSACA-K [58,59], GAP [30] or eGAP [34,60]. It is interesting to note that the data structures for the genome database used by our strategy can be built once and stored, and then, for each new experiment, we can build the data structures for the read collection and merge the two data structures. More precisely, as the set G of genomes is the same for each experiment, we can build the data structures of G only once, by using eGSA for instance, and keep them stored. Then, for each new experiment, we can build the data structures for each read collection R, for instance, by using BCR (a tool for very large collection of short reads), and finally, merge the two data structures, by using eGAP, and obtain those for the entire collection S. On the other hand, by exploiting the mathematical properties of the permutation associated with the eBWT and LCP array, by using BCR [31,32], we could build once the data structures for G and then update with the symbols from R (hence, without constructing the eBWT of reads) in order to obtain the data structures for S. However, to find the best method for building our data structures is not in the aim of this paper.
We also remark that the data structures used by our strategy are intrinsically dynamic: by taking advantage of the inherent properties of the eBWT and LCP array (see, in particular, ( [31], Remark 3.6)), the collection S can be modified by inserting or removing single sequences. This may allow to modify the read dataset by inserting or removing a group of reads, or to update the reference database with newly arranged set of genomes, allowing us to modify α-clusters accordingly.
Moreover, we note that in the recent literature there are several papers with the aim of introducing new lightweight and parallel computational strategies for building the data structures we use in our tool, see for instance [33,35].
In this paper, we use algorithm eGAP [34,60] to merge ebwt(R) and da(R) (built by BCR tool) with ebwt(G) and da(G) (built by eGSA tool) and obtain ebwt(S) and da(S), so that the array lcp(S) is given as a by-product of the algorithm. In particular, by using the option -trlcp (see [60]), one can compute lcp(S) with values truncated in k, where k is the longest string length in the read collection. This merging procedure performs O(N × k) sequential scans. In order to use eGAP for our scope, we can set any k > α to minimize the number of steps.

Resources
The construction of the required data structures is independent of our method, so one can use any strategy according to the resources available, preferring for example a tool that works in internal memory rather than in external memory, or viceversa. In addition, such data structures also allow to efficiently compress the original files (FASTA or FASTQ), see for instance [61,62]. Note that, the other tools (Centrifuge, CLARK-S, Kraken 2, taxMaps and Magic-BLAST) build ad-hoc data structures for the database used in the classification. For instance, CLARK-S needs about 120 GB of RAM for the database used for its classification of the simulated datasets, while Centrifuge, Kraken 2 and taxMaps need about 10 GB of RAM. Moreover, some tools require some information to be specified at the time of building a database, for instance Clark-S requires to set the rank level and Kraken 2 requires to set the k-mer length. Thus, they do not have the same flexibility in choosing how to build these data structures and the same independence from the parameter sweeps as LiME.
Classification experiments were performed by using a DELL PowerEdge R630 machine, 24-core machine with Intel(R) Xeon(R) CPU E5-2620 v3 at 2.40 GHz, with 128 GB of shared memory. The system is Ubuntu 14.04.2 LTS.
Regarding the time of classification steps, Kraken 2 is the fastest tool, while Magic-BLAST and taxMaps are the slowest ones (taxMaps took about 13 hours for classifying the simulated dataset setA2). Full results for the dataset setA2 are available in Additional file 1: Table S3.
Observe that for the classification of the dataset setA2, Clark-S is the tool that requires the most internal memory. While the internal memory used by LiME is about 19,034 MB for the similarity matrix and 786 MB for the other auxiliary data structures.
Moreover, we observe that LiME, unlike other methods, processes all the reads at the same time, so that in the current implementation we need to keep in internal memory the whole similarity matrix. If one wanted to run our tool on a system without enough RAM to store the similarity matrix, one could either store that matrix on an external file or one could build the list of the clusters for all the reads, and then consider one read per time analyzing only the clusters that contain symbols of that read. In the latter case, the required time for Step 2 and Step 3 of LiME eBWT is a few milliseconds and the maximum resident set size is only 16,240 KB. Full results for the dataset setA2 are available in Additional file 1: Table S3.
So, on the basis of the available resources, one could choose the number of reads to be compared simultaneously, in order to get the desired time-memory trade-off.
Finally, since our method scans sequentially the required data structures, we could analyze unknown sequences of very large collections by using mainly external memory and reduce the internal memory usage.
The multi-threading version of our tool exploits the fact that our strategy allows a certain degree of parallelization thanks to the analysis of clusters that is independent of each other, so each thread can handle distinct parts of the input files by reading it through local buffers (Additional file 1: Table S4).
Moreover, we noticed that LiME DA is faster than LiME eBWT . This experimental evidence can be easily understood, as LiME DA saves time scanning only the DA, rather than both DA and eBWT during the cluster analysis. More specifically, with respect to our three-step method, LiME DA is time-saving concerning the completion of the second step (as the first and the third step are the same for both approaches). However, the execution time of the third step largely depends on the number of not-null similarity scores calculated during the second step. Thus, the higher the sensitivity is, the more likely the execution time of the third step could increase.

Conclusion
In this paper, we present a versatile, alignment-free and lightweight tool for metagenomic classification, named LiME, that, by sequentially scanning fundamental string data structures (eBWT, LCP and DA) allows us to efficiently identify the genome to which each read belongs.
Our method is based on two possible approaches: the first approach LiME eBWT takes into account the different symbols in eBWT that precede a common context between the read and the genome that we are comparing. The second approach LiME DA takes into account only the colors in DA of each read and the genome that we are comparing. Experiments on both simulated and real datasets corroborate our intuition that the first approach is more precise and less sensitive than the second approach, while keeping high precision and sensitivity in both cases.
Moreover, we compare LiME (both approaches) with the state-of-the-art tools: Magic-BLAST, CLARK-S, Centrifuge, Kraken 2 and TaxMaps, which have recently been introduced.
In the experiments, we focused the attention on species level classification, but LiME can also work at higher taxonomic levels such as genus, family, class or phylum. Further experiments at phylum level on simulated datasets show that the relative phylum abundance estimated by LiME meets the dataset composition designed in [14] with very high precision. More precisely, LiME eBWT on setA2 (resp. setB2) has 95.55% (resp. 95.95%) of sensitivity, and thus the F1 score achieved is 97.66% (resp. 97.91%). In particular, we obtain only 129,470 (resp. 40,895) ambiguous reads and 824,036 (resp. 779,515) not classified reads and we correctly classify 20,478,036 (resp. 19,419,539) in setA2 (resp. setB2).
For the real metagenome dataset, as the "ground truth" is not available and a large number of reads may belong to unknown species, we first filtered the downloaded dataset by using BLAST and extracted information to build a customized reference database. Since Magic-BLAST is based on BLAST, we may consider the results of Magic-BLAST as a benchmark for the classification results. We can thus observe that only 42,788 reads of the considered real dataset classified by Magic-BLAST are not classified by LiME eBWT . Our tool classifies, indeed, the same reads as the aligner Magic-BLAST for the 95.2%. Whereas CLARK-S, (resp. Centrifuge, Kraken 2, and TaxMaps) fails to classify 408,776 (resp. 185,489, and 116,964, and 180,819) reads, which on the contrary are classified by Magic-BLAST. Furthermore, our tool assigns 93.6% of the reads to the same taxons as the aligner Magic-BLAST does.
Finally, we observe that the notion of LCP-interval [63] is a particular α-cluster (pS, pE) in which at least an index i, pS < i ≤ pE, is equal to α. Moreover, there exist several methods that are based on clustering of eBWT symbols with [62,64,65] or without the LCP array [21,23]. Unlike these methods based on the partitioning of the LCP values, we do not impose any constraint on the α-cluster size.
In conclusion, we believe that our tool can be useful in a variety of applications both in metagenomics and in genomics.