 Proceedings
 Open access
 Published:
Ensemble analysis of adaptive compressed genome sequencing strategies
BMC Bioinformatics volume 15, Article number: S13 (2014)
Abstract
Background
Acquiring genomes at singlecell resolution has many applications such as in the study of microbiota. However, deep sequencing and assembly of all of millions of cells in a sample is prohibitively costly. A property that can come to rescue is that deep sequencing of every cell should not be necessary to capture all distinct genomes, as the majority of cells are biological replicates. Biologically important samples are often sparse in that sense. In this paper, we propose an adaptive compressed method, also known as distilled sensing, to capture all distinct genomes in a sparse microbial community with reduced sequencing effort. As opposed to group testing in which the number of distinct events is often constant and sparsity is equivalent to rarity of an event, sparsity in our case means scarcity of distinct events in comparison to the data size. Previously, we introduced the problem and proposed a distilled sensing solution based on the breadth first search strategy. We simulated the whole process which constrained our ability to study the behavior of the algorithm for the entire ensemble due to its computational intensity.
Results
In this paper, we modify our previous breadth first search strategy and introduce the depth first search strategy. Instead of simulating the entire process, which is intractable for a large number of experiments, we provide a dynamic programming algorithm to analyze the behavior of the method for the entire ensemble. The ensemble analysis algorithm recursively calculates the probability of capturing every distinct genome and also the expected total sequenced nucleotides for a given population profile. Our results suggest that the expected total sequenced nucleotides grows proportional to log of the number of cells and proportional linearly with the number of distinct genomes. The probability of missing a genome depends on its abundance and the ratio of its size over the maximum genome size in the sample. The modified resource allocation method accommodates a parameter to control that probability.
Availability
The squeezambler 2.0 C++ source code is available at http://sourceforge.net/projects/hyda/.
The ensemble analysis MATLAB code is available at
Introduction
Progress in DNA amplification techniques [1] and high throughput cell cultivation methods [2, 3] allow capturing of genomes at singlecell resolution. However, deep sequencing and assembly of all of the cells in a sample is prohibitively costly since there are millions sometimes billions of cells. The good news is that, to capture all distinct genomes, deep sequencing of every cell should not be necessary as the majority of cells are biological replicates. For instance, the number of detected distinct species in the human gut was estimated to be in the order of 1,000, while the number of microbial cells in a human body, most of which reside in the gut, is in the order of 100 trillion [4]. We call this effect the sparsity of distinct genomes in a sizeable microbial population. Biologically important samples are often sparse in that sense. We use sparsity to capture all of the genomes in a sample.
During the last decade, the rich field of compressed sensing in nonadaptive [5–7] and adaptive (distilled sensing and refinements) forms [8, 9] has been developed to reduce the cost of sampling and reconstruction of sparse signals [10, 11]. In the general form of the problem, both adaptive and nonadaptive methods reduce the number of sensing in comparison to the nonadaptive naive sensing, and make it proportional to the number of distinct events times log of the data size [12]. However while adaptive compressed methods may seem more cumbersome than their non adaptive counterparts, adaptive methods often improve detection and estimation performance [12].
Our problem is an instance of a larger class of problems called element distinctness, which is a popular problem in massive data analysis with numerous applications and different variants including (i) finding if there are duplicates in a list, (ii) calculating the number of distinct elements (support size), and (iii) estimating the distribution of distinct element populations [13]. In the vast majority of element distinctness problems, the complexity of deciding if two elements are identical is of O(element size). We distinguish between different classes of the problem based on the size of an element in comparison with the size of the entire population.
In some of the problems such as estimation of the number of distinct words an author knows (e.g. Shakespeare) [14], the size of an element is very small in comparison with the size of the problem. In some others, hash functions are used to reduce the element size. Such variants of the problem have been investigated deeply to find optimal algorithms both in time and space. For data stream analysis, if n is the size of the language of elements, then the space complexity of optimal probabilistic (1 ± ∈)approximation algorithm is O(∈^{−2} + log n) and its time complexity is O(n) [13]. In such algorithms, each element is either completely sensed or not sensed at all, i.e., no partial sensing of an element. However, there are important variants of the problem in which each element is complex, such as the case of whole genome sequences. The contributions of this paper and its predecessor [15] is (i) introduction of this other class of problems with large element size with respect to the sample size, and (ii) the first adaptive compressed method to the best of our knowledge to solve an instance of this problem in the form of finding all distinct genomes with reduced sequencing effort in a sparse microbial sample.
Assume that the distinction of two cells is based on the differences between their genomes. Therefore, the complexity of pairwise distinction is a function of the lengths of the DNA sequences, each in the order of 1,000,000 − 10,000,000 basepairs for a bacterial cell and 3,300,000,000 basepairs for a human cell, with an average size of m. A sample contains n cells, for instance 10,000,000 cells, where n and m are in the same order. In this problem, there are two types of cost: (i) wetlab cost related to sequencing, i.e., reading the DNA sequence digitally, and (ii) computational cost of genome assembly and comparison, i.e., digital reconstruction of the whole genome sequences from the sequencer output. The output reads of a sequencer are short randomly sampled subsequences of the genomic sequence which cover the genome multiple times. The number of reads that contain a genomic location is called the coverage. In the assembly, the reads are concatenated to reconstruct the whole genome. Sequence assembly is a challenging task due to sequencing errors and repetitive elements. To compare a number of sequenced read data sets, a coassembly software such as HyDA [16] is used. The output of HyDA provides us with measures to compare the extent of similarity between the underlying genomes from which the read data sets are derived.
Wetlab cost includes the monetary cost which is linearly proportional to the total number of basepairs sequenced. If m is the average genome size and c is the necessary coverage, then the cost is O(nmc) for the exhaustive sequencing of all cells. Computational cost includes the space complexity and time complexity of assembly and comparison. If the assembly is done using the de Bruijn graph [17], the time complexity is O(nmc log m) and space complexity is O(nm). For instance, a typical realworld scenario involves c = 20, m = 5,000,000 bps, and n = 10,000, 000 for which the exhaustive wetlab, time, and space cost complexities would be respectively O(10^{15}), O(10^{16}), and O(10^{13}). The exhaustive approach is not tractable even for a small population. Hence, sublinear algorithms are needed to solve the problem.
We propose an adaptive compressed method, also known as distilled sensing. Our ultimate goal is to reduce the nm factor in wetlab, time, and space complexities to sm log n in which s is the number of distinct genomes in the community. We cannot use the algorithms and analyses given for the classical compressive sensing approach since our sparsity is unordered set sparsity rather than ordered sparsity by time or space. As opposed to group testing in which the number of distinct events is often constant and sparsity is equivalent to rarity of an event, sparsity in our case means scarcity of distinct events in comparison to the data size. It is also important to note that we do not have positional access (a.k.a. random access in the computer science literature) to the DNA sequence, which limits the use of many dimensionality reduction techniques [18].
We previously defined the problem and proposed a distilled sensing solution based on the breadth first search strategy [15]. To evaluate the performance of our algorithm, we simulated the whole process including genome amplification by MDA [19], sequencing by Illumina, (co)assembly by HyDA [16], and comparison. We proposed an adaptive resource allocation method to determine the amount of sampling of each genome in each round, which is related to the one proposed by [8, 9]. Due to the computational intensity of each of those processes, we were able to demonstrate the power of our approach for a few instances of the problem, but the behavior of the algorithm for the entire ensemble is yet to be studied.
In this paper, we give a new algorithm based on the depth first search strategy and modify our previous breadth first search resource allocation and set selection. Since simulating the entire process is timeconsuming, we provide a dynamic programming algorithm to analyze the behavior of the method over the entire ensemble. Our algorithm recursively calculates the probability of capturing every distinct genome and also the expected total sequenced nucleotides for a given population profile. It is important to note that even though the population is known for the ensemble analysis algorithm, the actual sensing algorithm works without that knowledge. That is our sensing algorithm can be applied to any population, even without knowing the profile. To have a clear view of the effect of each parameter on the expected cost, we assume that our model is error free at this stage. The results in this paper may lead to theoretical solutions and analysis with more complete model assumptions in the near future.
Method
Our method consists of two parts: (i) wetlab process, and (ii) computational process. On the wetlab side, we are assuming to have a high throughput device which is capable of isolating each cell in the sample, cultivate it, then extract the DNAs of each cultivated cell and amplify them. This device should also be capable of sampling customized amount from selected amplified DNAs, pool them, and prepare them for sequencing. If we would like to sequence more than one pool of samples in the same run, the device should uniquely barcode each pool before sending the samples for sequencing. Although there is currently no such device, one can envision automated microfluidic devices in near future based on the technologies already developed for separation, cultivation, DNA extraction, amplification, and barcoding [2, 3].
The output of sequencing is a library of reads which will be demultiplexed based on the barcodes. Therefore, for each pool of sampled amplicons which is sent for sequencing, a read data set is obtained. All the read data sets at each round are coassembled (with HyDA [16]). In the coassembler, to each read data set a unique color is assigned. All the colors are assembled on a single de Bruijn graph. The output is a list of contigs and their colored average coverages. This provides us with a measure to compare the similarities between the assembly of different colors. Based on those similarities, we decide if any two assemblies could potentially be from the same genome.
In the naïve exhaustive approach, each isolated cell is sampled and deeply sequenced. Based on the similarity measures provided by the coassembler, distinct genomes are then identified. In the adaptive method, at each round a number of collections of cells are selected. For each collection, the amount to be sampled from each cell is computed based on the analysis in the previous rounds. The output read data sets are analysed and the next round of sampling is calculated. We describe the details of the sampling collections and size in this section. First, let's clarify the assumptions for our model.
Model assumptions
The definition of distinct genomes may vary in different applications. We, instead of phenotypic notions like species or strains, use a quantifiable genomic measure to determine the distinction of genomes. We define two genome sequences to be distinct if the ratio of their differences over the whole genome size is above a threshold, called τ. That threshold is input by the user and controls a tradeoff between sensitivity and specificity [15].
Let C = {C_{1}, C_{2}, . . . , C_{ n }} comprise the input community of cells. As described earlier, we are given a device that can sense each cell C_{ i } partially at random, and the cost of a sensing is proportional to the sensing size, i.e., the number of nucleotides sampled. As the sensing size increases, the reconstructed genome of C_{ i } after the assembly converges to completion. To introduce appropriate notations, let I ⊆C be a subset of the community. Let be the sensing of the aggregated cells in I, which is the superposition of all sensing taken from the cells in I, i.e., the aggregated read data set or equivalently the resulting assembly. The key observation is that if there are enough replicates of a particular distinct genome in I, then that distinct genome can be completely captured from the superposition of partial sensing of the replicates provided that the partial sensing are random and unbiased.
Comparison of assemblies of two sets
Let I_{1}, I_{2} ⊆C be two subcollections of the input community, and A_{1}, A_{2} the corresponding aggregate sensing. If all of the distinct cells represented in I_{1} are also represented in I_{2}, then we say that A_{1} subsumed by A_{2} (A_{1} ≤ A_{2}). In ideal world with no errors and genome variations, A_{1} is called subsumed in A_{2} if A_{1} is a subset of A_{2}. However, in real world, while two genomes are considered similar (from the same type), they may have some variations like single nucleotide polymorphisms. In addition, errors, noise, and contaminations in sequencing and assembly make the situation harder to handle just by pure mathematical subset definition. To address this issue, the subset definition is relaxed to ignore those differences between two assemblies that are less than a threshold, τ. Therefore, subsumption [15] is defined as follows
In this equation, D_{τ} (A_{1}, A_{2}) quantifies the differences in assembly of A_{1} with respect to A_{2} which is more than τ and is defined as follows
in which A_{1}\A_{2} = {b ∈ A_{1}b ∉ A_{2}},  ·  denotes the total assembly size. In other words, τ is the maximum differences tolerated between two genomes which are considered similar. Parameter τ is user defined and τ gives a tradeoff between specificity and sensitivity of the algorithm to distinguish between two distinct genomes [15].
Search strategies
Our algorithm aims to assemble all of the distinct genomes represented in C and identify at least one cell per distinct genome. The objective is to minimize the total number of bases required to be sequenced. To reach this goal a search tree is created and explored iteratively to find the leaves which are the sequenced and assembled species. In the first iteration, the set of deeply sequenced and completely assembled distinct genomes, I, and its aggregated sensing, A, is empty. The algorithm divides the n cells C_{1}, . . . , C_{n} into two sets I_{1,1} = {C_{1}, . . . , C_{[n/2]}} and I_{1,2} = {C_{[n/2⌋+1}, . . . , _{ C }n}. Denote {\overline{I}}^{1}=\left\{{I}_{1,1},{I}_{1,2}\right\}. In each iteration i, I_{ i,j }'s are subsets of C and are chosen based on the results in the iteration i − 1. The search tree of I_{ i,j } to find leaves can be traversed by different methods. Here we choose two methods, breadth first search (BFS) and depth first search (DFS). In the BFS strategy, in each iteration i, all I_{ i,j }'s are explored at the same time, while in DFS, nodes (I_{ i,j }'s) are explored sequentially in time and analyzed one after the other.
In the recursive call on {\overline{I}}^{i}=\left\{{I}_{i,1},...,{I}_{i,mi}\right\}, the set of cells is sensed according to the resource allocation policy. Then, the aggregated sensing, A_{ i,j }, for each I_{ i,j } is obtained by sequencing and assembly. Those I_{ i,j }'s that contain a single cell, i.e., I_{ i,j } = 1, are leaves, and if they are fully assembled, they will be added to the list of deeply sequenced cells. In other words, if the corresponding assembly is reliable, i.e., c_{ i,j } ≥ M_{ l }, for a given constant M_{ l }, I_{ i,j } will be popped from {\overline{I}}^{i} and pushed to I. In addition, A_{ i,j } will be added to A.
For the BFS search strategy to find the optimum path to continue, a subset of {\overline{I}}^{i} with minimum number of cells is chosen that covers all of the assembly. In other words, the minimum assemblyset cover {\overline{I}}_{cover}\subseteq {\overline{I}}^{i} with minimum number of cells is found for which \overline{A}\cup A is subsumed in {\overline{A}}_{cover}\cup A, i.e.,
Second line can be derived from the first line because \left({\overline{A}}_{cover}\cup A\right)\subseteq \left(\overline{A}\cup A\right). In these notions {\overline{A}}_{cover} and \overline{A} are the the resulting superposition of partial sensing and equivalently the corresponding assemblies of all cells represented in {\overline{I}}_{cover} and {\overline{I}}^{i}, respectively. The search of the subtrees rooted at {I}_{i,j}\notin {\overline{I}}_{cover} are terminated, and the next level set {\overline{I}}^{i+1}:={\overline{I}}_{cover}.
For the DFS strategy, the minimum set cover is calculated gradually during several iterations. Since in the DFS, in each iteration {\overline{I}}^{i} only includes two subsets, and the number of cells in both subsets are (almost) equal, the minimum set cover can be calculated based on the greedy algorithm. The I_{ i,j } with maximum assembly size has the highest priority to be in the minimum set cover. Therefore, {\overline{I}}_{cover}:=\left\{{I}_{i,j}\right\}, and the second I_{ i,j }′ will be pushed to the stack \overline{W}, which is the waiting list of the untraversed nodes in the tree. If A_{ cover } is subsumed in A, then A_{ cover } will be emptied and the last element will be popped from \overline{W} and pushed to{\overline{I}}_{cover}. This will continue until A_{ cover } is not subsumed in A. In the end, the next level set {\overline{I}}^{i+1}:={\overline{I}}_{cover}.
For both search strategies, all subsets in {\overline{I}}^{i+1} will be divided to two almost equal size subsets, which concludes iteration i. This algorithm will continue until {\overline{I}}^{i} and \overline{W} are empty. Figures 1 and 2 depict examples of the DFS and BFS strategies on 10 cells with 3 distinct genomes shown in different colors.
Resource allocation
Resource allocation policy determines the size of partial sensing from each cell in each step. This is done with two objectives: (i) the amount of sensing from each element is such that with a given probability all of the distinct genomes present in I_{ i,j } can be reconstructed almost completely from the superposition of partial sensing, and (ii) the total sensing size in the whole algorithm is minimized.
Assume the input set of cells I_{ i,j }is obtained from splitting I_{ i }−1,k (for clarity we call it I_{ parent }). Let t_{parent}, a_{parent}, and c_{ parent } be total nucleotides sampled, the assembly size, and the average coverage of I_{ parent } and {c}_{i,j}^{\prime} and {a}_{i,j}^{\prime} be the intended coverage and assembly size of I_{ i,j }. We assume that there is a constant minimum coverage M_{ u }, such that if the coverage is above M_{ u }, then the resulting assembly covers the entire genome, i.e., does not have any gaps. We would like the actual coverage c_{ i,j } after the sequencing and assembly to be at least M_{ u }, so we let {c}_{i,j}^{\prime}={M}_{u} as a surrogate. Hence, the total nucleotides t_{ i,j }to be sampled and sequenced from I_{ i,j } is estimated by
In this equation a′ should be estimated from c_{ parent } and a_{ parent } and may differ from the actual value a obtained after sequencing and assembly. If I_{ i,j } is a leaf, i.e., I_{ i,j }  = 1, then the algorithm does a deep sequencing of the single cell in I_{ i,j }. In that case, the algorithm repeats the resource allocation and sequencing until a sufficient actual coverage is reached, i.e., {c}_{i,j}\ge {M}_{u}.
DFS versus BFS
Although the two search strategies are similar, they have differences in several aspects:

In DFS, the number of cell subcollections to be explored in each round is fixed. For instance, it is two in the current implementation. In BFS, this number is dynamic. For instance, in the first round of the example in Figure 2, it is two and in the third round is four. From another point of view, the number of rounds in BFS is fixed, [log_{2}n], whereas that in DFS is variable depending on the setup. The number of rounds and the number of subcollections are both desired to be minimum because each incurs a cost: each round incurs a setup cost for sample preparation and runningandstopping the sequencer and each subcollection requires a unique barcode and incurs the cost of barcoding and sequencing the base pairs in the barcode. Hence, there is a tradeoff between the two costs that determines the most suitable algorithm. The optimal algorithm should consider both costs.

In the process of choosing the minimum set cover, sets are compared to determine the subsumption relationships. In DFS, one side of the comparison is always A in which each single cell is deeply sequenced and completely assembled. Therefore, our information about that side is almost complete and has minimal error. The resource allocation of the algorithm is applied such that A_{ i,j }'s for all i and j are completely sequenced on average to reduce unwanted missing of some distinct genomes. If the resulting coverage of A_{ i,j }is low, i.e., {c}_{i,j}\le {M}_{i} for a constant M_{ i }, then the data is considered unreliable and I_{ i,j }is treated as if it contains a new distinct genome. In that case, I_{ i,j }would be divided into two groups and explored further in the following rounds with increased requested resources. Although A_{ i,j }'s have deep coverage on average, there is a nonzero probability of missing small and less abundant distinct genomes. We explore these probabilities in the Results section. In BFS, the missing probability is more since on both sides of the ≤ relation, cells are deeply sequenced not individually but on average, which increases the error and intrinsic noise in comparisons.
The mentioned differences between the two strategies, do not result in considerable performance priority of one over the other method. Each are proper for a specific condition. In this paper, we compare the two methods using casebycase setups. We show that the total nucleotides required is almost the same in both methods. We could not do the ensemble analysis for BFS using dynamic programming since the exploration of different subcollections in one round are coupled. For instance, the four subcollections to be compared in the third round of Figure 2 come from two parent subcollections in the second round. That creates an inevitable coupling between the parents in the second round. Without dynamic programming, exploring all of the permutations of cells to provide ensemble analysis is intractable. Therefore, we provide the ensemble analysis only for the DFS algorithm.
Implementation
The pseudocode of the algorithm is given in 1, 2, and 3. COMPRESSEDSEARCH is the main function and SELECTNEXTLEVELSETS and Subsumed are two subfunctions of the algorithm. This algorithm has been implemented in the tool squeezambler 2.0.
squeezambler 2.0 versus squeezambler 1.0
The tool squeezambler 1.0 has been implemented based on the BFS algorithm given in [19]. There are three main differences between the BFS algorithm implemented in squeezambler 1.0 and the one in squeezambler 2.0:

In the recursion, the method to choose subsets passed to the next level is different in the two implementations. In squeezambler 1.0, every subset that is subsumed in another one is eliminated from further analysis. However, this is not the optimum method to choose next level sets. In squeezambler 2.0, a collection of subsets is chosen, which will cover the whole assembly with minimum number of cells.

The resource allocation in squeezambler 1.0 was design for those sequencing technologies that have nonuniform coverage. That resource allocation results in assembly gaps and in some cases causes missing some of the distinct genomes. The resource allocation in squeezambler 2.0 is modified such that it reduces the random missing of the distinct genomes and let us predict the probability of missing genomes. This probability is analysed in the Results section. We have added the resource allocation described in this paper to squeezambler 1.0. The new version is called squeezambler 1.1.

The parameter τ in squeezambler 1.0 is variable and is dependent on the number of cells involved in each round. As the number of cells increases, τ decreases. Reduced τ increases the number of base pairs required to be sequenced when error appears in the reads. In squeezambler 2.0 τ is set to be fixed in the whole algorithm.
Ensemble analysis
We provide the ensemble analysis on the DFS algorithm to calculate the expected total number of sequenced base pairs and the probability of capturing every distinct genome over the entire ensemble of n! permutations of C. To reduce the complexity, instead of exhaustively trying multiple permutations we developed a dynamic programming algorithm to calculate the results. In this analysis, the simulation of the entire process is replaced by black boxes which are mathematical models of the behaviors of the process. To ease finding decoupled effects of different parameters in the algorithm, we do not consider the sequencing and assembly errors in our model. Another assumption, again in the interest of other important sparsityrelated parameters, is to consider uniformity of coverage. This assumption is not far from reality. With the advancement of automated microfluidic cell separation and cultivation devices [2, 3], the genome can be captured from cultivated cells and sequenced with close to uniform coverage. This is different from the assumption we made in [15] for which a genome was amplified from a single cell using multiple displacement amplification and suffered from highly uneven coverage after sequencing. Although we are assuming uniform coverage distribution in this work as opposed to in [15], this is only a convenient choice that does not change the algorithm. This assumption is reflected only in (5).
Given the uniformity of coverage, we assume
in which c is the sequencing coverage, a is the total assembly size, g is the genome size, and M_{ u } is a constant that defines the minimum coverage to obtain a complete assembly of the whole genome. For more advanced models, see [20]. In the ensemble analysis, we treat (co) assembly as a black box oracle that knows g and M_{ u }, the input to which is the total sequenced nucleotides and the output of which are a and c. That is based on the assumption that ac is the total sequenced nucleotides, i.e. there are no sequencing errors. In this case the resource allocation formula in (4) will be reduced to {t}_{i,j}={M}_{u}{a}_{i,j}^{\prime}. In the worst case, all of the distinct genomes in I_{ parent } are also represented in Ii,j which means {a}_{i,j}^{\prime}\ge {a}_{parent}. Therefore, the total nucleotides can be estimated by
which is twice the lower bound as a safe margin.
Dynamic programming algorithm
The dynamic programming algorithm can be divided into three main functions Cost (Algorithm 4), ALLOCATESEQUENCEASSEMBLEORACLE (Algorithm 5), and Subsumed (Algorithm 6). Algorithm 4 is the main dynamic programming, and its subroutines are presented in Algorithms 5 and 3. Let s be the number of distinct genomes in C. A distinct genome profile p = (p_{1}, p_{2}, . . . , p_{ s }) ∈ (N ∪ {0}) ^{s} is a population vector. In the root of the search tree, \sum _{j=1}^{s}{p}_{j}=n, where n is the total number of cells. The vector of deeply sequenced and assembled distinct genomes before exploration of the current node is denoted by I = (I_{1}, I_{2}, . . . , I_{s}) where I_{ j } ∈ {0, 1} and I_{ j } = 1 means that the distinct genome j has been sequenced and completely assembled. Throughout the algorithms, \overline{a}=\left({a}_{1},\phantom{\rule{2.77695pt}{0ex}}{a}_{2},\phantom{\rule{2.77695pt}{0ex}}...,\phantom{\rule{2.77695pt}{0ex}}{a}_{s}\right) is the assembly size profile per distinct genome in the current node, and \left\right\overline{a}{}_{1} is the total assembly size. Denote the assembly size of the parent search node by a_{ parent }, total sequenced nucleotides by t, and the expected total sequenced nucleotides by E[t]. We denote the probabilities of capturing distinct genomes by P = (q_{1}, q_{2}, . . . , q_{2s}), in which q_{ j } is the probability of the vector of deeply sequenced and assembled distinct genomes I ∈ {0, 1} ^{s} where j=I\u27e8\left({2}^{0},{2}^{1},....,{2}^{s1}\right)\u27e9+1, upon complete exploration of the current node in the search tree. Above, j is the decimal representation of I treated as binary (in reverse order) plus one. For example, for s = 3 and I = (0, 1, 1) (in short I = 011), j is the decimal representation of reverse of 011 plus one which is equal to 7. Therefore, q_{7} (or in another notation q_{011}) is the probability that after exploration of the current node the distinct genomes 2 and 3 are recognized by the algorithm, while distinct genome 1 is missed.
The COST function requires the genome population profile for the current node, p, the set of distinct genomes already deeply sequenced and assembled, I, and the result of the assembly of the parent node a_{ parent } as input parameters. The output of this function is the estimated cost E[t] and the probabilities of capturing genomes P after exploring the current node. At first (line 5), an oracle will decide on the sampling size of the current node and the resulting coverage and assembly size based on the formulations given in (4) and (5). In line 7, using the function SUBSUMED, the node is then compared with I to see if it includes any new distinct genome. If there is no new distinct genome (line 41), the probability of capturing the corresponding genomes is set to 1 and the function exits. Otherwise, the node will be explored further.
If the node includes only one cell, then that node is a leaf, and it will be sequenced and assembled deeply (line 9) and the corresponding capturing probability is set to 1. In the case of a node with more than one cell, the collection of cells will be divided into two groups. The expected cost and capturing probability, starting from the current node, is calculated over the ensemble of all of the possible divisions of the node into v and w between lines 19 and 39. The ensemble parameters are averaged over all divisions (v, w) by calculating their probability of occurrence (line 36). For a given division, v is explored followed by w. For each u ∈ {v, w} (line 24) and each combination of already captured distinct genomes with nonzero probability (lines 2628), the expected cost t′ and capturing probability profile P′′′ are recursively calculated using the COST function (line 29). These parameters are averaged over all nonzero probability profiles (lines 30, 31).
Results
DFS versus BFS
To compare the performance of DFS and BFS algorithms, we tested the algorithms on simulated data. We have selected 6 distinct genomes from human gut microbiome [4] to generate 3 setups (see Tables 1 and 2). The genomes were amplified and sequenced using ART Illumina sequencing simulator [21]. Reads are 100 bp long, uniformly covering the whole genome. The assembler used in the paper is HyDA coassembler [16]. To allocate resources, the relationship between the coverage and the assembly size of this setup is calculated using ART and HyDA over the 6 distinct genomes. The result is depicted in Figure 3. We selected M_{ u } = M_{ l } = 15 and τ = 0.1.
We have compared the performance of four different methods: (i) naïve method of sequencing each cell deeply, (ii) BFS compressed method using squeezambler 1.1, (iii) BFS compressed method using squeezambler 2.0, and (iv) DFS compressed method using squeezambler 2.0. The results are summarized in Table 3. See that squeezambler 2.0 BFS outperforms squeezambler 1.1 BFS in total sequencing base pairs. The BFS and DFS algorithms of squeezambler 2.0 have close performances in terms of the total sequenced base pairs. However, for the case of 31 cells and 6 distinct genomes, BFS missed one distinct genome. This is one of the examples that shows the genome missing probability is slightly more in BFS. Overall, both algorithms have comparable performances. Comparison of the performance of naïve algorithm and compressed methods shows that as the number of cells increases, the total sequenced base pairs increases linearly for the naïve algorithm and sublinear for the compressed methods. Although for the case of 31 cells, the naïve method outperforms the compressed method, at 140 cells the compressed method shows its strength.
Ensemble analysis
In this section, we selected M_{ u } = 5, M_{ l } = 0.3, and τ = 0.2 except in Table 5. Bacterial genome sizes were considered to be within 1  12 Mbps. In the current version of the program implemented in MATLAB, we computed the results for a small number of cells and distinct genomes, i.e., < 200 cells and < 11 distinct genomes. However, we expect to be able to run the algorithm for a larger number of cells and distinct genomes with an advanced implementation in C++. We tried to decouple the effect of different parameters in the analysis, namely τ in the algorithm, and the number of cells and species in the input. We would like to test whether the expected total cost complexity is O(s) for a fixed n and O(log n) for a fixed s and population profile. This is a first step to show the expected total cost is O(s log n) in the future. Therefore, the results also provide intuition for a potential thorough theoretical analysis of the expected cost and capturing probability.
Expected cost
To investigate the growth rate of E[t] for different number of cells n and compare it with the naïve cellbycell sequencing, we ran our program for the profiles of \frac{n}{8}\left(3,2,3\right) where n = 8, . . . , 192 is the total number of cells with genome sizes (4, 12, 2) Mbps. Figure 4 depicts the results. The total sequenced nucleotides in the naïve case is M_{ u } × max genome size × n = 60n Mbps. The genome of length 2 Mbps may not be captured with at most 3% probability; the other two genomes are always captured. Figure 4 suggests that E[t] grows almost linearly with log n whereas the naïve cost grows linearly with n. Hence, E[t] = O(log n) for fixed number of distinct genomes s and population profile.
To characterize the behavior of E[t] for different number of distinct genomes s, we plotted E[t] versus s = 1, . . . , 10 for n = 32, 64 in Figure 5. For each n, the best and worst population profiles in terms of expected cost were considered. The best case corresponds to roughly uniform n/s cells per distinct genome and the worst corresponds to n − s + 1 cells from one distinct genome and one cell per every other s − 1 distinct genomes (experimentally verified). The genome size was fixed at 4 Mbps for all distinct genomes to factor out the effect of genome sizes and τ = 0.2. Capturing probability of all distinct genomes in all cases was 1. Comparing the experimental curves with the linearly interpolated cost curves in Figure 5 suggests that the upper bound of cost (worst case) for each fixed n is O(s). In other words, the worst case cost grows almost linearly with the number of distinct genomes. Hence, for all population profiles, E[t] = O(s) for fixed number of cells n.
Capturing probability
We ran our program for a number of setups with 3 and 4 distinct genomes. The number of cells n was fixed at 40 in all experiments. Genome sizes varied between 1 and 12 Mbps. The total naïve singlecell sequencing is the minimum coverage M_{ u } times the number of cells, which is 40, times the maximum genome size. In this case, the naïve total cost is 2.4 Gbps. Tables 4 and 5 show the setups and their expected sequenced nucleotides E[t] and P, the capturing probability. Recall P = (q_{00···0}, q_{10···0}, . . . , q_{11···1}), in which q_{ I } for I ∈ {0, 1}^{s} is the joint probability of capturing distinct genome j if I_{ j } = 1 and missing it if I_{ j } = 0 for j = 1, 2, . . . , s.
In Table 4 with constant τ = 0.2, that genome whose length is 2 Mbps in the first row may not be captured with probability 0.62% because 2/12 < 0.2 = τ. We see a similar effect in the subsequent rows. Those genomes whose lengths are 1 and 2 Mbps may not be captured with nonzero probability. As the abundances of short genomes increase, the probabilities of missing them decrease. This suggests that those genomes whose lengths are shorter than τ times the largest genome size may not be captured with nonzero probability.
To further investigate, we ran our program on other setups with varying τ, which are presented in Table 5. For the profile (1, 1, 2, 4) and τ = 0, 0.1, all of the genomes are captured with probability 1. For τ = 0.2, there is a nonzero probability of missing the genomes of length 1 and 2 Mbps. For τ = 0.4, the genome of size 4 Mbps joins the other two short genomes, and the probability of missing it becomes nonzero. Comparing the profiles (1, 1, 2, 4) and (1, 1, 1, 5) for τ = 0.4, the probability of missing the genome of size 2 Mbps, i.e., q_{1101}, increases significantly from 0.8% to 23% as a result of decrease in its abundance, whereas the probability of not capturing the genome of size 1 Mbps, i.e., q1110, decreases from 4.1% to 0.8% as its abundance increases. This suggests that the missing probability depends on abundance, potentially relative to the abundance of the largest genome in the population.
Conclusion
We presented an adaptive compressed algorithm for sequencing and de novo assembly of distinct genomes in a bacterial community. We used the characteristics of sparsity of distinct genomes in a community of cells to decrease the amount of nucleotides needed to be sequenced. Using a dynamic programming algorithm to analyze the ensemble behavior of the algorithm, we showed that the expected cost is O(log(number of cells in the community)) for fixed genome population pro files and O(number of distinct genomes) for fixed number of cells. Furthermore, we showed that for a nonzero threshold τ, those genomes whose sizes relative to the maximum genome size in the community is less than τ may go undetected with a nonzero probability. This probability depends on the abundance of the corresponding genome. These results shed light on our future path towards theoretical analysis of our algorithm and further tree exploration strategies.
Declarations
This work and publication has been supported by Wayne State University.
Algorithms
Algorithm 1 Compressed Sequencing
1: Input: C = {C_{1}, C_{2} , . . . , C_{ n }}
2: Output: A, I
3: I_{1,1} ← {C_{1} , . . . , C[n/2]}
4: I_{1,2} ← {C[n/2]+1 , . . . , C_{ n }}
5: \overline{I} ← {I_{1,1}, I_{1,2}} ▷ \overline{I} is the list of the subsets to be analysed in the subsequent round
6: \overline{W} ← {} ▷ \overline{W} is the waiting list of the subsets assembled but not ready to be analysed immediately
7: i ← 1 ▷ i is the sequencing round index
8: while EITHER \overline{I} OR \overline{W} IS NOT EMPTY do
9: \overline{t} ← RESOURCEALLOCATE (\overline{I}, a_{ parent }, c_{ parent }, M_{ u }) ▷\u0100 \overline{t}=\left\{{t}_{i,1},....,{t}_{i,\leftI\right}\right\}; based on Equ 4
10: \overline{A},\phantom{\rule{2.77695pt}{0ex}}\overline{a},\overline{c} ← SEQUENCEANDASSEMBLE (\overline{t}, C) ▷ \overline{A}=\left\{{A}_{i,1},...,{A}_{i,\leftI\right}\right\}; \overline{c}=\left\{{c}_{i,1},...,{c}_{i,\leftI\right}\right\}; A_{ i,j }, c_{ i,j } are the assembly set and the average coverage of I_{ i,j }, respectively.
11: \overline{I},\overline{W} ← SELECTNEXTLEVELSETS(\overline{I}, \overline{W}, \overline{A}, \overline{c}, F ) ▷ F : DFS or BFS flag
12: i ← i + 1
13: end while
Algorithm 2 selectNextLevelSets
1: Input: \overline{I}, \overline{W}, \overline{A}, \overline{c}, F
2: Output: {\overline{I}}_{new},{\overline{W}}_{new}
3: \overline{L} = {} ▷ list of subsets with low quality assemblies
4: A_{ L } = {} ▷ assemblies of subsets in \overline{L}
5: for j = 1 . . . \left\overline{I}\right do
6: if c_{ i,j } < M_{ l } then
7: MOVE I_{ i,j } FROM \overline{I} TO \overline{L} ▷ move all low coverage assembled I_{ i,j } TO \overline{L}
8: MOVE A_{ i,j } FROM \overline{A} TO A_{ L }
9: else
10: if I_{ i,j } = 1 then
11: MOVE I_{ i,j } FROM \overline{I} TO I ▷ move all single cell assembled I_{ i,j } TO I
12: MOVE A_{ i,j } FROM \overline{A} TO A
13: end if
14: end if
15: end for
16: if F is BFS then
17: FIND THE MINIMUM SET COVER A_{ new } CORRESPONDING TO {\overline{I}}_{new}\subseteq \overline{I} FOR WHICH Dτ ((A ∪ A_{ L } ∪ A), ({\overline{A}}_{new}∪ A_{ L } ∪ A)) ≥ AND \left{\overline{I}}_{new}\right IS MINIMUM ▷ Equ. 3
18: {\overline{I}}_{new}\leftarrow {\overline{I}}_{new}\cup \overline{L}
19: {\overline{W}}_{new}=\left\{\right\}
20: else
21: if F is DFS then
22: {\overline{W}}_{new}=\overline{W}
23: POP ALL I_{ i,j }'s FROM {\overline{I}}_{new} AND PUSH TO {\overline{W}}_{new} EXCEPT ONE SUBSET WITH THE MAXIMUM ASSEMBLY SIZE
24: while \left{\u012a}_{new}\right=0 AND \left{\overline{W}}_{new}\right>0 do
25: I_{ new } ← POP LAST SUBSET IN {\overline{W}}_{new}
26: A_{ new } ← ASSEMBLY OF I_{ new }
27: if NOT SUBSUMED (A_{ new }, A) then
28: PUSH I_{ new } TO {\overline{I}}_{new}
29: end if
30: end while
31: end if
32: end if
33: DIVIDE ALL I_{ i,j }'s IN {\overline{I}}_{new} TO TWO SETS
Algorithm 3 SUBSUMED
1: Input: A_{ new }, A
2: Output: r ∈ { true, false }
3: ▷ low quality assembly; explore the node further.
4: if c ≤ M_{ l } then
5: r ← false
6: else
7: D ← τ − A_{ new } \A/A_{ new } ▷ Equ 2
8: if D < 0 then ▷Equ 1
9: r ← false
10: else
11: r ← true
12: end if
13: end if
Algorithm 4 COST  ensemble analysis main function
1: Input: p, I, a_{ parent }
2: Output: E[t] and \mathbb{\text{P}}
3:
4: \mathbb{\text{P}} = (q_{1}, q_{2}, . . . , q_{2s}) ← 0
5: t, \overline{a}, c ← ALLOCATESEQUENCEASSEMBLEORACLE(p, a_{ parent }) ▷ t total nt, \overline{a}=\left({a}_{1},...,{a}_{s}\right),c coverage
6: E[t] ← t
7: if not SUBSUMEDENSEMBLE(\overline{a}, c, I) then
8: if p_{1} = 1 then ▷ leaf base case
9: while c < 2M_{ u } do ▷ ensures the complete assembly for leaves
10: t, \overline{a}, c ← ALLOCATESEQUENCEASSEMBLEORACLE(p, \left\right\overline{\alpha}{}_{1})
11: E[t] ← E[t] + t
12: end while
13: k ← arg max p_{ j }
14: I^{new} ← I
15: {I}_{k}^{new}\leftarrow 1
16: j ← 〈I^{new}, (2^{0}, 2^{1}, . . . , 2^{s−1})〉 + 1
17: q_{ j } = 1 ▷ updating \mathbb{\text{P}}
18: else ▷ recursion
19: for v + w = p, v, w ∈ (N ∪ {0})^{s} do
20: t ← 0
21: \mathbb{\text{P}}′ ← \mathbb{\text{P}} ▷ \mathbb{\text{P}}′ = \left({q}_{1}^{\prime},\phantom{\rule{2.77695pt}{0ex}}{q}_{2}^{\prime},...,{q}_{2s}^{\prime}\right)
22: j ← 〈I, (2^{0} , 2^{1} , . . . , 2^{s−1})〉 + 1
23: {q}_{j}^{\prime}\leftarrow 1
24: for u ∈ {v, w} do ▷ v is explored followed by w
25: \mathbb{\text{P}}′′ ← 0
26: for b A BINARY VECTOR OF LENGTH s do
27: j ← 〈b, (2^{0}, 2^{1}, . . . , 2^{s−1})〉 + 1
28: if {q}_{j}^{\prime}>0 then ▷ average over all already captured distinct genome profile with nonzero probability
29: t′, \mathbb{\text{P}}′′′ ← COST(u, b, \left\right\overline{a}{}_{1})
30: t\leftarrow t+t\prime {q}_{j}^{\prime}
31: \mathbb{P}\prime \leftarrow \mathbb{P}\prime \prime +{q}_{j}^{\prime}\mathbb{P}\prime \prime \prime
32: end if
33: end for
34: ℙ′ ← ℙ′′ ▷ for w, q′'s are updated
35: end for
36: \pi \leftarrow \prod _{j=1}^{s}\left(\begin{array}{c}\hfill \begin{array}{c}\hfill {p}_{j}\hfill \\ \hfill {v}_{j}\hfill \end{array}\hfill \end{array}\right)/\left(\begin{array}{c}\hfill \left\rightp{}_{1}\hfill \\ \hfill \left\rightv{}_{1}\hfill \end{array}\right) ▷ probability of (v, w)
37: \mathbb{\text{E}}\left[t\right]\leftarrow \mathbb{\text{E}}\left[t\right]+\pi t▷ average over all possible (v, w)
38: \mathbb{\text{P}}\leftarrow \mathbb{\text{P}}+\mathbb{\text{P}}\prime ▷ average over all possible (v, w)
39: end for
40: end if
41: else ▷ all distinct genomes represented in p have already been sequenced
42: j ← 〈I, (2^{0}, 2^{1}, . . . , 2^{s−1})〉 + 1
43: q_{ j } = 1 ▷ updating \mathbb{\text{P}}
44: end if
Algorithm 5 ALLOCATESEQUENCEASSEMBLEORACLE  ensemble analysis
1: Input: p, a_{ parent }
2: Output: t, \overline{a}, and c
3:
4: t ← 2M_{ uaparent } ▷ Equ 6
5: t_{ pc } ← t/p_{1} ▷ total sequenced nt per cell
6: {\overline{t}}_{pdg}\leftarrow {t}_{pc}\cdot p ▷ total sequenced nt per distinct genome
7: \overline{a}, c ← ORACLE({\overline{t}}_{pdg}) ▷ oracle decides on the assembly size and coverage based on Equ 5
Algorithm 6 SUBSUMEDENSEMBLE  ensemble analysis
1: Input: \overline{a}, c, I
2: Output: r ∈ { true, false }
3:
4: if c ≤ Ml then ▷ low quality assembly; explore the node further.
5: r ← false
6: else
7: x ← (¬I, \overline{a}) ▷ exclusive part of assemblies, ¬ is bitwise not, based on Equ 2
8: if τ − x/\tau x/\left\right\overline{a}\left\right<0 then ▷ Equ 1
9: r ← false
10: else
11: r ← true
12: end if
13: end if
References
Chitsaz H, YeeGreenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, Novotny M, Rusch DB, Fraser LJ, Gormley NA, SchulzTrieglaff O, Smith GP, Evers DJ, Pevzner PA, Lasken RS: Efficient de novo assembly of singlecell bacterial genomes from shortread data sets. Nature Biotech. 2011, 29 (10): 915921. 10.1038/nbt.1966.
Zengler K, Toledo G, Rapp´e M, Elkins J, Mathur EJ, Short JM, Keller M: Cultivating the uncultured. Proc Nat Acad Sci. 2002, 99 (24): 1568115686. 10.1073/pnas.252630999.
Fitzsimons MS, Novotny M, Lo CC, Dichosa AE, YeeGreenbaum JL, Snook JP, Gu W, Chertkov O, Davenport KW, McMurry K: Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome research. 2013, 23 (5): 878888. 10.1101/gr.142208.112.
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Le Paslier D, Linneberg A, Nielsen HB, Pelletier E, Renault P, SicheritzPonten T, Turner K, Zhu H, Yu C, Li S, Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N, Yang H, Wang J, Brunak S, Dore J, Guarner F, Kristiansen K, Pedersen O, Parkhill J, Weissenbach J, Bork P, Ehrlich SD, Wang J, Antolin M, Artiguenave F, Blottiere H, Borruel N, Bruls T, Casellas F, Chervaux C, Cultrone A, Delorme C, Denariaz G, Dervyn R, Forte M, Friss C, van de Guchte M, Guedon E, Haimet F, Jamet A, Juste C, Kaci G, Kleerebezem M, Knol J, Kristensen M, Layec S, Le Roux K, Leclerc M, Maguin E, Minardi RM, Oozeer R, Rescigno M, Sanchez N, Tims S, Torrejon T, Varela E, de Vos W, Winogradsky Y, Zoetendal E: A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010, 464 (7285): 5965. 10.1038/nature08821.
Candès EJ, Tao T: Decoding by linear programming. IEEE Transactions on Information Theory. 2005, 51 (12): 42034215. 10.1109/TIT.2005.858979.
Candès EJ, Tao T: Nearoptimal signal recovery from random projections: Universal encoding strategies?. IEEE Transactions on Information Theory. 2006, 52 (12): 54065425.
Donoho DL: Compressed sensing. Information Theory. IEEE Transactions. 2006, 52 (4): 12891306.
Haupt J, Castro RM, Nowak R: Distilled sensing: Adaptive sampling for sparse detection and estimation. IEEE Transactions on Information Theory. 2011, 57 (9): 62226235.
Wei D, Hero AO: Multistage adaptive estimation of sparse signals. IEEE Statistical Signal Processing Workshop (SSP). 2012, 153156.
Erlich Y, Gordon A, Brand M, Hannon GJ, Mitra PP: Compressed genotyping. Information Theory. IEEE Transactions. 2010, 56 (2): 706723.
Stobbe P, Krause A: Learning fourier sparse set functions. Journal of Machine Learning Research  Proceedings Track. 2012, 22: 11251133.
Malloy ML, Nowak RD: Nearoptimal adaptive compressed sensing. Signals, Systems and Computers (ASILOMAR). 2012, 19351939. 2012 Conference Record of the Forty Sixth Asilomar Conference On IEEE
Kane DM, Nelson J, Woodruff DP: Proceedings of the Twentyninth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems. An optimal algorithm for the distinct elements problem. 2010, 4152. ACM
Efron B, Thisted R: Estimating the number of unseen species: How many words did shakespeare know?. 1976, 63 (3): 435447.
Taghavi Z, Movahedi NS, Drăghici S, Chitsaz H: Distilled singlecell genome sequencing and de novo assembly for sparse microbial communities. Bioinformatics. 2013, 29 (19): 23952401. 10.1093/bioinformatics/btt420.
Movahedi NS, Forouzmand E, Chitsaz H: De novo coassembly of bacterial genomes from multiple single cells. IEEE Conference on Bioinformatics and Biomedicine. 2012, 561565.
Compeau PEC, Pevzner PA, Tesler G: How to apply de bruijn graphs to genome assembly. Nature biotechnology. 2011, 29 (11): 987991. 10.1038/nbt.2023.
Johnson WB, Lindenstrauss J: Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics. 1984, 26 (189206): 1
Taghavi Z, Draghici S: MDAsim:A multiple displacement amplification simulator. IEEE Conference on Bioinformatics and Biomedicine. 2012, 575578.
Motahari AS, Bresler G, Tse DNC: Information theory of DNA shotgun sequencing. Information Theory IEEE Transactions on. 2013, 59 (10): 62736289.
Huang W, Li L, Myers JR, Marth GT: ART: a nextgeneration sequencing read simulator. Bioinformatics. 2012, 28 (4): 593594. 10.1093/bioinformatics/btr708.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The author declares that they have no competing interests.
Authors' contributions
Z.T. designed and implemented the algorithms, ran the experiments, and wrote the manuscript.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Taghavi, Z. Ensemble analysis of adaptive compressed genome sequencing strategies. BMC Bioinformatics 15 (Suppl 9), S13 (2014). https://doi.org/10.1186/1471210515S9S13
Published:
DOI: https://doi.org/10.1186/1471210515S9S13