Coverage statistics for sequence census methods
 Steven N Evans^{1, 2},
 Valerie Hower^{1}Email author and
 Lior Pachter^{1, 3}
https://doi.org/10.1186/1471210511430
© Evans et al; licensee BioMed Central Ltd. 2010
Received: 23 April 2010
Accepted: 18 August 2010
Published: 18 August 2010
Abstract
Background
We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic LanderWaterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current highthroughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions.
Results
Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a twodimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a GaltonWatson tree with generationdependent geometric offspring distributions whose parameters can be computed.
Conclusions
We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in highthroughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest.
Background
The classic "LanderWaterman model" [1] provides statistical estimates for the read depth in a whole genome shotgun (WGS) sequencing experiment via the Poisson approximation to the Binomial distribution. Although originally intended for estimating the redundancy when mapping by fingerprinting random clones, the LanderWaterman model has served as an essential tool for estimating sequencing requirements for modern WGS experiments [2]. Furthermore, although it makes a number of simplifying assumptions (e.g. fixed fragment length and uniform fragment selection) that are violated in actual experiments, extensions and generalizations [3–9] have continued to be developed and applied in a variety of settings.
The advent of "highthroughput sequencing", which refers to massively parallel sequencing technologies has greatly increased the scope and applicability of sequencing experiments. With the increasing scope of experiments, new statistical questions about coverage statistics have emerged. In particular, in the context of sequence census methods, it has become important to understand the shape of coverage functions.
Sequence census methods [10] are experiments designed to assess the content of a mixture of molecules via the creation of DNA fragments whose abundances can be used to infer those of the original molecules. The DNA fragments are identified by sequencing, and the desired abundances inferred by solution of an inverse problem. An example of a sequence census method is ChIPSeq. In this experiment, the goal is to determine the locations in the genome where a specific protein binds. An antibody to the protein is used to "pull down" fragments of DNA that are bound via a process called chromatin immunoprecipitation (abbreviated by ChIP). These fragments form the "mixture of molecules" and after purifying the DNA, the fragments are determined by sequencing. The resulting sequences are compared to the genome, leading to a coverage function that records, at each site, the number of sequenced fragments that contained it. As with many sequence census methods, "noise" in the experiment leads to random sequenced fragments that may not correspond to bound DNA, and therefore it is necessary to identify regions of the coverage function that deviate from what is expected in the "null" situation when only noise is present. Finding peaks that are extreme requires a definition of "extreme" in the sense of some test statistic taking a large value as well as a probability model for the coverage process that leads to the null distribution of the test statistic and hence to means for calibrating what values of the test statistic are improbably large in the null regime. The height of a peak is one obvious statistics, but we hope to get more discriminating procedures by also considering a suitably defined numerical summary of the shape of a peak. Indeed, the shapebased methods presented here have been used to develop a peakcallerTPICfor the ChIPSeq assay [11].
The purpose of this paper, however, is not to develop methods for data analysis, but rather to present a null model for the shape of a coverage function that is of general utility. That is, we propose a definition for the shape of a coverage function in terms of the topology of a tree. We describe a random instance assuming that fragments are selected at random from a genome, with lengths of fragments given by a known distribution. We indicate how our description can be used to either compute analytically or approximate via simple Monte Carlo simulation the distributions of quantities of interest in a data analysis.
Methods
In this section, we use some specialized mathematical terminology and notation that the reader may be unfamiliar with. We feel it is important to include this in order to make our statements rigorous and mathematically correct. We will give the definitions of some of the concepts and a general idea of others, but first we set some notation. The symbols ℝ,ℤ, and ℤ_{≥0} stand for the real numbers, integers, and nonnegative integers (respectively), and the elements of a set can be listed inside curly braces, for instance A = {1,2,3}.
The shape of a fragment coverage function
We begin by explaining what we mean by a coverage function. Given a genome of length N, a coverage function is a function f : {1, ..., N} → ℤ_{≥0}. The interpretation of this function is that f(i) is the number of sequenced fragments obtained from a sequencing experiment that cover position i in the genome. Because N is very large, we work with the set ℝ and redefine a coverage function as f : ℝ → ℤ_{≥0}, which simplifies our analysis. We next introduce an object that describes a sequence coverage function's shape. Our approach is motivated by recent applications of topology including persistent homology [12, 13] and the use of critical points in shape analysis [14–16]. For a given coverage function f : ℝ → ℤ_{≥0}, we will define a rooted tree, which is a particular type of directed graph with all the directed edges pointing away from the root. This tree T_{ f }is based on the upperexcursion sets off : U_{ h }: = {(x,f(x))f(x) ≥ h},h ∈ ℤ_{≥0} and keeps track of how the sets U_{ h }evolve as h decreases. Long paths in T_{ f }represent features of the coverage function that persist through many values of h.
Specifically, for each h ∈ ℤ_{≥0}, let C_{ h }denote the set of connected components of the upperexcursion set U_{ h }. That is, each element of C_{ h }is an interval I such that f(x) ≥ h for all x ∈ I and if J is another interval for which I ⊂ J and J ≠ I (so that J strictly contains I), then f(y) <h for some y ∈ J. We define the rooted tree T_{ f }= (V,E) as follows

Vertices in V correspond to the connected components in the sets C_{ h }, with h ranging over all nonnegative integers.

(i, j) ∈ E provided their corresponding connected components ${c}_{i}\in {C}_{{h}_{i}}$ and ${c}_{j}\in {C}_{{h}_{j}}$ with h_{ i }<h_{ j }satisfy h_{ i }= h_{ j }1 and c_{ j }⊂ c_{ i }.
The equivalence classes under this relation are in 1:1 correspondence with the connected components in the upperexcursion sets of f_{[a,b]}. One equivalence class is {0, 2n}, and if {i_{1}, ..., i_{ p }} is an equivalence class with 0 <i_{1} <i_{2} < ... <i_{ p }then ${x}_{{i}_{1}1}={x}_{{i}_{1}}1,$, whereas ${x}_{{i}_{q}1}={x}_{{i}_{q}}+1$ for 2 ≤ q ≤ p Conversely, any index i with x_{i1}= x_{ i }1 is the minimal element of an equivalence class. We use the minimal element of each equivalence class as its representative. Thus, we can view the vertices of ${T}_{f{}_{[a,b]}}$ as the set {0} ∪ {ix_{i1}= x_{ i }1} Two indices i_{1} <i_{2} are adjacent in ${T}_{f{}_{[a,b]}}$ provided ${x}_{{i}_{2}}={x}_{{i}_{1}}+1$ and ${x}_{k}\ge {x}_{{i}_{1}}$ for i_{1} ≤ k ≤ i_{2}. Figure 1 gives an example of a coverage function together with its lattice path excursion (0, 1, 2, 3, 4, 3, 2, 3, 4, 5, 4, 3, 2, 3, 2, 1, 0) and rooted tree. The minimal elements of each equivalence class in Figure 1B are depicted with red squares.
Planar Poisson processes from sequencing experiments
In order to model random coverage along the genome thought of as a continuum, we adopt the perspective of the LanderWaterman model and use a Poisson process to give random starting locations for the fragments. Specifically, we suppose that the left endpoints of the fragments form stationary Poisson point process on ℝ with intensity ρ.
At each point of the Poisson point process we lay down an interval that has that point as its left endpoint. The lengths of the successive intervals are independent and identically distributed with common distribution μ. We will use the notation X for a coverage function built from this process and X_{ t }for the height at a point t.
 1.
N(A) has the Poisson distribution with parameter Γ(A), and
 2.
If A _{1}, ..., A _{ k }are disjoint Borel subset of ℝ^{2}, then N(A _{1}), ..., N(A _{ k }) are independent random variables.
The following theorem is a theoretical statement about our null model for random fragment placement and is a consequence of [[21], Proposition 12.3]. The theorem and the work that follows from it will allow us to access the shape of random fragment placement by giving a description we can simulate.
Theorem 1. The collection {(t_{ j },l_{ j })} of points obtained as described above is a nonhomogeneous Poisson process with mean measure ρm ⊗ μ. Here m is Lebesgue measure (that is, length measure) on ℝ.
where the last line follows from an integrationbyparts. Thus, $\mathbb{E}[{X}_{{t}_{0}}]$] is the product of the intensity ρ and the mean length of a fragment.
Remark: The average height E [${X}_{{t}_{0}}$] can be computed without the use of Theorem 1. We include the derivation above as a first illustration that properties of the coverage function can be understood in terms of the twodimensional Poisson process.
Fragment lengths have a general distribution
To use the shape of fragment coverage in a data analysis, one needs to understand the distribution of the shape when fragments are laid down according to the null model described above. In particular, one is interested in the probability of seeing shapes associated with trees that have a height exceeding some high level. One way of doing this would be to first simulate a very long stretch of the twodimensional Poisson process, determine the coverage function, construct the trees for peaks that exceed a high level, compute our shape statistic for each tree, and then record the empirical distribution of the resulting values. However, peaks that exceed high levels occur very infrequently and so we would need to simulate infeasibly long stretches of the Poisson process in order to determine the probabilities we are interested in with reasonable accuracy. Thus, in this section we propose a Markov approximation that lets us start at high levels (rather than wait for them to appear in simulations of the Poisson process). The corresponding trees are distributed as GaltonWatson trees with generationdependent geometric offspring distributions and these are easy to simulate. In the Results and Discussion section, we compare this approximation to that obtained by simulating the Poisson process for fixed length fragments.
which occurs with probability ${\left(\frac{\rho {\displaystyle {\int}_{T}^{\infty}\mu}((u,\infty ))du}{\rho {\displaystyle {\int}_{0}^{\infty}\mu}((u,\infty ))du}\right)}^{k}$.
which is the probability of n failures before the first success in a sequence of independent Bernoulli trials where the probability of success equals 1p(k). The utility of Equation 1 is that it allows one to (approximately) simulate trees for peaks that exceed a high level under the null model, making it possible to compare trees built from actual data to those formed by random fragment placement.
for 2 ≤ n ≤ H  1 with Y_{1} = 1, ${Y}_{2}=\frac{1}{p(1)}$. We may solve inductively for Y_{ H }and obtain $r(1,H)=\frac{1}{{Y}_{H}}$. The quantity r(1,H) gives the probability that a tree corresponding to a single lattice path excursion away from 0 and coming from the null model is at least as tall as height H. Note that this type of tree comes from a block where the coverage function rises from 0 and then back againoften referred to as an island or contig. This probability can be used to do an initial "filtering" of peaks in a data analysis: one first concentrates on peaks that exceed some height that is calibrated using a knowledge of r(1,H) and then computes the shape statistic and associated pvalues for just those peaks. As an example, Figure 5 in the Results and Discussion section shows r(1,H) plotted for the fixed fragment length.
Results and Discussion
Fragment lengths have the exponential distribution
Claim 1. The process X is a stationary, timehomogeneous Markov process.
This means that probability that an interval covers t_{2} knowing that it covers t_{1} is the same as the probability that an interval starting at t_{1} covers t_{2}. Thus, the probability that ${X}_{{t}_{2}}=k$ given X_{ t }for at t ≤ t_{1} only depends on the value of ${X}_{{t}_{1}}$.Indeed, in terms of time, $\mathbb{P}\{{X}_{{t}_{2}}=k{X}_{{t}_{1}}={k}^{\prime}\}$ depends only on t_{2} t_{1.}
Note that as the exponential distribution is the only distribution with the memoryless property, we lose the Markov property when μis not exponential.
Fragments have a fixed length
The "shape" we have proposed for coverage functions was motivated by persistence ideas from topological data analysis (TDA). In the context of TDA, our setting is very simple (1dimensional), however unlike what is typically done in TDA, we have provided a detailed probabilistic analysis that can be used to construct a null hypothesis for coveragebased test statistics. For example, computing test statistics [27] based on the trees constructed from coverage functions and comparing those to the statistics expected from the GaltonWatson trees has been used to determine protein binding sites in ChIPSeq assay [11]. It should be interesting to perform similar analyses with highdimensional generalizations for which we believe many of our ideas can be translated. There are also biological applications, for example in the analysis of ChipSeq experiments [11], as previously mentioned.
Conclusions
We believe that the study of sequence coverage functions that we have initiated may be of use in the analysis of many sequence census methods. The number of proposed protocols used in such methods has exploded in the past two years, as a result of dramatic drops in the price of sequencing. For example, in January 2010, the company Illumina announced a new sequencer, the HiSeq 2000, that they claim "changes the trajectory of sequencing" and can be used to sequence 25 Gb per day. Although technologies such as the HiSeq 2000 were motivated by human genome sequencing a surprising development has been the fact that the majority of sequencing is in fact being used for sequence census experiments [10]. The vast amounts of sequence being produced in the context of complex sequencing protocols, means that a detailed probabilistic understanding of random sequencing is likely to become increasingly important in the coming years.
Declarations
Acknowledgements
SNE is supported in part by NSF grant DMS0907630 and VH is funded by NSF fellowship DMS0902723. We thank Adam Roberts for his help in making Figure 6.
Authors’ Affiliations
References
 Lander E, Waterman M: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 1988, 2: 231–239. 10.1016/08887543(88)900079View ArticlePubMedGoogle Scholar
 Weber J, Myers E: Human wholegenome shotgun sequencing. Genome Research 1997, 7: 401–409.PubMedGoogle Scholar
 Wendl M, Barbazuk WB: Extension of LanderWaterman theory for sequencing ltered DNA libraries. BMC Bioinformatics 2005, 6: 245. 10.1186/147121056245View ArticlePubMedPubMed CentralGoogle Scholar
 Wendl M: A general coverage theory for shotgun DNA sequencing. Journal of Computational Biology 2006, 13: 1177–1196. 10.1089/cmb.2006.13.1177View ArticlePubMedGoogle Scholar
 Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A WholeGenome Assembly of Drosophila. Science 2000, 287(5461):2196–2204. 10.1126/science.287.5461.2196View ArticlePubMedGoogle Scholar
 Holst L: Random arcs on the circle. Journal of Mathematical Sciences 1984, 25(3):1231–1233. 10.1007/BF01084801Google Scholar
 Sharon I, Pati A, Markowitz V, Pinter R: A Statistical Framework for the Functional Analysis of Metagenomes. Research in Computational Molecular Biology 2009, 496–511. full_textView ArticleGoogle Scholar
 Arratia R, Lander ES, Tavare S, Waterman MS: Genomic mapping by anchoring random clones: A mathematical analysis. Genomics 1991, 11(4):806–827. 10.1016/08887543(91)90004XView ArticlePubMedGoogle Scholar
 Schbath S: Coverage Processes in Physical Mapping by Anchoring Random Clones. Journal of Computational Biology 1997, 4: 61–82. 10.1089/cmb.1997.4.61View ArticlePubMedGoogle Scholar
 Wold B, Myers R: Sequence census methods for functional genomics. Nature Methods 2008, 5: 19–21. 10.1038/nmeth1157View ArticlePubMedGoogle Scholar
 Hower V, Evans SN, Pachter L: Shapebased peak identification for ChIPSeq. ArXiv eprints 2010.Google Scholar
 Carlsson G: Topology and data. Bull Amer Math Soc (N.S.) 2009, 46(2):255–308. 10.1090/S027309790901249XView ArticleGoogle Scholar
 Zomorodian A, Carlsson G: Computing persistent homology. Discrete Comput Geom 2005, 33(2):249–274. 10.1007/s004540041146yView ArticleGoogle Scholar
 Biasotti S, Giorgi D, Spagnuolo M, Falcidieno B: Reeb graphs for shape analysis and applications. Theoretical Computer Science 2008, 392(1–3):5–22. 10.1016/j.tcs.2007.10.018View ArticleGoogle Scholar
 de Berg M, van Kreveld M: Trekking in the Alps without freezing or getting tired. Algorithmica 1997, 18(3):306–323. 10.1007/PL00009159View ArticleGoogle Scholar
 Edelsbrunner H, Harer J, Zomorodian A: Hierarchical MorseSmale complexes for piecewise linear 2manifolds. Discrete Comput Geom 2003, 30: 87–107.View ArticleGoogle Scholar
 Carr H, Snoeyink J, Axen U: Computing contour trees in all dimensions. Comput Geom 2003, 24(2):75–94. 10.1016/S09257721(02)000937View ArticleGoogle Scholar
 Evans SN: Probability and real trees, Volume 1920 of Lecture Notes in Mathematics. Berlin: Springer; 2008.Google Scholar
 Grimmett GR, Stirzaker DR: Probability and random processes. third edition. New York: Oxford University Press; 2001.Google Scholar
 Daley DJ, VereJones D: An introduction to the theory of point processes. Springer Series in Statistics, New York: SpringerVerlag; 1988.Google Scholar
 Kallenberg O: Foundations of modern probability. second edition. Probability and its Applications (New York), New York: SpringerVerlag; 2002.View ArticleGoogle Scholar
 Fearn DH: GaltonWatson processes with generation dependence. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. IV: Biology and health. Berkeley, Calif.: Univ. California Press; 1972:159–172.Google Scholar
 Good IJ: The joint distribution for the sizes of the generations in a cascade process. Proc Cambridge Philos Soc 1955, 51: 240–242. 10.1017/S0305004100030115View ArticleGoogle Scholar
 Harris TE: The theory of branching processes. Dover Phoenix Editions, Mineola, NY: Dover Publications Inc; 2002.Google Scholar
 Jagers P: GaltonWatson processes in varying environments. J Appl Probability 1974, 11: 174–178. 10.2307/3212594View ArticleGoogle Scholar
 Hansen K, Brenner S, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 2010.Google Scholar
 Matsen F: A geometric approach to tree shape statistics. Systematic Biology 2006, 4: 652–661. 10.1080/10635150600889617View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.