Statistics for approximate gene clusters
- Katharina Jahn^{1, 2}Email author,
- Sascha Winter^{3},
- Jens Stoye^{1, 2} and
- Sebastian Böcker^{3}
https://doi.org/10.1186/1471-2105-14-S15-S14
© Jahn et al.; licensee BioMed Central Ltd. 2013
Published: 13 December 2013
Abstract
Background
Genes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. The computational detection of these patterns, which are usually referred to as gene clusters, has become increasingly sensitive over the past decade. The most powerful approaches allow for various types of imperfect cluster conservation: Cluster locations may be internally rearranged. The individual cluster locations may contain only a subset of the cluster genes and may be disrupted by uninvolved genes. Moreover cluster locations may not at all occur in some or even most of the studied genomes. The detection of such low quality clusters increases the risk of mistaking faint patterns that occur merely by chance for genuine findings. Therefore, it is crucial to estimate the significance of computational gene cluster predictions and discriminate between true conservation and coincidental clustering.
Results
In this paper, we present an efficient and accurate approach to estimate the significance of gene cluster predictions under the approximate common intervals model. Given a single gene cluster prediction, we calculate the probability to observe it with the same or a higher degree of conservation under the null hypothesis of random gene order, and add a correction factor to account for multiple testing. Our approach considers all parameters that define the quality of gene cluster conservation: the number of genomes in which the cluster occurs, the number of involved genes, the degree of conservation in the different genomes, as well as the frequency of the clustered genes within each genome. We apply our approach to evaluate gene cluster predictions in a large set of well annotated genomes.
Keywords
Background
Gene order-based analysis of whole genomes has become an important field in comparative genomics. It is well known that genomes evolve, not only at the level of nucleotide sequence, but also by means of large-scale rearrangements, such as inversions and transpositions, as well as changes of the gene content. Focusing on this higher-level structure, genomes are usually modeled as sequences of integers with genes belonging to the same gene family represented by the same integer. If no selective pressure was acting on whole genome evolution, gene order and gene content would randomize over time. In reality, particularly in bacterial genomes, we observe low overall conservation of gene order between species, contrasted by a small number of well-conserved segments. These are often referred to as gene clusters. Such local aberrations from genome randomization provide evidence for various biological phenomena and are of high interest in functional and evolutionary genomics [1–9].
When comparing a large number of genomes, the identification of these structures can be a challenging task since conservation patterns may be highly variable across species. Due to micro-rearrangements, gene order can also vary between cluster occurrences. Gene insertions and losses, or mis-annotations may lead to cluster occurrences interrupted by genes that do not belong to the cluster, or containing only a subset of the clustered genes. Moreover, a gene cluster may be present only in a subset of the genomes under study. (A gene cluster displaying all these features, except for mis-annotations, is shown in Additional File 1) Taking these effects into account in different ways, several models of gene clusters and algorithms for their detection have been suggested [10–17].
However, it may as well be the case that seemingly conserved patterns occur merely by chance. To estimate the likeliness of such events, appropriate statistics are needed to quantify the probability of finding gene clusters by chance. Such statistics have been developed for some gene cluster models, in particular $\phantom{\rule{0.1em}{0ex}}r$-windows [18–20], segmental homologies ($k\phantom{\rule{0.1em}{0ex}}$-clumps) [21], and max-gap clusters [22–24]. Other methods solve the problem of assigning significance to predicted gene clusters in an ad-hoc manner, including C-Hunter [25], OrthoCluster [26], MCMuSeC [16], CYNTENATOR [27], and i-ADHoRe [28]. In this paper we consider the gene cluster model of approximate common intervals [15]. This can be easily applied to a variety of use cases, as it offers a combination of parameter flexibility and efficiency of computation, even for very large data sets. The variant reference gene clusters [17] proves especially useful. We provide a statistical test for evaluating gene cluster predictions against the null hypothesis of random gene order. For our background model, we consider, for each genome $\phantom{\rule{0.1em}{0ex}}G$, a random string $\phantom{\rule{0.1em}{0ex}}S$ of the same length where each character (gene) has a probability proportional to its frequency in $\phantom{\rule{0.1em}{0ex}}G$. For multi-chromosomal genomes, or in cases where the (unfinished) genome sequence consists of multiple contigs, we do the same for each chromosome/contig individually. We then estimate p-values, that is, the probabilities of gene clusters of the observed quality or higher being found in the random genomes by chance.
Since the random genomes are drawn independently, we can proceed as follows: For each genome, we compute the likelihood of a gene cluster occurring in the corresponding random genome. These are the individual p-values for each genome. Next, we demonstrate how to combine p-values from individual genomes into one p-value for the gene cluster. The problem of multiple testing is then considered by applying a false discovery rate control to minimize this effect. Finally, we demonstrate that there is excellent concurrence between our calculated p-values and empirically determined p-values and that the method is able to recognize known gene clusters from large genomic data sets.
Methods
Preliminaries
The corresponding index interval $\left[a,\phantom{\rule{2.77695pt}{0ex}}b\right]$ is termed a location of $C\subseteq \mathrm{\Sigma}$ if and only if $C=\mathcal{C}\mathcal{S}\left(S\left[a,b\right]\right).$ An interval $\left[a,b\right]$ is left-maximal if $a=1$ or $S\left[a-1\right]\notin \mathcal{C}\mathcal{S}\left(S\left[a,b\right]\right)$; it is right-maximal if $b=\phantom{\rule{2.77695pt}{0ex}}\left|S\right|$ or $S\left[b+\mathsf{\text{1}}\right]\notin \mathcal{C}\mathcal{S}\left(S\left[a,b\right]\right);$ and it is maximal if it is both left- and right-maximal.
Given that ${S}_{\mathsf{\text{1}}},\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.,\phantom{\rule{2.77695pt}{0ex}}{S}_{k}$ are genomes, the above character set $\phantom{\rule{0.1em}{0ex}}C$ corresponds to a gene cluster with perfectly conserved gene content.
This constitutes a metric and therefore meets all intuitive notions of a distance measure, such as validity of the triangle inequality. In the context of gene clusters, it corresponds to a simple summation over the genes deleted and the genes inserted into a cluster occurrence. Like earlier character set based gene cluster models [29], it disregards recurrences of genes within the same cluster occurrence.
Based on this distance notion, we extend the concept of character set locations towards approximate conservation: Given an integer $\delta \ge 0$, we define an interval $\left[a,b\right]$ in a string $S$ as a $\delta $-location of character set $C,$ if and only if, $D\left(C,\mathcal{C}\mathcal{S}\left(S\left[a,b\right]\right)\right)\le \delta $, and $\mathcal{C}\cap \mathcal{C}\mathcal{S}\left(S\left[a,\phantom{\rule{2.77695pt}{0ex}}b\right]\right)\ne \varnothing $.
Let ${S}_{\mathsf{\text{1}}},\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.,\phantom{\rule{2.77695pt}{0ex}}{S}_{k}$ be genomes over a gene alphabet $\mathrm{\Sigma}$. Let $s\ge \mathsf{\text{2}}$ be the minimum cluster size, $\delta \ge 0$ a distance threshold and ${k}^{\prime}$ a quorum parameter with $\mathsf{\text{2}}\le {k}^{\prime}\le k$. A reference gene cluster for parameters $s,\delta $ and ${k}^{\prime}$ is a set of genes $C\subseteq \mathrm{\Sigma}$ with $\left|C\right|\phantom{\rule{2.77695pt}{0ex}}\ge s$ such that $C$ has an exact occurrence in one of the genomes and $\delta \text{-}\mathsf{\text{locations}}$ in at least ${k}^{\prime}-1$ other genomes. In other words, there exist $i,a,b$ such that $C=CS\left({S}_{i}\left[a,b\right]\right)$, and $J\subseteq \left\{\mathsf{\text{1}},...,k\right\}-\left\{i\right\}$ with $\left|J\right|\phantom{\rule{2.77695pt}{0ex}}\ge {k}^{\prime}-1$ such that each ${S}_{j}$ has a $\delta $-location of $C$ for all $j\in J$.
In [17] we studied the following problem: Given genomes ${S}_{\mathsf{\text{1}}},\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.,\phantom{\rule{2.77695pt}{0ex}}{S}_{k}$ and parameters $s,\delta ,{k}^{\prime},$ find all reference gene clusters $C\subseteq \mathrm{\Sigma}$ in ${S}_{\mathsf{\text{1}}},\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},{S}_{k},$ and for each reference gene cluster, output all its optimal $\delta \text{-}\mathsf{\text{locations}}\mathsf{\text{.}}$ (A $\delta \text{-}\mathsf{\text{location}}$ is not optimal if it is a subinterval of a $\delta \text{-}\mathsf{\text{location}}$ that has a smaller distance to $C.$) We introduced an efficient algorithm that runs in $O\left({k}^{\mathsf{\text{2}}}{n}^{\mathsf{\text{2}}}{\delta}^{\mathsf{\text{2}}}+{k}^{\mathsf{\text{2}}}{n}^{\mathsf{\text{2}}}\right)$ time using $O\left(k{n}^{\mathsf{\text{2}}}\right)$ space, where $\phantom{\rule{0.1em}{0ex}}n$ is the length of the largest genome [17]. The algorithm is exact, meaning that it is guaranteed to find all reference gene clusters and their optimal occurrences as specified by the search parameters.
The above definitions do not take into account multi-chromosomal genomes, or genomes that were not completely assembled and still consist of several contigs. However, it is simple to extend these definitions, as well as the remainder of this paper, to the multi-chromosomal or multi-contig case. For example, we may assume that the different chromosomes/contigs of one genome are concatenated in a single string, separated by symbols $\$\notin \mathrm{\Sigma}$. We can then assume that neither a gene cluster nor an interval is allowed to contain the character $\$$. Further details will be omitted, aside from saying that some of the complexity bounds mentioned below actually improve for multi-chromosomal and multi-contig genomes.
Significance of a gene cluster for one genome
which can be calculated with high accuracy, using mathematical library functions for $f\left(x\right):=\mathsf{\text{exp(}}x\mathsf{\text{)}}-\mathsf{\text{1}}$ and $g\left(x\right):=\mathsf{\text{log(}}x+\mathsf{\text{1)}}$.
Exact computation using dynamic programming
where ${p}_{0}\left(l,L\right)$ is the probability of drawing a random string of length $\phantom{\rule{0.1em}{0ex}}L$ with $\phantom{\rule{0.1em}{0ex}}l$ characters from $C$ and $L-l$ characters from $\stackrel{\u0304}{C};{p}^{-}\left(l,d\text{\_}\right)$ is the probability that, for a random string ${S}_{{l}^{}}$ of length $l$ over the alphabet $C$, $\mathcal{C}\mathcal{S}\left({S}_{l}\right)$ is missing ${d}_{-}$ characters from $C$; and ${p}^{+}\left(L-l,d-d\text{\_}\right)$ is the probability that, for a random string ${S}_{{l}^{\prime}}$ of length ${l}^{\prime}=L-l$ over the alphabet $\stackrel{\u0304}{C},\mathcal{C}\mathcal{S}\left({S}_{{l}^{\prime}}\right)$ contains ${d}_{+}=d-d\text{\_}$ different characters. If all these values are known, we can compute the desired probability using (4) in time $O\left(dL\right)$. In practice, we found that ${p}^{-}\left(l,{d}_{-}\right)$ in (4) decreases rapidly with increasing $l$. To this end, we can stop the summation, as well as the computation of ${p}^{-}\left(l,{d}_{-}\right)$ and ${p}^{+}\left(L-l,{d}_{+}\right)$, as soon as $\sum _{d\text{\_}=0}^{d}{p}_{0}\left(l,L\right).\phantom{\rule{2.77695pt}{0ex}}{p}^{-}\left(l,d\text{\_}\right).\phantom{\rule{2.77695pt}{0ex}}{p}^{+}\left(L-l,\phantom{\rule{2.77695pt}{0ex}}{d}_{+}\right)$ no longer contributes to the sum.
holds. It remains to be shown how to compute ${p}^{-}\left(l,{d}_{-}\right)$ for missing characters and ${p}^{+}\left(L-l,{d}_{+}\right)$ for additional characters.
Missing characters
We initialize $p\left[z,0,0\right]=\mathsf{\text{1}},\phantom{\rule{2.77695pt}{0ex}}p\left[z,0,h\right]=0,p\left[z,l,0\right]=0$, and $p\left[0,l,h\right]=0$ for all $z$ and all $l,h>0$. The two missing probabilities are computed using a binomial distribution. In the end, ${p}^{-}(l,{d}_{-})=p[Z,l,h]$ is the probability that in a random string of length $l$ over alphabet $C$, exactly $h$ different characters from $C$ have been generated. Computation takes $O\left(\left|C\right|h{l}^{\mathsf{\text{2}}}\right)$ time and $O\left(hl\right)$ space, as only the values for $z=Z$ need to be stored.
Additional characters
The recurrence introduced in equation (5), can also be used to compute ${p}^{+}\left(l,{d}_{+}\right)$. We need to set $h:={d}_{+}$, and exchange the roles of $C$ and $\stackrel{\u0304}{C}.$ Unfortunately the latter modification has a strong impact on the practical runtime, due to the linear dependence now being on the much larger $\left|\stackrel{\u0304}{C}\right|$ (compared to the rather small $\left|C\right|$ for ${p}^{-}\left(l,{d}_{-}\right)$). However, we can mitigate this by pooling genes based on the frequency of their occurrences in the original genome. The genes within each pool have the same occurrence probability in the random genome and need not be distinguished in our calculations. It is sufficient to know for each pool how many of its genes originate from $C$, and $\stackrel{\u0304}{C}$ respectively.
The initializations are the same as for the previous recurrence. We can use the binomial distribution to compute the value of $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$ (no gene of ${F}_{z}^{\stackrel{\u0304}{C}}$ rolled with $\phantom{\rule{0.1em}{0ex}}l$ dice) and the same type of recurrence as in equation (5) to compute the second probability, $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$($\ell $ genes of ${F}_{z}^{\stackrel{\u0304}{C}}$, ${h}^{\prime}$ different ones, rolled with $\phantom{\rule{0.1em}{0ex}}l$ dice). In the end, ${p}^{+}\left(l,{d}_{+}\right)=p\left[f,l,h\right]$ is the probability that in a random string of length $\phantom{\rule{0.1em}{0ex}}l$ over alphabet $\stackrel{\u0304}{C},$ exactly $\phantom{\rule{0.1em}{0ex}}h$ different characters from $\stackrel{\u0304}{C}$ have been generated.
Due to the second summation, the asymptotic time complexity of the recurrence becomes $O\left(f{h}^{\mathsf{\text{2}}}{l}^{\mathsf{\text{2}}}\right)$. We observe that, in practice $f,$ the number of different gene occurrence frequencies is very small compared to $\stackrel{\u0304}{C}$ and is typically in the size range of large gene clusters. Also, the vast majority of genes occur only once in a genome, with pool sizes dropping quickly for larger occurrence frequencies. Since ${h}^{\prime}$ is bounded, not only by h but also by $\left|{F}_{f}^{\stackrel{\u0304}{C}}\right|,$ the quadratic dependence on $\phantom{\rule{0.1em}{0ex}}h$ is unlikely to be relevant in practice.
Next, we show how the values of $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$($\ell $ genes of ${F}_{z}^{\stackrel{\u0304}{C}}$, ${h}^{\prime}$ different ones, rolled with $l$ dice) can efficiently be computed. Ideally, we would like to precompute these values once for every genome, providing constant-time lookup during computation of p-values for the different gene clusters. At first sight, these probabilities seem to be specific for each cluster, as the gene pools need to be restricted to their complement $\stackrel{\u0304}{C}$. However, we know that all genes in a pool have the same occurrence probability, therefore it is sufficient to compute the above values for all residual pools, after removing a certain number - not a certain set - of genes. For small pools, which are in the majority, not much extra work is required to do this for all possible residual sizes. For large pools exact computations may, in practice, be too costly. In this case, we suggest working with pools based on the complete alphabet $\mathrm{\Sigma}$. Due to their size, the removal of a few elements has little influence on the conditional probabilities of the remaining elements.
In practice, we use a faster, but less exact approach: We replace the more accurate estimation of ${p}^{+}\left(l,{d}_{+}\right)$ with a much faster preprocessing that leads to almost identical results. To this end, we compute a global ${P}^{+}\left(l,{d}_{+}\right)$ for the complete alphabet, $\stackrel{\u0304}{C}:=\mathrm{\Sigma},$ during preprocessing. This is achieved using recurrence (5). We then assume that, for any cluster $C$ with additional character probabilities ${p}^{+}\left(l,{d}_{+}\right)$, we have ${p}^{+}\mathsf{\text{(}}l,{d}_{+}\mathsf{\text{)}}\approx {P}^{+}\left(l,{d}_{+}\right)$. In doing so, we ignore the fact that the gene cluster $\phantom{\rule{0.1em}{0ex}}C$ removes some of the genes from the pool of potential additional genes. Clearly, this computation can be carried out very quickly, as we have to compute ${P}^{+}\left(l,{d}_{+}\right)$ only once for each genome. Depending on the distribution of occurrence probabilities, the above approximation can distort the results significantly. We account for this by setting ${p}^{+}\left(l,{d}_{+}\right)\approx {{\left(\mathsf{\text{1}}-\mathbb{P}\left(C\right)\right)}^{d}}^{+}\cdot {P}^{+}\left(l,{d}_{+}\right)$, thereby taking into account the occurrence probabilities of the genes in $C.$
Significance of a gene cluster in multiple genomes
Thus far, we have concentrated on the probability of observing gene cluster occurrences in a single genome. To estimate the significance of observing a gene cluster in multiple genomes, we need to combine the individual probabilities into a single p-value. This gives the probability of observing a gene cluster $C,$ at least as well conserved in the randomized genomes as in the original genome set.
The base cases are $M\left[0,0\right]=\mathsf{\text{1}}$, and $M\left[0,\phantom{\rule{2.77695pt}{0ex}}d\right]=0$ for $d>0$. $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$(best $\delta \text{-}\mathsf{\text{locations}}$ in ${S}_{i}$ has distance ${d}^{\prime}$ to $C$) equals $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$($C$ has a ${d}^{\prime}$-location in ${S}_{i}$) − $\phantom{\rule{0.1em}{0ex}}\mathbb{P}$($\phantom{\rule{0.1em}{0ex}}C$ has a $\left({d}^{\prime}-\mathsf{\text{1}}\right)$-location in ${S}_{i}$). These probabilities can be computed with equation (2). Summing over all $M\left[k,\phantom{\rule{2.77695pt}{0ex}}d\right]$, $0\le d\le {d}_{\mathsf{\text{obs}}}$, gives the desired p-value. i. e. the probability that $C$ is at least as well conserved in the randomized genomes as in the original dataset. This computation takes time $O\left({k}^{\mathsf{\text{2}}}{\delta}^{\mathsf{\text{2}}}\right)$, as $\phantom{\rule{0.1em}{0ex}}i$ and $\phantom{\rule{0.1em}{0ex}}d$ are bounded by $k$ and ${d}_{\mathsf{\text{obs}}}$, respectively, and we have ${d}_{\mathsf{\text{obs}}}\le k\cdot \delta $.
When a gene cluster is observed only in a subset of the studied genomes, it becomes tricky to define the meaning of "at least as well conserved": Is a gene cluster better conserved if it occurs in many genomes at a low quality, or in fewer genomes but at higher quality? We suggest that the latter should be given preference. Otherwise, there is the risk of systematically preferring gene clusters that occur in a large number of genomes, yet only incompletely in the form of one or two genes, over clusters that are very well conserved but only in a small number of genomes. We believe that the latter are the more interesting cases. Therefore we say a gene cluster $C$ with $\delta $-locations in ${k}^{\prime}$ out of $k$ genomes and sum distance ${d}_{\mathsf{\text{obs}}}$ is conserved at the same or better quality in the randomized genomes, if it has $\delta \text{-}\mathsf{\text{locations}}$ in at least ${k}^{\prime}$ of them, and the best ${k}^{\prime}$ $\delta \text{-}\mathsf{\text{locations}}$ (from $\phantom{\rule{0.1em}{0ex}}k$ different genomes) have a sum distance of at most ${d}_{\mathsf{\text{obs}}}$ to $C$. Unfortunately the recurrence used in equation (7) cannot be extended to compute the corresponding probabilities. We need to track the sum of the ${k}^{\prime}$ smallest distances below $\delta $, after processing the first $i\le k$ genomes. This value cannot be computed in a simple recursive manner, as there is no optimal substructure underlying the problem: The sum, after processing $i$ genomes, depends not only on the sum before the genome was added and the distance for the new genome, but also on the previous number of distances below $\delta $ and the values of these distances. Only with all of this information is it possible to decide whether or not the distance encountered for the new genome needs to be added to the sum. In the absence of an efficient dynamic programming approach, we need to sum probabilities over all ${{k}^{\delta}}^{+\mathsf{\text{1}}}$ distance vectors. This becomes infeasible for larger $\delta $.
In order to avoid exponential running time, we studied a simpler approach where we use a fixed distance threshold ${\delta}^{\prime}$ for all genomes. This can be either the original threshold $\delta $, or the largest entry in d_{obs}, i. e. the largest of the distances between $C$ and its best $\delta \text{-}\mathsf{\text{location}}$ in each genome.
The base cases are $Q\left[0,\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{1}}\right]=\mathsf{\text{1}}-{p}_{\mathsf{\text{1}}},\phantom{\rule{2.77695pt}{0ex}}Q\left[\mathsf{\text{1}},\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{1}}\right]={p}_{\mathsf{\text{1}}}$, and $Q\left[i,j\right]=0$ for all $i>j$. The likelihood of finding at least ${k}^{\prime}$ ${\delta}^{\prime}$-locations in $\phantom{\rule{0.1em}{0ex}}k$ genomes is then the sum over all $Q\left[i,\phantom{\rule{2.77695pt}{0ex}}k\right]$ with ${k}^{\prime}\le i\le k.$ Computation requires $O\left({k}^{\mathsf{\text{2}}}\right)$ time. This method will be referred to as "global distance bound".
Unfortunately, the above approach has the following problem: Consider two (otherwise identical) gene clusters, both found in three genomes. The first cluster is found with distances 0, 1 and 4 in the three genomes; the second cluster with distances 3, 4 and 4. Common sense tells us that the first gene cluster is more significant, because it is less likely to occur by chance. However, the approach described above will come up with identical likelihoods, as, in both cases we have ${\delta}^{\prime}=\mathsf{\text{4}}$. In fact, for the first cluster it may be beneficial to exclude the last occurrence, as this may lead to a smaller p-value; we omit the details.
To ensure that gene clusters of this type are evaluated differently while keeping the computational complexity reasonable, we resort to the following simplification: Recall that ${\delta}^{\prime}\le \delta $ is the maximum distance of the cluster to any occurrence. For genomes where we do not detect a $\delta $-location, we use the single-genome p-value with distance threshold ${\delta}^{\prime}$; for genomes other than the reference genome that contain a $\delta \text{-}\mathsf{\text{location}}$ (and, hence, a ${\delta}^{\prime}$-location), we use the single-genome p-value with distance threshold given by the distance of the detected interval in this genome. In the above example, for the first cluster, we use distance threshold 0 for the first genome, 1 for the second genome, and 4 for the third genome; for the second cluster we use distance threshold 3 for the first genome and 4 for the remaining two. This method will be referred to as "individual distance bounds".
False discovery rate control
where ${n}_{\mathsf{\text{1}}},...,{n}_{k}$ are the genome lengths. This is a conservative estimation and comes at the cost of increasing the probability of producing false negatives. That is, gene clusters that should be regarded as significant may be declared non-significant after the FDR correction. In addition, equation (9) actually overestimates the number of gene clusters tested as we do not take into consideration certain gene clusters that appear multiple times in a genome (and should be tested only once). We do not use the more powerful Šidák correction [31], as independence of the different tests cannot be guaranteed. FDR-corrected p-values can be larger than one, which solely indicates that this correction is conservative.
Evaluation of p-value accuracy for a single genome
We now evaluate the accuracy of the p-value estimation we introduced above for a single genome. Recall that we use two simplifications to keep this task computationally feasible. First, we assume that the intervals within a genome do not overlap, in order to gain statistical independence of the probabilities $q\left(L,\delta \right)$ in equation (2). Second, we employ a heuristic approach to deal with additional characters in cluster occurrences. To show that our calculations are still sufficiently accurate, we compare our estimated p-values with p-values derived empirically from large-scale simulations of random genomes.
Two simulation studies were performed, one with biological data and one with simulated data. To reduce simulation time, we performed the first study on four, somewhat small, bacterial genomes with just 600 to 850 genes each, namely Buchnera aphidicola APS, Ureaplasma urealyticum, Mycoplasma pneumoniae, and Borrelia burgdorferi. We downloaded these genomes from the NCBI database [32] and assigned homology families, using GHOST FAM [33] with standard parameters. Ten billion random sequences were then generated, with the same length and character frequencies as the original genomes.
Our reference gene cluster detection algorithm [17] was applied to the four original genomes, with three different parameter settings $\left(\delta =\mathsf{\text{1}},\phantom{\rule{2.77695pt}{0ex}}s=\mathsf{\text{4}}\right)$, $\left(\delta =\mathsf{\text{3}},\phantom{\rule{2.77695pt}{0ex}}s=\mathsf{\text{6}}\right)$, and $\left(\delta =\mathsf{\text{5}},\phantom{\rule{2.77695pt}{0ex}}s=\mathsf{\text{9}}\right)$. The quorum parameter was set to ${k}^{\prime}=\mathsf{\text{2}}$ in each case. This search returned 459 gene clusters. We computed for each gene cluster $C$, the p-value of each occurrence, excluding the reference interval. This was done with equation (2), by setting the threshold to ${\delta}^{\prime}$, the symmetric set distance between $C$ and the character set of the occurrence. Next, the p-value for a genome with randomized gene order containing a ${\delta}^{\prime}$-location of $C$ was empirically determined. To this end, all random genomes corresponding to the genome where the occurrence was observed were searched for intervals with a symmetric set distance of, at most, ${\delta}^{\prime}$ to $\phantom{\rule{0.1em}{0ex}}C$. The number of genomes with at least one such occurrence was divided by the total number of tested genomes. This gives an empirical estimate of the true p-value. When no occurrences were found, we omitted the pair from further analysis, as this shows the empirical p-value to be out of the scope of the current evaluation. This resulted in the omission of 277 gene clusters. The maximum p-value calculated for any such omitted pair was 9.23 · 10^{−11}. Therefore all of the omitted values fall into the 95% Agresti-Coull confidence interval [34] of finding no occurrences, that has an upper bound of 4.64 · 10^{−10}. (Note that when searching for gene clusters, what matters in p-value estimation is the order of magnitude. For example, p-values 2.7 · 10^{−380} and 5.4 · 10^{−380} differ by a factor of two; still, we would regard a gene cluster with either of these p-values as highly significant.)
where the ${y}_{i}$ are the observations (log empirical p-values), the ${x}_{i}$ are the predictions (log calculated p-values), and $\overline{y}$ is the mean of the observations. For the bacterial genomes, we achieve ${R}^{\mathsf{\text{2}}}=0.\mathsf{\text{9951661}}$. The observed deviations for small probabilities must be attributed to the fact that, for these reference gene clusters, ten billion random genomes are not enough to give a good estimate of the true value. It appears very likely that the empirical p-value is inaccurate, rather than the calculated p-value.
We argue that these results strongly indicate that our calculated p-values are very close to the true values; hence, although equation (2) does not take into account statistical dependencies, our calculations are still highly accurate.
In the following calculations, we use ${x}_{m}=\mathsf{\text{1}}$, so that each gene appears at least once. The bacterial genomes we use later on for our evaluation approximately follow a Pareto distribution with $\alpha =\mathsf{\text{2}}.\mathsf{\text{8}}$ (Additional file 1).
For each random genome we uniformly draw its length $n\in \left[\mathsf{\text{125}}0,\mathsf{\text{175}}0\right]$. To select the character content of the genome, we repeat the following: We choose the next character and draw the number of occurrences of this character in the random genome using the above Pareto distribution. We repeat this until all $\phantom{\rule{0.1em}{0ex}}n$ positions of the random genome are filled, discarding surplus copies of the last added character. In this manner, we generated ten genome contents.
To generate a random gene order, we could concatenate the genes and shuffle the resulting string. To speed up computations, we proceeded in a slightly different way: Instead of generating a random genome and then searching for reference gene clusters, we simply assume a gene cluster $C$ to be present. To obtain useful p-values we combined different strategies to construct $C$: (1) nine clusters are chosen with two to ten genes using the most commonly occurring genes; (2) nine clusters are chosen with two to ten genes using the rarest genes, usually occurring only once; (3) 82 clusters are chosen with two to ten genes, randomly selected. For each of these 100 gene clusters we randomly choose a maximum allowed distance $\delta \in \left[0,\left|C\right|-\mathsf{\text{2}}\right]$. As mentioned earlier, for higher values of $\phantom{\rule{0.1em}{0ex}}\delta $ exact p-values can be easily computed, simply by testing whether a single gene from $C$ is present in the genome.
We then proceeded as described above, comparing the calculated p-value to an empirical p-value obtained from one billion random draws. From the 10 · 100 = 1000 gene clusters tested, we omitted 249 p-value pairs where no occurrences were empirically observed. The maximum calculated p-value for any such omitted pair is 2.45 · 10^{−9}, while the upper bound of the 95% Agresti-Coull confidence interval for observing no occurrence is 4.64 · 10^{−9}. To further increase the accuracy of the empirical estimation of the 49 clusters that were found between one and ten times only, we did an additional nine billion random draws. For five clusters, we still observed less than ten occurrences; the largest calculated p-value among these clusters is 5.40 · 10^{−10}, the upper bound of the 95% Agresti-Coull interval for observing nine occurrences is 1.74 · 10^{−9}. As for these cases the empirically determined p-values are presumably inaccurate, we excluded them from our analysis.
Results
To evaluate our method, we used a dataset of 119 bacterial genomes from the RefSeq database [32]. A complete list of the genomes can be found in Additional File 1. This dataset has previously been used by Ling et al. [16] for the evaluation of the MCMuSeC software; we removed 14 plasmids. A partitioning of the genes of these genomes into homology families is available based on COG (Clusters of Orthologous Groups) numbers [35]. We deliberately ran our analysis on this set of well-described bacteria to facilitate evaluation of our cluster predictions based on the annotations provided in the database.
Top 20 gene cluster predictions.
distance to ref. | ||||||||
---|---|---|---|---|---|---|---|---|
ID | G | GN | min | max | avg | p-score | corr. p-score | description |
1 | 9 | 108 | 2 | 5 | 2.8 | 1314.43 | 1307.49 | 30S/50S ribosomal subunit |
2 | 7 | 114 | 0 | 3 | 1.6 | 1258.12 | 1251.18 | 30S/50S, rpoA, infA |
3 | 6 | 91 | 0 | 2 | 0.7 | 1031.47 | 1024.83 | ATP synthase |
4 | 9 | 57 | 0 | 5 | 1.4 | 896.31 | 890.57 | NADH dehydrogenase |
5 | 8 | 108 | 3 | 5 | 4.1 | 716.68 | 711.29 | 30S/50S ribosomal subunit |
6 | 8 | 88 | 0 | 5 | 4.2 | 569.88 | 564.63 | phosphate ABC transporter |
7 | 8 | 93 | 0 | 5 | 4.1 | 486.80 | 481.67 | infB, rfbA, nusA, hypothetical protein |
8 | 8 | 79 | 3 | 5 | 4.6 | 367.33 | 362.27 | putative/peptide ABC transporter |
9 | 8 | 62 | 3 | 5 | 4.4 | 294.41 | 289.40 | sugar ABC transporter |
10 | 8 | 65 | 2 | 5 | 4.1 | 290.24 | 285.24 | N-acetylmuramoyl, cell division |
11 | 4 | 33 | 0 | 0 | 0.0 | 272.99 | 267.55 | succinate dehydrogenase |
12 | 8 | 51 | 3 | 5 | 4.9 | 221.73 | 216.79 | pdhA/B/C |
13 | 8 | 48 | 2 | 5 | 4.9 | 216.54 | 211.62 | ATP-dependent (Clp) protease, trigger factor |
14 | 8 | 58 | 0 | 5 | 4.2 | 216.12 | 211.20 | 50S L31, prfA, thrA/B/C, rho, hemK |
15 | 8 | 50 | 4 | 5 | 4.9 | 213.61 | 208.70 | hisA/C/F/H |
16 | 6 | 32 | 0 | 2 | 1.7 | 200.11 | 194.80 | dnaA/N, gyrA/B, recF |
17 | 6 | 27 | 1 | 2 | 1.7 | 194.39 | 189.10 | carA/B, pyrC/B/R |
18 | 8 | 67 | 4 | 5 | 5.0 | 192.56 | 187.69 | elongation factor Tu, G; 30S S7 |
19 | 8 | 29 | 4 | 5 | 4.5 | 190.62 | 185.75 | sulfate ABC transporter |
20 | 8 | 44 | 2 | 5 | 4.3 | 181.13 | 176.28 | argB/C/D/G/H/F/R |
We have also computed p-values using the "maximum distance bound" method, see Additional file 1. The best scoring cluster, in both cases, is part of the 30S/50S ribosomal subunit, with nine conserved genes. However, using the "maximum distance bound" method, the best scoring occurrence of this cluster is only found in 66 genomes, with a p-value of 1.19 · 10^{−946}, while the occurrence in 108 genomes only receives a p-value of 1.43 · 10^{−714} due to its maximum distance of only five. The p-value for the 66 genomes occurrence using the "individual distance bounds" method is 2.37 · 10^{−947}, while the 108 genomes occurrence receives a p-value of 3.24 · 10^{−1308}.
None of the clusters in the above experiments have a corrected p-value anywhere close to 0.05, the typical threshold used to discriminate significant from non-significant observations. To get some insight into this "grey area" where no confident predictions can be made, we studied gene cluster predictions with a corrected p-value close to 0.05. We obtained these in three runs of our program, $\left(s=\mathsf{\text{3}},\phantom{\rule{2.77695pt}{0ex}}\delta =\mathsf{\text{1}},\phantom{\rule{2.77695pt}{0ex}}{k}^{\prime}=\mathsf{\text{9}}\right)$, $\left(s=\mathsf{\text{4}},\phantom{\rule{2.77695pt}{0ex}}\delta =\mathsf{\text{4}},\phantom{\rule{2.77695pt}{0ex}}{k}^{\prime}=\mathsf{\text{8}}\right)$ and $\left(s=\mathsf{\text{6}},\phantom{\rule{2.77695pt}{0ex}}\delta =\mathsf{\text{7}},\phantom{\rule{2.77695pt}{0ex}}{k}^{\prime}=\mathsf{\text{7}}\right)$. For each setting we collected all predictions with a p-value > 0.05 (corrected p-score < -1.3), and also the same number of predictions with the biggest p-values < 0.05. The complete list of these 84 predictions can be found in Additional file 1. We compared the predictions with known E. coli operons that we obtained from the RegulonDB database [36]. As can be seen in Additional file 1 most of the 84 predictions above and below the threshold contain at least one operon. A more formal analysis of these findings is hard to obtain with the limited data available. What appears to be a false positive prediction based on RegulonDB, may in fact be an unknown gene cluster, or one that is not well enough confirmed to appear in the database. Also a non-significant prediction, that is in fact a confirmed operon, does not mean that our p-values are too strict. Statistical significance by itself is simply not a necessary condition for a biological gene cluster.
Conclusion and outlook
In this paper, we presented the first statistical model to estimate the significance of gene cluster predictions under an approximate common interval-based gene cluster model. The underlying p-value calculations integrate all parameters that influence the quality of gene cluster conservation. Namely, the number of genomes in which the gene cluster is detected, the size of the gene cluster, the degree of conservation within the different occurrences, as well as the genome-wide frequencies of the genes involved. To keep the computation time feasible, we had to make some simplifying assumptions in the p-value calculation, but we have experimentally shown that our estimations are remarkably close to empirically derived p-values. An analysis of a set of well annotated genomes has proven that our method is able to re-discover known, highly conserved gene clusters with p-values clearly showing that such conservation did not occur by chance. The gene cluster at position 20 in our output list (functional annotation: arginine biosynthesis) still receives a highly significant p-value of 5.20 · 10^{−177}. We also studied clusters with low significance and observed known operons with p-values below the significance threshold of 0.05, as well as unknown clusters with significant p-values. However, due to the limited data available it was not possible to distinguish between false and true positives.
Although the simplifying assumptions seem to have little effect in practice, it would be an interesting next step to study more accurate models in the future. In particular, the assumption that the probabilities of observing a cluster are statistically independent in overlapping intervals could be omitted. This could be achieved by accounting for such dependencies in our calculations, for example by employing the inclusion/exclusion principle. Alternatively our present p-value approximation might be amenable to control by the Chen-Stein method [37].
Another strong assumption in our model is that any two genomes show fully randomized gene order, unless evolutionary pressure prohibits it. This assumption may be violated if we analyze genomes of closely related species or strains. In this case, significances will be overrated by our p-value estimation. For the dataset analyzed in this study at least, this problem is less severe than one might expect. Even strains of the same species show a relatively high amount of random shuffling (Additional file 1). Nevertheless, by lowering the quorum parameter to ${k}^{\prime}=\mathsf{\text{5}}$, we observe that many conserved regions are detected, where all or most species are Mycobacteria. In cases where more closely related strains are analyzed, we suggest several workarounds in Additional file 1. In the future, it will be an interesting problem to include incomplete randomization into our statistical model.
Declarations
Acknowledgements
K.J. and S.W. were supported by DFG grants ST 431/5-1 and BO 1910/8-1, respectively.
Declarations
We acknowledge support for the Article Processing Charge by the Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University Library.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 15, 2013: Proceedings from the Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S15.
Authors’ Affiliations
References
- Tamames J, Casari G, Ouzounis C, Valencia A: Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes. J Mol Evol. 1997, 44: 66-73. 10.1007/PL00006122.View ArticlePubMedGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998, 23 (9): 324-8. 10.1016/S0968-0004(98)01274-2.View ArticlePubMedGoogle Scholar
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The Use of Gene Clusters to Infer Functional Coupling. Proc Natl Acad Sci USA. 1999, 96 (6): 2896-2901. 10.1073/pnas.96.6.2896.PubMed CentralView ArticlePubMedGoogle Scholar
- Huynen M, Snel B: Gene and context: integrative approaches to genome analysis. Adv Protein Chem. 2000, 54: 345-379.View ArticlePubMedGoogle Scholar
- Wolf YI, Rogozin IB, Kondraskov AS, Koonin EV: Genome Alignment, Evolution of Procaryotic Genome Organization, and Prediction of Gene Function Using Genomic Context. Genome Res. 2001, 11: 356-372. 10.1101/gr.GR-1619R.View ArticlePubMedGoogle Scholar
- Yanai I, Mellor J, DeLisi C: Identifying functional links between genes using conserved chromosomal proximity. Trends Genet. 2002, 18 (4): 176-179. 10.1016/S0168-9525(01)02621-X.View ArticlePubMedGoogle Scholar
- Rogozin IB, Makarova KS, Wolf YI, Koonin EV: Computational Approaches for the Analysis of Gene Neighborhoods in Prokaryotic Genomes. Brief Bioinform. 2004, 5 (2): 131-149. 10.1093/bib/5.2.131.View ArticlePubMedGoogle Scholar
- Price M, Huang K, Alm E, Arkin A: A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005, 33 (3): 880-10.1093/nar/gki232.PubMed CentralView ArticlePubMedGoogle Scholar
- Homma K, Fukuchi S, Nakamura Y, Gojobori T, Nishikawa K: Gene Cluster Analysis Method Identifies Horizontally Transferred Genes with High Reliability and Indicates that They Provide the Main Mechanism of Operon Gain in 8 Species of $\gamma $-Proteobacteria. Mol Biol Evol. 2007, 24 (3): 805-813.View ArticlePubMedGoogle Scholar
- Heber S, Stoye J: Algorithms for Finding Gene Clusters. Proc. of Workshop on Algorithms in BioInformatics, WABI 2001 Lect. Notes Comput. Sci. 2001, Springer, Berlin, 2149: 254-265.Google Scholar
- Béal MP, Bergeron A, Corteel S, Raffinot M: An Algorithmic View of Gene Teams. Theor Comput Sci. 2004, 320 (2-3): 395-418. 10.1016/j.tcs.2004.02.036.View ArticleGoogle Scholar
- He X, Goldwasser MH: Identifying Conserved Gene Clusters in the Presence of Homology Families. J Comp Biol. 2005, 12: 638-656. 10.1089/cmb.2005.12.638.View ArticleGoogle Scholar
- Rahmann S, Klau GW: Integer Linear Programs for Discovering Approximate Gene Clusters. Proc. of Workshop on Algorithms in BioInformatics, WABI 2006, Lect. Notes Comput. Sci. 2006, Springer, Berlin, 4175: 298-309.Google Scholar
- Ling X, He X, Xin D, Han J, Han J: Efficiently identifying max-gap clusters in pairwise genome comparison. J Comput Biol. 2008, 15 (6): 593-609. 10.1089/cmb.2008.0010.View ArticlePubMedGoogle Scholar
- Böcker S, Jahn K, Mixtacki J, Stoye J: Computation of median gene clusters. J Comp Biol. 2009, 16 (8): 1085-1099. 10.1089/cmb.2009.0098.View ArticleGoogle Scholar
- Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics. 2009, 25 (5): 571-10.1093/bioinformatics/btp027.View ArticlePubMedGoogle Scholar
- Jahn K: Efficient computation of approximate gene clusters based on reference occurrences. J Comput Biol. 2011, 18 (9): 1255-1274. 10.1089/cmb.2011.0132.View ArticlePubMedGoogle Scholar
- Durand D, Sankoff D: Tests for Gene Clustering. J Comp Biol. 2003, 10: 453-482. 10.1089/10665270360688129.View ArticleGoogle Scholar
- Raghupathy N, Hoberman R, Durand D: Two plus two does not equal three: statistical tests for multiple genome comparison. J Bioinform Comput Biol. 2008, 6: 1-22. 10.1142/S0219720008003242.View ArticlePubMedGoogle Scholar
- Raghupathy N, Durand D: Gene Cluster Statistics with Gene Families. Mol Biol Evol. 2009, 26 (5): 957-968. 10.1093/molbev/msp002.PubMed CentralView ArticlePubMedGoogle Scholar
- Calabrese PP, Chakravarty S, Vision TJ: Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics. 2003, 19 (Suppl 1): i74-i80. 10.1093/bioinformatics/btg1008.View ArticlePubMedGoogle Scholar
- Hoberman R, Sankoff D, Durand D: The Statistical Analysis of Spatially Clustered Genes under the Maximum Gap Criterion. J Comp Biol. 2005, 12 (8): 1083-1102. 10.1089/cmb.2005.12.1083.View ArticleGoogle Scholar
- Wang X, Shi X, Li Z, Zhu Q, Kong L, Tang W, Ge S, Luo J: Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice. BMC Bioinformatics. 2006, 7: 447-10.1186/1471-2105-7-447.PubMed CentralView ArticlePubMedGoogle Scholar
- Parida L: Gapped Permutation Pattern Discovery for Gene Order Comparisons. J Comp Biol. 2007, 14: 45-55. 10.1089/cmb.2006.0103.View ArticleGoogle Scholar
- Yi G, Sze SH, Thomas MR: Identifying clusters of functionally related genes in genomes. Bioinformatics. 2007, 23 (9): 1053-1060. 10.1093/bioinformatics/btl673.View ArticlePubMedGoogle Scholar
- Zeng X, Pei J, Vergara IA, Nesbitt MJ, Wang K, Chen N: OrthoCluster: A New Tool for Mining Synteny Blocks and Applications in Comparative Genomics. Proc of Extending Database Technology, EDBT 2008. 2008, 656-667.Google Scholar
- Rödelsperger C, Dieterich C: CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes. PLoS One. 2010, 5: e8861-10.1371/journal.pone.0008861.PubMed CentralView ArticlePubMedGoogle Scholar
- Proost S, Fostier J, Witte DD, Dhoedt B, Demeester P, de Peer YV, Vandepoele K: i-ADHoRe 3.0-fast and sensitive detection of genomic homology in extremely large data sets. Nucleic Acids Res. 2012, 40 (2): e11-10.1093/nar/gkr955.PubMed CentralView ArticlePubMedGoogle Scholar
- Didier G, Schmidt T, Stoye J, Tsur D: Character Sets of Strings. J Discr Alg. 2007, 5: 330-340. 10.1016/j.jda.2006.03.021.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 1995, 289-300.Google Scholar
- Šidák Z: Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. J Am Stat Assoc. 1967, 62: 626-633.Google Scholar
- Pruitt K, Tatusova T, Browen GR, Maglott D: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40: D130-135. 10.1093/nar/gkr1079.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt T, Stoye J: Gecko and GhostFam - Rigorous and Efficient Gene Cluster Detection in Prokaryotic Genomes. Comparative Genomics, Methods in Molecular Biology. Edited by: Bergman N. 2007, Humana Press, 2: 165-182.Google Scholar
- Agresti A, Coull BA: Approximate Is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician. 1998, 52 (2): 119-126. [http://www.jstor.org/stable/2685469]Google Scholar
- Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMed CentralView ArticlePubMedGoogle Scholar
- Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Research. 2013, 41 (Database): 203-213.View ArticleGoogle Scholar
- Arratia R, Goldstein L, Gordon L: Two Moments Suffice for Poisson Approximations: The Chen-Stein Method. The Annals of Probability. 1989, 17: 18-Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.