# Statistical power of phylo-HMM for evolutionarily conserved element detection

- Xiaodan Fan
^{1}, - Jun Zhu
^{2}, - Eric E Schadt
^{2}and - Jun S Liu
^{1}Email author

**8**:374

**DOI: **10.1186/1471-2105-8-374

© Fan et al.; licensee BioMed Central Ltd. 2007

**Received: **10 May 2007

**Accepted: **05 October 2007

**Published: **05 October 2007

## Abstract

### Background

An important goal of comparative genomics is the identification of functional elements through conservation analysis. Phylo-HMM was recently introduced to detect conserved elements based on multiple genome alignments, but the method has not been rigorously evaluated.

### Results

We report here a simulation study to investigate the power of phylo-HMM. We show that the power of the phylo-HMM approach depends on many factors, the most important being the number of species-specific genomes used and evolutionary distances between pairs of species. This finding is consistent with results reported by other groups for simpler comparative genomics models. In addition, the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors. In contrast, the influence of the topology and the nucleotide substitution model are relatively minor factors.

### Conclusion

Our results provide for general guidelines on how to select the number of genomes and their evolutionary distance in comparative genomics studies, as well as the level of power we can expect under different parameter settings.

## Background

One of the most fundamental problems of molecular biology is to annotate all functional elements in the genome. Because of evolutionary pressure, many of the functional elements are believed to be located within regions that evolve more slowly than the overall genome [1, 2]. Regions that evolve more slowly are called evolutionarily conserved elements. Conservation analysis by comparing genomes of related species is a powerful approach for identifying functional elements like protein/RNA coding regions and transcriptional regulatory elements [3–8].

In comparative genomics, one often starts with an alignment of multiple orthologous sequences from several species. The power of a conservation analysis is usually measured by its sensitivity and specificity of detecting conserved elements from given alignments. Several papers have been published recently dealing with the power evaluation of phylogenetic models in comparative genomics studies [9, 10]. Both Eddy [9] and McAuliffe et al. [10] evaluated the power of such studies assuming a symmetric star topology and the Jukes-Cantor nucleotide substitution model, whose simple structure makes it possible to evaluate the power of the method analytically. McAuliffe et al. considered the hypothesis testing problem for the conservation of a single nucleotide site, whereas Eddy considered the classification problem for whether a whole DNA block is conserved or not. Both papers reported the theoretical power mainly as a function of the number of comparative genomes and the branch length from the center of the star topology to each species. However, the approaches they employed are not able to accommodate the substitution rate variation and correlations among nearby nucleotides along a genomic sequence. More realistic models are needed to account for the spatial rate variation and correlation in order to more accurately reflect the nature of real data [11]. Various models have been proposed for this task, including the hidden Markov model [12, 13], Ising chains [14, 15] and Markov random fields [16, 17]. In particular, Felsenstein and Churchill [13] used HMMs to model the substitution rate correlation along the genome. Since then, this model has been adapted to many evolution related problems [7, 18–20].

In this paper we evaluate the power of the phylogenetic hidden Markov model (phylo-HMM), which models the substitution rate variation using an HMM. As introduced collectively in ref. [7, 12, 13] to improve phylogenetic modeling, Phylo-HMM is a generative probability model for aligned multiple orthologous sequences. It models the molecular evolution in both the space dimension along the genome and the time dimension along branches of the phylogenetic tree. Along the genome of the common ancestor, an HMM is used to describe the change from one site to the next. Along each branch of the phylogenetic tree, a continuous-time Markov process is used to model the evolutionary process. Siepel et al. [7] implemented a two-state phylo-HMM to perform a genome-wide detection of evolutionarily conserved elements. They showed that this model can yield biologically meaningful results, but did not evaluate the statistical power of the method. In their study, the transition matrix of the HMM in the phylo-HMM model is assumed to be known and unchanged along the genome. This is equivalent to assuming that the expected length and coverage (defined as the percentage) of conserved elements are known and that the conservation ratio is homogeneous along the genome, which is unrealistic [21].

The goal of this paper is to systematically assess the ability (or power) of the phylo-HMM to identify conserved genomic regions. Because this model is too complex for a theoretical analysis and because the availability of real data with experimentally verified conserved elements is still very limited, we adopt a simulation approach. Given that different regions along the genome vary in terms of their neutral substitution rates, the expected lengths and coverage of the conserved elements [21], we used sequence data from promoter regions of four selected species (human, mouse, rat, and dog) to first estimate the set of key parameters of the phylo-HMM. We call this set of estimated parameter values the *baseline*. We simulated sequences by varying one or a subset of parameters around the *baseline*. Using these simulated sequences, we explore several fundamental questions in comparative genomics with respect to this model, including 1) does the topology of the phylogenetic tree critically impact statistical power?, 2) is the accuracy of the predictions significantly reduced if we simplify the nucleotide substitution model?, 3) is it always beneficial to select more distantly related species for this analysis, thereby increasing the branch lengths of the phylogenetic tree?, 4) how many comparative genomes do we need in order to detect conserved elements of a certain length with satisfactory power?, and 5) is spatial variation of the conservation ratio important? Assuming that the underlying phylo-HMM model is a reasonable approximation of reality, our results reveal a number of insights. First, we demonstrate that the most important factors of a comparative genomics analysis appear to include the number of genomes used and the evolutionary distance between pairs of species, consistent with results for simpler models reported by others. Further, we show that the conservation ratio of conserved elements and the expected length of the conserved elements are also major factors. Therefore, our results show that it is important to accurately characterize the inter-species (spatial) variation in the phylo-HMM model. In contrast, the influence of the topology and the nucleotide substitution model are relatively minor.

## Methods

### Phylo-HMM model

*(μ, ν)*, for the two-state HMM are derived from the data, and then re-parameterized as

*(P, L)*, which have a more intuitive explanation, with

*L*=

*1/μ*representing the expected length of a conserved element (i.e., a segment of contiguous conserved sites), and

*P*=

*ν/(μ*+

*ν)*representing the expected coverage of conserved elements (the density of the conserved sites). The phylogenetic models for nonconserved and conserved states are denoted as

*ψ*

_{ n }=

*(Q, π, τ, β)*and

*ψ*

_{ c }=

*(Q, π, τ, ρβ)*, respectively. Here

*π*is the vector of background (equilibrium) probabilities for the four nucleotide bases;

*τ*is the tree topology of the corresponding phylogeny;

*β*is a vector of non-negative real numbers representing branch lengths of the tree, which are measured by the expected number of substitutions per site;

*ρ*is the conservation ratio representing the ratio between the substitution rate in conserved regions versus that in nonconserved regions; and

*Q*is a 4-by-4 substitution rate matrix for the continuous-time Markov process.

Many parametric forms are available for the substitution rate matrix *Q* [22, 23]. Here we used the parametric forms in the PAML package [24]. The JC model [25] is the simplest one among all models implemented in PAML; it assumes a uniform base composition and a uniform rate for all types of substitutions. F81 [26] assumes that the substitution rate is proportional to the frequency of the target nucleotide. HKY [27] is a more realistic rate matrix because it accounts for non-uniform base composition and transition/transversion rate bias. REV, a generalization of HKY, is the most general model and only requires that the nucleotide substitution process be a reversible Markov process [28].

### Parameter inference and posterior probability

*τ*of the selected species for comparative analysis is known. The background distribution

*π*can be first estimated by the relative frequencies of the four bases of all the sequences in consideration and treated as known. All other parameters, denoted by

*θ*= (

*μ, ν, Q, β, ρ*), will be inferred from the data by their maximum likelihood estimates (MLEs). The complete likelihood can be written as:

where *K* is the total number of columns in the alignment, *x*_{
i
}is the observed nucleotide vector in the *i*-th column, *z*_{
i
}∈ {*c*, *n*} is the hidden state of the *i*-th column, (*b*_{
c
}, *b*_{
n
}) = (*ν/(μ* + *ν*), *μ*/(*μ* + *ν*)) is the initial emission probability of the HMM, and ${a}_{{z}_{i-1}{z}_{i}}$ is the transition probability as illustrated in Figure 1. With the help of the standard forward/backward procedure for HMM [29] to sum over all possible *Z*, we can use the EM algorithm [30] to get the MLE of *θ*, denoted as $\widehat{\theta}$. Based on $\widehat{\theta}$, the forward-summation-backward-sampling method [31] can then be used to compute the posterior probability for a given hidden state to be conserved, i.e., *P*(z_{
i
}= *c*|*X*, $\widehat{\theta}$). By applying a threshold to the posterior probability that a given site is conserved, we finally classify all sites in the alignment as either conserved or nonconserved. Another possible approach is to estimate *θ* using a Bayesian method via the Gibbs sampler [31].

### The Baseline

*baseline*parameters of the phylo-HMM from the promoter sequences of dog, human, mouse, and rat, whose evolutionary relationship is represented by the phylogenetic tree in Figure 2. The promoter sequence of a gene is defined as the region 2000 bp upstream to 500 bp downstream of the annotated transcription start site of that gene. MLAGAN [32] was used to align the promoter regions of genes in each

*orthologous gene cluster*, which is defined as a set of 4 genes, each from a different species, of which any pair are reciprocal-best BLAST hits of each other (filtered by blastn threshold e-value = 0.001). All gaps in the alignment were treated as missing data [7]. If the average branch length of the phylogenetic tree for the conserved sites of a gene's promoter region was estimated to be greater than one substitution per site, the corresponding promoter alignment was considered unreliable and, thus, removed from our simulation studies. After filtering out such alignments, 8533 orthologous gene clusters remained. Using the REV model for the substitution rate matrix, we fitted the phylo-HMM to each of these 8533 alignments separately. The

*baseline*is defined by setting the parameters (

*π, μ, ν, Q, β, ρ*) to be the median values of fitted parameters from these 8533 alignments.

### Simulation scheme

Given the phylogenetic tree and all model parameter values (*π, μ, ν, Q, β, ρ*), we first simulated the *true* state sequence from the two-state HMM, considering sequences 2500 nucleotides in length. Then, we simulated the ancestral DNA sequence by independently choosing a nucleotide for each site of the state sequence according to *π*. To simulate the ungapped alignment of the modern-day genomes, we followed the phylogenetic tree from its root for each site. Recursively, we sampled a nucleotide for each branch point (or leaf) of the tree according to the Markov transition kernel *Q* and the length of the branch that links to the previously-generated branch point. Assuming that the true topology and the true substitution model type were known, we computed the posterior probability for each site being conserved from the simulated alignments. The predicted state sequence was produced by applying a threshold to the posterior probability. The sensitivity of the method was defined as the proportion of true conserved sites correctly predicted, and the specificity was defined as the proportion of predicted conserved sites that were truly conserved. The Receiver Operating Characteristic (ROC) curve was plotted to illustrate the tradeoffs between the sensitivity and specificity at different thresholds.

The above simulation-prediction procedure was repeated 1000 times under each topology and parameter setting unless otherwise specified. The median sensitivity and specificity at each threshold were used to draw a ROC curve. The inter-quartile range (i.e., the range from the 1^{st} to 3^{rd} quartiles) was used to reflect the variation of the sensitivity and specificity measures. The size of this range is related to the length of the alignment. We used the median and the inter-quartile range instead of the mean and the standard deviation because the former measures are more robust and naturally bounded by 0 and 1. We used the bootstrap method [33] to obtain the 95% confidence interval of the median. The size of the bootstrap confidence interval is dependent of the number of simulations.

## Results and discussion

### The Baseline

The baseline setting

Parameter | Value | ||||
---|---|---|---|---|---|

expected length | 50 | ||||

expected coverage | 0.25 | ||||

background probability | A | C | G | T | |

0.24034 | 0.25518 | 0.26097 | 0.24351 | ||

A | C | G | T | ||

substitution rate matrix | -1.02446 | 0.23112 | 0.58662 | 0.20672 | A |

0.21767 | -0.99759 | 0.20665 | 0.57327 | C | |

0.54024 | 0.20206 | -0.95781 | 0.21551 | G | |

0.20402 | 0.60073 | 0.23095 | -1.03570 | T | |

un-rooted branch length | dog | human | mouse | rat | middle |

0.56496 | 0.33361 | 0.09604 | 0.10016 | 0.35807 | |

conservation ratio | 0.32 |

### Power comparison of phylo-HMM and the PID method

We simulated alignments from phylo-HMM with the *baseline* setting. These alignments were then used by phylo-HMM to infer the hidden state sequence conditional on the true topology and the true substitution model type, i.e., REV. The power of the phylo-HMM on these simulated alignments was compared with the PID method, which is a simple but widely used local conservation measure in comparative genomics [35–40]. Given an alignment, the PID value of each site was calculated as the percent of identical columns (i.e., completely conserved across all comparative species) within a window centered at the current column. To maximize the performance of the PID method, we set the window size as the true expected length of conserved elements, which is 50 + 1 for the *baseline*. Then the ROC curves were generated by varying thresholds for the PID values.

^{st}– to 3

^{rd}- quartiles of the sensitivity are 0.87 and 0.93, respectively. However, the PID method only achieved a median sensitivity of 0.73 at the same specificity level. The corresponding 1

^{st}– to 3

^{rd}- quartiles of the sensitivity are 0.66 and 0.78, respectively. The inter-quartile range for the estimate is also larger than that of phylo-HMM. The 95% bootstrap confidence interval for the median estimate is very narrow as shown in Figure 3, which implies that the variation of the median estimate is small. Therefore, the power of PID method is significantly smaller than that of phylo-HMM method for the

*baseline*. This is generally true if the alignment is generated from the phylo-HMM model with a non-star-topology tree.

### Influence of branch length at different locations in the tree

It has been suggested that the power of a comparative genomics method to distinguish conserved sites from nonconserved ones increases with the total evolutionary distance in the phylogenetic tree [5, 41]. However, it is difficult to obtain a reliable orthologous alignment if the genomes under consideration are too divergent. Even if the true orthologous alignment is available, some recent studies have shown that the power decreases with increasing total branch length in some instances [9, 10]. Here we investigated the effect of varying branch lengths on phylo-HMM power.

### Influence of the tree topology

Figure 5 also shows that the performance of the unbalanced topologies is much worse than the balanced ones. This is because the genomes with very short branches are so close to each other that there is little difference among them. In other words, the clustering effect of these genomes decreases the effective number of genomes in the topology, thus reducing its power. By the same logic, since the clustering effect for the star topology is more serious than for the binary trees, the unbalanced star topology performed much worse than the unbalanced binary trees. The unbalanced depth-first binary tree performed better than the breadth-first one because genomes are more widely dispersed in the depth-first tree. Although not conclusive, these results suggest that the branch lengths should be balanced among all genomes, and if the branch lengths are balanced, the topology is not that important in choosing species for comparative genomics studies. The power of simpler topologies like the symmetric star topology can be used to approximate the power of more complex, but balanced topologies with equal total branch length and the same number of genomes.

### Influence of the number of genomes

### Effect of different substitution rate models

Several parametric forms are available to model the nucleotide substitution process. Simpler models are obviously easier to handle. For example, the analytical form of the ROC curve is available for the JC model under certain conditions [9, 10]. Although models with more free parameters, such as HKY and REV, appear more realistic, conducting a proper statistical inference for these models is difficult, and one can actually lose information or over-fit the data if an improper analysis is done. Therefore, one question is whether we can use simpler models to capture the essential characteristics of the power for the more complex models. We performed two experiments to investigate this problem.

*baseline*, with the exception of the substitution rate matrix for each model type. For JC, we used the uniform background nucleotide probability. For HKY, we set kappa = 4. For simulated sequences from each model type, we used the corresponding true model type to infer the state sequence. We went through the simulation-prediction procedure 1000 times to get the ROC curve for each model type. The results are depicted in Figure 7(A) and show that the simpler models (JC and F81) performed slightly better than the more complex models (HKY and REV), which agrees with our intuition that simpler models are easier to solve. At the parameter values set by mimicking these 8533 alignments, however, the observed differences are small.

The second experiment was aimed at characterizing the effects of simplifying the substitution model. The REV model has five free parameters, whereas the JC model has no free parameters. We simulated alignments from the REV model using the baseline parameters, and then inferred the state sequence under the simpler JC model. This perhaps represents a more realistic situation in which the substitution model in the analysis is a simplification of the "true" substitution model. Figure 7(B) shows that the ROC curve for using the true model type is located within the 95% bootstrap confidence interval of the ROC curve for using the simplified JC model, which justifies the use of the simple JC model to study the properties of more complex models such as REV.

### Power comparison for different HMM parameters

Evolutionarily conserved elements within different regions have different sizes and densities. For example, conserved elements in coding regions are much longer than those in promoter regions, and the coverage of conserved elements (i.e., the percentage of sites that are conserved) in coding regions is higher than those in intergenic regions. Also, conserved elements in promoter regions of genes rich in cis-regulatory modules may be longer than those for other genes. Furthermore, promoter regions of different genes may have different coverage of conserved elements. Binding sites for different transcription factors have different lengths as well. All of these variations are modeled by the HMM parameters *μ* and *ν*, or correspondingly, expected coverage (*P*) and expected element length (*L*) in the phylo-HMM method.

*P*= 0.25, to achieve a median specificity of 0.94, the median sensitivity for expected element length

*L*= 10, 30, 50, 70, 90 is 0.30, 0.80, 0.91, 0.94, and 0.97, respectively. The relationship between the median sensitivity and

*1/L*is approximately linear with a negative slope. The effect of the expected coverage for fixed expected element length is rather complicated because the corresponding ROC curves cross over. For example, when the expected element length

*L*= 30, to achieve a low specificity like 0.8, the sensitivity for expected coverage

*P*= 0.45 is higher than that for

*P*= 0.05. To achieve a high specificity like 0.95, the sensitivity for expected coverage

*P*= 0.45 is lower than that for

*P*= 0.05. The variation of the ROC curves for smaller expected coverage is greater than for bigger ones in the range we considered (

*P*= 0.05, 0.15, 0.25, 0.35, 0.45). The area under the ROC curve of a larger expected coverage is greater than that of a smaller expected coverage. In this sense, a larger expected coverage of conserved elements makes them easier to detect.

### Power comparison for different conservation ratio

*ρ*), which is defined as the ratio of the average substitution rate of conserved sites over that of nonconserved sites, is one of the major factors determining the power of phylo-HMM. Figure 9 shows the power we can expect for a given conservation ratio under other baseline settings. For fixed specificity, the relationship between sensitivity and the conservation ratio is approximately a sigmoid type function: 1/(1 +

*e*

^{13(ρ-0.6)}). The power decreases dramatically with increasing conservation ratio, especially when the conservation ratio is around 0.6. This poses a problem for the two-state phylo-HMM model, which assumes a uniform conservation ratio along a given alignment. A promoter region could be bound by several types of transcription factors, and each of these could have a different conservation ratio compared to the nonconserved background. In this case, the power evaluation from a uniform conservation ratio is questionable. This problem can be alleviated by introducing multiple rate classes to form a multiple-state phylo-HMM.

### Ability to recover the true alignment and its influence on the power

All simulations discussed so far have been based on the assumption that we can get the true alignment. Therefore, the usefulness of the above results is questionable in more realistic situations where the true alignment is unknown. We again used simulation to check what alignment accuracy we could achieve under different phylo-HMM parameter settings. We simulated sequences that were 2500 base pairs long by varying one or a subset of phylo-HMM parameters around their baseline values. We then used MLAGAN [32] to align the simulated sequences, assuming that we knew the topology of the true evolutionary tree. Real alignment problems are plagued by many different types of noise in the input sequences (e.g., insertions and deletions), which inevitably decreases the alignment accuracy. In order to evaluate the alignment accuracy determined by phylo-HMM parameters, we directly fed the sequences simulated based on the phylo-HMM to MLAGAN without adding additional insertions or deletions. The alignment accuracy was measured by the column score [42], which is defined as the number of identical columns between the true alignment and the recovered alignment. We further divided the number of identical columns by 2500 to normalize the column score to the [0, 1] interval.

Besides MLAGAN, we also evaluated the alignment accuracy using TBA [43] and MAVID [44], which have been used for the recently published ENCODE consortium alignments [45]. Details are shown in additional file 1. Among these three aligners, TBA is essentially a local alignment algorithm, which avoids incorporating distant sequences into the output alignment. MAVID produces global alignments by assuming that all sites in the alignment evolve at the same speed. MLAGAN produces global alignments for sequences containing conserved blocks, which is the exact scenario we are dealing with. The alignment accuracies are similar for the results of the three aligners when the branch lengths are short (See supplementary Figures 1 and 2 in additional file 1). These results agree with those reported by Kumar and Filipski [46]. Nevertheless, the power of phylo-HMM is still quite robust with respect to the alignment quality.

## Conclusion

We used simulations to investigate the statistical power of phylo-HMM under a number of diverse situations. Among all factors studied, the number of species-specific genomes used, evolutionary distance, conservation ratio, and expected length of the conserved element are the major factors affecting the power of phylo-HMM. If conditions allow, it is better to select species such that every branch length in the phylogenetic tree is between 0.6 to 1 substitutions per site. To achieve a desired power, the number of genomes required in the analysis scales inversely with the mean branch length if the mean branch length is small. The conservation ratio and the expected length of the conserved element are uncontrollable factors. For a fixed specificity, the relationship between the median sensitivity and one over the expected length of the conserved element is approximately linear with a negative slope. We also found that the influence of the topology and the nucleotide substitution model were relatively minor. This justifies selecting species with a simpler topology, like the symmetric star topology, and approximating complex substitution models with less complex, easier to manipulate ones, like the JC model.

Our analyses of the power of phylo-HMM models were carried out under a number of simplifying assumptions that may influence the results reported and the degree to which they will reflect what can be expected from real data. First, for the simulations carried out to evaluate power, we assumed that all sequences were generated from the phylo-HMM model and that we could get the true ungapped alignment. In reality, it may be difficult to locate the orthologous sequences. Even if we can get the orthologous alignment, gaps are inevitable and it is perhaps inappropriate to treat them as missing data. Second, the real nucleotide substitution process may be more complicated than the models we studied here. For example, content dependent substitution is possible. Finally, the assumption of two evolutionary rates, one for functional elements and the other for neutral background, is also unrealistic and may affect the power estimate we obtained. However, these issues notwithstanding, the general guidelines established by our analysis should still hold qualitatively.

Several problems are worthy of further study. First, a goodness-of-fit test for phylo-HMM on real data remains to be validated. Second, in the current phylo-HMM model, once the conservation state of a site is determined for the common ancestor, it is fixed for all species and is not allowed to change over the course of evolution. More flexible models that allow for differences among the conservation states are required for different species. Third, the conservation annotation conducted by the current phylo-HMM model represents just one step toward the goal of functional annotation. A more ambitious approach would be to directly model the functional elements like transcription factor binding sites. An integrative approach to align the sequences by modeling evolution events and perform the conservation analysis at the same time is also desired. This may be feasible within the start topology JC framework.

## Declarations

### Acknowledgements

XF and JSL were supported by the NIH grant R01-GM078990. All simulations were performed using the Linux cluster at Rosetta Inpharmatics LLC, a wholly owned subsidiary of Merck & Co. Inc.

## Authors’ Affiliations

## References

- Wolfe KH, Sharp PM, Li WH: Mutation rates differ among regions of the mammalian genome. Nature 1989, 337(6204):283–285. 10.1038/337283a0View ArticlePubMed
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2003, 299(5611):1391–1394. 10.1126/science.1081331View ArticlePubMed
- Hardison RC, Oeltjen J, Miller W: Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res 1997, 7(10):959–966.PubMed
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423(6937):241–254. 10.1038/nature01644View ArticlePubMed
- Cooper GM, Brudno M, Green ED, Batzoglou S, Sidow A: Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res 2003, 13(5):813–820. 10.1101/gr.1064503PubMed CentralView ArticlePubMed
- Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 2005, 3(1):e7. 10.1371/journal.pbio.0030007PubMed CentralView ArticlePubMed
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005, 15(8):1034–1050. 10.1101/gr.3715005PubMed CentralView ArticlePubMed
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434(7031):338–345. 10.1038/nature03441PubMed CentralView ArticlePubMed
- Eddy SR: A model of the statistical power of comparative genome sequence analysis. PLoS Biol 2005, 3(1):e10. 10.1371/journal.pbio.0030010PubMed CentralView ArticlePubMed
- McAuliffe JD, Jordan MI, Pachter L: Subtree power analysis and species selection for comparative genomics. Proc Natl Acad Sci U S A 2005, 102(22):7900–7905. 10.1073/pnas.0502790102PubMed CentralView ArticlePubMed
- Yang Z, Goldman N, Friday A: Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol 1994, 11(2):316–324.PubMed
- Yang Z: A space-time process model for the evolution of DNA sequences. Genetics 1995, 139(2):993–1005.PubMed CentralPubMed
- Felsenstein J, Churchill GA: A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 1996, 13(1):93–104.View ArticlePubMed
- Koshi JM, Goldstein RA: Models of natural mutations including site heterogeneity. Proteins 1998, 32(3):289–295. 10.1002/(SICI)1097-0134(19980815)32:3<289::AID-PROT4>3.0.CO;2-DView ArticlePubMed
- Wagner H, Baake E, Gerisch T: Ising quantum chain and sequence evolution. J Stat Phys 1999, 92: 1017–1052. 10.1023/A:1023048711599View Article
- Schadt EE, Sinsheimer JS, Lange K: Applications of codon and rate variation models in molecular phylogeny. Mol Biol Evol 2002, 19(9):1550–1562.View ArticlePubMed
- Schadt E, Lange K: Codon and rate variation models in molecular phylogeny. Mol Biol Evol 2002, 19(9):1534–1549.View ArticlePubMed
- Thorne JL, Goldman N, Jones DT: Combining protein evolution and secondary structure. Mol Biol Evol 1996, 13(5):666–673.View ArticlePubMed
- Husmeier D, Wright F: Detection of recombination in DNA multiple alignments with hidden Markov models. J Comput Biol 2001, 8(4):401–427. 10.1089/106652701752236214View ArticlePubMed
- Pedersen JS, Hein J: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 2003, 19(2):219–227. 10.1093/bioinformatics/19.2.219View ArticlePubMed
- Hellmann I, Prufer K, Ji H, Zody MC, Paabo S, Ptak SE: Why do human diversity levels vary at a megabase scale? Genome Res 2005, 15(9):1222–1231. 10.1101/gr.3461105PubMed CentralView ArticlePubMed
- Whelan S, Lio P, Goldman N: Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet 2001, 17(5):262–272. 10.1016/S0168-9525(01)02272-7View ArticlePubMed
- Schadt EE, Sinsheimer JS, Lange K: Computational advances in maximum likelihood methods for molecular phylogeny. Genome Res 1998, 8(3):222–233.View ArticlePubMed
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997, 13(5):555–556.PubMed
- Jukes TH, Cantor CR: Evolution of protein molecules. In Mammalian protein metabolism. Edited by: Munro HN. New York , Academic Press; 1969:21–123.View Article
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17(6):368–376. 10.1007/BF01734359View ArticlePubMed
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 1985, 22(2):160–174. 10.1007/BF02101694View ArticlePubMed
- Tavaré S: Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 1986, 17: 57–86.
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77: 257–286. 10.1109/5.18626View Article
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1977, 39(1):1–38.
- Liu JS: Monte Carlo Strategies in Scientific Computing. New York , Springer-Verlag; 2001:28–31.
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13(4):721–731. 10.1101/gr.926603PubMed CentralView ArticlePubMed
- Efron B, Tibshirani R: An Introduction to the Bootstrap. London , Chapman and Hall; 1993.View Article
- Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res 2003, 13(4):579–588. 10.1101/gr.668403PubMed CentralView ArticlePubMed
- Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 2000, 16(11):1046–1047. 10.1093/bioinformatics/16.11.1046View ArticlePubMed
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker--a web server for aligning two genomic DNA sequences. Genome Res 2000, 10(4):577–586. 10.1101/gr.10.4.577PubMed CentralView ArticlePubMed
- Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV, Antonarakis SE: Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature 2002, 420(6915):578–582. 10.1038/nature01251View ArticlePubMed
- Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science 2003, 302(5644):413. 10.1126/science.1088328View ArticlePubMed
- Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 2004, 14(3):451–458. 10.1101/gr.1327604PubMed CentralView ArticlePubMed
- Ovcharenko I, Loots GG, Hardison RC, Miller W, Stubbs L: zPicture: dynamic alignment and visualization tool for analyzing conservation profiles. Genome Res 2004, 14(3):472–477. 10.1101/gr.2129504PubMed CentralView ArticlePubMed
- Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003, 424(6950):788–793. 10.1038/nature01858View ArticlePubMed
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682–2690. 10.1093/nar/27.13.2682PubMed CentralView ArticlePubMed
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 2004, 14: 708–715.PubMed CentralPubMed
- Bray N, Pachter L: MAVID: constrained ancestral alignment of multiple sequences. Genome Res 2004, 14(4):693–699. 10.1101/gr.1960404PubMed CentralView ArticlePubMed
- Thomas DJ, Rosenbloom KR, Clawson H, Hinrichs AS, Trumbower H, Raney BJ, Karolchik D, Barber GP, Harte RA, Hillman-Jackson J, Kuhn RM, Rhead BL, Smith KE, Thakkapallayil A, Zweig AS, Haussler D, Kent WJ, Consortium TENCODEP: The ENCODE Project at UC Santa Cruz. Nucleic Acids Res 2007, 35: D663-D667. 10.1093/nar/gkl1017PubMed CentralView ArticlePubMed
- Kumar S, Filipski A: Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res 2007, 17: 127–135. 10.1101/gr.5232407View ArticlePubMed

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.