Combining comparative genomics with de novo motif discovery to identify human transcription factor DNA-binding motifs
© Mao and Zheng; licensee BioMed Central Ltd 2006
Published: 12 December 2006
As more and more genomes are sequenced, comparative genomics approaches provide a methodology for identifying conserved regulatory elements that may be involved in gene regulations.
We developed a novel method to combine comparative genomics with de novo motif discovery to identify human transcription factor binding motifs that are overrepresented and conserved in the upstream regions of a set of co-regulated genes. The method is validated by analyzing a well-characterized muscle specific gene set, and the results showed that our approach performed better than the existing programs in terms of sensitivity and prediction rate.
The newly developed method can be used to extract regulatory signals in co-regulated genes, which can be derived from the microarray clustering analysis.
Transcription factors (TFs) regulate the expression of genes by interacting with cis-regulatory elements in DNA sequences in response to internal and external stimuli. Genomic comparisons have shown that most of these cis-regulatory elements are located in the conserved non-coding region of the genome [1, 2]. Over the past decade, many bioinformatics tools have been developed to detect de novo DNA motifs bound by TFs that are overrepresented in the promoters of a group of co-regulated genes. Despite the tremendous efforts in algorithm development, discovering human regulatory motifs from a set of co-regulated promoter sequences remains very challenging . In this study, we developed a novel approach to identify human TF DNA-binding motifs for a set of co-regulated genes by combining comparative genomics with de novo motif discovery. This approach restricted the motif search in the human promoter regions that are conserved across multiple species.
Most of the current programs combining comparative genomics with de novo motif discovery use human-mouse orthologous sequences [4–7] or human-mouse-rat orthologs  to obtain the conserved promoter regions. Our approach is the first to use an 8-species (human, chimp, mouse, rat, dog, chicken, fugu and zebrafish) genome comparison to derive the human conserved regions. The method takes the advantage that the ability to detect TF binding sites improves with both the number of comparison species and the evolution distance between species [9, 10].
The motif discovery algorithm in our approach is a modification of the original Weeder program [11, 12], which is based on the exhaustive oligomer enumeration technique. Current programs that combine comparative genomics with de novo motif discovery are based on a greedy search algorithm, Gibbs sampling or expectation maximization techniques [4–8, 13, 14]. These techniques are all heuristic. Depending on the initial configuration, these heuristic algorithms might be trapped in local maxima . In contrast, due to the exhaustive characteristic of the Weeder algorithm, a single run is sufficient to identify the specified number of most over-represented motifs. In addition, in a recent assessment comparing the performance of various sequence-based motif discovery programs , the original Weeder outperformed the other programs, which included the Gibbs sampling and expectation maximization based algorithms, in most measurements. We modified the original Weeder program to incorporate conservation information derived from the comparative genomics. The modified program was implemented in C under Linux.
Effects of masking methods
A stringent masking method (SMM) and a window-based masking method (WBMM) were developed to extract conserved upstream regions of human genes from a multiple-species genome alignment. Both methods masked the non-conserved nucleotides and thus reduced the size of sequence space to be searched for regulatory motifs. However, the methods may also eliminate the true TF binding sites as the number of species required to have the same base as human in a multiple alignment column, t (see Method), approaches 7. We used the muscle data set to assess the effect of masking methods on the size of searching space and on the percentage of true binding sites retained. The muscle specific gene set was experimentally verified to be regulated by transcription factors Myf, SRF, Mef2, Sp1, Tef and NVL . Fourteen Myf binding sites and 7 Mef2 binding sites determined by experiments were mapped to the upstream sequences of the human genes. For the assessment, an experimentally determined binding site was considered to be retained if a sequence of at least 6 consecutive bases within the binding site was unmasked. Such a binding site can be sampled by a 6-bp motif.
Motif discovery for the muscle gene set
Comparison of motifs identified by different programs for the muscle genes 1,2,3,4,5.
masking repeats 6
Our de novo motif discovery algorithm in combination with the masking method outperformed CompareProspector and Toucan according to the motif discovery results for the muscle gene set. Both CompareProspector and our approach using WBMM with t set to 4 identified four TF regulatory motifs (Table 1), however, our approach exhibited higher sensitivity and higher prediction rates than CompareProspector (Figure 2). Our approach correctly predicted 14 out of the 34 true binding sites for the muscle genes which substantially exceed the 8 true binding sites identified by CompareProspector. In comparison, Toucan only identified two TF DNA-binding motifs. Both the combined sensitivity and PPR for Toucan are lower than our approach using WBMM with t set to 4. Both CompareProspector and Toucan biased motif searches in the human-mouse conserved regions, and they both employ Gibbs sampling technique for the motif discovery. In contrast, our approach requires motifs to be conserved over multiple species and uses exhaustive oligomer enumeration technique to discover motifs. It is likely that both the properties of the masking method and motif discovery technique contribute to the superior performance of our approach.
Deciphering human regulatory motifs is crucial for understanding the regulatory mechanisms that control gene expression in response to various stimuli. Recent advances in genomics have enabled large scale investigation of gene regulation by microarray technology, which can identify genes with similar expression patterns by clustering analysis. These gene clusters with similar expression patterns are likely to be regulated by common transcription factors. It is feasible to apply the approach developed in this study to analyze the gene clusters for the extraction of regulatory signals.
Extracting conserved upstream regions of human genes
Multiple alignments of 5000 bp sequences upstream of annotated transcription starts of human RefSeq genes to the genomes of the following 7 species, chimp, mouse, rat, dog, chicken, fugu and zebrafish, using the program Multiz  were downloaded from UCSC Genome Browser. Two methods were applied to extract conserved upstream regions, respectively. The first one would be referred to as stringent masking method (SMM). The method retained a base in a human gene's upstream sequence if the base was located in the non-repeat region and conserved in at least t out of the other 7 species in a multiple-alignment column, where t is an integer and specified by a user. The method replaced a base with the letter 'N', otherwise . The second method, referred to as window-based masking method (WBMM), utilized less stringent conservation criteria. It first assigned a score for each base in a human gene's upstream sequence multiple-aligned with the other 7 species. If a base was in the non-repeat region and conserved in at least t of the seven species, it was assigned 1; otherwise, a base was assigned 0. The score for a gap inserted into the human gene's upstream sequence for the purpose of multiple-alignment was also assigned 0. Then a window-based value (WBV) for a base in the aligned human upstream sequence was calculated as the summation of scores over a user-specified window size centered at that base. The summation excluded the score of that base. After WBV was calculated, a base was retained if it met either of the following two conditions: (i) the score for the base is 1; (ii) the base is in the non-repeat region and the WBV for the base exceeds a certain threshold specified by a user. Otherwise, the base was masked by 'N'. Eleven bp was chosen as the window size and the threshold for WBV was set at 7 in this study. For both methods, gaps appearing in the multiple-alignment were retained as part of the upstream sequence of a human gene.
De novo motif discovery
The algorithm discovers over-represented motifs in a set of DNA sequences upstream of co-regulated genes using the exhaustive oligomer enumeration technique. The algorithm is a modification of the Weeder program [3, 11, 12] to cope with the masked bases and gaps in upstream sequences. In the algorithm, an oligomer m with length l is defined as a sequence of l conserved bases. Neither the letter 'N' used to replace a nucleotide nor a gap can appear in m. A match to the m that allows for e mutations is also an l- bp oligomer that has at most e mismatches with respect to m. Let S = S1...S k be the set of masked upstream sequences, n1...n k are the number of l-bp oligomers in the corresponding sequence, and N is given as:
A sequence specific score for the oligomer m allowing for e mutations is given as :
where A represents the set of sequences each of which has at least one match to the m, e i (0 ≤ e i ≤ e) is the minimal number of mismatches to m among all matching oligomers in the i th sequence, H i (m, e i ) is the number of oligomers with exact e i mismatches with respect to m in the i th sequence, and B (m, e i ) is the background frequency for m with e i mutations computed using the genome-wide regulatory regions of the human species .
In addition, a general score for m(e) was also calculated :
where Tot(m, e) is the total number of oligomers matching m(e) in all input sequences, B(m, e) is the background frequency for m(e).
A final score for m(e) is given as:
Score(m, e) = seq(m, e) + Glo(m, e) (4)
The algorithm used (4) to calculate scores for all oligomers of the same length that occur in the upstream sequences and ranked them based on their scores. Putative TF DNA-binding motifs were selected from the top ranked oligomers. In applying the algorithm to discover DNA-binding motifs for the muscle specific gene set (22 genes), both strands of the genes' upstream sequences were searched. The algorithm enumerated all 6, 8, 10, and 12-bp oligomers in the upstream sequences. The number of mutations allowed is 1 for 6-mer, 2 for 8-mer, 3 for 10-mer and 4 for 12-mer, corresponding to the large mode in the original Weeder program . Only oligomers that were positively scored and for which matching oligomers can be found in more than 50% of the number of input sequences were retained. Among the retained oligomers, the twenty highest scored oligomers, if any, at the lengths of 6, 8, 10 and 12 bp, respectively, were further selected. Among them, only oligomers satisfying both of the following conditions were reported to the user as discovered motifs: (i) An oligomer ranks in the top ten; (ii) An oligomer is both horizontally and vertically redundant, as defined by the Weeder program , by comparing the pattern of the oligomer with the other oligomers ranked in the top twenty. Thus, for the muscle gene set, at most 40 motifs were reported. In analyzing the muscle gene set, parameters of the algorithm were set as: -O HS -R 50 -S -M -T 20.
After redundant motifs which also ranked in the top ten were reported, they were used to search for putative TF binding sites in the masked upstream sequences. Assuming one of the reported motifs is an l-bp oligomer allowing for e mutations, all matching occurrences in the masked upstream sequences were first collected to construct the position specific weight matrix (PSWM). This PSWM was then used to compute a score for each matching occurrence . All the matching occurrences with scores above a certain threshold were output as the putative TF binding sites. In this study, the threshold for a 6-mer is set at 100% (i.e. exact match), and the thresholds for 8- 10- and 12-mers are 85%.
SMM or WBMM reduced the size of sequence space to be searched for regulatory motifs. Masking percentage (MP) is introduced to measure such reduction, and it is calculated as:
MP = 1 - N6 - mer/(4995 · k) (5)
where N6-meris calculated using (1) for 6-bp oligomers, 4995 is the number of 6-mers that would be found in an unmasked 5000-bp sequence containing no gaps, and k is the number of input sequences.
Using CompareProspector and Toucan to discover DNA-binding motifs
For comparison, CompareProspector and Toucan were also applied to identify TF regulatory motifs for the muscle genes. CompareProspector is a motif discovery program that extends Gibbs sampling by biasing the search in promoter regions conserved across species . In applying CompareProspector to discover human regulatory motifs, the alignment of human-mouse orthologs were extracted from the 8-species multiple alignment and were used to calculate window (20 bp) percent identity values required by the program. Motif lengths were set at 6, 8, 10 and 12 bp, respectively. For each motif length, the top ten motifs were reported. The parameters for using CompareProspector are '-T 0.8 -Z 0.5 -D 1'. Toucan also used the Gibbs sampling technique, in combination with comparative genomics, to identify DNA-binding motifs [7, 14]. The motif(s) identified by Toucan for the muscle genes were taken from Aerts et al .
Evaluating program performance
Sensitivity (Sn) and positive prediction rate (PPR) were computed for a discovered motif. Sn gives the fraction of true binding sites (i.e. experimentally established binding sites) that are predicted by a motif, whereas PPR reflects specificity and is calculated as the fraction of sites predicted by a motif that are true binding sites . The mean of Sn and PPR for a discovered motif is defined as average performance (AP) of the motif. In this study, a motif predicted site is considered to be a true hit if it overlaps with a true binding site by at least 5 bp.
We would like to express our gratitude to Drs. Giulio Pavesi and Graziano Pesole for providing us the source code of the original Weeder program. This work is partly supported by grants GC-3609-04-43766CM to D. Watson, and to A. Kraft, and by a Hollings Cancer Center/Medical University of South Carolina Department of Defense grant "Translational Research on Cancer Control and Related Therapy" (Subcontract GC-3319-05-4498CM) to W. J. Zheng.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S4.
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423(6937):241–254. 10.1038/nature01644View ArticlePubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434(7031):338–345. 10.1038/nature03441PubMed CentralView ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23(1):137–144. 10.1038/nbt1053View ArticlePubMedGoogle Scholar
- Wang T, Stormo GD: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 2003, 19(18):2369–2380. 10.1093/bioinformatics/btg329View ArticlePubMedGoogle Scholar
- Prakash A, Blanchette M, Sinha S, Tompa M: Motif discovery in heterogeneous sequence data. Pac Symp Biocomput 2004, 348–359.Google Scholar
- Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 2004, 14(3):451–458. 10.1101/gr.1327604PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, Moor BD: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucl Acids Res 2003, 31(6):1753–1764. 10.1093/nar/gkg268PubMed CentralView ArticlePubMedGoogle Scholar
- Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5: 170. 10.1186/1471-2105-5-170PubMed CentralView ArticlePubMedGoogle Scholar
- Gertz J, Riles L, Turnbaugh P, Ho SW, Cohen BA: Discovery, validation, and genetic dissection of transcription factor binding sites by comparative and functional genomics. Genome Res 2005, 15(8):1145–1152. 10.1101/gr.3859605PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: A model of the statistical power of comparative genome sequence analysis. PLoS Biol 2005, 3(1):e10. 10.1371/journal.pbio.0030010PubMed CentralView ArticlePubMedGoogle Scholar
- Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17(Suppl 1):S207–214.View ArticlePubMedGoogle Scholar
- Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 2004, 32(Web Server):W199–203.PubMed CentralView ArticlePubMedGoogle Scholar
- Li X, Wong WH: Sampling motifs on phylogenetic trees. Proc Natl Acad Sci U S A 2005, 102(27):9481–9486. 10.1073/pnas.0501620102PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Res 2005, 33(Web Server):W393–396. 10.1093/nar/gki354PubMed CentralView ArticlePubMedGoogle Scholar
- Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucl Acids Res 2005, 33(15):4899–4913. 10.1093/nar/gki791PubMed CentralView ArticlePubMedGoogle Scholar
- Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human-mouse genome comparisons to locate regulatory sites. Nat Genet 2000, 26(2):225–228. 10.1038/79965View ArticlePubMedGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al.: Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res 2004, 14(4):708–715. 10.1101/gr.1933104PubMed CentralView ArticlePubMedGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104. 10.1038/nature02800PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.