Paircomp, FamilyRelationsII and Cartwheel: tools for interspecific sequence comparison
© Brown et al; licensee BioMed Central Ltd. 2005
Received: 18 November 2004
Accepted: 24 March 2005
Published: 24 March 2005
Comparative sequence analysis is an effective and increasingly common way to identify cis-regulatory regions in animal genomes.
We describe three tools for comparative analysis of pairs of BAC-sized genomic regions. Paircomp is a tool that does windowed (ungapped) comparisons of two sequences and reports all matches above a set threshold. FamilyRelationsII is a graphical viewer for comparisons that enables interactive exploration of several different kinds of comparisons. Cartwheel is a Web site and compute-cluster management system used to execute and store comparisons for display by FamilyRelationsII. These tools are specialized for the discovery of cis-regulatory regions in animal genomes. All tools and their source code are freely available at http://family.caltech.edu/.
These tools have been shown to effectively identify regulatory regions in echinoderms, mammals, and nematodes.
Comparative sequence analysis is fast becoming a standard method for discovering cis-regulatory modules . The technique relies on the signatures of conservation left by functional genomic regions as the background sequence evolves. It is often the only way to computationally discover cis-regulatory modules in animal genomes when definite knowledge of upstream regulators is lacking, and it can serve as an excellent complement to experimental techniques.
Paircomp, FamilyRelationsII (FRII), and Cartwheel are an integrated system for comparing two BAC-sized (~100 kb) genomic sequences, viewing the comparison, manipulating thresholds and views, and extracting the results. These tools and their predecessors, seqcomp and FamilyRelations, have been used extensively in the years since we first made them available . However, the addition of Cartwheel, a Web server system for performing, storing, and revisiting analyses, makes this combined toolkit considerably more useful to the experimental biologist.
The first analysis done with FamilyRelations was a comparison of the otx region between two sea urchins; 11 of the 17 conserved blocks were shown to drive expression of a reporter . Kirouac and Sternberg  showed that features conserved between C. elegans and C. briggsae encode functional regulatory regions. Romano and Wray  used FamilyRelations to show that primary sequence identity was conserved in only part of the previously identified endo16 cis-regulatory region, when the L. variegatus sequence was used as a partner to the S. purpuratus sequence. Leung et al.  used FRII to analyze regions in which NFKB bound to verify that the regions were conserved between mouse and human. And, most recently, Revilla-i-Domingo et al.  identified a small conserved region in the delta genomic locus as a cis-regulatory element responsible for localized expression of delta in S. purpuratus. Similar analyses of the regulation of gatae, krox, wnt8, brachyury, tbrain, foxa and deadringer in S. purpuratus are forthcoming from this lab. While most published use of FRII and Cartwheel has been in sea urchins and nematodes, users have reported that the tools accurately identify regulatory regions in vertebrates and plants.
FRII and Cartwheel are specialized for identifying conservation within relatively small genomic regions, and can be used for comparing BAC sequences between organisms for which no whole genome assembly exists (e.g. S. purpuratus/L. variegatus). The exhaustive "dot-plot"-style search algorithm used (described below) assumes nothing about the relative positioning or orientation of regulatory regions and can be used to detect rearrangements that might be missed by a global alignment algorithm (see e.g. ). Because of these features, FRII and Cartwheel are particularly useful in targeted searches for regulatory regions.
In this paper, we present these effective tools for comparative sequence analysis to the wider biological community.
Paircomp is a program for doing windowed comparisons of two sequences. It is an expanded reimplementation of the seqcomp program . Paircomp contains several algorithms for doing exhaustive fixed-width-window sequence comparisons, optimized for different parameters. The default algorithm uses a sliding window to do a "rolling comparison" and runs in time O(NxM) for two sequences of lengths N and M. Paircomp is written in C++ and has a Python interface.
FamilyRelationsII (FRII) is a graphical viewer for sequence analyses. It is a C++ reimplementation of the original Java/Jython FamilyRelations . FRII uses the cross-platform FLTK windowing toolkit to present a common interface on Windows, Mac OS X, and Linux/X11.
Cartwheel is a server-side system that presents a uniform interface for job coordination and execution. It has several components, including a Web interface through which users can establish analyses; a remote interface for programs to retrieve analysis data; and a batch job queueing system based on a method of parallel processing known as a Linda tuple space. All of the components are built on top of a PostgreSQL database. Cartwheel is written in Python and provides libraries in Python, Java, and C++ for remote access.
A technical history of the design decisions made in the implementation of these tools has been published online (, article "Python in Bioinformatics").
FRII is freely available for download in a binary distribution for Mac OS X and Windows ; FRII will also run under most UNIX distributions but must be compiled individually. The Center for Computational Regulatory Genomics at Caltech maintains a public Cartwheel server . A tutorial for FRII is available online , and an example homework assignment for an undergraduate class is also available. The source code for paircomp, FRII and Cartwheel and all their components is freely available under the L/GPL through the above Web sites. Paircomp, FamilyRelationsII and Cartwheel are Copyright © 2001–2004 the California Institute of Technology.
Results and discussion
Several different classes of algorithms are available for comparing two genomic sequences. Windowed comparisons do an exhaustive comparison of two sequences with a fixed-width window, and record strict (ungapped) sequence identity within that window [2, 12]. Local alignment algorithms such as BLAST search for common "words" of DNA in a pair of sequences and build a gapped alignment around these words . These gapped alignments are often scored by overall length, so that e.g. a 500 bp match at 90% is ranked higher than a 200 bp match at 90%. Global alignment algorithms such as AVID  and LAGAN  seek to build a start-to-end gapped alignment of syntenic genomic regions. Windowed comparisons and local alignment algorithms usually search for matches in both forward and reverse complement directions, while global alignment algorithms typically try to build an alignment without inversions. Implementations of all three strategies for genomic comparisons have been publicly available for some time: Dotter and seqcomp implement windowed comparisons [2, 12]; PipMaker uses a local alignment algorithm, blastz [16, 17]; and Vista relies on a global alignment generated by AVID . All three comparison strategies have been successful at finding regulatory regions [1, 19].
Of the three general classes of algorithms, we chose to use windowed comparisons in our search for cis-regulatory modules. Our decision was based on several criteria. First, these comparisons report matches based solely on strict sequence identity with no gapping, unlike alignment algorithms. This is a good ab initio requirement when comparing sequences in search of cis-regulatory modules, whose evolution is still poorly understood; in particular, binding sites could be sensitive to indels, which are somewhat elided in gapped alignments. Moreover, we had no a priori expectation for the locations, sizes, or degrees of similarity of conserved regions, necessitating an exhaustive search strategy that did not bias scores based on the length or position of matches. And, finally, from a user-interface perspective the parameters for paircomp – windowsize and threshold – are simple and intuitively linked to the results. Our success with this basic approach means that we have not needed to move to alternative algorithms.
Paircomp is a standalone program that executes windowed comparisons (see Methods). It searches for matches in both the forward and reverse complement directions. Paircomp runs within Cartwheel; the results are stored in a database and communicated to FRII.
Cartwheel is a Web site through which analyses are executed and from which analyses are loaded into FamilyRelationsII. It provides an easy-to-use interface through which to establish a set of analyses on a pair of sequences. Cartwheel also allows the annotation of sequences with a variety of features; features can be uploaded to Cartwheel in the standard GFF format. A tutorial for setting up pairwise comparisons is available online .
FamilyRelationsII, or FRII, displays comparisons of BAC-sized genomic sequences of lengths ~100 kb. It is a graphical program that runs directly from a desktop and loads data from the Cartwheel server. From within FRII, users can zoom in to look more closely at features, alter scoring thresholds for comparisons, change the color of features, and turn on or off the display of specific analyses. FRII can also display closeup views of comparisons and alignments against DNA and protein sequence.
Figure 2 shows a dot-plot view of an expanded region of the comparison, centered on the first exon of the α-otx transcript. In addition to the exon itself, there is patchy conservation throughout the region; again, this is typical of many comparisons. This view also shows that all of the elements are collinear on scales of ~10 kb.
In both the dot-plot and pairwise mapping view, multiple comparisons done with different parameters can be displayed in different colors. The threshold for the matches shown can be adjusted until the desired view is obtained, and sequence can be exported from any of the views via a pop-up menu.
FRII also performs searches for motifs using the IUPAC notation in which e.g. W represents A or T. This feature allows users to search for matches to known "consensus" binding sites for transcription factors. Searches are either stored on the Cartwheel server and displayed as individual features on FRII views, or executed directly in FRII. One particularly convenient feature is the ability to ask for motifs that have mismatches in up to 5 positions; this lets users search for weaker matches to known consensi.
FRII displays a variety of analyses. In addition to paircomp windowed comparisons, FRII displays and manipulates Vista-style comparisons, BLAST and blastz comparisons, BLAST database searches, cDNA and protein comparisons, and the results of several different gene finders (genscan, geneid, and hmmgene [20–22]). All of these analyses may be executed directly on the Cartwheel server, excepting only Vista comparisons using the (default) AVID alignment program. The data for Vista comparisons must be uploaded from the results returned by the Vista Web site; however, Vista-style comparisons with the LAGAN global alignment tool are executed directly on Cartwheel.
Discovering and analyzing regulatory regions
We set up two to three paircomp analyses at the following windowsizes and thresholds: 10 bp/90%; 20 bp/80%; 50 bp/60%.
We match the cDNA or protein of interest against both regions, to determine where the coding regions lie.
We also compare the RefSeq database from NCBI against both regions, to find other genes in the region.
We load these analyses into FRII and zoom in to a view that includes as much intergenic sequence around the gene as is possible without also including other genes. We then adjust the thresholds on the 20 bp and 50 bp analyses until we obtain a roughly collinear pattern of conserved blocks. Typical values for these thresholds are 80–100% for a 20 bp windowed comparison, and 60–80% for a 50 bp windowed comparison.
We use the closeup view to extract the conserved blocks, and design PCR primers to isolate all of the contiguous blocks of conserved sequence. We then individually subclone or fuse them into a GFP reporter construct together with a basal promoter. These constructs are then introduced into the sea urchin by microinjection and analyzed for appropriate spatiotemporal expression.
In our experience, we have always been able to identify the relevant enhancer elements using this procedure. A similar procedure in which putatively negative elements are fused with a ubiquitous driver of expression often identifies necessary repressive elements. Also note that one caveat of these procedures is that for some genes, e.g. transcription factors, there are often many regions that appear to do nothing. These may be regulatory regions that affect expression at times or in places that are not under consideration, or could be other genomic features not relevant to gene regulation.
Paircomp, FamilyRelationsII, and Cartwheel are an effective, easy-to-use set of tools for analyzing conservation in BAC-sized genomic regions. Over 100 people are currently using them, and they have been effective in finding regulatory regions in a variety of organisms. In this paper we have described the tools and provided an introduction for biologists who wish to use them.
Availability and requirements
See Implementation, above, for information on server-side software.
Project name: FamilyRelationsII
Project home page: http://family.caltech.edu/
Operating systems: Mac OS X, Windows NT/XP, UNIX/Linux (X Windows)
Programming language: C++
No restrictions placed on use.
Tristan De Buysscher and Madeleine Price, under the supervision of Dr. Barbara Wold, developed the original seqcomp and contributed to FamilyRelations. Ramon Cendejas and Kevin Berney aided in the development of features and helped exercise the Cartwheel server; a complete list of contributors to FamilyRelationsII and Cartwheel can be found on the Cartwheel Web site, under Developers. We especially thank Carolina Livi, Pei-Yun Lee, Dr. Ellen Rothenberg and Dr. Erich Schwarz for extensive user-interface testing over the years. Dr. Ellen Rothenberg and Dr. Erich Schwarz both contributed significantly to discussions of new features; in addition, Sagar Damle, Tracy Teal and Dr. Erich Schwarz gave many helpful comments on this paper. We also thank two anonymous reviewers for their comments. CTB is supported by National Institutes of Health Grant GM61005, and the Beckman Institute Center for Computational Regulatory Genomics is supported by National Institutes of Health Grant RR15044.
- Cooper GM, Sidow A: Genomic regulatory regions: insights from comparative sequence analysis. Curr Opin Genet Dev 2003, 13(6):604–610. 10.1016/j.gde.2003.10.001View ArticlePubMedGoogle Scholar
- Brown CT, Rust AG, Clarke PJ, Pan Z, Schilstra MJ, De Buysscher T, Griffin G, Wold BJ, Cameron RA, Davidson EH, Bolouri H: New computational approaches for analysis of cis-regulatory networks. Dev Biol 2002, 246(1):86–102. 10.1006/dbio.2002.0619View ArticlePubMedGoogle Scholar
- Yuh CH, Brown CT, Livi CB, Rowen L, Clarke PJ, Davidson EH: Patchy interspecific sequence similarities efficiently identify positive cis-regulatory elements in the sea urchin. Dev Biol 2002, 246(1):148–161. 10.1006/dbio.2002.0618View ArticlePubMedGoogle Scholar
- Kirouac M, Sternberg PW: cis-Regulatory control of three cell fate-specific genes in vulval organogenesis of Caenorhabditis elegans and C. briggsae. Dev Biol 2003, 257(1):85–103. 10.1016/S0012-1606(03)00032-0View ArticlePubMedGoogle Scholar
- Romano LA, Wray GA: Conservation of Endo16 expression in sea urchins despite evolutionary divergence in both cis and trans-acting components of transcriptional regulation. Development 2003, 130(17):4187–4199. 10.1242/dev.00611View ArticlePubMedGoogle Scholar
- Leung TH, Hoffmann A, Baltimore D: One nucleotide in a kappaB site can determine cofactor specificity for NF-kappaB dimers. Cell 2004, 118(4):453–464. 10.1016/j.cell.2004.08.007View ArticlePubMedGoogle Scholar
- Revilla-i-Domingo R, Minokawa T, Davidson EH: R11: a cis-regulatory node of the sea urchin embryo gene network that controls early expression of SpDelta in micromeres. Dev Biol 2004, 274(2):438–451. 10.1016/j.ydbio.2004.07.008View ArticlePubMedGoogle Scholar
- PyZine online magazine[http://www.pyzine.com/Issue006/index.html]
- FamilyRelations Web site[http://family.caltech.edu/]
- Caltech Cartwheel server, "Woodward"[http://woodward.caltech.edu/canal/]
- FamilyRelations tutorial[http://family.caltech.edu/tutorial/]
- Sonnhammer EL, Durbin R: A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 1995, 167(1–2):GC1–10. 10.1016/0378-1119(95)00714-8PubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003, 13(1):97–102. 10.1101/gr.789803PubMed CentralView ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13(4):721–731. 10.1101/gr.926603PubMed CentralView ArticlePubMedGoogle Scholar
- Elnitski L, Riemer C, Petrykowska H, Florea L, Schwartz S, Miller W, Hardison R: PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 2002, 80(6):681–690. 10.1006/geno.2002.7018View ArticlePubMedGoogle Scholar
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res 2000, 10(4):577–586. 10.1101/gr.10.4.577PubMed CentralView ArticlePubMedGoogle Scholar
- Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I: VISTA: computational tools for comparative genomics. Nucleic Acids Res 2004, 32(Web Server):W273–279.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi TM, Walsh K, Schimmel P: Rabbit muscle creatine kinase: genomic cloning, sequencing, and analysis of upstream sequences important for expression in myocytes. Nucleic Acids Res 1991, 19(11):3027–3033.PubMed CentralView ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268(1):78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
- Parra G, Blanco E, Guigo R: GeneID in Drosophila. Genome Res 2000, 10(4):511–515. 10.1101/gr.10.4.511PubMed CentralView ArticlePubMedGoogle Scholar
- Krogh A: Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res 2000, 10(4):523–528. 10.1101/gr.10.4.523PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.