Non-coding sequence retrieval system for comparative genomic analysis of gene regulatory elements
© Doh et al; licensee BioMed Central Ltd. 2007
Received: 14 September 2006
Accepted: 15 March 2007
Published: 15 March 2007
Completion of the human genome sequence along with other species allows for greater understanding of the biochemical mechanisms and processes that govern healthy as well as diseased states. The large size of the genome sequences has made them difficult to study using traditional methods. There are many studies focusing on the protein coding sequences, however, not much is known about the function of non-coding regions of the genome. It has been demonstrated that parts of the non-coding region play a critical role as gene regulatory elements. Enhancers that regulate transcription processes have been found in intergenic regions. Furthermore, it is observed that regulatory elements found in non-coding regions are highly conserved across different species. However, the analysis of these regulatory elements is not as straightforward as it may first seem. The development of a centralized resource that allows for the quick and easy retrieval of non-coding sequences from multiple species and is capable of handing multi-gene queries is critical for the analysis of non-coding sequences. Here we describe the development of a web-based non-coding sequence retrieval system.
This paper presents a Non-Coding Sequences Retrieval System (NCSRS). The NCSRS is a web-based bioinformatics tool that performs fast and convenient retrieval of non-coding and coding sequences from multiple species related to a specific gene or set of genes. This tool has compiled resources from multiple sources into one easy to use and convenient web based interface. With no software installation necessary, the user needs only internet access to use this tool.
The unique features of this tool will be very helpful for those studying gene regulatory elements that exist in non-coding regions. The web based application can be accessed on the internet at: http://cell.rutgers.edu/ncsrs/.
While annotation efforts and gene prediction methods have begun the process of identifying protein-coding genes, robust high-throughput methods for detecting functional non-protein coding elements remain elusive . Only about 2 percent of the human or mouse genomes consist of DNA sequences that are protein-coding regions [2, 3]. The remaining vast majority of the genome consists of non-coding sequences (NCS). It has been shown that gene regulatory elements (GREs) reside in the NCS [4, 5]. GREs have been broadly placed into two major functional groups: promoters and enhancers. Promoters are sequences that direct the precise locations of transcription start sites. Enhancers, repressor, and silencers, etc. are sequences that bind gene regulatory proteins and influence the transcription activity of a gene. GREs can be located upstream, downstream, or even internal to the target gene. GREs, therefore, act as switches to turn gene expression on or off and as modulators to increase or decrease expression. Traditionally, NCS have not received as much attention from investigators as protein coding sequences and GREs are generally poorly defined, mostly as only sequence motifs. Research is now focusing increasingly on non-coding sequences and specifically the search for NCS with regulatory functions. Identifying functional NCS and understanding their mechanism of operation will shed new insights into the understanding of the regulatory functions of transcription, DNA replication, chromosome pairing, and chromosome condensation [2, 6]. In order for the full understanding and eventual control of biological function, not only must the genes involved in a particular function be identified but the regulatory elements that trigger and control the biochemical pathways that determine each gene's expression must also be well understood. However, searching for functional GREs within the NCS that comprise roughly 98% of the genome is not a simple task. The size and scope of this search brings with it many intellectual and experimental challenges that span computational biology and comparative functional genomics .
There are two commonly used methods for identification of functional GREs. The first uses gene expression analysis and the second uses comparative genomics. DNA microarray gene expression profiling is capable of evaluating thousands of genes across various experimental conditions. Bioinformatics approaches are used to cluster genes that show similar patterns of expression. Once genes with similar patterns of expression are identified, they are searched within their upstream sequences to identify over-represented or conserved sequence motifs [8, 9]. Sequence alignment algorithms employed by the comparative genomic methods are powerful in identifying conserved sequences in non-coding regions located in and around genes with the same function, known as homologous genes, from diverse species. Homologous genes usually have the same function and may also have similar regulatory elements that control this function. Functional regions (which consist of protein coding regions along with regulatory regions) experience selective pressure against change and therefore have a higher level of sequence conservation across a wide range of species than non-functional regions. Ideally, selective pressures would allow for non-functional sequences to diverge due to evolutionary drift while leaving functional regions with high similarity [1, 10–17]. DNA sequence comparison of the human and mouse orthologous genes have indicated that conserved NCS are enriched significantly in regulatory sequence regions [4, 18, 19]. Subsequent to the identification of putative regulatory elements by sequence comparison, the confirmation of biological function will depend upon experimental assays. Transcriptional regulatory regions in genes from humans, mouse, Fugu fish, Caenorhabditis elegans, Drosophila, and yeast [5, 6, 20–24] have been identified. The power of comparative genomics analysis is enhanced significantly when genomic sequences are available from a number of related species that have diverged sufficiently. This reduces the chances of conservation among non-functional elements. By comparing multiple genomes, it can help to determine which conserved elements are more likely to be functional [5, 21, 25, 26].
Whether the focus is on genes with similar expression patterns or those expected to have the same function, both analysis methods require the retrieval of NCS for the identification of functional sequence elements. Currently, the process of retrieving these sequences is performed manually from a wide range of sources. No efficient method is available for retrieving NCS quickly and systematically at a single source. To facilitate the analysis of NCS for functional regulatory elements, we present here a web-based n on-c oding s equences r etrieval s ystem (NCSRS) that performs the automated retrieval of non-coding sequences among genomes of different species. A previously developed application for retrieving NCS, called Retrieval of Regulative Regions (RRE)  parses annotation and homology data from NCBI to identify NCS. This parser requires local installation but also requires a local copy of desired genomes and annotation files. A web based application is also available but only a few genomes are currently available and RRE utilizes annotation data from only NCBI. The NCSRS requires no installation or local management of genome sequence databases and utilizes annotation information from both NCBI and Ensembl. Currently, NCSRS has 15 genomes (containing over 85 Gigabyte of DNA sequence data) with sufficient annotations available for NCS retrieval.
Annotation and sequence information
Statistics of gene annotation for ENSEMBL and NCBI.
ENSEMBL – as of 12/06/06
NCBI – taxonomy browser and Unigene as of 12/06/06
Total Unigene Clusters
Statistics of homology prediction for human, mouse, and chicken
Ensembl (mart 41)
Baseline Species for Homology Search
Species of Homologous Genes
Total number of genes
Homologene (release 53)
Baseline Species for Homology Search
Species of Homologous Genes
Total number of genes
Analysis of known and predicted genes for chicken, rat, mouse, and human from Ensembl Mart v.41
Refseq known to Ensembl
Refseq known to Ensembl
Refseq known to Ensembl
The RefSeq annotation data (RefGene.txt, RefLink.txt) is obtained for each genome from the Human Genome Browser at UCSC  ftp site as are some of the genome sequences. The remaining genome sequences along with the Ensembl data (gene, transcript, and structure files for each genome) are obtained from Ensembl's ftp site. Both sets of gene annotations are downloaded and then processed to serve as the basis for building a genome map that contains the location of each gene and all its exons. Ensembl includes 5' and 3' UTR's along with coding exons when counting the total number of exons. 5' and 3' UTR's are also listed on their website as exons and are included in the downloadable annotation files as exons. Therefore, following this convention, 5' and 3' UTR's have also been included in the list of exons by NCSRS. In those cases where multiple transcripts are available, the transcript with the greatest number of exons is used. Because not all the transcript information is used to define sequences in the intron region there is the potential that an exon from an unused transcript variant may not be shared with the utilized transcript. In this case, the unshared exon would be "hidden" from the annotation and returned as part of the non-coding sequence. However this is not relevant in the majority of cases as "hidden" exons are not the norm. Furthermore, this is limited to only the intragenic region and would not affect the up and down stream sequences.
In this paper non-coding regions refer to all sequences that are not exons. The definition of exons will follow the convention used by Ensembl of defining exons as the set of 5' UTRs, coding exons, and 3' UTRs. Ensembl includes 5' and 3' UTRs in their total exon counts both on their website and in the downloadable annotation files [29, 33]. While it is arguable that UTR's can be considered non-coding regions, for the sake of consistency and clarity, the convention followed by Ensembl is maintained.
The RefSeq annotation information, called Homologene , is the basis for the homologous gene prediction system. This system is implemented in the NCSRS by using the single "homologene.data" file  for gene annotation information from RefSeq. This file is a list of the sets of homologous genes for all genes annotated by RefSeq. Ensembl has its own homology prediction method and its output is organized in a set of multiple files. The Ensembl annotation information is the basis for its homologous gene prediction and therefore when using Ensembl's annotation information, Ensembl's homology prediction results are used. Simple analysis of the data generated by the two homology prediction methods shows that both systems have good coverage for those genomes most commonly used for research (see Table 2). As the annotations improve, the results of the predictions will also improve in accuracy and percent coverage.
Using the annotation information all annotated genes are sorted based on chromosomal position. Then the start and stop locations for all coding regions are identified. By identifying the coding regions we can also determine the locations of the non-coding regions using genomic position information, or mapping, of a specified gene and its flanking genes. The non-coding sequences are identified simply as those located between the adjacent identified exons. The information for each gene's non-coding region is then written to a new set of files that are used by the NCSRS. This non-coding region annotation serves as the basis for the locations of the end points for each intergenic and intragenic region.
Retrieval of non-coding sequences
Both annotation systems, Refseq and Ensembl, are works in progress and with subsequent releases, the annotations will improve in scope and accuracy. New genome assemblies will also continue to be released for new and existing genomes. As the sequence and annotation information which serve as the basis for this system are refined, the sequences generated by this system will improve. Therefore, it is critical that the genomes and annotation information are kept up to date. The NCSRS will be updated automatically on a weekly basis to ensure that the most recent information is always available.
Hardware and software
The NCSRS uses a single computer that acts as both the server and database. There is also a developmental computer which is used for updating, designing new applications, and troubleshooting. The main server uses Dual Intel® Xeon® Processors at 3.0 GHz, with 4 Gb RAM and 500 GB Hard Disk space and runs apache 2.2 as its web server. The scripts and programs used by NCSRS for building and accessing the databases are written predominantly in PHP and Perl.
We have developed a web-based sequence retrieval system that quickly and easily extracts non-coding sequences associated with a specific user defined gene set from a single and/or multiple genomes. The NCSRS efficiently delivers non-coding sequences for specified genes or gene sets using a user-friendly interface from a single site. This system eliminates the need to manually sift though genome sequences and look for annotation information from multiple sources. This will help eliminate human errors as well as increase throughput for those investigating gene regulatory elements. The system also allows the user to specify the gene or set of genes for retrieval while maintaining a simple user interface, enabling the user to apply their expert knowledge without having to spend a lot of time learning how to use the system. Another option that is important for those seeking to elucidate functional NCS is masking. Repetitive sequence elements found in the genome can cause sequence alignment algorithms to predict conserved elements that do not have gene regulatory function. For this reason, there is an available option to mask sequences as repeated sequences to allow for alignment algorithms to ignore repeated sequences .
This system has great flexibility in its potential applications. An important and unique feature is that if the user intends to apply this tool to a comparative genomics approach, the user can obtain the sequences for multiple species by simply selecting the "pull all orthologs" option. Once the sequences are returned, a multiple sequence analysis can be performed for each set of homologous gene sequences. The system's ability to return sequences from multiple genomes in one run greatly increases the efficiency and speed of the system. Furthermore, it has been shown that increasing the number of genomes used in alignment analysis increases the signal-to-noise ratio and, if specific genomes are selected carefully, increases the likelihood of correctly predicting functionality . If microarray data is the basis for analysis, the system's ability to handle multiple genes in a single query allows for the user to input multiple genes with similar expression patterns at one time to return the desired sequences. It is even possible to combine the two approaches and obtain sequences for all homologous genes of a set of genes with similar expression profiles. This allows for the system to be utilized by those who seek to learn more about genome-wide networks through their analysis .
NCSRS combines a number of available genomic resources (a total of over 85 Gb sequences) and applies them to the specific task of identifying and retrieving non-coding sequences in an up to date web based application that is easy to use and requires no maintenance by the user. The unique features of this tool will be very helpful for those studying gene regulatory elements that exist in non-coding regions. Future work will include incorporating NCSRS with a program that analyzes non-coding sequences using a multi-sequence alignment algorithm and identifies highly conserved regions [10–15]. This pipeline will be designed to be able to rank potential gene regulatory elements according to the likelihood of functionality using sequence motif information of known transcription binding factors . Ultimately the seamless integration of these two tools with the NCSRS will be implemented into a gene regulatory element finder pipeline. This will allow future experimental work and resources to focus on the verification of potential regulatory elements to those conserved elements that have a theoretically basis for regulatory function and therefore increase overall efficiency. The proposed pipeline will serve as the initial selection process for targeting experimental verification.
Availability and requirements
Project name: NCSRS
Project home page: http://cell.rutgers.edu/ncsrs/
Operating system(s): Platform independent
Programming language: Perl, PHP
Any restrictions to use by non-academics: None
We wish to thank the many "NCSRS" users for their constructive comments. This work was supported in part by the Charles and Johanna Busch Fund.
- Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence comparisons: a review of methods and available resources. Genome Res 2003, 13(1):1–12. 10.1101/gr.222003PubMed CentralView ArticlePubMedGoogle Scholar
- Makalowski W: The human genome structure and organization. Acta Biochim Pol 2001, 48(3):587–598.PubMedGoogle Scholar
- Consortium TENCODEP: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306: 636–640. 10.1126/science.1105136View ArticleGoogle Scholar
- Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science 2003, 302(5644):413. 10.1126/science.1088328View ArticlePubMedGoogle Scholar
- Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 2005, 3(1):e7. 10.1371/journal.pbio.0030007PubMed CentralView ArticlePubMedGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423(6937):241–254. 10.1038/nature01644View ArticlePubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860–921. 10.1038/35057062View ArticlePubMedGoogle Scholar
- Bucher P: Regulatory elements and expression profiles. Curr Opin Struct Biol 1999, 9(3):400–407. 10.1016/S0959-440X(99)80054-2View ArticlePubMedGoogle Scholar
- Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 1998, 16(10):939–945. 10.1038/nbt1098-939View ArticlePubMedGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science 2004, 304(5675):1321–1325. 10.1126/science.1098119View ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13(4):721–731. 10.1101/gr.926603PubMed CentralView ArticlePubMedGoogle Scholar
- Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I: VISTA: computational tools for comparative genomics. Nucleic Acids Res 2004, 32(Web Server issue):W273–9. 10.1093/nar/gkh458PubMed CentralView ArticlePubMedGoogle Scholar
- Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I: VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 2000, 16(11):1046–1047. 10.1093/bioinformatics/16.11.1046View ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13(1):103–107. 10.1101/gr.809403PubMed CentralView ArticlePubMedGoogle Scholar
- Stojanovic N, Florea L, Riemer C, Gumucio D, Slightom J, Goodman M, Miller W, Hardison R: Comparison of five methods for finding conserved sequences in multiple alingments of gene regulatory regions. Nucleic Acids Res 1999, 27(19):3899–3910. 10.1093/nar/27.19.3899PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31(13):3518–3524. 10.1093/nar/gkg579PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker--a web server for aligning two genomic DNA sequences. Genome Res 2000, 10(4):577–586. 10.1101/gr.10.4.577PubMed CentralView ArticlePubMedGoogle Scholar
- Levy S, Hannenhalli S, Workman C: Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics 2001, 17(10):871–877. 10.1093/bioinformatics/17.10.871View ArticlePubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434(7031):338–345. 10.1038/nature03441PubMed CentralView ArticlePubMedGoogle Scholar
- Bergman CM, Kreitman M: Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res 2001, 11(8):1335–1345. 10.1101/gr.178701View ArticlePubMedGoogle Scholar
- Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH, Johnston M: Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res 2001, 11(7):1175–1186. 10.1101/gr.182901View ArticlePubMedGoogle Scholar
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428(6983):617–624. 10.1038/nature02424View ArticlePubMedGoogle Scholar
- Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA: Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 2000, 288(5463):136–140. 10.1126/science.288.5463.136View ArticlePubMedGoogle Scholar
- Thacker C, Marra MA, Jones A, Baillie DL, Rose AM: Functional genomics in Caenorhabditis elegans: An approach involving comparisons of sequences from related nematodes. Genome Res 1999, 9(4):348–359.PubMed CentralPubMedGoogle Scholar
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2003, 299(5611):1391–1394. 10.1126/science.1081331View ArticlePubMedGoogle Scholar
- Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM, Frazer KA: Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res 2000, 10(9):1304–1306. 10.1101/gr.142200PubMed CentralView ArticlePubMedGoogle Scholar
- Lazzarato F, Franceschinis G, Botta M, Cordero F, Calogero RA: RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets. Bioinformatics 2004, 20(16):2848–2850. 10.1093/bioinformatics/bth287View ArticlePubMedGoogle Scholar
- Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29(1):137–140. 10.1093/nar/29.1.137PubMed CentralView ArticlePubMedGoogle Scholar
- Curwen V, Eyras E, Clarke L, Mongin E, Searle SMJ, Clamp M: The Ensembl automatic gene annotation system. Genome Res 2004, 14(5):942–950. 10.1101/gr.1858004PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskim KM, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. Genome Res 2002, 12(6):996–1006. 10.1101/gr.229102. Article published online before print in May 2002PubMed CentralView ArticlePubMedGoogle Scholar
- UCSC FTP[ftp://hgdownload.cse.ucsc.edu/goldenPath/]
- Ensembl FTP[ftp://ftp.ensembl.org/pub/current_mart/]
- Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Bane J, Graf S, Haide S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Overduin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E: Ensembl 2007. Nucleic Acids Research 2006, 00(Database issue):D1-D8.Google Scholar
- Homologene FTP[ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current]
- Entrez Gene[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene]
- UCSC Genome Browser[http://genome.ucsc.edu/]
- Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Thessfeld CL, Dolinski K, Troyanskaya OG: Discovery of biological networks from diverse functional genomic data. Genome Biol 2005, 16(13):R114. 10.1186/gb-2005-6-13-r114View ArticleGoogle Scholar
- Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E, Taipale J: Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell 2006, 124(1):47–59. 10.1016/j.cell.2005.10.042View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.