Kerfuffle: a web tool for multi-species gene colocalization analysis
© Aboukhalil et al.; licensee BioMed Central Ltd. 2013
Received: 17 August 2012
Accepted: 11 January 2013
Published: 17 January 2013
Skip to main content
© Aboukhalil et al.; licensee BioMed Central Ltd. 2013
Received: 17 August 2012
Accepted: 11 January 2013
Published: 17 January 2013
The evolutionary pressures that underlie the large-scale functional organization of the genome are not well understood in eukaryotes. Recent evidence suggests that functionally similar genes may colocalize (cluster) in the eukaryotic genome, suggesting the role of chromatin-level gene regulation in shaping the physical distribution of coordinated genes. However, few of the bioinformatic tools currently available allow for a systematic study of gene colocalization across several, evolutionarily distant species. Furthermore, most tools require the user to input manually curated lists of gene position information, DNA sequence or gene homology relations between species. With the growing number of sequenced genomes, there is a need to provide new comparative genomics tools that can address the analysis of multi-species gene colocalization.
Kerfuffle is a web tool designed to help discover, visualize, and quantify the physical organization of genomes by identifying significant gene colocalization and conservation across the assembled genomes of available species (currently up to 47, from humans to worms). Kerfuffle only requires the user to specify a list of human genes and the names of other species of interest. Without further input from the user, the software queries the e!Ensembl BioMart server to obtain positional information and discovers homology relations in all genes and species specified. Using this information, Kerfuffle performs a multi-species clustering analysis, presents downloadable lists of clustered genes, performs Monte Carlo statistical significance calculations, estimates how conserved gene clusters are across species, plots histograms and interactive graphs, allows users to save their queries, and generates a downloadable visualization of the clusters using the Circos software. These analyses may be used to further explore the functional roles of gene clusters by interrogating the enriched molecular pathways associated with each cluster.
Advances in genomics and DNA sequencing technology have fueled growing interest in the large-scale physical and functional organization of chromosomes. Several studies have shown that genomes of many disparate species may have chromosome regions containing clusters of functionally related genes [1–3]. It is well known that operons, ubiquitous in prokaryotes, allow multiple genes to be transcribed at once into a polycistronic mRNA. The extent to which genes colocalize in eukaryotes and the extent to which gene clusters are conserved across species are largely unknown. In eukaryotes, operons are rare ; however, there is evidence to suggest that genes within the same biological pathway may be clustered more so than expected by random rearrangements, possibly because of co-regulation . For example, the Hox genes are tandem duplicate genes organized into clusters, playing a pivotal role in defining the body plan of organisms. Further, the order of the genes within a Hox cluster defines the sequence in which these genes are expressed . While these examples rely on positional clustering, other mechanisms may also lead to gene clusters; for example, clustered genes could be coregulated because (1) their promoters are bound by the same transcription factors; (2) they share regulatory elements (e.g. bidirectional promoters); and (3) the transcription of a gene can change local chromatin accessibility for its neighbors.
Between evolutionarily distinct species, we expect to find random genomic rearrangements that do not conserve gene clusters, unless colocalization is beneficial to the organism. It is possible that colocalization is acted upon by natural selection, conserving the gene clusters across large evolutionary time-scales, although it remains unclear what structural, regulatory, and functional factors are responsible for the colocalization [1, 7, 8]. A recent study found that the genome of a number of different species was arranged into neighborhoods of functionally-related genes that were not necessarily orthologous .
If functionally related genes cluster for mechanistic purposes, then it is expected that those clustered genes would colocalize in other species as well. How to quantify this conservation is a difficult task since orthologous genes have obtained random changes since their last respective speciation event, are often called different names in other species, often have different or increased/decreased functional attributes and not all genes may be required in other species. To address these related research questions, there is a pressing need for computational tools that can overcome the onerous task of querying the growing list of available assembled genomes, analyzing the spatial ordering of genes across many species to identify whether they form clusters, and assessing the conservation of these clusters across other species.
One such clustering tool is C-Hunter, a command-line program that clusters genes by genome position and GO categories . While C-Hunter is capable of identifying clusters of genes within a species, it does not incorporate an analysis of conserved clustering across multiple species, and is not intended as a tool to query a general set of genes that don’t share GO terms. Other tools, such as CGCV, allow for clustering across many species but require the user to input DNA sequences instead of gene names ; subsequently, the web tool performs BLAST searches to find orthologous genes, which adds significant overhead to run-time.
There are related tools which identify regions of synteny, such as EnsemblCompara , i-ADHoRe , MCScanX , Cinteny , OrthoClusterDB  and Syntenator . These tools are useful for identifying homologous genomic regions between species, but do not include an automated approach for evaluating gene clustering and its conservation across species. Utilizing the software that provides a web API and pre-computed homology results, we chose to use EnsemblCompara in Kerfuffle.
To the best of our knowledge, there are currently no tools available for efficiently verifying whether a given list of genes from one species forms clusters and whether these clusters are conserved across other species. To this end, we have developed and implemented a publicly-available web tool, Kerfuffle, that efficiently computes various summary statistics of gene clustering across most genomes in the e! Ensembl database , compares significance of clustering with shuffled null models, and graphically displays the results. The main advantage of Kerfuffle is that it only requires a user to specify human gene names and species of interest. In addition, orthologous gene searches are automated utilizing pre-computed homology from e! Ensembl servers, a relative statistic is used to quantify cluster conservation, and the online program permits server-side saving of results for each registered user for later analysis. Furthermore, Kerfuffle can generate a visualization of the clusters using the Circos software . This comprehensive platform is an important step in furthering our understanding of genome organization and its evolution.
Because Kerfuffle is available as a web application, this obviates the need for compiling or installing, is accessible from anywhere, is supported by most web browsers (tested on Firefox, Chrome, Internet Explorer and Safari), and allows us to improve our software on our end without requiring the user to download an update.
If the user inputs the names of human genes and wishes to do an analysis on the genomes of several other species, Kerfuffle will find the corresponding homologs in each of those genomes on its own, with no further user input. Kerfuffle is also flexible in the way it accepts user input. The user may choose to input genes in a textbox one by one or alternatively, may upload a file that contains a list of genes, each of which is separated by a break line. However, we recognize that it is difficult for users to keep track of the dozen file formats they use. Thus, if the uploaded file is a comma- or tab-delimited file with multiple columns, Kerfuffle will ask the user to specify the column in which the gene names are found.
To aid in recurring analyses, we recommend that users create a free Kerfuffle account so their results and the queried genes will be saved in our databases. On the back-end, the query results obtained from e! Ensembl are temporarily stored in text files and purged every week, unless users decide to save their results to their account, in which case the results remain on the server until the users delete them.
Kerfuffle provides an option to input a known list of functionally related synapse genes, totaling 477, for demonstrative purposes. The source of these genes may be found in the AmiGO database . Colocalization of these genes is supported by a recent publication that demonstrated clustering of genes associated with GABAergic circuit assembly in the cerebellar cortex of young mice . All analyses and images generated were performed on these genes. The list of genes is available on the Kerfuffle webpage for analysis, located under the “Upload gene list” button labeled “Example: Synapse genes.”
Once the analysis is launched, the local machine will concurrently send asynchronous requests to our web server, one request for each species. For each request, our web server will connect to the e!Ensembl BioMart database and download necessary genomic information, after which a colocalization analysis is performed.
Once the analysis is completed, the results are displayed in the “Consecutive gene pairs” (Figure 1A) and “Cluster size distribution” (Figure 1B) graphs. We define the consecutive gene pair distance distribution (Figure 1A) as the histogram of distance between consecutive genes, for all the genes input. The null-distribution in Figure 1A is determined through the following procedure. First, we randomly distribute the genes uploaded to Kerfuffle across the genome. Second, the distance distribution is found for those genes. Finally, this process is iterated many times, dependent on the user’s minimum target p-value, and then all random distance distributions are averaged. This average approximates the null-model—that the list of genes do not cluster more than random. The p-values are determined through a similar process. Upon each random iteration, the counts for each distance in the distribution from the real genomic positions (from uploaded genes) and from the randomly distributed genes are compared. The p-values are calculated as the frequency that random permutation counts surpass real data counts.
Below the p-value table, the cluster size distribution plot (Figure 1B) is displayed with an option to download the data in a text file. Clusters used in the cluster size distribution plot and the conservation analysis below are defined as in Figure 2. Namely, a set of ordered genes G 1, G 2, …, G n is said to colocalize, i.e. form a cluster, if the number of total intervening genes is less than or equal to the specified parameter d. Mathematically, if x(G i ) is the positional order of gene G i , then we require that x(G n ) − x(G 1) − n + 1 ≤ d. In Kerfuffle, the default value of the parameter d is 2.
In the “Summary” tab, the user can launch an analysis comparing human clusters to those of other species, as well as plot distance histograms between consecutive gene pairs for all species selected (Figure 1C). To find the orthologs of the input human genes, Kerfuffle fetches data from e!Ensembl’s EnsemblCompara resource . Once selected, the results can be found under the “Compare” tab. Further, to quantify cluster conservation between chosen species, we define a “conservation score” which conveys similarity of clusters among species.
where S i and T i refers to the set of genes in cluster i in species S and T, respectively. N x refers to the total number of clusters in species x. All clusters were chosen as size 2 or larger. The intersection between S i and T j is defined as the set of common genes between cluster i in species S and cluster j in species T. We normalize the size of the intersection by the size of the cluster S i, hence calculating the score relative to species S. The inner sum increases if the genes found in cluster j of species T are also found in cluster i of species S, while the outer sum averages those scores over each cluster i in species S Thus, Score(S, T) is a statistic which increases as the same clusters are observed and remain intact amongst the species investigated in T relative to S. Our default setting for this analysis sets S = Human.
Under the table of clustered genes in the species (currently only human, mouse, rat and drosophila) tab, a Circos image may be generated for visualization of gene clusters. These images may be downloaded and saved as either .png or .svg formats. Figure 3 shows a Circos plot of the clustered genes from our online example of synapse genes. The sizes of the clusters are represented by a green histogram located at the appropriate genomic start and stop of the clustered genes, pointing radially inwards. We have attempted to optimize output for visualization of gene names (pointing radially outwards) while maintaining all genes on the image, however, some genes may run-off the Circos image because it is impossible to know a priori how many genes will sit next to each other in any given colocalization analysis.
Finally, Kerfuffle makes it easy to compare cluster conservation across all species. The user may click on the “Summary” tab and run a comparative analysis by clicking “Go.” Under the newly generated tab “Compare”, the Score(S,T) statistic is calculated and displayed demonstrating the degree of conservation of the clusters in each species relative to humans. Below these results, the consecutive distance distribution for each species is simultaneously displayed (Figure 1C).
Once the user becomes familiar with the performed analyses, default parameters (discussed previously) may be changed at the bottom of the website below the “Analyze” button. Further, the uploaded genes may be dynamically removed by clicking on the “x” next to each gene or added to the list, whereby the analysis will need to be re-run to reflect the changes. The gene list may also be reset without disturbing the analyses performed and the whole set of analyses may be reset using the “New” button at the top of the page. If the user has any difficulties, we have created a comprehensive Frequently Asked Questions (FAQ) section which covers the capabilities of the whole website and answers many of the more common questions. Finally once a user account is created, all results can be saved online for later analysis.
Future developments will include increased investigative options, such as changing the type of genes investigated (currently set to protein-coding only) and incorporation of other gene name schemes (such as RefSeq IDs). Currently, our default conservation score sets humans as the relative species, i.e. for all calculations, S = Homo sapiens. While our implementation is human-centric, the end-user may wish to use another relative species. In future implementations, it is expected this option will be added. Other features, such as the identification of common clusters in the species, will be added, while other functionality will be included to improve our pathway investigations. Currently, we link to the KEGG website, a multi-gene pathway search. In later developments, our webpage will determine the similar pathways and display them along with the clusters. Finally, Circos uses specific karyotype files, which define the genome of the species investigated. Our current Circos implementation, however, uses the default available files: humans, mouse, rat, and drosophila. In future developments, we will generate karyotype files for any available genome, making visualization of clusters available for a much wider range of species.
This software is intended for the end-user to quickly and efficiently obtain genomic organizational information about a set of user-defined functionally related genes. The software discovers clusters in each species selected and determines the significance of those clusters while allowing for interactive and visual exploration of genomic structure.
Since it is expected that speciation would lead to differences in genomic organization, provided organization is random, we investigate relative cluster conservation between species using a measure we define as the Score(S,T). Once the analysis is performed, the user may compare species and determine the degree of cluster conservation.
The optional parameters make the investigations customizable and allow the user to optimize run-time. An account may also be created where all investigations may be saved for later use. Further, our website has an extensive FAQ section which may help guide the user.
□ Project name: Kerfuffle
□ Project home page: http://atwallab.org/kerfuffle
□ Operating system(s): Platform independent
□ Other requirements: Web browser (supported browsers: Firefox, Chrome, Internet Explorer or Safari)
□ License: GNU GPL
□ Any restrictions to use by non-academics: no license needed.
Centered at Cold Spring Harbor Laboratory, RA is a graduate student in the Watson School of Biological Sciences, BF is a postdoc in the Atwal Lab, and GA is the Atwal lab head.
PHP Hypertext Preprocessor
Hypertext Markup Language
Basic Local Alignment Search Tool
Kyoto Encyclopedia of Genes and Genomes
We gratefully thank Peter Andrews for technical assistance with the web server and Ying Cai for useful comments on the web tool.
This work was supported by the Starr Fellowship and the CSHL Watson School of Biological Sciences NIH training grant [5T32GM065094 to R.A.], the Simons Foundation [to B.F. and G.S.A] and the Starr Cancer Consortium [I3-A123 to G.S.A.].
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.