Kerfuffle: a web tool for multi-species gene colocalization analysis

Background The evolutionary pressures that underlie the large-scale functional organization of the genome are not well understood in eukaryotes. Recent evidence suggests that functionally similar genes may colocalize (cluster) in the eukaryotic genome, suggesting the role of chromatin-level gene regulation in shaping the physical distribution of coordinated genes. However, few of the bioinformatic tools currently available allow for a systematic study of gene colocalization across several, evolutionarily distant species. Furthermore, most tools require the user to input manually curated lists of gene position information, DNA sequence or gene homology relations between species. With the growing number of sequenced genomes, there is a need to provide new comparative genomics tools that can address the analysis of multi-species gene colocalization. Results Kerfuffle is a web tool designed to help discover, visualize, and quantify the physical organization of genomes by identifying significant gene colocalization and conservation across the assembled genomes of available species (currently up to 47, from humans to worms). Kerfuffle only requires the user to specify a list of human genes and the names of other species of interest. Without further input from the user, the software queries the e!Ensembl BioMart server to obtain positional information and discovers homology relations in all genes and species specified. Using this information, Kerfuffle performs a multi-species clustering analysis, presents downloadable lists of clustered genes, performs Monte Carlo statistical significance calculations, estimates how conserved gene clusters are across species, plots histograms and interactive graphs, allows users to save their queries, and generates a downloadable visualization of the clusters using the Circos software. These analyses may be used to further explore the functional roles of gene clusters by interrogating the enriched molecular pathways associated with each cluster. Conclusions Kerfuffle is a new, easy-to-use and publicly available tool to aid our understanding of functional genomics and comparative genomics. This software allows for flexibility and quick investigations of a user-defined set of genes, and the results may be saved online for further analysis. Kerfuffle is freely available at http://atwallab.org/kerfuffle, is implemented in JavaScript (using jQuery and jsCharts libraries) and PHP 5.2, runs on an Apache server, and stores data in flat files and an SQLite database.


Background
Advances in genomics and DNA sequencing technology have fueled growing interest in the large-scale physical and functional organization of chromosomes. Several studies have shown that genomes of many disparate species may have chromosome regions containing clusters of functionally related genes [1][2][3]. It is well known that operons, ubiquitous in prokaryotes, allow multiple genes to be transcribed at once into a polycistronic mRNA. The extent to which genes colocalize in eukaryotes and the extent to which gene clusters are conserved across species are largely unknown. In eukaryotes, operons are rare [4]; however, there is evidence to suggest that genes within the same biological pathway may be clustered more so than expected by random rearrangements, possibly because of co-regulation [5]. For example, the Hox genes are tandem duplicate genes organized into clusters, playing a pivotal role in defining the body plan of organisms. Further, the order of the genes within a Hox cluster defines the sequence in which these genes are expressed [6]. While these examples rely on positional clustering, other mechanisms may also lead to gene clusters; for example, clustered genes could be coregulated because (1) their promoters are bound by the same transcription factors; (2) they share regulatory elements (e.g. bidirectional promoters); and (3) the transcription of a gene can change local chromatin accessibility for its neighbors.
Between evolutionarily distinct species, we expect to find random genomic rearrangements that do not conserve gene clusters, unless colocalization is beneficial to the organism. It is possible that colocalization is acted upon by natural selection, conserving the gene clusters across large evolutionary time-scales, although it remains unclear what structural, regulatory, and functional factors are responsible for the colocalization [1,7,8]. A recent study found that the genome of a number of different species was arranged into neighborhoods of functionallyrelated genes that were not necessarily orthologous [9].
If functionally related genes cluster for mechanistic purposes, then it is expected that those clustered genes would colocalize in other species as well. How to quantify this conservation is a difficult task since orthologous genes have obtained random changes since their last respective speciation event, are often called different names in other species, often have different or increased/decreased functional attributes and not all genes may be required in other species. To address these related research questions, there is a pressing need for computational tools that can overcome the onerous task of querying the growing list of available assembled genomes, analyzing the spatial ordering of genes across many species to identify whether they form clusters, and assessing the conservation of these clusters across other species.
One such clustering tool is C-Hunter, a command-line program that clusters genes by genome position and GO categories [10]. While C-Hunter is capable of identifying clusters of genes within a species, it does not incorporate an analysis of conserved clustering across multiple species, and is not intended as a tool to query a general set of genes that don't share GO terms. Other tools, such as CGCV, allow for clustering across many species but require the user to input DNA sequences instead of gene names [11]; subsequently, the web tool performs BLAST searches to find orthologous genes, which adds significant overhead to run-time.
There are related tools which identify regions of synteny, such as EnsemblCompara [12], i-ADHoRe [13], MCScanX [14], Cinteny [15], OrthoClusterDB [16] and Syntenator [17]. These tools are useful for identifying homologous genomic regions between species, but do not include an automated approach for evaluating gene clustering and its conservation across species. Utilizing the software that provides a web API and pre-computed homology results, we chose to use EnsemblCompara in Kerfuffle.
To the best of our knowledge, there are currently no tools available for efficiently verifying whether a given list of genes from one species forms clusters and whether these clusters are conserved across other species. To this end, we have developed and implemented a publicly-available web tool, Kerfuffle, that efficiently computes various summary statistics of gene clustering across most genomes in the e! Ensembl database [18], compares significance of clustering with shuffled null models, and graphically displays the results. The main advantage of Kerfuffle is that it only requires a user to specify human gene names and species of interest. In addition, orthologous gene searches are automated utilizing pre-computed homology from e! Ensembl servers, a relative statistic is used to quantify cluster conservation, and the online program permits server-side saving of results for each registered user for later analysis. Furthermore, Kerfuffle can generate a visualization of the clusters using the Circos software [19]. This comprehensive platform is an important step in furthering our understanding of genome organization and its evolution.

Implementation
Because Kerfuffle is available as a web application, this obviates the need for compiling or installing, is accessible from anywhere, is supported by most web browsers (tested on Firefox, Chrome, Internet Explorer and Safari), and allows us to improve our software on our end without requiring the user to download an update.
The back-end runs on PHP 5.2 on an Apache server. The front-end was built primarily in HTML and Java-Script and two Javascript libraries to enhance user experience, namely jsCharts for plotting graphs and jQuery for Ajax effects. With the list of genes input by the user, Kerfuffle will query the e!Ensembl BioMart database and retrieve gene name, ID, chromosomal position and homology information of all genes selected and for each species selected. To improve speed, all the species are queried at once in a parallel fashion, and the results are displayed on screen as soon as they are processed by Kerfuffle.
With the gene positions, Kerfuffle will group the genes into clusters based on their colocality using a clustering algorithm written in C++ to improve performance. Once done, Kerfuffle displays the set of gene clusters and uses the jsCharts library to graph relevant plots and histograms ( Figure 1). The plot of Figure 1A is interactive: hovering on each point reveals its x and y coordinate, and clicking on the point will reveal all the gene pairs that are separated by a distance x. Furthermore, p-value calculations are done to estimate statistical significance.
To ensure that small p-value requests does not slow down webpage usability, the calculations are performed in the background and once done, the results appear in a table.
If the user inputs the names of human genes and wishes to do an analysis on the genomes of several other species, Kerfuffle will find the corresponding homologs in each of those genomes on its own, with no further user input. Kerfuffle is also flexible in the way it accepts user input. The user may choose to input genes in a textbox one by one or alternatively, may upload a file that contains a list of genes, each of which is separated by a break line. However, we recognize that it is difficult for users to keep track of the dozen file formats they use. Thus, if the uploaded file is a comma-or tabdelimited file with multiple columns, Kerfuffle will ask the user to specify the column in which the gene names are found. To aid in recurring analyses, we recommend that users create a free Kerfuffle account so their results and the queried genes will be saved in our databases. On the back-end, the query results obtained from e! Ensembl are temporarily stored in text files and purged every week, unless users decide to save their results to their account, in which case the results remain on the server until the users delete them.

Results and discussion
Kerfuffle provides an option to input a known list of functionally related synapse genes, totaling 477, for demonstrative purposes. The source of these genes may be found in the AmiGO database [20]. Colocalization of these genes is supported by a recent publication that demonstrated clustering of genes associated with GABAergic circuit assembly in the cerebellar cortex of young mice [21]. All analyses and images generated were performed on these genes. The list of genes is available on the Kerfuffle webpage for analysis, located under the "Upload gene list" button labeled "Example: Synapse genes."

Multi-species colocalization analyses
Kerfuffle allows the user to specify a list of gene names and select up to, currently, 47 species for which the analysis will be performed. The gene identifiers supported by Kerfuffle are e!Ensembl gene names or WikiGene names and the orthologs of the input human genes are obtained from EnsemblCompara, which uses maximum likelihood phylogenetic trees for homology prediction. Default analysis parameters are provided, although customization is allowed; parameters include: (1) d, the maximum number of total intervening genes (or gaps) allowed in a cluster ( Once the analysis is launched, the local machine will concurrently send asynchronous requests to our web server, one request for each species. For each request, our web server will connect to the e!Ensembl BioMart database and download necessary genomic information, after which a colocalization analysis is performed. Once the analysis is completed, the results are displayed in the "Consecutive gene pairs" ( Figure 1A) and "Cluster size distribution" (Figure 1B) graphs. We define the consecutive gene pair distance distribution ( Figure 1A) as the histogram of distance between consecutive genes, for all the genes input. The null-distribution in Figure 1A is determined through the following procedure. First, we randomly distribute the genes uploaded to Kerfuffle across the genome. Second, the distance distribution is found for those genes. Finally, this process is iterated many times, dependent on the user's minimum target p-value, and then all random distance distributions are averaged. This average approximates the null-model-that the list of genes do not cluster more than random. The p-values are determined through a similar process. Upon each random iteration, the counts for each distance in the distribution from the real genomic positions (from uploaded genes) and from the randomly distributed genes are compared. The p-values are calculated as the frequency that random permutation counts surpass real data counts.
Below the p-value table, the cluster size distribution plot ( Figure 1B) is displayed with an option to download the data in a text file. Clusters used in the cluster size distribution plot and the conservation analysis below are defined as in Figure 2. Namely, a set of ordered genes G 1 , G 2 , . . ., G n is said to colocalize, i.e. form a cluster, if the number of total intervening genes is less than or equal to the specified parameter d. Mathematically, if x (G i ) is the positional order of gene G i , then we require that x(G n ) − x(G 1 ) − n + 1 ≤ d. In Kerfuffle, the default value of the parameter d is 2.

Comparative analysis
In the "Summary" tab, the user can launch an analysis comparing human clusters to those of other species, as well as plot distance histograms between consecutive gene pairs for all species selected ( Figure 1C). To find the orthologs of the input human genes, Kerfuffle fetches data from e!Ensembl's EnsemblCompara resource [12]. Once selected, the results can be found under the "Compare" tab. Further, to quantify cluster conservation between chosen species, we define a "conservation score" which conveys similarity of clusters among species.
To quantify the conservation of gene clusters in species T relative to those found in species S, we use the following conservation score: where S i and T i refers to the set of genes in cluster i in species S and T, respectively. N x refers to the total number of clusters in species x. All clusters were chosen as size 2 or larger. The intersection between S i and T j is defined as the set of common genes between cluster i in species S and cluster j in species T. We normalize the size of the intersection by the size of the cluster S i, hence calculating the score relative to species S. The inner sum increases if the genes found in cluster j of species T are also found in cluster i of species S, while the outer sum averages those scores over each cluster i in species S Thus, Score(S, T) is a statistic which increases as the same clusters are observed and remain intact amongst the species investigated in T relative to S. Our default setting for this analysis sets S = Human.

Using the software
Before the analysis is started, the user must select the species of the gene names used, upload, type, or paste the genes to the Kerfuffle server, select the species to investigate, and click "Analyze." While the analysis is performed, the program displays a "Summary" tab which reports, for each species, the amount of genes found from the specified list, as well as invalid or missing gene names. After the completion of the analysis for each species, a tab is generated next to the summary tab that reveals a report of discovered clusters. This report includes a plot of the distribution of distances between consecutive gene pairs ( Figure 1A), a histogram of cluster sizes ( Figure 1B), and a genomic visualization of the clusters (Figure 3). The graphs and data points used to plot them are downloadable. At the top of each species tab, the genes in a cluster are presented along with a clickable link, which searches the KEGG pathway database [22] for commonality in the cluster. Further down the tab, the interactive analysis plots are generated: clicking on a data point for a given consecutive distance will display all the consecutive gene pairs separated by that distance, as well as display the histogram counts over the graph. To assess the significance of the clustering, we overlay a plot of the expected distribution under random gene shuffling, i.e. if gene colocalization were random. Deviation from the null distribution is also quantified as a p-value table generated using a permutation test, as previously discussed. Note that the null distribution curve in the "Consecutive gene pairs" graph ( Figure 1A) may sometimes appear to be linear, as opposed to the expected exponential, due to the analyses may be reset using the "New" button at the top of the page. If the user has any difficulties, we have created a comprehensive Frequently Asked Questions (FAQ) section which covers the capabilities of the whole website and answers many of the more common questions. Finally once a user account is created, all results can be saved online for later analysis.

Performance
To evaluate the performance of our web tool, we ran several queries using gene sets of varying size and different number of species (Figure 4). We find that a typical query of~500 genes in 5 species completes in~25 seconds (or~3 minutes when querying all 47 species). Overall, for a given number of species, the running time increases exponentially with the number of input genes ( Figure 4). However, even a query of 5,000 genes (an unusually high number of genes) in all 47 species completes in less than 10 minutes. Hence, our server is well suited to ensure that queries are handled expediently. Although there is no limit on how many genes a user can input, we recommend that users do not exceed 10,000 genes in order to maintain a reasonable running time, as well as the usefulness of results (too many genes increases the likelihood of finding clusters).

Future developments
Future developments will include increased investigative options, such as changing the type of genes investigated (currently set to protein-coding only) and incorporation of other gene name schemes (such as RefSeq IDs). Currently, our default conservation score sets humans as the relative species, i.e. for all calculations, S = Homo sapiens. While our implementation is human-centric, the enduser may wish to use another relative species. In future implementations, it is expected this option will be added.
Other features, such as the identification of common clusters in the species, will be added, while other functionality will be included to improve our pathway investigations. Currently, we link to the KEGG website, a multi-gene pathway search. In later developments, our webpage will determine the similar pathways and display them along with the clusters. Finally, Circos uses specific karyotype files, which define the genome of the species investigated. Our current Circos implementation, however, uses the default available files: humans, mouse, rat, and drosophila.
In future developments, we will generate karyotype files for any available genome, making visualization of clusters available for a much wider range of species.

Conclusions
This software is intended for the end-user to quickly and efficiently obtain genomic organizational information about a set of user-defined functionally related genes. The software discovers clusters in each species selected and determines the significance of those clusters while allowing for interactive and visual exploration of genomic structure.
Since it is expected that speciation would lead to differences in genomic organization, provided organization is random, we investigate relative cluster conservation between species using a measure we define as the Score(S,T). Once the analysis is performed, the user may compare species and determine the degree of cluster conservation.
The optional parameters make the investigations customizable and allow the user to optimize run-time. An account may also be created where all investigations may be saved for later use. Further, our website has an extensive FAQ section which may help guide the user.