The Drosophila Gene Expression Tool (DGET) for expression analyses

Background Next-generation sequencing technologies have greatly increased our ability to identify gene expression levels, including at specific developmental stages and in specific tissues. Gene expression data can help researchers understand the diverse functions of genes and gene networks, as well as help in the design of specific and efficient functional studies, such as by helping researchers choose the most appropriate tissue for a study of a group of genes, or conversely, by limiting a long list of gene candidates to the subset that are normally expressed at a given stage or in a given tissue. Results We report DGET, a Drosophila Gene Expression Tool (www.flyrnai.org/tools/dget/web/), which stores and facilitates search of RNA-Seq based expression profiles available from the modENCODE consortium and other public data sets. Using DGET, researchers are able to look up gene expression profiles, filter results based on threshold expression values, and compare expression data across different developmental stages, tissues and treatments. In addition, at DGET a researcher can analyze tissue or stage-specific enrichment for an inputted list of genes (e.g., ‘hits’ from a screen) and search for additional genes with similar expression patterns. We performed a number of analyses to demonstrate the quality and robustness of the resource. In particular, we show that evolutionary conserved genes expressed at high or moderate levels in both fly and human tend to be expressed in similar tissues. Using DGET, we compared whole tissue profile and sub-region/cell-type specific datasets and estimated a potential source of false positives in one dataset. We also demonstrated the usefulness of DGET for synexpression studies by querying genes with expression profile similar to the mesodermal master regulator Twist. Conclusion Altogether, DGET provides a flexible tool for expression data retrieval and analysis with short or long lists of Drosophila genes, which can help scientists to design stage- or tissue-specific in vivo studies and do other subsequent analyses.


Background
The application of next-generation sequence technologies to RNA analysis has opened the door to relatively rapid, large-scale analyses of gene expression. 'Standard' RNA-seq analysis, for example, can provide a snapshot of gene expression in specific cell types or tissues [17], and related technologies such as Ribo-seq [11] provide more refined views, such as a snapshot of what genes are actively transcribed in a given cell or tissue. For Drosophila, efforts such as the modENCODE project [1,2,7,12] have provided a baseline overview of expression under standard laboratory conditions for various cultured cell types, developmental stages, and tissues, as well as treatment conditions. Moreover, studies such as those investigating expression in subregions of the fly gut [6,10] are providing increasingly detailed views of the baseline expression levels of various genes in various tissues, cell types and sub-regions. Altogether, these RNA-seq data resources provide helpful starting points for analysis of other gene lists.
Resources such as FlyBase [5] make it possible to quickly view modENCODE data for a given gene and make these data generally accessible to the community. The value of these data to the community can be further increased by facilitating search of lists of genes. For example, for gene lists originating from whole-animal or cultured cell studies, or for studies based on a list of orthologs of genes from another species, it can be very helpful to get a picture of what stages or tissues normally express those genes, as that will help focus stageor tissue-specific in vivo studies and other subsequent analyses. We implemented DGET to help scientists retrieve modENCODE expression data in batch mode. DGET also hosts other relevant RNA-Seq datasets published in individual studies, such as profiles of specific sub-regions and cell types of the Drosophila gut [6,10]. Here, we describe DGET and perform a number of analyses to demonstrate the quality and robustness of the resource.

Expression pattern analysis
Human protein expression data were retrieved from proteinatlas.org and tissue-specific genes were selected using the file "ProteinAtlas_Normal_tissue_vs14." Proteins with high or medium expression levels with a reliability value of "supportive" were selected. Proteins expressed in a broad range of tissues (i.e., more than 5 tissues) were filtered out. DIOPT vs5 was used to map genes from human to Drosophila [9]. 'Ortholog pair rank' was added at recent DIOPT release 5.2.1 (http://www.flyrnai.org/DRSC-ORH.html# versions). Drosophila genes with high or moderate rank were selected. The high/moderate rank mapping include the gene pairs that are best score in either forward or reverse mapping (and DIOPT score >1) as well as gene pairs with DIOPT score >3 if not best score either way.

Implementation
DGET was implemented using php and JavaScript with MySQL database for data storage. It is hosted on a server provided by the Research IT Group (RITG) at Harvard Medical School. The MySQL database is also hosted on a server provided by RITG. Plotting of heatmaps for svg download is done in R using the gplot heatmap function. Website bar charts are drawn using the 3C.js plotting package. The php symfony framework scaffold is used to create DGET webpages and forms.

Results and discussion
Database content and features of the user interface (UI) The DGET database contains processed RNA-Seq data from the modENCODE consortium [1,2,7,12], as released by FlyBase [5], as well as other published datasets we obtained from specific studies [3,6,10]. The DGET UI has two tabs ( Fig. 1).
At the "Search Gene Expression" tab, users can enter a list of genes or choose one of the predefined gene classes from GLAD [8], e.g., kinases, then specify the datasets to be displayed. There are two search options, "look at expression" and "enrichment analysis." The results page for "look at expression" displays expression values in a heatmap format. At this results page, users have the option to download the relevant expression values; download the heatmap; and further filter the list by defining a cutoff, limit to specific dataset(s), or filtering out genes, for example with less than 1 RPKM value based on carcass and/or digestion system expression of 1 day adult. We used an RPKM cutoff of 1 because this is considered the cutoff for 'no or extremely low expression' at FlyBase. The results page for an enrichment analysis displays the distribution of genes at different expression levels using a bar graph and heatmap. The cutoff values for different levels are defined based on FlyBase guidelines (Fig. 1a).
Using the "Search Similar Genes" tab, users can enter a gene of interest and search for other genes with similar expression pattern based on Pearson correlation score. Users have the options to download the list of genes with similar expression patterns, a heatmap, and a normalized heatmap. Using the "Build Network" tab, users can enter a list of genes and build synexpression network based on the correlation of expression using the dataset and Pearson correlation cutoff specified by the user (Fig. 1b).

Expression pattern of Drosophila regulatory genes
When genome-scale screening is not practical to do, a common approach is to select a specific subset of genes to start with, such as a group of genes with related activities. The most frequently screened sub-sets of genes are important regulatory genes including genes that encode kinases, phosphatases, transcription factors, or canonical signal transduction pathways components. Our expectation is that these regulatory genes, which appear to be re-used in many contexts, will be expressed in many tissues. To test this, we analyzed the expression patterns of several Drosophila regulatory gene classes defined by GLAD, Gene List Annotation for Drosophila [8]. These included canonical signal transduction pathway genes, kinases, phosphatases, transcription factors, secreted proteins, and receptors. The percentages of expressed genes were calculated across all tissues profiled using a RPKM of 1 or above as a cutoff for expressed versus not expressed (Fig. 2). About 70-90% of the genes categorized as encoding canonical signal transduction pathway components, kinases, phosphatases, or transcription factors are expressed in each of the major tissue categories profiled, whereas only 30-60% of receptor or secreted proteins are detected in any given tissue.

Correlation of expression with confidence in an ortholog relationship
It is well established that the evolutionary conservation of proteins correlates with conservation at the level of biological and/or biochemical functions. Drosophila is a model organism of particular interest for which a wide variety of molecular genetic tools are readily available. Particularly, Drosophila models have been developed for a number of human diseases [13]. According to DIOPT, 9705 of 13,902 protein-coding genes in Drosophila are predicted to have human ortholog(s) [9]. Using DGET we analyzed the expression levels of the subset of Drosophila genes for which there is evidence that they are conserved in the human genome. Specifically, we analyzed subsets of genes scoring as putative human orthologs of fly genes at different levels of confidence, as defined by the orthologous gene prediction tool developed at the DRSC, Drosophila RNAi Screening Center [9]. This tool, DIOPT (DRSC Integrative Ortholog Prediction Tool), integrates the ortholog predictions from 14 different algorithms and assigns a 'DIOPT score' or (See figure on previous page.) Fig. 1 The DGET user interface. a On the "Search Gene Expression" page, users can input a gene list by pasting Drosophila gene or protein symbols or IDs into the text box, or by uploading a file. The specific identifiers accepted are FlyBase Gene Identifier (FBgn), gene symbol, CG number, and full gene name. Users can choose to look at expression patterns or perform an enrichment analysis of the inputted list as compared with the underlying RNA-Seq data. b At the "Search Similar Genes" page, users can enter a gene symbol (or other accepted identifier) to find genes with similar expression patterns. At the "Build Network" page, users can enter a list of genes to build the synexpression network based on the dataset and Pearson correlation cut-off specified Fig. 2 Expression patterns of genes in major Drosophila regulatory gene groups count of algorithms that predict a given pair-wise orthologous relationship. We found a strong correlation of percent expressed genes with DIOPT score (Fig. 3). For example, for genes that have a high-confidence ortholog relationship (DIOPT score of 7 or above), almost all are expressed across all tissues. By contrast, for genes for which DIOPT analysis suggests that there is no evidence of a human ortholog (i.e., none of the 10 ortholog algorithms queried with DIOPT predict an ortholog), only 20-50% are expressed in each of the major tissue categories profiled. We suspect that this correlation is driven by essential genes, which are more conserved evolutionarily. We also note that gene set enrichment for the set of high-confidence orthologs indicates that "kinases" and "nucleotide binding" among the top 20 enriched sets, indicating that the set of regulatory genes analyzed above has overlap with this set.
We next analyzed the 418 Drosophila essential genes identified by Spradling et al. [15] using a largescale single P-element insertion fly stock collection. The proportions of essential genes expressed at detectable levels in various tissues are very similar to the genes with DIOPT score 7-10 ( Fig. 3, light purple and dark purple bars) with a Pearson correlation coefficient equal to 0.92.
Expression patterns of Drosophila orthologs of human genes that are highly expressed in specific tissues Next, we asked whether genes conserved between human and Drosophila are also expressed in similar patterns. We used the tissue-based human proteome annotation available at the Human Protein Atlas (HPA) (www.proteinatlas.org) [16], as the source for tissue-specific expression, and retrieved the set of human genes that are expressed in specific tissues. Next, we mapped these human genes to Drosophila orthologs using DIOPT [9], filtering out low rank ortholog pairs (see Methods), and analyzed the expression patterns of these high-confidence orthologs in Drosophila tissues using DGET (Fig. 4). The results of our analysis using all annotated proteins without a filter did not clearly demonstrate conservation of expression patterns. However, an analysis limited to genes expressed at high or moderate levels (as annotated by HPA) from high confident annotation (i.e., excluding HPA "reliability" value of Fig. 3 Relationship between expression levels and gene conservation. Drosophila genes that are conserved in the human genome at different confidence levels (i.e., different DIOPT scores) were analyzed by DGET. We found that across all tissues, expression levels correlate with confidence in the ortholog relationship. That is, in general, genes with higher DIOPT scores vs. human genes have higher expression levels. Genes with DIOPT scores of 7-10 (light purple bars) have similar expression patterns as compared with Drosophila essential genes (dark purple bars); i.e., in both cases, the genes are likely to be expressed in many or all tissues "uncertain"), indicates that gene expression patterns are conserved in similar tissues in Drosophila. For example, as a group, orthologs of genes highly expressed in the human cerebellum, cerebral cortex, lateral ventricle or hippocampus are highly expressed in the Drosophila central nervous system (CNS) or head, at both larval and adult stages, and orthologs of genes highly expressed in human testis are also highly expressed in the Drosophila testis. Moreover, orthologs of genes from some organs of the human digestive system, such as stomach, duodenum or small intestine, are also highly expressed in the Drosophila digestive system. To further compare the expression patterns of genes expressed in the human and Drosophila, digestive systems, we analyzed the Drosophila gut sub-region data from Dutta et al. [6] (Fig. 5). Orthologs of genes highly expressed in the human salivary gland and esophagus are highly expressed in the R1 upstream region, and orthologs of genes highly expressed in the human rectum, colon or appendix are more biased towards expression in the R5 downstream region. Fly orthologs of genes highly expressed in the human stomach, duodenum and small intestine are detected throughout the samples corresponding to R1 to R5.

Mining information from distinct but related fly gut gene expression data sets
We next sought to compare the results of whole-gut profiling with results from profiling of specific subregions or cell types with the goal of identifying genes only expressed in specific sub-populations. Our rationale for the analysis was to determine the likelihood that genes expressed in a sub-population are missed in expression analysis of an entire organ. This type of false negative analysis should provide helpful information for interpreting results of whole-organ or whole-tissue studies. Thus, we compared the whole gut profiling data obtained by modENCODE consortium for 20 day old adult flies [12] with data generated by profiling sub-regions of the midgut in 16-20 day old adult flies [10]. Whole gut profiling indicates that 9109 genes are expressed in the gut of 20 day old adult flies (RPKM cutoff value of 1). Among the 4790 protein-coding genes not detected as expressed in the whole-gut study, 136 genes are detected in at least 3 sub-regions of the gut (RPKM ≥ 3). These genes are either false negative in whole gut profiling or false positive in sub-region profiling. We next did a gene set enrichment analysis with these 136 genes and found that stress response genes, such as heat-shock genes Fig. 4 Comparison of gene expression patterns in humans and Drosophila. High-confidence Drosophila orthologs of genes that are highly expressed in the small intestine, ovary, testis, cerebellum, cerebral cortex, or other tissues were analyzed using DGET. For at least some tissues, we see a correlation between genes highly expressed in specific human tissues (e.g., cerebellum, testis) and the expression of orthologs in cognate tissue sample(s) available for Drosophila (e.g., CNS or head, testis) (Hsp70Aa, Hsp70Ab, Hsp70Ba, Hsp70Bbb) are enriched (P value = 3.05E-07). This suggests that the sample used for sub-region profiling was associated with some level of stress. Comparing the list of 136 genes with the Drosophila essential gene list, we found only one overlapping gene. In addition, only 23 of the 136 genes have DIOPT score 7-10 when mapping to human genes. Thus, a small fraction of these genes might be false negative with regards to whole tissue profiling while the majority of the genes are likely to be false positives not normally present in the gut under non-stress conditions.

Synexpression analysis for the transcription factor twist
Expression profiling is a powerful approach to identify functionally related genes, as genes showing synexpression often operate in similar pathways and/or processes (see for example [4]). We tested DGET for its usefulness for synexpression studies by querying genes with expression profiles similar to the mesodermal master regulator Twist. DGET preferentially retrieved Twist target genes with cell line data as well as development data. For example, among the top 27 genes that share similar expression with Twist in cell lines (Pearson correlation co-efficiency cut off = 0.7), 11 of them are Twist target genes based on Chip-on-chip data [14], and 8 of the 11 genes are high-confidence ( Table 1). The enrichment p-value for Twist target genes is 8.70E-04 overall and 3.05E-05 for high-confidence targets. In addition, we also queried genes that have an expression profile opposite that of Twist (i.e., negative correlation) with the idea of identifying potential repressed gene targets. No genes have a strong negative correlation. However, 7 genes show a weak negative correlation with Twist and none of them overlaps with Chip-on-chip data ( Table 1).
We observed a less significant enrichment with development data (p-value 5.00E-02 for all Twist target genes and p-value of 2.70E-03 for high-confidence targets), likely reflecting the diversity of cell types present in the developmental data and that not enough cells express twist. Thus, DGET will be very powerful when applied to RNA-seq data sets from single cell or groups of homogeneous cell populations.

Conclusions
In summary, DGET makes it possible to retrieve and compare Drosophila gene expression patterns generated by various groups using RNA-Seq. The tool can help scientists design experiments based on expression and analyze experiment results. The backend database for DGET is designed to easily accommodate the addition of new high quality RNA-Seq datasets as they become available. Finally, although the anatomy of human and Drosophila are quite different, by using DGET, we demonstrate that expression patterns of genes that are conserved and highly expressed are conserved between human and Drosophila in many matching tissues, underscoring the utility of the Drosophila model to understand the role of human genes with unknown functions.  Twist targets as defined in [14]