- Research article
- Open Access
The Drosophila Gene Expression Tool (DGET) for expression analyses
BMC Bioinformatics volume 18, Article number: 98 (2017)
Next-generation sequencing technologies have greatly increased our ability to identify gene expression levels, including at specific developmental stages and in specific tissues. Gene expression data can help researchers understand the diverse functions of genes and gene networks, as well as help in the design of specific and efficient functional studies, such as by helping researchers choose the most appropriate tissue for a study of a group of genes, or conversely, by limiting a long list of gene candidates to the subset that are normally expressed at a given stage or in a given tissue.
We report DGET, a Drosophila Gene Expression Tool (www.flyrnai.org/tools/dget/web/), which stores and facilitates search of RNA-Seq based expression profiles available from the modENCODE consortium and other public data sets. Using DGET, researchers are able to look up gene expression profiles, filter results based on threshold expression values, and compare expression data across different developmental stages, tissues and treatments. In addition, at DGET a researcher can analyze tissue or stage-specific enrichment for an inputted list of genes (e.g., ‘hits’ from a screen) and search for additional genes with similar expression patterns. We performed a number of analyses to demonstrate the quality and robustness of the resource. In particular, we show that evolutionary conserved genes expressed at high or moderate levels in both fly and human tend to be expressed in similar tissues. Using DGET, we compared whole tissue profile and sub-region/cell-type specific datasets and estimated a potential source of false positives in one dataset. We also demonstrated the usefulness of DGET for synexpression studies by querying genes with expression profile similar to the mesodermal master regulator Twist.
Altogether, DGET provides a flexible tool for expression data retrieval and analysis with short or long lists of Drosophila genes, which can help scientists to design stage- or tissue-specific in vivo studies and do other subsequent analyses.
The application of next-generation sequence technologies to RNA analysis has opened the door to relatively rapid, large-scale analyses of gene expression. ‘Standard’ RNA-seq analysis, for example, can provide a snapshot of gene expression in specific cell types or tissues , and related technologies such as Ribo-seq  provide more refined views, such as a snapshot of what genes are actively transcribed in a given cell or tissue. For Drosophila, efforts such as the modENCODE project [1, 2, 7, 12] have provided a baseline overview of expression under standard laboratory conditions for various cultured cell types, developmental stages, and tissues, as well as treatment conditions. Moreover, studies such as those investigating expression in sub-regions of the fly gut [6, 10] are providing increasingly detailed views of the baseline expression levels of various genes in various tissues, cell types and sub-regions. Altogether, these RNA-seq data resources provide helpful starting points for analysis of other gene lists.
Resources such as FlyBase  make it possible to quickly view modENCODE data for a given gene and make these data generally accessible to the community. The value of these data to the community can be further increased by facilitating search of lists of genes. For example, for gene lists originating from whole-animal or cultured cell studies, or for studies based on a list of orthologs of genes from another species, it can be very helpful to get a picture of what stages or tissues normally express those genes, as that will help focus stage- or tissue-specific in vivo studies and other subsequent analyses. We implemented DGET to help scientists retrieve modENCODE expression data in batch mode. DGET also hosts other relevant RNA-Seq datasets published in individual studies, such as profiles of specific sub-regions and cell types of the Drosophila gut [6, 10]. Here, we describe DGET and perform a number of analyses to demonstrate the quality and robustness of the resource.
Processed modENCODE data were retrieved from FlyBase (ftp://ftp.flybase.net/releases/FB2015_05/precomputed_files/genes/gene_rpkm_report_fb_2015_05.tsv.gz). Data published by Marianes and Spradling  were retrieved from NCBI Gene Expression Omnibus at (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47780). Data published by Dutta et al.  were retrieved from the flygut-seq website (http://flygutseq.buchonlab.com/resources). Data retrieved were mapped to FlyBase identifiers from release 2015_5 and formatted for upload into the FlyRNAi database .
Expression pattern analysis
Human protein expression data were retrieved from proteinatlas.org and tissue-specific genes were selected using the file “ProteinAtlas_Normal_tissue_vs14.” Proteins with high or medium expression levels with a reliability value of “supportive” were selected. Proteins expressed in a broad range of tissues (i.e., more than 5 tissues) were filtered out. DIOPT vs5 was used to map genes from human to Drosophila . ‘Ortholog pair rank’ was added at recent DIOPT release 5.2.1 (http://www.flyrnai.org/DRSC-ORH.html#versions). Drosophila genes with high or moderate rank were selected. The high/moderate rank mapping include the gene pairs that are best score in either forward or reverse mapping (and DIOPT score >1) as well as gene pairs with DIOPT score >3 if not best score either way.
Results and discussion
Database content and features of the user interface (UI)
The DGET database contains processed RNA-Seq data from the modENCODE consortium [1, 2, 7, 12], as released by FlyBase , as well as other published datasets we obtained from specific studies [3, 6, 10]. The DGET UI has two tabs (Fig. 1).
At the “Search Gene Expression” tab, users can enter a list of genes or choose one of the predefined gene classes from GLAD , e.g., kinases, then specify the datasets to be displayed. There are two search options, “look at expression” and “enrichment analysis.” The results page for “look at expression” displays expression values in a heatmap format. At this results page, users have the option to download the relevant expression values; download the heatmap; and further filter the list by defining a cutoff, limit to specific dataset(s), or filtering out genes, for example with less than 1 RPKM value based on carcass and/or digestion system expression of 1 day adult. We used an RPKM cutoff of 1 because this is considered the cutoff for ‘no or extremely low expression’ at FlyBase. The results page for an enrichment analysis displays the distribution of genes at different expression levels using a bar graph and heatmap. The cutoff values for different levels are defined based on FlyBase guidelines (Fig. 1a).
Using the “Search Similar Genes” tab, users can enter a gene of interest and search for other genes with similar expression pattern based on Pearson correlation score. Users have the options to download the list of genes with similar expression patterns, a heatmap, and a normalized heatmap. Using the “Build Network” tab, users can enter a list of genes and build synexpression network based on the correlation of expression using the dataset and Pearson correlation cutoff specified by the user (Fig. 1b).
Expression pattern of Drosophila regulatory genes
When genome-scale screening is not practical to do, a common approach is to select a specific subset of genes to start with, such as a group of genes with related activities. The most frequently screened sub-sets of genes are important regulatory genes including genes that encode kinases, phosphatases, transcription factors, or canonical signal transduction pathways components. Our expectation is that these regulatory genes, which appear to be re-used in many contexts, will be expressed in many tissues. To test this, we analyzed the expression patterns of several Drosophila regulatory gene classes defined by GLAD, Gene List Annotation for Drosophila . These included canonical signal transduction pathway genes, kinases, phosphatases, transcription factors, secreted proteins, and receptors. The percentages of expressed genes were calculated across all tissues profiled using a RPKM of 1 or above as a cutoff for expressed versus not expressed (Fig. 2). About 70–90% of the genes categorized as encoding canonical signal transduction pathway components, kinases, phosphatases, or transcription factors are expressed in each of the major tissue categories profiled, whereas only 30–60% of receptor or secreted proteins are detected in any given tissue.
Correlation of expression with confidence in an ortholog relationship
It is well established that the evolutionary conservation of proteins correlates with conservation at the level of biological and/or biochemical functions. Drosophila is a model organism of particular interest for which a wide variety of molecular genetic tools are readily available. Particularly, Drosophila models have been developed for a number of human diseases . According to DIOPT, 9705 of 13,902 protein-coding genes in Drosophila are predicted to have human ortholog(s) . Using DGET we analyzed the expression levels of the subset of Drosophila genes for which there is evidence that they are conserved in the human genome. Specifically, we analyzed subsets of genes scoring as putative human orthologs of fly genes at different levels of confidence, as defined by the orthologous gene prediction tool developed at the DRSC, Drosophila RNAi Screening Center . This tool, DIOPT (DRSC Integrative Ortholog Prediction Tool), integrates the ortholog predictions from 14 different algorithms and assigns a ‘DIOPT score’ or count of algorithms that predict a given pair-wise orthologous relationship. We found a strong correlation of percent expressed genes with DIOPT score (Fig. 3). For example, for genes that have a high-confidence ortholog relationship (DIOPT score of 7 or above), almost all are expressed across all tissues. By contrast, for genes for which DIOPT analysis suggests that there is no evidence of a human ortholog (i.e., none of the 10 ortholog algorithms queried with DIOPT predict an ortholog), only 20–50% are expressed in each of the major tissue categories profiled. We suspect that this correlation is driven by essential genes, which are more conserved evolutionarily. We also note that gene set enrichment for the set of high-confidence orthologs indicates that “kinases” and “nucleotide binding” among the top 20 enriched sets, indicating that the set of regulatory genes analyzed above has overlap with this set.
We next analyzed the 418 Drosophila essential genes identified by Spradling et al.  using a large-scale single P-element insertion fly stock collection. The proportions of essential genes expressed at detectable levels in various tissues are very similar to the genes with DIOPT score 7–10 (Fig. 3, light purple and dark purple bars) with a Pearson correlation coefficient equal to 0.92.
Expression patterns of Drosophila orthologs of human genes that are highly expressed in specific tissues
Next, we asked whether genes conserved between human and Drosophila are also expressed in similar patterns. We used the tissue-based human proteome annotation available at the Human Protein Atlas (HPA) (www.proteinatlas.org) , as the source for tissue-specific expression, and retrieved the set of human genes that are expressed in specific tissues. Next, we mapped these human genes to Drosophila orthologs using DIOPT , filtering out low rank ortholog pairs (see Methods), and analyzed the expression patterns of these high-confidence orthologs in Drosophila tissues using DGET (Fig. 4). The results of our analysis using all annotated proteins without a filter did not clearly demonstrate conservation of expression patterns. However, an analysis limited to genes expressed at high or moderate levels (as annotated by HPA) from high confident annotation (i.e., excluding HPA “reliability” value of “uncertain”), indicates that gene expression patterns are conserved in similar tissues in Drosophila. For example, as a group, orthologs of genes highly expressed in the human cerebellum, cerebral cortex, lateral ventricle or hippocampus are highly expressed in the Drosophila central nervous system (CNS) or head, at both larval and adult stages, and orthologs of genes highly expressed in human testis are also highly expressed in the Drosophila testis. Moreover, orthologs of genes from some organs of the human digestive system, such as stomach, duodenum or small intestine, are also highly expressed in the Drosophila digestive system. To further compare the expression patterns of genes expressed in the human and Drosophila, digestive systems, we analyzed the Drosophila gut sub-region data from Dutta et al.  (Fig. 5). Orthologs of genes highly expressed in the human salivary gland and esophagus are highly expressed in the R1 upstream region, and orthologs of genes highly expressed in the human rectum, colon or appendix are more biased towards expression in the R5 downstream region. Fly orthologs of genes highly expressed in the human stomach, duodenum and small intestine are detected throughout the samples corresponding to R1 to R5.
Mining information from distinct but related fly gut gene expression data sets
We next sought to compare the results of whole-gut profiling with results from profiling of specific sub-regions or cell types with the goal of identifying genes only expressed in specific sub-populations. Our rationale for the analysis was to determine the likelihood that genes expressed in a sub-population are missed in expression analysis of an entire organ. This type of false negative analysis should provide helpful information for interpreting results of whole-organ or whole-tissue studies. Thus, we compared the whole gut profiling data obtained by modENCODE consortium for 20 day old adult flies  with data generated by profiling sub-regions of the midgut in 16–20 day old adult flies . Whole gut profiling indicates that 9109 genes are expressed in the gut of 20 day old adult flies (RPKM cutoff value of 1). Among the 4790 protein-coding genes not detected as expressed in the whole-gut study, 136 genes are detected in at least 3 sub-regions of the gut (RPKM ≥ 3). These genes are either false negative in whole gut profiling or false positive in sub-region profiling. We next did a gene set enrichment analysis with these 136 genes and found that stress response genes, such as heat-shock genes (Hsp70Aa, Hsp70Ab, Hsp70Ba, Hsp70Bbb) are enriched (P value = 3.05E-07). This suggests that the sample used for sub-region profiling was associated with some level of stress. Comparing the list of 136 genes with the Drosophila essential gene list, we found only one overlapping gene. In addition, only 23 of the 136 genes have DIOPT score 7–10 when mapping to human genes. Thus, a small fraction of these genes might be false negative with regards to whole tissue profiling while the majority of the genes are likely to be false positives not normally present in the gut under non-stress conditions.
Synexpression analysis for the transcription factor twist
Expression profiling is a powerful approach to identify functionally related genes, as genes showing synexpression often operate in similar pathways and/or processes (see for example ). We tested DGET for its usefulness for synexpression studies by querying genes with expression profiles similar to the mesodermal master regulator Twist. DGET preferentially retrieved Twist target genes with cell line data as well as development data. For example, among the top 27 genes that share similar expression with Twist in cell lines (Pearson correlation co-efficiency cut off = 0.7), 11 of them are Twist target genes based on Chip-on-chip data , and 8 of the 11 genes are high-confidence (Table 1). The enrichment p-value for Twist target genes is 8.70E-04 overall and 3.05E-05 for high-confidence targets. In addition, we also queried genes that have an expression profile opposite that of Twist (i.e., negative correlation) with the idea of identifying potential repressed gene targets. No genes have a strong negative correlation. However, 7 genes show a weak negative correlation with Twist and none of them overlaps with Chip-on-chip data (Table 1).
We observed a less significant enrichment with development data (p-value 5.00E-02 for all Twist target genes and p-value of 2.70E-03 for high-confidence targets), likely reflecting the diversity of cell types present in the developmental data and that not enough cells express twist. Thus, DGET will be very powerful when applied to RNA-seq data sets from single cell or groups of homogeneous cell populations.
In summary, DGET makes it possible to retrieve and compare Drosophila gene expression patterns generated by various groups using RNA-Seq. The tool can help scientists design experiments based on expression and analyze experiment results. The backend database for DGET is designed to easily accommodate the addition of new high quality RNA-Seq datasets as they become available. Finally, although the anatomy of human and Drosophila are quite different, by using DGET, we demonstrate that expression patterns of genes that are conserved and highly expressed are conserved between human and Drosophila in many matching tissues, underscoring the utility of the Drosophila model to understand the role of human genes with unknown functions.
The Drosophila gene expression tool
DRSC integrative ortholog prediction tool
Drosophila RNAi screening center
Model organism ENCyclopedia of DNA elements
Boley N, Wan KH, Bickel PJ, Celniker SE. Navigating and mining modENCODE data. Methods. 2014;68(1):38–47.
Cherbas L, Willingham A, Zhang D, Yang L, Zou Y, Eads BD, Carlson JW, Landolin JM, Kapranov P, Dumais J, Samsonova A, Choi JH, Roberts J, Davis CA, Tang H, van Baren MJ, Ghosh S, Dobin A, Bell K, Lin W, Langton L, Duff MO, Tenney AE, Zaleski C, Brent MR, Hoskins RA, Kaufman TC, Andrews J, Graveley BR, Perrimon N, Celniker SE, Gingeras TR, Cherbas P. The transcriptional diversity of 25 Drosophila cell lines. Genome Res. 2011;21(2):301–14.
Clough E, Barrett T. The gene expression omnibus database. Methods Mol Biol. 2016;1418:93–110.
Dequeant ML, Fagegaltier D, Hu Y, Spirohn K, Simcox A, Hannon GJ, Perrimon N. Discovery of progenitor cell signatures by time-series synexpression analysis during Drosophila embryonic cell immortalization. Proc Natl Acad Sci U S A. 2015;112(42):12974–9.
dos Santos G, Schroeder AJ, Goodman JL, Strelets VB, Crosby MA, Thurmond J, Emmert DB, Gelbart WM, FlyBase C. FlyBase: introduction of the Drosophila melanogaster release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 2015;43(Database issue):D690–7.
Dutta D, Dobson AJ, Houtz PL, Glasser C, Revah J, Korzelius J, Patel PH, Edgar BA, Buchon N. Regional cell-specific transcriptome mapping reveals regulatory complexity in the adult Drosophila midgut. Cell Rep. 2015;12(2):346–58.
Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW, Brown JB, Cherbas L, Davis CA, Dobin A, Li R, Lin W, Malone JH, Mattiuzzo NR, Miller D, Sturgill D, Tuch BB, Zaleski C, Zhang D, Blanchette M, Dudoit S, Eads B, Green RE, Hammonds A, Jiang L, Kapranov P, Langton L, Perrimon N, Sandler JE, Wan KH, Willingham A, Zhang Y, Zou Y, Andrews J, Bickel PJ, Brenner SE, Brent MR, Cherbas P, Gingeras TR, Hoskins RA, Kaufman TC, Oliver B, Celniker SE. The developmental transcriptome of Drosophila melanogaster. Nature. 2011;471(7339):473–9.
Hu Y, Comjean A, Perkins LA, Perrimon N, Mohr SE. GLAD: an online database of gene list annotation for Drosophila. J Genomics. 2015;3:75–81.
Hu Y, Flockhart I, Vinayagam A, Bergwitz C, Berger B, Perrimon N, Mohr SE. An integrative approach to ortholog prediction for disease-focused and other functional studies. BMC Bioinformatics. 2011;12:357.
Marianes A, Spradling AC. Physiological and stem cell compartmentalization within the Drosophila midgut. Elife. 2013;2:e00886.
Michel AM, Baranov PV. Ribosome profiling: a Hi-Def monitor for protein synthesis at the genome-wide scale. Wiley Interdiscip Rev RNA. 2013;4(5):473–90.
mod EC, Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, Brooks AN, Dai Q, Davis CA, Duff MO, Feng X, Gorchakov AA, Gu T, Henikoff JG, Kapranov P, Li R, MacAlpine HK, Malone J, Minoda A, Nordman J, Okamura K, Perry M, Powell SK, Riddle NC, Sakai A, Samsonova A, Sandler JE, Schwartz YB, Sher N, Spokony R, Sturgill D, van Baren M, Wan KH, Yang L, Yu C, Feingold E, Good P, Guyer M, Lowdon R, Ahmad K, Andrews J, Berger B, Brenner SE, Brent MR, Cherbas L, Elgin SC, Gingeras TR, Grossman R, Hoskins RA, Kaufman TC, Kent W, Kuroda MI, Orr-Weaver T, Perrimon N, Pirrotta V, Posakony JW, Ren B, Russell S, Cherbas P, Graveley BR, Lewis S, Micklem G, Oliver B, Park PJ, Celniker SE, Henikoff S, Karpen GH, Lai EC, MacAlpine DM, Stein LD, White KP, Kellis M. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330(6012):1787–97.
Perrimon N, Bonini NM, Dhillon P. Fruit flies on the front line: the translational impact of Drosophila. Dis Model Mech. 2016;9(3):229–31.
Sandmann T, Girardot C, Brehme M, Tongprasit W, Stolc V, Furlong EE. A core transcriptional network for early mesoderm development in Drosophila melanogaster. Genes Dev. 2007;21(4):436–49.
Spradling AC, Stern D, Beaton A, Rhem EJ, Laverty T, Mozden N, Misra S, Rubin GM. The Berkeley Drosophila Genome Project gene disruption project: single P-element insertions mutating 25% of vital Drosophila genes. Genetics. 1999;153(1):135–77.
Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson A, Kampf C, Sjostedt E, Asplund A, Olsson I, Edlund K, Lundberg E, Navani S, Szigyarto CA, Odeberg J, Djureinovic D, Takanen JO, Hober S, Alm T, Edqvist PH, Berling H, Tegel H, Mulder J, Rockberg J, Nilsson P, Schwenk JM, Hamsten M, von Feilitzen K, Forsberg M, Persson L, Johansson F, Zwahlen M, von Heijne G, Nielsen J, Ponten F. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419.
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
The authors would like to thank the members of the DRSC, Transgenic RNAi Project (TRiP), and Perrimon lab for helpful suggestions and discussions.
Work at the DRSC is supported by NIGMS R01 GM067761, NIGMS R01 GM084947, and ORIP/NCRR R24 RR032668. S.E.M. is additionally supported in part by NCI Cancer Center Support Grant NIH 5 P30 CA06516 (E. Benz, PI). N.P. is an Investigator of the Howard Hughes Medical Institute.
Availability of data and materials
YH designed and tested the application, implemented the back-end of the application, performed the analysis and drafted the manuscript. AC implemented the user interface and contributed to the back-end of the application. NP provided critical input on key features and the analysis as well as edited the manuscript. SEM provided oversight and critical input on key features and the analysis, and helped draft the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this article
Cite this article
Hu, Y., Comjean, A., Perrimon, N. et al. The Drosophila Gene Expression Tool (DGET) for expression analyses. BMC Bioinformatics 18, 98 (2017). https://doi.org/10.1186/s12859-017-1509-z
- Expression profile