Online GESS was developed as a Java web application. Twitter Bootstrap 3 front-end framework and jQuery JavaScript library were used to develop the web pages. At the back end, Online GESS contains reference sequences corresponding to 3’UTRs (the region thought to be the most sensitive to miRNA-off-target effects), 5’UTRs, coding sequences (CDS) or full-length transcripts (including non-coding RNAs) in the human, mouse and Drosophila genomes. The human and mouse sequences are obtained from the NCBI RefSeq database. Although these sequences are derived from GenBank records, RefSeq records are non-redundant and have gone through additional levels of validation, annotation, and manual curation. Transcript sequences, as well as CDS and UTR annotations, are retrieved. The Drosophila transcript sequences are obtained from FlyBase (flybase.org) [13], a comprehensive database of Drosophila information that is curated by experts to ensure quality and includes sequences, gene annotation, mutant alleles and publications. Because curation and annotation of reference sequences is an ongoing effort, we have implemented a mechanism for synchronizing reference sequences with each new RefSeq and FlyBase release [14, 15].
After a user uploads their annotated screen results (i.e. sequences of active and, if available, inactive RNAi reagents) in Excel, comma-separated values or tab-delimited text format, the online GESS tool extracts the seed sequences from active and inactive RNAi reagent sequences, then searches the transcript sequences for perfect matches. If a set of inactive RNAi reagents is not provided, the program creates a theoretically inactive set by replacing the first nucleotide of each seed region with the complementary nucleotide. The program then calculates the frequencies of matches among active and inactive RNAi reagents, and identifies transcripts that are significantly enriched among active RNAi reagents using the Fisher exact test and Yates chi-square test. When the sample size is small, the p value from the Fisher’s Exact Test is selected; otherwise, the p value from the Yates Chi Square test is used. Transcripts are then ranked based on the selected p-value. Ranks are later used for calculating multiple hypothesis correction. Three multiple hypothesis correction methods are used in the analysis, the Bonferroni, Bonferroni step-down and Benjamini & Hochberg algorithms, listed in order from most to least stringent correction. Detailed information about the GESS algorithm and analysis methods can be found in the original publication [11].
User interface
The online GESS application functions as an interface for submitting data and setting parameters for GESS analysis. The output files are sent via e-mail if their size is equal to or smaller than 15 MB. For larger files, a link to download resulting files is provided to user by email. The output files will be available for the user to download for 48 hours.
User input
In order to perform a GESS analysis, the user has to provide siRNA or shRNA information in one of the required formats (e.g. tab or comma separated text file or Excel file). There are two possible layouts for input si/shRNA files. The first requires the sequences of both active and inactive siRNAs/shRNAs, as well as their corresponding phenotype/activity information (see example file at http://www.flyrnai.org/gess/ActiveAndInactiveSiRNAs.txt). The second layout includes only the sequences of active siRNAs/shRNAs, and phenotype/activity information is not needed (all reagents are assumed to be active; see example file at http://www.flyrnai.org/gess/ActivesiRNAs.txt). The user then chooses the correct format for their input file by selecting “Input file contains both active and inactive RNAi reagents” or “Input file contains only active RNAi reagents”. The user also needs to indicate if the input sequences represent the sense (passenger) or anti-sense (guide) strands of the reagents. In addition, the user has to indicate the reagent type, siRNA or shRNA. If shRNA is selected, it is possible for the user to trim the sequences by one to three nucleotides respectively since sequences provided by the source of shRNA library may not reflect the actual mature siRNA strands that are generated by expected canonical dicer cleavage.
The next step is to specify a reference database. As described above, online GESS has built-in reference databases for the human, mouse and Drosophila genomes. The user can choose one of the three species and then specify the transcript region(s) to search against. The options are 3’UTR (preferred genomic region for GESS analysis), 5’UTR, CDS, full transcript of protein coding genes, or full transcript region of all genes including non-coding RNA. The user can also choose to upload a custom database file. A custom database file should have FASTA formatted sequences (see example file at http://www.flyrnai.org/gess/customDatabase.txt). For a customized reference database, the program will search for seed matches along the full length of the sequences provided. If the user would like to focus the search to a specific sub-region within a custom reference set, such as 3’UTRs (thought to be the major site of miRNA activity), the user is responsible for uploading only the 3’UTR sequences.
The final step prior to submitting data for processing is to specify any optional parameters. The GESS interface allows users to specify the length of a seed sequence, the minimum number of seed matches to be found in the target sequence, the strand of the RNAi sequence, as well as a statistical threshold value. Currently, the default settings are 7 base pair seed sequence (nucleotides 2–8 from the 5’ end of antisense sequences provided by user), a minimum of one seed match using the anti-sense strand of RNAi only, and a p-value threshold of 0.05 before multiple hypothesis testing correction. The user has the option to perform a control test where each seed sequence of both active and inactive reagents is randomly scrambled. This provides a sense of strength of outliers that may occur at random and more confidence that the significant results are not due to chance. To do this, the user needs to run a parallel test by making corresponding selection under “Advanced Options” at the user interface. This will provide a new set of results and make it possible for users to compare the results obtained for the experimental and control test sets. It is important to note that the program generates only one set of results at a time. Hence, to include a control test in the overall analysis, the control test has to be submitted and run separately.
Online GESS pre-processes the input files and detects mis-formatted records, such as lines missing sequence information, before the analysis starts. If more than 25% of the records are mis-formatted, the error type (see help page at http://www.flyrnai.org/gess/help.jsp) as well as a few examples will be displayed to the user. This feature enables the user to identify errors in their files immediately and fix them. If less than 25% of the records fail pre-processing, the tool continues the analysis, ignoring mis-formatted records in the analysis. The user is then informed via email about the number of RNAi reagents that were ignored in the analysis and their location in the file.
Output files
A GESS analysis generates two output files. The first file lists the transcripts identified by seed region match to active RNAi reagents and their enrichment scores. By default, this file contains results for all tested transcripts. If the user is not interested in getting the full list, the results of significant transcripts can be obtained by choosing “Only Significant Transcripts” under advanced options. When using a built-in database, each transcript is indicated by its RefSeq accession number, along with a corresponding gene symbol from NCBI or FlyBase. If a custom database is provided, the comment lines from the FASTA file are displayed. This first file also reports the number of active RNAi reagents that have seed matches to a given sequence, the seed match frequency of active reagents, and the p-values according to both Fisher’s Exact and Yates Chi Square tests. The output file also reports the p-value selected for multiple hypothesis correction and the adjusted p-values, as calculated using the Bonferroni, Bonferroni Step-down and Benjamini & Hochberg methods. Finally, the corrected p-value thresholds, as well as statistical significance status of each transcript according to each algorithm, are reported in this file. The second file contains the transcript identifiers and a list of active RNAi reagents that match to them. This file contains only the transcripts with p-values ≤ 0.05. If the analysis fails during input file processing, an email notification is sent to the user (see help page for detailed explanation, http://www.flyrnai.org/gess/help.jsp).
Run time
The run time of a GESS analysis is dependent upon the input file sizes but in most cases, the analysis is complete within a couple of minutes. For example, in our tests it took two minutes to analyze 10,000 siRNAs against about 68,450 3’UTRs annotated for human genes in RefSeq database (vs61).
Testing
We compared Online GESS to the standalone MATLAB version using supplementary data from a spindle assembly checkpoint screen as provided in the original GESS publication [11]. The original publication used transcripts from Ensembl as the reference, whereas by default, Online GESS uses transcripts from RefSeq. To do a direct comparison, at Online GESS we uploaded a custom database of 3′UTR sequences from Ensembl as provided in the original publication [11]. We then ran Online GESS using the same parameters as those used in the original publication (a 7mer seed match from either strand) [11] and obtained the same results. Next, we ran another Online GESS analysis with the same parameters using our built-in database of human 3′UTR sequences (by default, this was the current RefSeq release, i.e. v61). The results were the same at the gene level; that is, MAD2 was the only significant outlier. The only differences we observed between results obtained with the standalone MATLAB version and Online GESS were at the transcript level (not at the gene level) and are attributable to differences in the underlying reference data.