Figure 1 provides a diagram of overall GraphOmics functionalities. An initial data loading step is performed to get measurements of entities into GraphOmics. As part of data loading, the Reactome database is used for mapping of the biological entities (transcripts, proteins and metabolites) in the uploaded data onto reactions and pathways from Reactome. Once data loading is completed, users can perform various global analyses, including differential analyses, pathway activity enrichment, principal component analyses (PCA), clustering and uni-variate statistical tests for differential analyses. To assist in data interpretation, mapped results are shown in multiple interactive tables that are linked to each other. Selecting an entry in one table will filter entries in other related tables. Groups of related entities can also be created and analysed within GraphOmics.
Overall system design
GraphOmics is a Web-based system developed using open-source technologies. The client (browser) side is built upon HTML & Javascript, while charting functionalities are provided through libraries such as D3 and Plotly. The server side runs on the Django 2 Web framework and the Python 3 programming language. Common statistical methods such as t-tests and PCA are implemented using the numpy and scipy libraries in Python, while differential analyses using DeSEQ2 [13] and limma [14] are provided through R. A SQLite database is used to store relational data. A local copy of the Reactome knowledge base [12] is downloaded and accessed from the Django Web application through a Neo4j graph database.
Data uploading
To begin analysis in GraphOmics, users upload their transcript, protein or metabolite data to the system. Uploaded measurements should be provided as matrices in a Comma-separated Value (CSV) format, where the rows are the entity IDs, columns are the samples, and entries are the measurements. To facilitate mapping, GraphOmics requires each row to be labelled with the appropriate ID for the omics type. These are Ensembl ID for transcripts, UniProt ID for proteins, and KEGG or ChEBI IDs for metabolites. There is no limit to the computational allowance or size of measurement CSV that can be uploaded, however from our experience about 100 - 200 samples are reasonable size, beyond which slowness could be encountered when using GraphOmics.
GraphOmics also requires information on the assignment of samples to experimental groups. Users can specify this by including into the measurement CSV a second row that begins with the label ‘group’, where the column values are the group assignments. This information can also be provided in a separately uploaded design CSV, where the first ‘sample’ column specifies the sample name and the second ‘group’ column the grouping information. Other experimental conditions could be included as additional columns in the file.
Differential analysis results from outside GraphOmics can also be included during upload. This takes the form of additional fold-changes and statistical significance (p values) columns in the measurement CSV. Here the column names take the format of FC_[group1]_vs_[group2] for fold change information, and padj_[group1]_vs_[group2] for p values, with [group1] and [group2] referring to the different experimental groups. For more details on the input format, please refer to Supplementary S1.
Omics integration
Horizontal integration of the uploaded data is performed through an automated mapping procedure written in Cypher (the graph query language used in Neo4j). This retrieves the connections between transcripts, proteins, metabolites to reactions and pathways of the given species in Reactome, constructing a network graph of entities, reactions and pathways involved in the dataset. Entities in this network graph are connected to one another: transcripts are linked to the proteins they encode, proteins and compounds are linked to the reactions they are involved in, and reactions are linked to the pathways that contain them.
Mapping is done using Reactome based on the species that users selected during data upload. The list of species is currently limited to the 84 species that Reactome supports (database version 77) at the time of writing. Mapping coverage in GraphOmics could grow as Reactome is regularly updated to incorporate more species and biological entities. Once mapping is completed, the results are stored in the SQLite database and presented to users in the Linked Data Browser. GraphOmics uses other databases such as Ensembl, Uniprot, ChEBI and KEGG; these are not used for mapping, but instead are used to retrieve additional contextual information about selected entities in the Info Panel.
Linked data browser
The Data Browser is the primary interface in GraphOmics that facilitates linked exploration of the integrated data. Instead of presenting an often-massive network graph, the main components of the Data Browser are five interactive tables: one for each supported omics type (transcripts, proteins and metabolites) as well as for reactions and pathways (Fig. 2).
Users interact with the Data Browser by navigating through the tables. Clicking an entity in the Data Browser selects it, and multiple entities can be selected in this manner. Selections from one table will filter entries in other tables, such that only connected items are shown according to the links between entities. As more entities of different omics types are added to the current selection, the number of entities displayed across tables are reduced to meet the filtering criteria.
In this manner, users can explore the data starting from a global view where all entities are shown, and successively narrowing down to more specific entities that are related to the selected items. This ‘drill-down’ interactivity in the Data Browser could help reveal the relationships among biological entities of interest and their reactions and pathways across omics.
In the case where users explore the data with no particular features in mind, GraphOmics allow users to perform differential analyses to highlight significantly changing entities, as well as pathway activity analyses to highlight potentially interesting biological processes. This generates an initial list of significantly changing features, which could be ranked by fold changes and p values from the Data Browser. Significant features could now be explored in relation to active pathways (from pathway analysis), and in relation to clustering with other significant features in the integrated Clustergrammer views. This provides a starting point for hypothesis generation.
Contextual information panel
Selected entries in the Data Browser are also associated to contextual information under each table (Fig. 3). This includes plots of the measurements of that entity across conditions as well as links to external databases (Fig. 3a, b). For transcripts, the Harmonizone Web service [15] is used to retrieve additional description for the gene, as well as links to Ensembl and GeneCard. For proteins, the name, catalytic activity, pathways, gene ontology terms, and links to Uniprot and Swiss-Model of the currently selected proteins are displayed. For compounds, information on the KEGG and CheBI IDs, formula and SMILES string, as well as links to their respective databases, and also compound structures are retrieved. For reactions and pathways, a desriptive summary is displayed by querying Reactome (Fig. 3c). Additionally an interactive pathway viewer utilising the Reactome Pathway Diagram Viewer (DiagramJS) is also available (Fig. 3d). Measured values of transcripts, proteins and metabolites can be overlaid on top of the interactive pathway diagrams.
Ranking and filtering
All interactive tables in the Data Browser allow entities to be ranked and sorted according to their fold changes and p values. This can be used to explore the most significantly changing entities across omics that are differentially expressed (DE). In conjunction with linked interactions, the interface allows users to easily navigate through the top DE entities from one omics and inspect if they are linked to DE entities from other omics. Entities are also connected to pathways, which can be subjected to enrichment analysis within GraphOmics. In this manner, users can easily rank DE entities and determine which enriched pathways they are connected to. Additionally the Query Builder in GraphOmics allows for complex queries to be defined on the data (Fig. 4). From the Query Builder, a query can be defined using comparison operators to filter entities by their p values and fold changes. Queries spanning multiple omics data can also be defined by concatenating (performing a logical AND operation) of each constituent single-omics query.
Creating and analysing groups
GraphOmics allows for any set of entities that have been selected by users to be saved as a selection group. These groups can later be loaded for future use. A group of related entities (for instance the top DE entities, or members of a cluster or some pathways of interest) can be defined, saved and loaded for future analysis. Selection groups can be easily visualised and plotted. For transcriptomics data, gene ontology analysis can be performed using the Python package GOATools [16] to discover enriched GO terms associated with a group. Additionally interactive heatmaps and clustering analysis using Clustergrammer can also be performed on any group. Finally users can annotate groups on the GraphOmics platform for reporting purposes.
Global analysis of multi-omics data
Differential expression analysis
A common task in omics data analysis is to find entities that are differentially expressed (DE) across different experimental conditions. If users have performed their own DE analysis, the statistical significance (p values) of entities could be uploaded as part of the data loading process. Otherwise from the Inference page in GraphOmics, users can execute standard uni-variate t-tests (with Benjamini-Hochberg procedure for controlling the false discovery rate). Additionally, widely-used methods such as DeSEQ2 and limma can also be run as an option. The resulting statistical significance from performing DE analyses are shown in the interactive tables of the Data Browser, alongside the entity names and measured values.
Interactive clustering and heatmap
Heatmap visualisation is performed using Clustergrammer [17], a Web component that integrates interactive heatmap and hierarchical clustering to visualise high-dimensional biological data. Clustergrammer provides many interactive features to explore a hierarchically clustered heatmap, including navigational features such as zooming and panning, as well as filtering features to search and select entities.
The interactivity of Clustergrammer makes it suitable for integration with GraphOmics as it works in concert with the Data Browser. Each omics type (transcripts, proteins and metabolites) in the Data Browser is associated to a Clustergrammer component (Fig. 5). Clustergrammer was modified such that selecting entities in the Data Browser also performs the same selection in the corresponding Clustegrammer component, and vice versa.
Clustergrammer integration means users can generate a heatmap and perform cluster analysis for any selections in the Data Browser. For instance, this includes the ability to display the heatmap of entities in a pathway (or in several pathways), or to discover the clusters of proteins and metabolites linked to top DE transcripts. The interaction also goes the other way, such that selecting a cluster in Clustergrammer also selects its member entities in the Data Browser. This allows users to examine the DE members of a cluster and their connections to reactions and pathways.
Principal component analysis
PCA can be used to assess the global similarity of samples across different conditions. In GraphOmics, a PCA analysis is created from the Inference page by selecting the omics type and the number of components to use. The results from PCA analysis include plots of the projected samples for the first two principal components, as well as a scree plot showing the percentage of variance explained by the different components. The latter plot can be examined to determine how many components to retain for analysis.
Pathway activity analysis
Enrichment of a pathway often suggests relevant biochemical activities happening in that pathway. In GraphOmics, pathway activity analysis can be performed by considering a single omics dataset separately, or from multiple omics datasets at once. To prioritise changing pathways in single omics data, we developed a Python library named PALS [18] that presents a unified wrapper to the following algorithms: Over-representation Analysis (ORA); Gene Set Enrichment Analysis (GSEA) [19]; and Pathway Level Analysis of Gene Expression (PLAGE) [20]. Originally developed for metabolomics, PALS was extended in GraphOmics to be able to also deal with transcript and protein data.
The three pathway ranking methods in PALS represent a diverse approach to enrichment analysis. ORA is widely used to assess the probability of over-representation of DE entities in a pathway using the Hypergeometric test. GSEA is considered a ‘second-generation’ method that takes into account the correlation between sets of entities to assess DE pathways. Finally PLAGE is a method based on singular value decomposition which was found to be best performing [21] in returning the highest detection of changing pathways.
From the Inference page, users can choose to run any of these methods on the GraphOmics server. For any of the pathway ranking methods, the p values of significantly changing pathways are collected and displayed with pathway names in the Data Browser. This allows pathways to be ranked, sorted and filtered in the same manner as entities.
Multi-omics pathway activity analysis
GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets and can be combined with an AND operator in the Query Builder. For instance from the Query Builder, users can easily filter pathways that are significantly changing based on the transcriptomics AND proteomics AND metabolomics measurements.
For a different approach that considers multiple omics data together during analysis, users can run the Reactome Analysis Service, which offers a high-performance multi-omics over-representation analysis using the Reactome server [22]. The IDs of DE entities (across multiple omics) are selected according to a user-defined threshold on the p values, which defaults to \(\le 0.05\). The collected IDs of DE entities are sent to the Reactome Analysis Service, which performs pathway analysis through ORA on the Reactome server. An analysis token is returned, and the results of DE pathways and their p values are retrieved in GraphOmics and displayed on the Data Browser for sorting and filtering. Note that Reactome will delete a submitted analysis on their server after a period of inactivity (7 days). In this case, users could resubmit the analysis from GraphOmics to Reactome to generate updated Reactome links that work.
Exporting of results
GraphOmics allows users to export the mapping results of all entities, as well as their corresponding secondary information (reactions and pathways, p values, fold-changes). For tabular results, this can be accomplished by clicking on the Export button in the respective tables of the Data Browser. Results from interactive heatmap and clustering could also be exported by clicking on the ‘Take snapshot’ button in each Clustergrammer component.