A MATLAB tool for pathway enrichment using a topology-based pathway regulation score
© Ibrahim et al.; licensee BioMed Central Ltd. 2014
Received: 30 June 2014
Accepted: 22 October 2014
Published: 4 November 2014
Handling the vast amount of gene expression data generated by genome-wide transcriptional profiling techniques is a challenging task, demanding an informed combination of pre-processing, filtering and analysis methods if meaningful biological conclusions are to be drawn. For example, a range of traditional statistical and computational pathway analysis approaches have been used to identify over-represented processes in microarray data derived from various disease states. However, most of these approaches tend not to exploit the full spectrum of gene expression data, or the various relationships and dependencies. Previously, we described a pathway enrichment analysis tool created in MATLAB that yields a Pathway Regulation Score (PRS) by considering signalling pathway topology, and the overrepresentation and magnitude of differentially-expressed genes (J Comput Biol 19:563-573, 2012). Herein, we extended this approach to include metabolic pathways, and described the use of a graphical user interface (GUI).
Using input from a variety of microarray platforms and species, users are able to calculate PRS scores, along with a corresponding z-score for comparison. Further pathway significance assessment may be performed to increase confidence in the pathways obtained, and users can view Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway diagrams marked-up to highlight impacted genes.
The PRS tool provides a filter in the isolation of biologically-relevant insights from complex transcriptomic data.
Increasingly, high-throughput transcriptional profiling techniques (microarrays or, increasingly, RNAseq) inform modern life-science research. Such techniques provide a molecular “camera” taking genome-wide “snap-shots” of genetic activity. However, the effective analysis of microarray data presents a number of challenges, in particular handling the large number of genes that are studied simultaneously.
Over-representation Analysis (ORA): This scores a pathway by considering the proportion of differentially-expressed genes (DEGs) observed in each pathway relative to the proportion of all microarray DEGs. This is used by several pathway analysis tools, including GenMAPP , GoMiner , Onto-Express  and FatiGo .
Functional Class Scoring (FCS): FCS gives a score to each gene in a pathway based on its expression, from which a pathway-score is calculated based on the scores of all the genes in the pathway. A number of FCS methods have been implemented through standalone tools such as GSEA , SigPathway , and SAFE , or web tools such as T-profiler , Gazer  and GeneTrail .
Pathway Topology (PT)-based approaches: These approaches exploit the topology of pathways by giving weights to pre-defined connections between genes, which inform pathway scoring. Several topology-based approaches have been described in the literature over the past few years. According to Mitrea et al , PT-based approaches differ in the way they translate pathway topology information into a pathway score. Some methods use only the topology data of differentially-expressed genes (DEGs) in the enrichment score (for example MetaCore  and EnrichNet ), whereas others (including SPIA  and GANPA ) use expression data of DEGs along with the topology data. Alternatively, some methods use expression data derived from all microarray genes, whether they change between conditions or not, for example PathOlogist , DEGraph , and ACST . Importantly, some PT-based tools use only signalling pathway descriptions, such as Pathway-Express , NetGSA , ScorePAGE , TAPPA  MetPA , and Clipper .
Previously, we proposed a new pathway enrichment method, in which both pathway topology and the magnitude of gene expression changes informed the creation of a Pathway Regulation Score (PRS) . Specifically, by combining fold-change data for those transcripts exceeding a significance threshold, and by taking into account the potential of altered gene expression to impact upon downstream transcription, we identified those pathways most relevant to the pathophysiological process under investigation. Our approach addressed a number of issues that potentially compromise enrichment methods. We took steps to mitigate the influence of errors in ID mapping, and to reduce the bias introduced by highly-redundant pathways (i.e. multiple instances of the same gene). Topology methods also have to handle loops effectively, so we used a search algorithm derived from graph theory to resolve this problem. We also felt that arbitrarily dividing processes into either up- or down-regulated was artificial as changes in gene expression are likely to be distributed throughout pathways, thus ours was an overall impact assessment.
Herein, we described the implementation of our PRS approach as a standalone tool that provides end users with the option of importing data from different microarray platforms and species. The tool yields both PRS and z-scores, provides statistical analysis, and allows browsing of pathways with impacted genes highlighted in different colours. An enhancement from our original report is that users are able to enrich both signalling and metabolic pathways.
Preprocessing microarray data
We did not re-engineer a filter to normalise data from a variety of platforms, rather users must first preprocess transcriptomic data using one of the myriad existing tools. Data must be in the form of a simple Excel spreadsheet, in which the first column should be probe ID, and the following columns normalised replicated expression values from the control and test conditions. Additional information regarding species, sample numbers, fold-change and t-test thresholds, normalisation method and platform is required.
Representing pathways as graphs had an additional advantage as it reduced redundancy in that genes were only represented once in any pathway graph. A Depth-First Search (DFS) algorithm, derived from Graph Theory was used to ensure that loops were only counted once.
Our method assigned weights to all significant nodes (i.e. DEGs) in a pathway to reflect their topological strength (specifically the number of significant downstream nodes that are pointed to, either directly or via other significant nodes as described previously ). A PRS was calculated on the basis of fold-change value and weighting of all significant nodes in the pathway and normalized for pathway size. We also calculated a z-score  (with an improvement over earlier implementations in that this was performed after removing redundant genes from pathway descriptions). The software outputs two lists of pathways ranked according to PRS and z-score, saved as both Excel and .mat files for later analysis.
Pathway significance assessment
We then went on to establish the probability of achieving scores at least as high as the PRS score by chance using a non-parametric permutation method. Initially, fold-change values for all expressed microarray genes were permuted. These values were then mapped back onto pathways, and a PRS recalculated. This process was repeated n times, where n is provided by the user through the interface (typically n = 1000). The statistical significance (p-value) of each pathway score was estimated by a comparison between the observed score and the n random scores generated. To achieve more reliable statistical significance evaluation, p-values were adjusted for multiple-test correction by a False Discovery Rate (FDR) method based on a threshold provided by the user. This is described in more detail in our original report .
Visualizing enriched pathways
UML for modelling and software description
Herein, we used Unified Modelling Language (UML) to describe, model and visualize the structure and functions of our method by diagrams. There are 14 types of diagrams classified in three categories in UML 2.0 , however, in this paper we used only two: class and sequence diagrams. Class diagrams represent static structures or main objects in the software. Figure 4 shows the key classes at the pathway analysis stage. The class “Analysis” is the main class, which provides an interface to run all the services provided by the tool. It has four main attributes:
▪ MicroarrayObject: an object of the class “Microarray_Dataset” built by calling initialiseMicroarray() function (see Additional file 2). This holds the normalised gene expression data, and a list of all genes with their fold-change values.
▪ kgmlObject: an object of the class “KGML_Parser” built by calling the parseKGML() function (see Additional file 3). This holds the static structure of all pathways as a list of objects of “KGML_Path” class that is defined by KGML format. An object of “KGML_Path” represents the structure of one KEGG pathway and is composed of entriesList, reactionsList, and relationsList (see Additional file 1).
▪ PathList: this is a list of objects of the class “Pathway” which is created by calling CreatePathListFromKegg() function (see Additional file 4). This object ultimately holds a list of pathways enriched with reference to a given microarray dataset.
▪ rankedPaths: this object is created by calling the rankPaths() function. It holds the same list of pathways defined by PathsList, but they are ranked in descending order based on PRS values.
Conversion of pathways into graphs by the convertPath2Graph() function, which requires the usage of kgmlObject that holds a list of entries, relations and reactions of all pathways.
Using information stored in kgmlObject and PathsList for each graph (see Figure 4), a list of nodes is created (where each node represents one or more genes from the original pathway) and a list of children for each node.
Removal of redundant genes, which may be represented many times in the same pathway. Two functions are designed to deal with node redundancy: checkNodeRedundancy() and handleNodeRedundancy().
After building a graph for each pathway, graphs are weighted by calling the createWeightedGraphs() function, which uses the DFS algorithm to traverse the nodes of each graph and assign a weight for each significant node taking into account the loops in the graph.
A pathway regulation score (PRS) is assigned to each weighted graph using the weights of the significant nodes in the graph and other parameters.
We implemented all these classes, functions, and DFS algorithm using MATLAB R2010a.
Results and discussion
Top ten pathways ranked by PRS (T2D and pancreatic islets dataset)
Arachidonic acid metabolism
Cytokine-cytokine receptor interaction
TGF-beta signalling pathway
Complement and coagulation cascades
PPAR signaling pathway
Pathways in cancer
Type II diabetes mellitus
MAPK signaling pathway
Fatty acid metabolism
Top ten pathways ranked by Z-score (T2D and pancreatic islets dataset)
Arachidonic acid metabolism
TGF-beta signaling pathway
Complement and coagulation cascades
PPAR signaling pathway
Cytokine-cytokine receptor interaction
Fatty acid metabolism
Intestinal immune network for IgA production
Cell adhesion molecules (CAMs)
Staphylococcus aureus infection
The rapid development of high-throughput genomic technologies and the deposition of their output in open-access databases has produced huge amounts of biolo-gical data. Mining and interpreting these data has driven innovation in the field of computational biology, leading to the emergence of sophisticated tools to produce reliable, meaningful and testable results. This is important as these kinds of experiments are expensive, and new tools are likely to add value to pre-existing analysis.
In this paper, we address two areas; firstly, the extension of our PRS enrichment algorithm  to include both metabolic and signalling pathways; and secondly, to provide a detailed description of a GUI that facilitates array analysis by both PRS and z-scoring. The improved tool handles a number of challenges, notably in ID mapping, redundancy in pathway descriptions and statistical significance assessment. Unlike z-scoring, the PRS algorithm takes into account the topology of a pathway (the relationships between genes) and the magnitude of gene expression changes to identify impacted pathways. For these reasons, we argue that PRS enrichment yields more biologically-relevant insights compared to those provided by the standard hypergeometric method. It was not feasible to compare performance to other PT methods as the additional preprocessing steps taken to reduce redundancy in KEGG descriptions are not easily implemented in other methods without considerable re-engineering. The behaviour of signalling and metabolic pathways is, of course, distinct. However, as our approach was to assess transcriptional changes in a pathway, rather than to predict an effect on the function of a pathway, we felt it was reasonable to evaluate impact on signalling and biochemical pathways using a single method. In this way, we were able to detect biochemical pathways known to be perturbed in metabolic disease. A key tenet of this kind of analysis is that biomedical scientists are guided in the subsequent investigation of targets revealed by transcriptional profiling studies. Unfortunately, there is no unambiguous statistical test that allows investigators to be certain that any pathway highlighted is worthy of further study (and considerable expense). The use of permutation-based approaches are commonly used to determine the likelihood of an enrichment score being achieved by chance, and by adjusting P values by FDR can increase investigators’ confidence that a result is meaningful.
In summary, we suggest that providing researchers with a choice of analysis tools, informed by distinct rationales, will allow evidence to be combined or contrasted in order to facilitate more informed decision making.
Availability and requirements
Project name: PRS_software.
Operating system(s): Platform independent.
Programming language: MATLAB.
Other requirements: MATLAB 2010a or higher. If MATLAB is not installed on your PC, you need to install the MCR (Matlab Compiler Runtime) environment first and then run the PRS tool.
Restrictions for use: None.
MAI conceived the method, generated the code and performed the testing. SJ provided guidance in the use of the DFS algorithm and assisted with statistical analysis. MAC provided invaluable insights during the development process. KL developed the algorithm in collaboration with MAI and assisted with the biological analysis. All authors were involved in preparing the manuscript, and all approved the final draft.
We wish to thank Dr Madhumita Das, a Master's student in our group, for her feedback on the tool. We also would like to thank the Buckingham Institute of Translational Medicine for funding this research.
- Glazko GV, Emmert-Streib F: Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009, 25 (18): 2348-2354. 10.1093/bioinformatics/btp406.View ArticlePubMed CentralPubMedGoogle Scholar
- Khatri P, Sirota M, Butte AJ: Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol. 2012, 8 (2): e1002375-10.1371/journal.pcbi.1002375.View ArticlePubMed CentralPubMedGoogle Scholar
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002, 31 (1): 19-20. 10.1038/ng0502-19.View ArticlePubMedGoogle Scholar
- Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003, 4 (4): R28-10.1186/gb-2003-4-4-r28.View ArticlePubMed CentralPubMedGoogle Scholar
- Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics. 2002, 79 (2): 266-270. 10.1006/geno.2002.6698.View ArticlePubMedGoogle Scholar
- Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004, 20 (4): 578-580. 10.1093/bioinformatics/btg455.View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005, 102 (43): 15545-10.1073/pnas.0506580102.View ArticlePubMed CentralPubMedGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102 (38): 13544-10.1073/pnas.0506577102.View ArticlePubMed CentralPubMedGoogle Scholar
- Barry WT, Nobel AB, Wright FA: Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005, 21 (9): 1943-1949. 10.1093/bioinformatics/bti260.View ArticlePubMedGoogle Scholar
- Boorsma A, Foat BC, Vis D, Klis F, Bussemaker HJ: T-profiler: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res. 2005, 33: W592-W595. 10.1093/nar/gki484.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim S-B, Yang S, Kim S-K, Kim SC, Woo HG, Volsky DJ, Kim S-Y, Chu I-S: GAzer: gene set analyzer. Bioinformatics. 2007, 23 (13): 1697-1699. 10.1093/bioinformatics/btm144.View ArticlePubMedGoogle Scholar
- Backes C, Keller A, Kuentzer J, Kneissl B, Comtesse N, Elnakady YA, Muller R, Meese E, Lenhof H-P: GeneTrail-advanced gene set enrichment analysis. Nucleic Acids Res. 2007, 35: W186-W192. 10.1093/nar/gkm323.View ArticlePubMed CentralPubMedGoogle Scholar
- Mitrea C, Taghavi Z, Bokanizad B, Hanoudi S, Tagett R, Donato M, Voichiţa C, Drăghici S: Methods and approaches in the topology-based analysis of biological pathways. Front Physiol. 2013, 4: 278-10.3389/fphys.2013.00278.View ArticlePubMed CentralPubMedGoogle Scholar
- MetaCore™: , [http://thomsonreuters.com/metacore/]
- Glaab E, Baudot A, Krasnogor N, Schneider R, Valencia A: EnrichNet: network-based gene set enrichment analysis. Bioinformatics. 2012, 28 (18): i451-i457. 10.1093/bioinformatics/bts389.View ArticlePubMed CentralPubMedGoogle Scholar
- Amin K: Pathway-express: A Bioinformatics Tool for Pathway Level Analysis Using Gene Expression Data. 2007Google Scholar
- Fang Z, Tian W, Ji H: A network-based gene-weighting approach for pathway analysis. Cell Res. 2012, 22 (3): 565-580. 10.1038/cr.2011.149.View ArticlePubMed CentralPubMedGoogle Scholar
- Greenblum SI, Efroni S, Schaefer CF, Buetow KH: The PathOlogist: an automated tool for pathway-centric analysis. BMC Bioinformatics. 2011, 12 (1): 133-10.1186/1471-2105-12-133.View ArticlePubMed CentralPubMedGoogle Scholar
- Jacob L, Neuvial P, Dudoit S: Gains in power from structured two-sample tests of means on graphs. 2010Google Scholar
- Mieczkowski J, Swiatek-Machado K, Kaminska B: Identification of Pathway Deregulation - Gene Expression Based Analysis of Consistent Signal Transduction. PLoS One. 2012, 7 (7): e41541-10.1371/journal.pone.0041541.View ArticlePubMed CentralPubMedGoogle Scholar
- Khatri P, Voichita C, Kattan K, Ansari N, Khatri A, Georgescu C, Tarca AL, Draghici S: Onto-Tools: new additions and improvements in 2006. Nucleic Acids Res. 2007, 35: W206-W211. 10.1093/nar/gkm327.View ArticlePubMed CentralPubMedGoogle Scholar
- Shojaie A, Michailidis G: Analysis of Gene Sets Based on the Underlying Regulatory Network. J Comput Biol. 2009, 16 (3): 407-426. 10.1089/cmb.2008.0081.View ArticlePubMed CentralPubMedGoogle Scholar
- Rahnenführer J, Domingues FS, Maydt J, Lengauer T: Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data. Stat Appl Genet Mol Biol. 2004, 31: 1544-6115.Google Scholar
- Gao S, Wang X: TAPPA: topological analysis of pathway phenotype association. Bioinformatics. 2007, 23 (22): 3100-3102. 10.1093/bioinformatics/btm460.View ArticlePubMed CentralPubMedGoogle Scholar
- Xia J, Wishart DS: MetPA: a web-based metabolomics tool for pathway analysis and visualization. Bioinformatics. 2010, 26 (18): 2342-2344. 10.1093/bioinformatics/btq418.View ArticlePubMedGoogle Scholar
- Martini P, Sales G, Massa MS, Chiogna M, Romualdi C: Along signal paths: an empirical gene set approach exploiting pathway topology. Nucleic Acids Res. 2012, 41 (1): e19-10.1093/nar/gks866.View ArticlePubMed CentralPubMedGoogle Scholar
- Ibrahim MA, Jassim S, Cawthorne MA, Langlands K: A Topology-Based Score for Pathway Enrichment. J Comput Biol. 2012, 19 (5): 563-573. 10.1089/cmb.2011.0182.View ArticlePubMedGoogle Scholar
- Kyoto Encyclopaedia of Genes and Genomes, data retrieved May 2012 from , [http://www.genome.jp/kegg/]
- Cheadle C, Vawter MP, Freed WJ, Becker KG: Analysis of Microarray Data Using Z Score Transformation. J Mol Diagn. 2003, 5 (2): 73-81. 10.1016/S1525-1578(10)60455-2.View ArticlePubMed CentralPubMedGoogle Scholar
- Unified Modeling Language™ (UML®): , [http://www.uml.org/]
- Taneera J, Lang S, Sharma A, Fadista J, Zhou Y, Ahlqvist E, Jonsson A, Lyssenko V, Vikman P, Hansson O, Parikh H, Korsgren O, Soni A, Krus U, Zhang E, Jing X-J, Esguerra JLS, Wollheim CB, Salehi A, Rosengren A, Renström E, Groop L: A Systems Genetics Approach Identifies Genes and Pathways for Type 2 Diabetes in Human Islets. Cell Metab. 2012, 16 (1): 122-134. 10.1016/j.cmet.2012.06.006.View ArticlePubMedGoogle Scholar
- Persaud SJ, Muller D, Belin VD, Kitsou-Mylona I, Asare-Anane H, Papadimitriou A, Burns CJ, Huang GC, Amiel SA, Jones PM: The Role of Arachidonic Acid and Its Metabolites in Insulin Secretion From Human Islets of Langerhans. Diabetes. 2007, 56 (1): 197-203. 10.2337/db06-0490.View ArticlePubMedGoogle Scholar
- Yaney GC, Corkey BE: Fatty acid metabolism and insulin secretion in pancreatic beta cells. Diabetologia. 2003, 46 (10): 1297-1312. 10.1007/s00125-003-1207-4.View ArticlePubMedGoogle Scholar
- McGarry JD: Banting lecture 2001 Dysregulation of fatty acid metabolism in the etiology of type 2 diabetes. Diabetes. 2002, 51 (1): 7-18. 10.2337/diabetes.51.1.7.View ArticlePubMedGoogle Scholar
- Sugden MC, Holness MJ: Potential Role of Peroxisome Proliferator-Activated Receptor-α in the Modulation of Glucose-Stimulated Insulin Secretion. Diabetes. 2004, 53 (1): S71-S81. 10.2337/diabetes.53.2007.S71.View ArticlePubMedGoogle Scholar
- Kim H-S, Hwang Y-C, Koo S-H, Park KS, Lee M-S, Kim K-W, Lee M-K: PPAR-γ Activation Increases Insulin Secretion through the Up-regulation of the Free Fatty Acid Receptor GPR40 in Pancreatic β-Cells. PLoS One. 2013, 8 (1): e50128-10.1371/journal.pone.0050128.View ArticlePubMed CentralPubMedGoogle Scholar
- Prentki M, Nolan CJ: Islet cell failure in type 2 diabetes. J Clin Invest. 2006, 116 (7): 1802-1812. 10.1172/JCI29103.View ArticlePubMed CentralPubMedGoogle Scholar
- Tomas A, Yermen B, Min L, Pessin JE, Halban PA: Regulation of pancreatic β-cell insulin secretion by actin cytoskeleton remodelling role of gelsolin and cooperation with the MAPK signalling pathway. J Cell Sci. 2006, 119 (10): 2156-2167. 10.1242/jcs.02942.View ArticlePubMedGoogle Scholar
- Tanizawa Y, Riggs AC, Chiu KC, Janssen RC, Bell DS, Go RPC, Roseman JM, Acton MT, Permutt MA: Variability of the pancreatic islet beta cell/liver (GLUT 2) glucose transporter gene in NIDDM patients. Diabetologia. 1994, 37 (4): 420-427. 10.1007/BF00408481.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.