PathMAPA: a tool for displaying gene expression and performing statistical tests on metabolic pathways at multiple levels for Arabidopsis

Background To date, many genomic and pathway-related tools and databases have been developed to analyze microarray data. In published web-based applications to date, however, complex pathways have been displayed with static image files that may not be up-to-date or are time-consuming to rebuild. In addition, gene expression analyses focus on individual probes and genes with little or no consideration of pathways. These approaches reveal little information about pathways that are key to a full understanding of the building blocks of biological systems. Therefore, there is a need to provide useful tools that can generate pathways without manually building images and allow gene expression data to be integrated and analyzed at pathway levels for such experimental organisms as Arabidopsis. Results We have developed PathMAPA, a web-based application written in Java that can be easily accessed over the Internet. An Oracle database is used to store, query, and manipulate the large amounts of data that are involved. PathMAPA allows its users to (i) upload and populate microarray data into a database; (ii) integrate gene expression with enzymes of the pathways; (iii) generate pathway diagrams without building image files manually; (iv) visualize gene expressions for each pathway at enzyme, locus, and probe levels; and (v) perform statistical tests at pathway, enzyme and gene levels. PathMAPA can be used to examine Arabidopsis thaliana gene expression patterns associated with metabolic pathways. Conclusion PathMAPA provides two unique features for the gene expression analysis of Arabidopsis thaliana: (i) automatic generation of pathways associated with gene expression and (ii) statistical tests at pathway level. The first feature allows for the periodical updating of genomic data for pathways, while the second feature can provide insight into how treatments affect relevant pathways for the selected experiment(s).


Background
Novel technologies in genome research, such as Affymetrix GeneChips and DNA microarrays, have generated large amounts of data. These data have led to a more thorough understanding of gene function, regulation, interaction, and pathways. However, analysis of microarray data at the individual gene level is not sufficient for a thorough understanding of how a pathway/network is disturbed in the discovery of new drugs or in the modification of the existing plant variety. It is important to develop userfriendly tools that allow biologists to associate gene expression with pathways, test the treatment effects at the pathway level, and extract a comprehensive overview of experimental effects from the data.
Despite the importance of understanding biological pathways through microarray data, most microarray databases lack information on pathways. For example, SMD [1], a well-known microarray database for gene expression analysis in multi-organisms, does not feature pathway components. The Kyoto Encyclopedia of Genes and Genomes (KEGG) [2] and the National Center for Genome Research (NCGR, http://www.ncgr.org/) have pathway databases but they do not have microarray data analysis or statistical test components associated with pathways. A number of software packages for microarray data analysis or integration of pathway information does exist [3][4][5][6][7]. Their functions include querying pathway information, with some databases having the capability to overlay gene expression data on pathways. These packages focus on E. coli, yeast, mouse, human, or other organisms instead of plants, though. For example, GenMAPP 1.0 [6] provides useful tools to display and test gene expression, and interactively modify pathways. However, the pathways in GenMAPP are not relevant for plants. In addition, GenMAPP statistically tests gene expression at the individual enzyme level rather than the pathway level, and it uses local data files instead of a large database. Because plants and animals have evolved independently from unicellular eukaryotes and represent highly contrasting life forms, the analysis of plant genomes can elucidate fundamental principles of biology relevant to a variety of species, including humans as well as principles unique to plants.
Several plant-specific databases have been developed. For Arabidopsis thaliana, a model organism extensively studied by plant biologists [8], TAIR [9] provides a comprehensive database for many types of information on Arabidopsis. TAIR recently introduced a pathway component that can overlay expression patterns on known biological pathways. There are other Arabidopsis databases with more specific emphases, e.g. CATMA [10].
Despite the availability of web-based pathway databases, one current limitation of these databases is their reliance on static pathway image files. With the exception of NCGR, which provides partially automatic pathway graph generation features, the pathway graphs of these databases are built manually with other tools [2][3][4][5]9]. As our knowledge of genes evolves, gene information will need to be updated and older pathway images will become outdated. Static images cannot reflect any change and it is time-consuming to rebuild such detailed new data into pathways. Another limitation of the existing databases is that the analysis of gene expression data focuses on individual genes instead of pathways. An exception is Pathway Processor [7] that offers limited test for yeast on the basis of fold change only. PathMAPA transcends these limitations. It serves as a database to upload and retrieve Arobidopsis microarray data, automatically generates pathway graphs and visualizes pathways associated with gene expression. It displays and statistically tests gene expressions at the probe, locus and enzyme levels respectively for nearly 100 pathways. PathMAPA can also be used to compare gene expressions for individual experiments or across multiple experiments.

System Architecture
PathMAPA uses a combination of Apache and Tomcat as its web server and is accessible through IE 5.0 or higher and Netscape 7 or higher from multiple platforms. The Java Plug-In can be easily installed for most platforms with the link provided. PathMAPA uses Oracle as its database management system, and Java/JSP as its programming language. Statistical components are called from Rproject through an API interface except for the normalization component that is coded in C++. The system is periodically updated. Architecturally, the system is a three-tier application ( Figure 1). Clients access the web site through the intranet/internet. The web server runs JSP, Java Servlets, and Java Beans. The Oracle database is accessed through JDBC. SQL Plus and SQL Loader are also used to optimize the performance of data processes. Functionally, PathMAPA consists of two components: Database and Pathway. The database part will not be discussed in this paper; the pathway part provides statistical and visualization tools to integrate microarray data with metabolic pathways (Figure 2).

Microarray Data
Microarray data files generated through GenePix http:// axon.com/GN_GenePixSoftware.html, exported Affymetrix http://www.affymetrix.com/ data and other types of exported microarray files can be populated into the Path-MAPA database for analysis. The owners of these files have the option to modify the experimental information after the data have been uploaded into the server. PathMAPA provides data normalization through the "Upload" link in the Pathway component, with normalization taking place after uploading but before populating into the database. For GenePix data, PathMAPA normalizes the median-ofratios column. Meanwhile, the user has the option to populate the original data into the database without normalization. An additional copy of the original text file is available for users to download.

Pathway visualization with gene expression
Several methods have been used to build pathway graphs in the literature. Most of the existing pathway software uses static images [2][3][4]9,11]. Static visualization cannot reflect any updated pathway information unless the static image is rebuilt [12]. Another approach is the semidynamic method. For example, Pathway Processor [7] puts several generated gene expression values at the same position on a corresponding KEGG static image when it visualizes pathways. This method is limited by pre-built image space. A third approach is automatic generation through standard graph layout algorithms. Examples have used algorithms for circular, orthogonal or planar drawing, and force-directed layout heuristics [13][14][15]. Becker et al. [16] used a combination of circular, hierarchical, and force-directed graph layout algorithms to compute the position of the graph elements representing main compounds and reactions. The resulting graphs are satisfactory when this method is applied for relatively regular pathways, such as circular or hierarchical pathways. However, a large number of irregular pathways exist. In addition, there is the added complexity of variations in enzyme number and gene expression for each step of a biochemical reaction. As for pathway analysis, pathway scores and distance functions have been proposed to analyze gene expression data in the context of pathways [17,18].
PathMAPA uses JSP/JavaBeans to retrieve database information for the automatic construction of pathways (Figure 3). It uses computational methods only for a small number of regular pathways, such as the circular Citrate cycle. The difference between our method and the automatic generation methods discussed above is that we draw elements for most of the complicated pathways with data stored in database tables instead of using a computational approach. The information retrieved from the database is obtained by joining a group of tables/views ( Figure  4).
Each pathway has a unique identifier. One pathway includes a list of enzymes with each enzyme including a list of genes (or loci). Each gene (or locus) may have a list of probes (accession ids) in the experiment. This relationship excludes any unrelated genes in the computation of gene expressions or test of pathways. There is a many-tomany relationship between EC_numbers. The combination of the EC_number, gene_id and accession_id columns is unique for the gene_probe table. The combination of the pathway_id, ec_number, gene_id and repeat_id columns is unique for the path_enzyme_gene table. In the query to compute the mean value of gene expression, a pathway_id is used to limit the EC_number into a particular pathway. Then a column repeat_id is used to group the repeated EC_numbers within a pathway. In each step of a pathway, such as from Ribulose 5phosphate to Ribulose 1,5 bisphosphate of Figure 3, we treat Ribulose 5-phosphate, enzyme 2.7.1.19, the reaction arrow, the gene expression value and ATP-ADP as one unit. The path_enzyme_gene table (Figure 4) includes the pathway column (pathway_id), the relative order number of the unit (sequence), the reaction direction (direction), the x and y positions of Ribulose 5-phosphate (x_position, y_position), the next unit order number (to_element), the EC number (ec_number), the enzyme The system architecture Figure 1 The system architecture. name (enzyme_name), and additional energy/chemicals like ATP-ADP involved in the reaction (addition_from, addition_to). The information for one unit is filled once for each EC_number appearance in a pathway. The compound name (Ribulose 5-phosphate) is obtained from the compound_name column by joining the path_enzyme_gene table with the pathway_compound table through the pathway_id and compound_id columns. Chemicals that are input to or output from a pathway are handled by filling in the addition_from and addition_to columns. If there is only input like H 2 O, or output like CO 2 , the addition_to column will be filled in with "input" or "output". The program can detect this and draw an up or down arrow. Additional code to handle very specific items for each pathway is very limited. The data storage for constructing pathway graphs is less than 100 M(G)B. Most of the data can be integrated from pub-lic sources, and it is simple to implement. The modules are scalable, and consist of classes that are repeatedly used for generation or visualization of different components of pathways.

Display of gene expressions on pathways at different levels for an individual experiment and across experiments
PathMAPA displays gene expressions for a given pathway and experiment at the enzyme, locus, and probe levels, respectively. The gene expressions at different levels are generated automatically from a database upon a web client's request. The value is computed by joining the table expr_header, expr_detail, gene_probe and path_enzyme_gene ( Figure 4). The microarray files are selected and joined for computation once the user has selected an experiment. Gene expressions are computed as Calvin Cycle Pathway associated with gene expressions the average of the MedianOfRatio column by grouping at a user-selected gene expression level (probe, gene or enzyme). Pathways overlaid with gene expression levels provide investigators with an overview on whether gene expressions are up or down regulated, and how the gene expressions are distributed with each enzyme. Gene expressions are displayed on the pathway graph only at the enzyme level because a pathway describes biochemical reactions with enzymes. Since multiple loci may correspond to the same enzyme in many cases, expressions of these loci allow investigators to compare multiple loci within one enzyme. Furthermore, since more than one probe may exist on a microarray that corresponds to the same locus, gene expressions from multiple probes corresponding to the same locus may provide variation across these probes. In addition to displaying gene expression data for individual experiments, PathMAPA allows investigators to view expression data for a specific pathway across a group of experiments. Currently, gene expressions for multiple experiments are displayed in text format because of the limited space of the pathway graph. Gene expression comparisons across multiple experiments provide the investigator with an overview of whether gene expression patterns are related to experimental conditions. This is especially useful for investigators with timecourse or multiple treatment level experiments. The computation of gene expressions with current microarray technology is limited in that it is currently impossible to separate gene expressions from two different reactions catalyzed by the same enzyme that has the same probes of the microarray experiments because the RNA from the sample is a mixture. The RNA from different transcriptions cannot be separated from the mixed sample either. However, as there are a limited number of genes for which this is the case, microarray technology plays an important role in biological research.
The major tables used to compute gene expression and generate pathway graphs.

Display of automatic generated pathway images
PathMAPA automatically generates pathway image upon the web client's request in contrast with KEGG, MetaCyc, EcoCyc and TAIR [2][3][4][5]9] that rely on a large number of static images (gif files). The web servers of these databases respond to a client's request to view a pathway by sending a gif image file to client computer. PathMAPA uses Java code to automatically generate the images displayed at the users' computer rather than manually built image files. The algorithm involved can be summarized as: (1) query database to compute gene expression; (2) query database to retrieve selected pathway elements and coordinate information; (3) use Java code to draw the graph. More details in solving the complexity of pathway graphs have been described in the section "Pathway visualization with gene expression". The method used by PathMAPA is quite efficient in building images and reflects updated information through easy modification of data in database. As of yet the user cannot change image on the client side, but this feature will be added soon. On the machines we have tested, the generation of a pathway graph usually takes around 5 seconds. The pathway image provides two additional features: (1) it displays the full enzyme name as a tool-tip when the user's mouse is over any Enzyme Commission (EC) number and (2) if the user clicks on an EC-number spot, a new page is opened, from which the user can explore more details. Any change of the enzyme-geneprobe, substrate-reaction-compound or experimentparameter-data will reflect on the pathway graph once the data are updated in the database table. It would take more time to re-build static images. Because biologists periodically modify some of the pathway gene lists, it is essential to be able to update pathway data when a pathway is studied in detail. The pathway construction method used in PathMAPA combines the flexibility of reusable code with the simplicity of database storage.

Statistical test for pathways
One unique advantage of PathMAPA over with TAIR is its ability to analyze pathways as a unit. For each statistical procedure, we use a Java Server Page (JSP) or Java Bean to generate data with JDBC and pass that data to the statistical component through an Application Programming Interface (API) call. The analysis result page is then displayed using JSP. Currently, Fisher's exact test [19,20] is used to test whether a given pathway is affected in a specific experiment. The enzymes, genes and probes involved in each pathway are restricted in their biological relationships as described in the implementation section. There are no unrelated genes involved in the statistical test. The Fisher's test considers the number of loci in each pathway, the number of loci with "altered" gene expression, and the total number of loci in the microarray. We have implemented a T-test across replicates to separate significant and non-significant probes. The cutoff value is determined by the P-value of the probe from the T-tests. The gene expression of a probe is considered to be significantly altered under the experimental condition if its Pvalue is less than the given threshold. We have also implemented a cutoff value method based on fold change, such as 2.0 fold [7], as another option. However, this method does not consider the variation in gene expressions because a mean value that is greater than 2 may not be statistically significant. The significance of gene expressions depends on both the mean value and the variance when sample size and significant level are given [21]. Path-MAPA can identify up-regulated and down-regulated pathways depending on whether the mean gene expression in a significant pathway is greater or less than that of the loci that are significant but are not in the pathway. Each significant pathway is marked either up-regulated or down-regulated in the output and the resulting text file can be downloaded ( Table 1). Note that if genes on a microarray are only a small portion of the whole genome The integrated pathway test graph showing the light effect on gene expressions across experiments or a small portion of the pathway, the results may not be meaningful even though the Fisher's exact test can still be performed.
PathMAPA also plots a 3-D graph ( Figure 5) for visualization of multiple experiments. The input data for plotting the graph are generated from the "Pathway Test" menu.
The results show up-regulated (red), down-regulated (green) and non-regulated (grey) at different days (F1, F2, F3). The integrated graph provides the investigator with information on how related pathways are affected under different treatments across experiments. The resulting graph can be downloaded to the user's computer.

Search of pathway elements across pathways
PathMAPA allows a biologist to quickly identify pathways to which a specific locus or probe belong. It also provides tools to search across pathways for EC-Number, GeneBank Accession ID, and Gene Ontology ID. Investigators can simply enter the item on the web page and click a button. Currently, PathMAPA contains around 100 plant pathways that can be used for queries.  Table 1. Then, after we clicked on the "Result Graph" menu, uploaded the three result files and completed the processing, we obtained the result graph ( Figure 5). When the plants were treated under white light condition versus dark condition, one of the most significant changes was the photosynthesis pathway, which can be designated as an up-regulated pathway. The graph shows that gene expression of photosynthesis was up-regulated at 1.5 days and 5 weeks after planting but not at the very young stage (the same day of planting). These results are consistent with biological knowledge that photosynthesis is activated under white light condition. On the other hand, the gene expressions of other pathways, such as nitrogen metabolism, are not on this significance list.

Identification of enzymes activated
The Calvin cycle pathway is used to demonstrate how PathMAPA can help investigators to identify individual enzymes in a pathway affected under an experimental condition. We can click on the "Select Experiment" and "Enzyme Test" links to conduct a T-test. All enzymes in the pathway are significant at the 0.05 level ( Table 2). This is consistent with the pathway test discussed above. We can display the Calvin cycle graph by using the "Select Experiment" and "Graph Model: EC" menus and selecting "Carbon Fixation". The overview of the gene expression for the pathway is displayed (Figure 3). The enzymes with the two highest changes are Sedoheptulose-bisphosphatase (3.1.3.37) and Ribulose 1,5-bisphosphate carboxylase/ oxygenase (4.1.1.39). The latter enzyme, also called Rubisco, is the most extensively studied enzyme in photosynthesis. Rubisco works either as a carboxylase or oxygenase in photosynthesis and photorespiration [22,23]. The reaction catalyzed by Rubisco is the key step that fixes carbon dioxide, and the high gene expression

Identification of loci involved
An enzyme like Rubisco may consist of several small subunits. We can examine these subunits by using the "Select Experiment", "Text Model: Locus", and "Text Model: Probe" links to display pages containing the gene expressions of loci or probes. By clicking the "Select Experiment", "Gene Test" and "Carbon fixation" links, statistical test results for all genes in the Calvin cycle can be obtained (

Conclusions
PathMAPA can automatically generate and display biological pathways integrated with gene expression data for Arabidopsis thaliana. It is a useful bioinformatics tool for performing statistical test at the pathway, enzyme and gene levels and studying enzyme, locus and probe details within a pathway. It can also be used to identify pathways to which a set of elements belongs as well as to compare gene expressions across experiments. PathMAPA facilitates the exploration of microarray data, the investigation of pathways, and the generatation of insights from microarray data. S Up This is an example to identify genes that are affected by experimental treatment for the Calvin cycle pathway. S means significant and NS means non-significant. Significance level is 0.05.