ExonMiner: Web service for analysis of GeneChip Exon array data
- Kazuyuki Numata†1,
- Ryo Yoshida2,
- Masao Nagasaki1,
- Ayumu Saito1,
- Seiya Imoto1 and
- Satoru Miyano1Email author
© Numata et al; licensee BioMed Central Ltd. 2008
Received: 18 March 2008
Accepted: 26 November 2008
Published: 26 November 2008
Some splicing isoform-specific transcriptional regulations are related to disease. Therefore, detection of disease specific splice variations is the first step for finding disease specific transcriptional regulations. Affymetrix Human Exon 1.0 ST Array can measure exon-level expression profiles that are suitable to find differentially expressed exons in genome-wide scale. However, exon array produces massive datasets that are more than we can handle and analyze on personal computer.
We have developed ExonMiner that is the first all-in-one web service for analysis of exon array data to detect transcripts that have significantly different splicing patterns in two cells, e.g. normal and cancer cells. ExonMiner can perform the following analyses: (1) data normalization, (2) statistical analysis based on two-way ANOVA, (3) finding transcripts with significantly different splice patterns, (4) efficient visualization based on heatmaps and barplots, and (5) meta-analysis to detect exon level biomarkers. We implemented ExonMiner on a supercomputer system in order to perform genome-wide analysis for more than 300,000 transcripts in exon array data, which has the potential to reveal the aberrant splice variations in cancer cells as exon level biomarkers.
ExonMiner is well suited for analysis of exon array data and does not require any installation of software except for internet browsers. What all users need to do is to access the ExonMiner URL http://ae.hgc.jp/exonminer. Users can analyze full dataset of exon array data within hours by high-level statistical analysis with sound theoretical basis that finds aberrant splice variants as biomarkers.
It is reported that some splicing isoform-specific transcriptional regulations are related to disease [1, 2]. To find disease specific transcriptional regulations, detection of disease specific splice variations is the first step. However, conventional microarrays that produce gene-level information are not suitable for this purpose. On the other hand, Affymetrix Human Exon 1.0 ST Array can measure exon-level expression profiles that are suitable to find differentially expressed exons in genome-wide scale. Affymetrix exon array can measure the transcript levels of more than 1,000,000 exons with 300,000 transcripts by about 6,500,000 probes.
We have developed a supercomputer-based web service named ExonMiner to analyze exon array datasets for detecting genes that are spliced into different isoforms in two types of cells in comparison, e.g. normal and cancer cells. There are some noncommercial standalone applications for analyzing exon array data: IGB  is an application for visualizing exon array data and ExACT  and Affymetrix Expression Console  are mainly focusing on normalizing exon array data. Also, Bioconductor  (exonmap ) focuses on annotation as well as normalization. The advantage of exonmap is that users can use other statistical tools implemented on R. These are well organized applications, however, these applications focus on data normalizations and we need to use other software for further analysis. Since ExonMiner is, however, an all-in-one web service on a supercomputer system, users can analyze more than 300,000 transcripts spotted on exon array by data normalization, two-way ANOVA analysis, visualization of the results, and detection of exon-level biomarkers. Based on our experiments, which used colon cancer exon array data that contains 20 exon arrays, on various situations of our system usages, the minimal computational time is four hours and the longest was finished in one day. We also observed that the average computational time of colon cancer example is about eight hours.
We have implemented ExonMiner on our Super Computer System https://supcom.hgc.jp/english/ in Human Genome Center, Institute of Medical Science, University of Tokyo and created GUI to use the all analysis tools of ExonMiner easily. An illustrative example of colon cancer exon array data analysis  is shown in the web site. ExonMiner has five advantages: (1) a statistical analysis framework, (2) analysis for all transcripts completed, (3) effective visualization with heatmap and barplot images, (4) sophisticated and easy-to-use web interface, and (5) useful hyperlinks to major public databases, e.g. PubMed and NetAffx.
As shown in latter sections, the method implemented in ExonMiner requires more computational time than other software, due to the nonparametric test based on bootstrapping. For example, we need to repeat bootstrap sampling more than 1,000 times for computing accurate p-values of statistical tests finding aberrant splice variations, it requires 1,000 times computation of usual statistical test of ANOVA with Gaussian error model. Therefore, we need high-performance parallel computing on Super Computer System. Also, more advanced methods implemented on ExonMiner in future possibly requires more computational resources, therefore, the use of Super Computer System can give flexible computational basis and is suitable for our purpose.
Before performing statistical analysis, we apply normalization method to raw exon array data. ExonMiner can remove a bias related to GC-content in each probe. The probes are categorized according to their GC-contents and GC-content specific bias will be removed from the probes in each category. ExonMiner uses two types of control for data normalization: One is the median value for each GC category and the other is based on antigenomic background probes. The antigenomic background probes are also categorized into GC categories and we compute their median values. The median value of the probe intensities in each GC category will be transformed by subtracting corresponding control value. In case that user chooses the median values of GC categories for control, the median of probe intensities in a GC category will be equal to one.
Two-way Analysis of Variance
Concept and Model
For using ExonMiner to detect aberrant splice variations, user needs to prepare at least two exon array data from a pair of cells. For example, in our illustrative example, one exon array is prepared for measuring exon profiles in colon cancer cell and the other exon array is used for normal cell. In this case, we can find aberrant splice variants in colon cancer by comparing with normal cells. In this purpose, we use two-way analysis of variance (ANOVA). Suppose that a gene (transcript cluster) is composed of the m exonic regions (exon clusters), and that x ijk is the background corrected probe intensity for the k th probe (k = 1, ⋯, n ij ) on the i th exon (i = 1, ⋯, m) of a transcript, i.e. this transcript has m exonic regions and each exonic region is spanned by n ij probes. Here the index j denotes the type of cells, e.g. j = 1 denotes normal cell and j = 2 for cancer cell. If we observe x ijk ≈ c for any i, j and k, the transcript does not show any transcriptional changes and splicing variations across cell types (j = 1, 2). If we observe that xi 1k≈ c1 and xi 2k≈ c2 (c1 ≠ c2) for any i and k, it indicates that this transcript was differentially expressed between two cells and this information is equivalent to usual microarray expression data like cDNA microarray, GeneChip and so on. On the other hand, x ijk ≈ c1 and xi'jk≈ c2 for any j, k and i ≠ i' hold where c1 ≠ c2 and c1, c2 > 0, it means that this transcript has splice variations but these splice variations are commonly occurred between cell types. Finally, if we observe that two cells show different splice patterns, we define them aberrant splice variations. We will capture this information by two-way ANOVA model. For ANOVA in exon array data analysis, see also [8–10].
For detecting transcripts that show aberrant splice variations, we use two-way ANOVA model defined byx ijk = μ + α i + β j + γ ij + δ ijk ,
Statistical tests for detecting alternative splicing, differentially expression, and aberrant splice variations
The estimates of γ ij s could capture presence of aberrant splice variations. By the ANOVA model, the probe fluctuations are decomposed into three orthogonal effects, i.e. exon effect (α i ), overall gene effect (β j ) and effect of specific splice variations (γ ij ). The statistical significance of each effect can be evaluated by the following three tests:
Test 1 (Detection for exon effect):
H0: α i = 0 for all i.
H a : α i ≠ 0 at least one i.
Test 2 (Detection for overall gene effect):
H0: β1 = β2
Ha: β1 ≠ β2
Test 3 (Detection for effect of specific splice variations):
H0: γ ij = 0 for all i and j.
Ha: γ ij ≠ 0 for one or more pairs of (i, j).
for each effect. In ExonMiner, users can choose parametric or nonparametric test for assessing significance of each parameter.
To detect aberrant splice variants as biomarkers, we need to check whether the detected aberrant splice variants are common in the targeted disease or not. In this purpose, we establish a statistical testing procedure based on meta-analysis . Suppose that we have G pair of exon array datasets, i.e. normal and tumor exon expression data are measured from G patients. By performing the whole transcript analysis based on two-way ANOVA to G paired exon array datasets, one obtains a set of p-values for each effect, e.g. effect of specific splice variations (γ ij ), , across patients, g = 1, ⋯, G. Here the total number of transcripts analyzed is denoted by r. Intuitively, a transcript having a small p-value is strongly associated with tumor formation. However, it is possible that some observed aberrant splice variants could be caused by the inter-individual differences of the analyzed samples. Our goal is to discover the "universal biomarkers", i.e. aberrant splice variations which are shared by most individuals with a particular diagnostic category.
Following this direction, we develop the statistical technique within the framework of meta-analysis based on the normal inversion method.
for g = 1, ⋯, G. Given these models, the statistical hypothesis testing of each effect, for example, effect of specific splice variations, is formulated by
Test 4 (Detection for universal specific splice variations):
H0: = 0 for all i, j and k.
Ha: ≠ 0 for one or more tuple (i, j, g).
We would like to show an actual example of meta-anlaysis. In Yoshida et al. , colon cancer exon array dataset was analyzed by primary version of ExonMiner. In this anlaysis, based on the Test 3 of ANOVA, gene MUC17 (Accession ID: NM_001040105) has p-values for ten patients:
p h 1 = 0.313; p h 2 = 0.0005; p h 3 = 0.0005; p h 4 = 0.8964; p h 5 = 0.8201;
p h 6 = 0.0002; p h 7 = 0.6549; p h 8 = 0.0179; p h 9 = 0.0522; p h 10 = 0.1664.
These p-values are transformed into z-scores as:
z h 1 = 0.487; z h 2 = 3.291; z h 3 = 3.291; z h 4 = -1.261; z h 5 = -0.916;
z h 6 = 3.540; z h 7 = -0.399; z h 8 = 2.099; z h 9 = 1.624; z h 10 = 0.968.
The integrated z-score is 4.023 and the integrated p-value is obtained as 2.86 × 10-5.
In the colon cancer example, we compute q-values from integrated p-values of meta-analysis, the list of the genes identified as having aberrant splice variations including exon skipping/retaining has 10% False Discovery Rate (FDR) that corresponds to q-value < 0.1. In the above MUC17, the q-value is 0.0345 and it is determined as significant. The computation of q-value is shown in Yoshida et al. .
By using exon array data with ExonMiner, it is possible to detect alternative splicing like exon skipping/retaining, alternative usage of donor and acceptor splice sites and so on. However, since exon array does not have junction probes, custom array with junction probes or PCR method are needed for further analysis of detecting exact patterns of splice isoforms.
The users are required to upload their exon array data. We prepared an FTP service for data upload. A reason for choosing FTP service for our system is that a large dataset can easily be uploaded. To increase the security level, we prepare one time account and password for FTP service. Note that one time account and password are different from the pair for login account and password of ExonMiner.
Statistical analysis engine
ExonMiner performs ANOVA for each transcript. To test the significance of each effect in ANOVA described in previous section, we implemented two types of tests: one is based on Gaussian noise model and it performs F-test, the other is based on nonparametric approach using bootstrap method. In the nonparametric approach, we need to compute test statistics repeatedly and it needs enormous computation. Therefore we implemented the ANOVA program by Fortran and optimized for high performance computing described in the latter section.
The information of exon expression pattern for each transcript needs to be shown visually. We have developed two types of image generators and can make heatmap and barplot images optimized for exon array data. These images are generated by using R. The graphics library is originally developed.
For the management of user information and probe annotation information, we use MySQL database server. For constructing a highly secure system, user login information is encrypted and stored in MySQL database. By keeping probe annotation information into MySQL database, users are not necessary to explore other databases. Thus ExonMiner is an all-in-one web service.
High performance computing on supercomputers
Since ANOVA for the full set of transcripts needs high performance computing, we perform each ANOVA computation in parallel on our supercomputer system. Our supercomputer system has eight Sun Fire 15 k and at most 700 CPUs can be used for parallel statistical computation by using Sun Grid Engine.
In ExonMiner, PHP scripts deal with connections between front end users and our supercomputer system and dynamically generate images by executing visualization engine described above based on user input. PHP scripts generate HTML web pages with a uniformed style that increases usability.
Results and discussions
Overview of ExonMiner
Create user account
FTP for data upload
For the upload of your data, you need to use FTP. For using FTP service in ExonMiner, user needs to get one time password and account for FTP.
Note that the account of FTP is different from login account. Using the one time password, the user can upload CEL (TEXT) files archived by ZIP via FTP. ExonMiner supports CEL files as TEXT format (this CEL file is recognized as version 4), usual CEL files are, however, BINARY format (this CEL file is recognized as version 3). To convert BINARY CEL files (version 3) in TEXT format (version 4), "CEL File Conversion Tool" provided by Affymetrix Inc. is available .
Description: you can add a brief description of your analysis. It may be convenient that you put a name of this analysis to organize your analyses.
Select probe levels: you can select the level of expression information in exon array. Transcript Level: For transcripts, there are three levels, core, extended and full transcripts, according to their information quality based on their information sources. Like transcript level, user can choose Probe Level and Exon Resolution.
Select GFF: you can select chromosomes. Transcripts on the selected chromosomes will be analyzed. This selection can reduce computational time.
Select which CEL file is a patient or a control: user adds the outcome information to each CEL file you have uploaded by FTP.
Preprocessing data (background correction): user selects the type of normalization method. GC-content: the median values in the same GC-content probe groups are used as control values. Antigenomic background: the median values in the same GC-content antigenomic background probes are used as control values.
Preprocessing data (GC-content threshold): it is a possible case that probes with high GC-content work as noise. So you can remove such probes. In default, the probes with 20 or more GC-content are removed. If you want to use the all probes for analysis, you choose 26 as the cut-off.
Analysis type (model): user selects the analysis type from the following three types – Don't analyze: ExonMiner does not perform ANOVA. Only visualization and sequence information are available. Parametric analysis: Gaussian distribution is assumed as the noise model. Nonparametric analysis: ExonMiner does not assume any distributions for the noise model. Bootstrap test will be applied for computing p-values.
Analysis type (threshold for the number of probes): ExonMiner ignores probesets or exon clusters with small number of probes for stabilizing the results of ANOVA. You can choose this cut-off by this option.
Nonparametric analysis options: the number of bootstraps in nonparametric ANOVA is specified by this option.
Visualization of the results
Availability and requirements
Project name: ExonMiner
Project home page: http://ae.hgc.jp/exonminer/
Anonymous accounts (no e-mail address for registration is needed): http://ae.hgc.jp/exonminer/anonymous.html
Operating systems: any OS (that has an internet browser application)
Programming language: PHP, R, Fortran, Perl, Ruby, MySQL
ExonMiner is an all-in-one web service well suited for analysis of exon array data. Since it does not require any installation of software except for internet browsers, what all users need to do is to access the ExonMiner URL http://ae.hgc.jp/exonminer. ExonMiner can perform not only visualization of exon array data, but also can perform data normalization and user customized statistical analysis that is hard to run on a single computer. With the support of supercomputers in Human Genome Center, Institute of Medical Science, University of Tokyo, users can analyze full dataset of exon array data within hours with results of meta-analysis that finds aberrant splice variants as biomarkers.
The authors would like to thank the three reviewers for their constructive comments and suggestions that improved the quality of the paper considerably. The authors also wish to thank the Affimetrix Japan Inc. for their allowance to link to their web site: NetAffx, and for their helpful suggestions. ExonMiner was supported by Human Genome Center, Institute of Medical Science, University of Tokyo.
- Richard DJ, Schumacher V, Royer-Pokora B, Roberts SGE: Par4 is a coactivator for a splice isoform-specific transcriptional activation domain in WT1. Genes Dev 2001, 15(3):328–339.PubMed CentralView ArticlePubMedGoogle Scholar
- Gruber FX, Hjorth-Hansen H, Mikkola I, Stenke L, Johansen T: A novel Bcr-Abl splice isoform is associated with the L248V mutation in CML patients with acquired resistance to imatinib. Leukemia 2006, 20: 2057–2060.View ArticlePubMedGoogle Scholar
- Affymetrix Expression Console[http://www.affymetrix.com/products/software/specific/expression_console_software.affx]
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80.PubMed CentralView ArticlePubMedGoogle Scholar
- Okoniewski MJ, Yates T, Dibben S, Miller CJ: An annotation infrastructure for the analysis and interpretation of Affymetrix exon array data. Genome Biol 2007, 8(5):R79.PubMed CentralView ArticlePubMedGoogle Scholar
- Yoshida R, Numata K, Imoto S, Nagasaki M, Doi A, Ueno K, Miyano S: Computational genome-wide discovery of aberrant splice variations with exon expression profiles. Proc IEEE 7th International Symposium on Bioinformatics & Bioengineering 2007, 715–722.Google Scholar
- Yoshida R, Numata K, Imoto S, Nagasaki M, Doi A, Ueno K, Miyano S: A statistical framework for genome-wide discovery of biomarker splice variations with GeneChip Human Exon 1.0 ST Arrays. Genome Informatics 2006, 17(1):88–99.PubMedGoogle Scholar
- Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Schweitzer A, Awad T, Sugnet C, Dee S, Davies C, Williams A, Turpaz Y: Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006, 7: 325.PubMed CentralView ArticlePubMedGoogle Scholar
- Affymetrix: Alternative transcript analysis methods for exon arrays. Affymetrix Whitepaper 2005.Google Scholar
- http://www.affymetrix.com/Auth/support/developer/downloads/Tools/CelFileConversion.ZIPCell File Conversion Tool[http://www.affymetrix.com/Auth/support/developer/downloads/Tools/CelFileConversion.ZIP]
- Schuler GD, Epstein JA, Ohkawa H, Kans JA: Entrez: molecular biology database and retrieval system. Methods Enzymol 1996, 266: 141–162.View ArticlePubMedGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31(1):82–86.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.