SplicerAV: a tool for mining microarray expression data for changes in RNA processing
© Robinson et al; licensee BioMed Central Ltd. 2010
Received: 23 September 2009
Accepted: 25 February 2010
Published: 25 February 2010
Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but has not been previously mined en masse for changes in mRNA processing. We explored the possibility of using HG-U133 microarray data to identify changes in alternative mRNA processing in several available archival datasets.
Data from these and other gene expression microarrays can now be mined for changes in transcript isoform abundance using a program described here, SplicerAV. Using in vivo and in vitro breast cancer microarray datasets, SplicerAV was able to perform both gene and isoform specific expression profiling within the same microarray dataset. Our reanalysis of Affymetrix U133 plus 2.0 data generated by in vitro over-expression of HRAS, E2F3, beta-catenin (CTNNB1), SRC, and MYC identified several hundred oncogene-induced mRNA isoform changes, one of which recognized a previously unknown mechanism of EGFR family activation. Using clinical data, SplicerAV predicted 241 isoform changes between low and high grade breast tumors; with changes enriched among genes coding for guanyl-nucleotide exchange factors, metalloprotease inhibitors, and mRNA processing factors. Isoform changes in 15 genes were associated with aggressive cancer across the three breast cancer datasets.
Using SplicerAV, we identified several hundred previously uncharacterized isoform changes induced by in vitro oncogene over-expression and revealed a previously unknown mechanism of EGFR activation in human mammary epithelial cells. We analyzed Affymetrix GeneChip data from over 400 human breast tumors in three independent studies, making this the largest clinical dataset analyzed for en masse changes in alternative mRNA processing. The capacity to detect RNA isoform changes in archival microarray data using SplicerAV allowed us to carry out the first analysis of isoform specific mRNA changes directly associated with cancer survival.
The key postulate that one gene encodes one polypeptide chain (one enzyme) has been overhauled with the discovery that one gene can generate multiple RNA transcripts (and indirectly many different polypeptide chains) through a process referred to as alternative mRNA processing . Alternative processing defines a range of events, including alternative splicing and alternative polyadenylation, which result in distinct mRNA species. Recent deep sequencing studies indicate that 94% of all protein coding genes generate multiple mRNA transcripts  and mutations affecting mRNA splicing are responsible for an estimated 15-60% of human genetic diseases [3, 4]. Functional consequences of alternative processing have been shown across a wide variety of biological processes (reviewed by [5–7]) including drug metabolism, stem cell renewal, neurologic disease, autoimmune disease, and especially cancer. Despite the importance of alternative processing in cancer, current understanding of its global regulation remains sparse  and limits the ability to fully harness alternative processing as a tool in cancer prognosis, diagnosis, and treatment.
Attempts to obtain a genome scale understanding of alternative processing in cancer have focused on large-scale characterizations of changes in alternative processing between normal tissue and cancer. Bioinformatic analyses have identified a large number of transcript isoforms found only within cancer tissue [9–11]. The recent use of splicing sensitive microarrays has allowed quantification of changes in alternative processing between individual samples (reviewed in ). These arrays have been used to detect changes in alternative processing between normal human tissues and in breast, brain, colon, prostate, and bladder carcinomas [12–16] using various splicing algorithms (reviewed in ). Large scale clinical analyses of changes in alternative processing; however, remain sparse, and there are no high-throughput analyses of changes in mRNA processing associated with poor patient prognosis. Such studies require years of patient follow-up and have not been reported using the new splicing arrays.
In contrast, public repositories such as the Gene Expression Omnibus (GEO) currently contain conventional gene expression data from hundreds of thousands of unique biological or clinical samples (). Data previously generated by the microarray community provide an untapped source of potential insight to the regulation of alternative mRNA processing in human cancer. Although the exact value of these data is not known, it is likely that well over a billion dollars have been invested in reagents, facility, and personnel costs over the past two decades.
The first commercially available high-density gene expression microarrays were invented three decades ago by Affymetrix  to quantify expression changes in tens of thousands of genes in a single experiment, but were not intended to detect isoform specific mRNA changes resulting from alternative processing. Two of the most commonly used human expression microarrays, the Affymetrix U95 and U133 series, use individual probesets to report expression of many genes. Each probeset is composed of 11 individual 25 nt oligomers that interrogate a subsequence of the target gene. Both platforms, however, contain thousands of genes whose expression is assayed by more than one probeset. The use of multiple probesets, which often interrogate non-overlapping regions of the target gene, was originally intended to provide a robust assay of gene expression. We and others have previously observed that discrepancies between fold-changes in probesets interrogating the same gene can represent isoform-specific changes in mRNA levels [20–22]. Such isoform changes can result from alternative transcription start sites, alternative mRNA processing, or changes in mRNA isoform stability.
Methods that detect isoform-specific mRNA changes have been developed for splicing microarrays such as the Affymetrix Human Exon 1.0 ST (reviewed in ), but have not been developed for or applied to conventional gene expression microarrays. In fact, it has been suggested in such reviews that "detection of disease-relevant splicing differences may be entirely missed in gene-level expression profiling studies" . Although it may be possible in theory to apply such methods to conventional gene expression microarrays, to our knowledge this has not been done. To fully investigate the potential to detect isoform-specific mRNA changes in conventional gene expression microarray data, we elected to develop a novel method, SplicerAV, which we have applied to conventional Affymetrix gene expression microarray data.
SplicerAV related probeset features of commonly used Affymetrix microarrays
Unique Annotated Genes
Genes w/Mult Probesets
Fraction of genes w/mult probesets
Avg. Probesets per gene
U133 Plus 2.0
Mouse 430A 2
SplicerAV is a program created to systematically assess the likelihood of changes in alternative processing evidenced by discrepancies in probeset behavior using a Gaussian mixture model of mRNA transcript regulation. A beta version of this program, which lacked biological modifiers and the ability to generate estimates of statistical significance, was initially used to identify differential regulation of transcript isoforms by TCERG1 . SplicerAV can be applied to any expression microarray platform with multiple probesets interrogating the same gene, without the need for detailed transcript annotation. The program provides a non-computationally intensive algorithm capable of analyzing probeset-summary level datasets for evidence of changes in alternative mRNA processing. We provide here a description of SplicerAV, which has been developed to provide a rigorous statistical model and incorporate biologically motivated modifications with the goal of assisting biologists in identifying alternative processing events most amenable for in-depth study from conventional gene expression microarray data.
In this study SplicerAV's unique value in detecting previously overlooked changes in mRNA processing is demonstrated using publicly available Affymetrix U133 gene expression datasets. SplicerAV was used to uncover previously uncharacterized isoform specific changes in epidermal growth factor receptor (EGFR) caused by in vitro HRAS over-expression . In a separate analysis, SplicerAV was used to identify changes in alternative mRNA processing associated with poor patient prognosis in over 400 breast tumors. Here we demonstrate SplicerAV's ability to examine archival data, performing the largest analysis of alternative mRNA processing in human cancer to date and the only high-throughput analysis of changes in alternative mRNA processing associated with human cancer prognosis.
Results and Discussion
In the first step, changes in probeset expression levels are summarized by calculating their average log2fold changes and corresponding t-statistics. These metrics were taken from conventional gene expression analysis. Probesets targeting the same gene are then grouped together and each probeset is assigned a weight. Individual probeset weights are calculated using a combination of that probeset's t-statistic, number of observations, and comparison with other probesets targeting the same gene (see methods).
Once these weights are assigned, each gene is evaluated for evidence of alternative processing using a Gaussian mixture model. In the Gaussian mixture model used by SplicerAV, probesets interrogating a transcriptionally activated gene are predicted to detect the same proportional increase in expression. For example, probesets targeting an mRNA that doubles in abundance would be expected to double in intensity (Figure 1B). Conversely, probesets targeting an mRNA which is down-regulated by half would be expected to be reduced by half Figure 1C). Multiple probesets targeting a gene that is alternatively processed or undergoes isoform specific mRNA regulation would be expected to report discordant changes in probeset intensities (Figure 1D).
SplicerAV uses the chip annotation file ("platform_annot.csv" for Affymetrix arrays) to determine which probesets interrogate the same gene. For most microarray platforms the gene symbol provides an appropriate annotation scheme, however any provided annotation (Transcript cluster ID, WormBase, FlyBase, Ensembl, etc.) can be used.
Probeset Annotation & Filtering
Our analyses used the default probeset annotation provided by Affymetrix. This annotation contains probesets that in some cases target multiple exons or are poorly annotated [24–26]. Re-defining probeset definition, for example using exon-based definitions of probesets, may improve the ability of SplicerAV to detect changes in mRNA processing [24, 25]. However, using the standard annotation provided by Affymetrix makes our findings here directly comparable to the vast majority of expression analyses conducted using the U133 series of arrays, allowing reference to specific probeset IDs and enabling us to directly analyze summarized expression datasets deposited in GEO. Additionally, many Affymetrix microarray expression datasets deposited in GEO do not contain CEL files  and cannot be re-analyzed using custom annotation.
The use of standard Affymetrix annotation also allows us to make presence/absence probeset detection calls using previously validated methods . As described above, SplicerAV detects discrepancies in fold changes between probesets targeting the same gene, using these discrepancies to infer changes in alternative mRNA processing. Nevertheless, such discrepancies can also reflect the presence of negative strand matching probesets (NSMPs) or probesets that do not produce signal above background, which can be caused by low transcript levels or non-functional probes. NSMPs hybridize or detect RNAs transcribed in the opposite direction of the annotated gene; they do not reflect the expression of the target transcript and are identified and removed by SplicerAV using information available in standard Affymetrix annotation files . Probesets that do not produce signal can also falsely suggest isoform specific mRNA changes. These probesets are removed by SplicerAV if they are not expressed above background (P < .05) in either treatment or control groups using the Presence-Absence calls with Negative Probesets (PANP) algorithm .
These modifiers do not affect the p-value generated by SplicerAV, but allow the program to preferentially rank predicted changes in alternative processing that generate less complicated hypotheses, are larger in magnitude, reflect changes in expression which are qualitatively different, and are less likely to reflect probesets targeting non-transcribed regions or probesets that do not linearly reflect changes in transcript abundance. Genes that exhibit statistically significant discordant probeset behavior and are given a positive splice score represent ideal candidates for experimental investigation of isoform specific regulation.
SplicerAV generates several additional outputs with each file. These include a file containing assessment of statistically significant expression changes for all probesets, a log file containing all user set parameters and comparisons made, as well as a FASTA file for each gene. These fasta files contain the target sequences of all probesets targeting that gene, allowing quick and easy mapping to known and predicted mRNA sequences using the UCSC genome browser http://genome.ucsc.edu . All genomic analyses in this study were performed using the March 2006 release of the human genome (hg18).
SplicerAV Index Generation
To perform analyses of isoform changes within individual samples we derived an index of relative isoform abundance predicted by SplicerAV. High-throughput analyses of alternative processing have previously defined "splice index" as a quantitative measure to compare isoform abundances between individual samples. The splice index of a probeset equals its expression relative to other probesets targeting the same gene . Using SplicerAV we defined a modified version of the splice index, referred to as the SplicerAV index. SplicerAV assumes a Gaussian mixture model, whereby all probesets are classified as belonging to one of two groups based on similarity of expression changes. The group of probesets exhibiting the largest increases in expression are referred to as the "A" (up) group and the group of probesets exhibiting the largest decreases in expression are referred to as the "B" (down) group (see examples of SplicerAV output in additional files 1, 2, 3, 4, 5, and 6). The SplicerAV index of a probeset equals its expression relative to the average expression of probesets in the opposite group. For example, the SplicerAV index of a probeset in the "A" group would be calculated by subtracting the average expression of the "B" group from that probeset's log2 expression value. In our analysis, SplicerAV indexes of probesets in the "A" group were defined as increased in aggressive cancers, while indexes of probesets in the "B" group were defined as decreased in aggressive cancers. Pre-specified hypotheses generated in training datasets made unidirectional significance tests appropriate in independent validation datasets.
SplicerAV was implemented in Perl, with a typical run time of 3-5 minutes on a standard personal computer and has not been tested using other operating systems. The program will only assess changes in alternative mRNA processing for genes interrogated by multiple probesets, which varies widely by microarray platform. To explore the potential for SplicerAV to identify novel changes in mRNA isform abundance in breast cancer, we applied SplicerAV to several publicly available, archival Affymetrix HG-U133 plus 2.0 datasets.
SplicerAV predicts oncogene induced changes in alternative processing of splicing factors
Studies of SRC , HRAS [31, 32], and E2F family binding sites  have demonstrated isolated roles of these oncogenes in affecting alternative mRNA processing. Nonetheless, prior to this study no large-scale examination of changes in alternative mRNA processing had been undertaken for any of these oncogenes. We examined an oncogene over-expression microarray dataset published by Nevins and colleagues  (GEO accession GSE3151) to demonstrate SplicerAV's ability to detect oncogene driven changes in alternative processing. In this experiment, activated HRAS, SRC, E2F3, activated β-catenin (CTNNB1), MYC, or green fluorescent protein (GFP) was over-expressed in human primary mammary epithelial cells. The Affymetrix U133 plus 2.0 microarray platform was used to assay gene expression in seven to ten replicates of each condition. Probeset level intensities were estimated using the Robust Multichip Averaging (RMA) procedure .
SplicerAV predicts oncogene-induced changes in isoform specific mRNA levels.
Unique Expressed Genes
SplicerAV Predictions (P < .01)
Alt. Processed Genes
Genes with Splice Score > 0
Significant Gene Ontologies
mRNA splicing (12)
Complement med immunity (3)
G-protein mediated signaling (10)
Transcription Elongation (2)
mRNA splicing (7)
mRNA processing factors (4)
Cell surface receptor signal (10)
G-protein mediated signaling (6)
Mesoderm development (6)
Cell structure and motility (11)
pre-mRNA splicing (5)
Granulocyte-mediate immunity (2)
Gene isoform changes receiving both a significant p-value and a positive splice score indicate ideal candidates for further experimental study ("Genes with Splice Score > 0" column; Table 2). HRAS and SRC over-expression resulted in 212 and 119 such events, while MYC over-expression resulted in only 12 (Table 2). One gene, Programmed Cell Death Protein 5 (PDCD5), underwent the same change in alternative processing upon over-expression of each of the five oncogenes (see additional files 1, 2, 3, 4, and 5). PDCD5 switched from an alternative isoform (mRNA AK293486) to the major isoform (mRNA BC015519), which codes 37 isoform specific c-terminal amino acids required for PDCD5 nuclear entry & activation of apoptosis . Gene ontology (GO) analysis of isoform specific changes revealed a common selection for genes involved in mRNA splicing (see methods). Over-expression of all oncogenes other than MYC each resulted in significant (p ≤ .05) enrichment of isoform specific changes in mRNA splicing, pre-mRNA splicing, or mRNA processing factors (Table 2). HRAS and SRC over-expression resulted in predicted isoform changes in 12 (p = .009) and seven (p = .05) factors involved in mRNA splicing, respectively. Both HRAS and E2F3 isoform specific changes were enriched for G-protein mediated signaling (p = .04; p = .0009) and roles in immune function (p = .02; p = .01). Sixty-seven genes were predicted to undergo isoform changes in common between two or more oncogenes. Messenger RNA processing factors (5 genes, p = .008; WDR33, HNRPC, SF3A1, SNRPA1, TRA2A) and mRNA splicing factors (8 genes, p = .0003; HNRPC, HNRPD, TARDBP, HNRPH1, SF3A1, HNRPA2B1, SNRPA1, TRA2A) were the most significant molecular function and biological process represented by these genes.
HRAS over-expression results in isoform specific EGFR mRNA regulation
Probesets 1 and 2, which target a region common to all four isoforms, reported highly concordant (R2 = .95) expression levels across all 55 samples in the dataset (Figure 3C). Probesets targeting different transcript regions (1 and 3) reported poor or even inversely correlated expression levels, (R2 = .36, Figure 3D). Due to this "outlier" behavior these probesets would be discarded during conventional microarray expression analysis , however, SplicerAV data suggest that this behavior reflects isoform-specific regulation of EGFR expression
EGFR isoform A (AShort) appeared to be the primary transcript upregulated by HRAS over-expression, as evidenced by highly correlated expression of the probesets targeting the common and AShort isoforms (probesets 1 and 6; R2 = .87). HRAS over-expression caused a robust decrease in the probeset targeting the long 3'UTR of EGFR (probeset 7; ALong) that was not correlated with expression of the common transcript region (Figure 3F, R2 = .01). In contrast, common and ALong expression levels were well correlated in non-HRAS samples (R2 = .70). These data suggest a HRAS-specific shortening of the isoform A 3'UTR.
We hypothesize that these HRAS-induced isoform changes promoted EGFR activation via several mechanisms. HRAS increased overall isoform A transcript levels, as evidenced by significant increases in probesets interrogating common regions of the gene (probesets 1 & 2). At the same time, HRAS over-expression resulted in selection of a shorter 3' UTR, which removes known miRNA binding sites present in the ALong UTR and likely increased translation of EGFR mRNAs . Widespread 3'UTR shortening to escape miRNA regulation has been observed previously in proliferating cells . EGFR isoforms B & D code for a truncated intracellular domain, which if translated could dimerize with and inhibit activation of both EGFR and HER2 . The observed down-regulation of these isoforms is predicted to promote EGFR1 and HER2 activation . It should be noted, however, that the corresponding truncated receptors have not been observed. Soluble isoforms composed of the extracellular domain occur naturally and suppress ligand-dependent EGFR signaling and oncogenic transformation in a dominant negative manner . Our data indirectly address expression levels of the soluble isoforms, which appear to be unchanged.
Our data suggest that HRAS acts through several isoform-specific mechanisms to promote EGFR family signaling. EGFR signaling plays known roles in cell survival, proliferation, adhesion, migration, and differentiation . Both EGFR and HER2 are currently therapeutic targets in breast cancer . Our analysis here suggests that modified regulation of alternative mRNA processing could be used as a novel means of EGFR inhibition, similar to that shown recently for HER2 using splice site switching oligonucleotides .
SplicerAV predicted isoform changes exhibit low overlap with gene expression changes
Using the same gene expression dataset, SplicerAV was able to predict a number of previously unappreciated changes in isoform specific mRNA regulation. Genes predicted to undergo isoform changes exhibited small overlap with genes predicted to undergo expression changes by conventional analysis, consistent with previous findings in the field [1, 47, 48]. HRAS and SRC over-expression resulted in the largest changes in both gene expression and isoform changes. Of the212 genes predicted to undergo ideal isoform changes (significant p-value and positive splice score) in HRAS over-expression, only 8 genes (3.8%) were also among the top 212 most significant changes by conventional expression analysis (data not shown). Of the top 119 predicted isoform changes in SRC over-expression, none were in the top 119 most significant expression changes. This low degree of overlap suggests that the results obtained via SplicerAV are largely orthogonal to that of conventional gene expression analyses. This low degree of overlap provides the potential for combining traditional gene expression signatures with SplicerAV isoform-based signatures to improve signature performance.
SplicerAV predicts isoform changes in high vs. low grade breast tumors
Our analysis of oncogene regulated isoform expression demonstrated the ability to generate novel insights into cancer biology. We next determined if similar insights could be obtained from the analysis of alternative processing in clinical tumor samples. Breast cancer has been extensively studied using high-throughput analyses of gene expression at the transcriptome level (Reviewed in ). In contrast, high-throughput analysis of alternative mRNA processing in breast cancer has been addressed in only a handful of studies [12, 47]. We explored the ability of SplicerAV to detect changes in alternative processing between low and high grade breast tumors in archival expression data.
GO analysis of 241 genes predicted to undergo isoform changes between grade I and grade III breast tumors (GUYT).
Guanyl-nucleotide exchange factor
RAB3IP, RAPGEF2, GAPVD1, CD47, TRIO, ARHGEF7, AKAP13
RNF130, TTC3 UBE3B, PML, TRIM26, RBCK1, MIB1, ZNF294, ZUBR1, TRIAD3
mRNA processing factor
SYNCRIP, WDR33, SFRS8, SFRS15, TAF15, SF1, SF3B1, SFPQ, PRP6
DNAL1, NF2, KIF5C, DYNC1H1
RAB3IP, RAPGEF2, GAPVD1, CD47,
mRNA splicing factor
TAF15, SFRS8, SF1, SF3B1, SFPQ, PRP6
Tyrosine protein kinase receptor
TEK, TPR, IGF1R, PDGFRA
SplicerAV predicted isoform changes are associated with breast cancer survival
SplicerAV probeset groupings of genes identified in the GUYT training set were used to create individual sample level indexes of relative isoform abundance. We tested an association of these SplicerAV indexes in two independent validation datasets to examine whether specific isoform changes observed in high grade tumors were also associated with poor patient prognosis (see methods). Previous datasets generated by Miller  (GSE3494) and Pawitan  (GSE1456) have independently profiled breast tumor gene expression using the Affymetrix U133 A and B microarrays (probeset intensities were estimated using MAS5 ). These studies include patient outcome, providing the opportunity to test for an association of isoform changes with survival in ER positive tumors.
Isoform changes in gene expression significantly associated with patient outcomes in both validation datasets.
Association with Survival
Few studies have performed high-throughput examination of alternative processing in clinical tumor samples [12, 13] and to our knowledge no prior studies have examined changes in alternative mRNA processing directly associated with cancer patient survival. This study examined isoform specific mRNA levels in over 400 human clinical samples, providing support for the use of changes in alternative processing as potential prognostic markers in cancer.
ARHGEF7& EIF4E2 isoform changes are associated with breast cancer survival
Whether or not individual probesets could demonstrate a consistent association with survival differed by gene. Although individual probeset behavior may represent an alternative processing event, only through comparison with other probesets for that gene can SplicerAV uncover these relevant and predictive isoforms that would go unnoticed in conventional analyses.
Combining isoform changes from multiple genes improves prediction of breast cancer survival
Similar to our in vitro analyses of oncogene over-expression, we observed low overlap between gene expression and SplicerAV changes. Of the 241 isoform changes predicted by SplicerAV in the GUYT training set that were later tested for an association with poor prognosis, only one gene (0.4%), BTD, was also among the top 241 differentially expressed genes. The orthogonality of candidate gene lists identified by SplicerAV and conventional methods suggests that these two methods detect different biological processes and may provide independent value in generating molecular classifiers. SplicerAV can generate both conventional and isoform specific gene expression analyses, and therefore provides two non-redundant datasets from one experiment.
Traditional analyses of gene expression data have considered the probeset as the basic unit of expression. Under this paradigm, the presence of multiple probesets has been viewed largely as a nuisance. Current approaches dealing with the issue of multiple probesets have used either probeset location or the mean, median, or largest probeset expression change to distill multiple probesets into a single gene level expression value. Each of these approaches would have yielded a different readout of EGFR expression changes in HRAS over-expression, making conventional interpretation inadequate for such genes. Software has even been developed whose sole purpose is the removal of discordant probeset expression values for probesets targeting the same gene .
We propose that for genes with multiple probesets, isoform specific expression changes may be a more appropriate means of interpreting standard microarray expression data than the current one gene = one probeset paradigm. Previous algorithms [54, 55] have examined the possibility of investigating changes in alternative processing using single probeset level data. These methods have relied on custom chips, or would not have detected events predicted by SplicerAV in this paper because such methods do not examine events spanning multiple probesets. SplicerAV provides a systematic means by which to detect and interpret inconsistent probeset behavior within the same gene, a situation where an oversimplified perspective may be obscuring relevant and important biological changes.
This study marks the first en masse analysis of mRNA isoform changes in existing conventional expression microarray data. We have shown here that re-analyzing such data using a different paradigm can uncover novel biological insights and potential prognostic markers.
The combination of material, personnel, and clinical costs of obtaining gene expression microarray data has resulted in a massive archive of these data accumulated over the past two decades. Many previously created datasets, particularly clinical datasets, are unique and cannot be reproduced. Numerous private and public repositories of microarray expression data exist, with the largest public repository, Gene Expression Omnibus, containing over 50,000 data samples from the Affymetrix U133 and U95 series alone. In this paper we demonstrate the utility of SpicerAV, the first program used to analyze this existing data en masse for isoform specific changes that can result from alternative mRNA processing.
SplicerAV algorithm details
At this step, individual probeset weights are raised to a user specified power (Wt_scale, default = 2), which allows preferential focus on more significant probeset changes in expression at the cost of removing information from less reliable probesets and reducing the power of significance tests.
Xprbset = the log2fold expression change of that probeset
μA = the weighted average log2fold change in expression for probesets assigned to groupA
μB = the weighted average log2fold change in expression for probesets assigned to groupB
μSingle = the weighted average log2fold change in expression for all probesets targeting the gene
σA, σB, and σsingle for groups A, B, and all probesets are determined by expectation maximization, bounded by a minimum value of 10% to prevent over-fitting by the model. The value of 10% was chosen as a conservative limit based on empirical observations of summarized significant log2fold probeset changes, which consistently exhibited standard deviations (σ) below 10% across analyzed datasets (data not shown).
SplicerAV incorporates biologically motivated modifiers to alter the relative ranking of potential changes in alternative processing to suit the final objectives of the user. These modifiers can be adjusted by the user and do not affect the p-values reported by SplicerAV. The specified form and magnitude of these biologically motivated modifiers were empirically derived through analysis of several datasets.
Multiple Probeset Modifier
Expression Cutoff Modifier
Gene Ontology Analyses
Gene ontology (GO) analyses compared genes with SplicerAV predicted isoform changes (p < .01, splice score > 0) to a reference set of all genes evaluated for isoform changes in each condition using PANTHER [56, 57]. Non-overlapping GO categories with more than one gene were reported.
Gene Expression Omnibus
Negative Strand Matching Probeset
Presence-Absence calls with Negative Probesets
Maximum Likelihood Ratio
Green Fluorescent Protein
We thank Joe Nevins, Holly Dressman, Joe Lucas, and Erich Huang for helpful comments on the manuscript and Ashley Chi, Sayan Mukherjee, Uwe Ohler, and Alexander Hartemink for their suggestions and advice during the development of SplicerAV. We acknowledge funding from the NIH grants 5R01-GM63090 (MGB) and 1R01-CA127727 (MGB), the DOD grant GRANT00412169 (TJR) (Predoctoral Traineeship Award), and the SPORE grant 5P50-CA068438-10 (MD).
- Blencowe BJ: Alternative splicing: new insights from global analyses. Cell 2006, 126(1):37–47. 10.1016/j.cell.2006.06.023View ArticlePubMedGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456(7221):470–476. 10.1038/nature07509View ArticlePubMedPubMed CentralGoogle Scholar
- Krawczak M, Reiss J, Cooper DN: The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum Genet 1992, 90(1–2):41–54. 10.1007/BF00210743View ArticlePubMedGoogle Scholar
- Lopez-Bigas N, Audit B, Ouzounis C, Parra G, Guigo R: Are splicing mutations the most frequent cause of hereditary disease? FEBS Lett 2005, 579(9):1900–1903. 10.1016/j.febslet.2005.02.047View ArticlePubMedGoogle Scholar
- Garcia-Blanco MA, Baraniak AP, Lasda EL: Alternative splicing in disease and therapy. Nat Biotechnol 2004, 22(5):535–546. 10.1038/nbt964View ArticlePubMedGoogle Scholar
- Venables JP: Unbalanced alternative splicing and its significance in cancer. Bioessays 2006, 28(4):378–386. 10.1002/bies.20390View ArticlePubMedGoogle Scholar
- Cooper TA, Wan L, Dreyfuss G: RNA and disease. Cell 2009, 136(4):777–793. 10.1016/j.cell.2009.02.011View ArticlePubMedPubMed CentralGoogle Scholar
- Takeda J, Suzuki Y, Nakao M, Barrero RA, Koyanagi KO, Jin L, Motono C, Hata H, Isogai T, Nagai K, et al.: Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56,419 completely sequenced and manually annotated full-length cDNAs. Nucleic Acids Res 2006, 34(14):3917–3928. 10.1093/nar/gkl507View ArticlePubMedPubMed CentralGoogle Scholar
- Venables JP, Klinck R, Koh C, Gervais-Bird J, Bramard A, Inkel L, Durand M, Couture S, Froehlich U, Lapointe E, et al.: Cancer-associated regulation of alternative splicing. Nat Struct Mol Biol 2009, 16(6):670–6. 10.1038/nsmb.1608View ArticlePubMedGoogle Scholar
- Xu Q, Lee C: Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. Nucleic Acids Res 2003, 31(19):5635–5643. 10.1093/nar/gkg786View ArticlePubMedPubMed CentralGoogle Scholar
- He C, Zhou F, Zuo Z, Cheng H, Zhou R: A global view of cancer-specific transcript variants by subtractive transcriptome-wide analysis. PLoS ONE 2009, 4(3):e4732. 10.1371/journal.pone.0004732View ArticlePubMedPubMed CentralGoogle Scholar
- Andre F, Michiels S, Dessen P, Scott V, Suciu V, Uzan C, Lazar V, Lacroix L, Vassal G, Spielmann M, et al.: Exonic expression profiling of breast cancer and benign lesions: a retrospective analysis. Lancet Oncol 2009, 10(4):381–390. 10.1016/S1470-2045(09)70024-5View ArticlePubMedGoogle Scholar
- Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Schweitzer A, Awad T, Sugnet C, Dee S, et al.: Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006, 7: 325. 10.1186/1471-2164-7-325View ArticlePubMedPubMed CentralGoogle Scholar
- Xi L, Feber A, Gupta V, Wu M, Bergemann AD, Landreneau RJ, Litle VR, Pennathur A, Luketich JD, Godfrey TE: Whole genome exon arrays identify differential expression of alternatively spliced, cancer-related genes in lung cancer. Nucleic Acids Res 2008, 36(20):6535–6547. 10.1093/nar/gkn697View ArticlePubMedPubMed CentralGoogle Scholar
- Thorsen K, Sorensen KD, Brems-Eskildsen AS, Modin C, Gaustadnes M, Hein AM, Kruhoffer M, Laurberg S, Borre M, Wang K, et al.: Alternative splicing in colon, bladder, and prostate cancer identified by exon array analysis. Mol Cell Proteomics 2008, 7(7):1214–1224. 10.1074/mcp.M700590-MCP200View ArticlePubMedGoogle Scholar
- Cheung HC, Baggerly KA, Tsavachidis S, Bachinski LL, Neubauer VL, Nixon TJ, Aldape KD, Cote GJ, Krahe R: Global analysis of aberrant pre-mRNA splicing in glioblastoma using exon expression arrays. BMC Genomics 2008, 9: 216. 10.1186/1471-2164-9-216View ArticlePubMedPubMed CentralGoogle Scholar
- Laajala E, Aittokallio T, Lahesmaa R, Elo LL: Probe-level estimation improves the detection of differential splicing in Affymetrix exon array studies. Genome Biol 2009, 10(7):R77. 10.1186/gb-2009-10-7-r77View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, et al.: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009, (37 Database):D885–890. 10.1093/nar/gkn764Google Scholar
- Fodor SP, Rava RP, Huang XC, Pease AC, Holmes CP, Adams CL: Multiplexed biochemical assays with biological chips. Nature 1993, 364(6437):555–556. 10.1038/364555a0View ArticlePubMedGoogle Scholar
- Pearson JL, Robinson TJ, Munoz MJ, Kornblihtt AR, Garcia-Blanco MA: Identification of the cellular targets of the transcription factor TCERG1 reveals a prevalent role in mRNA processing. J Biol Chem 2008, 283(12):7949–7961. 10.1074/jbc.M709402200View ArticlePubMedGoogle Scholar
- Stalteri MA, Harrison AP: Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips. BMC Bioinformatics 2007, 8: 13. 10.1186/1471-2105-8-13View ArticlePubMedPubMed CentralGoogle Scholar
- D'Mello V, Lee JY, MacDonald CC, Tian B: Alternative mRNA polyadenylation can potentially affect detection of gene expression by affymetrix genechip arrays. Appl Bioinformatics 2006, 5(4):249–253. 10.2165/00822942-200605040-00007View ArticlePubMedGoogle Scholar
- Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al.: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439(7074):353–357. 10.1038/nature04296View ArticlePubMedGoogle Scholar
- Ferrari F, Bortoluzzi S, Coppe A, Sirota A, Safran M, Shmoish M, Ferrari S, Lancet D, Danieli GA, Bicciato S: Novel definition files for human GeneChips based on GeneAnnot. BMC Bioinformatics 2007, 8: 446. 10.1186/1471-2105-8-446View ArticlePubMedPubMed CentralGoogle Scholar
- Lu J, Lee JC, Salit ML, Cam MC: Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: high-resolution annotation for microarrays. BMC Bioinformatics 2007, 8: 108. 10.1186/1471-2105-8-108View ArticlePubMedPubMed CentralGoogle Scholar
- Yu H, Wang F, Tu K, Xie L, Li YY, Li YX: Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data. BMC Bioinformatics 2007, 8: 194. 10.1186/1471-2105-8-194View ArticlePubMedPubMed CentralGoogle Scholar
- Warren P, Taylor D, Martini PGV, Jackson J, Bienkowska J: PANP - a New Method of Gene Detection on Oligonucleotide Expression Arrays. Proc 2007 IEEE 7th International Symposium on BioInformatics & BioEngineering, Cambridge, USA 2007, 108–115.View ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res 2002, 12(6):996–1006.View ArticlePubMedPubMed CentralGoogle Scholar
- Srinivasan K, Shiue L, Hayes JD, Centers R, Fitzwater S, Loewen R, Edmondson LR, Bryant J, Smith M, Rommelfanger C, et al.: Detection and measurement of alternative splicing using splicing-sensitive microarrays. Methods 2005, 37(4):345–359. 10.1016/j.ymeth.2005.09.007View ArticlePubMedGoogle Scholar
- Neel H, Gondran P, Weil D, Dautry F: Regulation of pre-mRNA processing by src. Curr Biol 1995, 5(4):413–422. 10.1016/S0960-9822(95)00082-0View ArticlePubMedGoogle Scholar
- Chandler LA, Ehretsmann CP, Bourgeois S: A novel mechanism of Ha-ras oncogene action: regulation of fibronectin mRNA levels by a nuclear posttranscriptional event. Mol Cell Biol 1994, 14(5):3085–3093.View ArticlePubMedPubMed CentralGoogle Scholar
- Chandler LA, Bourgeois S: Posttranscriptional down-regulation of fibronectin in N-ras-transformed cells. Cell Growth Differ 1991, 2(8):379–384.PubMedGoogle Scholar
- Darville MI, Rousseau GG: E2F-dependent mitogenic stimulation of the splicing of transcripts from an S phase-regulated gene. Nucleic Acids Res 1997, 25(14):2759–2765. 10.1093/nar/25.14.2759View ArticlePubMedPubMed CentralGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Yao H, Xu L, Feng Y, Liu D, Chen Y, Wang J: Structure-function correlation of human programmed cell death 5 protein. Arch Biochem Biophys 2009, 486(2):141–149. 10.1016/j.abb.2009.03.018View ArticlePubMedGoogle Scholar
- Ullrich A, Coussens L, Hayflick JS, Dull TJ, Gray A, Tam AW, Lee J, Yarden Y, Libermann TA, Schlessinger J, et al.: Human epidermal growth factor receptor cDNA sequence and aberrant expression of the amplified gene in A431 epidermoid carcinoma cells. Nature 1984, 309(5967):418–425. 10.1038/309418a0View ArticlePubMedGoogle Scholar
- Kashles O, Yarden Y, Fischer R, Ullrich A, Schlessinger J: A dominant negative mutation suppresses the function of normal epidermal growth factor receptors by heterodimerization. Mol Cell Biol 1991, 11(3):1454–1463.View ArticlePubMedPubMed CentralGoogle Scholar
- Suzuki Y, Yoshitomo-Nakagawa K, Maruyama K, Suyama A, Sugano S: Construction and characterization of a full length-enriched and a 5'-end-enriched cDNA library. Gene 1997, 200(1–2):149–156. 10.1016/S0378-1119(97)00411-3View ArticlePubMedGoogle Scholar
- Reiter JL, Threadgill DW, Eley GD, Strunk KE, Danielsen AJ, Sinclair CS, Pearsall RS, Green PJ, Yee D, Lampland AL, et al.: Comparative genomic sequence analysis and isolation of human and mouse alternative EGFR transcripts encoding truncated receptor isoforms. Genomics 2001, 71(1):1–20. 10.1006/geno.2000.6341View ArticlePubMedGoogle Scholar
- Jaksik R, Polanska J, Herok R, Rzeszowska-Wolny J: Calculation of reliable transcript levels of annotated genes on the basis of multiple probe-sets in Affymetrix microarrays. Acta Biochim Pol 2009, 56(2):271–7.PubMedGoogle Scholar
- Weiss GJ, Bemis LT, Nakajima E, Sugita M, Birks DK, Robinson WA, Varella-Garcia M, Bunn PA Jr, Haney J, Helfrich BA, et al.: EGFR regulation by microRNA in lung cancer: correlation with clinical response and survival to gefitinib and EGFR expression in cell lines. Ann Oncol 2008, 19(6):1053–1059. 10.1093/annonc/mdn006View ArticlePubMedGoogle Scholar
- Sandberg R, Neilson JR, Sarma A, Sharp PA, Burge CB: Proliferating cells express mRNAs with shortened 3' untranslated regions and fewer microRNA target sites. Science 2008, 320(5883):1643–1647. 10.1126/science.1155390View ArticlePubMedPubMed CentralGoogle Scholar
- Basu A, Raghunath M, Bishayee S, Das M: Inhibition of tyrosine kinase activity of the epidermal growth factor (EGF) receptor by a truncated receptor form that binds to EGF: role for interreceptor interaction in kinase regulation. Mol Cell Biol 1989, 9(2):671–677.View ArticlePubMedPubMed CentralGoogle Scholar
- Adamson ED, Wiley LM: The EGFR gene family in embryonic cell activities. Curr Top Dev Biol 1997, 35: 71–120. 10.1016/S0070-2153(08)60257-4View ArticlePubMedGoogle Scholar
- Browne BC, O'Brien N, Duffy MJ, Crown J, O'Donovan N: HER-2 signaling and inhibition in breast cancer. Curr Cancer Drug Targets 2009, 9(3):419–438. 10.2174/156800909788166484View ArticlePubMedGoogle Scholar
- Wan J, Sazani P, Kole R: Modification of HER2 pre-mRNA alternative splicing and its effects on breast cancer cells. Int J Cancer 2009, 124(4):772–777. 10.1002/ijc.24052View ArticlePubMedPubMed CentralGoogle Scholar
- Li C, Kato M, Shiue L, Shively JE, Ares M Jr, Lin RJ: Cell type and culture condition-dependent alternative splicing in human breast cancer cells revealed by splicing-sensitive microarrays. Cancer Res 2006, 66(4):1990–1999. 10.1158/0008-5472.CAN-05-2593View ArticlePubMedGoogle Scholar
- Zhang C, Li HR, Fan JB, Wang-Rodriguez J, Downs T, Fu XD, Zhang MQ: Profiling alternatively spliced mRNA isoforms for prostate cancer classification. BMC Bioinformatics 2006, 7: 202. 10.1186/1471-2105-7-202View ArticlePubMedPubMed CentralGoogle Scholar
- Sotiriou C, Pusztai L: Gene-expression signatures in breast cancer. N Engl J Med 2009, 360(8):790–800. 10.1056/NEJMra0801289View ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, et al.: Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 2008, 9: 239. 10.1186/1471-2164-9-239View ArticlePubMedPubMed CentralGoogle Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, et al.: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 2005, 102(38):13550–13555. 10.1073/pnas.0506230102View ArticlePubMedPubMed CentralGoogle Scholar
- Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 2005, 7(6):R953–964. 10.1186/bcr1325View ArticlePubMedPubMed CentralGoogle Scholar
- Affymetrix: Microarray Suite User Guide, Version 5.Affymetrix; 2001. [http://www.affymetrix.com/support/technical/manuals.affx]Google Scholar
- Fan W, Khalid N, Hallahan AR, Olson JM, Zhao LP: A statistical method for predicting splice variants between two groups of samples using GeneChip expression array data. Theor Biol Med Model 2006, 3: 19. 10.1186/1742-4682-3-19View ArticlePubMedPubMed CentralGoogle Scholar
- Hu GK, Madore SJ, Moldover B, Jatkoe T, Balaban D, Thomas J, Wang Y: Predicting splice variant from DNA chip expression data. Genome Res 2001, 11(7):1237–1245. 10.1101/gr.165501View ArticlePubMedPubMed CentralGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 2003, 13(9):2129–2141. 10.1101/gr.772403View ArticlePubMedPubMed CentralGoogle Scholar
- Thomas PD, Kejariwal A, Guo N, Mi H, Campbell MJ, Muruganujan A, Lazareva-Ulitsky B: Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucleic Acids Res 2006, (34 Web Server):W645–650. 10.1093/nar/gkl229Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.