A random effects model for the identification of differential splicing (REIDS) using exon and HTA arrays

Van Moerbeke, Marijke; Kasim, Adetayo; Talloen, Willem; Reumers, Joke; Göhlmann, Hinrick W. H.; Shkedy, Ziv

doi:10.1186/s12859-017-1687-8

Methodology Article
Open access
Published: 25 May 2017

A random effects model for the identification of differential splicing (REIDS) using exon and HTA arrays

Marijke Van Moerbeke ORCID: orcid.org/0000-0002-3097-5621¹,
Adetayo Kasim²,
Willem Talloen³,
Joke Reumers³,
Hinrick W. H. Göhlmann³ &
…
Ziv Shkedy¹

BMC Bioinformatics volume 18, Article number: 273 (2017) Cite this article

1565 Accesses
2 Citations
3 Altmetric
Metrics details

Abstract

Background

Alternative gene splicing is a common phenomenon in which a single gene gives rise to multiple transcript isoforms. The process is strictly guided and involves a multitude of proteins and regulatory complexes. Unfortunately, aberrant splicing events do occur which have been linked to genetic disorders, such as several types of cancer and neurodegenerative diseases (Fan et al., Theor Biol Med Model 3:19, 2006). Therefore, understanding the mechanism of alternative splicing and identifying the difference in splicing events between diseased and healthy tissue is crucial in biomedical research with the potential of applications in personalized medicine as well as in drug development.

Results

We propose a linear mixed model, Random Effects for the Identification of Differential Splicing (REIDS), for the identification of alternative splicing events. Based on a set of scores, an exon score and an array score, a decision regarding alternative splicing can be made. The model enables the ability to distinguish a differential expressed gene from a differential spliced exon. The proposed model was applied to three case studies concerning both exon and HTA arrays.

Conclusion

The REIDS model provides a work flow for the identification of alternative splicing events relying on the established linear mixed model. The model can be applied to different types of arrays.

Background

Alternative splicing (AS) was considered to be an uncommon phenomenon until microarray and high-throughput sequencing technology enabled whole genome expression profiling [1]. More than 90% of human genes exhibit multiple transcript isoforms due to exon enrichment or depletion in mRNA transcription [2–4]. Since transcript isoforms of a single gene have been observed to vary between tissues and even between developmental stages, alternative splicing has been proposed as a primary driver of evolution and phenotypic complexity in mammals [5–7]. Straying splice variants, however, has been linked to cancers such as mammary tumorigenesis and ovarian cancer [8]. Although the underlying relationship between aberrant splicing events and cancer is often not (yet) established, the potential exists to develop new diagnostic and therapeutic interventions when more insights are gained [9]. Therefore, a better understanding of the mechanism of alternative splicing and identification of the differences in splicing events between diseased and healthy tissues is considered crucial in cancer and other medical research [10]. By measuring a relative amount of distinct splice forms, one can test whether a new splice form really constitutes an important fraction of a gene’s transcript in at least some cell types. This type of research could reveal patterns of regulation across a large number of different tissues [11]. Several alternative splicing detection methods have been proposed with the development of the RNA sequencing (RNASeq) [12] and microarray platforms such as the Affymetrix Exon ST arrays [13] and the Human Transcriptome Arrays 2.0 [14]. Recent studies emphasize the complementary nature of RNASeq and microarrays; combined, both technologies have strengths which might overcome the reported weaknesses. The primary advantage of RNASeq is its potential to explore the entire diversity of the transcriptome while the microarray has the ability to measure lower abundance transcripts [15]. Since the RNASeq is not able to properly account for low abundance transcripts and its competitive detection, the resulting library diversity will be limited [16, 17]. The limited diversity can be resolved by relying on the technology of exon and HTA arrays. Methods for alternative splicing detecting using RNASeq include Mats, DEXSeq and Cufflinks [18–20]. However, these have shown to be insufficient [21]. Alternative splicing has been studied with microarray platforms as well resulting in a variety of methods. The Microarray Detection of Alternative Splicing (MiDAS) method employs gene-level normalized exon intensities in an ANOVA model based on a Splicing Index (SI) [13, 22]. The SI method normalizes the exon level expression intensities by their corresponding gene level intensities, and compares these normalized intensities between sample groups. Another ANOVA based method is the so-called Analysis Of Splice VAriation (ANOSVA) [23], which fits a linear model to the observed data aiming to identify non-zero interaction terms between the sample groups and the exons. However, it has been argued that the ANOSVA method performed poorly [13]. The Probe Level Alternative Transcript Analysis (PLATA) method is based on the normalization of probe level intensities: first the probe-wise intensities, using gene level summarized values, are computed; afterwards the group averages of these normalized intensities are compared by considering all measurements across probes and arrays as independent [24]. The probe level SI estimation procedure for detecting differential splicing (PECA-SI method) detects alternative splicing based on a probe level splicing index instead of the exon level used by MiDAS [25]. PECA-SI outperforms other existing methods except for Finding Isoforms using Robust Multichip Arrays (FIRMA) [25, 26]. In contrast to other methods, FIRMA formulates alternative splicing identification as an outlier detection problem. It is based on the residuals of the Robust Multichip Analysis (RMA) [27]. A recent method is Robust Alternative Splicing Analysis for Human Transcriptome Arrays (RASA) [28] which was applied to HTA arrays and uses exon junction information in the identification of alternative splicing. In this paper, we propose a new modelling approach for the detection of AS namely the Random Effects for the Identification of Differential Splicing (REIDS). This model identifies splicing events based on a set of two scores; an array score which is used to identify samples containing an alternatively spliced exon and an exon score to prioritize spliced exons. The array scores have an intuitive interpretation as the deviation of the exon from the overall gene expression. The REIDS method was compared with FIRMA as the existing preferred method for alternative splicing detection using simulated data and two real-life exon array studies. A third case study based on HTA illustrates how the REIDS method enables the disentanglement of differentially expressed genes and differential spliced exons. The data and the proposed random effects model are introduced in the Methods sections. Next, the model is applied to three case studies in the Results sections. The paper is concluded with a discussion and conclusion. Illustrations are based on the R packages BiomaRt and GenomeGraphs [29]. REIDS is currently bundled in a package publicly available on R-forge.

Methods

Data

Three data sets are used to illustrate the proposed random effects model for the identification of alternative splicing.

The tissue data

The tissue data was obtained with the GeneChip®; Human Exon 1.0 ST array. The array is a whole genome array containing only perfect matching (PM) probes with a small number of generic mismatching probes for the purposes of background correction. A probe set identifies an exon using four perfect match probes. There are no probes which span exon-exon junctions [30]. The data set consists of triplicates from 11 tissues, so in total 33 arrays. Each tissue is thus represented by three replicates. This data set was also used to illustrate the FIRMA method [26] and is publicly available on the Affymetrix website.

The colon cancer data

The colon cancer data was also generated with the GeneChip®; Human Exon 1.0 ST array and contains 10 paired tumor-normal cancer samples. The data was analyzed before [9, 26] and is publicly available on the Affymetrix website.

The HTA data

The Human Transcriptome Array (HTA) is a recent microarray platform of Affymetrix. It is an expansion of the Human Exon array containing 10 probes per probe set. In addition, the HTA array contains probes that span exon-exon junctions which are supported by four probes each. The data was provided by Janssen Pharmaceutica, Belgium and contains measurements on seven tissues with three replicates each. An annotation file connecting the exon level to the gene level was taken from the Brainarray website [31]. As the provided cdf file currently does not yet annotate the junctions on the array, exon junctions are not considered in this paper.

Models for the detection of alternative splicing

In this section we present the REIDS model for the detection of alternative splicing.

Finding Isoforms using Robust Multichip Arrays (FIRMA)

We begin with a brief description of the FIRMA model. The FIRMA algorithm for the detection of alternative splicing events relies on the RMA preprocessing approach [26, 27]. The algorithm consists of background correction, normalization and summarization of probe level data into gene level data, with one value per combination of gene and array. The gene level summarization is done by fitting an additive model on probe intensities:

$$ Y_{ij} = c_{i}+p_{j}+\epsilon_{ij}. $$

(1)

Here, Y _ij denotes a log2-transformation of the intensities of array i and probe j. The parameter p _j denotes the average value of probe j, c _i represents the summarized gene level intensity of array i while the residual of probe j of array i is denoted by ε _ij. The unknown parameters in the model are estimated using a median polish algorithm to ensure robust estimates of the summarized gene level intensities against outlying probes. The RMA model for summarization at the gene level can be extended to summarization at the exon level:

$$ Y_{ijk} = c_{i}+e_{k}+d_{ik}+p_{j}+\epsilon_{ijk}. $$

(2)

The effect of exon k is denoted by e _k while d _ik represents the interaction between array i and exon k and ε _ijk is the residual of probe j which belongs to exon k in array i. Since the probes are nested within exons, the exon effect e _k is absorbed into the probe effect p _j. Ignoring the interaction between the exon and the array, the information about alternative splicing is left to be absorbed into the residual [26]. This is a crucial point since it implies that alternatively spliced exons will have substantial higher residuals for some arrays than for others which motivates the definition of the FIRMA score as

$$ F_{ik}=median\,\,\, \epsilon_{ijk}/s. $$

(3)

Here, probe j is assumed to belong to exon k (j=1,…n _k) and s is the MAD (Median Absolute Deviation) allowing comparisons across genes. An exon is declared AS whenever F _ik is large [26].

The REIDS model

The alternative splicing detection problem can be formulated as a variance decomposition problem in a random effects model. The underlying assumption is that the between array variability of an alternatively spliced exon will be higher than the within array variability among the exons of the same gene. Similar to FIRMA, we define a linear model for the probe intensities:

$$ Y_{ijk} = p_{j}+d_{ik}+\epsilon_{ijk}. $$

(4)

The background noise is assumed to follow a normal distribution, ε _ijk∼N(0,σ ²) and it captures the within array variability (σ ²) across all exons of the same gene. In contrast to the FIRMA model, the parameter d _ik is decomposed into an average gene intensity per array i, c _i, and an exon specific deviation from its average gene intensity b _ik,

$$ d_{ik} = c_{i}+b_{ik}. $$

(5)

where b _ik∼N(0,D). The covariance matrix D is a K×K diagonal matrix containing the between array variabilities $\left (\tau ^{2}_{k}\right)$ for each exon. The model formulation in Eqs. (4) and (5) can be combined into a single model consisting of both the fixed effects (p _j and c _i) and the random effects (b _ik). The combined mixed effect model is given by:

$$ Y_{ijk} = p_{j}+c_{i}+b_{ik}+\epsilon_{ijk}, $$

(6)

in which the random effects b _ik∼N(0,D) are assumed to be independent of the background noise ε _ijk∼N(0,σ ²). Figure 1 illustrates the mean structure of the REIDS model presented in (5) for a scenario in which the gene is not differentially expressed and the kth exon is alternatively spliced. The exon is related to four probes. This results in four probe effects p ₁, p ₂, p ₃ and p ₄ which represent an average of the probe values across all arrays. The array effects in the REIDS model c _1a, c _1b, …, c _2b are used to measure the differences between the arrays. The deviation of the probes from the gene level will be captured by a random effect per sample: b _1ak, b _1bk, …, b _2ck which are, as mentioned above, assumed to follow a normal distribution with variability τ _k. The remaining variation of a probe j of exon k in array i is captured by the error term ε _ijk. Hence, the model splits the total variability of the probe intensities of an exon k into the variability which can be accounted for by the arrays $\tau ^{2}_{k}$ and an the remaining variability σ ².

REIDS Scores for Quantification of Alternative Splicing The advantage of a mixed model formulation for alternative splicing detection is the existence of a standard score for every exon in every sample which quantifies the trade-off between signal and noise. We refer to this score as the exon score. The exon score for the kth exon in a gene is defined as:

$$\rho_{k}= \tau^{2}_{k}/\left(\sigma^{2}+\tau^{2}_{k}\right). $$

It intuitively follows from this definition that an equity threshold for the exon score is 0.5. Note that this threshold can be adapted depending on the amount of signal in a microarray data set. Given that exon k has been identified to have substantial variation between the arrays, the estimated random effects b _ik per array per exon can be used as array scores to quantify the degree of alternatively splicing per array. Arrays enriched or depleted with exon k will have array scores greater than zero. It should be noted that the array scores are expected to be correlated with the FIRMA scores for an alternatively spliced exon as both the random effects of the REIDS model will resist and the residuals of the FIRMA model will be large. The combination of an exon score and an array score gives enables us to differentiate between differential expression of a gene and differential splicing of an exon. Four scenarios can be distinguished for which illustrations can be found in Section 2 of Additional file 1.

The first scenario describes a gene that is not differentially expressed between the arrays and has no alternatively spliced exons. This implies that exon intensities are similar across all arrays. In this case it is expected that $\tau ^{2}_{1}=\ldots =\tau ^{2}_{K}=\tau ^{2}$ and τ ²<<σ ². As a consequence, the exon score ρ _k will be low and the exons should not be identified by the model.
The second scenario consists of a non-differentially expressed gene that contains an alternatively spliced exon k and non-alternatively spliced exons k−. For the alternatively spliced exon, it is expected that $\tau ^{2}_{k} > \tau ^{2}_{k-}$ with $\tau ^{2}_{k} >> \sigma ^{2}$ and $\tau ^{2}_{k-} << \sigma ^{2}$. The exon score for this probe set k will be high. As an acceptable ρ _k is present, a test on the array scores can be conducted in order to identify biologically induced splicing associated with the experimental conditions or tissue types.
The third scenario corresponds to a differentially expressed gene with no alternatively spliced exons. Again it is expected that $\tau ^{2}_{1}=\ldots =\tau ^{2}_{K}=\tau ^{2}$. Since there is a natural difference between the gene levels of the arrays here; it will be the case that τ ²>>σ ² and that the exon scores are high. A test on the array scores will conclude the absence of alternatively spliced exons since the scores will not be associated with experimental conditions or tissue types.
The fourth scenario is a differentially expressed gene with an alternatively spliced exon. For the alternatively spliced exons, the same reasoning applies as for when the gene is not differentially expressed. The non-alternatively spliced exons will show enough signal in the exon score but a test between the array scores will show no association with experiment conditions or tissue types.

Estimation of the Model Parameters The parameters of the proposed mixed effects model are estimated within the Bayesian framework with vague proper priors since the full conditional posterior distributions for the parameters of interest are known. Let D be a K×K diagonal covariance matrix of $\tau ^{2}_{1},\tau ^{2}_{2},\cdots,\tau ^{2}_{K}$ for which an Inverse-Wishart prior was assumed, i.e., D∼Inverse−Wishart(ψ,Ω). An inverse gamma prior was specified for σ ² and 1/σ ²∼Gamma(α,β). The full conditional posterior distributions for the parameters of interest are given by

$$P\left(\mathbf{b_{i}}|\mathbf{p},\mathbf{c},\mathbf{D},\sigma^{2}\right) = N_{K}\left(\mathbf{\Phi},\mathbf{\Upsilon}^{-1}\right). $$

Here, Υ=D ⁻¹+σ ⁻² n _i where n _i is a K vector of number of probes per exon. Further, Φ=Υ ⁻¹ Θ ^′ where Θ is a K vector of $\sigma ^{-2}\sum _{j,k} (log2(PM_{ijk(j)})-p_{k}-c_{i})$. Hence, the full conditional posterior distribution for D, the matrix of the between array variability is

$$P\left(\mathbf{D}|\mathbf{b},\mathbf{p},\mathbf{c},\sigma^{2}\right) = Inverse-Wishart(\psi + n,\mathbf{\Omega}+\mathbf{b}'\mathbf{b}), $$

where Ω is a K×K diagonal matrix of ones, n is the number of arrays with ψ specified as the number of exons. Finally, the full conditional distribution for 1/σ ² is

$$P(1/\sigma^{2}|\mathbf{b},\mathbf{p},\mathbf{c},\mathbf{D}) = Gamma(\alpha+0.5N,\eta) $$

where $\eta = \beta +0.5\sum _{i,j,k}(Y_{ijk(j)}-p_{k}-c_{i}-b_{ik}z_{ik})^{2}$ with α=β=0.0001 and N is the number of observations for all the arrays, exons and probes. Using Gibb’s sampler, we generate posterior samples for the parameters by iteratively sampling from their full conditional posterior distributions conditioning on the sample of the parameters at the immediate previous iteration. The posterior point estimates and the credible intervals for the parameters are based on the MCMC chains after discarding the burn-in parts.

Identification of Alternative Splicing Events There are two main types of alternative splicing detections: (1) detection of sample-specific alternative splicing and (2) detection of differential splicing between two or more experimental conditions. Figure 2 illustrates the flexible framework of the mixed model and how it can be used to investigate either sample-specific alternative splicing or differential splicing between experimental groups. First the REIDS method is applied to each gene to obtain array and exon scores after which the probe sets are prioritized according to their exon scores. Probe sets with exon scores greater than a pre-specified threshold (0<ρ<1) are retained for further investigation. The exon scores directly reflect the heterogeneity between samples and consequently, a probe set with a high exon score implies enrichment or depletion of the exon in some of the samples. A prioritized probe set is considered to be expressed in a subset if the array scores for some samples are further away from zero compared to the other samples or if the samples have the maximum array score for exon enrichment or the minimum array score for exon depletion. For the detection of differential splicing between two or more experimental conditions, the exon scores also reflect heterogeneity between arrays. This does not imply that such heterogeneity is associated with experimental conditions. Heterogeneity between arrays captured by exon scores is a necessary but not a sufficient criterion for differential splicing detection. We recommend to use the array scores as input into a t-test for independent arrays or a paired t-test for paired arrays to test whether the array scores are significantly different between experimental conditions. Other relevant tests might also be performed as the framework is flexible and allows many types of downstream analyses. Finally, the prioritized exons are ranked according to their corresponding p-values or t-statistics.

Exclusion of Non-Informative Probe Sets Alternative splicing detection is known to suffer from a large number of false positives when many probes in a probe set are non-informative. Therefore, filtering has been recommended as a step prior to alternative splicing detection [9, 26]. A non-informative probe set can be defined by a lack of coherence among its probes. By evaluating the intra-probe set correlation, a non-responsive probe set can be identified as such and excluded prior to alternative splicing detection based on informative calls. The concept of informative or non-informative calls was introduced for arrays by applying a factor analysis model to calculate a score of informativeness based on signal to noise ratio [32]. We used a mixed model framework for Informative/Non-Informative calls (I/NI calls) to identify and exclude non-responsive probe sets based on an intra-probe set correlation as a filtering score [33].

Results

In this section we present the analysis of the three case studies presented in “Background” section. All data sets are pre-processed using the R package aroma.affymetrix [34]. The raw.CEL files are background corrected with the RMA background correction, normalized with quantile-normalization and log2-transformed [27] resulting in probe level intensities on which first the I/NI calls and then REIDS model are performed. For the first case study, the tissue data, we illustrate the method on three genes for which several probe sets were identified to be alternative spliced. For the second case study, the colon cancer data, we present the results for 24 validated genes. The third case study, the HTA data, shows examples of the four scenarios described above.