A random effects model for the identification of differential splicing (REIDS) using exon and HTA arrays

Background Alternative gene splicing is a common phenomenon in which a single gene gives rise to multiple transcript isoforms. The process is strictly guided and involves a multitude of proteins and regulatory complexes. Unfortunately, aberrant splicing events do occur which have been linked to genetic disorders, such as several types of cancer and neurodegenerative diseases (Fan et al., Theor Biol Med Model 3:19, 2006). Therefore, understanding the mechanism of alternative splicing and identifying the difference in splicing events between diseased and healthy tissue is crucial in biomedical research with the potential of applications in personalized medicine as well as in drug development. Results We propose a linear mixed model, Random Effects for the Identification of Differential Splicing (REIDS), for the identification of alternative splicing events. Based on a set of scores, an exon score and an array score, a decision regarding alternative splicing can be made. The model enables the ability to distinguish a differential expressed gene from a differential spliced exon. The proposed model was applied to three case studies concerning both exon and HTA arrays. Conclusion The REIDS model provides a work flow for the identification of alternative splicing events relying on the established linear mixed model. The model can be applied to different types of arrays. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1687-8) contains supplementary material, which is available to authorized users.


Introduction
The four scenarios are presented in Section 4.3 for the HTA data. Supplementary Figure 2 and 3 illustrate a differentially expressed gene with an alternatively spliced and a non-alternatively spliced exon respectively for the tissue data. We visualized probe set 2736397 in Supplementary Figure 2 which was mapped to a transcript cluster of the differentially expressed PDLIM5 gene. The probe set ranks at the third place in order of the exon scores with a score equal to 0.92 among the significant probe sets. Supplementary Figure 2 represents a gene that is differentially expressed considering the heart, muscle, thyroid and prostate tissues as group 1 and the remaining tissues as group 2. It is clear that probe set 2736397 is present in group 1 while it is depleted in group 2. The density plot of the array scores shows a bimodal distribution separating the groups of interest from each other. Supplementary Figure 3 shows probe set 3023189 of the non-differentially expressed gene FLNC which was selected for the tissue data. The probe set has an exon score of 0.12 which implies that the ratio between signal to the noise of the transcript cluster is relatively small. It is observed that the gene is differentially expressed between the groups of interest.

Supplementary
Further, the exon levels lie close to their respective gene levels. This implies that these exons are present among all samples and that their deviation of the gene level is only natural. This is visualized in the density plots of the array scores where we expect an uni model distribution for a non-alternatively spliced probe set. Supplementary Figure 4 presents an example of a non-differentially expressed gene that has an alternatively spliced exon for the tissue data. We consider the groupings of the data as before. We selected probe set 3132962 with an exon score of 0.70. The probe set is significantly different between the groups of interest and is ranked at place 319. It has been mapped to belong to a transcript cluster of gene ANK1. It is observed that the gene ANK1. Probe set 3132962 however shows a large deviation of the gene level for the heart, muscle, thyroid and prostate tissue. This implies that this probe set has a higher inclusion in these tissues compared to the other tissues where the exon level matches with the gene level. When inspecting the density plots, we see a clear bimodal distribution which represent a distinction between the groups of interest.
The last example we consider in this section, presented in Supplementary Figure 5 shows consists of a non-differentially expressed gene and a non-alternatively spliced exon. For the tissue data, probe set 4054185 belongs to a transcript cluster of the PPP1R2P3 gene. It has an exon score of 0.17 and it is not significantly different between the groups of the tissue data.   Table 1.

KIF1B
The KIF1B gene is found to be differentially expressed in 32 cancer studies covering several cancer types and tissues. The gene has been reported to be alternatively spliced in heart, muscle and thyroid (1). The probe sets with the highest exon scores are 2319718 and 2319121 with scores higher than 0.80 and are identified as alternatively spliced by the REMAS model.

PDE4DIP
A differential expression of the PDE4DIP gene has been reported in more than 70 cancer experiments. It contains 10 probe sets with exon scores higher than 0.7 but probe set 2432028 shows the most significant p-value when testing the array scores between the tissue groups.

TPM3
The TPM3 gene was observed to be differentially expressed in 18 cancer experiments. Five probe sets have been found to be the most significantly deviating from their respective gene levels between the tissue groups. These are the probe sets 2436538, 2436539,2436564, 2436565 and 2436566 with respecytively the exon scores 0.84, 0.79, 0.84, 0.87 and 0.87.

TNNI1
Four experiments have seen differential expression of the gene TNNI1. The probe set 2450832 passes the threshold for the exon score with a value of 0.60 and is identified as alternatively spliced.

MAP4
The MAP4 gene was discovered by 19 cancer studies. It has a probe set, 2673022, with an exon score higher than 0.9 and that shows a significant difference in the array scores between the groups of interest.

PDLIM5
Differential expression of the gene PDLIM5 was reported by 87 cancer studies. Probe set 2736397 has an exon sore of 0.92 and is identifief as alternatively spliced.

PALLD
The PALLD gene has been seen to be differentially expressed in 75 cancer experiments. Two probe sets, 2751072 and 2751068, have been identified as alternatively spliced with exon scores 0.69 and 0.67 respectively.

SYNPO2L
Four cancer experiments have reported gene SYNPO2L to be differentially expressed. Probe set 3294673 has an exon sore of 0.69 and is identified as alternatively spliced with the most significance.

SORBS1
The gene SORBS1 was shown by 51 experiments involving cancer to be differentially expressed. Three significant probe sets have been identified with exon scores higher than 0.80.

FHL1
The FHL1 gene was reported to be differentially expressed in 128 cancer related experiments. Probe set 3992432 with an exon score of 0.72 was identified to be alternatively spliced with the most significance.

The Colon Cancer Data
In this section we discuss the comparison between the FIRMA and the REIDS model in more detail. Supplementary among the array scores and the FIRMA scores for the probe set ENSE00001668645, as before FIRMA scores seems to have a higher variability than the REMAS array scores.
Supplementary Figure 19: Left panel: a heatmap of the FIRMA scores of the DOCK10 gene. Right panel: a heatmap of the array scores of the DOCK10 gene.
We illustrate the annotation of the probe sets of the MYO18A gene which was mentioned in the main manuscript as an example of a DE gene with an AS exon. Supplementary Figure 20 illustrates the alternative splicing of the probe set ENSE00001297204.