 Methodology article
 Open Access
 Published:
Finite mixture clustering of human tissues with different levels of IGF1 splice variants mRNA transcripts
BMC Bioinformatics volume 16, Article number: 289 (2015)
Abstract
Background
This study addresses a recurrent biological problem, that is to define a formal clustering structure for a set of tissues on the basis of the relative abundance of multiple alternatively spliced isoforms mRNAs generated by the same gene. To this aim, we have used a modelbased clustering approach, based on a finite mixture of multivariate Gaussian densities. However, given we had more technical replicates from the same tissue for each quantitative measurement, we also employed a finite mixture of linear mixed models, with tissuespecific random effects.
Results
A panel of human tissues was analysed through quantitative realtime PCR methods, to quantify the relative amount of mRNA encoding different IGF1 alternative splicing variants. After an appropriate, preliminary, equalization of the quantitative data, we provided an estimate of the distribution of the observed concentrations for the different IGF1 mRNA splice variants in the cohort of tissues by employing suitable kernel density estimators. We observed that the analysed IGF1 mRNA splice variants were characterized by multimodal distributions, which could be interpreted as describing the presence of several subpopulation, i.e. potential tissue clusters. In this context, a formal clustering approach based on a finite mixture model (FMM) with Gaussian components is proposed. Due to the presence of potential dependence between the technical replicates (originated by repeated quantitative measurements of the same mRNA splice isoform in the same tissue) we have also employed the finite mixture of linear mixed models (FMLMM), which allowed to take into account this kind of withintissue dependence.
Conclusions
The FMM and the FMLMM provided a convenient yet formal setting for a modelbased clustering of the human tissues in subpopulations, characterized by homogeneous values of concentrations of the mRNAs for one or multiple IGF1 alternative splicing isoforms.
The proposed approaches can be applied to any cohort of tissues expressing several alternatively spliced mRNAs generated by the same gene, and can overcome the limitations of clustering methods based on simple comparisons between splice isoform expression levels.
Background
Alternative splicing, which can be detected in more than 90 % of multiexon genes [1], is considered a leading process giving rise to cellular and tissues diversity in higher eucariotes [2]. It has been estimated that there are, on average, at least seven alternative splicing events per multiexon human gene [1]; considering that each tissue can potentially express a peculiar set of splice isoforms, it is easy to deduce that splicing complexity can profoundly contribute to tissue diversity.
In this study, we employed a finite mixture model (FMM) to cluster a cohort of tissues on the basis of the relative abundance of multiple alternatively spliced mRNAs. We show how this clustering approach, based on a probabilistic model, can overcome the limitations of clustering methods based on simple comparisons between splice isoform expression levels.
FMM have been used for several decades in the analysis of highdimensional complex data, and are now experiencing a progressive popularity, due to the increase in the computing power [3, 4]. The classical areas of application for FMMs are social sciences and economics, but several applications of FMMs are extending to genetics, biology, natural sciences, psychology, medicine [4].
FMMs are probabilistic models that can be applied to complex data whenever the experimental observations are drawn from several subpopulations which are not known a priori [5]. While the clustering context is certainly a major area of application, FMMs may also be employed to produce a semiparametric (kerneltype) estimate for a multimodal distribution, with the feature of being more parsimonious than standard kernelbased estimators [5]. For this reason, FMMs can be fruitfully employed also in the presence of high dimensional data.
In this manuscript we aim at solving a recurrent problem in molecular biology that is to define an appropriate clustering structure for a set of tissues that display differential concentrations in multiple alternatively spliced mRNA variants generated by the transcription of the same gene. To better explain how a suitable solution to this problem can be found, we describe the analysis of a paradigmatic case, where we aim at clustering a cohort of human tissues on the basis of the relative abundance of IGF1 mRNA splicing variants. However, since the data measurement consist in three technical replicates for each observed splice isoform in each observed tissue, a dependence issue may arise, and standard finite mixture models for clustering cannot be the appropriate choice. For this reason, in this study we have also employed the method described by Celeux et al. [6] to cluster tissues in presence of replicated data.
In human tissues, the transcription from a single IGF1 gene can generate multiple splicing isoforms (summarized in Fig. 1; for an exhaustive review see [7–9]) which encode multiple proteins, comprising variable amino and carboxyterminal amino acid sequences.
In particular, splicing in the 3′region of the IGF1 transcripts give rise to different mRNA splicing variants (Fig. 1) and as a consequence, this generates an important molecular variability in the carboxyterminus of the mature IGF1 peptide that represents the socalled “Epeptide” [10]. The different Epeptides might modulate the biological actions, the stability or the bioavailability of the IGF1 protein [10, 11]. For this reason, different IGF1 mRNA splicing variants could be preferentially abundant in some tissues, and scarce or even absent in other.
In spite of the great importance of IGF1 in physiology and pathology [8, 11, 12], no comprehensive quantitative studies or detailed analyses of the relative abundance of IGF1 splice isoforms mRNAs in panels of human tissues have been reported so far.
In human, alternative splicing in the 3′region of the IGF1 transcripts give rise to three different mRNA splicing variants: IGF1 Ea, IGF1 Eb and IGF1 Ec (Fig. 1) [11, 13]. IGF1 is bona fide expressed by all human tissues; however, it remains unknown whether all human tissues are also able to splice all the three splice variants Ea, Eb and Ec.
IGF1 Ea is considered the “standard isoform” expressed locally in muscle [12, 13], and the predominant splicing variant in liver [14]; for this reason, this particular isoform is often referred to as a “local muscle specific” IGF1 isoform (mIGF1) or “muscle/liver type” isoform [14–16]. The isoform IGF1 Ec has been detected in muscles subjected to damage or exercise, and it is often defined as a “mechano growth factor” (MGF) [14–16]. However it is not clear whether IGF1 Ec is also significantly spliced in nonmuscular tissues or its presence can alter the relative balance of the two other splicing isoforms IGF1 Ea and IGF1 Eb.
To gain a better understanding of IGF1 alternative splicing, we performed a quantitative analysis of the IGF1 mRNA splice variants in human tissues. In a panel of human tissues, after a preliminary data equalization (which allows us to compare relative mRNAs levels of the different IGF1 variants across the different tissues), an exploratory quantitative analysis on IGF1 mRNA splice variants was performed. This analysis showed that IGF1 Ea, Eb and Ec were spliced in all human tissues, and that their concentrations is characterized by multimodal distributions. This was clearly appreciated by employing kernel density estimation to smooth the observed empirical frequency distributions.
We then proposed a finite mixture model (FMM) with multivariate Gaussian kernel to formally cluster human tissues into subgroups with homogeneous values of concentration for the IGF1 splice variants. Since this modelbased clustering approach does not allow to consider repeated measurements, and since the analysed quantitative data contains technical replicates for each measurement, we have also applied the method of Finite mixture of linear mixed models (FMLMM) by Celeux et al. [6], which has been defined to produce a finite mixture clustering of repeated measurements. The two proposed approaches show substantial similarities in the obtained results, with a clear cut biological interpretation.
The approaches presented in this work can be regarded as a general framework for the formal clustering of any cohort of tissues (or cells) expressing alternatively spliced mRNAs, and can overcome the limitations of clustering methods based on the simple comparison between splice variants levels.
Results and discussion
Data preprocessing and exploratory analysis on IGF1 splice variants concentration in human tissues
A panel of 20 human tissues was analysed through quantitative realtime PCR, using specific IGF1 splicing isoforms assays, to detect the mRNAs encoding IGF1 Ea, IGF1 Eb and IGF1 Ec splice variants (Fig. 1).
Data derived from realtime PCR analysis (the comparative C_{T} method was used to obtain a quantitative measure of each IGF1 splice variant mRNA, see Methods) were equalized by means of the “compositional ratios” (or “concentration ratios”, see Methods), referred to as Ratio 1, Ratio 2, Ratio 3, as shown in Table 1 and Fig. 2. This transformation allows to compare the relative mRNA concentrations for the IGF1 splicing isoforms across the different tissues in the panel. However, it also results in permanent loss of the original amplitude (magnitude) in the expression levels of each isoform in each tissue.
Additionally, the formal constraint [Ea]+[Eb]+[Ec]=1 holds, i.e. the sum of the relative concentrations for the three IGF1 splicing isoforms is constant (and equal to one) in all the analysed tissues. The constraint is not only a simple theoretical assumption: in this respect, we used a IGF1_{PAN} specific TaqMan assay that simultaneously detects all the human IGF1 isoforms (see Methods), and we performed preliminary ad hoc tests to verify the null hypothesis that [IGF1_{PAN}]= [Ea]+[Eb]+[Ec]; our empirical findings suggested that we have not enough empirical evidence to reject the null hypothesis (data not shown). In other words, according to experimental observations, in each human tissue the sum of relative concentrations for the three mRNA IGF1 splicing variants Ea, Eb and Ec, is not substantially different from the total IGF1 mRNA pool, and this result is consistent with the theoretical assumptions.
A preliminary exploratory data analysis was performed on the equalized data; we report some summary statistics for the compositional ratios in Table 1. We then summarized the distribution of the concentration ratios, by looking at the withintissues variability, as shown by the boxplots in Fig. 2a, b and c, where the observed tissues are ordered by increasing mean values.
According to the findings in Table 1 and Fig. 2, we could deduce the following:

i)
all the human tissues we have analysed are able to splice the three IGF1 mRNA isoforms Ea, Eb and Ec;

ii)
all human tissues in the cohort can splice, on a relative scale, a greater amount of the variant Ea, a smaller amount of the variant Eb, and a minimal amount of the variant Ec.
Therefore, it emerges a general concept that should be taken into consideration when we perform a quantitative analysis of mRNA splice variants: when we refer to “high values” or “low values” of expression for one specific splice mRNA variant, it is always relatively to the values of concentration for the other splice variants in the same tissue or in a specific cohort of tissues. For example, in our specific cohort, high values of concentration for the isoform Eb will be always lower to Ea, even with respect to low values of concentration for isoform Ea (this can be evinced by looking at Table 1). Also, in our cohort, high values of concentration for the isoform Ec will be lower to Eb, even with respect to low values of concentration for isoform Eb (Table 1).
The boxplots in Fig. 2 do not provide a formal clustering of tissues. However, they are very interesting from a general observational perspective, and they can suggest which are the tissues showing more marked expression levels for a specific splice variant. For example, the tissues [Liver A and Liver B] seem to contain high concentrations of variant Ec, while the tissues [Brain A, Brain B and Brain C] seem to contain low concentrations of variant Ec, relatively to the cohort of all the observed tissues (Fig. 2c).
Kernel density estimates of the distribution of IGF1 splicing isoforms mRNA concentrations in human tissues
We employed kernel density estimates (KDE) to get a further step in the characterization of the distribution of the observed values for the three IGF1 splicing variants Ea, Eb and Ec mRNA concentrations in the cohort of human tissues (Figs. 3, 4 and 5).
The quality of a kernel estimate heavily depends on the value of the bandwidth [17]. By using a small value, we may have a very local (i.e. extremely variable) estimate, while a too high value may result in oversmoothing. For this reason, we have employed different choices for the bandwidth. In Figs. 3, 4 and 5, that correspond to KDE for the variants Ea, Eb and Ec, respectively, panel A refers to the Scott rule of thumb [18], panel B to the Silverman rule of thumb [17], panel C to the unbiased crossvalidation based bandwidth [19], while panel D refers to the Sheather and Jones plugin method [20] (see also the section Methods and Refs. [21] and [22] for a critical discussion on density estimation).
As it can be easily evinced by looking at Figs. 3, 4 and 5, the kernel density estimates suggest the presence of at least two subpopulations in each of the analysed IGF1 splicing variants, thus resulting in multimodal density estimates (Figs. 3, 4 and 5).
To be more specific, with respect to the human splice isoform IGF1 Ea, we may observe that at least two tissue subpopulations can be recognized in the analysed cohort: the first one has higher frequency and, likely, represents what we can roughly call the “IGF1 Ea mRNA standard population”, while the second one is more on the left side of the range of the observed values and shows a lower frequency (Fig. 3). On the basis of the boxplot shown in Fig. 2, we may speculate that the latter subpopulation likely includes those tissues having a concentration of the splice isoform IGF1 Ea lower than the average.
Opposite conclusions can be inferred by looking at the kernel density estimates for variants IGF1 Eb and IGF1 Ec, which are displayed in Figs. 4 and 5.
Some asymmetries in the higher frequency kernel component for the expression levels of the splice isoforms IGF1 Ea and IGF1 Ec can also be easily noticed (Figs. 3 and 5). These asymmetries may indicate either the presence of a third subpopulation, not clearly distinguishable from the highfrequency component, or, rather, a departure from symmetry of the populationspecific density (Figs. 3 and 5).
However, as we have already remarked on the grounds of Fig. 2, the empirical evidence we obtain from Figs. 3, 4 and 5 is mainly observational, and cannot lead to any formal inference on the existence of tissue subgroups that show different values for the observed concentration of a given IGF1 splice variants.
Finite mixture models to classify tissues with different levels of IGF1 splicing isoforms
As presented in the previous section, in human tissues the analysed levels of IGF1 mRNA splice variants Ea, Eb and Ec are characterized by multimodal distributions, which could be interpreted as being composed by several subpopulations.
In this context the finite mixture models (FMM) with Gaussian kernel could provide a formal, convenient, modelbased method to cluster the cohort of human tissues in several, but potentially overlapping, subpopulations with homogeneous values of mRNA concentration for one or more IGF1 splicing isoform. For the empirical analysis, we employed simultaneously all the technical replicates (usually triplicates, sometimes duplicates when the experimental triplicates were not available) for each human IGF1 mRNA splice isoform (Ea, Eb and Ec), in each different tissue (X _{irk}, see the Methods section).
According to the FMM, the population of tissues is divided in G components (i.e. in 2 or more “subpopulations”, or “clusters”, or “subgroups”) [5]; each component is graphically identified by an elliptical shape, which usually represents a (multivariate) Gaussian density; in this case, given the unit constraint, we have fitted the model to the couple (Ea, Eb), so that the clusterspecific distribution we are considering is a bivariate Gaussian density.
The clusters (components of the finite mixture) are thus characterized by a center, defined by the corresponding mean value, and by the elliptical contour, summarized by the covariance matrix. By definition, the latter may be parameterized as a function of 3 parameter sets, which describe scale, shape and direction, respectively [23]. Each of these parameter sets may be constant, variable or, whenever appropriate, independent, leading to 10 different association structures for each choice of the number of components [23].
To choose the appropriate number of components (that is clusters) in the population, the model was fitted for a progressively increasing number of clusters (G = 2, …,10) for each of the different structures for the clusterspecific covariance matrix. The solution corresponding to the lowest Bayesian Information Criterion score (BIC score) [24] has been retained. In our analysis the best model (i.e. the model with the lowest BIC score) corresponds to G = 3 components with an association structure defined by the acronym “EEV”, standing for Equal scale, Equal shape and Variable orientation across the components [25].
After fitting the model, we may define a formal clustering of the tissues by allocating each tissue to the cluster with the highest probability of component membership, see Methods. This procedure is often referred to as MAP (maximum a posteriori) clustering.
In our FMM there is a clear distinction in the center of the distribution for clusters [#1 and #2] when compared to cluster #3 (Table 2, and Fig. 6a). Components #1 and #2 are both characterized by a higher concentration of isoform IGF1 Ea when compared to component #3 (Table 2); the latter component is characterized by a relatively higher concentration of the isoform IGF1 Eb, even if it should be observed that in any tissue the concentration of the isoform IGF1 Eb is lower when compared to the concentration of the isoform IGF1 Ea (as already discussed, see Tables 1 and 2).
Components #1 and #2 differ mainly as a function of the orientation of the ellipse, that is the orientation of the corresponding covariance matrix estimates, showing a different correlation between the observed values for IGF1 Ea and IGF1 Eb, which is higher (in absolute value) for component #1 with respect to component #2 (Table 2, and Fig. 6a).
We assigned each tissue to a cluster by using a MAP criterion, as introduced above; the resulting classification with an estimate of the classification entropy, is shown in Table 3. It is worth considering that by allocating each tissue replicate to a specific component, we force the value of the posterior probability (which ranges from 0 to 1) to be 0 or 1 and, therefore, we insert some approximation error in our model. For some tissue replicate, this produces a higher variability due to the uncertainty in the allocation, which is higher when the tissue replicate is characterized by values of posterior probability of component membership that are very similar across the different components.
To give an exemplification on how uncertainty is generated by allocation of tissue replicates to specific components, let us assume we have two components, namely cluster A and cluster B, and that according to MAP allocation we can assign a tissue replicate to cluster A. Suppose now the posterior probabilities for the tissue are equal to 0.99 for cluster A and 0.01 for cluster B, respectively. In this case, the tissue can be allocated to cluster A with a very low uncertainty. However, if the posterior probabilities are equal to 0.51 and 0.49 for clusters A and B, respectively, the tissue can still be allocated to cluster A, but the uncertainty of the classification is higher.
In this work, modelbased clustering has been performed using the mclust library [25] developed for the opensource software R; in this context, classification uncertainty is measured by the Shannon entropy. Table 3 also describes the average uncertainty in the MAP classification for the human tissues data (see the column referring to entropy in Table 3). A higher classification uncertainty can therefore be associated to points located between components #2 and #3 when compared to points located between components #1 and #3. Looking in more details at these points, we observe that the highest values of uncertainty are associated to the following tissues: Brain B, Brain C, Fetal kidney and Skeletal muscle B, and range between 0.16 and 0.32 (with a maximum value of 1).
As shown in Table 3 each tissue was allocated to the specific component with a score of 3 replicates out of 3 (or 2 out of 2 if the third technical replicate was not available for experimental reasons). Only the Skeletal muscle B (which, as discussed, was characterized by a relatively high allocation uncertainty) was associated to a partial allocation discrepancy: in fact, two experimental replicates were assigned to component #1, whereas the remaining technical replicate was assigned to the component #2 (Table 3).
Finite mixture of linear mixed models to classify tissues with different levels of IGF1 splicing isoforms accounting for dependence between replicates
As we have noticed in the previous section, using FMM to analyze quantitative realtime PCR data was very effective, and only in one case a technical replicate was not assigned by the FMM to the same component as the other two replicates (Table 3).
FMM clustering approach is based on considering the technical replicates (originated by repeated quantitative measurements of the concentration of same mRNA splice isoform in the same tissue) as independent units, increasing the sample size. To take into account potential dependence between technical replicates we adopted a further clustering strategy, i.e. the finite mixture of linear mixed models (FMLMM) [6]. This method, developed by Celeux et al. [6] can be considered as a particular specification of the method proposed by Basford and McLachlan [26] for the analysis of three way data (in this case, tissues by isoforms by replicates). However, application of FMLMM to cluster tissues in presence of replicated data required a specific implementation (mclust does not allow for this extension) and generated some less favourable effects: i) a reduced sample size (in fact the replicates were not treated as statistical independent units anymore) and, ii) each individual (tissuespecific) sequence became a KRdimensional sequence, where K denotes the number of splice variants and R the number of replicates. Consequently, this reduced the number of possible models to be fitted, due to joint increase in the number of parameters and the decrease in the sample size.
According to FMLMM, the population of tissues is divided into G components; each component is now identified by a regression model with a componentspecific intercept (playing the role of overall mean), a tissuespecific effect that accounts for technical variability and dependence between replicates from the same tissue [6]. Altogether these different sources of variation define an elliptical shape, which represents a (multivariate) Gaussian density in a higher dimension (2 by R, where R is the number of replicates), with all the sources of variation (included the measurement error) concurring to define the covariance matrix (see the Methods section).
To choose the appropriate number of components (that is the appropriate number of clusters) in the population, we have considered the solution corresponding to the lowest Bayesian Information Criterion score (BIC score) [24]. However, it should be noted that given the reduced sample size, and given the highly parameterized model specification, we had the chance to estimate the model only for G = 2,3 and the best model, i.e. the model with the lowest BIC corresponded to G = 3 components. As described in the previous paragraph for FMM, after fitting the model with FMLMM, we may define a MAP clustering of the tissues by allocating each tissue to the cluster with the highest probability of component membership (see Methods).
As for FMM, FMLMM shows a clear distinction in the centers of the distributions corresponding to clusters [#1 and #2] when compared to cluster #3 (Tables 45 and Fig. 6b). The latter component is characterized by a relatively higher concentration of the isoform IGF1 Eb (Table 4), even if it should be observed (as already discussed, see Tables 1 and 2) that in any tissue the concentration of the isoform IGF1 Eb is lower when compared to the concentration of the isoform IGF1 Ea. In the context of FMLMM, however, the difference between components #1 and #2 is not due to a difference in the orientation of the covariance matrix as seen for the FMM (Table 2 and Table 5). Rather, the two components #1 and #2 seem to share the same orientation, with higher values in the second component (Table 4 and Fig. 6b).
We assigned each tissue to a cluster, by using a MAP criterion as introduced above; the resulting classification is shown in Table 6, with an estimate of the classification entropy. For some tissue, the allocation produces a higher variability due to the uncertainty in the allocation, which is higher when the tissue presents similar values for posterior probability across the different components. Looking in more details at these points, we observe that in FMLMM the highest values of uncertainty are associated to the following tissues: Fetal lung, Fetal skeletal Muscle, Kidney, Lung. These values are not consistent with the entropy values we have derived from the application of FMM (Table 3). Moreover, the very low entropies we could often observe in FMLMM (Table 6) are partially due to the higher number of parameters respect to the sample size. Fig. 6a and b shows the partition of tissues obtained adopting the FMM and the FMLMM, respectively. Table 7 reports the same information under the shape of a confusion matrix.
In our perspective, for the resolution of this specific biological problem the mclust library [25] still represents the methods of choice: is a straightforward analysis tool which can be easily handled by practitioners, geneticists and biological researchers. Additionally, though more appropriate clustering methods might be proposed for the case of replicated measurements, FMM still remains, at least to our opinion, a quite reliable and flexible approach.
A comprehensive analysis of the abundance of IGF1 mRNA splice variants in human tissues
From the FMM posterior classification listed in Table 3 we could finally infer a formal clustering structure for the human tissues, based on the relative abundance of the mRNAs for the IGF1 alternative splicing isoforms.
As shown in Table 3, each tissue was allocated to the specific component with a score of 3 replicates out of 3, or 2 out of 2 when the third technical replicate was not available for experimental reasons (a recurring circumstance in realtime PCR measures). Only the Skeletal muscle B (which, as discussed previously, was characterized by a relatively high allocation uncertainty) was associated to a partial allocation discrepancy: in fact, two experimental replicates were assigned to component #1, whereas the remaining technical replicate was assigned to the adjacent component #2. However, this may be explained by all three replicates being on the tails of the corresponding distributions (Fig. 6a).
The majority of the human tissues in the panel were allocated by FMM as belonging to component #1 (Table 3). Components #1 and #2 are characterized by a higher concentration of the isoform IGF1 Ea with respect to component #3 (Fig. 6a and Table 2). As remarked above, components #1 and #2 differ mainly as a function of the orientation of the ellipse, with a different correlation between the observed values for IGF1 Ea and IGF1 Eb (Table 2). Liver A and Liver B (mRNA derived from two independent groups of donors) and Uterus were assigned to component #2 (Fig. 6a and Table 2). Remarkably, we observed from the boxplots in Fig. 2c that Liver A, Liver B and Uterus are characterized by the highest relative concentration of IGF1 Ec. Therefore we can speculate that the component #2 is characterized by a peculiar relatively higher concentration of the isoform IGF1 Ec.
The tissues allocated by FMM to component #3 are characterized by a relatively higher concentration of the isoform IGF1 Eb, when compared to components #1 and #2, i.e. Pancreas and Heart B (Fig. 2b and Table 2).
It is interesting to observe that most of the results of the allocation are consistent when we consider the same tissue, but pools of RNA collected from independent groups of donors: for example Brain A, Brain B and Brain C are all allocated to component #1; Liver A and Liver B are both allocated to component #2. However Heart A and Heart B are assigned to two highly separated subgroups, i.e. to components #1 and #3, respectively. This means that pools of RNA from the same tissue, but collected from different individuals (in which tissues are possibly subjected to different pathophysiological circumstances) may sometimes exhibit radically different IGF1 splice isoforms balance. It has been reported, for example, that IGF1 Ec increased during postinfarcted myocardium remodelling [27].
To better understand the rationale behind the clusterization obtained by means of FMM, we consider tissues Brain A, Brain B and Brain C, which are three pools of RNAs from three independent groups of human donor individuals. Looking at the boxplots in Fig. 2 we may roughly derive that these tissues are characterized by “a low concentration of IGF1 Ec”. However, such cluster does not formally exist, since, due to the unit constraint, we have considered Ea and Eb only. Therefore, Brain A, Brain B and Brain C are formally allocated to component #1, that is a cluster characterized by a high concentration of isoform IGF1 Ea and, in particular, by a relatively lower concentration of IGF1 Eb and by a relatively lower concentration of IFG1 Ec.
We can summarize the results of our FMM study on human tissues as follows:

all the human tissues in the panel were able to splice the isoforms IGF1 Ea, IGF1 Eb and IGF1 Ec.

the fulfilment of the formal equation [Ea]+[Eb]+[Ec]=1 and the compositional ratios we have calculated have helped to compare the relative mRNA concentrations for the different IGF1 splice variants within each single tissue and across the different tissues.

in all the human tissues the relationship [Ea] > [Eb] > [Ec] holds, at least as an empirical evidence. Specifically, every tissue showed a lower concentration of the isoform IGF1 Eb when compared to IGF1 Ea; at the same time, in every tissue the concentration of the isoform IGF1 Ec was lower than the concentration of the isoform IGF1 Eb.

by defining and applying a FMM we have been able to formally cluster the analyzed tissues into three subgroups: i) tissues with relatively high concentration of IGF1 Ea and relatively lower concentration of IGF1 Ec (Cluster # 1); ii) tissues with relatively high concentration of IGF1 Ea and IGF1 Ec (Cluster # 2); iii) tissues with relatively high concentration of IGF1 Eb (Cluster # 3).
As discussed so far, we implemented FMLMM as a further clustering strategy to take into account the potential dependence between technical replicates. When considering the clustering results from this latter approach, we observed that the separation between [component #1, component #2] and [component #3] is still well retained, and this result is consistent with FMM (Fig. 6). The peculiar difference of FMLMM when compared with FMM is the relative composition of component #1 and component #2: while in the FMM the prior probabilities (π) for component #1 and component #2 were, respectively, 0.68 and 0.22 (Table 2), for the FMLMM the corresponding prior probabilities were 0.32 and 0.57 (Table 4), with six tissues moving from one component to the other (Tables 3 and 6).
Additionally most of the results of the allocation for FMLMM are consistent when we consider pools of RNA collected from independent human donors: for example Brain A, Brain B and Brain C were all allocated by FMLMM to component #2; Liver A and Liver B were both allocated to component #1; Skeletal muscle A and B were both allocated to component #2 (Table 6).
To better understand the rationale associated with the FMLMM clustering, we can observe that:

the tissues allocated by FMLMM to component #3 (Pancreas and Heart B) are characterized by a peculiar relatively high concentration of the isoform IGF1 Eb when compared to tissues allocated in the components #1 and #2 (Fig. 2 and Table 6).

Liver A and Liver B were allocated by FMLMM to component #1, and show an average concentration of isoform IGF1 Ea and Eb, and a particular high concentration of IFG1 Ec (Fig. 2 and Table 6). However, the other sample units belonging to components #1 cannot be clearly distinguished from tissues allocated to component #2; in this respect, the FMLMM approach is, at least to some extent, less efficient than FMM, where the three components were appreciably better characterized.
Roughly speaking, FMLMM component #1 and component #2 may be considered as describing two different tails of the same distribution, the only relevant difference being represented by Liver A and Liver B, and this poses the question of whether we definitively need for FMLMM a 3 component solution, or we can rather use a 2 component solution.
Further statistical validations and biological experiments are needed to assess unequivocally whether components #1 and #2 belong from two distinct clusters (as suggested by FMM) or belong to one single cluster.
Conclusions
In this study we propose two approaches to address the general biological problem of clustering a group of tissues on the basis of the relative abundance of multiple spliced mRNA, generated by the transcription of the same gene. We present a paradigmatic case, the quantitative analysis by realtime PCR of IGF1 mRNA splice variants in human tissues.
The starting point of this study is the appropriate preliminary equalization of the quantitative data, followed by summary statistics and nonparametric estimators based on the observed distributions (kernel density estimates).
The clustering approach proposed in this work is based on a finite mixture model (FMM) with Gaussian components, which is an efficient model based clustering approach, available in several opensource statistical computer packages.
In order to take into account potential dependence between technical replicates from the same tissue, we propose a further clustering strategy, the finite mixture of linear mixed models (FMLMM), which required a specific implementation.
In our analysis, both FMM and FMLMM provided a convenient yet formal setting for modelbased clustering of a cohort of human tissues expressing three spliced mRNAs. These approaches can be applied by the biologist to any cohort of tissues expressing several alternatively spliced mRNAs, and can overcome the limitations of classification methods based on the simple comparisons between splice variants expression levels.
Methods
Human cDNAs and RNAs
All the human samples, namely cDNA and RNAs, used in this study have been purchased from Clontech and Ambion; according the suppliers specifications, all human cDNAs and RNA samples used were obtained from normal donors, essentially free from diseases. Pancreas cDNA (derived from total RNA pooled from 15 male/female Caucasians individuals), heart (A) cDNA (whole heart) (derived from total RNA from 3 male Caucasians), lung cDNA (derived from total RNA from 1 male Caucasians), liver (A) cDNA (derived from total RNA from 1 male Caucasians), placenta cDNA (derived from total RNA from 11 female Caucasians), kidney cDNA (derived from total RNA from 4 male/female Caucasians), skeletal muscle (A) cDNA (derived from total RNA from 4 male/female Caucasians), brain (A) cDNA (whole brain) (derived from total RNA from 4 male Caucasians) were all purchased from Clontech (Human MTC Panel I cDNAs). Heart (B) cDNA (whole heart) was generated by Reverse Transcription (RT) (see below) from human heart total RNA (RNA from Clontech, pooled from 3 male Caucasians); liver (B) cDNA was generated by RT from human liver total RNA (RNA from Clontech, pooled from 3 Asian males); skeletal muscle (B) cDNA was generated by RT from human skeletal muscle total RNA (RNA from Clontech, pooled from 1 Caucasian male); uterus cDNA was generated by RT from human uterus total RNA (RNA from Clontech, pooled from 8 female Caucasians); brain (B) cDNA (whole brain) was generated by RT from brain total RNA (RNA from Clontech, pooled from 4 male Asians); Brain (C) cDNA (whole brain) was generated by RT from human brain RNA (FirstChoiche human brain total RNA from Ambion, from 1 male Caucasian). Fetal lung cDNA, Fetal skeletal muscle cDNA, Fetal heart cDNA, Fetal thymus cDNA, Fetal spleen cDNA, Fetal kidney cDNA, were all generated by RT from human fetal total RNA purchased from Clontech.
Reverse Transcription (RT)
Firststrand cDNAs from each total RNA was generated using a Reverse Transcription kit with random primers (High Capacity cDNA Reverse Transcription kit; Applied Biosystems) starting from 1 μg of total RNA as described in [28].
Realtime PCR
Newly synthesized cDNAs were diluted 5fold in DNasefree water and 5 % of the cDNA (or an equivalent quantity of the purchased cDNAs) was then used in each realtime PCR assay. Each single determination was performed in technical triplicate, using a 7500 Fast RealTime PCR (Applied Biosystems), and the TaqMan Universal Master Mix II (Applied Biosystems).
The comparative C_{T} method was used to make a quantitative measure of each IGF1 splice variant and of the housekeeping gene mRNA. In order to assure a valid ∆∆C_{T} calculation, only empirically validated TaqMan assay with an experimental efficiency approaching 100 % were used. For each assay, the efficiency of the reference gene (GAPDH) amplification and the efficiency of each IGF1 splice variant amplification, were experimentally determined and compared within a progressive template dilution. The validation experiments for each TaqMan assay were performed according the general recommendation described in the Applied Biosystems User Bulletin (ABI Prism 7700 Sequence Detection System #2) and each assay was considered validated when the efficiency of the housekeeping gene amplification and the target gene amplification were approximately equal.
TaqMan assays
All the TaqMan assays were purchased from Applied Biosystems (ABI). Refer to Fig. 1 to deduce the exon organization of the different splice isoforms and the corresponding NCBI RefSeq transcripts detected by each TaqMan assay used. Human, IGF1 “PAN” TaqMan assay (ABI Hs01547656_m1) able to detect the human IGF1 transcripts corresponding to NCBI RefSeq NM_001111284, NM_001111283, NM_000618, NM_001111285. Human, IGF1 Ea TaqMan assay (ABI Hs01547657_m1) able to detect the human IGF1 transcripts corresponding to NCBI RefSeq. NM_001111284, NM_000618. Human, IGF1 Eb TaqMan assay (ABI Hs01555481_m1) able to detect the human IGF1 transcripts corresponding to NCBI RefSeq NM_001111283, NM_001111285. Human IGF1 Ec TaqMan assay (ABI Hs03986524_m1) able to detect the human IGF1 transcript corresponding to NCBI RefSeq NM_ NM_001111283. Human GAPDH TaqMan assay (ABI Hs02758991_g1): reference (housekeeping) gene able to detect the transcripts for human glyceraldehyde3phosphate dehydrogenase (NCBI reference sequence of the transcripts detected by the assay: NM_002046, NM_001256799).
Statistical analysis
All statistical analyses, including graphical representation, summary statistics, exploratory statistics, kernel density estimation, finite mixture clustering and finite mixture of linear mixed models have been performed using the open source software R, release 3.0.1 (codename: “Good sport”).
Kernel density estimation
The kernel density estimation (KDE) of a density function refers to nonparametric approaches to estimate the probability density function of random variable X based on the empirical distirbution of an observed sample, say x = (x _{1}, …, x _{ n }), whose elements are assumed to be independent and identically distributed accoding to an unknown density f(⋅θ). The aim is at providing an estimate of f by using the information stored in the observed sample x. The kernel density estimator for f is defined by:
where K _{ h }(⋅) is the kernel, that is a nonnegative function that integrates to one, defined over a zeromean random variable, and h is a parameter, often referred to as the bandwidth, that controls the smoothing of the data. Intuitevely, one may choose a small h to provide the best fit to the observed data; however, there is always a tradeoff between the bias of the estimator and its variance. By using a small bandwidth, we may have a very local (i.e. extremely variable) estimate, while adopting a too high value may result in oversmoothing the observed distribution. For this reason, in the present analysis, we used 4 different bandwidths, based on the Scott [18] and Silverman [17] rules of thumb, on unbiased crossvalidation, see Bowman [19] and on the Sheather & Jones plugin [20], respectively. See Sheather [22] for a comparison of the methods, refere to Sheather [21] for a thourough discussion on density estimation. A range of kernel can be used; due to its conveninet mathematical properties, the Gaussian kernel is often used; in the present context, we have used the Gaussian and the Epanechnikov kernels, but for sake of space and given the estimates do not substantially difffer, we report only those based on Gaussian kernel.
Finite mixture of multivariate Gaussian densities
To provide a formal clustering of human tissues in this study, we have decide to adopt a twostep strategy: first, we have used a finite mixture model with Gaussian components to cluster tissues in homogeneous classes, considering the technical replicates for each tissue as independent. Then, we have used a finite mixture of linear mixedeffects models, see Celeux et al. [6] and Ng et al. [29] to take into account of possible clusters in the sample while allowing for dependence between technical replicates corresponding to the same tissue. We will provide a brief description of both methods below.
Modelbased clustering via finite mixtures
We assume that each tissue belong to one of G discrete classes, say P _{ g }, g = 1, …, G, that is the population of tissues is partitioned in G disjoint classes, P = ∪ _{ g = 1} ^{G} P _{ g }. Tissues belonging to a given class have a classspecific density or probability mass function; the density function f(⋅) is of the same parametric form across classes, but is parameterized by a classspecific parameter vector, say θ _{ g }, g = 1, …, G. Clearly, we may extend this approach allowing for the classspecific densities to have a classspecific parametric form, but this is quite well beyond the scope of the present work. Finite mixture models are given the choice of a suitable density function f(⋅), and for a given splice isoform l = 1, 2, the finite mixture model specifies the marginal density for a set of r = 1,.., R replicates referring to the ith tissue as follows:
leading to the following observed data likelihood:
which can be considered as a specific version of the modelbased clustering approach introduced by Basford and McLachlan (1985) for threeway data [26]. Here, the terms π _{ g }, g = 1, …, G represent the prior probability that a tissue belongs to the gth class, g = 1, …, G, while the vector x _{ i } includes the set of measures technical replicates x _{ ir }, r = 1, …, R, and, for each replicate, the values for the splice isoforms, l = 1,2. The assumption that:
mimics the local independence assumption; it says that, conditional on belonging to the class P _{ g } the replicates corresponding to the same tissue are independent. However, due to the lack of general software routines to estimate the former model, we have further assumed that the technical replicates from the same tissues are independent. This oviously simplifies the analysis and allows the use of standard software, as it will beclear in the following. In this case, the likelihood reduce to
This assumption has also the result that the sample size is increased from n to nR. We can postulate the existence of a set of binary indicator variables, defined as:
indicating the component a tissue replicate belongs to. Should these be observed, the estimation problem would simply reduce to a multiclass problem, where the classspecific parameter θ _{ g } is to be estimated only on tissues belonging to the gth component, g = 1, …, G. However, the indicator variables are unobserved, and the number G of components is also unknown and must be estimated together with other model parameters. Thus, in the following, we will refer to the couple (x _{ ir }, z _{ ir }) as the complete data, and base our inferences on the following procedure. Assuming that the Gdimensional (latent) variable z _{ ir } has a multinomial distribution:
we derive the following density for the complete data:
Assuming that the tissue and the replicates are independent on each other, we have the following expression for the likelihood function:
Therefore, the loglikelihood function for the complete data can be written as:
where f _{ irg } = f(x _{ ir }θ _{ g } ). Since the latent vector z _{ ig } is unobservable, we use the EM algorithm to perform maximum likelihood estimation;this is an iterative algorithm, where two steps can be identified. In the first step, the Estep, of the algorithm, we define the loglikelihood for observed data by taking the expectation of the loglikelihood for complete data over the unobservable class indicator vector z _{ ir } given the observed data x _{ ir } and the current maximum likelihood estimates of model parameters, say θ ^{(t)}. In other words, at the tth step, we replace z _{ ig } by its conditional expectation:
which represents the posterior probability that the irth tissue replicate belongs to the gth component, given the observed data. The conditional expectation of the complete loglikelihood given the observed data x is expressed by the function:
If we use, a multivariate Gaussian density with componentspecific parameter set given by θ _{ g } = (μ _{ g }, Σ _{ g }) as classspecific density, we obtain the following expression:
where the superscript T denote matrix transpose, and under the constraint ∑_{ g } π _{ g } = 1. In the Mstep of the algorithm, we maximize Q ^{(t)}(⋅) with respect to θ _{ g }, and obtain the following ML estimates for the parameters of the Gaussian density in the gth component:
while te prior estimates are:
which are well known results from ML in finite mixtures. Solving these equations for given weights w _{ irg } ^{(t)} and updating the weights for given parameter estimates θ ^{(t)} defines an EM algorithm. The E and M steps are alternated repeatedly until convergence, which is obtained with a sequence of loglikelihood values which is bounded from above. Thus, we reach a solution for a given G, which can be used to estimate model parameters when we move to the solution for G + 1 classes. The corresponding solutions may be compared using penalized likelihood criteria, such as AIC o BIC, since the standard conditions do not holt to use the standard LRT statistic (the corresponding distribution under the null hypothesis is not the standard one). Once the EM algorithm has reached its end, the tissue replicate can be allocated to the comonent having the highest posterior probability w _{ irg } ^{(t)} ; in this way we obtain the clustering we are looking for, and compare estimated parameter values corresponding to the different classes, to characterize the obtained results. To avoid the well know sensitivity of the results obtained by the EM algorithm from the choice of the starting values for model parameters and/or posterior probabilities, we employed, prior to the EM algorithm, a shortlength CEM algorithm [30] which is quite flexible and robust to the presence of extreme values. This procedure often outperforms the one based on the comparison (in terms of penalized likelihood criteria) of results obtained using several random starts for the basic EM algorithm.
Finite mixtures of linear mixedeffects models
As previously stated, in order to improve the quality of the clustering results, we should take into account the technical variability deriving from the replicated measurements taken on the same tissue. In this case, linear mixedeffects models (LMMs) are defined to model repeated measures for a given statistical unit (in our case human tissue), and allow to separate measurement error variation from betweenunits variation (variability of the same splice variant between different tissues) and withinunits variation (between technical replicates from the same tissue). To propose a modelbased clustering tool for technical replicates, the standard finite mixture model (FMM) previously discussed, has to be modified to the LMMs context. The idea was developed by Celeux et al. [6], where a mixture of LMMs is proposed in order to account for the variability in measurements while performing clustering.
In details, let x _{ i } be the observed 2Rdimensional vector, representing the values for the ith tissue, corresponding to the R technical replicated measurements on the 2 splice variants. To take into account replicated measurements in the finite mixture framework, Celeux et al. [6] simply added the assumption that the replicated measurements of a statistical unit (in our case tissue) belong to the same mixture component. In this way, we are able to model the covariance structure between the rth and r 'th technical replicate on the ith tissue from the lth splice isoform, cov(x _{ ilr }, x _{ ilr '}). Thus, it is assumed that the observed 2Rdimensional vectors x _{1}, …, x _{ n } are drawn from a mixture of G components with unknown proportions π _{1}, …, π _{ G } (∑_{ g } π _{ g } = 1), and that, conditional on belonging to the gth component of the mixture, each vector x _{ i } referring to the ith tissue can be specified by the following linear mixedeffect model:
where V is a known (2R × 2) design matrix defined by
with 1 _{ R } and 0 _{ R } representing unit and null vectors of size R, α _{ g } is a 2dimensional vector of fixed effects modelling the conditional mean of x _{ i } in the gth component, u _{ ig } represents a zeromean 2dimensional vector of random effects shared by the replicates corresponding to the same tissue capturing the technical variability, while e _{ ig } represents the measurement error vector. The random effects u _{ ig } and the measurement error vectors e _{ ig } are assumed to be mutually independent. Furthermore, the distributions of e _{ ig } are and u _{ ig } are assumed to be multivariate Gaussian, MVN(0, Ω _{ g }) and MVN(0, Φ _{ g }), respectively.
It is worth to notice that Celeux et al. [6] specified the covariance matrix Φ _{ g } as diagonal; however, in our study, we have considered a nondiagonal covariance matrix since we allow for correlation between measurements corresponding to different splice variants on the same tissue. Moreover, Celeux et al. [6] specified the covariance matrix Ω _{ g } to be equal to σ _{ g } ^{2} I _{2R }, with I _{2R } being a 2R × 2R identity matrix; in order to allow the gth componentvariance to be different between the 2 splice variants, we assumed that Ω _{ g } = diag(Vσ _{ g } ^{2} ), where σ _{ g } ^{2} = (σ _{ g1} ^{2} , σ _{ g2} ^{2} )'.
Given these assumptions, the componentspecific mean vector and covariance matrix of x _{ i } are
and
respectively. In other words, the model does not only assumes dependence between two technical replicates (on the same tissue and for the same splice variant), but also between measurements corresponding to different splice variants. Ng et al. [29] further extended this model, adding another random effect which induces dependency among statistical units from the same class.
The estimation of the model parameters of a finite mixture of linear mixedeffects model can be obtained by maximum likelihood (ML) via the EM algorithm. The missing data are of two types: the indicator variables z _{ i } indicating the component a tissue belongs to and the randon effects u _{ ig } (i = 1, …, n) for each class.
Let π = (π _{1}, …, π _{ G }) be the mixing proportions, θ _{ g } = (α _{ g }, Φ _{ g }, σ _{ g } ^{2} ) be the parameter vector associated to each component P _{ g } g = 1, …, G, and θ = (α _{1}, …, α _{ g }, Φ _{1}, …, Φ _{ G }, σ _{1} ^{2} , …, σ _{ G } ^{2} ). The loglikelihood associated to the complete data (x, z) is given by
Conditionally on the gth component, x _{ i } is a realization from a MVN(α _{ g } V, VΦ _{ g } V '), therefore, f(x _{ i }, u _{ ig }θ _{ g }) is a Gaussian distribution with mean \( {\boldsymbol{\upmu}}_g=\left(\begin{array}{c}\hfill {\boldsymbol{\upalpha}}_g\mathbf{V}\hfill \\ {}\hfill \mathbf{0}\hfill \end{array}\right) \) and covariance matrix \( {\sum}_g=\left(\begin{array}{ll}\mathbf{V}{\boldsymbol{\Phi}}_g\mathbf{V}\hbox{'}+{\boldsymbol{\Omega}}_g\hfill & \mathbf{V}{\boldsymbol{\Phi}}_g\hfill \\ {}{\boldsymbol{\Phi}}_g\mathbf{V}\hbox{'}\hfill & {\boldsymbol{\Phi}}_g\hfill \end{array}\right). \)
At the tth iteration, the E step, we compute the expectation of the completedata loglikelihood conditional on the observed data and the current value of the parameter θ ^{(t)}, π ^{(t)} which is given by
with
where
denotes the posterior probability that x _{ i } arises from the gth component. In the Mstep of the algorithm, we maximize Q ^{(t)}(⋅) with respect to θ _{ g } and π _{ g }. Using the conditional expectation of the sufficient statistics. Since u _{ ik }' u _{ ik } and (x _{ ik } − Vu _{ ik }), we obtain the following estimates for g = 1, …, G:
where V _{ h } is the hth column vector of V(h = 1, 2), u _{ ig } ^{(t)} is a flexible estimator of u _{ ig } based on its posterior mean conditional on the observed data, given by
and γ _{ g } ^{(t)} corresponds to the trace of the current conditional covariance matrix cov(e _{ ig }x), and, finally,
The EM algorithm to obtain the ML parameter estimates given the number of classes G has been implemented by using an adaptation of the Emmixwire program for the R environment; the original script is freely available at http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIXWIRE/index.html. The number of components has to be chosen according to the BIC.
Abbreviations
 IGF1:

Insulinlike growth factor 1
 FMM:

Finite mixture model
 FMLMM:

Finite mixture of linear mixed models
 KDE:

Kernel density estimation
 AIC:

Akaike Information criterion
 BIC:

Bayesian Information Criterion
 MAP:

Maximum a posteriori
 Sk. muscle:

Skeletal muscle
 UTR:

Untranslated region
 Chr:

Chromosome
References
 1.
Pan Q. Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing. Nat Genet. 2008;40:1413–5.
 2.
Nilsen TW. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463:457–63.
 3.
McLachlan GJ, Do KA, Ambroise C. Analyzing microarray gene expression data. Hoboken: Wiley; 2005.
 4.
Schlattmann P. Medical applications of finite mixture models. Berlin: Springer; 2009.
 5.
McLachlan GJ, Peel D. Finite mixture models. New York: Wiley; 2000.
 6.
Celeux G, Martin O, Lavergne C. Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Model. 2005;5:243–67.
 7.
Oberbauer AM. The regulation of IGF1 gene transcription and splicing during development and aging. Front Endocrinol (Lausanne). 2013;4:39.
 8.
Scicchitano BM, Rizzuto E, Musaro A. Counteracting muscle wasting in aging and neuromuscular diseases: the critical role of IGF1. Aging (Albany NY). 2009;1:451–7.
 9.
Adamo ML, Neuenschwander S, LeRoith D, Roberts Jr CT. Structure, expression, and regulation of the IGFI gene. Adv Exp Med Biol. 1993;343:1–11.
 10.
Shimatsu A, Rotwein P. Mosaic evolution of the insulinlike growth factors. Organization, sequence, and expression of the rat insulinlike growth factor I gene. J Biol Chem. 1987;262:7894–900.
 11.
Barton ER. The ABCs of IGFI isoforms: impact on muscle hypertrophy and implications for repair. Appl Physiol Nutr Metab. 2006;31:791–7.
 12.
Musaro A, McCullagh K, Paul A, Houghton L, Dobrowolny G, Molinaro M, et al. Localized Igf1 transgene expression sustains hypertrophy and regeneration in senescent skeletal muscle. Nat Genet. 2001;27:195–200.
 13.
Shavlakadze T, Winn N, Rosenthal N, Grounds MD. Reconciling data from transgenic mice that overexpress IGFI specifically in skeletal muscle. Growth Horm IGF Res. 2005;15:4–18.
 14.
Yang S, Alnaqeeb M, Simpson H, Goldspink G. Cloning and characterization of an IGF1 isoform expressed in skeletal muscle subjected to stretch. J Muscle Res Cell Motil. 1996;17:487–95.
 15.
Hameed M, Orrell RW, Cobbold M, Goldspink G, Harridge SD. Expression of IGFI splice variants in young and old human skeletal muscle after high resistance exercise. J Physiol. 2003;547:247–54.
 16.
McKoy G, Ashley W, Mander J, Yang SY, Williams N, Russell B, et al. Expression of insulin growth factor1 splice variants and structural genes in rabbit skeletal muscle induced by stretch and stimulation. J Physiol. 1999;516(Pt 2):583–92.
 17.
Silverman BW. Density estimation for statistics and data analysis. London: Chapman and Hall; 1986.
 18.
Scott DW. Multivariate density estimation: theory, practice, and visualization. Hoboken: Wiley; 1992.
 19.
Bowman A. (1984) An alternative method of crossvalidation for the smoothing of density estimates. Biometrika. 1984;71:353–60.
 20.
Sheather SJ, Jones MC. A reliable databased bandwidth selection method for kernel density estimation. J R Stat Soc Ser B. 1991;53:683–90.
 21.
Sheather SJ. Density Estimation. Stat Sci. 2004;19:588–97.
 22.
Sheather SJ. The performance of six popular databased bandwidth selection methods on some real data sets (with discussion). Comput Stat. 1992;7:225–81.
 23.
Fraley C, Raftery AE. Modelbased clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–31.
 24.
Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–4.
 25.
Fraley C, Raftery AE. MCLUST version 3 for R: normal mixture modeling and modelbased clustering. technical report no. 504. Seattle: Department of Statistics, University of Washington; 2006.
 26.
Basford KE, McLachlan GJ. The mixture method of clustering applied to threeway data. J Classif. 1985;2:109–25.
 27.
Stavropoulou A, Halapas A, Sourla A, Philippou A, Papageorgiou E, Papalois A, et al. IGF1 expression in infarcted myocardium and MGF E peptide actions in rat cardiomyocytes in vitro. Mol Med. 2009;15:127–35.
 28.
Pelosi M, Marampon F, Zani BM, Prudente S, Perlas E, Caputo V, et al. ROCK2 and its alternatively spliced isoform ROCK2m positively control the maturation of the myogenic program. Mol Cell Biol. 2007;27:6163–76.
 29.
Ng SK, McLachlan GJ, Wang K, BenTovim Jones L, Ng SW. A Mixture model with randomeffects components for clustering correlated geneexpression profiles. Bioinformatics. 2006;22:1745–52.
 30.
Biernacki C, Celeux G, Govaert G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal. 2003;41:561–75.
Acknowledgments
We thank Prof. Claudio Sette for scientific discussions and Maria Grazia Berardinelli for technical support. This work was supported by Telethon (grant N. GGP13013; GGP14066), AFM, PRIN, and ASI.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MP and MA conceived the study and its design, performed the statistical analysis, and wrote the manuscript. AM, FM and EP participated in the design of the study, participated in the statistical analysis, helped to draft the manuscript, and revised critically the manuscript for important intellectual content. AM and MP coordinated the study and carried out the molecular analysis. All authors read and approved the final manuscript.
Michele Pelosi and Marco Alfò contributed equally to this work.
Michele Pelosi and Marco Alfò joint first Authors.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Pelosi, M., Alfò, M., Martella, F. et al. Finite mixture clustering of human tissues with different levels of IGF1 splice variants mRNA transcripts. BMC Bioinformatics 16, 289 (2015) doi:10.1186/s1285901506897
Received:
Accepted:
Published:
Keywords
 Finite mixture model
 Finite mixture of linear mixed models
 IGF1
 Alternative splicing
 Multiple alternatively spliced mRNA
 Tissue profiling
 Biostatistics
 Gene expression levels