 Research
 Open access
 Published:
A clustering procedure for threeway RNA sequencing data using data transformations and matrixvariate Gaussian mixture models
BMC Bioinformatics volume 25, Article number: 90 (2024)
Abstract
RNA sequencing of timecourse experiments results in threeway count data where the dimensions are the genes, the time points and the biological units. Clustering RNAseq data allows to extract groups of coexpressed genes over time. After standardisation, the normalised counts of individual genes across time points and biological units have similar properties as compositional data. We propose the following procedure to suitably cluster threeway RNAseq data: (1) preprocess the RNAseq data by calculating the normalised expression profiles, (2) transform the data using the additive log ratio transform to map the composition in the Dpart Aitchison simplex to a \(D1\)dimensional Euclidean vector, (3) cluster the transformed RNAseq data using matrixvariate Gaussian mixture models and (4) assess the quality of the overall cluster solution and of individual clusters based on cluster separation in the transformed space using densitybased silhouette information and on compactness of the cluster in the original space using cluster maps as a suitable visualisation. The proposed procedure is illustrated on RNAseq data from fission yeast and results are also compared to an analogous twoway approach after flattening out the biological units.
Background
RNA sequencing (RNAseq) generates huge amounts of information about the transcriptome of the organism studied. In timecourse RNAseq experiments gene expression is observed over time with thousands of genes measured simultaneously using only a small number of samples. The comparison of several experimental conditions induces a threeway data structure. Examples of threeway data are experiments under different process conditions, knockout experiments or the exploration of different strains.
RNAseq data provide readouts in form of count data, i.e., read counts per gene. Specific characteristics of the distribution of these counts are nonnormality and a dependence of the variance on the mean. The raw count data may be modelled using the Poisson distribution [1] or the NegativeBinomial distribution to account for overdispersion [2]. Instead of a direct analysis of the raw count data, however, in general several preprocessing steps are applied before analysing the RNAseq data [3], e.g., to correct for different sequencing depths, library sizes and gene lengths. Typically, logfold changes of differential expression are calculated and the number of genes is reduced by filtering low normalised counts and low differential expression estimates [4]. For these tasks several R [5] and Bioconductor [6] packages are available, e.g., TCC [7], DESeq2 [8] or edgeR [9].
Timecourse gene expression data are analysed in different ways. One possible approach is clustering to find groups of coexpressed genes [10,11,12,13]. Different methods and algorithms are used for clustering including kmeans [14], hierarchical clustering [15], biclustering [16, 17], as well as modelbased clustering [18]. Modelbased clustering methods embed the clustering problem within a statistical framework and the mixture models used may be adapted in a flexible way to the data structure and clustering aims by specifying suitable models for the components of the mixture.
Using the raw counts, finite mixtures of Poisson as well as NegativeBinomial distributions have been considered for clustering RNAseq data [19]. Mixtures of multivariate Poissonlognormal distributions have been proposed for clustering transcriptome sequencing data [1]. Modelbased clustering was also extended to threeway data [20] by proposing to use matrixvariate distributions for the components. In this way, the experiments may be assumed to be independent whereas time points are assumed to be dependent. Taking the threeway structure of RNAseq data under several experimental conditions into account, matrixvariate Poissonlognormal distributions were used as components [21]. An alternative method to modelbased clustering for grouping threeway timeseries gene expression data results from extending biclustering to triclustering [22, 23].
Modelbased clustering can also be used after applying suitable preprocessing methods to the data. In a twodimensional setting, a possible approach is to preprocess the RNAseq data by calculating socalled normalised expression profiles and then use data transformations such as the arcsine or logit transformation before clustering the data using Gaussian mixture models [24]. However, both the arcsine transformation as well as the logit transformation have the drawback to be rank deficient and the resulting data are therefore not suitable for modelbased clustering with component distributions assuming full rank.
Noting, however, that in fact these normalised expression profiles have similar properties as compositional data [25], one may resort to methods developed for compositional data to analyse the RNAseq data after normalisation. Compositional data are usually modelled using standard statistical methods after suitably transforming the data with support on the simplex to \(\mathbb {R}^D\) or \(\mathbb {R}^{D1}\) where the data transformations are supposed to then facilitate the use of statistical methods based on Gaussian distributions or relying on the Euclidean distance. Aitchison [26] proposed a number of classes of transformednormal distributions for data on the simplex.
The additive log ratio (ALR) transform is particularly suitable when working with timecourse gene expression data as the first time point can be used as a reference. Finite mixture of matrixvariate Gaussian distributions can then be fitted to perform modelbased clustering with a suitable model being selected based on a statistical information criterion. A final partition can be obtained by assigning observations to the component with the maximum aposteriori probability.
For assessing the quality of a clustering several methods have been proposed [27], e.g., the separation between clusters or the compactness of a given cluster. Silhouette width [28] compares the average distance of one observation to members of its own cluster and the average distance to members of its second closest cluster. A modified silhouette width based on posterior probabilities was also developed for modelbased unsupervised learning approaches, the densitybased silhouette information (dbsi) [29]. The dbsi value of an observation is based on the cluster with the largest posterior probability and the cluster with the second largest posterior probability. In the context of classification, [30] propose a class map to investigate the classspecific performance which takes into account farness from the class as well as the predicted class probabilities. Similarly, we propose in the clustering context a cluster map which takes into account cluster separation in the transformed space and compactness in the original space.
In this work we propose the following fourstep procedure for clustering threeway RNAseq data:

Step 1:
Preprocessing RNAseq data where first normalised expression profiles of the genes across time points for a biological unit and experiment are obtained, averages are taken across biological replicates and finally differentially expressed genes are identified to reduce the number of observations.

Step 2:
Transforming RNAseq data using ALR on the normalised expression profiles which have similar properties as compositional data.

Step 3:
Modelbased clustering of the transformed threeway RNAseq data using finite mixtures of matrixvariate normal distributions.

Specify the variancecovariance structure of the components taking into account the experimental design and clustering aims.

Select the number of components based on the integrated completed likelihood (ICL; [31]) which takes goodnessoffit as well as cluster separation into account.


Step 4:
Postprocessing of the cluster solution for validation based on both, the normalised gene profiles as well as the transformed data, and external additional information. Assess the quality of the partition by inspecting the densitybased silhouette information (dbsi) based on the posterior probabilities of the individual genes alone as well as in combination with the distance from the cluster center as a measure of compactness in a cluster map. Complement these evaluations with a gene set analysis.
We illustrate this approach on a publicly available fission yeast dataset [32] which has a threeway structure. The data consist of global transcription profiles of two strains, the fission yeast wild type and atf21 mutant strains, over an osmotic stress time course. The experiment aimed at the identification of genes affected by the knockout of the atf21 gene. We assess the cluster solution obtained using the proposed procedure also in combination with biological knowledge about functional annotation from the PomBase database [33], the scientific resource for fission yeast. Finally, the proposed threeway clustering approach is compared to a classical twoway approach after flattening out the biological units.
Material and methods
Preprocessing RNAseq data
RNAseq count data are normalised to account for the varying library size (i.e., the total number of sequenced reads in a sample) and the varying gene lengths [3]. Normalisation of the raw read counts enables the comparison across samples and genes. A comprehensive evaluation of normalisation methods for RNAseq data is given in [34].
We extend the idea of obtaining normalised expression profiles proposed for the twodimensional data setting [24] to threeway data. The normalised expression profile for gene \(i = 1, \ldots , n\), time point \(t = 1, \ldots , T\) and experiment \(j = 1, \ldots , J\) is given by
with \(x_{itj}\) the raw read counts of gene i at time point t in experiment j. These normalised expression profiles are calculated separately for each gene and experiment and give the proportion of reads for gene i in experiment j with respect to the total reads for gene i in experiment j across all time points T while accounting for the scaling normalisation factors \(s_{tj}\) which are time point and experiment specific. A constant c is added to avoid potential issues due to zeros. This constant often takes the value 1 but other values, e.g., half of the minimum of all raw counts, are also possible [35]. The scaling normalisation factors \(s_{tj}\) can be calculated using, e.g., the DESeq2 normalisation [8].
For each gene profile of a specific gene i and experiment j, the individual contributions of each time point sum up to one. Therefore, the profile of gene i in experiment j denoted by \(\varvec{p}_{ij} = (p_{itj})_{t=1,\ldots ,T}\) has similar properties as compositional data. In the dataset of the empirical illustration and, more generally, in timecourse experiments, the relative change to time point T0 in gene expression over time is of main interest, implying that T0 represents a natural reference time point.
Transforming RNAseq data
Compositional data are assumed to be made up of the relative parts of a whole with all parts being strictly positive [26]. They follow a vectorspace structure on the simplex based on log ratios between the compositional parts rather than the usual Euclidean geometry. The Ddimensional simplex is defined as
with \(\kappa\) an arbitrary constant which can be set to 1 without loss of generality. The geometrical structure of compositions is referred to as Aitchison geometry [36, 37]. For compositional data, interest lies in the relative proportions of the components measured, absolute quantities and units are irrelevant. Compositional data require that all entries are strictly positive. In case there are components equal to zero in the observed data, one needs to deal with these zeroes – usually viewed as rounded zeroes – and add some small positive constant to ensure this requirement [35].
To model compositional data, one can either use a distribution with support on the simplex, e.g., the Dirichlet distribution, or use transformednormal distributions which map compositional data with support on the simplex to \(\mathbb {R}^D\) or \(\mathbb {R}^{D1}\). A number of transformations inducing transformednormal distributions were proposed for data on the simplex [26]: the additive log ratio transform (also known as logistic normal [38, 39]), the centred log ratio transform, and the isometric log ratio transform.
Transformations
The additive log ratio transform is given by
where \(x_D\) is an arbitrary component which usually, however, has a specific meaning. ALR leads to a nonorthogonal coordinate system.
The centerd log ratio transform is given by
where \(m(\varvec{x})\) is the geometric mean of \(\varvec{x}\). CLR represents a mapping of \(S^D \rightarrow \mathbb {R}^D\), i.e., the resulting matrix is rank deficient implying that the empirical variancecovariance matrix of the data is singular. CLR coefficients cannot directly be associated with an orthogonal coordinate system. Hence, an alternative transformation building on CLR was proposed [26], the isometric log ratio transform:
where \(\Psi\) is an orthonormal basis in the hyperplane. There are infinitely many ways to define such an orthonormal basis system, e.g., the use of pivot coordinates [40]. ILR coordinates represent a mapping of \(S^D \rightarrow \mathbb {R}^{D1}\) and also correspond to an isometry, i.e., all metric concepts on the simplex are maintained.
Applying the ALR and ILR compositional data transformations facilitates the use of statistical methods based on Gaussian distributions or relying on the Euclidean distance on the transformed data. For timecourse RNAseq data the ALR transformation is clearly preferable as differential expression can easily be interpreted relative to T0. By contrast, the interpretation of the ILR coordinates is not straightforward. We thus use the ALR transformation with T0 as reference in our proposed workflow.
Illustrating the ALR transformation
In the following we illustrate how clusters on the 3dimensional simplex transform to Gaussian clusters in 2dimensional Euclidean space based on the ALR transformation using two different mixture distributions with Gaussian components in Euclidean space. Figure 1 illustrates compositional data in the original space and in the transformed space after applying the ALR transformation. The compositional data take values on the 3dimensional simplex which are visualised in ternary diagrams on the left. The data are transformed using the ALR transformation and then visualised again using a 2dimensional scatter plot on the right. The visualised datasets contain either four or five Gaussian clusters in the transformed space. Details about the data generating process can be found in the Additional file 1.
In the first dataset (top row), three clusters are dominated by one component in the compositional vector and therefore these observations are placed in each of the corners of the ternary diagram. The fourth cluster has similar weights for component x2 and x3 and slightly lower weights for component x1 in the compositional vector and therefore its observations are placed approximately in the middle of the ternary diagram. The volume of this pink cluster in the middle is rather comparable to the volumes of the orange and blue clusters on the simplex; however, the volume of this cluster is in comparison much smaller after ALR transformation in the Euclidean space. Also, the pink cluster has a spherical shape on the simplex, but an ellipsoid structure in Euclidean space.
In the second dataset (bottom row), the two greenish clusters in the bottom left and top corner of the simplex differ only slightly in their volumes. In Euclidean space, however, their volumes differ considerably. The orange cluster in the middle has a large volume on the simplex compared to the clusters with observations in the corners or on the edges, while in Euclidean space it is a very compact cluster in the center. This behaviour is even more pronounced for the purple cluster which is spread over a large area of the simplex whereas it has a compact ellipsoid structure in Euclidean space compared to the dark green cluster. In contrast, the orange and light green clusters have the same volume in Euclidean space whereas the light green cluster is much more compact on the simplex.
Figure 1 illustrates that clusters that are very compact in Euclidean space can have a large volume on the simplex. In addition clusters which are dominated by one component might have a larger volume in Euclidean space while being very concentrated in one corner of the simplex. This suggests the need to allow for different volumes and shapes of the clusters in Euclidean space which is not possible using kmeans and thus requires the use of mixtures of Gaussian distributions where the variancecovariance matrix structure can be specified to allow for different shapes and volumes across clusters as well as allow for dependence between time points and independence between experiments.
Modelbased clustering
The standard form of a finite mixture model [41] is
where K is the total number of components, \(\pi _k\) is the positive component size of the kth component with \(\sum _{k=1}^K \pi _k = 1\) and \(\varvec{\theta }_k\) is the parameter vector corresponding to the kth component distribution \(f_k(.;\varvec{\theta }_k)\).
Matrixvariate distributions for the components offer a natural way to model threeway data. This results in a finite mixture model with components distributed as matrixnormal [20]:
implying that conditional on component membership, \(\varvec{Y} = (y_{jt})_{j=1,\ldots ,J; t=1,\ldots ,\mathcal {T}}\) follows a \(J \times \mathcal {T}\)dimensional matrixnormal distribution \((\mathcal{M}\mathcal{N}_{J \times \mathcal {T}})\) with J the number of experiments and \(\mathcal {T}\) the number of time points. Please note that \(\mathcal {T} = T  1\) is used here to indicate that the number of time points differ between the original space and the transformed space. The parameters of the matrixnormal distribution are the \(J \times \mathcal {T}\) mean matrix \(\varvec{M}\) and the \(J \times J\) and \(\mathcal {T} \times \mathcal {T}\) variancecovariance matrices \(\varvec{\Sigma }\) and \(\varvec{\Psi }\). \(\varvec{\Sigma }\) measures the variability along rows (experiments) and \(\varvec{\Psi }\) measures the variability along columns (time points). A suitable constraint needs to be imposed on \(\varvec{\Sigma }\) and \(\varvec{\Psi }\) to ensure identifiability.
The matrixnormal distribution is related to the multivariate normal distribution via
with \(\text {vec}()\) the vectorisation operator and \(\otimes\) the Kronecker product. Comparing these distributions indicates that much fewer parameters need to be estimated in the case of the matrixnormal distribution (e.g., for \(J=2\) and \(\mathcal {T}=5\) we get a block matrix of size \(10 \times 10\) with \(2\cdot 3/2 +5\cdot 6/2=18\) parameters as opposed to \(10\cdot 11/2=55\) parameters in a general variancecovariance matrix). As a result of this more parsimonious component specification, more groups of different size, volume and shape can be expected to be found.
The following specifications are imposed on the parameters of the components to obtain suitable cluster solutions. A general matrix with no constraints is assumed for the mean parameter. The variancecovariance matrix measuring variability between the experiments, \(\varvec{\Sigma }\), is assumed to be a diagonal matrix as no dependence structure is expected between them, i.e., the experiments are assumed to be independent. Additionally, in order to allow for clusters of different volumes and shapes the values in the diagonal of \(\varvec{\Sigma }\) are allowed to vary between dimensions and components. This specification ensures that noise clusters with large volume as well as compact clusters consisting of very similar expression patterns can be identified simultaneously. This represents a major advantage over kmeans clustering where only global restrictions across all components are possible. Between the time points some correlation structure is assumed. One can either specify a full correlation structure, i.e., \(\varvec{\Psi }\) is a general correlation matrix, or restrict \(\varvec{\Psi }\) assuming an autoregressive model of order 1 (AR1), resulting in a more parsimonious parameterisation which assumes conditional independence between time points more than one time point apart conditional on intermediate time points [42]. Other component distributions could also be considered, e.g., the tdistribution of skewed distributions, to allow for more flexible shapes of the clusters. However, using normal distributions has the advantage that the symmetry and the light tails imply that all observations in the cluster might well be represented by the mean, in particular if the cluster is compact.
To select the number of components, a penalised goodnessoffit criterion such as the Bayes information criterion (BIC) or the ICL is typically used. These criteria are determined by
where \(\ell (.\hat{\varvec{\theta }}_K)\) is the log likelihood evaluated at \(\hat{\varvec{\theta }}_K\), which is the maximum likelihood estimate of \(\varvec{\theta }_K\), \(\nu _K\) is the number of free parameters in the mixture model with K components and n is the number of genes. The difference between BIC and ICL is the penalty factor, i.e., the entropy which is added for the ICL and which measures the ability of the Kcomponent model to provide a well separated partition of the data. A simulation study on artificial data showed that BIC and ICL had an excellent performance for selecting the number of clusters for mixtures of matrixvariate Gaussian distributions [20]. This is in line with the literature indicating that the BIC in general performs well in modelbased clustering despite not satisfying the regularity conditions [18]. These criteria may also be used to select among different models or transformations [24, 43].
We use the ICL in our proposed workflow because this criterion aims not only at selecting a solution which provides a good fit to the data, but it also takes cluster separation into account. Hence, the selected solution is supposed to provide better results in a clustering context, in particular given that biologists perform cluster analysis to find groups of coexpressed genes where the functionality and the coregulation is still unclear. Therefore, rather than identifying the true number of underlying clusters in a dataset, it is of interest to identify small compact groups of genes with similar expression patterns across experiments.
Postprocessing of the cluster solution
The selected mixture model can be used to obtain a partition of the data by assigning each observation to the component where the posterior probability of belonging to the component given the observation is maximum. Different methods have been proposed to assess cluster solutions using internal as well as external criteria.
Silhouette information [28] allows to evaluate the quality of a partition based on a distance metric. The silhouette width is determined as the average distance of one observation \(\varvec{y}_i = (y_{itj})_{t = 1, \ldots , \mathcal {T}, j = 1, \ldots J}\) to the other members of its own cluster \(a(\varvec{y}_i)\) and the minimum average distance to members of its second closest cluster \(b(\varvec{y}_i)\).
To avoid the need of specifying a suitable distance metric, an alternative silhouette information was developed based on posterior probabilities which is thus applicable to assess the partition obtained from a mixture model [29]. The densitybased silhouette information (dbsi) of \(\varvec{y}_i\) is defined as
where \(k_0\) is the cluster with the largest posterior probability \(\tau _k(\varvec{y}_i)\) and \(k_1\) is the group index for the second largest posterior of observation i. In order to avoid numerical instabilities, the posterior probabilities were winsorized to be between 0.00001 and 0.99999 in the numerical implementation. A large value of the dbsi for a given data point is an indicator that this data point firmly belongs to its assigned cluster whereas a small dbsi value indicates that there is some ambiguity in assigning this data point to one cluster. The dbsi metric allows to assess the separation between clusters in the transformed data space.
The dbsi information plot visualizes for each gene the dbsi value obtained grouped by cluster, with the values within each cluster being in decreasing order. This plot provides insights into the cluster sizes as well as the separation of the clusters. Based on this plot, one can assess how well the fitted mixture model allows to classify the observations into the clusters. For each cluster, one can also assess how easily which proportion of the genes can be assigned to this cluster.
The dbsi information plot focuses on cluster quality in the transformed space based on the mixture model. We complement this information by combining the dbsi values with a measure of closeness to the cluster center in the original space based on the Euclidean distance. We build on a twodimensional visualisation method recently proposed for a detailed diagnostic of the quality of a classification procedure where true class memberships are known (class map; [30]). In this diagnostic plot, the probability of the best alternative class is plotted for each observation against the “farness” from the given class.
A similar approach can also be used in an unsupervised setting. To allow for a joint assessment of the solution in the original as well as the transformed space, we propose a cluster map which is obtained by plotting the dbsi from the mixture model fitted to the data in the transformed space against the distance to the cluster center in the original space with observations split into different facets based on cluster membership. The distance to the cluster center is determined using a scaled version of the Euclidean distance such that the values are in [0, 1]:
where \(\varvec{p}_i\) is observation \(\varvec{y}_i\) in the original space and \(c(\varvec{p}_i)\) is the transformed estimated mean of the component with the largest posterior. The distance indicates how compact a cluster is in the original space, whereas dbsi allows to assess cluster separation in the transformed space based on the mixture model.
The cluster map plot consists of scatter plots of distance versus dbsi values for each of the clusters in facets. To better identify the regions of points and facilitate comparison across clusters, we suggest to add convex hulls for the observations of each cluster. From a biological point of view we are interested in well separated and compact clusters allowing for easy interpretation of the functionality of the contained genes. This implies that an ideal cluster would be one where observations assigned to the cluster have high dbsi values and a low distance to the cluster center and hence, all observations are located in the top left corner of this scatter plot.
Overall, the dbsi information plot as well as the cluster map scale well in the number of experiments, the number of time points and the number of clusters and are therefore well suited for highdimensional data containing many observations. Traditional visualisations of the clusters in the transformed and original space can become quite challenging as the dimensionality, especially the number of experiments, increases.
In addition to evaluating cluster solutions based on internal criteria, cluster solutions on genomic data are also typically assessed using external information, e.g., taking into account functional groupings of the genes such as gene ontology (GO) terms. Individual clusters are evaluated regarding gene set enrichment of functional groups. For fission yeast, the gene association file containing GO term information is available from PomBase [33].
Software implementation
The procedure can be implemented combining several R packages. Package coseq [24, 44] implements calculation of the normalised expression profiles (Step 1). Package robCompositions [25] implements data transformations of compositional data including the ALR and the inverse ALR (Step 2). Package MatTransMix [45] implements the matrixvariate normal mixture model (Step 3), assuming a general variancecovariance matrix \(\varvec{\Psi }\). The restricted autoregressive version of order 1 (AR1) is fitted with an implementation of the expectationmaximisation algorithm using code available at the GitHub project page. For fitting the restricted version, the best solution obtained using MatTransMix with the general specification for \(\varvec{\Psi }\) is used for initialisation. The restricted version is thus obtained in a subsequent refinement step given the solution for a general variancecovariance matrix \(\varvec{\Psi }\). To fit mixture models to a twodimensional version after flattening out the biological units, packages mclust [46] and Rmixmod [47] can be used for maximum likelihood estimation of finite mixtures of multivariate normal distributions.
Code for the proposed workflow including also the visualisation methods for postprocessing the cluster solution (Step 4) is available at the GitHub project page. The repository contains the R code for the newly developed cluster map, the dbsi plot and scripts for reproducing the entire data analysis performed in this study.
Results
Analysing the fission yeast dataset
We applied the proposed workflow to a dataset collected in a study where global changes in transcript and protein levels in the fission yeast stress response were investigated [32]. The fission yeast data can be downloaded from the Gene Expression Omnibus and is also available in the Bioconductor package fission. The data comprises global transcription profiles of two strains, the fission yeast wild type (WT) and the atf21 mutant (Mut) strains, over an osmotic stress time course following treatment with 1 M sorbitol at 0, 15, 30, 60, 120 and 180 minutes. Strandspecific single end sequencing of total RNA was performed in biological triplicates on the Applied Biosystems SOLiD 5500xl Genetic Analyzer System. In total there are \(n = 7039\) genes, \(T = 6\) time points and \(J = 2\) different experimental units for 3 biological replicates. The yeast samples were exposed to oxidative stress, and half of the samples contained a deletion of the gene atf21 (at locus SPNCRNA.1164). One of the goals of the fission yeast data analysis is to find groups of coexpressed genes over time by additionally taking into account the information about the different experiments.
The raw read counts of each of the six experiments were used to calculate the normalised expression profiles using the coseq package. After taking the means over the biological replicates, the normalised expression profiles were transformed to ALR coordinates using the robCompositions package. In Fig. 2, the data associated with each of these preprocessing steps are visualised for the knockout gene atf21. The raw counts are shown in the top left panel for all six experiments. In the top right panel the normalised expression profiles are given which sum up to one for each experiment and all values are therefore in the range between zero and one. In the bottom panels the corresponding mean profiles (left) and ALR coordinates (right) are displayed.
A differential expression analysis for gene filtering was performed using DESeq2. Following the DESeq2 workflow [48], a design formula was used that models the strain difference at time point T0 = 0, the difference over time, and any strainspecific differences over time. Differentially expressed genes between the strains at any two time points were selected by taking only genes with an adjusted pvalue below 0.01 and an absolute log2 fold change larger than 1. These thresholds correspond to the defaults usually employed and yield a subset of 769 differentially expressed genes. Selecting different thresholds would result in a different set of differentially expressed genes. Given that clustering aims at suitably partitioning the set of given objects, clearly a different set would also induce a different cluster solution. In this subset, the most abundant GO terms are GO:0005634 (nucleus, 361 genes), GO:0005829 (cytosol, 284 genes), GO:0005737 (cytoplasm, 147 genes) and GO:0005515 (protein binding, 111 genes).
Modelbased clustering of the transformed threeway data was performed using R package MatTransMix. Models with the number of components between 1 and 20 were fitted using a diagonal variancecovariance matrix for the experiments and a full correlation matrix over time. ICL selects the 10 components solution for the ALR transformed data. This is in line with the BIC which would also point to the 10 components solution. These solutions are used to initialise the fitting of the restricted version based on AR1 to account for temporal correlation. For the restricted versions, both ICL and BIC also suggest to use the solution with 10 components given that the criteria show a considerable decrease up to 10 components and are quite flat afterwards. Based on these criteria, these AR1 restricted solutions are preferable to the full versions. In the following, we consider the 10component solution obtained with the AR1 restriction.
Figure 3 provides the dbsi information plot for the cluster solution obtained, with the clusters sorted by their average silhouette width. The cluster sizes range from 26 to 181 genes indicating there is not a single cluster which would contain the majority of genes. Clearly, in general, clusters with high average dbsi values are those which contain only few genes. Clusters 1 and 2 contain only a rather small number of genes and show a very good separation for all genes contained resulting in high average dbsi values. By contrast, cluster 10 has the smallest average dbsi value and therefore the worst separation. The overall average dbsi is 0.59 which is a moderate value for the dbsi. Results thus indicate that in the transformed space some clusters were identified where the observations can be clearly assigned to these clusters, whereas for other clusters assignment is rather ambiguous.
In order to get a more detailed view on the quality of this cluster solution, we inspect the cluster map in Fig. 4. For the cluster solution obtained, we can see that even though clusters 1 and 2 have very similar average dbsi values, they differ considerably in the distance to their cluster centers. Cluster 2 is a rather compact cluster as indicated by distance, whereas cluster 1 shows the overall largest average distance of the corresponding genes to their cluster center. Cluster 10 on the other hand has a very small average dbsi value but is a very compact cluster in the original space. The same is true for cluster 5.
We complement the evaluation of the cluster solution and assessment of the quality of specific clusters by also inspecting the gene expression patterns of the individual clusters both in the transformed and original data space. Some selected clusters are visualised in Fig. 5 with the third dimension, i.e., the experiments wild type (WT) and mutant (Mut), shown side by side. This implies that the xaxis represents the dimension of time as well as the experiments. Note that such a visualisation is easily possible for this analysis because of the rather lowdimensional nature of the threedimensional dataset. In the top panel, the ALR transformed data are given for the WT and MuT experiment next to each other. In the bottom panel the corresponding mean profiles in the original space are shown. Cluster 1 which contains genes with good cluster separation but bad compactness is given on the left. As expected, the gene expression profiles vary a lot in their magnitude. Cluster 2 on the other hand is a very compact cluster with good cluster separation. Similar, clusters 5 is again a very compact cluster, whereas cluster 7 is less compact. Cluster 7 also contains gene atf21 which is shown in Fig. 2. This cluster contains 87 genes which show the highest gene expression 30 min after start of the experiment. The corresponding genes show a very similar expression pattern in the wild type strain (left) and the mutant (right). The composition of the known functionality of these genes is very similar to the global functionality, i.e., GO:0005634 (nucleus, 42 genes), GO:0005829 (cytosol, 46 genes), GO:0005737 (cytoplasm, 14 genes) and GO:0005515 (protein binding, 15 genes).
To highlight the advantages of the threeway clustering approach we also investigate clustering results obtained for the transformed data when using twoway clustering methods as well as kmeans. For twoway clustering the fission yeast data, we flatten out the third dimension and use a traditional twoway clustering with package mclust. We impose an unconstrained variancecovariance specification to the component distributions because we want to allow for varying volume, orientation and shape across clusters. Flatting out the third dimension, i.e., the experimental conditions, yields a dataset with 769 genes and 10 variables after ALR transformation. Comparing models fitted with 1 to 20 components, a cluster solution with 4 components was selected by ICL (where BIC would have selected 5 components).
The dbsi information plot as well as the cluster map plot of the twoway clustering solution are given in Fig. 6. The dbsi information plot indicates that in general, the dbsi values are comparable for the twoway and threeway clustering approaches. Cluster 1 is a very large cluster containing a lot of genes with high dbsi values indicating good cluster separation and also some genes with very bad cluster separation. Clusters 2 and 3 are characterised by containing observations where the dbsi values vary considerably. Cluster 4 is generally a cluster with very low dbsi values. The cluster map plot on the right also allows to inspect the compactness of the clusters in the original space. This shows, that in particular cluster 1, which has a good cluster separation according to the dbsi values, contains a lot of genes with large distance to its cluster center. This can also be seen in Fig. 7. Cluster 2 on the other hand is a very compact cluster where the corresponding genes have only a small distance to the cluster center and the gene expression profiles are easily interpretable.
The adjusted Rand index [49] between the threeway cluster solution and the twoway cluster solution is 0.15. This indicates only a rather low congruence between the partitions, most likely due to the difference in number of clusters. The contingency table of the two cluster solutions is given in Table 1 on the left. Good agreement between the cluster solutions is inherent for cluster 9 of the threeway clustering and cluster 4 in the twoway clustering (where only 3 genes were put into different clusters). All genes of cluster 5 in the threeway clustering are contained in cluster 2 of the twoway clustering. However, cluster 2 of the twoway clustering also contains cluster 8 of the threeway clustering and some more genes. This comparison shows that while there is clearly some congruence between several clusters obtained with the two methods there is also a lot of difference between the partitions.
Finally, also the kmeans algorithm [14] was used for clustering the fission yeast data. The flattend out data to two dimensions was used. Note that this also corresponds to a threeway version of kmeans which implies isotropic clusters. The number of clusters were selected based on the maximum of the averaged Silhouette width [15] which resulted in three clusters. The comparison of the threeway clustering and the kmeans clustering solution gives an adjusted Rand index of 0.14 (see Table 1 on the right). In this case an even stronger hierarchical nesting of the clusters of the threeway clustering in the kmeans partition can be discerned. All observations in clusters 3, 5, 7 and 9 of the threeway clustering are completely contained in one single cluster of the kmeans solution. For kmeans clustering, however, no posterior probabilities are available and therefore no dbsi plot and cluster map can be used to investigate the cluster solution in more detail.
The comparison of the threeway clustering solution with the twoway and kmeans clustering solutions clearly indicates the advantage of using the threeway clustering approach which results in a more parsimonious while still flexible parametrisation of each of the components in the mixture model and hence allows to identify more clusters and thus for a more finegrained analysis.
Simulation study
We investigate the performance of the proposed threeway clustering approach in a simulation study using artificial data and compare results to those obtained using twoway clustering and kmeans clustering. We focus in particular on the ability to determine a suitable number of clusters based on the criteria considered in the application on the fission yeast dataset as well as on the congruence of the partition obtained with the true classification of the observations based on the adjusted Rand index.
We proceed as follows. We generate 100 artificial threeway datasets with the same structure as the fission yeast dataset, i.e., 2 experiments, 5 time points, and 769 genes. As data generating process we use the mixture model estimated to the fission yeast dataset in the threeway clustering procedure, i.e., the 10component mixture of matrixvariate Gaussian distributions imposing the AR1 restriction on the columnwise covariance matrix. We draw class assignments based on the component sizes and—given class assignment—we draw observations from the matrixvariate Gaussian distribution with mean and variancecovariance matrices estimated for the component.
For each dataset, we use four different methods to cluster these observations: threeway clustering using a full columnwise covariance matrix, threeway clustering using AR1, twoway clustering and kmeans clustering. For the three modelbased approaches we select the suitable number of clusters using the ICL and consider the BIC as well to assess its performance. For kmeans we use the maximum average Silhouette width as criterion to select the number of clusters as well as fixed the number of clusters to 10, the true number of clusters.
The adjusted Rand index between the true cluster memberships and the eight different clustering solutions are given on the left of Fig. 8. The results clearly indicate the superiority of the threeway clustering to obtain similar partitions to the true classification than twoway clustering or kmeans clustering. Threeway clustering using AR1 slightly outperforms classical threeway clustering using a full columnwise covariance matrix. For kmeans only assuming that the number of clusters is known leads to some reasonable congruence with the true classification, while results are poor if the number of clusters are selected based on the Silhouette width criterion. Regarding the use of either ICL or BIC to select the number of clusters in a modelbased clustering context, the simulation study results show that this choice has only a very minor impact on the performance. Figure 8 on the right shows the number of clusters selected by the procedures. Clearly for threeway clustering, the selected number of clusters are quite close to the true number regardless of if the full or the AR1 restricted variants are considered. Twoway clustering and kmeans select consistently a much lower number of clusters which is also in line with the results in the application on the fission yeast dataset.
The results of the simulation study clearly indicate that the threeway clustering procedure performs well and using the ICL (but also the BIC) for selecting the number of clusters is a reasonable choice. Regarding the use of twoway clustering and kmeans one might expect to clearly underestimate the true number of clusters present in a dataset.
Conclusions
In this work we proposed a new workflow for analysing threeway RNAseq data, i.e., where genes are investigated over time under different experimental conditions. The fourstep procedure consists of (1) preprocessing RNAseq data, (2) transforming RNAseq data, (3) modelbased clustering using matrixvariate component distributions and (4) postprocessing the cluster solution obtained. For preprocessing, we propose to calculate normalised expression profiles over time which have similar properties as compositional data. After applying the ALR data transformation, one may assume a matrixvariate normal distribution for the clusters in the data. We thus propose to fit a finite mixture model with components distributed as matrixnormal. A new visualisation method was developed for postprocessing of the cluster solution. The cluster map visualises the densitybased silhouette information (dbsi) calculated from the posterior probabilities in the transformed space and the distance of a gene to its cluster center in the original space. The proposed workflow was applied to a dataset from fission yeast which has a threeway structure. We also compared results from the threeway clustering with a twoway clustering approach and kmeans clustering after flattening out the experiments. Results indicate that the threeway approach allows for a more detailed view on the data and encourages the detection of groups of genes with similar temporal expression patterns over time across the different experiments.
The application focused on threevariate data consisting only of two experiments in one dimension. In the future we want to extend our approach to datasets with more than two experiments. In such situations it might be worthwhile to investigate the use of a more parsimonious specification of the mean vectors based on a regression model where the covariates characterise the experimental units. Additionally, it would also be interesting to compare the presented threeway clustering approach to triclustering.
Availibility of data and materials
The dataset used is available in the Bioconductor package fission. R code for the implemented workflow is available at the GitHub project page: https://github.com/theresascharl/RNAseq_clustering
Abbreviations
 RNAseq:

RNA sequencing
 ALR:

Additive log ratio
 BIC:

Bayes information criterion
 ICL:

Integrated completed likelihood
 AR1:

Autoregressive of order 1
 dbsi:

Densitybased silhouette information
 WT:

Wild type
 CLR:

Centered log ratio
 ILR:

Isometric log ratio
 Mut:

Mutant
 GO:

Gene ontology
References
Silva A, Rothstein SJ, McNicholas PD, Subedi S. A multivariate Poissonlog normal mixture model for clustering transcriptome sequencing data. BMC Bioinform. 2019;20(1):394. https://doi.org/10.1186/s1285901929160.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:106. https://doi.org/10.1186/gb20101110r106.
Korpelainen E, Tuimala J, Somervuo P, Huss M, Wong G. RNAseq data analysis: a practical approach. 1st ed. New York: Chapman and Hall/CRC; 2014. https://doi.org/10.1201/b17457.
Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for highthroughput experiments. Proc Natl Acad Sci USA. 2010;21(107):9546–51. https://doi.org/10.1073/pnas.0914005107.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2023.
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating highthroughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–21. https://doi.org/10.1038/nmeth.3252.
Sun J, Nishiyama T, Shimizu K, Kadota K. TCC: An R package for comparing tag count data with robust normalization strategies. BMC Bioinform. 2013;14:219. https://doi.org/10.1186/1471210514219.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:550. https://doi.org/10.1186/s1305901405508.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. https://doi.org/10.1093/bioinformatics/btp616.
Androulakis IP, Yang E, Almon RR. Analysis of timeseries gene expression data: methods, challenges, and opportunities. Annu Rev Biomed Eng. 2007;9:205–28. https://doi.org/10.1146/annurev.bioeng.9.060906.151904.
Nueda MJ, Tarazona S, Conesa A. Next maSigPro: updating maSigPro Bioconductor package for RNAseq time series. Bioinformatics. 2014;30(18):2598–602. https://doi.org/10.1093/bioinformatics/btu333.
Scharl T, Voglhuber I, Leisch F. Exploratory and inferential analysis of gene cluster neighborhood graphs. BMC Bioinform. 2009;10(1):288. https://doi.org/10.1186/1471210510288.
Srivastava H, Ferrell D, Popescu GV. NetSeekR: a network analysis pipeline for RNAseq time series data. BMC Bioinform. 2022;23:54. https://doi.org/10.1186/s12859021045541.
Hartigan JA, Wong MA. Algorithm AS136: a \(k\)means clustering algorithm. Appl Stat. 1979;128:100–8.
Kaufman L, Rousseeuw PJ. Finding groups in data. New York: Wiley; 1990.
Hartigan JA. Direct clustering of a data matrix. J Am Stat Assoc. 1972;67(337):123–9. https://doi.org/10.2307/2284710.
Pontes B, Giráldez R, AguilarRuiz JS. Biclustering on expression data: a review. J Biomed Inform. 2015;57:163–80. https://doi.org/10.1016/j.jbi.2015.06.028.
Fraley C, Raftery AE. Modelbased clustering, discriminant analysis and density estimation. J Am Stat Assoc. 2002;97(458):611–31. https://doi.org/10.1198/016214502760047131.
Si Y, Liu P, Li P, Brutnell TP. Modelbased clustering for RNAseq data. Bioinformatics. 2013;30(2):197–205. https://doi.org/10.1093/bioinformatics/btt632.
Viroli C. Finite mixtures of matrix normal distributions for classifying threeway data. Stat Comput. 2011;21(4):511–22. https://doi.org/10.1007/s112220109188x.
Silva A, Qin X, Rothstein SJ, McNicholas PD, Subedi S. Finite mixtures of matrix variate Poissonlog normal distributions for threeway count data. Bioinformatics. 2023;39(5):btad167. https://doi.org/10.1093/bioinformatics/btad167.
Amar D, Yekutieli D, MaronKatz A, Hendler T, Shamir R. A hierarchical Bayesian model for flexible module discovery in threeway timeseries data. Bioinformatics. 2015;31(12):17–26. https://doi.org/10.1093/bioinformatics/btv228.
Jung I, Jo K, Kang H, Ahn H, Yu Y, Kim S. TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes. Bioinformatics. 2017;33(23):3827–35. https://doi.org/10.1093/bioinformatics/btw780.
Rau A, MaugisRabusseau C. Transformation and model choice for RNAseq coexpression analysis. Brief Bioinform. 2018;19(3):425–36. https://doi.org/10.1101/065607.
Filzmoser P, Hron K, Templ M. Applied compositional data analysis: with worked examples in R. Springer series in statistics. Switzerland: Springer; 2018. https://doi.org/10.1007/9783319964225.
Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B (Methodol). 1982;44(2):139–77. https://doi.org/10.1111/j.25176161.1982.tb01195.x.
Hennig C. Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R, editors. Handbook of cluster analysis. 1st ed. New York: Chapman and Hall/CRC; 2015. https://doi.org/10.1201/b19706.
Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/03770427(87)901257.
Menardi G. Densitybased silhouette diagnostics for clustering methods. Stat Comput. 2011;21:295–308. https://doi.org/10.1007/s1122201091690.
Raymaekers J, Rousseeuw PJ. Silhouettes and quasi residual plots for neural nets and treebased classifiers. J Comput Graph Stat. 2022;31(4):1332–43. https://doi.org/10.1080/10618600.2022.2050249.
Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell. 2000;22(7):719–25. https://doi.org/10.1109/34.865189.
Leong HS, Dawson K, Wirth C, Li Y, Connolly Y, Smith DL, Wilkinson CRM, Miller CJ. A global noncoding RNA system modulates fission yeast protein levels in response to stress. Nat Commun. 2014;5:3947. https://doi.org/10.1038/ncomms4947.
Harris MA, Rutherford KM, Hayles J, Lock A, Bähler J, Oliver SG, Mata J, Wood V. Fission stories: using PomBase to understand Schizosaccharomyces pombe biology. Genetics. 2021;220(4):222. https://doi.org/10.1093/genetics/iyab222.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Gall CL, Schaëffer B, Crom SL, Guedj M, Jaffrézic F. French StatOmique consortium: a comprehensive evaluation of normalization methods for illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;6(14):671–83. https://doi.org/10.1093/bib/bbs046.
PawlowskyGlahn V, Buccianti A. Compositional data analysis: theory and applications. Chichester: Wiley; 2011. https://doi.org/10.1002/9781119976462.ch17.
PawlowskyGlahn V, Egozcue JJ. Geometric approach to statistical analysis on the simplex. Stoch Environ Res Risk Assess. 2001;15:384–98. https://doi.org/10.1007/s004770100077.
Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarceloVidal C. Isometric logratio transformations for compositional data analysis. Math Geol. 2003;35:279–300. https://doi.org/10.1023/A:1023818214614.
Blei DM, Lafferty JD. A correlated topic model of Science. Ann Appl Stat. 2007;1(1):17–35. https://doi.org/10.1214/07AOAS114.
Russo M, Singer BH, Dunson DB. Multivariate mixed membership modeling: inferring domainspecific risk profiles. Ann Appl Stat. 2022;16(1):391–413. https://doi.org/10.1214/21AOAS1496.
Fišerová E, Hron K. On the interpretation of orthonormal coordinates for compositional data. Math Geosci. 2011;43:455–68. https://doi.org/10.1007/s110040119333x.
McLachlan GJ, Peel D. Finite mixture models. New York: Wiley; 2000. https://doi.org/10.1002/0471721182.
Anderlucci L, Viroli C. Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data. Ann Appl Stat. 2015;9(2):777–800. https://doi.org/10.1214/15AOAS816.
Thomas I, Frankhauser P, Biernacki C. The morphology of builtup landscapes in Wallonia (Belgium): a classification using fractal indices. Landsc Urban Plan. 2008;84(2):99–115. https://doi.org/10.1016/j.landurbplan.2007.07.002.
GodichonBaggioni A, MaugisRabusseau C, Rau A. Clustering transformed compositional data using \(k\)means, with applications in gene expression and bicycle sharing system data. J Appl Stat. 2017;46:47–65. https://doi.org/10.1080/02664763.2018.1454894.
Zhu X, Sarkar S, Melnykov V. MatTransMix: an R package for matrix modelbased clustering and parsimonious mixture modeling. J Classif. 2022;39:147–70. https://doi.org/10.1007/s00357021094019.
Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8(1):289–317. https://doi.org/10.32614/rj2016021.
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G. Rmixmod: the R package of the modelbased unsupervised, supervised, and semisupervised classification Mixmod library. J Stat Softw. 2015;67(6):1–29. https://doi.org/10.18637/jss.v067.i06.
Love MI, Kim SAV, Huber W. RNAseq workflow: genelevel exploratory analysis and differential expression [version 2; peer review: 2 approved]. F1000Research. 2016;4:1070. https://doi.org/10.12688/f1000research.7035.2.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/BF01908075.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
TS implemented the workflow, carried out the analysis of the data and wrote the manuscript. BG contributed to implementing the workflow and writing the manuscript. TS and BG conceptualised and developed the workflow and methodology. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable. The manuscript does not have any direct human involvement or human clinical/medical data which is not publicly available.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Artificial data used in Fig. 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Scharl, T., Grün, B. A clustering procedure for threeway RNA sequencing data using data transformations and matrixvariate Gaussian mixture models. BMC Bioinformatics 25, 90 (2024). https://doi.org/10.1186/s12859024057176
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024057176