Two-way learning with one-way supervision for gene expression data

Background A family of parsimonious Gaussian mixture models for the biclustering of gene expression data is introduced. Biclustering is accommodated by adopting a mixture of factor analyzers model with a binary, row-stochastic factor loadings matrix. This particular form of factor loadings matrix results in a block-diagonal covariance matrix, which is a useful property in gene expression analyses, specifically in biomarker discovery scenarios where blood can potentially act as a surrogate tissue for other less accessible tissues. Prior knowledge of the factor loadings matrix is useful in this application and is reflected in the one-way supervised nature of the algorithm. Additionally, the factor loadings matrix can be assumed to be constant across all components because of the relationship desired between the various types of tissue samples. Parameter estimates are obtained through a variant of the expectation-maximization algorithm and the best-fitting model is selected using the Bayesian information criterion. The family of models is demonstrated using simulated data and two real microarray data sets. The first real data set is from a rat study that investigated the influence of diabetes on gene expression in different tissues. The second real data set is from a human transcriptomics study that focused on blood and immune tissues. The microarray data sets illustrate the biclustering family’s performance in biomarker discovery involving peripheral blood as surrogate biopsy material. Results The simulation studies indicate that the algorithm identifies the correct biclusters, most optimally when the number of observation clusters is known. Moreover, the biclustering algorithm identified biclusters comprised of biologically meaningful data related to insulin resistance and immune function in the rat and human real data sets, respectively. Conclusions Initial results using real data show that this biclustering technique provides a novel approach for biomarker discovery by enabling blood to be used as a surrogate for hard-to-obtain tissues. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1564-5) contains supplementary material, which is available to authorized users.


Background
With the introduction of personalized medicine, the discovery of novel biomarkers via "omics" research plays a critical role in its advancement [1]. A biomarker is defined as "a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention" [2]. The behaviour of a biomarker is expected to vary among individuals, thereby allowing treatment to be "personalized" depending on that individual's (predicted) response. The ideal diagnostic tool *Correspondence: monica.wong@math.mcmaster.ca 1 Department of Mathematics and Statistics, McMaster University, L8S 4L8 Hamilton, ON, Canada Full list of author information is available at the end of the article is minimally invasive, leading researchers to investigate the use of peripheral blood cells as surrogate biopsy material, since blood is more easily accessible. The assumption is that the molecular profile of peripheral blood reflects a global overview of the physiological events occurring in different tissues throughout the body [3].
When gene expression microarrays are used for biomarker discovery, the subset of identified genes acts as the set of biomarkers [4,5]. Returning to the idea of peripheral blood as surrogate material, a gene that exhibits correlated expression profiles in blood and a given tissue (but not other tissues) may be a biomarker of interest. In this scenario, the genes act as the observations and the blood and tissues (the samples) act as the variables. A data point in the microarray data set is thus an intensity value corresponding to a specific gene in a sample. One popular way of identifying these subsets of correlated genes across the blood and the given tissue is via clustering techniques [6].
One-way clustering methods can be restrictive in certain applications. It is not always the case that the groups of patterns found in the observations are homogeneous across all the variables; rather, it could be the case that only a subset of the variables possesses these groupings. With gene expression data, if the samples are the variables and the researcher hypothesizes that there exists homogeneous groups of samples, this would be useful information for the algorithm to have. A popular example is the discovery of leukaemia tumour subtypes based on gene expression [7]. Consequently, biclustering techniques have been developed to address this recurring issue. Biclustering, first explored by Hartigan [8], clusters both rows and columns simultaneously and results in biclusters.
Biclustering is a useful technique when the researcher suspects biclusters of variables and observations in the data, but does not understand what properties of the variables define the biclusters. For instance, in the leukaemia tumour subtype analysis, researchers initially would not have known the classes of each tumour sample (see [7] for a discussion). Here, biclustering could help to reveal these subtypes more efficiently, as done by Kluger et al. [9] for example. However, researchers may desire that the observations within biclusters satisfy a particular relationship among the variables; the biclustering method would then be one-way supervised. This technique is particularly relevant for the blood biomarker discovery application mentioned earlier. One-way supervision is effective because the researcher specifically requires a prominent relationship between the samples of blood and the samples of the tissue of interest with respect to the expression profile of a subset of genes. Additionally, the researcher explicitly requires that the expression profiles of that same subset of genes to have no relationship between the previously mentioned samples and the rest of the samples in the data set. In this way, the resulting biclusters would contain a subset of genes that is strictly correlated within blood and the tissue of interest only.

Model-based clustering
Cluster analysis identifies homogeneous groups that are relevant within a population. It is an unsupervised technique because it does not utilize existing labels to find the best homogeneous groups among a set of observations, which reflects common real-life scenarios because observations are not usually accompanied by hints about their true groupings with respect to the variables. Some popular clustering techniques include methods such as hierarchical clustering [10], k-means clustering [11], and model-based clustering (see [12] for an in-depth discussion).
In model-based clustering, group membership is estimated using a parametric finite mixture model, which can be denoted where π g ∈ (0, 1], such that G g=1 π g = 1, is the mixing proportion for component g, f x (x | θ g ) is the density of a multivariate random variable X with parameters θ g , and ϑ = (π 1 , . . . , π G , θ 1 , . . . , θ G ). Frequently, the finite Gaussian mixture model is used because of its mathematical tractability. This density is given by where φ(x | μ g , g ) is the density of a multivariate Gaussian random variable X with mean μ g and covariance matrix g . An overview of model-based clustering is given by McNicholas [13].

Parsimonious Gaussian mixture models
The factor analysis model [14], assumes that a pdimensional random vector X i can be modelled using a q-dimensional vector of latent factors U i , where q < p. Factor analysis allows for a decrease in the number of parameters, which is useful in high-dimensional data cases. The model can be written as where is a p × q matrix of factor loadings, the latent factors U i ∼ N(0, I q ) are independent, and the errors i ∼ N(0, ) are independently distributed and independent of the U i , where is a diagonal noise matrix with dimensions p × p. Thus, X i ∼ N(μ, + ). In the mixture of factor analyzers (MFA) model, different factor analysis models are allowed in different regions of the feature subspace, using the density of a Gaussian mixture model with covariance structure g = g g + [15] or g = g g + g [16]. The mixture of probabilistic principal components analysis (PPCA) model [17] is a special case of the MFA model from [16] in that it adds the assumption that the noise matrix is isotropic so that g = ψ g I p . The parsimonious Gaussian mixture model (PGMM) family [18] allows combinations of the constraints: g = , g = , and g = ψ g I p within the MFA model, resulting in a family of eight models.

Model-based biclustering
A recent review of biclustering on expression data by Pontes et al. [19] classifies the methods using various taxonomies. One taxonomy is based on bicluster structure, specifically whether or not the genes and/or samples must be assigned to a bicluster (exhaustivity) and whether or not the genes and/or samples can be assigned to multiple biclusters (exclusivity). When considering blood biomarker discovery, an implicit property of the biomarker is that its expression profile is highly correlated between the blood and tissue of interest, and distinct from the rest of the tissues; indicating a unique biomarker for that tissue. Thus, the researcher would be interested in samples that are assigned to one bicluster only, in other words, non-overlapping column-exclusive biclusters. Examples of existing biclustering methods that adopt this property are plaid models developed by Lazzeroni and Owen [20], biclustering via Gibbs sampling developed by Sheng et al. [21], and Bayesian biclustering developed by Gu and Liu [22]. These are also examples of non metric-based probabilistic biclustering methods, based on another taxonomy provided in the review. The reader is referred to the review paper by Pontes et al. [19] for a structured and detailed discussion on the available biclustering methods.
Under the probabilistic framework, Martella et al. [23] propose a modified MFA technique for high-dimensional data for simultaneously clustering observations and variables. Variable cluster membership is represented by a binary row-stochastic matrix, which can be estimated viâ where j = 1, . . . , p, h, l = 1, . . . , q, g = 1, . . . , G, and Q is the expected complete-data log-likelihood. We have X i = μ g + g U ig + ig with probability π g . In this case, U ig ∼ N(0, I q ) and X i ∼ N(μ g , g g + g ). This particular form of factor loadings matrix results in a block-diagonal covariance matrix which is especially suitable in the biclustering framework because it models the grouped nature of the variables. Additionally, it results in the non-overlapping biclusters that are useful in blood biomarker discovery. Constraining or not constraining the covariance parameters across clusters leads to a family of four models. This family will be referred to as MFABC from this point forward. The remainder of this paper describes a one-way supervised biclustering technique and its application to simulated and real data.

Covariance structure
To accommodate biclustering we set the factor loadings matrix to be binary row-stochastic. To allow for supervision along the variable dimension, we provide the structure of this matrix to the algorithm. In our gene expression analysis case, the variables are the samples, thus we are setting a relationship between the samples in the data set and providing it to the algorithm during initialization and take it as constant. Constraints can be imposed or not on g , g , and g = ψ g I p to create a family of eight oneway-supervised Gaussian mixture models for biclustering (Table 1), which will be referred to as OSGaBi (one-way supervised Gaussian biclustering) hereafter.

Parameter estimation
This section provides the mathematical details required to compute the parameter estimates for the eight members of the OSGaBi family, with a focus on the CUU model because it is the most appropriate model for the application presented previously and later in the Application section. The expectation-maximization (EM) algorithm [24] is an iterative procedure for computing the maximum likelihood estimates (MLE) when data are incomplete or treated as such. The EM algorithm is based on the complete-data, which consist of both observed and missing data. The algorithm begins with the expectation step (E-step), where the expected value of the complete-data log-likelihood (Q) is computed conditional on the current parameter estimates. In the maximization step (M-step), The nomenclature, covariance structure, and number of covariance parameters for each member of the OSGaBi family. C, constrained; U, unconstrained Q is maximized with respect to the model parameters. These two steps are repeated until convergence. The alternating expectation-conditional maximization (AECM) algorithm [25] incorporates a series of conditional maximization (CM) steps instead of a single M-step and also allows for different specification of the completedata at each stage. This algorithm is used for parameter estimation for the MFA model, the PGMM family, and the MFABC family. It will also be used for the OSGaBi family.
For convenience, the following notation is adopted. We denote the observed data as x and the unobserved latent parameters as U i = (U i1 , . . . , U iG ). We denote the missing group memberships as z i , where In the first cycle of the AECM algorithm, (x i , z i ) are the complete-data, where i = 1, . . . , n. During the CMstep, π g and μ g are updated. During the E-step, the z ig are replaced by their expected values leading to the calculation of the expected value of the complete-data log-likelihood, Q 1 . In the CM-step, Q 1 is maximized to givê where n g = n i=1ẑ ig . In the second cycle of the AECM algorithm, (x i , z i , U i ) are the complete-data. During this CM-step, g is updated. During this E-step, z ig are replaced byẑ ig and U ig and U ig U ig are replaced by respectively, where β g = ( + g ) −1 for model CUU, to allow for the calculation of Q 2 . In the CMstep, the maximization of Q 2 is specific for each model. Considering the CUU model, where C is a constant with respect to the unknown param- The following score function is obtained when differentiating Q 2 with respect to g : Now, setting S( ,ˆ g ) = 0 and solving gives the estimatê The parameter estimates for the remaining seven models are derived similarly and are provided in the Additional file 1 titled OSGaBi_MWong_appendix.pdf.
When running the AECM algorithm, utilizing the Woodbury identity [26] avoids inverting any non-diagonal p × p matrices that may be singular for p n. Suppose an n × n matrix A, an n × q matrix H, a q × q matrix C, and a q × n matrix V. The Woodbury identity states that Specifically for the AECM algorithm, setting H = , V = , A = , and C = I q results in Now, instead of inverting the p × p covariance matrix on the left side of Eq. 8, only the diagonal and q × q matrices on the right side need to be inverted. With gene expression data where q p, this identity provides a major computational advantage. Another useful identity is for calculating the determinant of the covariance matrix in the AECM algorithm:

Component membership
The predicted biclustering for each member of the OSGaBi family is given by the maximum a posteriori (MAP) classification for the observations and the classifications originally provided for the variables. That is, the posterior predicted component membership of observation (i.e., gene) i is the value of g for whichẑ ig is greatest. In the biological sense, this will identify which gene belongs to which subset, implying that the genes in each subset are related in some way. Component membership of variable (i.e., sample) j is already provided as g at the beginning of the algorithm, specifically In the biological view, we know a priori that a certain set of samples are/should be related to each other, which is uncorrelated to another set of samples. A concrete example of how component membership is applied in microarray gene expression analysis is presented in the Application section.

Convergence and model selection
Convergence of the AECM algorithm is determined using the Aitken's acceleration [27] to estimate the asymptotic maximum of the log-likelihood at each iteration of the AECM algorithm for a specific number of components and a specific number of factors, as described in [28]. The Aitken's acceleration at iteration t is where l corresponds to the respective log-likelihood. The asymptotic estimate of the log-likelihood at iteration t + 1 is [29]. The stopping criterion l (t+1) ∞ − l (t) < [30], where = 0.1, is used and provided that the difference is positive [13]. The Bayesian information criterion (BIC) [31] is used to choose the best member of the proposed OSGaBi family with respect to the model and number of components, G.

Simulation studies
Simulation studies were carried out to validate the proposed biclustering algorithm. The adjusted Rand index (ARI) [32] was used to evaluate the performance of the algorithm in recovering biclusters from the simulated data. Specifically, z i was compared withẑ i after convergence was attained. Model selection was done via the BIC as previously described, although it can be noted that the integrated completed likelihood (ICL) [33] and Akaike information criterion (AIC) [34] were used as comparison and produced the same outcomes. The parameters and resulting data sets for the following simulation studies are found in the (Additional file 1: Supplementary files).
Simulated data were generated with G = 2, 3, and 4 clusters for observations and q = 2 clusters for variables. This resulted in 4, 6, and 8 biclusters, respectively. Four cases were examined: low, medium, and high variance coupled with good cluster separation, and high variance coupled with relatively close clusters. For each case, 100 data sets were generated, where each set had p = 8 variables and n = 200, 300, 400 observations (for G = 2, 3, 4, respectively) and was randomly generated from the same Gaussian distribution. Examples of heatmaps for each of the CUC cases visibly indicate that there are distinct biclusters in the simulated data (Fig. 1). To reflect the oneway supervised nature of the algorithm, the true was provided. Twenty random starts were used for each run of the algorithm. Table 2 presents the results from these four simulation studies for the CUU and CUC models. It lists the average number of components selected, the most frequently chosen model, and the average ARI when fitting G = 2, . . . , 10 observation clusters. Because the algorithm was sometimes overfitting for the number of components based on the model it chose, another analysis was included to show the average ARI when the number of clusters was known (i.e., G = 2, 3, 4, depending on the case). These results are shown in the last column. The CUU and CUC models are focused on because they are the most probable cases in real-life scenarios, and additionally, they are the models most frequently selected when the number of clusters was known (results not shown).
For completeness, simulation studies were conducted on the remaining six OSGaBi models using simulated data with medium variance and good cluster separation and with the same properties as that used for the CUU and CUC models. As the true g or was provided, it implied that for models with unconstrained (i.e., models UUU, UUC, UCU, and UCC), G was known because the for each component would have been provided. Table 3 presents the average ARI, most frequently chosen model, and average number of clusters selected from this simulation study for each of the remaining six models when fitting G = 2, . . . , 10 clusters for observations. Because the algorithm was once again sometimes overfitting for the number of components based on the model selected, this final analysis was included to show the algorithm's performance when the number of clusters was fixed to G = g known , where g known represents the number of observation clusters the data was generated from. The last column of the table presents the corresponding average ARI when fixing G for the CCU and CCC models.
It is important to note that although the algorithm was overfitting for the number of components based on the model selected, the majority of the time the original components were simply being broken into smaller components. A classification table from one of the simulation results illustrates the very common occurrence (Table 4). In this specific result, Cluster 1 was broken up into three components by the algorithm, resulting in a  Average ARI, most frequently chosen model, and the number of observation clusters selected for the CUU and CUC models using simulated data with low, medium, and high variance (var) with good cluster separation (sep), and high variance with relatively close clusters when fitting G = 2, . . . , 10 observation clusters using 100 data sets and 20 random starts. The last column presents the ARI when fixing G = g known , where g known represents the number of observation clusters the data was generated from. Values in brackets represent the respective standard deviation total of four components. The final column of Tables 2 and  3 provide further evidence because once the algorithm is provided the correct number of components, the ARI become perfect or near perfect.

Rat data
We present the biclustering results from Affymetrix oligonucleotide array data from a nutritional and pharmaceutical intervention in diabetic rats . This study consisted of five male lean control rats and five male Zucker diabetic fatty (ZDF) rats, which are genetically predisposed to developing diabetes. Details regarding the original rat study are described in Beaudoin et al. [35]. From each animal, tissue was extracted from various tissue depots, including the liver and red tibialis anterior (red TA, a type of muscle). Blood was also extracted, resulting in a total of 30 samples. RNA was extracted from these samples and used for the subsequent microarray gene expression analysis. Pre-processed data can be found on Gene Expression Omnibus (GEO) [36], accession number GSE93402 (blood), GSE93403 (liver), and GSE93406  Average ARI, most chosen model, and the average number of observation clusters selected for the remaining six OSGaBI models using simulated data with medium variance and good cluster separation when fitting G = 2, . . . , 10 observation clusters using 100 data sets and 20 random starts. The last column presents the ARI when fixing G = g known , where g known represents the number of observation clusters the data was generated from, for the CCU and CCC models. Values in brackets represent the respective standard deviation (red TA). After pre-processing using the affy and oligo packages [37,38] respectively for R Bioconductor [39,40] respectively, n = 8801 genes remained. We worked with the top 2000 differentially expressed genes between the red TA and liver (p < 0.01). For this analysis, we set the genes as the observations (n = 2000) and the samples as the variables (p = 30). The goal of the biclustering analysis was to identify sets of genes within the blood that possess similar expression profiles within the distinct tissues. Thus, we aimed to Although the algorithm was overfitting for the number of components based on the model selected, the majority of the time the original components were simply being broken into smaller components match biclusters containing genes that had similar expression profiles that were unique for blood and a specific tissue type. We focus here on genes with similar expression profiles between blood and liver. Downstream, these candidate genes can be tested to determine if they can function as blood biomarkers of metabolic status in individuals in different contexts (i.e., response to interventions, different disease states, etc.); however, this subsequent analysis goes beyond the scope of the present article. We constrained the structure of g because we knew the relationships required among the three sample types. Specifically, we wanted correlated expression between blood and liver only, implying that expression between blood and red TA were uncorrelated and expression between liver and red TA were uncorrelated as well. The other (extraneous) relationship characterized by the block-diagonal covariance matrix was the correlated nature of the expression strictly among the liver samples. Consequently, q = 2 for the number of variable clusters (i.e., the two relationships described previously). Sample types were constant across all components, i.e., g = , and thus we limited the algorithm to fit model CUU and CUC. These two -constrained models were chosen based on the results from the simulation studies previously mentioned. We normalized the data and fitted the range of G = 2, . . . , 30.
The BIC selected a CUU model with G = 19 observation clusters for the blood-liver analysis. As seen from the heatmaps before and after biclustering and subsequent rearranging, there were definitive biclusters in the data (Fig. 2). We inputted the gene lists for each of the 19 biclusters into the online functional annotation tool DAVID (Database for Annotation, Visualization and Integrated Discovery) [41,42] to elucidate potential biological processes that were dominant in each bicluster. DAVID functional annotation results indicated that the largest proportions of genes in the blood-liver biclusters had roles in protein metabolic and modification processes, carboxylic metabolic process, oxaloacid metabolic process, and intracellular signal transduction (biological processes as defined by the Gene Ontology Consortium, [43]), all biological processes of which have previously been shown to have an involvement in diabetes and obesity, and some processes within the liver [44][45][46][47][48][49][50]. These processes accounted for approximately 20-43% of the genes in the various biclusters and were all statistically significantly enriched (p < 0.05). There is also a general inference that insulin resistance occurs at different times in insulin sensitive tissues such as muscle and liver [51,52]; therefore, it is not surprising that the expression profiles between the liver and red TA were not similar. Additionally, it has been previously established that the peripheral blood transcriptome reflects changes in various tissues throughout the body [3], a property that is illustrated in the biclusters of interest.

Human data
The second data set we analyzed is another Affymetrix oligonucleotide array retrieved from GEO, accession number GSE1133. The original study aimed to profile 79 human and 61 mouse tissues in terms of their transcriptomes under normal conditions [53]. Here, we focus on the human arrays, specifically the tissues related to the immune system (20 tissue types) and the brain (16 tissue types), and also whole blood, for a total of 37 tissue types. Each tissue had two replicates, giving a total of 74 samples. After pre-processing using the same methods described for the rat data and removing genes without Entrez gene identifiers, n = 3867 genes remained, of which 2148 genes were differentially expressed between brain and immune tissues (p < 0.01). Similar to the rat data, we set the genes Fig. 2 Heatmaps of the rat data. Heatmap of the rat data before biclustering (a). The red, yellow, and blue bars along the columns and represent liver, blood, and red TA samples, respectively. Heatmap of the rat data after biclustering and subsequently rearranging the rows so that the observation clusters were contiguous (b). Black and grey bars along the columns and rows simply represent the presence of the clusters and do not indicate relationships between them. G = 19 for the observation clusters and along the columns, q 1 = q 2 = 2 as the observations (n = 2148) and the samples as the variables (p = 74).
The goal of this biclustering analysis was to identify sets of genes within the blood that possess similar expression within the distinct groups of tissues. Thus, we aimed to match biclusters containing genes that had similar expression that were unique for blood and a specific group of tissues. We focus here on genes with similar expression between blood and immune tissues. Subsequent work can involve determining which of these candidate genes can function as blood biomarkers of normal immune function in individuals.
Similar to the rat data, we constrained the structure of g because we knew the relationships required among the three sample types. Specifically, we wanted correlated expression only between blood and immune tissues. This implied that expression between blood and brain tissues were uncorrelated, and expression between immune and brain tissues were uncorrelated as well. The other (extraneous) relationship characterized by the block-diagonal covariance matrix was the correlated nature of the expression strictly among the immune tissue samples. Consequently, q = 2 for the number of variable clusters (i.e., the two relationships described previously). Samples were constant across all components, i.e., g = , and thus we again limited the algorithm to fit model CUU and CUC. We normalized the data and fitted models in the range of G = 2, . . . , 30.
The BIC selected a CUU model with G = 10 observation clusters for the blood-immune analysis. As seen from the heatmaps before and after biclustering and subsequent rearranging, there were again definitive biclusters in the data (Fig. 3). DAVID functional annotation results indicated that the largest portion of genes in each bicluster had roles in the nucleobasecontaining small molecule metabolic process, macromolecule metabolic process, microtubule-based process, microtubule cytoskeletal organization, response to DNA damage stimulus, and transmembrane transport; all biological processes that have been linked to immune responses [54][55][56][57]. These processes accounted for anywhere between 4 to 51% of the genes in the various biclusters, and were all statistically significantly enriched (p < 0.05). Furthermore, blood acts as a transporter for the immune system by transporting immune cells throughout the body, thus blood can provide an extensive view of the immune status of an individual [58]. This property is reflected in the biclusters of interest because there is a Fig. 3 Heatmaps of the human data. Heatmap of the human data before biclustering (a). The red, yellow, and blue bars along the columns and represent immune tissues, whole blood, and brain tissues respectively. Heatmap of the human data after biclustering and subsequently rearranging the rows so that the observation clusters were contiguous (b). Black and grey bars along the columns and rows simply represent the presence of the clusters and do not indicate relationships between them. G = 10 for the observation clusters and along the columns, q 1 = q 2 = 2 correlation among the expression between the blood and the immune tissues.

Discussion
One specific taxonomy of biclustering methods for gene expression data aims to retrieve non-overlapping biclusters characterized by one specific sample type (in this case, "sample" could refer to a type of treatment, tissue, disease state, etc.) along the variable dimension, as reviewed in Pontes et al. [19]. This is useful in applications such as disease subtype discovery, where the focus is to elucidate the various disease subtypes based on the genes. Conversely, in blood biomarker discovery, knowledge of the types of samples a priori is required and the focus is on the relationships between those sample types based on the genes, which is where one-way supervised biclustering is able to play a role. Two inherent criteria of blood biomarkers are that there is 1) a correlation between blood and the tissue of interest and 2) no correlation between those two sample types and other tissues. The second criteria is enforced by including other tissues into the biclustering analysis so that the condition can be set in conjunction with the first criteria. The relationships required are satisfied through the use of one-way supervision to explicitly determine the relationship between blood and the various tissues. To the best of our knowledge, biclustering methods currently available under the taxonomy of non-overlapping biclusters do not provide the option of one-way supervision along the variable dimension to aid in applications such as blood biomarker discovery.
Another advantage of approaching tissue-specific blood biomarker discovery through the use of biclustering is the ability to identify groups of genes that are potentially related to each other through their biological pathways. Commonly, correlation analysis between blood and a tissue is conducted using the available gene list in its entirety, e.g. [59], consequently not revealing any information about genes related by biological pathways that a cluster analysis would provide. In our OSGaBi family, setting the variable clusters labels and subsequently biclustering conditional on this information allows us to handle this limitation of correlation analysis.
Simulation study results show that models with values of G that are too high are sometimes selected, and this problem becomes more pronounced for high variance. While the BIC has been shown to be unreliable in higher dimensions, e.g. [60] -and this may suggest that further research on an optimal model selection criteria for this family of biclustering models is warranted -it is quite possible that the selection of larger values of G is simply a result of lack of concentration around the modes at higher variances. The inclusion of results for fixed G follows [61] and [62], who carried out mixture model analysis of gene expression data by treating G as fixed and known.
Note that, in [23] where the binary row-stochastic factor loadings matrix is a property of their MFABC family, the authors report simulation results but do not mention the model selection criterion or the range of number of observation clusters fitted; therefore, it is not known if the authors treated G as fixed. Conversely, the authors mention the use of the BIC and AIC for model selection in their real data study with gene expression data, supporting the use of the BIC for our analyses until the optimal model selection criteria is determined.
Future work will also aim to compare performance of the OSGaBi family to that of other model-based biclustering algorithms capable of detecting non-overlapping clusters and allowing for one-way supervision. Current methods are available for the former (as mentioned previously), but do not allow for the latter criteria. This limitation in the existing methods makes it difficult to compare the genes that are found in the biclusters to those found using the OSGaBi family since they do not always correspond to the intended subset of variables.
We have presented biclustering results using the OSGaBi family on two real microarray gene expression data sets. The first one was a previously unpublished rat microarray gene expression data set, where identified biclusters corresponded to genes whose expression profiles were correlated between liver and blood (and not between red TA and blood, or liver and red TA). Identified biclusters were enriched in genes related to biological processes known to play a role in insulin resistance and obesity in a tissue-specific manner. The second data set was a subset of a microarray gene expression data set from the GEO database that aimed to profile the human transcriptome under normal conditions. In this analysis, identified biclusters corresponded to genes whose expression correlated between immune tissues and blood (and not between brain tissues and blood, or immune and brain tissues). Identified biclusters contained genes related to biological processes previously associated with the immune system. Although further biological experimental analysis and interpretation need to be conducted to determine the best candidate gene(s) in both preliminary analyses, the initial results show promise in using the OSGaBi biclustering family for discovering novel blood biomarkers to act as surrogate tissue material in the maintenance of health and the prevention of disease.

Conclusions
A family of parsimonious Gaussian mixture models for the biclustering of gene expression data has been proposed. These models work in a one-way-supervised fashion in that the variable labels are known. The binary and row-stochastic factor loadings matrix results in a block-diagonal covariance matrix, which can be a useful property in biclustering applications for dictating the relationships between the variables. A promising application for our method is in the discovery of novel peripheral blood biomarkers for use as surrogate biopsy material.