Identifying gene-specific subgroups: an alternative to biclustering

Background Transcriptome analysis aims at gaining insight into cellular processes through discovering gene expression patterns across various experimental conditions. Biclustering is a standard approach to discover genes subsets with similar expression across subgroups of samples to be identified. The result is a set of biclusters, each forming a specific submatrix of rows (e.g. genes) and columns (e.g. samples). Relevant biclusters can, however, be missed when, due to the presence of a few outliers, they lack the assumed homogeneity of expression values among a few gene/sample combinations. The Max-Sum SubMatrix problem addresses this issue by looking at highly expressed subsets of genes and of samples, without enforcing such homogeneity. Results We present here the K-CPGC algorithm to identify K relevant submatrices. Our main contribution is to show that this approach outperforms biclustering algorithms to identify several gene subsets representative of specific subgroups of samples. Experiments are conducted on 35 gene expression datasets from human tissues and yeast samples. We report comparative results with those obtained by several biclustering algorithms, including CCA, xMOTIFs, ISA, QUBIC, Plaid and Spectral. Gene enrichment analysis demonstrates the benefits of the proposed approach to identify more statistically significant gene subsets. The most significant Gene Ontology terms identified with K-CPGC are shown consistent with the controlled conditions of each dataset. This analysis supports the biological relevance of the identified gene subsets. An additional contribution is the statistical validation protocol proposed here to assess the relative performances of biclustering algorithms and of the proposed method. It relies on a Friedman test and the Hochberg’s sequential procedure to report critical differences of ranks among all algorithms. Conclusions We propose here the K-CPGC method, a computationally efficient algorithm to identify K max-sum submatrices in a large gene expression matrix. Comparisons show that it identifies more significantly enriched subsets of genes and specific subgroups of samples which are easily interpretable by biologists. Experiments also show its ability to identify more reliable GO terms. These results illustrate the benefits of the proposed approach in terms of interpretability and of biological enrichment quality. Open implementation of this algorithm is available as an R package.

This document presents the evolution of the Gene Ontology [1] (GO) enrichment as a function of K, the maximal number of biclusters to find. Our approach, K-CPGC, the six biclustering algorithms (CCA, ISA, QUBIC, Plaid, Spectral and xMOTIFs) and the 17 Saccharomyces cerevisae datasets [2] we consider are detailed in the main manuscript. A subset of genes and a subset of samples are defined for each bicluster identified. An enrichment step provides a list of GO terms and FDR (false discovery rate) corrected p-values [3] for each subset of genes. The enrichment procedure is performed using the clusterProfiler R package [4].
We compare algorithms on the number of enriched biclusters. According to the methodology proposed in [5,6,7,8], a specific bicluster is considered enriched if there is at least one GO term with an FDR corrected p-value below 5%. As a supplementary evaluation, we compare algorithms on the number of GO terms enriched (with an FDR corrected p-value below 5%).
Fixing parameter K is required for appropriate evaluation of algorithms. Choosing a value, however, depends on the considered algorithm and dataset. In practice, this is not a critical choice since the analyst can start with K = 1 and use the proposed gene enrichment analysis to check whether the successive biclusters and GO terms returned by increasing K are still significantly enriched. Consequently, we report here the evolution of the number of enriched biclusters and the evolution of the number of enriched GO terms as a function of K. Value of parameter K ranges from 1 up to no significant improvement.
Each dataset is separately normalized by subtracting a threshold θ to all matrix entries before using K-CPGC. The threshold θ is set to the 75th percentile of expression values, specifically to each dataset, in the main manuscript. We consider such a threshold as representative of the objective of capturing high expression patterns. We also examine the 65th and 85th percentiles of expression values to complement analyzes of this document. We report as K-CPGC 0 xx the results of our approach after normalization through subtraction of the xxth percentile. Results reported as K-CPGC in the main manuscript correspond to results reported as K-CPGC 0 75 in this document. Figure 1 presents the evolution of the number of enriched biclusters as a function of K. We observe a rapidly growing number of enriched biclusters up to the chosen value in the main manuscript (K = 10). Our approach, with its three different θ values, essentially identifies more enriched biclusters than other approaches in the first part of the graph. Nevertheless, CCA produces more enriched biclusters in the long run. We explain the absence of important improvements in our approach by 1 the size of the discovered gene subsets. Indeed, biclusters are identified only when there is some signal remaining in the matrix. Therefore, identifying larger subsets of genes is penalized, as the remaining signal depends on the previously identified (and masked) biclusters. Table 1 presents the average size of gene subsets for K = 10 and K = 40. We observe that the difference between CCA and the three variants of K-CPGC increases with K. A data analyst would consider smaller values of K given the size of the datasets, the size of the identified subsets of genes and the results from Figure 1.  Figure 2 presents the number of different enriched GO terms identified by each approach for increasing values of K. We observe a rapidly growing number of enriched GO terms up to the chosen value in the main manuscript (K = 10), for most approaches, as in Figure 1. It confirms the ability of K-CPGC to quickly identify a large part of the relevant signal. It is furthermore noticeable that K-CPGC 0 65 and K-CPGC 0 75 identify more GO terms than all others approaches for any value of the parameter K.
Interpretations of both graphs can be reunited by observing that: 1 K should be large enough to observe differences regarding the enrichment, 2 K should, however, be small enough to ensure that there still is a signal to be found, regarding the number of biclusters and the number of GO terms.
Author details