 Research
 Open access
 Published:
RUBic: rapid unsupervised biclustering
BMC Bioinformatics volume 24, Article number: 435 (2023)
Abstract
Biclustering of biologically meaningful binary information is essential in many applications related to drug discovery, like protein–protein interactions and gene expressions. However, for robust performance in recently emerging large health datasets, it is important for new biclustering algorithms to be scalable and fast. We present a rapid unsupervised biclustering (RUBic) algorithm that achieves this objective with a novel encoding and search strategy. RUBic significantly reduces the computational overhead on both synthetic and experimental datasets shows significant computational benefits, with respect to several stateoftheart biclustering algorithms. In 100 synthetic binary datasets, our method took \(\sim 71.1\) s to extract 494,872 biclusters. In the human PPI database of size \(4085\times 4085\), our method generates 1840 biclusters in \(\sim 48.6\) s. On a central nervous system embryonic tumor gene expression dataset of size 712,940, our algorithm takes 101 min to produce 747,069 biclusters, while the recent competing algorithms take significantly more time to produce the same result. RUBic is also evaluated on five different gene expression datasets and shows significant speedup in execution time with respect to existing approaches to extract significant KEGGenriched biclustering. RUBic can operate on two modes, base and flex, where base mode generates maximal biclusters and flex mode generates less number of clusters and faster based on their biological significance with respect to KEGG pathways. The code is available at (https://github.com/CMATERJUBIOINFO/RUBic) for academic use only.
Introduction
Specialised gene clusters participate in specific cellular processes under a subset of conditions. Automatic identification of such geneclusters and conditionsubsets is usually known as biclustering and has immense value in such applications of bioinformatics (proteomics and genomics) like studying protein–protein interactions, automated drug discovery. The concept of biclustering algorithms was first introduced by Hartigan et al. [1]. Cheng et al. [2] first applied biclustering of both genes and conditionsubsets for knowledge discovery from expression data. A nodedeletion algorithm was presented by Cheng et al. [2] to find bicluster in expression data that have low mean squared residue scores. It found similarity based on a subset of attributes, clustering of genes and conditions, and overlapped bicluster that provides a better representation for genes with multiple functions or regulated by many factors. Their method allowed discovery of similar genes based on a subset of attributes, i.e., simultaneous clustering of genes and conditions. Subsequently, biclustering techniques have been widely applied in gene expression data analysis. In contrast to biclustering algorithms, such as the recently developed graph clustering algorithms [3,4,5], which can be applied to complex biological datasets, they have constraints related to unannotated data and dataset size.
In the field of biclustering within the realm of bioinformatics, recent years have seen the rise of numerous advanced computational methods. Many of these methods leverage the capabilities of machine learning. These approaches can be broadly categorized into two main groups: graphbased and nongraphbased biclustering techniques. Moreover, it’s worth noting that within these categories, individual algorithms are further subdivided based on their compatibility with different types of input data, with some designed to operate on raw expression data and others tailored for binary data. Tanay et al. [6] proposed a method that combines graph theory with statistical modeling and has polynomialtime complexity. Yang et al. [7] presented a probabilistic algorithm called FLOC for finding k possibly overlapping biclusters used graph theory to address the orderpreserving submatrix problem by identifying submatrices that preserve order relationships among genes and conditions. Other graph based approaches are ISA [8, 9], Samba [6], OPSM [10] , spectral biclustering [11], spectral biclustering [11] and xMotif [12] where xMotif utilizes graph theory to identify conserved gene expression motifs by representing genes and conditions as nodes in a graph and finding significant subgraphs that correspond to biclusters.
Some of the nongraph based biclustering methods introduced greedy heuristic idea or machine learning. Cheng et al. [13] proposed a greedy version of an existing biclustering algorithm. Santamaria et al. [14] developed BicOverlapper for visualizing biclusters from geneexpression matrices, whereas Uitert et al. [15] designed an algorithm to extract biclusters from sparse, binary datasets. Madeira et al. [16] presented the ‘eCCCBiclustering’ algorithm that mines coherent biclusters with approximate expression patterns. The FABIA tool [17] uses a multiplicative model and Bayesian techniques, whereas the DeBi algorithm [18] uses a frequent item setbased data mining approach to determine homogeneous biclusters. Sill et al. [19] offered a sparse singular value decomposition (SSVD) approach to control Type I error rates and discover stable biclusters. Huang et al. [20] presented a biclustering method based on evolutionary learning and applied it in a search space created by the conditions. Ayadi et al. [21] introduced a heuristic algorithm, ‘BicFinder’, to estimate the coherence of a given bicluster. Huang et al. [22] proposed biclustering for mining to find technical trading patterns that combine indicators from historical financial data series. Prelic et al. [23] proposed a fast and exact model biclustering algorithm called Bimax, considered a benchmark reference for most biclustering methods. Bimax achieved similar scores as the best biclustering techniques then and is still helpful in identifying potentially relevant groundtruth biclusters as a preprocessing step. Later, these chosen biclusters can be used as input for more accurate biclustering methods to speed up processing time and increase bicluster quality.
One common challenge faced by many of these biclustering algorithms is their inability to guarantee finding the global optima, potentially leading to the omission of some optimal biclusters. An algorithm known for its benchmark status in this regard is Bimax, introduced by Prelic et al. [23]. Bimax, designed for binary data models, has exhibited substantial performance improvements compared to other biclustering methods, including ISA [8, 9], Samba [6], OPSM [10], CC [2], and xMotif [12]. However, xMotif tend to identify large biclusters representing gene groups with stable expression levels, potentially lacking interesting patterns such as coregulation. ISA, Samba, and OPSM have demonstrated comparable performance, with Samba excelling in managing increased regulatory complexity but being more sensitive to noise than ISA. Bimax serves as a valuable preprocessing step to identify potentially relevant groundtruth biclusters, which can then be utilized as input for more accurate biclustering methods to enhance processing speed and bicluster quality. Although initially designed for binary data, Bimax has been extended to handle realvalued data, leading to the development of BiBit [24] and its successor, QUBIC [25], and QUBIC2 [26]. BiBit, an extension of Bimax, introduces a preprocessing step to transform data into binary format. QUBIC2, in contrast, employs a probabilistic model to directly handle realvalued data, showcasing superior accuracy and efficiency compared to BiBit and other stateoftheart biclustering algorithms on both synthetic and real datasets. A more recent entrant, ARBic [27], outperforms QUBIC and QUBIC2 in discovering highquality biclusters with varying dimensions and shapes while claiming increased robustness to noise and outliers. Both QUBIC2 and ARBIC algorithms involve essentially a twostep procedure: initial seed point selection and subsequent expansion to the biclustering, separated into 2 separate stages. The seed selection process for QUBIC2 and ARBIC relies on additional graph creation from all data points, leading to increased computational demands.
In this work, we present the Rapid Unsupervised Biclustering (RUBic) algorithm which demonstrates significantly enhanced speed in comparison to recently developed approaches, while effectively extracting all original biclusters from both synthetic and biological datasets. The aim of reducing computational overhead with respect to stateoftheart algorithms [24, 26, 27] was achieved by introducing a novel encoding and search strategy. The methodology presented in this article necessitates significantly fewer computational steps compared to BiBit [24], while producing the same maximal biclusters within a notably reduced timeframe, denoted as the base mode. Additionally, through the utilization of flexmode tuning in RUBic, biologically significant clusters were generated, surpassing those obtained by the most recently developed ARBic [27] and QUBIC2 [26] algorithms, as confirmed by KEGG pathway [28] annotations. Thus the contribution of this work is twofold:

1
A new biclustering algorithm (RUBic) that is significantly faster with reduced computational load (shown both mathematically and experimentally) than the recent competitors and hence establishing a new stateoftheart that has great potential for robust scalability in large healthcare datasets.

2
Unlike existing methods, RUBic can operate in two modes, the default mode produces the maximal set of biclusters like its competitors but faster, and a flex mode where conditional priors can be provided for biologically relevant subsets (validated with KEGG pathways) and hence has great potential in genomics/proteomics applications like automated drug discovery.
Methods
The developed RUBic method is designed to extract maximal biclusters from binary datasets. It takes input parameters as the binary input matrix (A) as well as the minimum number of rows \((r_{min})\) and columns \((c_{min})\) which are allowed in the final biclusters. Let the input binary matrix be \(A=(R,C)\), where R and C are two finite sets of rows and columns, respectively. Let \(l:R \times C \rightarrow {\{0,1\}}\) be a binary function for the matrix A. Let the binary value \(l(r,c); r \in R,c \in C\), be denoted by \(\{b_{rc}\}\) and the corresponding decimal value be represented by \([[d_{rc}]], d_{rc} \in D; D={\{0,\ldots ,15\}}.\)The binary matrix \(A=(R,C)\), can be decomposed into N sets of M bits, such that \(A={\{r_1,\ldots ,r_N\}},\) with \(r_i={\{b_{i1},\ldots ,b_{iM}\}},\) where \(b_{ij} \in \{0,1\}, N =R\) and \(M=C.\)
Let, \(B=\{B_1,\ldots ,B_k\}\) be the set of biclusters, such that \(B_i=\{I_i,J_i\}\), is composed of the pair of nonempty sets, where \(I_i \subseteq R\) and \(J_i \subseteq C\). For any given bicluster \(B_i\), a set of columns \(J_i =\{c_1,\ldots ,c_K\}\) is called a template if for every \(c_k \in J_i\) and for every pair of rows \((r,r^{'}) \in I_i\), we have \(\{r,c_{k} \} \bigwedge \{r',c_{k}\}=1\). The bicluster \(B_i=\{I_i,J_i\}\) is called a maximal bicluster if and only if it is not entirely contained in any other bicluster. In another way, the biclusters will be maximal submatrices created from a template obtained by the application of the bitwise AND operator \((\bigwedge )\) to a pair of seed rows. Now, the problem addressed by the developed biclustering algorithm can be formally defined as follows: given a binary data matrix A, we want to identify a set of biclusters \(B=\{B_1,\ldots ,B_k\}\), where, \(B_i=(I_i,J_i)\) such that each bicluster \(B_i\) satisfies some particular features of homogeneity. Note that, the exact characteristics of homogeneity that a bicluster follow may differ from one approach to another approach. In our work, the homogeneity characteristic is governed by: (1) encoding of the template from a pair of seed rows by bitwise AND operations, and (2) efficient searching of the remaining rows, having common homogeneity characteristic, for possible inclusion in the bicluster. The overall working principle of RUBic is depicted in Fig. 1.
Encoding strategy
Each row, \(r_i=\{b_{i1},\ldots ,b_{iM}\}\), is first transformed to decimal form by converting every fourbit consecutive binary number to decimal number. So each row becomes \({\hat{r}}_i=\{d_{i1},\ldots ,d_{iP} \}\),where \(,P=M/4\) and \(d_{ij} \in D; D=\{0,\ldots ,15\}.\) Now we define \(V=\{V_{i}\}, \forall i \in D,\) where \(V_0=\{0\}\) and \(V_i=\{x [[i \bigwedge x]]=x\},\forall i, x=1,\ldots ,15\) (see Fig. 2a for details). For example, in case of \(V_5=\{1,4,5\}\), all binary equivalent of decimal numbers in D having 2nd and/or 4th bit (from right) as 0 are included in \(V_5\). Note that, binary equivalent of 5 is 0101 and bitwise AND \((\bigwedge )\) with all other binary combinations whose 2nd and/or 4th bits as 0 are 0001, 0100 and 0101, i.e., 1,4 and 5 in decimal number system. So, from definition, \(V_i\) contains those nonzero decimal numbers (x) whose fourbit binary numbers return same number x after bitwise \(\bigwedge\) operation with i.
Searching strategy
Let us assume that the seed row pair \({\hat{r}}_i\) and \({\hat{r}}_j\) creates a bicluster \(B_{new}=\{I_{new},J_{new}\}\), where, \(({\hat{r}}_i, {\hat{r}}_j ) \in {I_{new}}\) and \((j>i)\), as a result of applying the operator \(\bigwedge\). We can also write, \(B_{new}=\{I_{new},J_{new}\} =\{\{{\hat{r}}_i, {\hat{r}}_j\},\{\rho _{ij}\}\},\) where \(\rho _({ij})\) is new template having number of nonzero terms greater than equals to \(r_{min}\) and \(\rho _{ij} = \bigcup _{k=0}^{P1}[[d_{ik} \bigwedge d_{jk}]]\); where, \(P=M/4\). Let \(\Vert \rho _{ij} \Vert\) be the count of number of nonzero columns in \(\rho _{ij}\). Note that, we proceed only if \(\rho _{ij}\) is unique (new) and \(\Vert \rho _{ij} \Vert \ge c_{min}\).
Now, for each of the remaining rows \(({\hat{r}}_l\)), in \(I_{new}=R\{{\hat{r}}_i, {\hat{r}}_j\},\) a decision is to be taken on the homogeneity criterion on, whether or not to retain \({\hat{r}}_l\) in \(I_{new}\). This is an exhaustive checking, and usually requires bitwise AND operations to compare every row with the template. In our work, we have proposed a novel mechanism by which we can columnwise search the content of every row in the set V, and decide whether or not to retain the row under consideration. If any column of a row fails the inclusioncondition, the complete row is excluded from the subsequent checking. More specifically, let \(d_{kl}\) be the kth column decimal value of \({\hat{r}}_l\), then the row \({\hat{r}}_l\) is excluded from \(I_{new}\) if \(d_{k\rho } \notin V_{d_{kl}}\) (please refer to Fig. 2b for illustration). Finally, if all the columns of any given row pass the inclusioncondition, then the row is included in \(I_{new}.\) We refer it as new bicluster \(B_{new}=\{\{I_{new} \} \cup \{({\hat{r}}_i, {\hat{r}}_j \},J_{new} \}=\{\{I_{new}\} \cup \ \{({\hat{r}}_i, {\hat{r}}_j \},{\rho _{ij} }\}\) if and only if \({I_{new}} \cup \{{\hat{r}}_i, {\hat{r}}_j \} \ge r_{min}\). This procedure is continued for other all possible pair of seed rows to generate all possible maximal biclusters. Please refer to the complete Algorithm 1 for details.
Computational complexity
To assess the computational complexity of the newly developed RUBic algorithm, we have observed that the computational overhead is maximum during the searching process. Once we compute the template, we need to search the selected column(s) of every other row for its possible inclusion in the bicluster. Let us assume that for the 1st column of the template, N comparisons are required to select the row(s) and only \(K_{1}\) rows are not selected due to above mentioned constraint (see Fig. 2a, b and Algorithm 1). Consequently we can exclude \(K_{1}\) rows in 1st column comparison, then only \((NK_{1})\) rows are selected for 2nd column comparison. Likewise, if \(K_{2}\) rows are excluded during 2nd column comparison, then \((NK_{1}K_{2})\) rows are to be processed during 3rd column comparison. This procedure is continued up to the Mth column, where (\(NK_{1}K_{2}\cdots K_{(M1)}\) ) comparisons are required to select the row(s) in the last step. Therefore, total number of comparisons which are required for the selection of rows for one bicluster can be formulated as:
where \(C=( K_{1}+K_{2}+\cdots +K_{(M1)}\)) is a large constant and M is number of columns and N is number of rows. Now let us derive the best case complexity where in 1st column comparison, all the rows except one are excluded as per our designed criteria. So, in first step, it takes N comparisons and in each remaining \((M1)\) steps there is only requirement of one comparison. So total number of comparison required in best case is: \(T_{best}=N+(M1).\) In the worst case scenario, every column of each row satisfies our designed criteria. So it requires maximum \(T_{worst}=N \times M\) number of comparisons in the worst case.
However, for the average case, the probability \((P_i)\) of successful search for each decimal number \((0, \ldots ,15)\) of any row in V is different. For (0, 15) this probability, \(P_0=P_{15}=\frac{1}{16}\), because they occur only once in their respective sets \(V_0\) and \(V_{15}\). For the numbers 1, 2, 4 and 8, \(P_1=P_2=P_4=P_8=\frac{8}{16}\). Likewise, \(P_3=P_5=P_6=P_9=P_{10}=P_{12}=\frac{4}{16}\) and finally, \(P_7=P_{11}=P_{13}=P_{14}=\frac{2}{16}\). Therefore, the probability of successful search of any number \((0 \ldots 15)\) of any row in V may be estimated as: \(\frac{1}{16} \sum _{(i=0)}^{15}P_i =\frac{66}{16 \times 16} \cong 1/4.\)
So, on average, in each column comparison, 25% of the rows are selected from analyses in subsequent columns. Therefore, if N comparisons are required in the 1st column comparison, \(\frac{N}{4}\) comparisons are required in the 2nd column, \(\frac{N}{4^{2}}\) comparisons in the 3rd column and so on. Therefore, total number of average case comparisons \(T_{average}\) is estimated as follows:
where \(C^{'}\) is a constant and k is sufficiently large. From the above formulation we can say the average case complexity of the developed RUBic algorithm is O(N), i.e., linearly proportional with the number of rows in the input vector.
Experimental results
To assess the performance of the developed RUBic algorithm, we have evaluated its behaviour in terms of the execution time and quality of the biclusters generated on both synthetic and experimental biological datasets. Our biclustering algorithm is able to execute in two modes, basemode and flexmode depending on the two basic parameters and an additional filtering by symmetric clusters removal. At basemode, we set the \(r_{min}=2\) and \(c_{min}=2\) to extract optimal clustering which provides maximal biclusters. We have executed our basemode RUBic on the synthetic data and shows the performance improvement with stateoftheart. However, we have evaluated our algorithm with two biological datasets in flexmode on (a) protein–protein interaction dataset and (b) real gene expression dataset.
Synthetic data analysis
In the current experiment, three sets of synthetic datasets were used to assess the performance of the developed algorithm and compare it with the existing stateoftheart methods. We compared the performance of the basemode RUBic with the existing fast model BiBit, as both generate maximal clusters. However, we excluded the very recent methods ARBic and QUBIC2 from this comparison as they are not designed for maximal cluster generation. Moreover, ARBic and QUBIC2 are unable to execute with \(r_{min}=2\) and \(c_{min}=2\) values, which causes inconsistency in clustering.
In the first experiment, synthetic binary matrices of size \(200\times 200\) are used from the BiBit data repository (Match_score_density_200\(\times 200\)_csv) [24], with varying density and overlapping characteristics. There are ten groups of \(200 \times 200\) matrices in which each group contains 10 matrices with varying density of 1’s from 5 to 50% with step wise increment of 5%. We have evaluated the average number of extracted biclusters in these ten groups of matrices and also estimated the average time of execution in each ten groups. In our experimental setup, the RUBic algorithm took \(\sim 71.1\) s to process 100 synthetic matrices of this dataset, whereas BiBit [24] took \(\sim 3.4\) h to complete the same task (see Table 1). It may be noted that the average time taken to extract a bicluster is decreasing with increasing density, in case of RUBic. RUBic takes 0.29 ms on average over all the bicluster produced whereas BiBit [24] takes 21.23 ms for the same purpose. Altogether, RUBic generated exactly the same number of biclusters in comparison to the 494, 872 biclusters generated by BiBit [24]. We have also evaluated the cluster quality by estimating the match scores [23] between the biclusters generated by the two methods. The average of maximum match scores of all biclusters generated by RUBic with respect to the BiBit [24] and viceversa are exactly the same. Overall, same biclusters are extracted by RUBic, in much lesser time. Detailed results on this synthetic dataset are compiled in the Table 1.
In the second synthetic data setup, to examine the scalability of the developed algorithm, we have worked over ten groups of matrices from size \(50\times 50\) to \(500\times 500\) with an increase of 50 rows and 50 columns in each group. Each group is having 10 matrices with the density of 1’s varying from 10 to 100% with an increment of 10% in each step. This dataset is also taken from the BiBit data repository (Performance_test_csv) [24] and the performance of the RUBic method over these ten groups of matrices is shown in Fig. 3. Table 2 shows the detailed results on the average time of execution over these ten groups of matrices. We can observe how the number of biclusters varies over these ten groups as the dimensionality increases. We have also plotted the average time of execution per bicluster over these groups to give an idea of the performance variation of RUBic with respect to scalability.
In the third category of synthetic dataset (Match_score_csv) [24], 11 different matrices of sizes \(100\times 100\) to \(110\times 110\) with varying degree of overlapping are implanted; groundtruth biclusters of size \(10\times 10\) to \(20\times 20\) are used. We have used the average bicluster relevance which measures up to what level the mined biclusters represent ‘true’ biclusters which we have already introduced in the dataset. We have used the average module recovery to know how far each of the true biclusters is recovered. Both of the measures are normalized between (0, 1), where higher value reveals that the set of generated biclusters is same as to true set of biclusters and 0 represents that the sets are disjoint [23]. The objective of this experiment was to evaluate average bicluster relevance and module recovery of the two methods, RUBic and BiBit [24], under consideration. Table 3 shows that both the methods extract the exactly the same biclusters with similar bicluster relevance and module recovery scores.
Biological data analysis
We have also assessed the performance of the RUBic method on the latest biological datasets, the Homo sapiens protein–protein interaction (PPI) dataset [29]. Initially it contained 6, 247 number of protein–protein interactions. We have removed selfinteractions from the dataset which reduces it to 5823 number of interactions only. However, it only contains 4085 number of unique proteins and we mapped it into a binary matrix of size \(4085\times 4085\) where 1’s represent interaction and 0’s represent noninteraction among the corresponding proteins. We have extracted the data for extracting biclusters with minimum number of row as 2 and minimum number of column as 2. We have observed that our method RUBic generates 1840 maximal biclusters in \(\sim 48.6\) s. Then the same dataset is used for BiBit [24] and it generates the same 1, 840 number of biclusters in \(\sim 251.6\) s. We have also worked over the central nervous system (CNS) embryonic tumor gene expression dataset [30] of size \(7129 \times 40\) were considered. We have also executed both modes of RUBic (base and flex). In this biological dataset, RUBicbase generates 747,069 maximal biclusters in \(\sim 101\) min, whereas, BiBit [24] produces the same number of biclusters in \(\sim 56\) h (see Table 4).
Performance on expression datasets
To compare RUBic with other algorithms on real datasets with a large number of columns (\(>500\)), we evaluated five datasets from E. coli, yeast, and human tissues. To ensure consistency between the algorithms, we conducted experiments using RUBicflex algorithm with two parameter (\(r_{min}\), \(c_{min}\)) values synchronized with both ARBic and QUBIC2. We have executed all the algorithms using the optimal parameters specified in their respective publications on these five datasets [30]. First, all the gene expression values are discretized into binary matrix using Mixture of Left Truncated Gaussian Distributions model as described in [26]. On these binary patters, RUBicflex is employed to extract the significant biclusters. It has been observed that in all 5 datasets, RUBicflex shown a significant performance improvement in terms of total execution time speedup (see Table 5) and average time per clusters compared to ARBic and QUBIC2. However, in average time per cluster, RUBic also surpass ARBic and QUBIC2 in all datasets with only exception of E. coli Colombos as depicted in Fig. 4. In all these datasets, ARBic generates significantly less number of clusters, where in E. coli Colombos and Yeast DREAM5 dataset, RUBic produces less clusters compared to QUBIC2 and rest three datasets RUBic generates higher number of clusters with less execution time. The detailed statistics of generated clusters and corresponding execution time (ET) are reported in Table 5.
As the true biclusters in real datasets are unknown, we evaluated each bicluster identified by each algorithm by using KEGG [28] biological pathwaybased analysis on E. coli Colombos to demonstrate the efficacy of our proposed approach, RUBic. To evaluate the significance of our algorithm, we incorporated all the pathways as compared to ARBic. We found that the match score of each cluster with the KEGG enriched pathway clusters was similar for all three approaches, ARBic, QUBIC2, and RUBic. The heatmap representation of the match score per KEGG enriched cluster is shown in Fig. 5A. In most of the pathways, based gene annotations are wellmapped with the biclusters found in RUBic. Each cell represents average matching scores of top 50 biclustering solutions with respect to KEGG enriched clusters (see Fig. 5A).
For example, in the ecoli_colombos dataset for the KEGG pathway‘Twocomponent system’ eco02020 (see expression heatmap Fig. 5B), we found a significant number of overlapping genes (34) with RUBic clusters where in QUBIC2 and ARBic extracts 18 and 29 enriched genes (correlation hits), respectively with a maximal matching cluster. Figure 5C represents the heatmap for the expression level correlation matrix from resulting biclusters of RUBicFlex, QUBIC2 and ARBic. Redboxed cells within the heatmap represent KEGG(eco02020) enriched genes present in the respective clusters, whereas Fig. 5D shows the enriched gene count in each clustering approach and the total number of genes within those clusters.
Conclusion
In this work we propose a novel algorithm for fast extraction of biclusters from binary datasets. We have evaluated its performance on both synthetic and biological datasets and compared the results with the existing stateoftheart. We also estimated the best case, worst case, and average case complexity of the developed method and attempt to show that the average case computational complexity is almost linearly proportional to the increase in the size of the dataset. Our approach can operate in two modes: basemode and flexmode, resulting in two types of clustering solutions  maximal biclusters and biologically plausible biclusters.
The performance evaluation was carried out on three carefully selected benchmark synthetic datasets using basemode and flexmode to assess the effects of varying density, degree of overlap, and dimensionality on the developed method. Two biological datasets were also considered. The first one was the human PPI dataset of low density of interactions in a large square matrix \(4085 \times 4085\) used to assess the performance of the method. The second dataset was a large CNS gene expression data used to test the performance of the method. In the case of the CNS gene expression dataset, RUBic generated a large number of biclusters in less than 2 h, which took \(\sim 56\) h by the BiBit algorithm [24]. This observation highlights the robustness of our method with respect to the scalability of the dataset.
The performance of RUBic was also evaluated on five gene expression datasets, two from E. coli, two from yeast, and one from human tissues. Our RUBicflex mode biclustering showed a significant performance improvement in terms of total execution time speedup compared to the most recent and popular biclustering strategies ARBic and QUBIC2. Finally, the extracted clusters were evaluated and validated with match scores of resultant clusters with KEGG enriched pathway clusters for biological significance.
Thus, not only is our RUBic algorithm faster than its competitors, it also produces an optimal set of biologically relevant biclusters, and established a new state of the art in across several benchmark synthetic, PPI and gene expression datasets, and hence is expected to have generalised applications across proteomic and genomic interactions, including high impact tasks of automated drug repurposing and drug discovery.
Availability of data and materials
The different public datasets used in this work are referred to in their respective publications, as described in the "Biological data analysis" Sect. .
Code availability
The code is available at (https://github.com/CMATERJUBIOINFO/RUBic) for academic use only.
References
Hartigan JA. Direct clustering of a data matrix. J Am Stat Assoc. 1972;67(337):123–9.
Cheng Y, Church GM. Biclustering of expression data In: Ismb. 2000;8:93–103.
Hu L, Yang Y, Tang Z, He Y, Luo X. Fcanmopso: an improved fuzzybased graph clustering algorithm for complex networks with multiobjective particle swarm optimization. IEEE Trans Fuzzy Syst (2023)
He T, Chan KC. Misaga: an algorithm for mining interesting subgraphs in attributed graphs. IEEE Trans Cybern. 2017;48(5):1369–82.
He T, Chan KC. Discovering fuzzy structural patterns for graph analytics. IEEE Trans Fuzzy Syst. 2018;26(5):2785–96.
Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002;18(suppl1):136–44.
Yang J, Wang H, Wang W, Yu P. Enhanced biclustering on expression data. In: Third IEEE symposium on bioinformatics and bioengineering, 2003. Proceedings., 2003. IEEE. pp. 321–327.
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nat Genet. 2002;31(4):370–7.
Ihmels J, Bergmann S, Barkai N. Defining transcription modules using largescale gene expression data. Bioinformatics. 2004;20(13):1993–2003.
BenDor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: the orderpreserving submatrix problem. In: Proceedings of the sixth annual international conference on computational biology, pp 49–57 (2002)
Liu B, Wan C, Wang L. An efficient semiunsupervised gene selection method via spectral biclustering. IEEE Trans Nanobiosci. 2006;5(2):110–4.
Murali T, Kasif S. Extracting conserved gene expression motifs from gene expression data. In: Pacific symposium on biocomputing. Pacific Symposium on Biocomputing, pp 77–88 (2003)
Cheng KO, Law NF, Siu WC, Liew AWC. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinform. 2008;9(1):1–28.
Santamaría R, Therón R, Quintales L. Bicoverlapper: a tool for bicluster visualization. Bioinformatics. 2008;24(9):1212–3.
Uitert Mv, Meuleman W, Wessels L. Biclustering sparse binary genomic data. J Comput Biol. 2008;15(10):1329–45.
Madeira SC, Oliveira AL. A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms Mol Biol. 2009;4(1):1–39.
Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, et al. Fabia: factor analysis for bicluster acquisition. Bioinformatics. 2010;26(12):1520–7.
Serin A, Vingron M. Debi: discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol Biol. 2011;6(1):1–12.
Sill M, Kaiser S, Benner A, KoppSchneider A. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics. 2011;27(15):2089–97.
Huang Q, Tao D, Li X, Liew A. Parallelized evolutionary learning for detection of biclusters in gene expression data. IEEE/ACM Trans Comput Biol Bioinform. 2011;9(2):560–70.
Ayadi W, Elloumi M, Hao JK. Bicfinder: a biclustering algorithm for microarray data analysis. Knowl Inf Syst. 2012;30:341–58.
Huang Q, Wang T, Tao D, Li X. Biclustering learning of trading rules. IEEE Trans Cybern. 2014;45(10):2287–98.
Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006;22(9):1122–9.
RodriguezBaena DS, PerezPulido AJ, Aguilar Ruiz JS. A biclustering algorithm for extracting bitpatterns from binary datasets. Bioinformatics. 2011;27(19):2738–45.
Li G, Ma Q, Tang H, Paterson AH, Xu Y. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res. 2009;37(15):101–101.
Xie J, Ma A, Zhang Y, Liu B, Cao S, Wang C, Xu J, Zhang C, Ma Q. QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of largescale RNASeq data. Bioinformatics. 2020;36(4):1143–9.
Liu X, Yu T, Zhao X, Long C, Han R, Su Z, Li G. ARBic: an allround biclustering algorithm for analyzing gene expression data. NAR Genom Bioinform. 2023;5(1):009.
Kanehisa M, Furumichi M, Sato Y, Kawashima M, IshiguroWatanabe M. KEGG for taxonomybased analysis of pathways and genomes. Nucleic Acids Res. 2023;51(D1):587–92.
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32(suppl1):449–51.
Saelens W, Cannoodt R, Saeys Y. A comprehensive evaluation of module detection methods for gene expression data. Nat Commun. 2018;9(1):1090.
Acknowledgements
The authors are grateful to Dr. Ben Macarthur and Dr. Chris Harbron of the TuringRoche partnership for their valuable advice throughout.
Funding
This project did not receive any direct funding for the research work, however the publications charges are covered by TC’s funder, the TuringRoche strategic partnership.
Author information
Authors and Affiliations
Contributions
All the authors contributed to the conception and design of the study. BKS and AKH were responsible for data collection and analysis and for writing the original draft. SB and TC were responsible for the verification, review and editing of the manuscript. Supervision and project administration were done by SB and TC. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sriwastava, B.K., Halder, A.K., Basu, S. et al. RUBic: rapid unsupervised biclustering. BMC Bioinformatics 24, 435 (2023). https://doi.org/10.1186/s12859023055343
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055343