Skip to main content

Advertisement

BicPAMS: software for biological data analysis with pattern-based biclustering

The Erratum to this article has been published in BMC Bioinformatics 2017 18:162

Abstract

Background

Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific.

Methods

To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1–30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1–12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39–48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces.

Results

Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data.

Conclusions

BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data.

Software

BicPAMS and its tutorial available in http://www.bicpams.com.

Background

The biclustering task has been shown to be essential for improving the status-quo understanding of biological systems, being of particular relevance for expression data analysis (to discover putative transcription modules given by subsets of genes correlated in subsets of conditions [1]) and network data analysis (to unravel functionally coherent nodes [2]). Such relevance is further evidenced by the high number of recent surveys on biclustering algorithms for biological data analysis [36]. However, and as an attempt to minimize the complexity of the biclustering task, state-of-the-art biclustering algorithms [1, 710] place restrictions on the coherency, quality and structure of biclusters. These restrictions prevent the recovery of complete biclustering solutions and generally lead to the exclusion of non-trivial yet relevant biclusters. Furthermore, state-of-the-art biclustering algorithms generally rely on searches that cannot offer guarantees of optimality [11, 12].

Pattern-based biclustering emerged in recent years as an attempt to address these limitations [13]. Patterns coherently observed on a subset of rows, columns or nodes reveal homogeneous subspaces. In this context, pattern-based biclustering algorithms rely on widely-researched principles for efficiently mining distinct patterns (including frequent itemsets, association rules or sequential patterns) in large databases as the means to identify these subspaces in real-valued matrices or weighted graphs.

The major benefits of pattern-based approaches for biclustering are: 1) scalable searches with optimality guarantees [11]; 2) possibility to discover biclusters with parameterizable coherency strength and coherency assumption (including constant, additive, plaid and order-preserving plaid assumptions) [11, 12, 14]; 3) flexible structures of biclusters (arbitrary positioning of biclusters) and searches (non-fixed number of biclusters) [15, 16]; 4) robustness to noise and missing values [11] by introducing the possibility to assign multiple symbols or ranges of values to a single data element; 5) easy extension for labeled data analysis using discriminative patterns [11]; 6) applicability to sparse matrices and network data [2, 17]; 7) well-defined statistical tests to assess/enforce the statistical significance of biclusters [18], and 8) easy incorporation of constraints to guide the search [11].

Furthermore, results on biological data show their unique ability to retrieve non-trivial yet meaningful biclusters with high biological significance [2, 11, 14].

To integrate these dispersed contributions, BicPAMS (Biclustering based on PAttern Mining Software) is proposed to discover biclusters with customizable structure, coherency and quality, yet powerful default behavior. BicPAMS makes available earlier pattern-based biclustering algorithms (including BicPAM [11], BiModule [16] and DeBi [15]), well suited for expression data analysis. Furthermore, BicPAMS implements recent contributions that guarantee the applicability of biclustering towards network data (BicNET [17]), the discovery of order-preserving and plaid models (BicSPAM [12] and BiP [14]) and the incorporation of domain knowledge [19].

This work is organized as follows. The remaining part of this section provides the background on pattern-based biclustering. “Implementation” section describes the behavior of BicPAMS, covering the allowed inputs, parameters and visualization options. “Results” section provides empirical evidence of BicPAMS’ role to unravel non-trivial and relevant putative modules from biological data. Finally, the major implications are highlighted.

Definition 1

Given a real-valued matrix (or network) A with a set of rows (or nodes) X= {x 1,..,x n }, a set of columns Y= {y 1,..,y m } and elements a ij relating row x i and column y j (or relating nodes x i and x j ): the biclustering task aims to identify a set of biclusters \(\mathcal {B}\)= {B 1,..,B p }, where each bicluster B k = (I k ,J k ) is defined by a subset of rows I k X and columns J k Y (or two subsets of nodes) satisfying specific criteria of homogeneity and statistical significance.

The placed homogeneity criteria determine the structure, coherency and quality of a biclustering solution, while the statistical significance criteria guarantees that the probability of a bicluster to occur deviates from expectations. The structure of a biclustering solution is defined by the number, size, shape and positioning of biclusters. A flexible structure has a non-fixed of arbitrarily positioned biclusters. The coherency of a bicluster is determined by the form of correlation among its data elements (coherency assumption) and by the allowed deviations per element against the perfect correlation (coherency strength). The quality of a bicluster is defined by the type and degree of tolerated noise. Figure 1 shows biclusters with different coherency assumptions for an illustrative symbolic dataset.

Fig. 1
figure1

Symbolic pattern-based biclusters with varying coherency assumptions

Definition 2

Given a matrix A, let the elements in a bicluster a ij B have coherency across rows (patterns on rows) given by a ij =k j +γ i +η ij , where k j is the value expected for column y j , γ i is the adjustment for row x i , and η ij is the noise factor (determining the quality of the bicluster). Coherency across columns is identically defined over the transposed matrix, A T. Let \(\bar {A}\) be the amplitude of values in A. Given A, coherency strength is a real value \(\delta \in \,[\!0,\bar {A}]\), such that a ij =k j +γ i +η ij and η ij [ −δ/2,δ/2].

Definition 3

The properties of a ij elements define the coherency assumption: constant when γ=0 and additive otherwise. Multiplicative assumption is observed when a ij is better described by k j γ i +η ij . Symmetries can be accommodated on rows, a ij c i where c i {1,- 1}. Order-preserving assumption is observed when the values along the subset of columns induce the same linear ordering per row. A plaid assumption considers the cumulative effects associated with elementar contributions from multiple biclusters on areas where their columns and rows overlap.

Definition 4

Given a bicluster B=(I,J), the bicluster pattern φ B is the set of expected values (k j ) in the absence of noise (η ij =0) and adjustments (γ i =0) according to a fixed ordering of columns: {k j y j J}; while its support, |I|, is the number of rows satisfying the pattern.

Consider the bicluster (I 2,J 2)=({x 1,x 2},{y 1,y 2,y 4,y 5}) in \(\mathbb {N}_{0}^{+}\) from Fig. 1 with an additive coherency assumption across rows. This bicluster can be described by a ij =k j +γ i with the pattern φ={k 1=1,k 2=0,k 4=1,k 5=0}, supported by two rows with additive adjustments γ 1=5 and γ 2=1.

Pattern-based Biclustering. The recently exploited synergies between biclustering and pattern mining paved the rise of a new class of algorithms, generally referred as pattern-based biclustering algorithms [13]. Pattern-based biclustering algorithms are natively prepared to efficiently find exhaustive solutions of biclusters and offer the unique possibility to affect their structure, coherency and quality [13]. This behavior justifies the increasing attention paid in recent years to this class of biclustering algorithms by the bioinformatics community for biological data exploration [11, 12, 1417, 20].

Let \(\mathcal {L}\) be a set of items. In the scope of pattern mining research [21], a pattern is a frequent composition of items P, either an itemset (\(P\subseteq \mathcal {L}\)), association rule (P:P 1P 2 where \(P^{1}\subseteq \mathcal {L}\wedge P^{2}\subseteq \mathcal {L}\)) or sequence (P=P 1P n where \(P^{i}\subseteq \mathcal {L}\)). Given a set of observations D={ P 1,..,P n }, let a full-pattern be a pair (P,Φ P ), where P is a pattern and Φ P is the set of observations in D containing P. Let a closed pattern to be a pattern without supersets with the same support (\(\phantom {\dot {i}\!}\forall _{P'\supset P}|\Phi _{P'}|<|\Phi _{P}|\)).

Given a real-valued matrix A, pattern-based biclustering relies on mappings from A into D and on pattern mining methods able to discover all closed full-patterns, which are used to derive all maximal biclusters satisfying certain coherency (e.g. η ij <ε) and structure criteria (e.g. \(|\mathcal {B}|>p, |I_{k}|>\theta,(\bigcup _{k} B_{k}\cap A)>\tau \)). A maximal bicluster with regards to a specified homogeneity criteria is a bicluster that cannot be extended with additional rows or columns while still satisfying the target criteria. See [22] for a detailed formal view on pattern-based biclustering.

In this context, a pattern-based biclustering solution is optimal with regards to certain coherency, quality and structure criteria. The optimality of pattern-based biclustering algorithms is linked with their exhaustive and unrestricted behavior, contrasting with peer greedy and stochastic biclustering algorithms.

The major potentialities of pattern-based biclustering against alternative biclustering approaches include the possibility to: perform efficient searches with guarantees of optimality [12]; discover biclusters with parameterizable coherency assumption and strength [11, 12]; guarantee robustness to noise, missing values and discretization problems through the possibility of assigning or imputing multiple values or symbols to a single data element [11]; discover structures with a non-fixed number and positioning of biclusters possibly characterized by plaid effects [14, 16]; annotate biclusters with a measure of their statistical significance [18]; extend their applicability towards network data and sparse data matrices [2, 17]; and incorporate domain knowledge from user expectations, knowledge repositories and literature in the form of constraints to guarantee a focus on biologically relevant and non-trivial biclusters [22].

Related work. Following Madeira and Oliveira’s taxonomy [1], biclustering algorithms can be categorized according to their homogeneity criteria (determined by the underlying merit function) and type of search (defined by whether the merit function is applied within a greedy [7, 23], exhaustive [10, 11] or stochastic [9] algorithmic setting). Hundreds of algorithms were proposed in the last decade to discover biclusters satisfying specific forms of homogeneity, as shown by recent surveys on biclustering algorithms for biological data analysis [36]. As a result, some of the algorithms with most visibility have been made publicly available recurring to different software, such as BicAT1 [24], biclust2 [25], Expander3 [10] or BicOverlapper4 [26]. However, the available biclustering algorithms (regardless of whether they are provided or not with adequate interfaces) assume very specific forms of homogeneity and therefore do not support the enumerated benefits of pattern-based biclustering approaches. Table 1 synthesizes the inherent properties of the state-of-the-art pattern-based biclustering algorithms and how they tackle the problems of peer biclustering algorithms. Despite their inherent benefits, they are not yet accessible through adequate graphical or application programming interfaces (GUI/API), and their contributions remain dispersed, being the possibility to consistently integrate them still uncertain.

Table 1 Recent breakthroughs on pattern-based biclustering: algorithms and tackled limitations

Implementation

BicPAMS (Biclustering based on PAttern Mining Software) is the first tool consistently combining state-of-the-art pattern-based biclustering algorithms and making them available within usable interfaces (GUI and API). Figures 2 and 3 provide snapshots of the graphical interface of BicPAMS (where parameters P1 to P20 can be used to determine the desirable properties of the output). First, BicPAMS is described according to the possibilities to parameterize the coherency, structure and quality of its outputs, and the principles to guarantee the efficiency of the underlying searches. We also visit additional contributions of BicPAMS associated with the exploration of potentialities inherent to the integration of pattern-based biclustering algorithms. Second, we cover implementation details associated with the behavior of BicPAMS and the provided interfaces.

Fig. 2
figure2

BicPAMS: sound and parameterizable behavior (annotations in purple)

Fig. 3
figure3

BicPAMS: textual and visual display of results

Pattern-based biclustering with BicPAMS

Coherency of biclusters. As highlighted in Table 1, BicPAMS allows the search for a parameterizable coherency assumption [P3]: constant overall, constant, multiplicative, additive, symmetric or order-preserving. BicPAMS also provides the possibility to robustly select the desirable coherency strength δ (such that η ij [−δ/2,δ/2]). This is done by fixing the length of the alphabet of discretization \(|\mathcal {L}|\) [P4], where \(\delta \propto 1/|\mathcal {L}|\). Furthermore, it allows for the inclusion or neglection of symmetries [P9] in order to effectively deal with both symbolic and real-valued datasets with either positive and negative ranges of values or strictly positive ranges of values. Finally, BicPAMS also offers the possibility to select coherency orientation: whether verified on rows or columns [P16].

Structure of biclusters. BicPAMS relies on the iterative application of dedicated pattern mining searches to guarantee that biclustering can be performed in the presence of a meaningful stopping criteria [P12], such as the minimum number of (dissimilar) biclusters or minimum percentage of the elements in the original dataset covered by the found biclusters. The minimum number of rows [P12] (support) or columns [P13] of biclusters can be optionally inputted to guide the search. Different pattern representations can be used to affect the structure [P15]: simple (all coherent biclusters), closed (all maximal biclusters), and maximal (flattened biclusters with a high number of columns). Furthermore, BicPAMS makes available post-processing options with parameterizable criteria to merge and extend biclusters against the inputted homogeneity criteria and filter biclusters against to prespecified dissimilarity criteria [P19,P20].

Quality of biclusters. BicPAMS provides multiple strategies to guarantee robustness to noise. The user can calibrate the desirable level of tolerance to noise through: 1) post-processing procedures by specifying the allowed percentage of noisy elements within a bicluster [P5]; and 2) multi-item assignments by activating the possibility to assign a parameterizable number of symbols per element based on its original value [P8]. Similarly, BicPAMS guarantees robustness to missing values [P10] by providing imputation methods and enabling the discovery of biclusters with an upper bound on the allowed amount of missing values (particularly relevant when biclustering network data).

Efficiency. BicPAMS also relies on enhanced pattern mining searches able to explore efficiency gains from the biclustering task, inputted constraints and desirable structures [P17,P18]. BicPAMS supports frequent itemset mining and association rule mining (including Apriori-based, vertical or dedicated frequent pattern-growth searches [21]), as well as sequential pattern mining (including state-of-the-art and dedicated searches [27]). New searches based on annotated pattern-based trees (F2G search [28]) and diffsets are implemented within BicPAMS to surpass the problems associated with bitset-based searches, as well as searches able to seize efficiency gains from item-indexable properties (IndexSpan [29]). These searches are integrated with heuristics, guaranteeing an effective pruning of the search space in the presence of constraints such as minimum number of columns.

BicPAMS also makes available data structures to deal with sparse data [17] that guarantee a heightened time-and-memory efficiency in the presence of network data. Finally, the application programming interface (API) of BicPAMS can be used to explore additional efficiency gains from non-optimal searches (mining approximate patterns) and the application of pattern mining within distributed/partitioned data settings.

Synergies. BicPAMS provides the unprecedented possibility to consistently integrate the previously described options, thus combining the contributions of BicPAM [11], BicNET [17], BicSPAM [12], BiP [14], DeBi [15] and BiModule [16]. Furthermore, BicPAMS can incorporate background knowledge according to the contributions made available in BiC2PAM [19], such as the possibility to remove uninformative elements. The API further supports the specification of constraints and the integrative biclustering analysis of experimental data with annotations derived from knowledge repositories.

In this context, although BicPAMS offers an environment with a substantial number of parameters, it makes available default and dynamic parameterizations that are suitable for the majority of data contexts (see Table 4).

Furthermore, BicPAMS explores efficiency gains from particular combinations of parameters. This is, for instance, the case when BicPAMS is applied with multiple coherency or quality criteria at a time. In this context, the search benefits from new heuristics (based on the principle that biclusters with stricter coherency or quality are contained in biclusters with more flexible coherency or quality) and the joint application of pre- and postprocessing procedures.

On how to use BicPAMS

Input and output. BicPAMS supports the loading of input data according to a wide-variety of tabular and network data formats (see Tables 2 and 3). Upon running BicPAMS, when the stopping criteria is achieved, a success message is displayed, enabling the visualization of the output. Both graphical and textual presentations (heatmaps and signal signatures) of the found biclusters are provided. Biclusters can be filtered, sorted and exported to be stored in knowledge bases or visualized on alternative software.

Table 2 BicPAMS: input data, major parameters, and output models
Table 3 Additional parameters of BicPAMS along the mapping, mining and closing steps

Figure 4 provides an illustrative application of BicPAMS for an inputted dataset (either in network or matrix format), showing the outputted biclusters for varying coherency assumptions. For this analysis, we assumed \(|\mathcal {L}|=4\), fixed discretization ranges and the assignment of multi-items for an adequate tolerance to noise.

Fig. 4
figure4

Illustrative application of BicPAMS: input data and output biclusters

Graphical interface (GUI). The desktop interface can be used to soundly parameterize pattern-based biclustering algorithms, as well as to visualize their output. Figures 2 and 3 provide illustrative snapshots. Soundness is guaranteed by: performing automatic form checks, disabling inconsistent fields when specific parameters are selected, and adequately displaying possible causes of errors (such as timeout alerts for heavy requests or data format inconsistencies).

Console, API and source-code. Alternatively to the previous interfaces, BicPAMS makes available a console to facilitate its invocation within language-independent scripts, as well as a Java API, the respective source code and the accompanying documentation. The API is essential to: extend the behavior of pattern-based biclustering algorithms for other tasks (such as classification and indexation), and adapt the current behavior to guarantee an optimum ability to handle biological data with specific regularities. Detailed scenarios showing advanced possibilities made available in the API of BicPAMS are provided in the software’s webpage.

Parameters. The behavior of BicPAMS can be controlled along its three major stages. First, parameters along the pre-processing stage include: coherency strength (given by the number of items \(|\mathcal {L}|\) [P4]), normalization [P6], discretization [P7], imputation [P10], non-informative elements [P11], and the noise range η ij for multi-item assignments [P8]. Second, parameters along the mining step include: coherency assumption [P3] and orientation [P16], stopping criteria (such as minimum number of dissimilar biclusters) [P13], expectations (such as minimum number of columns) [P14], pattern representation [P15], and algorithmic choice [P17]. BicPAMS supports the parameterization of two post-processing procedures: maximum degree of noisy or missing elements per bicluster (using merging procedures [29]) [P5/P19] and dissimilarity criteria (using filtering procedures [11]) [P20]. The API further provides the possibility to specify a desirable minimum homogeneity threshold to extend or reduce the target biclusters according to a parameterizable merit function [11].

Tables 2 and 3 provide an in-depth description of each of these parameters, showing their default values and how they can be modified according to the properties of the input data and desirable outputs. For an exhaustive exploration of biological data without apriori knowledge of the desirable outputs, BicPAMS can be iteratively applied with varying coherency assumptions, coherency strength (\(|\mathcal {L}|\in \{3,4,5\}\)) and quality ({60%, 80%, 100%}). Table 4 discusses the default and data-driven parameterizations provided by BicPAMS, showing their adequacy for exploratory yet robust biological data analysis.

Table 4 Default and dynamic/data-driven parameterizations of BicPAMS

Scalability/limits. Although biclustering is inherently a computationally complex task, BicPAMS is natively prepared to analyze large-scale matrices/networks (>1 Gb) and, under strict optimality criteria, data with more than one million of entries (200 Mb). BicPAMS provides the possibility to select data partitioning procedures. Assuming coherency across rows (patterns on rows), partioning procedures should be applied when (|X|>20000|Y|>λ×1000) is satisfied for the constant assumption (λ=1) or remaining coherency assumptions (λ=0.1). In this context, BicPAMS is able to efficiently analyze expression data with more than 20000 genes (magnitude of human genome) in hundreds of conditions, as well as over sparse biological networks with over 20000 nodes.

Testing cases. Synthetic data (resembling biological data), gene expression data and biological networks with >100 Mb are provided with BicPAMS for testing purposes. In BicPAMS webpage, we provide study cases using both synthetic and real data with varying properties to illustrate the multifaceted potentialities of the software.

Results

Additional file 1 provides extensive experiments that extend the already available assessments of pattern-based biclustering algorithms [2, 11, 12, 1416] towards: new synthetic and real data, and new performance views (including metrics of completeness, precision and accuracy). In these experiments, the performance of 15 distinct biclustering algorithms was for the first time compared for data contexts with varying size, regularities, and amount and type of noise. The gathered results confirm the enumerated advantages of BicPAMS, including its unique ability to efficiently find exhaustive and flexible solutions with superior robustness to noise.

Follows a brief analysis of some of the results gathered from applying BicPAMS to discover regulatory modules in expression and network data. Additional file 1 extends these analyzes (concerning both the functional enrichment and transcriptional regulation of the discovered modules) and demonstrates the relevance and completeness of pattern-based biclustering outputs against the outputs produced by alternative state-of-the-art biclustering algorithms. The biological relevance of the biclusters was given by the assessment of the over-represented functional terms using an hypergeometric test after Bonferroni correction. We considered a term to be highly enriched if it has a corrected p-value below 0.01.

Case studies on expression data analysis. Three gene expression datasets were used: dlbcl dataset (660 genes, 180 conditions) gathering human responses to chemotherapy [30], hughes dataset (6300 genes, 300 conditions) to analyze nucleosome occupancy [31], and gasch dataset (6152 genes, 176 conditions) with Yeast responses to environmental stimuli [32]. The goal is to discover coherent expression patterns corresponding to known and putative transcriptional modules associated with the experimental goal (such as elicited immune responses in dlbcl and stress responses in gasch). Table 5 provides a functional enrichment analysis of pathways, cell lines, transcription factors (TFs) and gene ontology (GO) terms associated with 182 pattern-based biclusters found in the dlbcl dataset. BicPAMS was applied with a constant coherency assumption, multi-item assignments, 80% quality, 50% dissimilarity and 3 iterations. The number of genes per bicluster varies between 89 and 166. The enrichment was computed using the Enrichr web tool [33] against terms from the following databases: KEEG, WIKI, Reactome and BioCarta, human PPIs, Gene Ontology, NCI-60 and cancer cell line Encyclopedia, Human Gene Atlas and MSigDB. Each database annotates groups of genes with dedicated terms. Table 5 shows that all the biclusters found are associated with dissimilar sets of coherent terms. The analysis of enriched pathways, TFs, cancer cell lines, target cells, oncogenic signatures and GO terms confirm that the discovered biclusters are associated with meaningful and well-defined putative cellular responses to chemotherapy. Similar analyzes were conducted for the hughes and gasch data, revealing an identical average number of enriched pathways per bicluster and a significantly higher average number of enriched TFs and GO terms per bicluster.

Table 5 Analysis of the highly enriched terms (p-value below 0.01 after correction using Enrichr [33]) for the 182 pattern-based biclusters found with BicPAMS in the dlbcl dataset (human cellular responses to chemotherapy) against multiple repositories: pathway databases (KEEG, WIKI, Reactome and BioCarta), human PPIs, GO, NCI-60 and cancer cell line Encyclopedia, Human Gene Atlas and MSigDB

Table 6 lists a compact subset of biological processes with significantly enriched terms for pattern-based biclusters found in dlbcl, hughes and gasch datasets. The analysis of the enriched transcription factors (TFs) of these modules confirms their role in regulating cellular responses to chemotherapy (human) [34] and stress conditions (yeast) [35]. Table 7 further shows the properties of an illustrative subset of pattern-based biclusters with high biological significance as verified by the number of highly enriched terms after Bonferroni correction. These biclusters could not be identified by peer biclustering methods due to the presence of noise-tolerant patterns with multiple expression levels (B1, B2 and B5) and non-constant coherency assumptions (B3, B6, B8). Additional file 1: Tables S6–S9 further stress the relevance of discovering biclusters with plaid, ordering-preserving and symmetric assumptions.

Table 6 Illustrative set of terms highly enriched in BicPAMS biclusters
Table 7 Illustrative set of biologically relevant biclusters with different properties

Figure 5 plots four pattern-based biclusters from gasch data with distinct coherent responses of genes to heat shock at different points in time. These biclusters rely on constant, multiplicative, additive and symmetric assumptions, each associated with noise-tolerant patterns with five expression levels (\(|\mathcal {L}|=5\)). Understandably, alternative state-of-the-art biclustering algorithms are not able to discover identical biclusters due to the restrictive assumptions they place on the underlying homogeneity criteria.

Fig. 5
figure5

Pattern-based biclusters retrieved from gasch data following a constant assumption with symmetries a, multiplicative assumption with symmetries (b), and additive assumption c and d

Case studies on network data analysis. Four biological networks were extracted from DryGIN [36] and STRING v10 [37] databases (Table 8). The goal is to discover putative functional modules given by non-trivial yet coherently interconnected subsets of biological entities. Table 9 illustrates some of all highly enriched biclusters discovered by BicPAMS over the biological networks in Table 8, gathering modules with varying: tolerance to noise (0–15% noisy interactions per bicluster), amount of missing values (0–20% missing interactions per bicluster), coherency assumptions (dense/differential, constant and order-preserving) and coherency strength (D 1- D 4 biclusters with \(\mathcal {L}=\{-2,-1,1,2\}, Y_{1}-Y_{4}\) and H 1H 2 with \(\mathcal {L}=\{1,2,3\}, Y_{5}\) and H 3 with \(\mathcal {L}=\{1,2,3,4\}\)). The biclusters were discovered using multi-item assignments to guarantee their robustness to noise. The results show that all biclusters have highly enriched terms, and the enriched terms per bicluster were also found to be taxonomically related (see Additional file 1). These results further suggest that the found modules are characterized by cohesive putative biological functions. Table 10 characterizes some the enriched pattern-based biclusters, reinforcing the role of BicPAMS to find modules with varying shape, coherency and quality; non-trivial yet biologically meaningful as shown by the number of enriched terms after correction.

Table 8 Biological networks used to experimentally assess BicPAMS
Table 9 Biological role of a subset of BicPAMS’ modules with varying properties
Table 10 Relevance and exclusivity of BicPAMS’ solutions: properties of some of the found modules in DryGIN

Conclusions

BicPAMS consistently integrates the state-of-the-art contributions from pattern-based biclustering within graphical, scripting and application programming interfaces for the analysis of biological data. BicPAMS is essential for the user-assisted unsupervised exploration of biological data as it overcomes the commonly placed restrictions by peer biclustering algorithms and provides the unprecedented possibility to parameterize the properties of the biclustering solutions. Unprecedentedly, BicPAMS offers the possibility to customize the coherency (including coherency assumption, orientation and strength), quality (including tolerance to noise and missing values), structure and statistical significance (including minimum number of rows and/or columns) of the outputted biclusters. BicPAMS is applicable to dense or sparse, symbolic or real-valued data, and optionally able to incorporate domain knowledge. In order to guarantee the usability of this parametrically rich environment, default parameterizations and simple guidelines (according to the properties of input data and desired output) are provided. BicPAMS further supports multiple data formats and representations of the output, verifying the soundness of requests. Empirical evidence shows that BicPAMS is able to efficiently and effectively discover non-trivial yet coherent biclusters that are robust to noise and biologically significant.

Availability and requirements

Project name: BicPAMS

Project home page: http://www.bicpams.com

Operating system(s): All (cross–platform)

Programming language: Java

Other requirements: Java v7 or superior

Licence: GNU General Public License

Endnotes

1 http://www.tik.ee.ethz.ch/sop/bicat/

2 http://cran.r-project.org/web/packages/biclust

3 http://acgt.cs.tau.ac.il/expander

4 http://vis.usal.es/bicoverlapper/

Abbreviations

BicNET:

Biclustering networks (algorithm)

BiC2PAM:

Biclustering with constraints using pattern mining (algorithm)

BicPAM:

Biclustering using pattern mining (algorithm)

BicPAMS:

Biclustering based on pattern mining software

BicSPAM:

Biclustering using sequential pattern mining (algorithm)

BiModule:

Biclustering modules (algorithm)

BiP:

Biclustering plaid models (algorithm)

DeBi:

Differentially expressed biclustering (algorithm)

DryGIN:

Data repository of yeast genetic interactions

GI:

Genes interaction

PPI:

Protein-protein interactions

STRING:

Search tool for the retrieval of interacting genes/proteins

TF:

Transcription factor

References

  1. 1

    Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinforma. 2004; 1:24–45.

  2. 2

    Henriques R, Madeira SC. BiC2PAM: constraint-guided biclustering for biological data analysis with domain knowledge. Alg Mol Biol. 2016; 11:23.

  3. 3

    Freitas AV, Ayadi W, Elloumi M, Oliveira J, Hao J-K. Survey on biclustering of gene expression data. In: Biological Knowledge Discovery Handbook. John Wiley & Sons, Inc: 2013. p. 591–608. doi:10.1002/9781118617151.ch25.

  4. 4

    Eren K, Deveci M, Küçüktunç O, Çatalyürek ÜV. A comparative analysis of biclustering algorithms for gene expression data. Brief Bioinform. 2013; 14(3):279–92.

  5. 5

    Charrad M, Ahmed MB. Simultaneous clustering: a survey. In: Pattern Recognition and Machine Intelligence (PReMI), Moscow, Russia. Berlin, Heidelberg: Springer Berlin Heidelberg: 2011. p. 370–375. doi:10.1007/978-3-642-21786-9_60.

  6. 6

    Sim K, Gopalkrishnan V, Zimek A, Cong G. A survey on enhanced subspace clustering. DAMI. 2013; 26(2):332–97. http://dx.doi.org/10.1007/s10618-012-0258-x.

  7. 7

    Cheng Y, Church GM. Biclustering of expression data. In: IC on Intelligent Systems for Molecular Biology. AAAI Press: 2000. p. 93–103.

  8. 8

    Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol. 2003; 10(3-4):373–384.

  9. 9

    Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Göhlmann HWH, Shkedy Z, Clevert DA. FABIA: factor analysis for bicluster acquisition. Bioinformatics. 2010; 26(12):1520–7.

  10. 10

    Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinf. 2002; 18:136–44.

  11. 11

    Henriques R, Madeira S. BicPAM: Pattern-based biclustering for biomedical data analysis. Alg Mol Biol. 2014; 9:27.

  12. 12

    Henriques R, Madeira S. BicSPAM: Flexible biclustering using sequential patterns. BMC Bioinforma. 2014; 15:130.

  13. 13

    Henriques R, Antunes C, Madeira SC. A structured view on pattern mining-based biclustering. Pattern Recogn. 2015; 48(12):3941–3958. doi:10.1016/j.patcog.2015.06.018.

  14. 14

    Henriques R, Madeira SC. Biclustering with flexible plaid models to unravel interactions between biological processes. IEEE/ACM Trans Comput Biol Bioinforma. 2015; 12(4):738–752.

  15. 15

    Serin A, Vingron M. DeBi: Discovering differentially expressed biclusters using a frequent itemset approach. AMB. 2011; 6:1–12.

  16. 16

    Okada Y, Fujibuchi W, Horton P. A biclustering method for gene expression module discovery using closed itemset enumeration algorithm. IPSJ Trans Bioinf. 2007; 48(SIG5):39–48.

  17. 17

    Henriques R, Madeira SC. BicNET: efficient biclustering of biological networks to unravel non-trivial modules. In: Algorithms in Bioinformatics (WABI), Atlanta, GA, USA, Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg: 2015. p. 1–15. doi:10.1007/978-3-662-48221-6_1.

  18. 18

    Henriques R. Learning from high-dimensional data using local descriptive models. PhD thesis. Lisboa: Instituto Superior Tecnico, Universidade de Lisboa; 2016.

  19. 19

    Henriques R, Madeira SC. Pattern-based biclustering with constraints for gene expression data analysis. In: Progress in Artificial Intelligence: 17th Portuguese Conference on Artificial Intelligence (EPIA), Coimbra, Portugal. Proceedings. Cham: Springer International Publishing: 2015. p. 326–339. doi:10.1007/978-3-319-23485-4_34.

  20. 20

    Martinez R, Pasquier C, Pasquier N. GenMiner: mining informative association rules from genomic data. In: BIBM. Washington, DC: IEEE Computer Society: 2007. p. 15–22.

  21. 21

    Han J, Cheng H, Xin D, Yan X. Frequent pattern mining: current status and future directions. Data Min Knowl Discov. 2007; 15:55–86.

  22. 22

    Henriques R, Madeira SC. BicNET: Flexible module discovery in large-scale biological networks using biclustering. Alg Mol Biol. 2016; 11:1–30.

  23. 23

    Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci. 2000; 97(22):12079–84.

  24. 24

    Barkow S, Bleuler S, Prelić A, Zimmermann P, Zitzler E. BicAT: a biclustering analysis toolbox. Bioinformatics. 2006; 22(10):1282–3.

  25. 25

    Kaiser S, Leisch F. A Toolbox for Bicluster Analysis in R. 2008. Technical Report Number 028 Department of Statistics University of Munich http://www.stat.uni-muenchen.de.

  26. 26

    Santamaría R, Therón R, Quintales L. BicOverlapper 2.0: visual analysis for gene expression. Bioinformatics. 2014; 30(12):1785. doi:10.1093/bioinformatics/btu120.

  27. 27

    Mabroukeh NR, Ezeife CI. A taxonomy of sequential pattern mining algorithms. ACM Comput Surv. 2010; 43:3:1–3:41.

  28. 28

    Henriques R, Madeira SC, Antunes C. F2G: efficient discovery of full-patterns. In: ECML/PKDD IW on New Frontiers to Mine Complex Patterns. Prague: Springer-Verlag: 2013.

  29. 29

    Henriques R, Antunes C, Madeira SC. Methods for the efficient discovery of large item-indexable sequential patterns. In: New Frontiers in Mining Complex Patterns (Held in Conjunction with ECMLPKDD), Selected Papers. Cham: Springer International Publishing: 2014. p. 100–116. doi:10.1007/978-3-319-08407-7_7.

  30. 30

    Rosenwald A, DLBCL Team. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002; 346(25):1937–47.

  31. 31

    Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C. A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet. 2007; 39(10):1235–44.

  32. 32

    Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000; 11(12):4241–57.

  33. 33

    Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, McDermott MG, Monteiro CD, Gundersen GW, Ma’ayan Avi. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016; 44(W1):W90. doi:10.1093/nar/gkw377.

  34. 34

    Lee AP, Yang Y, Brenner S, Venkatesh B. TFCONES: a database of vertebrate transcription factor-encoding genes and their associated conserved noncoding elements. BMC Genomics. 2007; 8:441.

  35. 35

    Teixeira MC, Monteiro PT, Guerreiro JF, et al. The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae. Nucleic Acids Res. 2014; 42(Database issue):D161–D166. doi:10.1093/nar/gkt1015.

  36. 36

    Koh JLY, Ding H, Costanzo M, Baryshnikova A, Toufighi K, Bader GD, Myers CL, Andrews BJ, Boone C. DRYGIN: a database of quantitative genetic interaction networks in yeast. Nucleic Acids Res. 2010; 38(suppl 1):D502–7.

  37. 37

    Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, Huerta-Cepas J, Simonovic M, Roth A, Santos A, Tsafou KP, Kuhn M, Bork P, Jensen LJ, von Mering C. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014; 43(D1):D447. doi:10.1093/nar/gku1003.

Download references

Funding

This work was supported by Fundação para a Ciência e Tecnologia under the project Neuroclinomics2 PTDC/EEI-SII/1937/2014, Inesc-ID plurianual with reference UID/CEC/50021/2013, the research grant SFRH/BD/75924/2011 to RH.

Availability of data and materials

BicPAMS software, data and tutorial can be accessed at http://www.bicpams.com.

Authors’ contributions

RH and SCM designed the proposed algorithms. RH implemented the algorithms. FLF and RH produced the software interfaces. RH drafted the manuscript. All authors revised and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable. The manuscript does not report new studies involving any animal or human data or tissue.

Author information

Correspondence to Rui Henriques or Sara C. Madeira.

Additional information

The original version has been revised

An erratum to this article is available at http://dx.doi.org/10.1186/s12859-017-1573-4.

Additional file

Additional file 1

Supplementary material – experimental assessments of BicPAMS on synthetic and real data accessible at http://www.bicpams.com/appendix. (PDF 685 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Keywords

  • Association Rule
  • Application Programming Interface
  • Pattern Mining
  • Coherency Strength
  • Merit Function