Skip to main content

Table 3 Additional parameters of BicPAMS along the mapping, mining and closing steps

From: BicPAMS: software for biological data analysis with pattern-based biclustering

Mapping Options (includes P4 from Table 2)

P6 Normalization

Depending on the properties of the input data, the user can either normalize data per Row, Column or for the Overall data elements or ignore normalization by selecting the None option. Both outliers and missing values are handled separately.

 

P7 Discretization

Real-valued data needs to be discretized to apply pattern-based biclustering (see noise handling to understand how BicPAMs guarantees robustness to discretization drawbacks). The user can select the cut-off points of a Gaussian distribution (default) or fixed ranges of values (equal sized intervals after excluding outliers). Note that fixed ranges can lead to an imbalanced distribution of items. The user can bypass this option for symbolic data by selecting the None option.

 

P8 Noise Handler

Multi-item assignments can be considered to handle deviations on the expected values within a bicluster caused by noise or discretization issues. By selecting this option, 2 items are assigned to elements with a value near a boundary of discretization (value in range c∈[ a,b] when min(b- c,c-a)/(b-a) <25%). In this context, a data element becomes associated with a varying number of items, thus increasing the size of data for analysis.

 

P9 Symmetries

This option is dynamically selected if the input data is composed by positive and negative values (as it naturally affects the properties of the outputted biclusters). When using symmetric ranges, additive (multiplicative) models should be parameterized with an odd (even) number of items to guarantee consistent shifts (scales).

 

P10 Missings Handler

The user can specify what happens in the presence of missing values. Since BicPAMS is natively prepared to analyze sparse data, the Remove option (default) simply signals the algorithms to exclude missings from the searches. Alternatively, the Replace option uses WEKA’s imputation methods to fill missings (the error of imputations can be minimized by simultaneously activating a noise handler). We suggest the use of Remove option for network data and other meaningfully sparse datasets since BicPAMS is able to discover biclusters with missing interactions (see Quality parameter).

 

P11 Remove Uninformative Elements

This option supports the possibility to remove uninformative data elements. Zero Entries can be selected to remove the {0}-items, while the Differential option is used to focus on items with high absolute value (e.g. {-3,-2,2,3} when \(|\mathcal {L}|\)=6). Uninformative elements may correspond to: 1) weak interactions in networks, 2) unchanged expression, 3) healthy evaluations from clinical data, among others.

Mining Options(includes P3, P15 and P16 from Table 2)

P12 Stopping Criteria

The search algorithm runs until any of the available stopping criteria is met. The available options are: 1) minimum number of biclusters before merging (default), 2) minimum covered area by the discovered biclusters (as a percentage of the elements of the input data matrix or network), and 3) minimum support threshold (minimum number of rows per bicluster specified as a fraction of overall rows). The value associated with the selected option should be additionally specified. We suggest the definition of a high number of biclusters (>50) as the default option, in order to guarantee an adequate exploration of the input dataset.

 

P13 Minimum ♯Columns

The minimum number of columns per bicluster can be optionally inputted to promote efficiency and align the outputs according to user expectations. A good principle to fix this value is to use the square root of the number of columns (interactions per nodes) of the input matrix (network).

 

P14 ♯Iterations

BicPAMS default behavior relies on two iterations. For data with large coherent regions that may prevent the discovery of smaller (yet relevant) regions, the number of iterations can be increased to guarantee their discovery. On every new iteration, 25% of the most selected data elements (from the biclusters discovered from the previous iteration) are removed to guarantee a focus on new regions. 3 iterations already guarantee an adequate space exploration for hard data settings.

 

P17 Pattern Miner

The available pattern mining algorithms are dynamically provided based on the selected coherency assumption and pattern representation. Sequential pattern miners (SPM) are provided for order-preserving models: PrefixSpan and IndexSpan (an optimized algorithm able to explore gains in efficiency from the item-indexable properties) are made available for simple pattern representations, while BIDE+ is provided for closed pattern representations. Frequent itemset miners (FIM) are selected for the remaining coherency assumptions. AprioriTID, F2G (pattern-growth method for data with a large extent of coherent areas) and Eclat (vertical method for data with a high number of columns) are made available for simple pattern representations. CharmDiffsets, AprioriTID and CharmTID are made available for closed pattern representations, while CharmMFI with diffsets is provided for maximal pattern representations.

 

P18 Scalability

This option specifies whether data partitioning principles are applied or not to guarantee the scalability of the searches (only suggested for data with >100 Mb).

Closing Opt.(includes P5)

P19 Merging

Different merging procedures are made available (according to [29]): heuristic (default option) for an efficient quasi-exhaustive merging; and combinatorial and multi-support FIM alternatives for an exhaustive yet more costly postprocessing step.

 

P20 Filtering

Filtering is essential to guarantee compact solutions (applied after merging). A biclustered is filtered if it has not enough Dissimilar Elements, Dissimilar Rows or Dissimilar Columns against a larger bicluster. Considering a filtering option with 20% of dissimilar elements. In this context, biclusters sharing more than 80% of their elements against a larger bicluster are removed.