Current developments in technologies related to data collection, storing, and processing allowed several fields of expertise to start exploring new ways of describing their activities using the data they produce. Data that will, then, be analyzed to extract meaningful information to help decision-making processes or to draw conclusions about cases in study. In recent years, three-way data (observations \(\times\) features \(\times\) contexts), also referred as three-dimensional, tridiac, tensor, or cubic data, gained popularity due to their capacity to describe events related across several dimensions (three, in this case) and have properties that evolve with them. Different applications can be found across several domains, such as biological, medical, social, financial or geophysical data analysis [1].
In biology, three-way gene expression data, represented as gene-sample-time [2,3,4], is used to study how genes are expressed during the progression of a disease, or treatment, and unravel complex biological and physical processes that influence their evolution.
Three-way data are also capable of capturing behaviors and trends common to several individuals, being able to represent how communities function and respond together. Notable examples can be found in medical data analysis, where temporal patient data (patient-feature-time) [5] is used to describe patient profiles and disease progression patterns during patient follow-up. Alternatively, in social data [6], individuals’ preferences (individual-feature-time) and interactions (individual-individual-time) are collected to improve the contents provided, recommendations, to serve communities of users sharing similar tastes.
In the financial domain, these data are used to study trading and stock investing to improve profits. Stock-ratio-data [7] are used to relate stock prices and their respective financial ratios during a time interval and can be used to identify groups of stocks whose performance on different indicators can influence their prices.
To foster knowledge discovery from three-way data, further advances are needed in triclustering [1], a new subspace clustering technique, proposed to enable the search for patterns that correlate subsets of observations, showing similarities on a specific subset of features, and whose values are repeated or evolve coherently across a third dimension, generally time or space.
Several triclustering algorithms were already proposed [1], based on different approaches, able to find different types of patterns, with distinct structures and tolerating noise and/or missing values. These approaches range from heuristic-based methods to exhaustive ones, to balance the complexity of the task (NP-Hard [1]) with the number of patterns that they can find. In this context, a key task during the development of a new algorithm is the evaluation of how good the found solutions are, where a triclustering solutions is a set of triclusters. This evaluation is usually performed by testing the new method with available data and checking the quality of the found triclusters using a predefined set of metrics evaluating different properties, such as homogeneity, size or statistical significance.
Real datasets are used in general during these tests, but this procedure has significant limitations. Since there is no previous knowledge about the type of patterns that are expected to be found, there is no ground truth, that is, a known baseline solution that can be compared with the algorithm’s output to assess its effectiveness, besides its efficiency. This means that each new algorithm can find different groups of triclusters, outputting a triclustering solution with distinct size and characteristics. This makes it is difficult to establish an objective and independent criteria to evaluate them.
Synthetic datasets are one way to surpass this limitation. These data can be customized, generated containing specific properties, defined by the author, and a set of planted triclusters (triclustering solution) with known structures, and then used to better assess algorithms’ performance using ground truth.
Despite the inherent advantages of generating triclustering data, to our knowledge, no three-way data generator is available to allow the generation of triclustering solutions. Therefore, each author has to generate their own data. This task is critical, can be time consuming, and even assuming synthetic data is generated correctly their properties can be, and usually are, biased towards the triclustering algorithm under evaluation. Furthermore, they are then used to compare the new algorithm with the state-of-the-art, in turn proposed and evaluated using other data, compromising the validity of some comparisons, and making them unfair, even if experimentally correct. Several authors [2, 3, 5, 6] generated specific data to test their algorithms.
In this context, we propose a new synthetic data generator, G-Tric, able to generate three-way datasets with planted triclusters (triclustering solution), where the user can define several properties regarding the dataset and the planted solutions (customized dataset and solution properties). Concerning dataset properties, the generator can create numeric or symbolic data, with default or custom alphabets, using backgrounds following predefined statistical distributions, and allowing a predefined amount of noise, missing values, and errors. Regarding solution properties (the planted triclusters), the user can define: (1) how many triclusters should be planted (solution size) and how their structure is defined using statistical distributions; (2) what type of patterns should to be planted; (3) what are the overlapping properties of triclusters; and (4) what is the amount of noise, missing values, and errors allowed in each tricluster. We also ensure the user to be able to generate datasets with varying sizes without worrying with scalability issues.
Besides the ability to easily generate customized three-way data with triclustering solutions, the proposed generator enables the possibility to perform benchmarks on existing algorithms to study their efficiency within certain conditions, or their effectiveness in finding different types of patterns, by allowing the creation of several datasets with an extensive board of characteristics. This provides the unprecedented opportunity to comprehensively assess the strengths and limitations of state-of-the art and new triclustering algorithms, promoting the advance in the area of three-way data analysis. To this end, we provide an initial set of generated benchmark datasets, that can then be extended using the software.
The paper is organized as follows. The rest of this section defines the triclustering task and its associated properties, such as coherence, quality, and evaluation methods. Section "Related work" reviews the state-of-the-art, concerning synthetic data generation. Section "Implementation" briefly discusses the software architecture, presents a possible representation for the problem, and describes and exemplifies each feature of the generator. Section "Results" presents the set of datasets generated, identifying the kind of problems they describe, and the associated properties. Finally, section "Conclusions" draws conclusions. As supplementary material (Additional file 1), we further provide a guide containing the mapping between the set of properties the user can define to create a new dataset and displaying the respective way of doing it in the interface. By using a toy example, this example can serve as a tutorial.
Triclustering task
Definitions
Definition 1
A three-way dataset, also termed three-dimensional dataset, D, is characterized by n observations \(X = \{x_1, \ldots , x_n\}\), m features \(Y = \{y_1, \ldots , y_m\}\) and p contexts \(Z = \{z_1, \ldots , z_p\}\). Analogous to 2D data-matrices, the data in 3D datasets can be real-valued or symbolic. Each element, \(a_{ijk}\), relates an observation \(x_i\), an attribute \(y_j\) and a context \(z_k\) [1].
This kind of datasets allows the representation of temporal data, when contexts correspond to time points. If the value of a particular object is fixed, such as observation, a features, or a context, a 2D data matrix is obtained, being called a slice. Figure 1 shows an illustrative dataset D represented as a set of slices according to the size of the context dimension.
Definition 2
Given the 3D dataset D, a tricluster, \(T = (I,J,K)\), is a subspace of the original dataset, where \(I \subseteq X\), \(J \subseteq Y\) and \(K \subseteq Z\) are subsets of observations, features and contexts, respectively [1].
Definition 3
In this context, the triclustering task consists in finding the set of triclusters \(T = \{t_1, \ldots , t_n\}\), such that each \(T_i \in T\) satisfies specific properties, such as, homogeneity and statistical significance, as defined below [1]. Figure 2 shows the dataset with the set of triclusters resulting from a triclustering task (a triclustering solution) highlighted. Figure 3 shows, in detail, the four triclusters in these triclustering solution.
Coherence
The types of patterns that the triclustering task is able to find are defined by the type of coherency that the desired subspaces can express. These subspaces can be correlated according to the following assumptions:
(1) Constant: subspaces that exhibit constant [symbolic data (Eq. 1)] or approximately constant [real-valued data (Eq. 2)] values.
$$\begin{aligned} a_{ijk}&= c, \end{aligned}$$
(1)
$$\begin{aligned} a_{ijk}&= c + \eta _{ijk}, \end{aligned}$$
(2)
where \(a_{ijk}\) is the value of observation i, feature j and context k, c is the common value (seed) and \(\eta _{ijk}\) corresponds to noise. Figure 3a shows an example of a constant tricluster.
(2) Additive: where each element is correlated through the sum of a factor from each dimension,
$$\begin{aligned} a_{ijk} = c + \alpha _i + \beta _j + \gamma _k + \eta _{ijk}, \end{aligned}$$
(3)
where \(\alpha _i\), \(\beta _j\) and \(\gamma _k\) are contributions from observation \(x_i\), feature \(y_j\) and context \(z_k\). The assumption can be fully additive when \(\alpha _i \ne 0\), \(\beta _j \ne 0\) and \(\gamma _k \ne 0\), or partially additive otherwise [1]. Figure 3c shows and example of a full additive triclusters with \(c = 2\), \(\alpha _i = \gamma _k = \{1,2,3\}\), and \(\beta _j = \{1,2\}\).
(3) Multiplicative: when the tricluster can be obtained through a product of terms from each dimension,
$$\begin{aligned} a_{ijk} = c * \alpha _i * \beta _j * \gamma _k + \eta _{ijk}, \end{aligned}$$
(4)
where \(\alpha _i\), \(\beta _j\) and \(\gamma _k\) are contributions from observation \(x_i\), attribute \(y_j\) and context \(z_k\). The assumption can also be full or partial multiplicative. Figure 3b shows an example of a full multiplicative tricluster, where \(c = 1\) and \(\alpha _i = \beta _j = \{1,2,3\}\), and \(\gamma _k = \{1,2\}\).
(4) Order preserving: when instead of looking at the actual tricluster’s values, the search goal is to find linear orderings across a specific dimension. For example, a matrix can be order preserving across columns, if there is a permutation of the said dimension for which each row has an increasing sequence of values [8]. On a three-dimensional dataset, it is expected that this ordering is maintained across different contexts. An order preserving tricluster across columns is shown in Fig. 3d.
Plaid effects
Unlike other clustering methods, and similarly to biclustering [9, 10], in triclustering an element (observation, feature or context) can be part of more than one cluster (tricluster), or in none. When a set of elements belong, simultaneously, to a group of triclusters, we are in the presence of overlapping triclusters. Overlapping regions between two or more triclusters can be described in accordance to a plaid assumption. Under this assumption, the value of an element that participates in multiple triclusters is a function of the expected value from each tricluster. As such, elements are defined using the cumulative effects of the overlapping triclusters considering two assumptions: additive (5) or multiplicative (6), where each element is computed by the sum or the product of the individual contributions of each tricluster.
$$\begin{aligned} a_{ijk}= {} c + \sum _{t = 0}^{q} \theta _{ijkt} \rho _{it} \kappa _{jt} \tau _{kt}, \end{aligned}$$
(5)
$$\begin{aligned} a_{ijk}= {} c + \prod _{t = 0}^{q} \theta _{ijkt} \rho _{it} \kappa _{jt} \tau _{kt}, \end{aligned}$$
(6)
where \(\theta _{ijkt}\) defines the contribution from tricluster \(B_t = (I_t, J_t, K_t)\) and \(\rho _{it}\), \(\kappa _{jt}\) and \(\tau _{kt}\) are binary values that indicate if observation i, attribute j and context k are present on tricluster t. Figure 2 shows an overlapping example, with an additive plaid assumption, between the red and blue triclusters. Figure 3 reveals the individual contributions of each tricluster in the overlapped region.
Quality
The triclustering task should be able to tolerate predefined levels of noise (deviations from the expected values) within the data under study, as well as, missing data (values of a particular observation that are not available) or errors (values whose deviation level is higher than that found in noisy elements, caused by incorrect measurements, for example). The higher the amount of these defective values, the lower the quality of the data, impacting the desired coherency for a given subspace.
Evaluation
After obtaining a triclustering solution, it is necessary to evaluate its correctness and quality, to be able to compare solutions. This analysis should be carried out using metrics that measure different views, with two existing goals: single solution evaluation and comparison between solutions.
Single solution evaluation
The first perspective tries to assess how good a found solution is by evaluating its quality across different performance views. This solution can be evaluated either by extrinsic methods, where a known ground truth exists, or by intrinsic methods, where there is not any prior information about the subspaces that can be present on the dataset under study.
Regarding extrinsic metrics, generally, a set of known triclusters is planted on the dataset, and each algorithm is supposed to find them. The solutions found are then compared with the planted subspaces, and the higher the number of shared elements is, the better they are. This comparison can be made using metrics based on F-Measure [11] and Jaccard-based scores [6], for example, or the 3D Revised Match Score (RMS3) proposed by Henriques and Madeira [1].
The solution’s intrinsic quality can be computed by evaluating its coherence, by calculating its degree of intra and interplane homogeneity, that is, correlation between values across two or more dimensions, using metrics such as MSR [12], Pearson [2] and Spearman correlations [13], that can be extended to a third dimension, or Mutual Information Score [14].
In addition, metrics that consider the statistical significance of a tricluster can be beneficial to distinguish true triclusters from random patterns on the dataset. This would allow a reduced number of false-positive (triclusters that appear by chance) and the number of false-negative (real triclusters that are excluded from the solution) [1]. This evaluation can be done through methods that analyze deviations between the observed data and the underlying data distributions [15], thresholding methods [16], or size expectations collected from randomized data [17]. However, these tests are limited by the allowed homogeneity criteria and placed assumptions on the underlying data.
Another view on the quality of the solution is to evaluate how insightful and meaningful the triclusters found are to the problem at hand. How actionable are the triclustering results? For instance, in biological domains [3, 5, 13], functional annotations and gene ontologies are used to extract meaning from the sets of genes found and to understand why they are correlated.
Comparison between solutions
Comparison of triclustering algorithms is also key to identify their relative strengths and weaknesses. Relevant comparisons can be made either by studying the inherent structure and coherency of the produced solutions, establishing a framework of comparison, and combining it with information about the background or prior information of planted solutions. Also, by their actionability and relevance to the problem to which the algorithms are applied.
In this context, the extrinsic metrics, discussed above to evaluate one triclustering solutions, can be extended to compare solutions produced from different algorithms. The intrinsic scores achieved by different algorithms can also be compared side-by-side. According to Horta and Campello [18], in order to perform a correct analysis of the quality of each biclustering solution, the evaluation metrics should respect eight properties that are intended to favor algorithms than can clearly distinguish different patterns. In particular, the ability to retrieve maximal subspaces, that is, subspaces that are not included on a larger subspace, and can do so without adding noise to increase the subspace’s area. The authors studied 14 similarity measures and verified the ones that respect most of the properties were the Clustering Error (CE) [19] and a measure of soft-clustering, CSI. Since triclustering is an extension of biclustering, both metrics could be extended and used to evaluate triclustering solutions.
Authors proposing triclustering algorithm have also defined comparison metrics to test the performance of the developed algorithms against the existing state-of-the-art. Bhar et al. [20] used a set of metrics, such as TQI, Affirmation Score, Coverage and SBD to perform the comparisons. Gutiérrez-Avilés et al. [21] proposed a new metric, TRIQ, to evaluate triclustering algorithms by combining correlation measures, graphic validation, and functional annotations, combining this way the coherency expressed with the relevance of the found subspaces to the problem.
Triclustering algorithms should also be evaluated on their ability to tolerate noise. The Adjusted Rand Index [13] and Jaccard Similarity Coefficient [6] have been considered to this end. Furthermore, and given the complexity and often size of the three-way data to be analysed, efficiency and scalability should also be of great concern. Thus, the ability to handle different dataset sizes, the memory consumption and execution time needed constitute additional and important criteria to consider when deciding which algorithm is better.
Related work
The generation of synthetic data is advantageous to test specific algorithm’s properties. Real data are, sometimes, difficult to obtain, and it is impossible to control the peculiarities they exhibit. In this context unsupervised learning tasks, including pattern mining and (subspace) clustering tasks, frequently resort to synthetic generators, as shown below, to produce custom data describing distinct problems to potentiate the work developed, facilitate analysis, and comparisons.
In pattern mining, Omari and Conrad [22] proposed a generator to create datasets consisting of transactions that record purchases, with an associated timestamp to study customer buying habits. Generators are also useful for context-aware recommender systems, as they allow to create sets of actions taken by users with some context information that describe them [23]. Machine learning techniques, such as image recognition, also benefit from these tools, with a generator based on generative adversarial networks that produces image-based datasets with demographic parity [24] being an example. Statistical learning methods can also be trained using data produced using the Bayes framework [25] also resorting to synthetic data. Other domains, such as software testing [26] or the development of anonymization techniques [27], also make use of synthetic data.
In clustering, some tools were also proposed to facilitate algorithm evaluation. One of them, proposed by Pei and Zaiane [28] enables the creation of datasets with planted clusters based on the user’s requirements, such as the number of points, the number of clusters, the size, shapes, and locations of clusters, and the density level of either cluster data or noise/outliers in a dataset. With the goal of performing clustering and outlier detection analysis. In biclustering, several benchmark contributions were made through the generation of synthetic data, with some of them providing the tools needed to replicate them. In this context, BiMax [29], produces datasets with biclusters with varying degrees of noise and overlapping, but is not capable of producing dynamic structures, with different sizes and coherencies. BiBench [30] was also proposed to fulfill some of the limitations of BiMax, by allowing the generation of datasets with different sizes and different numbers of biclusters with shift and scale patterns (additive and multiplicative). However, BiBench assumes only constant values across columns, preventing the generation of observations with different, yet correlated values. It does not consider order-preserving patterns, and the biclusters have fixed dimensions. BiGen was later proposed by Henriques [31] to correct these limitations by allowing the generation of biclusters of both symbolic and numeric natures, with varying sizes (each dimension is described through a statistical distribution), with more and different patterns, and with parameterizable overlapping and quality (noise and missings) settings. This is the data generator used to evaluate all the algorithms made available in BicPams software [32].
Concerns triclustering, several algorithms were developed and tested on synthetic data produced by the authors [1]. Unfortunately, none of them made available the respective generators. RSM and CubeMiner [33] were evaluated using IBM’s Quest Data Generator,Footnote 1 even though this generator is more suitable for pattern mining datasets since it generates sets of transactions. Three-way data generators are scarce, and, to the best of our knowledge, there is no generator producing three-way data with planted triclusters to be used to foster the research on triclustering algorithms and three-way data analysis. For that reason, G-Tric used BiGen [31] as a basis to develop a generator able to create 3D datasets with planted triclusters, interacting with each other and having varying properties. Besides the introduction of a third dimension, G-Tric also adds new features to BiGen, such as allowing the definition of a pattern to each dimension (in BiGen, a specific coherency was applied to one dimension while the other was filled with non-constant elements), dividing the quality parameters in two sets, one for the dataset’s background and the other for the subspaces planted (unlike the global definitions of BiGen). In G-Tric, the background elements can follow a discrete distribution, where each symbol/element follows a user-defined probability. Moreover, the overlapping settings were extended so that the user can choose how many triclusters can overlap and the number of elements they can share.