Sparse data embedding and prediction by tropical matrix factorization

Background Matrix factorization methods are linear models, with limited capability to model complex relations. In our work, we use tropical semiring to introduce non-linearity into matrix factorization models. We propose a method called Sparse Tropical Matrix Factorization (STMF) for the estimation of missing (unknown) values in sparse data. Results We evaluate the efficiency of the STMF method on both synthetic data and biological data in the form of gene expression measurements downloaded from The Cancer Genome Atlas (TCGA) database. Tests on unique synthetic data showed that STMF approximation achieves a higher correlation than non-negative matrix factorization (NMF), which is unable to recover patterns effectively. On real data, STMF outperforms NMF on six out of nine gene expression datasets. While NMF assumes normal distribution and tends toward the mean value, STMF can better fit to extreme values and distributions. Conclusion STMF is the first work that uses tropical semiring on sparse data. We show that in certain cases semirings are useful because they consider the structure, which is different and simpler to understand than it is with standard linear algebra.

, a person likes all Star Wars movies irrespective of actors or directors, shows that using the max operator instead of the sum might produce a better model.
We develop a method for the prediction of missing (unknown) values, called Sparse Tropical Matrix Factorization (STMF). We evaluate its performance on the prediction of gene expression measurements from The Cancer Genome Atlas Research Network (TCGA) database. We show that the newly defined operations can discover patterns, which cannot be found with standard linear algebra.

Related work
Matrix factorization is a data embedding model which gives us a more compact representation of the data and simultaneously finds a latent structure. The most popular example is the non-negative matrix factorization (NMF) [5], where the factorization is restricted to the matrices with non-negative entries. This non-negativity in resulting factor matrices makes the results easier to interpret. One of the applications of matrix factorization methods is for recommender systems, where users and items are represented in a lower-dimensional latent space [6]. Binary matrix factorization (BMF) [7,8] is a variant rooted from NMF where factor matrices are binary, while probabilistic non-negative matrix factorization (PMF) [9] models the data as a multinomial distribution. MMDNMF [10] is a supervised NMF method, which minimizes the maximum distance within-class and maximizes the minimum distance between-class. Integrative approaches, which use standard linear algebra to simultaneously factorize multiple data sources and improve predictive accuracy, are reviewed in [11]. Multi-omic and multi-view clustering methods like MultiNMF [12], Joint NMF [13], PVC [14], DFMF [15], MDNMF [16] and iONMF [17] can be used for data fusion of multiple data sources.
Lately, subtropical semiring (max, ·) gained interest in the field of machine learning, since it can discover interesting patterns [18,19]. By taking the logarithm of the subtropical semiring, we obtain the tropical semiring (max, +) [20]. Although these two semirings are isomorphic, the factorization in tropical semiring works differently than the factorization in subtropical semiring. The Cancer algorithm [20] works with continuous data, performing subtropical matrix factorization (SMF) on the input matrix. Two main components of the algorithm are: iteratively updating the rank-1 factors one-byone and approximate the max-times reconstruction error with a low-degree polynomial.
Latitude algorithm [4] combines NMF and SMF, where factors are interpreted as NMF features, SMF features or as mixtures of both. This approach gives good results in cases where the underlying data generation process is a mixture of the two processes. In [21] authors used subtropical semiring as part of a recommender system. We can consider their method to be a particular kind of neural network. Le Van et al. [22] presented a single generic framework that is based on the concept of semiring matrix factorization. They applied the framework on two tasks: sparse rank matrix factorization and rank matrix tiling.
De Schutter & De Moor [23] presented a heuristic algorithm to compute factorization of a matrix in the tropical semiring, which we denote as Tropical Matrix Factorization (TMF). They use it to determine the minimal system order of a discrete event system (DES). In the last decades, there has been an increase of interest in this research area, and DES is modeled as a max-plus-linear (MPL) system [24,25].
In contrast to TMF where approximation error is reduced gradually, convergence is not guaranteed in the Cancer algorithm. Both Cancer and TMF return factors that encode the most dominant components in the data. However, by their construction, they cannot be used for prediction tasks in different problem domains, such as predicting gene expression. In contrast with the NMF method and its variants, which require non-negative data, TMF can work with negative values.
Hook [26] reviewed algorithms and applications of linear regression over the maxplus semiring, while Gärtner and Jaggi [27] constructed a tropical analogue of support vector machines (SVM), which can be used to classify data into more than just two classes compared to the classical SVM. Zhang et al. [28] in their work establish a connection between neural networks and tropical geometry. They showed that linear regions of feedforward neural networks with rectified linear unit activation correspond to vertices of polytopes associated with tropical rational functions. Therefore, to understand specific neural networks, we need to understand relevant tropical geometry. Since one goal in biology is not just to model the data, but also to understand the underlying mechanisms, the matrix factorization methods can give us a more straightforward interpretation than neural networks. The GCN-MF framework [29] uses matrix factorization to combine embeddings and Graph Convolutional Network (GCN) using standard linear algebra. The authors state that matrix factorization only utilizes the linear relationship between entities. When data is more complex, the matrix factorization method cannot identify non-linear relationships. Since deep learning uses non-linear functions and layer combinations, neural networks can learn more complex data patterns. In our work, instead of introducing deep learning, we address the issue of non-linearity with tropical semiring.
In our work, we answer the question stated in Cancer: can tropical factorization be used, in addition to data analysis, also in other data mining and machine learning tasks, e.g. matrix completion? We propose a method STMF, which is based on TMF, and it can simultaneously predict missing values, i.e. perform matrix completion. In Table 1 we compare the most relevant methods for our work. To the best of our knowledge STMF is the only method which performs prediction tasks in tropical semiring. STMF introduces non-linearity into matrix factorization models, which enables discovering the most dominant patterns, leading to a more straightforward visual interpretation compared to other methods for missing value prediction.

Tropical semiring and factorization
Now, we give some formal definitions regarding the tropical semiring. The (max, +) semiring or tropical semiring R max , is the set R ∪ {−∞} , equipped with max as addition ( ⊕ ), and + as multiplication ( ⊗ ). For example, 2 ⊕ 3 = 3 and 1 ⊗ 1 = 2 . On the other hand, in the subtropical semiring or (max, ×) semiring, defined on the same set R ∪ {−∞} , addition ( max ) is defined as in the tropical semiring, but the multiplication is the standard multiplication ( × ). Throughout the paper, symbols + and − refer to standard operations of addition and subtraction. Tropical semiring can be used for optimal control [30], asymptotics [31], discrete event systems [32] or solving a decision problem [33]. Another example is the well-known game Tetris, which can be linearized using the (max, +) semiring [34].
Let R m×n max define the set of all m × n matrices over tropical semiring. For A ∈ R m×n max we denote by A ij the entry in the i-th row and the j-th column of matrix A. We denote the sum of matrices A, B ∈ R m×n max as A ⊕ B ∈ R m×n max and define its entries as max and its entries are defined as Matrix factorization over a tropical semiring is a decomposition of a form R = U ⊗ V , where R ∈ R m×n max , U ∈ R m×r max , V ∈ R r×n max and r ∈ N 0 . Since for small values of r such decomposition may not exist, we state tropical matrix factorization problem as: given a matrix R and factorization rank r, find matrices U and V such that To implement a tropical matrix factorization algorithm, we need to know how to solve tropical linear systems. Methods for solving linear systems over tropical semiring differ substantially from methods that use standard linear algebra [34].
We define the ordering in tropical semiring as z w if and only if z ⊕ w = w for z, w ∈ R max , and it induces the ordering on vectors and matrices over tropical semiring entry-wise. For A ∈ R m×n max and c = [c k ] ∈ R m max the system of linear inequalities A ⊗ x � c always has solutions and we call the solutions of A ⊗ x � c the subsolutions of the linear system A ⊗ x = c . The greatest subsolution x = [x 1 x 2 . . . x n ] T of Ax = c can be computed by for i = 1, 2, . . . , n . We will use (2) in a column-wise form to solve the matrix equations.
TMF starts with an initial guess for the matrix U in (1), denoted by U 0 and then computes V as the greatest subsolution of U 0 ⊗ X = R . Then authors use the iterative procedure by selecting and adapting an entry of U or V and recomputing it as the greatest subsolution of

Our contribution
In our work, we implement and modify TMF so that it can be applied in data mining tasks. We propose a sparse version of TMF, which can work with missing values.
In Sparse Tropical Matrix Factorization (STMF), which is available on https ://githu b.com/ Ejmri c/STMF, we update the factor matrices U and V based on the selected given entry of the input data matrix R to predict the missing values in R. In Algorithm 1, we present the pseudocode of STMF in which for each given entry (i, j) of R we first update U and V based on the element from the ith row of the left factor U (ULF, see Algorithm 2). If the update of the factors does not improve the approximation of R, then we update U and V based on the element from the jth column of the right factor V (URF, see Algorithm 3).
Algorithms ULF and URF differ from the corresponding TMF's versions in the way they solve linear systems. Since some of the entries of matrix A are not given, we define (min, +) matrix multiplication ⊗ * as for matrices A ∈ R m×p max and B ∈ R p×n max , i = 1, . . . , m , j = 1, . . . , n . Newly-defined operator ⊗ * can be seen as a generalization of Eq. (2), and it is used for solving linear systems by skipping unknown values. We assume that at least one element in each row/column is known.
Among the different matrix initialization strategies, we obtained the best performance with Random Acol strategy [15,35]. Random Acol computes each column of the initialized matrix U as an element-wise average of a random subset of columns of the data matrix R. It is a widely used method for initializations in matrix factorization methods since it gives better insight into the original data matrix than simple random initialization.
In contrast to Cancer, where convergence is not guaranteed, the update rules of STMF, similar to TMF, gradually reduce the approximation error. This is ensured by the fact that factor matrices U and V are only updated in the case when �R − U ⊗ V � b monotonously decreases.

Distance correlation
It is well known that Pearson and Spearman correlation coefficients can misrepresent non-linear relationships [36]. Since in real data, we often deal with non-linearity, our choice is to use so-called distance correlation. Distance correlation [37] is a straightforward measure of association that uses the distances between observations as part of its calculation. It is a better alternative for detecting a wide range of relationships between variables.
Let X and Y be the matrices each with n rows and A and B their matrices of Euclidean distances with the row/column means subtracted, and grand mean added. After matrix centering the distance covariance V xy is defined as and distance correlation dcor as where V X and V Y represent distance variances of matrices X and Y. Distance correlation is 0 only if the two corresponding variables are independent.
Distance correlation cannot be used to compare specific rows between X and Y, because it requires the entire matrix to be centered first. In such cases we use Euclidean norm between rows of centered original and rows of centered approximated data.

Synthetic data
We create two types of synthetic datasets of rank 3: one smaller of size 200 × 100 and five larger of size 500 × 300 . We use the (max, +) multiplication of two random nonnegative matrices sampled from a uniform distribution over [0, 1) to generate each synthetic dataset.

Real data
We download the preprocessed TCGA data [11] for nine cancer types, where for each cancer type three types of omic data are present: gene expression, methylation and miRNA data. We transpose the data sources, so that in each data source, the rows represent patients and columns represent features. The first step of data preprocessing is to take the subset of patients for which we have all three data sources. In our experiments we use only gene expression data. After filtering the patients, we substitute each gene expression value x in the original data with the log 2 (x + 1) . With log-transformation, we make the gene expression data conform more closely to the normal distribution, and by adding one, we reduce the bias of zeros. We also perform polo clustering, which is an optimal linear leaf ordering [38], to re-order rows and columns on the preprocessed data matrix. Polo clustering results in a more interpretative visualization of factor matrices.
Next, we use feature agglomeration to merge similar genes by performing clustering [39]. We use Ward linkage and split genes into 100 clusters (see Additional file 1: Figure  S 24), the center of each cluster representing a meta-gene. With this approach, we minimize the influence of non-informative, low variance genes on distance calculations and reduce the computational requirements.
For Breast Invasive Carcinoma (BIC), we do not perform feature agglomeration since a list of 50 genes, called PAM50 [40], classify breast cancers into one of five subtypes: LumA, LumB, Basal, Her2, and Normal [41,42], resulting in our BIC data matrix of size 541 × 50 . These five subtypes differ significantly in the expression of only a few genes in BIC data, which leads to the value close to zero for silhouette score [43] (see Additional file 1: Figure S 22). The sizes of the final nine datasets are listed in Table 2.

Performance evaluation
Since STMF is the first work in tropical semiring, which performs matrix completion, we choose NMF as a baseline method because it represents the original matrix factorization method using standard linear algebra. In contrast, other methods from Table 1, which use standard linear algebra, are extensions of NMF. Additionally, we provide running time (Table 3) and distance correlation ( Table 4) results for PMF because it represents the advanced version of NMF that is suitable for performing prediction task on a single dataset. Experiments were performed for varying values of the factorization rank. The smaller synthetic dataset experiments were run 10 times, with 500 iterations each, and on larger synthetic datasets, experiments were run 50 times, with 500 iterations each. Experiments for real data were run five times, with 500 iterations. For both datasets, we mask randomly and uniformly 20% of data as missing, which we then use as a test set to evaluate the tested methods. We assume that in a typical dataset, data will be missing uniformly at random. The remaining 80% represent the training set. We choose a rank based on the approximation error on training data, which represents a fair/optimal choice for both methods, STMF and NMF so that we can compare them, knowing both of them to have the same number of parameters.
We compute the distance correlation and Euclidean norm between the original and approximated data matrix to evaluate the predictive performance.

Results
First, we use synthetic data to show the correctness of the STMF algorithm. We use the smaller dataset to show that STMF can discover the tropical structure. The larger datasets are needed to show how the order of rows and columns affects the result. We then apply it to real data to compare the performance and interpretability of models obtained with STMF and NMF.  [44]. STMF demonstrates to be more robust to the choice of sampling strategy of missing values. As the results show STMF achieves a smaller prediction root-mean-square-error (RMSE) and higher distance correlation (Fig. 2). Experiments on synthetic data show that changing the execution order of URF and ULF in the computation of STMF does not affect the result of the algorithm. The result of STMF depends on the order of matrix entries. We perform different types of permutation techniques to order columns and rows on five large synthetic datasets   Figure S 19). Top three strategies are to sort columns by increasing values of their minimum, maximum, and mean value (Fig. 3). Moreover, in four out of five datasets, the best results were obtained by ordering columns in increasing order by their minimum value (see Additional file 1: Figure S 20). This strategy represents the first step of STMF method (Algorithm 1). Figure 4 shows the results on BIC matrix, with PAM50 genes and 541 patients. Our findings confirm that STMF expresses some extreme values. We see that STMF successfully recovers large values, while NMF has the largest error where gene expression values are high. Note that NMF tends towards the mean value. Half of the original data is close to zero (plotted in dark blue), which is a reason that NMF cannot successfully predict high (yellow) values. For all other datasets approximation matrices are available in Additional file 1, Section 2.  In Fig. 5a we see that NMF has smaller approximation error than STMF, but larger prediction error. So, NMF better approximates/fits the data, but STMF is not prone to overfitting, since its prediction error is smaller. On the other hand, in Fig. 5b, STMF has better distance correlation and silhouette score values; silhouette score for PMF is shown in Additional file 1: Figure S 23. Thus, STMF can find clusters of patients with the same subtype better than NMF, which tends to describe every patient by the mean values in data. For all other datasets similar graphs are available in Additional file 1, Section 2.

Real data
The rank three factor matrices of the BIC matrix (see Fig. 4) are illustrated in Fig. 6, where we denote STMF's factor matrices by U STMF , V STMF , and NMF's factor matrices by U NMF , V NMF . We see that these factor matrices are substantially different. Basis factor V STMF (first and third row) is visually the most similar to the original matrix than any other factor alone. Factor V STMF detects low and high values of gene expression, while factor V NMF detects high values in the first two columns (second and third row, respectively) and low values in remaining columns (first row). Coefficient factors U STMF and U NMF contribute to a good approximation of the original matrix. For all other datasets factor matrices are available in Additional file 1, Section 2.
To see which part of data is explained by which factorization rank, we define a latent matrix R (i) as a reconstruction using only one latent component from the approximation matrix, where i ∈ {1, . . . , r} , and r is the factorization rank. R (i) can be seen as a projection on the direction of the i-th factor. For example, R (1) STMF matrix in Fig. 7a is a result of the (max, +) product, which represent sums of each pair of elements, of the first column of U STMF and the first row of V STMF (Fig. 6). In the case of NMF, instead of sum, there is multiplication (see Fig. 7b). If we compute an element-wise maximum of all R (i) STMF we get the R STMF , while element-wise sum of all R (i) NMF results in R NMF . In this way, we see which latent matrix R (i) explains which part of the data. On the BIC matrix, we see that both methods, STMF, and NMF, describe most of the data with the first latent matrix (Fig. 7). For all other datasets latent matrices are available in Additional file 1, Section 2.  Table 4 we present the results of experiments on nine datasets listed in Table 2. We see that STMF outperforms NMF on six out of nine datasets, while NMF achieves better results on the LUSC, SKCM and SARC datasets. When we add to the comparison the PMF method, which is the probabilistic version of NMF, it outperforms STMF and NMF on five datasets, but there is no statistically significant difference between the three methods according to the critical difference (CD) method by Demšar [45] (see Additional file 1: Figure S 21).
Solving linear systems using ⊗ * emphasizes the low (blue) and high (yellow) gene expression values of patients in Fig. 4. In this way, STMF can, in some cases, recover better the original data, while NMF's results are diluted. However, a limitation of STMF compared to NMF is in its computational efficiency (Table 3).
In Fig. 8 we plot the distribution of Euclidean norm of difference between centered original data and centered approximations of rank r (chosen in Table 4) for different datasets. We see that even if we use another metric like Euclidean norm, computed for each row (patient) separately, results still show that STMF outperforms NMF, as it is shown in Table 4 using distance correlation.
In Fig. 9 we explore the difference between the original, approximated and centered BIC dataset. For every row (patient) we present the Euclidean norm of the difference between the rows in the original and the approximated matrix on x-axis, which can be interpreted as the accuracy of the approximated values. In contrast, on y-axis we present the Euclidean norm of the difference between the corresponding rows in the centered original and centered approximated matrix, which can be interpreted as the average error of the reconstruction of the original pattern. We see that for each row (patient) the STMF's value on y-axis is smaller than the NMF's value, indicating that STMF better approximates the original patterns. The rows in the STMF's   Fig. 4 with predominantly low values have large approximation errors (y-axis) in Fig. 9 while having a comparable approximation of the original pattern as NMF's approximation of the original pattern. We see that NMF has two clusters of patients with large values on y-axis, denoted by red stars and red circles. These are the rows (patients) where the NMF's predicted pattern differs significantly from the original pattern, more than the STMF's predictions, but at the same time NMF is achieving smaller approximation error than STMF. In Fig. 10 we plot the patients corresponding to these two clusters and compare approximations with original data. It can be seen that NMF cannot model high (yellow) values in a few first columns, while low (blue) values are larger (light blue) compared to the original matrix, which has around half of the data plotted with dark blue. Comparison with Pearson and Spearman correlation is shown in Additional file 1: Figure S 25, where STMF achieves higher Pearson correlation, but lower Spearman correlation. Clusters of patients are also visible in both figures using these two correlations confirming results in Fig. 9. For all other datasets plots are available in Additional file 1, Section 2.

Conclusion
Standard linear algebra is used in the majority of data mining and machine learning tasks. Utilizing different types of semirings has the potential to reveal previously undiscovered patterns. The motivation for using tropical semiring in matrix factorization methods is that resulting factors should give us the most dominant features that are specific and essential for each factor. In that way, factors are likely easier to interpret. We propose a method called STMF, which can work with missing values. We implement STMF by extending TMF algorithm to be able to handle unknown values. Results show that NMF could not successfully recover the patterns on specific synthetic data, while the approximation with STMF achieves a higher correlation value. Results on TCGA data show that STMF outperforms NMF in the prediction task. Also, the results obtained by NMF tend toward the mean value, while the approximations obtained by STMF better express extreme values. Our proposed approach identifies strong patterns that aid the visual interpretation of results. In this way, we can discover sharp, high-variance components in the data. To the best of our knowledge, STMF is the first work using tropical semiring in sparse (biomedical) data.
A limitation of our STMF method is its apparent inability to embed and predict truly novel examples (i.e., new incomplete rows or columns in the data matrix). Developing an approach similar to the one we have shown for NMF [17] deserves further research to address this important task. Fig. 9 Euclidean norm of difference between original BIC data and approximations (x-axis) and Euclidean norm of difference between centered original data and centered approximations (y-axis)