 Research
 Open Access
 Published:
MHSNMF: multiview hessian regularization based symmetric nonnegative matrix factorization for microbiome data analysis
BMC Bioinformatics volume 21, Article number: 234 (2020)
Abstract
Background
With the rapid development of highthroughput technique, multiple heterogeneous omics data have been accumulated vastly (e.g., genomics, proteomics and metabolomics data). Integrating information from multiple sources or views is challenging to obtain a profound insight into the complicated relations among microorganisms, nutrients and host environment. In this paper we propose a multiview Hessian regularization based symmetric nonnegative matrix factorization algorithm (MHSNMF) for clustering heterogeneous microbiome data. Compared with many existing approaches, the advantages of MHSNMF lie in: (1) MHSNMF combines multiple Hessian regularization to leverage the highorder information from the same cohort of instances with multiple representations; (2) MHSNMF utilities the advantages of SNMF and naturally handles the complex relationship among microbiome samples; (3) uses the consensus matrix obtained by MHSNMF, we also design a novel approach to predict the classification of new microbiome samples.
Results
We conduct extensive experiments on two realword datasets (Threesource dataset and Human Microbiome Plan dataset), the experimental results show that the proposed MHSNMF algorithm outperforms other baseline and stateoftheart methods. Compared with other methods, MHSNMF achieves the best performance (accuracy: 95.28%, normalized mutual information: 91.79%) on microbiome data. It suggests the potential application of MHSNMF in microbiome data analysis.
Conclusions
Results show that the proposed MHSNMF algorithm can effectively combine the phylogenetic, transporter, and metabolic profiles into a unified paradigm to analyze the relationships among different microbiome samples. Furthermore, the proposed prediction method based on MHSNMF has been shown to be effective in judging the types of new microbiome samples.
Background
With the rapid development of biotechnique, such as highthrough sequencing technique, plenty of multiple omics data (e.g. metagenomics, metabolomics and so on) have generated in microbiome study. These resources pave the way for researchers to explore and understand the structure and functions of microbiome community. In addition, it helps to reveal the relationships between microbiota and host environment, microbes and diseases. In order to further dissect the structure and functions of microbiome, many microbiome projects including Human Microbiome Plan (HMP) [1], Integrative Human Microbiome Plan (iHMP) [2], and Metagenomics of the Human Intestinal Gut (MetaHIT) [3] have been launched and accumulated large amounts of microbiome data. By some analysis tools, these data can be computationally represented as the phylogenetic profile or functional composition profile of microbiome [4]. Although some approaches have been designed to analyze the difference and connections among different microbiome samples, they only considered one kind of biological profile data. Thus, the conclusions obtained from these approaches may be onesided and incorrect. In order to draw a reasonable conclusion, integrating multiple omics data from different biological scenarios to jointly analyze latent patterns becomes a feasible way.
However, to the best of our knowledge, there have been few approaches to simultaneously combine multiple biological profiles into a paradigm to study the underlying microbiome structure shared by different representations. Hence, it is urgent and necessary to design novel data integration methods or tools to explore the complicated relationship among microorganisms.
As a kind of clustering method, nonnegative matrix factorization (NMF) has drawn great public attention, recently. In text mining, image processing, bioinformatics fields and so on, many new data integration methods based on NMF have emerged. Greene et.al proposed a joint nonnegative matrix factorization algorithm by concatenating the features of all the views to form a new representation, and then it was factorized into two low rank matrices, one of which was used to cluster indicator [5]. Liu et.al proposed the MultiNMF algorithm by searching a common consensus solution across different views [6]. Zhang et.al developed a novel NMF framework (CSMF) to reveal the common and specific patterns obtained from multiple interrelated biological scenarios [7]. All these methods could obtain good performance when data distribution satisfies certain conditions, e.g. linear relationship. However, the realworld data often owns complicated structure and nonlinear relation. For example, the interactions among microbes are easily influenced by the food intake, host environment or other species, particularly for the intestinal flora, and thus the relationship among microbes may be delicate and complicated. Traditional approaches based on NMF are not sufficient for revealing the latent relations hidden in multiple biological data profiles.
In order to improve the clustering performance, Laplacian graph which makes use of the geometric information of the original data was introduced into the NMF framework. Cai et.al proposed a graph regularization based nonnegative matrix factorization approach (GNMF) for data clustering and obtained good performance [8]. Jiang et.al proposed a new joint nonnegative matrix factorization algorithm with robust Laplacian graph (LJNMF) to cluster microbiome data [4] and achieved better clustering performance. Chen et.al proposed a novel comodule mining framework based on Trifactor nonnegative matrix factorization (NetNMF) to identify heterogeneous biological modules [9] and easily extended to Laplacian case with prior knowledge. Even though Laplacian can boost the performance, Kim et.al pointed that Laplacian regularization possibly leaded poor extrapolating power because Laplacian regularization always biased the solution towards a constant function [10]. Compared to Laplacian regularization, Hessian can not only effectively exploit the local geometry information of original data, but also extrapolate beyond data points [11].
To solve the above problems, in this paper we propose a novel multiview Hessian regularization based symmetric nonnegative matrix factorization algorithm (MHSNMF) to integrate multiple biological profiles into an unified framework to analyze the potential clustering patterns across all view. MHSNMF utilizes the local geometrical information of different views and automatically assigns corresponding weights for each view in each iteration process. We conduct large amounts of experiments on two real datasets and the experimental results show that the proposed MHSNMF algorithm outperforms other integrating approaches, suggesting its underlying application in microbiome data analysis.
The contributions of this study lie in: (1) an effective integration method to explore the difference among distinct microbiome samples with multiple views has been proposed. The experimental results show that it outperforms the stateofart algorithms in terms of AC and NMI; (2) highorder information of the original data is exploited to reveal the underlying clustering patterns across different views; (3) a novel approach based on the consensus matrix obtained from MHSNMF is proposed to predict the classification of new microbiome samples. The extended experiments demonstrate the effectiveness of the proposed method. Figure 1 demonstrates the flowchart of MHSNMF algorithm.
The rest of this paper is organized as below: in next section, a brief view of SNMF and multiview clustering is provided, and then multiview Hessian regularization based SNMF algorithm is also proposed. Next extensive experiments results and the comparisons with other methods are presented. At last, the conclusion and next research plans are given.
Methods
Symmetric nonnegative matrix factorization
Nonnegative matrix factorization (NMF), which has been widely used in many fields including text clustering, image recognition, bioinformatics, has drawn great attention. In NMF, the data matrix V is factorized the production of two low rank matrices W and H. Each column V_{.i} in original matrix V can be approximated as the linear combination of basis vectors W_{.j}, the coefficients are the corresponding elements of H_{.i}. Hence, when data owns linear structure, NMF can achieve better performance. However, the real world data distribution is usually complex and hard to dissect the relations among different objects, and especially for the microbial data. Symmetric nonnegative matrix factorization (SNMF) views the data samples as vertices in graph and minimizes certain objective function of graph cuts [12]. SNMF can adopt multiple metrics to character the similarities between two nodes, including inner kernel, Gaussian kernel, correlation coefficient methods and so on.
The objective function of SNMF is defined as:
where ‖∗‖_{F} is the Frobenius norm of matrix, \( A\in {R}_{+}^{n\times n} \) is the similarity matrix, and \( H\in {R}_{+}^{n\times k} \) is the factorized lowrank matrix, k is the degree of factorization. A_{ij} denotes the similarity between i ‐ th and j ‐ th node.
Eq. 1 iteratively updates H using the following rule [11, 13]:
Once the similarity matrix A was established, the low rank solution H would be easily obtained. For text data, the cosine function is used to compute the similarity between two documents. For microbiome data, the Gaussian kernel function can be used to measure the similarity between different microbiome samples:
where V_{i} denotes the i ‐ th data point in original matrix. σ_{i} is the Euclidean distance between V_{i} and its k ‐ th neighbor. We set k to be 7 as suggested in [14]. Note that the selfsimilarity of the nodes is eliminated in all cases.
Next, we construct the sparse graph for microbiome samplesample similarity network; the edge weight can be redefined as
where N(i) is the neighborhood of node i. In our study, we set the number of the neighbors to be 12 empirically.
Furthermore, the obtained weight matrix W_{ij} is normalized to
where D is the diagonal matrix and \( {D}_{ii}={\sum}_{j=1}^n{W}_{ij} \).
Multiview symmetric nonnegative matrix factorization
Given multiview dataset \( \left\{{V}^1,{V}^2,\cdots, {V}^{n_v}\right\} \),the corresponding similarity matrices are represented as \( \left\{{A}^1,{A}^2,\cdots, {A}^{n_v}\right\} \), where n_{v} denotes the number of views. Inspired by the study [6], Multiview symmetric nonnegative matrix factorization (Multiview SNMF) can be formulated as
where H^{∗} denotes the consensus matrix toward that the solutions of all views. \( {Q}^v= Diag\left(1/\sum \limits_{i=1}^m{H}_{i,1}^v,1/\sum \limits_{i=1}^m{H}_{i,2}^v,\cdots, 1/\sum \limits_{i=1}^m{H}_{i,k}^v\right) \) is an auxiliary matrix which guarantees that the clustering solution of each view is comparable. γ^{v} is the weight of the v ‐ th view and simultaneously keeps a balance between the SNMF reconstruction error and regularization term (the second term of Eq. 6). In the study, we set γ^{v} s to be equal for all views considering the convenience of computation.
Multiview SNMF follows the basic hypothesis that there exists an underlying consensus structure in all views. This is reasonable because each view describes partial truth of the unknown; however, these limited cognitions are essential components toward objective truth.
Hessian regularization
Given a smooth manifold M ⊂ R^{n}, at each point p the tangent space is defined as T_{p}(M) ⊂ R^{n} ⋅ N_{p} denotes the neighborhood of p. For each point p^{'} ∈ N_{p}, there is a unique closest point v^{'} ∈ T_{p}(M) such that the implied mapping p^{'} → v^{'} is smooth. In order to obtain the Hessian of function f : M ↦ R, an orthogonal coordinate system of T_{p}(M) is needed to define. This can be achieved by the d largest eigenvectors of N_{p} corresponding to the orthogonal basis of T_{p}(M). Hence, in the tangent space f(p) can be represented as g(x) : T_{p}(M) ↦ R. In this way, the Hessian of f at point p can be defined as
The previous studies point that the Frobenius form of Hessian matrix is invariant to coordinate changes [10]. Hence, the total Hessian is obtained to measure the average curviness of f along the manifold M as follows
Hessian regularization (HR) steers the solution varying smoothly along the manifold. Compared with Laplacian regularization, Hessian fits the data perfectly and owns stronger extrapolating capability to unseen data [15]. Next, we summarize the computation process of Hessian as follows.

(1)
For each sample v_{i}, finding its k nearest neighbors N_{i} and then construct the neighborhood matrix V^{i} with rows consisting of the centralized samples v_{j} = v_{j} − v_{i} for each v_{j} ∈ N_{i}.

(2)
Conducting SVD on V^{i} so that V^{i} = UDS^{T}. The first d columns of U gives the tangent coordinates of points in N_{i}.

(3)
Constructing the matrix M^{i} = [1, U_{.1}, U_{.2}, ⋯, U_{.d}, U_{11}, U_{11}, ⋯, U_{dd}], where 1 denotes one vector, followed by the first d columns of U and d × (d + 1)/2 columns consisting of various cross products and squares of these d columns. Then, performing the GramSchmidt process on M^{i} and yielding \( \hat{M^i} \). The last d × (d + 1)/2 columns of \( \hat{M^i} \) are extracted to form B^{i} ⋅ B^{i} is the hessian matrix of the tangent space formed by the k nearest neighbors of the ith sample.

(4)
Thus, a symmetric Hessian matrix can be obtained by summing up all point’s Hessian energy:
where l is the data point on the manifold, i denotes the i ‐ th data point in N_{l}.
In contrast to Laplacian regularization (LR), HR can make full use of the intrinsic geometric information of the data manifold. It can not only well fit the training data, but also predict the unseen data points [16]. In this paper, we use multiple Hessian matrices obtained from different data presentations to well maintain the structural consistence in process of dimension reduction, just like with Laplacian.
Multiview hessian regularization based symmetric nonnegative matrix factorization
According to the analyses above, we propose a novel data integrating method, called Multiview Hessian based symmetric nonnegative matrix factorization (MHSNMF). MHSNMF combines the advantages of SNMF and Hessian regularization, and can take full advantage of the local geometric structure information of the original data. Hence, MHSNMF theoretically owns more preferable performance.
The objective function of MHSNMF can be formulated as
where B^{v} denotes the Hessian matrix derived from the v ‐ th view, tr(·) denotes the trace of matrix. α^{v} is the coefficient of B^{v}, β is the regularization parameter and is used to tune the smooth of solution.
The optimal problem of MHSNMF contains three steps: (1) updating H^{v} given fixed consensus matrix H^{∗} and graph coefficient α^{v}; (2) updating H^{∗} given fixed H^{v} and graph coefficient α^{v}; (3) finding the optimal graph coefficients α^{v} s given fixed H^{v} and H^{∗}. The optimizations of these three subproblems are presented below.

(1)
Fixing H^{∗} and α^{v}, computing H^{v}
Given fixed H^{∗} and α^{v}, only considering terms that are relevant to H^{v} at this step, the Eq. 10 can be reduced to
To minimize Eq. 11, we can solve the optimal problem with Lagrange method [6, 17]. Introducing the Lagrange multiplier ψ, Lagrange function can be written as
For simplicity A, H, Q is substituted for A^{v}, H^{v}, Q^{v}, respectively.
Taking the partial derivative of L with respect to H gives
Using KKT condition, we can obtain the following updating rule

(2)
Fixing H^{v} and α^{v}, updating H^{∗}
This subproblem is similar to (1), the objective function can be rewritten as
where \( B=\sum \limits_{v=1}^{n_v}{\alpha}^v{B}^v \), \( {\alpha}^v>0,\sum \limits_v{\alpha}^v=1 \).
The rule of iteration for H^{∗} is given
where B = B^{+} − B^{−}. It shouldn’t be difficult to see that H^{∗} remains nonnegative after each iteration.

(3)
Fixing H^{v} and H^{∗}, learning α^{v}
This subproblem can be formulated as
When tr((H^{∗})^{T}B^{i}H^{∗}) the minimum one among distinct views, the solution w.r.t α is α^{i} = 1 and α^{j} = 0 corresponding to other views. It means that only one view takes effect and the complement information carried by multiple views cannot be utilized effectively.
In this study, we employ a trick [18, 19] to avoid this problem. We substitute (α^{v})^{r} for α^{v}, r > 1. In this case, each graph has a particular contribution to the consensus matrix. The Eq. 17 can be rewritten as
To solve Eq. 18, we introduce Lagrange multiplier λ and consider the constraint \( \sum \limits_v{\alpha}^v=1 \) and then obtain the Lagrange function
Taking the partial derivative of L(α, λ) with respect to α^{v} and λ set them to zero
Finally, a closed solution of α^{v} can be given
From Eq. 21 we can see that α^{v} is always nonnegative because Hessian matrix B^{v} is SDP.
Table 1 gives the pseudocode of the proposed MHSNMF.
Datasets and evaluation metrics
Datasets
In this paper, two public multiview datasets are used to verify the performance of the proposed MHSNMF algorithm.

(1)
Threesource text story dataset. The dataset was collected from three online news sources: BBC, Reuters and the Guardian. One hundred sixtynine stories were reported in all three sources. Each of them was manually classified into one of the six topical labels: business, entertainment, politics, sport, health and technology. These roughly correspond to the principal section headings used across these three sources. To facilitate comparisons using the AC and NMI metrics, only the main topic for each story was considered. More details can be found in [20]. Table 2 describes the detailed statistical information.

(2)
Human microbiome dataset (HMP). This dataset includes three compositional profiles: phylogenetic, metabolic and transporter profiles from HMP site. It consists of 637 samples drawn from seven body sites including one vagina (posterior fornix), one gut (stool), one nasal (anterior nares), one skin (retroauricular crease), and three oral sties (supragingvial plaque, tongue dorsum and buccal mucosa). The phylogenetic profile which contains the microorganism relative abundances was estimated by software MetaPhlAn at species level (710 × 637). For functional profile, the transporter profile (4941 × 637) and the metabolic profile (295 × 637) are investigated by filtering out those with low variances (see Table 3 for the detailed statistical summary) [4]. All the data can be available from HMP site: http://hmpdacc.org/ [21].
Evaluation metrics
In the following experiments, two frequently used metrics are applied to evaluate the clustering performance of MHSNMF, i.e. accuracy (AC) and normalized mutual information (NMI). Generally speaking, higher AC or NMI indicates the better clustering performance. More details were described in [22].
Results and discussion
Experimental results
In this section, we conduct extensive experiments to elucidate the effectiveness of the proposed MHSNMF approach. Some baseline algorithms below are compared:

Single view (BSSV and WSSV). Running standard SNMF on each view, BSSV is the most informative view that has the best clustering quality; WSSV refers to the worst view.

MultiNMF. Iteratively fusing the coefficient matrices learnt from different views to form a consensus clustering solution. In the fusion process, coefficient matrix from each view is normalized to guarantee that they are comparable and meaning [6].

Cotraining spectral clustering (Cotraining SC). Performing multiview spectral clustering with cotraining paradigm [23] to update iteratively the graph structure of one view by using the discriminative eigenvectors obtained from the other view.

Similarity network fusion (SNF). Constructing similarity network for each view and then iteratively fusing these networks so that global and local information from different views can be shared and interchanged. More details can be obtained from [24].

LJNMF. Fixing a common coefficient matrix across different views and then performing joint nonnegative matrix factorization as shown in [4].

CSMF. Extracting common and specific patterns from multiple data generated under interrelated biological scenarios via nonnegative matrix factorization [7].

NetNMF. Utilizing Trifactor NMF to construct two layer modular networks. For each biological network, the samples were reordered according to the obtained features modules. At last, the optimal clustering performance is recorded [9].

MHSNMF. This is the proposed algorithm. In the experiments, we used NNDSVD method to enhance the initiation stage of MHSNMF [25]. The parameter selection will be discussed later.
Table 4 shows the clustering results of different algorithms on these two datasets. From this table, we can see that MHSNMF outperforms the baseline and the stateofart algorithms in terms of AC and NMI.
As we can see, on these two realistic dataset MHSNMF achieves much improvement in terms of AC and NMI compared with other algorithms. One of the possible reasons is that MHSNMF takes advantage of the local geometry information reserved in the data to satisfy the manifold consistency assumption well. The proposed MHSNMF algorithm can effectively find the latent consensus clustering solution across different views.
Parameter tuning
There are two types of parameters in the proposed MSNMF algorithm: γ^{v} and β. γ^{v} is the regularization parameter for the v ‐ th view. On one hand, γ^{v} reflects each view’s relative importance among all views, on the other hand, it also indicates the strength which we want to impose on the regularization constraint. Considering the convenience of computation, we set γ^{v} s to be equal for each view. β is the graph regularization parameter. In our experiment the values of β are tuned from the candidate set {10^{−4}, 5 × 10^{−4}, 10^{−3}, 5 × 10^{−3}, 10^{−2}, 0.05, 0.1, 0.5, 1} and γ^{v} is set to vary in the set {10^{−3}, 5 × 10^{−3}, 10^{−2}, 0.05, 0.1, 0.5, 1} for all the datasets. Besides, in computing Hessian the size of neighborhood is set to be 30.
Figure 2 shows how the performance of MHSNMF varies with changes of parameters γ^{v} and β on these two datasets. As Fig. 2 shown, MHSNMF obtains the best performance when γ equals to 0.1 and β equals to 0.5 on threesource data. Moreover, for other values of β MHSNMF still owns stable and reliable performance. On HMP dataset, MHSNMF performs relatively stable when γ equals to 0.05 and β varies during the set {10^{−4}, 5 × 10^{−4}, 10^{−3}, 5 × 10^{−3}, 10^{−2}, 0.05, 0.1}.
Convergence curve and the performance
According to the iterative rules (Eqs. 14, 16 and 21), the objective function value progressively grows smaller and it is convergent. Figure 3 shows the convergence curves along with the accuracy value on these two datasets, respectively. The results below are obtained when γ is set to be 0.05 and β is set to 0.01. As we can see that MHSNMF will converge after a few iterations. Interestingly, on threesource data the performance curve shows some shocks in the iterative process. One of the possible reasons is that the clustering solutions obtained from multiple views may not be misaligned for each cluster. This is beyond the scope of this paper.
As Fig. 3 shown, on HMP dataset the performance of MHSNMF achieves the optimal value 95.28%/91.76% in terms of AC/NMI after around 250 iterations. It is worth noting that MHSNMF converges very fast regardless of Threesource or HMP data. This suggests the effectiveness and efficiency of MHSNMF for clustering multiview omics data.
Parameter study
In this subsection, extensive experiments are conducted on HMP data to further validate the performance of MHSNMF w.r.t the number of neighbors p and knn in computing Hessian and constructing affinity graphs, respectively. Figure 4 demonstrates how the accuracy varies with changes in the number of neighbors.
As Fig. 4 shown, the accuracy of MHSNMF achieves the best value when p is set to be 12. Meanwhile, the performance of MHSNMF is stable for the various values of knn. For other values of p, in most cases AC doesn’t vary significantly with the changes of knn, which demonstrates the number of neighbors in computing Hessian cannot have a remarkable impact on the performance of MHSNMF on HMP dataset. This is important to study the microbiome data. We can set a fixed knn value in computing Hessian for the convenience of computation. This study also offers a new reference for multiple heterogeneous omics data fusion.
Analysis on HMP data
To further explore the structures and functions of human microbiome, we apply the proposed MHSNMF algorithm to HMP data and find that it is very useful. Classical multidimensional scaling (MDS) is used on the consensus matrix H^{∗} to describe the relationships among microbiome samples in three dimensional space. Figure 5 reveals clear clustering patterns derived from the consensus matrix. This supports Jeffery et al.‘s argument that the change at the species level of human microbiome is irrelevant to the discrete clusters (enterotype), but it is continuous [26].
As Fig. 5 shown, MHSNMF clearly identifies different clusters corresponding to microbiome samples from seven different body sites. Theses samples from anterior nares (red), gut (cyan) and posterior fornix (yellow) are well separated, particularly for gut microbiome samples. One possible reason is that gut microbiome has more complicated composition and spatial distance relative to other sites. We can also find that samples from three oral sites (buccal mucosa, plaque, tongue dorsum) may have overlapped with each other. This might be because these three sites are all from oral cavity. Therefore, theses samples may have similar microbiome composition and diversity.
Other application
Besides clustering, MHSNMF has also other potential application, for instance, predicting the classification of new samples via consensus matrix H^{∗} obtained from multiple views. When applied it to HMP data with multiple views, the Eq. (10) can also be understood as finding a consensus basis H^{∗} (similar to basis matrix in NMF), such that in the space spanned by H^{∗} the presentation of new microbiome samples can also reflect their structure information. Therefore, we can express a new microbiome sample x_{new} as h by solving the following optimization problem:
Where, \( S={V}_{tr}^i\ast {x}_{new} \), \( {V}_{tr}^i \) is training set from the view, the second term is L_{2} regularization term.
We can use closeness of h to the rows of H^{∗} to decide how likely the new microbiome sample should belong to which body site. For example, one can predict the class of a new microbiome sample according to knn method.
To evaluate our approach, we recollect and extend human microbiome samples to 653 cases, and then separate HMP data (phylogenetic profile and metabolic profile) into training set and test set by randomly selecting 70% samples from each body site as training set and the remaining samples as test set. We firstly learn a consensus matrix H^{∗} from phylogenetic profile and metabolic profile samples in training set, and then predict the classification of phylogenetic (or metabolic) samples in the test set.
To verify that the consensus H^{∗} computed by the proposed MHSNMF algorithm indeed well represents the geometric structure, we also compare several baseline approaches. One is to learn the matrix H^{i} only by single view SNMF, the remaining steps for making predictions are the same as MHSNMF. The other two methods based on subspace learning are Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLSR) [27]. We use the consensus matrix H^{∗} to predict the classification of new samples from each view. The experimental results are shown in Table 5.
As Table 5 shown, MHSNMF obtains much improvement in accuracy compared with three baselines methods on HMP data. It should be noted that CCA fails to utilize the complementary information from multiple views and cannot find the underlying subspace shared by multiple biological compositional profiles. One possible reason is that the objective of CCA is to find the maximum linear correlation between two feature profiles data. Therefore, CCAbased methods may be not suitable for data with nonlinear structure, such as microbiome data. In contrast, by adopting graph and Hessian regularization framework to learn the consensus matrix H^{∗} across all views, MHSNMF succeeds in capturing such knowledge.
Conclusions
In this paper, we introduced a novel multiview Hessian regularization based symmetric nonnegative matrix factorization algorithm (MHSNMF) for multiple omics data integration task. On human microbiome data, the proposed MHSNMF algorithm can effectively combine the phylogenetic, transporter, and metabolic profiles into a unified paradigm to analyze the relationships among different microbiome samples. Experimental results demonstrate MHSNMF has the latent application in multiple biological profiles data analysis. Furthermore, the prediction method based on MHSNMF has shown to be effective in judging the types of new microbiome samples.
To our best knowledge, the interactions among microorganisms are complicated owning to the influences from host environment, diet and other species, particularly for the intestinal flora. Dissecting and exploring the structure and functions of intestinal microbiota is an essential step toward understanding the occurrence and development of microbiotarelated disease. In the future, combining the phylogenetic information of species into the microbial interaction network to analyze functional modules is our next consideration.
Availability of data and materials
The datasets generated or analyzed during the current study are available in the GitHub repository, https://github.com/chonghua1983/MHSNMF.
Abbreviations
 HMP:

Human Microbiome Plan
 iHMP:

Integrative Human Microbiome Plan
 MetaHIT:

Metagenomics of the Human Intestinal Gut
 NMF:

Nonnegative Matrix Factorization
 GNMF:

Graph Regularized Nonnegative Matrix Factorization
 SNMF:

Symmetric Nonnegative Matrix Factorization
 MHSNMF:

Multiview Hessian Regularization based Symmetric Nonnegative Matrix Factorization
 SVD:

Singular Value Decomposition
 AC:

Accuracy
 NMI:

Normalized Mutual Information
 BSSV:

Best Single view
 WSSV:

Worst Single View
 MultiNMF:

Multiview Nonnegative Matrix Factorization
 Cotraining SC:

Cotraining spectral clustering
 SNF:

Similarity network fusion
 LJNMF:

Joint Nonnegative Matrix Factorization with Laplacian
 CSMF:

Common and Specific Matrix Factorization
 NetNMF:

Two Layers Network based Nonnegative Matrix Factorization
References
 1.
Turnbaugh PJ, Ley RE, Hamady M, Fraserliggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449(7164):804–10.
 2.
Consortium IHN. The integrative human microbiome project: dynamic analysis of microbiomehost omics profiles during periods of human health and disease. Cell Host Microbe. 2014;16(3):276.
 3.
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59–65.
 4.
Jiang X, Hu X, Xu W. Microbiome data representation by joint nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans Comput Biol Bioinformatics. 2017;14(2):353–9.
 5.
Greene D, Cunningham P. A matrix factorization approach for integrating multiple data views. In: European conference on machine learning; 2009. p. 423–38.
 6.
Liu J, Wang C, Gao J, Han J. Multiview clustering via joint nonnegative matrix factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining; 2013. p. 252–60.
 7.
Zhang L, Zhang S. Learning common and specific patterns from data of multiple interrelated biological scenarios with matrix factorization. Nucleic Acids Res. 2019;47(13):6606–17.
 8.
Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011;33(8):1548–60.
 9.
Chen J, Zhang S. Discovery of twolevel modular organization from matched genomic data via joint matrix trifactorization. Nucleic Acids Res. 2018;46(12):5967–76.
 10.
Kim KI, Steinke F, Hein M. Semisupervised regression using hessian energy with an application to semisupervised dimensionality reduction. In: Neural information processing systems; 2009. p. 979–87.
 11.
Ma Y, Hu X, He T, Jiang X. Hessian regularization based symmetric nonnegative matrix factorization for clustering gene expression and microbiome data. Methods. 2016;111:80–4.
 12.
Kuang D, Ding CHQ, Park H. Symmetric nonnegative matrix factorization for graph clustering. In: Siam international conference on data mining; 2012. p. 106–17.
 13.
Long B, Zhang Z, Yu PS. Coclustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. New York: ACM; 2005. p. 635–40. https://doi.org/10.1145/1081870.1081949.
 14.
Zelnikmanor L, Perona P. Selftuning spectral clustering. In: Advances in neural information processing systems; 2005. p. 1601–8.
 15.
Donoho D, Grimes C. Hessian eigenmaps: locally linear embedding techniques for high dimensional data. Proc Natl Acad Sci. 2003;100(10):5591–6.
 16.
Liu W, Tao D. Multiview hessian regularization for image annotation. IEEE Trans Image Process. 2013;22(7):2676–87.
 17.
Ma Y, Hu X, He T, Jiang X. Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with laplacian regularization. IEEE/ACM Trans Comput Biol Bioinformatics. 2017;PP(99):1–1. https://doi.org/10.1109/TCBB.2017.2756628.
 18.
Wang M, Hua XS, Yuan X, Song Y, Dai LR. Optimizing multigraph learning: towards a unified video annotation scheme. In: ACM International Conference on Multimedia; 2007. p. 862–71.
 19.
Xia T, Tao D, Mei T, Zhang Y. Multiview spectral embedding. IEEE Trans Syst Man Cybernetics Part B. 2010;40(6):1438–46.
 20.
Greene D. A matrix factorization approach for integrating multiple data views. In: European conference on machine learning and knowledge discovery in databases; 2009. p. 423–38.
 21.
Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, Creasy HH, Earl AM, Fitzgerald MG, Fulton RS. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–14.
 22.
Xu W, Liu X, Gong Y. Document clustering based on nonnegative matrix factorization. In: International ACM sigir conference on research and development in information retrieval; 2003. p. 267–73.
 23.
Blum A, Mitchell TM. Combining labeled and unlabeled data with cotraining. In: Conference on learning theory; 1998. p. 92–100.
 24.
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibekains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7.
 25.
Boutsidis C, Gallopoulos E. SVD based initialization: a head start for nonnegative matrix factorization. Pattern Recogn. 2008;41(4):1350–62.
 26.
Jeffery IB, Claesson MJ, O'toole PW, Shanahan F. Categorization of the gut microbiota: enterotypes or gradients? Nat Rev Microbiol. 2012;10(9):591.
 27.
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N. A new approach to crossmodal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia. New York: ACM; 2010. p. 251–60. https://doi.org/10.1145/1873951.1873987.
Acknowledgements
The authors are grateful to all of the reviewers and editors of this manuscript.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement6
Funding
This study is supported by the National Natural Science Foundation of China (No.61532008), the Key Technology R&D Program of Henan Province (202102310561) and the Key Research Projects of Henan Higher Education Institutions (No.20B520002).
Author information
Affiliations
Contributions
YM developed the algorithms, coimplemented the experiments and helped to draft the manuscript. JZ coimplemented the experiment used in the paper and YM contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ma, Y., Zhao, J. & Ma, Y. MHSNMF: multiview hessian regularization based symmetric nonnegative matrix factorization for microbiome data analysis. BMC Bioinformatics 21, 234 (2020). https://doi.org/10.1186/s1285902003555w
Received:
Accepted:
Published:
Keywords
 Symmetric nonnegative matrix factorization
 Hessian regularization
 Multiview clustering
 Human microbiome