Interpretable deep learning methods for multiview learning

Background Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. Results We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) to learn nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and three real-world data for classification, clustering, and reconstruction tasks. For the classification tasks, iDeepViewLearn had competitive classification results with state-of-the-art methods in various settings. For the clustering task, we detected molecular clusters that differed in their 10-year survival rates for breast cancer. For the reconstruction task, we were able to reconstruct handwritten images using a few pixels while achieving competitive classification accuracy. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning. Conclusion iDeepViewLearn is an innovative deep learning model capable of capturing nonlinear relationships between data from multiple views while achieving feature selection. It is fully open source and is freely available at https://github.com/lasandrall/iDeepViewLearn.


Background
Multiview learning has garnered considerable interest in biomedical research, thanks to advances in data collection and processing.Here, for the same individual, different sets of data or views (e.g., genomics, imaging) are collected, and the main interest lies in learning low-dimensional representation(s) common to all views or specific to each view that together explain the overall dependency structure among the different views.Downstream analyses typically use the learned representations in supervised or unsupervised algorithms.For example, if a categorical outcome is available, then the learned low-dimensional representations could be used for classification.If no outcome is available, the low-dimensional representations could be used in clustering algorithms to cluster the samples.

Existing methods
The literature on multiview learning is not scarce.Linear and nonlinear methods have been proposed to associate multiview data.For example, canonical correlation analysis (CCA) methods have been proposed to maximize the correlation between linear projections of two views [1,2].The kernel version of CCA (KCCA) has also been proposed to maximize the correlation between nonlinear functions of the views while restricting these nonlinear functions to reside in reproducing kernel Hilbert spaces [3,4].Deep learning methods, which offer more flexibility than kernel methods, have been proposed to learn flexible nonlinear representations of two or more views, via deep neural networks (DNNs).Examples of such methods include Deep CCA [5], Deep generalized CCA [for three or more views] [6] and DeepIMV [7].
Despite the success of DNN and kernel methods, their main limitation is that they do not yield interpretable findings.In particular, if these methods are applied to our motivating data, it will be difficult to determine the genes and CpG sites that contribute the most to the dependency structure in the data.This is important for interpreting the results of downstream analysis that use these methods and for determining key molecules that discriminate between those who died from breast cancer and those who did not.
Few interpretable deep-leaning methods for multiview learning have been proposed in the literature.In [8], a data integration and classification method (MOMA) was proposed for multiview learning that uses the attention mechanism for interpretability.Specifically, MOMA builds a module (e.g.gene set) for each view and uses the attention mechanism to identify modules and features relevant to a certain task.
In [9], a deep learning method was proposed to jointly associate data from multiple views and discriminate subjects that allows for feature ranking.The authors considered a homogeneous ensemble approach for feature selection that allowed the ranking of features based on their contributions to the overall dependency among views and the separation of classes within a view.It is noteworthy that variable selection in MOMA and Deep IDA is data driven, and the algorithm for MOMA is applicable to two views, which is very restrictive.

Our approach
In this article, we propose a deep learning framework to associate data from two or more views while achieving feature selection.Similar to deep generalized CCA [deep GCCA] [6] and unlike deep CCA [5], we learn low-dimensional representations that are common to all views.However, unlike deep GCCA, we assume that each view can be approximated by a nonlinear function of the shared low-dimensional representations.We use deep neural networks to model the nonlinear function and construct an optimization problem that minimizes the difference between the observed and the nonlinearly approximated data, while imposing a regularization penalty on the reconstructed data.This allows us to reconstruct each view using only the relevant variables in each view.As a result, the proposed method allows the selection of variables in the views and enhances our ability to identify features from each view that contribute to the association of the views.The results of our motivating data and simulations with small sample sizes suggest that the proposed method may be a useful method for small-sample-size problems compared to other deep learning methods for associating multiple views.Beyond the data-driven approach to feature selection, we also consider a knowledge-based approach to identify relevant features.In the statistical learning literature, the use of prior information (e.g., biological information in the form of variable-variable interactions) in variable selection methods has the potential to identify correlated variables with greater ability to produce interpretable results and improve prediction or classification estimates [10,11].As such, we use the normalized Laplacian of a graph to model bilateral relationships between variables in each view and to encourage the selection of variables that are connected.
In summary, we have three main contributions.First, we propose a deep learning method for learning nonlinear relationships in multiview data that is capable of identifying relevant features that contribute the most to the association among different views.Our approach can accommodate more than two views, in contrast to MOMA, which requires significant code modifications by users, for the same purpose.Second, we extend this method to incorporate prior biological information to yield more interpretable findings, distinguishing it from existing interpretable deep learning methods for multiview learning, such as MOMA and Deep IDA.To the best of our knowledge, this is one of the first nonlinear-based methods for multiview learning to do so.Third, we provide an efficient implementation of the proposed methods in Pytorch and interface them in R to increase the reach of our algorithm.
The remainder of the paper is organized as follows.In section "Methods", we introduce the proposed method.In section "Simulation experiments", we conduct simulation studies to assess the performance of our methods compared to several existing linear and nonlinear methods.In section "Real-world experiments", we apply our method to the Holm breast cancer study for classification and clustering; we further consider two additional applications: brain lower grade glioma (LGG) data, to demonstrate the use of our method for three views; and MNIST handwriting data, to demonstrate that handwritten digits can be reconstructed with few pixels while maintaining competitive classification accuracy.

Model formulation
Assume that d = 1, . . ., D different types of data or views are available from n individu- als and organized in D matrices X (1) ∈ R n×p (1)  , . . ., X (D) ∈ R n×p (D)  .For example, for the same set of n individuals in our motivating study, the matrix X (1) consists of gene expres- sion levels and X (2) consists of CpG sites.Denote an outcome variable by y , if available.In our motivating study, y is an indicator variable of whether or not an individual died from breast cancer.We wish to model complex nonlinear relationships between these views via an informative joint low-dimensional nonlinear embedding of the original high-dimensional data.
For the sake of clarity, we outline a linear framework which our nonlinear model emulates.Assume that there is a joint embedding (or common factors) Z ∈ R n×K of the D views that drives the observed variation across the views so that each view is written as a linear function of the joint embedding plus some noise: X (d) = ZB (d) T + E (d) .Here, K is the number of latent components and B (d) ∈ R p (d) ×K is the loading matrix for view d, each row corresponding to the coefficients K for a specific variable.E (d) is a matrix of errors incurred by approximating X (d) with ZB (d) T .Let z i ∈ R K be the ith row in Z .The common factors z i represent K different driving factors that predict all variables in all views for the ith subject, thus inducing correlations between views.When we write X (d) ≈ ZB (d) T for d = 1, . . ., D , we assume that there is an "intrinsic" space R K so that each sample is represented as z ∈ R K .For each d = 1, . . ., D , x (d) is an instance in X (d) , and B (d) maps a low-dimensional representation z to this x (d) , i.e., restricting the map- pings to be linear.Now, we generalize these mappings to be nonlinear, parameterized by neural networks.
For d = 1, . . ., D , let G d denote the neural network that generalizes B (d)T for the view d.As typical neural networks, each of the G d 's is composed of multilayer affine mapping followed by nonlinear activation, i.e., of the form where σ denotes the nonlinear activation applied element-wise, and W i ' s for i = 1, . . . ,L denote the affine mappings.We prefer to state the affine layers in abstract form, as we can have different types of layer.In this paper, we use G d consisting of fully-connected and convolutional layers to reconstruct numerical data and images, respectively.
For simplicity, assume that each layer of the dth view network, except the first layer, has c d units.Let the size of the input layer (first layer) be K, where K is the number of latent components.The output of the first layer for the dth view is a function of the shared low-dimensional representation, Z , and is given by h is a matrix of biases, and σ : R −→ R is a nonlinear mapping.The output of the second layer for the dth view is the h 2 ∈ R n×c d matrix of biases.The final output layer for the dth view is given by G , and the subscript K d denotes the Kth hidden layer for the view d.G d (Z) is a function of the weights and biases of the network.
Our first goal is to approximate each view with a nonlinear embedding of the joint low-dimensional representation in an interpretable manner, i.e., X (d) ≈ G d (Z) .To achieve interpretability, MOMA used the attention mechanism to choose important features.In the statistical learning literature, regularization techniques (e.g., lasso [12], elastic net [13], SCAD [14]) are oftentimes used for variable selection to promote interpretability.We also propose a regularization approach for interpretability.Specifically, we assume that some variables in X (d) are irrelevant and are not needed in the approxi- mation of X (d) .Thus, the columns of G d (Z) corresponding to the unimportant variables in X (d) should be made zero or nearly zero in the nonlinear approximation of X (d) .To achieve this, we adopt the ℓ 2,1 norm from [15] to promote column-wise sparsity for fea- tures, where ℓ 2,1 is denoted as follows: where X ij is the ijth element in X .Given these assumptions, we propose to solve the following optimization problem: find the parameters of the neural network (weight matrices, biases) defining the neural network G d , and the shared low-dimensional representation together ensure that we select a subset of columns from X (d) to approximate X (d) .d 's are regularization parameters that could be selected by k-fold cross-validation, where k = 5 throughout this paper.
Although � • � 2,1 helps promote column-wise sparsity, we did observe that the columns of G d (Z) were not exactly zero across all samples but were shrunk towards zero for noise variables, perhaps as a result of our use of an automatic differentiation function.Thus, we proceed as follows to select/rank features.Once we have learned the latent code Z and neural networks G 1 , G 2 ,..., G D , we use this information to obtain reconstructed data for the different views, that is, to obtain G d (Z) for the view d.We then calculate the column-wise l 2 norm of G d (Z) , and choose the top r% columns with the largest column norms as important features for the corresponding view.It is imperative that the variables in each view be on the same scale in order to use this ranking procedure.Thus, we standardize each variable to have mean zero and variance one.We save the indices of important features as I 1 , I 2 ,...,I D , and we denote the new datasets with the selected indi- ces as X ′ (1) , X ′(2) , ..., X ′(D) .
Compared to existing deep learning methods for associating multiple views (e.g., deep generalized CCA), our formulation ( 1) is unique because we learn the shared lowdimensional representation, Z , while also selecting important variables in each view that drive the association among views.Similarly to Deep GCCA, and unlike CCA and Deep CCA that only learn linear transformations of each view, we learn Z , which is inde- pendent of the views and allows one to reconstruct all the view-specific representations simultaneously.Figure 1 is a schematic representation to train the neural network and select features.After that, downstream analyses can use the learned Z in classification, regression, and clustering algorithms, as shown in Fig. 2.

Network-based feature selection
We consider a knowledge-based approach to identify potentially relevant variables that drive the dependency structure among views (see Fig. 1).In particular, we use prior knowledge about variable-variable interactions (e.g., protein-protein interactions) in (1) min W (1) ,...,W (D) ,b (1) the estimation of G d (Z) .Incorporating prior knowledge about variable-variable interac- tions can capture complex bilateral relationships between variables.It has the potential to identify functionally meaningful variables (or networks of variables) within each view for improved prediction performance, as well as aid in interpretation of variables.
There are many databases for obtaining information on variable-variable relationships.One of such database for protein-protein interactions is the Human Protein Fig. 1 Feature selection.We train a deep learning model that takes all the views, estimates a shared low-dimensional representation Z that drives the variation across the views, and obtains nonlinear reconstructions ( G 1 (Z),...,G D (Z) ) of the original views.We impose sparsity constraints on the reconstructions allowing us to identify a subset of variables for each view ( I 1 , ...,I D ) that approximate the original data Fig. 2 Reconstruction and downstream analysis.We train a deep learning model to obtain a common low-dimensional representation Z ′ that is based on the features selected in Algorithm 1, we obtain nonlinear approximations ( R 1 (Z),...,R D (Z)) , and we perform downstream analyses using estimated Z ′ Reference Database (HPRD) [16].We capture the variable-variable connectivity within each view in our deep learning model using the normalized Laplacian [17] obtained from the graph underlying the observed data.Let G (d) = (V (d) , E (d) , W (d) ) , d = 1, 2, . . . ,D be a graph network given by a weighted undirected graph.V (d) is the set of vertices cor- responding to the p (d) variables (or nodes) for the d-th view.Let E (d) = {u ∼ v} if there is an edge of variable u to v in the dth view.W (d) is the weight of an edge for the d- th view that satisfies w(u, v) = w(v, u) ≥ 0 .Denote r v as the degree of vertex v within each view; r v = u w(u, v) .The normalized Laplacian of G (d) for the d-th view is where L is the Laplacian of a graph defined as and T is a diagonal matrix with r v as the (u,v)-th entry.Given L (d) , we solve the problem: The normalized Laplacian L (d) is used as a smoothing operator to smooth the columns in G d (Z) so that the variables connected in the d-th view are encouraged to be selected together.

Prediction of shared low-dimensional representation and downstream analyses
In this section, we would like to predict the low-dimensional representation shared from the test data X (d) test , d = 1, . . ., D , (i.e., Z test ) and use this information to predict an outcome, y , if available.The schematic graph is shown in Fig. 2. Note that y can be continuous, binary, or multiclass.We discuss our approach to predict the shared lowdimensional representation, Z test .After getting important features of X (d) , the subscript K d denotes the Kth hidden layer for view d, and the first hidden layer is given as W (1) ,...,W (D) ,b (1) τ for any τ ∈ R � =0 because R d and Z ′ are optimized simultaneously.The lack of control of the scaling of the learned representation Z ′ can lead to robustness problems in downstream analysis, so we add additional constraints on Z ′ in equation (4).However, since it is likely that the shape of Z ′ is not the same as the latent code learned in the testing stage due to the dif- ferent number of samples, we put constraints on each row of Z ′ (we assume that the number of latent components in the training and testing data is the same) as �z ′ i � 2 ≤ 1 where z ′ i means the i-th row vector of Z ′ , that is, the latent code of the ith sample.Finally, the optimization problem is as follows: Here, z ′ test i is the ith row vector in Z test and X test refers to the testing dataset with column indices I d , i.e.,,only the columns that are selected as important are used to esti- mate , the subscript K d denotes the Kth hidden layer for view d, h , and b , and ).Now, when predicting an outcome, the low-dimensional representations Z ′ and Z ′ test become training and testing data, respectively.For example, to predict a binary or multiclass outcome, we train a support vector machine (SVM) [18] classifier with the training data Z ′ and the outcome data y , and we use the learned SVM model and the testing data Z ′ test to obtain the predicted class membership, y test .We compare y test with y test and we estimate the classification accuracy.For continuous outcome, one can implement a nonlinear regression model and then compare the predicted and true outcomes using a metric such as the mean squared error (MSE).For unsupervised analyses, such as clustering, an existing clustering algorithm, such as K-means clustering, can be trained on Z ′ .Figure 2 is a schematic representation of the prediction algorithm and the down- stream analyses proposed.
We provide our optimization approach in the Additional file 1.Our algorithm is divided into three stages.The first stage is the Feature Selection stage.In this stage, we solve the optimization problem (1) or (3) to obtain features that are highly ranked.The second stage is the Reconstruction and Training stage using selected features.Here, we solve the optimization problem (4).Our input data are the observed data with the selected features (that is, the top r or r% features in each view), X ′ (1) . . .X ′ (D) .At convergence, we obtain the reconstructed data R d (Z ′ ) , and the learned shared low-dimensional representations Z ′ based only on the top r or r% variables in each view.Downstream (5) analyses such as classification, regression, or clustering could be carried out on these shared low-dimensional representations learned.The third stage is the prediction stage, if an outcome is available.Here, we solve the optimization problem ( 6) for the learned shared low-dimensional representation ( Z ′ test ) corresponding to the test views ( X test ).This can be used to obtain prediction estimates (e.g.testing classification via an SVM model).

Simulation experiments
We conducted simulation studies to assess the performance of iDeepViewLearn for varying data dimensions, as the relationship between views becomes more complex and when prior information on variable-variable relationships is available or not.Please refer to the Additional file 1 for more simulation setup and results.

Set-up when there is no prior information on variable-variable interactions
We consider two different simulation scenarios to demonstrate both the variable selection and classification performance of the proposed method.In the first scenario, we simulate data with linear relationships among the views and within a view (see Additional file 1).In the second scenario, we simulate the data to show nonlinear relationships.In each scenario, there are D = 2 views and within each view there are two distinct classes.In all scenarios, we generate 20 Monte Carlo training, tuning, and testing sets.We train the models on the training set, choose optimal hyper parameters using the tuning set, and obtain classification performance using the testing set.We evaluate the proposed and existing methods using the following criteria: i) test accuracy, and ii) feature selection.For feature selection, we evaluate the methods ability to select the true signals and ignore noise variables.We use true positive rates (TPR), false positive rates (FPR), and F-measure as metrics to evaluate the variable selection performance.In Scenario 1, the first 20 variables are important, and in Scenario Two, the top 10% of the variables in both views are signals.

Nonlinear simulations
We consider three different settings for this scenario.Each setting has K = 2 classes, but they vary in dimension.In each setting, 10% of the variables in each view are signals and the first five signal variables in each view are related to the remaining signal variables in a nonlinear way (see Fig. 3).We generate data for View 1 as follows: X ] is a matrix of ones and zeros, 1 is a matrix of ones, 0 is a matrix of zeros, and E 1 ∼ N (0, 1) .Each of the first five signal variables in X 1 ∈ R n×p (1) is obtained from θ = θ + 0.5U (0, 1) , where θ is a vector of n evenly spaced points between 0 and 3π .The next 45 signal variables (or col- umns) in X 1 ∈ R n×p (1) are simulated from cos(θ ) plus random noise from a standard nor- mal distribution.The remaining 0.9p (1) variables (or columns) in X 1 are generated from the standard normal distribution.We generate data for View 2 as: X where E 2 ∼ N (0, 1) .The first five columns (or variables) of X 2 ∈ R n×p (2) are simu- lated from exp(0.15θ ) • sin(1.5θ) .The next 0.1p (2) − 5 variables are simulated from exp(0.15θ) • cos(1.5θ ) .The remaining 0.9p (2) variables (or columns) in X 2 are generated from the standard normal distribution.The class labels y ) or (6000, 4500).Figure 3 shows the structure of the nonlinear rela- tionships between signal variables in View 1(First left panel), signal variables in View 2( Second left panel), and signal varibles between Views 1 and View 2 (Middle to Last panel), with black circles denoting data from Class 1 and red triangles data from Class 2.

Competing Methods and Results
We compare the proposed method, iDeepViewLearn, with linear and nonlinear methods for associating data from multiple views.For linear methods, we consider the sparse canonical correlation analysis [Sparse CCA] proposed in [2].For the nonlinear methods, we compare with deep canonical correlation analysis (Deep CCA) [5] and MOMA [8].We note that MOMA is a joint integration and classification method and as such does not require further training a classification method such as SVM, after training MOMA.However, per reviewer comment, we add a comparison where we use the important features chosen by MOMA to train and test an SVM classifier; we call this MOMA + SVM.For Sparse CCA and Deep CCA, we use the estimated canonical variates in SVM for classification performance since these two methods are unsupervised.We also compare the proposed method that integrates the two views with our method on stacked data, and SVM and random forest [19] on stacked data as well, to explore the benefits of multiview learning.Of note, by stacking the data, we do not appropriately model the dependency structure among views as one assumes that the views are not correlated, contrary to the assumption for data integration.We perform Sparse CCA with the SelpCCA R package provided by the authors on GitHub.We performed Deep CCA and MOMA using PyTorch codes provided by the authors.We pair Deep CCA with the teacher-student framework (TS) [20] to rank variables, and compare the TS feature selection approach with the proposed method.We follow the variable-ranking approach in MOMA to rank variables.We report the classification and variable selection results in Table 1 for nonlinear simulations (see results of linear settings and the network structures in Additional file 1).We implemented the proposed method in the training data, selected the top 10% variables for each view, learned a new model with these selected variables, and obtained test errors with the test data.The misclassification rates for the proposed method were lower or competitive compared to all the competitors.We observed a decreasing misclassification rate with increasing sample sizes for all the methods; nevertheless, the proposed method produced lower or competitive test errors even when the sample size was smaller than the dimension of the variables.In terms of variable selection, the TS framework applied to Deep CCA yielded suboptimal results; MOMA and random forest rank the important features based on their influence on the classification performance, and the two methods usually select unimportant features when the sample size is small; iDeepViewLearn and Sparse CCA can always achieve nearly perfect performance for feature selection in the nonlinear simulations.The performance of iDeepViewLearn on the stacked data was similar, although it had slightly higher classification errors, when compared to iDeepViewLearn that holistically integrates the views; thus we recommended against stacking data and implementing the proposed method, but rather using the method that integrates the two views as we have proposed.The results of the linear simulations mimic those of the nonlinear simulations.

Set-up when there is prior information on variable-variable interactions
Here, X (1) However, E i ∼ N (0, i ), i = 1, 2 , i is a diagonal block matrix with two blocks that represent signal and noise variables.The first block is a 50 × 50 covariance matrix that Table 1 Nonlinear settings: randomly select combinations of hyper-parameters to search over TPR-1; true positive rate for X (1) .Similar for TPR-2.FPR-1; false positive rate for X (1) .Similar for FPR-2; F-1 is the F measure for X (1) .Similar for F-2.The highest F-1/2 is in.(The mean error of two views is reported for MOMA; MOMA + SVM means selecting features using MOMA and training an SVM on the selected features)

Method
Error (%) TPR-1 TPR- captures the relationship among these 50 variables.Let G be the true graph structure for these variables.The second block is the identity matrix.We use bdgraph.sim in the BDGraph R package [21] to simulate three different types of networks for the first 50 variables: Scale-free, Lattice, and Cluster, and to obtain the adjacency matrix corresponding to the graph structure G.We then use rgwish from the same R package to generate a precision matrix distributed according to the G−Wishart distribution, W G (b, D) , with parameters b = 3 and D = I with respect to the graph structure G.We obtain the covariance matrix from the precision matrix.Figure 4 shows the variable-variable relationships among these 50 variables for the different network structure.Figure 4 [left panel], variable two is connected to more variables, so we consider variable 2 as a hub variable.We set W = [1 H , 0 p (1) −H ] , H to denote the variables directly connected to vari- able 2, and p (1) − H (similarly p (2) − H ) denote the remaining variables.By defining W this way, we assume that only the variables directly connected to the hub variable are signals and contribute to the nonlinear relationship between the two views.For the Lattice network (Middle panel), all variables in the network except variable 50 contribute to the nonlinear relationships among the views.For the cluster network, only two clusters (circled) are signals.

Competing methods and results
We explore the proposed method with and without the use of network information.In addition to competitors in the nonlinear simulations, we further compare the proposed method with Fused CCA [11].Fused CCA is a sparse canonical correlation analysis method that uses variable-variable information to guide the estimation of the canonical variates and the selection of variables that contribute most to the association between two views.We implemented Fused CCA using the R code accompanying the manuscript.We followed Fused CCA with SVM for classification.We implemented the proposed method on the training data with and without the network information, selected the top ranked variables (21 variables for Scale-free network, 49 variables for Lattice network, and 33 for Cluster network) for each view, learned a new model with these selected variables, and obtained test errors with the testing data.We also compared the top-ranked variables with the true signal variables and the estimated true positive rates (TPR), false positive rates (FPR), and F-score.From Table 2, the classification performance of our method that does not incorporate prior knowledge is comparable to our method that Fig. 4 Network structure for the first 50 variables in X (1) and X (2) .Left: scale-free network; Middle: Lattice; Right: Cluster.For the Scale-free network, we consider variable 2 has a hub variable.Variable 2 and the variables directly connected to it are considered as signal variables.For the Lattice network, all variables except variable 50 are considered as signals.For the Cluster network, the circled clusters are considered as signals does, in all settings.The fused CCA result for Scale-free Setting 2 was based on 19 out of the 20 simulation replicates due to a computational error.The fused CCA result for Lattice Setting 2 was not available due to time constraints.Like the scenario with no prior information, the misclassification rates for the proposed method (with or without prior information) were lower or competitive, especially for the Scale-free and Lattice networks, when compared to the association-based methods.Furthermore, the proposed method was superior to MOMA and SVM on stacked views across all settings and network type, for both classification and variable selection accuracy.For random forest, it can achieve very comparable prediction and feature selection performance with our methods when there are sufficient training data points; however, our iDeepViewLearn's performance outperforms random forest when the sample size is limited when considering both classification and variable selection performance.In summary, the classification and variable selection accuracy from both the linear and nonlinear simulations, and when we use or do not use prior information, suggest that the proposed methods are capable of ranking signal variables as high and ignoring noise variables.The proposed methods are also capable of producing competitive or better classification performance among all settings.In particular, we notice that random forest can achieve comparable classification and variable selection accuracies with our iDeep-ViewLearn when the number of training samples is relatively large, but the feature selection performance of random forest is usually suboptimal in situations where the sample TPR-1; true positive rate for X (1) .Similar for TPR-2.FPR; false positive rate for X (2) .Similar for FPR-2; F-1 is the F measure for X (1) .Similar for F-2.The highest F-1/2 is in red.(The mean error of two views is reported for MOMA; MOMA + SVM means combining the feature selection part of MOMA and SVM) size is less than the number of variables, as shown in Tables 1 and 2. These findings are encouraging to us since in a typical setting of high-dimensional and biomedical problems, the sample size is smaller than the number of variables.

Real-world experiments
In this section, we consider three real-world applications to show the effectiveness of the proposed method across different tasks and settings.We first applied the proposed method to integrate gene expression and methylation data from the Holm breast cancer study [22] for classification and clustering tasks with two views.We next applied the proposed method to data pertaining to brain lower grade glioma (LGG) to demonstrate the use of our method for classification tasks with three views.Finally, we applied our method on a MNIST handwriting data, for a reconstruction task, demonstrating that handwriting digits can be reconstructed with few pixels while maintaining competitive classification accuracy.The details of all the datasets used in this section are shown in Table 3.

Evaluation of data from holm breast cancer study
Breast cancer is the most common cancer among women worldwide, accounting for 12.5% of new cases and is one of the leading causes of death in women [23].Research shows that breast cancer is a multi-step process that involves both genetic and epigenetic changes.Epigenetic factors such as DNA methylation and histone modification lead to breast tumorigenesis by silencing critical tumor suppressor and growth regulator genes [24].Identifying methylated sites correlated with gene expression data could shed light on the genomic architecture of breast cancer.Our work is motivated by a molecular subtyping study conducted in [22], which used gene expression and DNA methylation data to identify methylation patterns in breast cancer.For completeness, we describe the data here.Raw methylation profiles from 189 breast cancer samples were extracted using the Beadstudio Methylation Module (Illumina).There were 1452 CpG sites (corresponding to 803 cancer-related genes).β-values were stratified into three groups: 0, 0.5, and 1.The value of 1 corresponded to hypermethylation.Relative methylation levels were obtained from raw values by centering the stratified β values in all samples.Furthermore, relative gene expression levels of 179 of 189 breast cancer tumors were obtained using oligonucleotide arrays for 511 probes.The number of samples with data on gene expression and methylation for our analysis is n = 179 .The first view, corresponding to gene expression data, had 468 variables (genes), and the second view, corresponding to methylation data, had 1452 variables (CpG sites).The methylation data were filtered to include the most variable methylated sites by removing CpG sites with a standard deviation less than 0.3 between samples; this resulted in 334 CpG sites (corresponding to 249 cancer-related genes).In addition to molecular data, data on whether an individual died from breast cancer or not is also available.The goal of our analysis is to perform an integrative analysis of the methylation and gene expression data to model nonlinear associations between CpG sites and genes through a joint non-linear embedding that drives the overall dependency structure in the data.Importantly, we wish to identify a subset of CpG sites and genes that contribute to the dependency structure and could be used to discriminate between those who survived and those who did not survive breast cancer.Further, we wish to explore the use of the joint nonlinear embedding in molecular clustering.

Goal 1: Model nonlinear relationships between methylation and gene expression data and identify CpG sites and genes that can discriminate between those who died and those who did not die from breast cancer
For the first goal, we split the data into three sets of approximately equal size.We used 2/3rd of the data to train the model and we used the remaining 1/3rd to test our models.We implemented the proposed method on the training set, selected the top 10% and 20% highly ranked variables in each view, learned new models with these selected features, used the test data and the models learned to predict the test outcomes, and obtained test errors.We repeated the process 20 times, obtained the highly ranked variables for each run, and estimated the average test errors.We compared the proposed method with SVM, random forest, Deep CCA, Sparse CCA, MOMA and MOMA + SVM based on average test errors.
Average misclassification rates and genes and CpG sites selected Table 4 gives the average test errors for the methods.On the basis of the high classification errors across the methods, it seems that separating those who died from breast cancer from those who did not die using methylation and gene expression data is a difficult problem.We investigate the use of iDeepViewLearn on the stacked data, and we notice that, similar to the nonlinear simulations, there are small gaps present compared to the results when we integrate the data.Hence, we still recommend using our method as proposed when there are two or more views, and not apply on stacked data.The average test error for the proposed method based on the top 10% or 20% CpG sites and genes is comparable to that of the other methods.Our proposed method allows us to obtain insight into the genes and CpG sites that drive the classification accuracy.
For this purpose, we explored the "stable" genes and CpG sites that potentially discriminate between people who died and those who did not die from breast cancer.We consider a variable to be "stable" if it is ranked in the top 20% at least 16 times ( ≥ 80% ) of the 20 resampled datasets.Table 5 shows the genes selected and how often they were selected.Genes DAB2, DCN, HLAF, MFAP4, MMP2, PDGFRB, TCF4 and TMEFF1  were consistently selected in the top 20% in all resampled data sets.There is support in the literature for the potential role of some of these genes in a variety of human cancers.Disabled homolog 2 [or DAB adaptor protein 2] (DAB2) is a protein-coding gene that is often deleted or silenced in several human cancer cells.The decorin gene (DCN) is a protein coding gene that encodes the protein decorin.Research on different human cancers (e.g.breast, prostate, bladder) has shown that DCN expression levels in cancerous cells are significantly reduced from expression levels in normal tissues or are often completely silenced in tumor tissues [25].Research suggests that individuals expressing lower levels of DCN in cancer tend to have poorer outcomes compared to individuals expressing higher levels of DCN.In our data, the mean expression levels of DCN for those who survived were not statistically different (based on the Anova test) from those who did not.We observed statistically significant differences in mean expression levels of PDGFR and BIRC5 for the two groups (p-value < 0.05 from ANOVA test), as shown in Fig. 5.The median expression values of these genes were higher in individuals who died of breast cancer.The platelet-derived growth factor receptor alpha (PDGFRA) gene is a protein encoding gene that encodes the PDGFRA protein.The PDGFRA protein is involved in important biological processes such as cell growth, division, and survival.Mutated forms of the PDGFRA gene and protein have been found in some types of cancer.The BIRC5 gene is a protein encoder gene that encodes the baculoviral IAP repeat containing protein 5 in humans.This protein is believed to play an important role in the promotion of cell division (proliferation) and in the prevention of cell apoptosis (death) [26,27].
Table 6 shows the CpG sites selected at least 16 times in the top 20% of the highly ranked CpG sites.The CpG sites FGF2_P229_F, IL1RN_E42_F, RARA_P1076_R, TFF1_P180_R, TGFB3_E58_R, WNT2_P217_F were consistently selected in the top 20% of highly ranked CpG sites across all resampled datasets.Figure 6 shows the relative methylation levels of the CpG sites that were consistently selected or were Fig. 5 All genes except BIRC5 were consistently selected in the top 20% of highly-ranked genes across the twenty resampled datasets.BIRC5 was selected 19 times (out of 20) in the top 20% highly-ranked genes.Genes PDGFRB and BIRC5 have mean expression levels that are statistically significantly different between individuals that died from breast cancer and those that survived significantly different between those who survived and those who died.The mean methylation levels for the CpG sites IL1RN_E42_F and TGFB3_E58_R were statistically different between those who died and those who survived breast cancer (p-value < 0.05 from Anova test).In particular, the mean relative methylation levels for IL1RN_E42_F and TGFB3_E58_R were lower in those who died from breast cancer compared to those who did not.The interleukin 1 receptor antagonist (IL1RN) gene is a protein-coding gene that encodes the interleukin-1 receptor antagonist protein, a member of the interleukin 1 cytokine family.IL1RN is an anti-inflammatory molecule And Two Follistatin Like Domains 2 Fig. 6 All CpG sites were consistently selected in the top 20% of highly-ranked CpG sites across the twenty resampled datasets.The mean methylation levels of IL1RN_E42_F and TGFB3_E58_R are statistically different between individuals that died from breast cancer and those that survived that modulates the biological activity of the pro-inflammatory cytokine, interleukin-1 [28].IL1RN has been implicated in several cancers.

Gene Ontology and Pathway Enrichment Analyses:
We use an online enrichment tool, ToppGene Suite [29], to explore the biological relationships of these "stable" genes and CpG sites.We took these genes from the gene expression data and genes corresponding to the CpG sites as input for ToppFun in ToppGene Suite.Some of the biological processes enriched with gene ontology (GO) included vasculature development, tissue development, angiogenesis, and tube morphogenesis (see Tables 7 and 8).Some of the biological processes significantly enriched in our list of methylation include tissue morphogenesis, epithelial tube morphogenesis, and tube development.All these biological processes are essential in cell development, and aberrations or disruptions in these processes could result in cancer.Tables 9 and 10 show the top 10 pathways that are enriched in our list of methylated genes and genes, respectively.Some of these pathways included cancer and pathways related to extracellular matrix orgnaization (ECM).ECM is a complex collection of proteins and plays a key role in cell survival, cell proliferation, and   differentiation [30].ECM is involved in tumor progression, dissemination, and response to therapy [30,31].

Goal 2: Model nonlinear relationships between methylation and gene expression data and derive molecular clusters
We demonstrate the use of the estimated shared low-dimensional representation and the reconstructed methylation and gene expression data in molecular clustering.For this purpose, we applied the proposed method (without Laplacian) to all data to identify the top 20% genes and CpG sites that could be used to nonlinearly approximate the origi- nal views.Then we obtained the shared low-dimensional representation ( Z ′ ), and the reconstructed views ( R 1 (Z ′ ) , and R 2 (Z ′ ) ) based only on the top 20% genes and CpG sites.We perform K-means clustering on Z ′ , R 1 (Z) and R 2 (Z) .We set the number of clusters to 4, which is within the number of clusters investigated in the original article [22].We compared the number of clusters detected with several variables related to breast cancer, including estrogen receptor (ER) status, progesterone receptor (PgR) status, survival time, and survival status for ten years.We obtained Kaplan-Meier (KM) curves to compare the survival curves for the identified clusters.We also fitted a Cox regression model to compare the estimated hazard ratios for 10-year survival.Finally, we performed an enrichment analysis of the top 20% genes and CpG sites.
Figure 8A shows the KM curves for the clusters detected using the low-dimensional shared representation (first panel) and the reconstructed gene expression (middle panel) and methylation data (right panel).From the KM plots, the 10-year survival curves for the clusters detected using the shared low-dimensional representation or the reconstructed methylation data are significantly different (p-value = 0.041 and 0.032, respectively, based on a log-rank test to compare survival curves).As reported in Table 11, In particular, the proportion of individuals in Cluster 3 with ER/PgR negative tumors was higher, the 10-year survival rate was lower (only 40% of participants from Cluster 0 survived while 69% of participants from Cluster 1 survived, Fig. 8B), and the average survival time was shorter compared to those in Cluster 1 (Fig. 8C).Furthermore, the estimated unadjusted risk ratio for 10-year survival for those in Cluster 3 compared to those in Cluster 0 was 1.409 (Fig. 8D 95%CI: 1.686-2.894,p-value = 0.04), suggesting that being in Cluster 3 reduces your survival rate by a factor of 1.41 at each point during 10-year follow-up compared to Cluster 0. This effect persisted even after adjusting for age or age and ER status.Significantly enriched pathways, as shown in Fig. 7, from our gene list and CpG sites (genes corresponding to the top 20% CpG sites) include ECM, inflammatory response pathway, and pathways in cancer.

Evaluation of brain lower grade glioma data
We applied our method to data pertaining to brain lower grade glioma (LGG) to identify molecules that discriminate between levels of LGG grade (grade 2 vs 3 gliomas).We obtained data from the Board GDAC Firehose of the Cancer Genome Atlas Program (TCGA). 1 We used three types of omics data: methylation, miRNA, and mRNAseq, following the analysis in [32].Only patients with all available omics and classifications of grade were included in our analyses, giving a total sample size of 510, with 246 patients classified as grade 2 and 264 patients as grade 3. We used the LGG dataset to demonstrate that the proposed method can be used to associate three views, select important biomarkers, and predict patient grade category.Data cleaning and data preprocessing were carried out on each view of data to remove features with low potential for discrimination.For all views, we first removed features with missing measures.Due to the limited number of features left in the miRNA view after removing missing values, future preprocessing was conducted only on the DNA methylation view and the mRNAseq view.Unsupervised preprocessing was applied to remove features whose variance was less than 0.001 for DNA methylation measures and 0.1 for mRNAseq measures, following the thresholds used in [32].The data were then divided into training ( n = 410 ) and testing (n = 100 ) sets and supervised preprocessing was conducted on the training set.Logistic regression was fitted for each feature in the DNA methylation view and the mRNAseq view.The p-values were adjusted by the Benjamini-Hochberg procedure, and the features with adjusted p-values < 0.05 were kept in the dataset.After data cleaning and preprocessing, the number of features for DNA methylation, miRNA, and mRNAseq was 9691, 235, and 7603 respectively.We applied the proposed approach to the training dataset, where we selected the important features from each type of omics data.Subsequently, we used these selected features to make predictions for the patient's grade category in the testing dataset, as shown in Table 12.We used cross-validation to tune hyper-parameters based on the training set.Our proposed method was compared with Deep Generalized Canonical Correlation Analysis (Deep GCCA) [33] with PyTorch implementation. 2We added the teacher-student network (TS) [20] for feature selection, and implemented SVM for classification; Deep IDA [9]; Features selected from Deep IDA with SVM for classification; SIDA [10], and SVM and Random Forest on stacked data.The classification performance of our method is comparable to other methods (Table 12).
In Fig. 9, we show the overlaps of features selected by the methods.We used the top 100 features of each view selected by the proposed method.We compare the top 100 features selected by the TS network with Deep GCCA and the top 50 features selected by the TS network with Deep IDA.SIDA selected 46, 29, and 304 features for each omics, respectively.We presented the overlaps between the selected genes across the four methods matched from NCBI. 3 The overlaps between 2 or more methods of DNA methylation were COL11A2 and FBLN2.The overlaps between 3 or more methods for miRNA view were MIR379, MIR409, MIR29C, MIR129-1, MIR20B, MIR30E, MIR92A2, MIR222, MIR24-2, MIR767, MIR128-2, MIR105-2, and MIR17.The overlaps between 2 or more methods for mRNAseq view were NCAPH, LY86-AS1, HSFX2, and SLC25A41.

Evaluation of shear transformed MNIST data
We apply our method to the MNIST dataset [34].The MNIST handwritten image dataset consists of 70,000 images of handwritten digits divided into training and testing sets of 60,000, and 10,000 images, respectively.The digits have been size-normalized and centered in a fixed-size image.Each image is 28 × 28 pixels and has an associated label that denotes which digit the image represents (0-9).We make good use of a shear mapping to generate a second view of these handwritten digits.A shear mapping is a linear map that displaces each point in a fixed direction by an amount proportional to its signed distance from the line that is parallel to that direction and goes through the origin.Figure 10 shows two image plots of a digit for views 1 and 2.  We used the MNIST dataset to demonstrate the ability of the proposed method to reconstruct handwritten images using a few pixels.In particular, neural networks G d consist of convolutional layers instead of fully connected layers, since they reconstruct images.We apply the proposed method to the training dataset, select 20% and 30% of the pixels based on our variable ranking criteria and reconstruct the images using only the selected pixels.We also learn a new model with these pixels, we use the learned model and the testing data to classify the test digits, and we obtain the test errors.Figure 10 shows the reconstructed images based on the top 20% and 30% pixels.The digits are apparent even with only 30% of the pixels.From Table 13, the classification performance using the top 30% of the pixels is comparable to Deep CCA and SVM, which use all pix- els.Even when only 20% of the pixels were selected and used to reconstruct the images, the classification performance of our method was competitive.

Discussion
We have presented iDeepViewLearn, short for Interpretable Deep Learning Method for Multiview Learning, to learn nonlinear relationships in data from multiple sources.iDeepViewLearn combines the flexibility of deep learning with the statistical advantages of data-and knowledge-driven feature selection to yield interpretable results.In particular, iDeepViewLearn learns low-dimensional representations of the views that are common to all the views and assumes that each view can be approximated by a nonlinear function of the shared representations.Deep neural networks are used to model the nonlinear function and an optimization problem that minimizes the difference between the observed data and the nonlinearly transformed data are used to reconstruct the original data.A regularization penalty is imposed on the reconstructed data in the optimization problem, permitting us to reconstruct each view only with relevant variables.Beyond the data-driven approach for feature selection, we also consider a knowledge-based approach to identify relevant features.We use the normalized Laplacian of a graph to model bilateral relationships between variables in each view and to encourage the selection of connected variables.
We have developed a user-friendly algorithm in Python 3, specifically PyTorch, and interfaced it with R to increase the reach of our method.Extensive simulations with varying data dimensions and complexity revealed that iDeepViewLearn outperforms several other linear and nonlinear methods for integrating data from multiple views, even in high-dimensional scenarios where the sample size is typically smaller than the number of variables.
When iDeepViewLearn was applied to methylation and gene expression data related to breast cancer, we observed that iDeepViewLearn is capable of achieving meaningful biological insights.We identified several CpG sites and genes that better discriminated people who died from breast cancer and those who did not.The biological processes of the gene ontology enriched in the top-ranked genes and methylated CpG sites included processes essential to cell proliferation and death.The enriched pathways included cancer and others that have been implicated in tumor progression and response to therapy.Using the shared low-dimensional representations of gene expression and methylation data from our method, we detected four molecular clusters that differed in their 10-year survival rates.The enrichment analysis of highly ranked genes and genes corresponding to the CpG sites selected by our method showed a strong enrichment of pathways and biological processes, some related to breast cancer and others that could be further explored for their potential role in breast cancer.We also applied iDeepViewLearn to DNA methylation, miRNA, and mRNASeq data pertaining to Brain Lower Grade Glioma (LGG) and found our method to be competitive in discriminating between LGG categories, demonstrating the ability of our methods to be used for more than two views.We further applied iDeepViewLearn to handwritten image data and we were able to reconstruct the digits with about 30% pixels while also achieving competitive classification accuracy.For more applications, e.g., drug repositioning [35][36][37], we leave them for future work.A limitation of our work is that the number (or proportion) of top-ranked features needs to be specified in advance.

Conclusion
In conclusion, we have developed deep learning methods to learn nonlinear relationships in multiview data that are able to identify features likely driving the overall association in the views.The simulations and real data applications are encouraging, even for scenarios with small to moderate sample sizes, thus we believe the methods will motivate other applications.

Fig. 3
Fig. 3 Structure of nonlinear relationships between (First left panel) signal variables in View 1; (Second left panel) signal variables in View 2; (Middle panel)-(Fifth panel) signal variables between Views 1 and 2. Black circle: Class 1; Red triangle: Class 2

Fig. 7 Fig. 8
Fig. 7 Top 10 significant pathways using highly-ranked genes (Top Panel) and genes corresponding to highly-ranked CpG sites (Bottom Panel)

Fig. 9
Fig.9 Venn diagrams of features selected by the proposed method, and the three comparison methods that are capable of feature selection.The left, middle, and right panels correspond to DNA methylation, miRNA, and mRNAseq, respectively.The percentages represent the proportion of the total selected features from the four methods 3 https:// www.ncbi.nlm.nih.gov/.

Fig. 10
Fig.10 An example of shear transformed MNIST dataset.For the subject with label "0" and "9", view 1 observation is on the left and view 2 observation is on the right.Notably, we show the grayscale images with color only for better visualization

Table 3
Summary of datasets for each analysis

Table 4
Breast cancer data: SVM is based on stacked raw data with two views Deep CCA + SVM is a training SVM based on the last layer of Deep CCA.iDeepViewLearn with selected top 10% features reconstructs the original views with only 10% of the features and obtains a test classification error based on a shared low- dimensional representation trained on data with only 10% of the features.Similar to iDeepViewLearn with selected top 20% .(The mean error of two views is reported for MOMA; MOMA + SVM means combining the feature selection part of MOMA and SVM)

Table 5
Frequency of Genes selected at least 16 times in the top 20% across 20 resampled datasets

Table 6
Frequency of Genes selected at least 16 times in the top 20% across 20 resampled datasets

Table 7
Top 10 Gene Ontology (GO) Biological Processes enriched with ToppFun in ToppGene Suite

Table 8
Genes corresponding to CpG sites.Top 10 Gene Ontology (GO) Biological Processes enriched with ToppFun in ToppGene Suite

Table 9
Genes corresponding to CpG sites.Top 10 Pathways enriched with ToppFun in ToppGene Suite

Table 10
Genes selected.Top 10 Pathways enriched with ToppFun in ToppGene Suite PgR, overall survival time, and 10-year survival event.Individuals in Cluster 3 seemed to have worse survival outcomes compared to individuals in Cluster 0.

Table 11
Characteristics of the patients.Continuous variables are tested based on regular ANOVA with equal variance assumption, and categorical variables are tested based on the Chi-square test

Table 12
LGG dataset: SVM and random forest are based on stacked views.Deep IDA + SVM means selecting features from Deep IDA and training an SVM classifier on these features.iDeepViewLearn with selected top 50 features obtains a classification error based on a shared low-dimensional representation trained on data with the selected top 50 features.Similar for iDeepViewLearn with selected top 100 features

Table 13
MNIST dataset: SVM is based on stacked views.Deep CCA + SVM is a training SVM based on the last layer of Deep CCA.iDeepViewLearn with selected top 20% pixels obtains a classification error based on a shared low-dimensional representation trained on data with the selected 20% of the pixels.Similar for iDeepViewLearn with selected top 30%