Supervised Regularized Canonical Correlation Analysis: integrating histologic and proteomic measurements for predicting biochemical recurrence following prostate surgery

Background Multimodal data, especially imaging and non-imaging data, is being routinely acquired in the context of disease diagnostics; however, computational challenges have limited the ability to quantitatively integrate imaging and non-imaging data channels with different dimensionalities and scales. To the best of our knowledge relatively few attempts have been made to quantitatively fuse such data to construct classifiers and none have attempted to quantitatively combine histology (imaging) and proteomic (non-imaging) measurements for making diagnostic and prognostic predictions. The objective of this work is to create a common subspace to simultaneously accommodate both the imaging and non-imaging data (and hence data corresponding to different scales and dimensionalities), called a metaspace. This metaspace can be used to build a meta-classifier that produces better classification results than a classifier that is based on a single modality alone. Canonical Correlation Analysis (CCA) and Regularized CCA (RCCA) are statistical techniques that extract correlations between two modes of data to construct a homogeneous, uniform representation of heterogeneous data channels. In this paper, we present a novel modification to CCA and RCCA, Supervised Regularized Canonical Correlation Analysis (SRCCA), that (1) enables the quantitative integration of data from multiple modalities using a feature selection scheme, (2) is regularized, and (3) is computationally cheap. We leverage this SRCCA framework towards the fusion of proteomic and histologic image signatures for identifying prostate cancer patients at the risk of 5 year biochemical recurrence following radical prostatectomy. Results A cohort of 19 grade, stage matched prostate cancer patients, all of whom had radical prostatectomy, including 10 of whom had biochemical recurrence within 5 years of surgery and 9 of whom did not, were considered in this study. The aim was to construct a lower fused dimensional metaspace comprising both the histological and proteomic measurements obtained from the site of the dominant nodule on the surgical specimen. In conjunction with SRCCA, a random forest classifier was able to identify prostate cancer patients, who developed biochemical recurrence within 5 years, with a maximum classification accuracy of 93%. Conclusions The classifier performance in the SRCCA space was found to be statistically significantly higher compared to the fused data representations obtained, not only from CCA and RCCA, but also two other statistical techniques called Principal Component Analysis and Partial Least Squares Regression. These results suggest that SRCCA is a computationally efficient and a highly accurate scheme for representing multimodal (histologic and proteomic) data in a metaspace and that it could be used to construct fused biomarkers for predicting disease recurrence and prognosis.


Background
With the plentitude of multi-scale, multi-modal, disease pertinent data being routinely acquired for diseases such as breast and prostate cancer, there is an emerging need for powerful data fusion (DF) methods to integrate the multiple orthogonal data streams for the purpose of building diagnostic and prognostic meta-classifiers for disease characterization [1]. Combining data derived from multiple sources has the potential to significantly increase classification performance relative to performance trained on any one modality alone [2]. A major limitation in constructing integrated meta-classifiers that can leverage imaging (histology, MRI) and non-imaging (proteomics, genomics) data streams is having to deal with data representations spread across different scales and dimensionalities [3].
For instance, consider two different data streams F A (x) and F B (x) describing the same object x. If F A (x) and F B (x) correspond to the same scale or resolution and also have the same dimensionality, then one can envision, concatenating the two data vectors into a single unified vector [F A (x), F B (x)] which could then be used to train a classifier. However when F A (x) and F B (x) correspond to different scales, resolutions, and dimensionalities, it is not immediately obvious as to how one would go about combining the different types of measurements to build integrated classifiers to make predictions about the class label of x. For instance, directly aggregating data from very different sources without accounting for differences in the number of features and relative scaling, can not only lead to the curse of dimensionality (too many features and not enough corresponding samples [4]), but can lead to classifier bias towards the modality with more attributes. A possible solution is to first project the data streams into a space where the scale and dimensionality differences are removed; a meta-space allowing for a homogeneous, fused, multi-modal data representation.
DF methods try to overcome these obstacles by creating such a metaspace, on which a proper meta-classifier can be constructed. Methods leveraging embedding techniques have been proposed to try and fuse such heterogeneous data for the purpose of classification and prediction [2,3,[5][6][7]. However, all of these DF techniques have their own weaknesses in creating an appropriate representation space that can simultaneously accommodate multiple imaging and non-imaging modalities. Generalized Embedding Concatenation [5] is a DF scheme that relies on dimensionality reduction (DR) methods to first eliminate the differences in scales and dimensionalities between the modalities before fusing them. However, these DR methods face the risk of extracting noisy features which degrade the metaspace [8]. Other variants of the embedding fusion idea, including Consensus embedding [6] and Boosted embedding [3] have yielded promising results, but come at a high computational cost. Consensus embedding attempts to combine multiple low dimensional data projections via a majority voting scheme while the Boosted embedding scheme leverages the Adaboost classifier [9] to combine multiple weak embeddings. In the case of weighted multi-kernel embedding using graph embedding [7] and support vector machine classifiers [2], insufficient training data can lead to overfitting and inaccurate weights to the various kernels, which can lower the performance of the metaclassifier [10].
CCA is a statistical DF technique that extracts linear correlations, by using cross-covariance matrices, between 2 data sources, X and Y. It capitalizes on the knowledge that the different modalities represent different sets of descriptors for characterizing the same object. For this reason, the mutual information that is most correlated between the two modalities will provide the most meaningful transformation into a metaspace. In recent years, CCA has been used to fuse heterogeneous data such as pixel values of images and the text attached between these images [11], assets and liabilities in banks [12], and audio and face images of speakers [13].
Regularized CCA (RCCA) is an improved version of CCA which in the presence of insufficient training data prevents overfitting by using a ridge regression optimization scheme [14]. Denote p and q as the number of features in X and Y, and n as the sample size. When n < <p or n < <q, the features in X and Y tend to be highly collinear. This leads to ill-conditioned matrices C xx and C yy , which denote the covariance matrix of X with itself and Y with itself, such that their inverses are no longer reliable resulting in an invalid computation of CCA and an unreliable metaspace [15]. The condition placed on the data to guarantee that C xx and C yy will be invertible is n ≥ p + q + 1 [16]. However, that condition is usually not met in the bioinformatics domain, where samples (n) are usually limited, and modern technology has enabled very high dimensional data streams to be routinely acquired resulting in very high dimensional feature sets (p and q). This creates a need for regularization, which works by adding small positive quantities to the diagonals of C xx and C yy to guarantee their invertibility [17]. RCCA has been used to study expressions of genes measured in liver cells and compare them with concentrations of hepatic fatty acids in mice [18]. However, the regularization process required by RCCA is computationally very expensive. Both CCA and RCCA also fail to take complete advantage of class label information, when available [19].
In this paper, we present a novel efficient Supervised Regularized Canonical Correlation Analysis (SRCCA) DF algorithm that is able to incorporate a supervised feature selection scheme to perform regularization. Mainly, it makes better use of labeled information that in turn allows for significantly better stratification of the data in the metaspace. While SRCCA is more expensive than the overfitting-prone CCA, it provides the needed regularization while also being computationally cheaper than RCCA. SRCCA first produces an embedding of the most correlated data in both modalities via a low dimensional metaspace. This representation is then used in conjunction with a classifier (K-Nearest Neighbor [20] and Random Forest [21] are used in this study) to create a highly accurate meta-classifier.
Along with CCA and RCCA, SRCCA is compared with 2 other low dimensional data representation techniques: Principal Component Analysis (PCA) and Partial Least Squares Regression (PLSR). PCA [22] is a linear DR method that reduces high dimensional data to dominant orthogonal eigenvectors that try to represent the maximal amount of variance in the data. PLSR [23] is a DR method that uses one modality as a set of predictors to try to predict the other modality. Tiwari et al. [24] employed PCA in conjunction with a wavelet based representation of different MRI protocols to build a fused classifier to detect prostate cancer in vivo. PLSR has been used with heterogeneous multivariate signaling data collected from HT-29 human colon carcinoma cells stimulated to undergo programmed cell death to uncover aspects of biological cue-signal-response systems [25].
In this work, we apply SRCCA to the problem of predicting biochemical recurrence in prostate cancer (CaP) patients, following radical prostatectomy, by fusing histologic imaging and proteomic signatures. Biochemical recurrence is commonly defined as a detectable elevation of Prostate Specific Antigen (PSA), a key biomarker for CaP [26][27][28]. However, the nonspecificity of PSA leads to over-treatment of CaP, resulting in many unnecessary treatments, which are both stressful and costly [29][30][31][32][33]. Even the most widely used prognostic markers such as pathologist assigned Gleason grade [34], which attempts to capture the morphometric and architectural appearance of CaP on histopathology, has been found to be a less than perfect predictor of biochemical recurrence [35]. Additionally, Gleason grade has been found to be subject to inter-, and intra-observer variability [36][37][38]. While some researchers have proposed quantitative, computerized image analysis approaches [1,39,40] for modeling and predicting Gleason grade (a number that goes from 1 to 5 based on morphologic appearance of CaP on histopathology), it is still not clear that an accurate, reproducible grade predictor from histology will also be accurate in predicting biochemical recurrence and long term patient outcome [41].
Recent studies have shown that proteomic markers can be used to predict aggressive CaP [42,43]. Techniques such as mass spectrometry hold promise in their ability to identify protein expression profiles that might be able to distinguish more aggressive from less aggressive CaP and identify candidates for biochemical recurrence [44][45][46]. However, more and more, it is becoming apparent that a single prognostic marker may not possess sufficient discriminability to predict patient outcome which suggests that the solution might lie in an integrated fusion of multiple markers [47]. This then begs the question as to what approaches need to be leveraged to quantitatively fuse imaging and non-imaging measurements to build an integrated prognostic marker for CaP recurrence. The overarching goal of this study is to leverage SRCCA to construct a fused quantitative histologic, proteomic marker, and a subsequent meta-classifier, for predicting 5 year biochemical recurrence in CaP patients following surgery.
Our main contributions in this paper are: • A novel data fusion algorithm, SRCCA, that builds an accurate metaspace representation that can simultaneously represent and accommodate two heterogeneous imaging and non-imaging modalities.
• Leveraging SRCCA to build a meta-classifier to predict risk of 5 year biochemical recurrence in prostate cancer patients following radical prostatectomy by integrating histological image and proteomic features.
The organization of the rest of the paper is as follows: In the methods section, we first review the 4 statistical methods, PCA, PLSR, CCA and RCCA. Next, we introduce our novel algorithm, Supervised Regularized Canonical Correlation Analysis (SRCCA). We then discuss the DF algorithm for metaspace creation and the computational complexities for CCA, RCCA and SRCCA. In the Experimental Design section, we briefly discuss the prostate cancer dataset considered in this study and the subsequent proteomic and histologic feature extraction schemes before moving on to the experiments performed on the dataset where we try to determine the ability of PCA, PLSR, CCA, RCCA and SRCCA to identify patients at risk for biochemical recurrence following surgery. The results are discussed in the subsequent section and the concluding remarks are presented at the end of the paper.

Review of PCA and PLSR
Principal Component Analysis (PCA) and Partial Least Squares Regression (PLSR) are common statistical methods used to analyze multi-modal data and they are briefly discussed in the following sections. However, further information, explaining how these two methods can be viewed as special cases of the generalized eigenproblem, can be found in [48].

Principal Component Analysis (PCA)
PCA [22] constructs a low dimensional subspace of the data by finding a series of linear orthogonal bases called principal components. Each component seeks to explain the maximal amount of variance in the dataset. Denote two multidimensional variables, X ℝ n × p and Y ℝ n×q , where p and q are the number of features in X and Y and n the number of overall samples. PCA is usually performed on the data matrix, Z ℝ n×(p+q) , obtained by concatenating the individual modalities [24].Z ∈ Ê n × (p+q) is then obtained by subtracting the means of all features for a certain sample from its original feature value in Z so that the resultantZ has rows with a 0 mean.Z is further broken using singular value decomposition into [22]: where E ℝ n×n is a diagonal matrix containing the eigenvalues of the eigenvectors which are stored in U ℝ p×p , and V T ℝ m×n . The eigenvalues stored in E explain how much variance of the originalZ is stored in the corresponding eigenvector, or principal component. Using these eigenvalues as a rank, the top d embedding components can be chosen to best represent the original data in a lower dimensional subspace.

Partial Least Squares Regression(PLSR)
PLSR [49] is a statistical technique that generalizes PCA and multiple regression. The general underlying model behind PLSR is [23]: where T ℝ n×l is a score matrix, P ℝ p×l and C ℝ q×l are loading matrices for X and Y, and E ℝ n×p and F ℝ n×p are the error terms. PLSR is an iterative process and works by continually approximating, and improving the approximation of the matrices T, P and C [50].

Review of CCA and RCCA Canonical Correlation Analysis (CCA)
CCA [51] is a way of using cross-covariance matrices to obtain a linear relationship between the two multidimensional variables, X ℝ n×p and Y ℝ n×q . CCA obtains two directional vectors w x ℝ p×1 and w y ℝ q×1 such that Xw x and Yw y will be maximally correlated. It is defined as the optimization problem [11]: where C xy ℝ p×q is the covariance matrix of the matrices X and Y, C xx ℝ p×p is the covariance matrix of the matrix X with itself and C yy ℝ q × q is the covariance matrix of the matrix Y with itself. The solution to CCA reduces to the solution of the following two generalized eigenvalue problems [52]: where l is the generalized eigenvalue representing the canonical correlation, and w x and w y are the corresponding generalized eigenvectors. CCA can further produce exactly min{p, q) orthogonal embedding components (sets of w x X and w y Y) which can be sorted in order of decreasing correlation, l.

Regularized Canonical Correlation Analysis (RCCA)
RCCA [53,54] corrects for noise in X and Y by first assuming that X and Y are contaminated with noise, N x ℝ n×p and N Y ℝ n×q . We assume that these noise vectors in the p and q columns of N X and N Y , respectively, are gaussian, independent and identically distributed. For this reason, all combinations of the covariances of the p columns of N X and q columns of N Y will be 0 except the covariance of a particular column vector with itself. This variance of each column of N X and N Y is labeled l x and l y and these labels are called the regularization parameters. The matrix C xy will not be affected but the matrices C xx and C yy become C xx + l x I x and C yy + l x I x . The solution to RCCA now becomes the solution to these generalized eigenvalue problems [52]: The regularization parameters next have to be chosen. For i {1, 2, . . . , n}, let w i x and w i y denote the weights calculated from RCCA when samples X i and Y i are removed. l x and l y are varied in a certain range θ 1 ≤ l x , l y ≤ θ 2 and chosen via a grid search [55] optimization of the following cost function [18]: where corr (·, ·) refers to the Pearson's correlation coefficient [56]. The above cost function essentially measures the change in the produced w i x and w i y when a sample i is omitted and seeks the optimal l x and l y where this change is minimized. l x and l y are chosen using the embedding component with the highest l and then adjusted for the remaining dimensions [18].
Extending RCCA to SRCCA Supervised Regularized Canonical Correlation Analysis (SRCCA) chooses l x and l y using a supervised feature selection method (t-test, Wilcoxon Rank Sum Test and Wilks Lambda Test are used in this study). Denote Ï 1 and Ï 2 as class 1 and class 2 and μ 1 and μ 2 , σ 2 1 and σ 2 2 , n 1 and n 2 as the means, variances, and sample sizes of Ï 1 and Ï 2 . The data in the metaspace, Xw x or Yw y , can be split using its labels into the n 1 samples that belong to Ï 1 and the n 2 samples that belong to class Ï 2 , where n 1 + n 2 = n. These two partitions can then be used to calculate the discrimination level between the samples of the two classes in the metaspace representation. In this study, we implement RCCA with the t-test (SRCCA TT ), the Wilcoxon Rank Sum Test (SRCCA WRST ) and the Wilks Lambda Test (SRCCA WLT ) to try to choose more appropriate regularization parameters, l x and l y , that can more successfully stratify the samples in the metaspace compared to the parameters chosen by RCCA. Similar to RCCA, for SRCCA, l x and l y are chosen using the embedding component with the most discriminatory score as chosen by the feature selection schemes below and then adjusted for the remaining dimensions.

SRCCA TT
The t-test [57] is a parametric test that assumes the distributions of the two samples are normal and tests whether these distributions have the same means. The t-score, which measures the number of standard deviations the two means of n 1 samples of Ï 1 and n 2 samples of Ï 2 are away from each other, is maximized using a grid search algorithm as:

SRCCA WRST
Wilcoxon Rank Sum Test [58] sorts both the samples in order from lowest value to highest value. It then uses their respective ranks within the population to calculate the discriminatory score: where b i represents the rank of the sample i ∈ Ï 2 with respect to the rest of the samples.

SRCCA W LT
In an ideal metaspace representation, samples from each class will be grouped together while the samples from different classes will be grouped separately. The WLT [59] capitalizes on this knowledge and calculates the ratio of within class variance of both samples to the total variance of both samples combined. Wilks Lambda (Λ) is minimized using a grid search algorithm as: Data Fusion in the context of CCA, RCCA and SRCCA DF is performed as described in Foster et al. [60]. When the Xw x and Yw y are maximally correlated, each modality represents similar information, and thus either Xw x or Yw y can be used to represent the original two modalities in the metaspace. Moreover, X and Y are both descriptors of the same object and thus, the most relevant information is the data that exists and is correlated in both modalities. Thus, a high correlation of Xw x and Yw y is indicative that meaningful data, measuring the object of interest, is being added to the metaspace. In order of decreasing l, the top d embedding components, up to = min{p, q} can be chosen to represent the two modalities in a metaspace. However, the lower embedding components will have a lower l, and thus a lower correlation between Xw x and Yw y which might imply that non-relevant data is being added to the metaspace. To avoid this issue, a threshold, l 0 , can be selected such that only embedding components with l ≥ l 0 will be included in the metaspace.

Computational Complexity
Given = min{p, q}, CCA has a computational complexity of ! (based on the source code in [61]). The regularization algorithm requires a grid search process for each ordered pair (l x , l y ). Assume v potential l x and l y sampled evenly between θ 1 and θ 2 . RCCA requires a training/testing cross-validation strategy, at each ordered pair (l x , l y ), to find the optimal l x and l y . It will require CCA to be performed an order of n times at each of the v intervals leading to a complexity of vn!. SRCCA only requires a CCA factorization once at each of the v intervals leading to a complexity of v!.
The computational complexities for each of the CCA schemes are summarized in Table 1. Table 1 indicates that SRCCA is an order of n times faster compared to RCCA. However, SRCCA is also more complex compared to CCA and will have a longer execution time.

Experimental Design Data Description
A total of 19 prostate cancer patients at the Hospital at the University of Pennsylvania were considered for this study. All patient identifiers are stripped from the data at the time of acquisition. The data was deemed to be exempt for review by the internal review board at Rutgers University and the protocol was approved by the University of Pennsylvania internal review board. Hence, the data was deemed eligible for use in this study. All of these patients had been found to have prostate cancer on needle core biopsy and subsequently underwent radical prostatectomy. 10 of these patients had biochemical recurrence within 5 years following surgery (BR) and the other 9 did not (NO BR). The 19 patient studies were randomly chosen from a larger cohort of 110 patient studies at the University of Pennsylvania all of whom had been stage and grade matched (Gleason score of 6 or 7) and had undergone gland resection. Of these 110 cases, 55 had experienced biochemical recurrence within 5 years while the other 55 had not. The cost of the mass spectrometry to acquire the proteomic data limited this study to only 19 patient samples. Following gland resection, the gland was sectioned into a series of histological slices with a meat cutter. For each of the 19 patient studies, a representative histology section on which the dominant tumor nodule was observable was identified. Mass Spectrometry was performed at this site to yield a protein expression vector. The representative histologic sections were then digitized at 40 × magnification using a whole slide digital scanner.
In the next two sections, we briefly describe the construction of the proteomic and histologic feature spaces. Subsequently we describe the strategy for combination of quantitative image descriptors from the tumor site on the histological prostatectomy specimen and the corresponding proteomic measurements obtained from the same tumor site, via mass spectrometry. The resultant meta-classifier, constructed in the fused meta-space, is then used to distinguish the patients at 5 year risk of biochemical recurrence following radical prostatectomy from those who are not.

Proteomic Feature Selection
Prostate slides were deparaffinized, and rehydrated essentially as described in [62]. Tumor areas previously defined on a serial H&E section were collected by needle dissection, and formalin cross-links were removed by heating at 99°C. The FASP (Filter-Aided Sample Preparation) method [63] was then used for buffer exchange and tryptic digest. After peptide purification on C-18 StageTips [64] samples were analyzed using nanoflow C-18 reverse phase liquid chromatography/ tandem mass spectrometry (nLC-MS/MS) on an LTQ Orbitrap mass spectrometer. A top-5 data-dependent methodology was used for MS/MS acquisition, and data files were processed using the Rosetta Elucidator proteomics package, which is a label-free quantitation package that uses extracted ion chromatograms to calculate protein abundance rather than peptide counts. A high dimensional feature vector was obtained, denoted j P ℝ 19 × 953 , characterizing each patient's protein expression profile following surgery. This data underwent quantile normalization, log(2) transformation, and mean and variance normalization on a per-protein basis.

Quantitative Histologic Feature Extraction
In prostate whole-mount histology, denoted j H ℝ 19 × 151 (Figure 1 (a), (f)), the objects of interest are the glands (shown in Figure 1 (b), (g)), whose shape and arrangement are highly correlated with cancer progression [1,39,65,66]. We briefly describe this process below. Prior to extracting image features, we employ an automatic region-growing gland segmentation algorithm presented by Monaco et al. [67]. The boundaries of the interior gland lumen and the centroids of each gland, allow for extraction of 1) morphological and 2) architectural features from histology as described briefly below. More extensive details on these methods are in our other publications [5,39,68]. Glandular Morphology The set of 100 morphological features [1], (denoted j M ℝ 19 × 100 ), of attributes, consists of the average, median, standard deviation, and min/max ratio for features such as gland area, maximum area, area ratio, and estimated boundary length (See Table 2). Architectural Feature Extraction 51 architectural image features, which have been shown to be predictors of cancer [69], (denoted j A ℝ 19 × 51 ), were extracted in order to quantify the arrangement of glands present in the section (See Table 2). Voronoi diagrams, Delaunay Triangulation and Minimum Spanning Trees were constructed on the digital histologic image using the gland centroids as vertices, the gland centroids having previously been identified via the scheme in [68].  SRCCA v! = min{p, q}, which represents the number of features in the lower dimensional modality, n is the sample size and v is the interval spacing over which l 1 and l 2 will be chosen in the range {θ 1 , θ 2 }.
H}. j P was reduced to 25 features as ranked by the ttest, with a p-value cutoff of p = .05, using a leave-oneout validation strategy. For CCA, j P and j J were used as the two multidimensional variables, X and Y, as mentioned above in Section 2. For RCCA and SRCCA, j P and j J were used in a manner similar to CCA except they are tested with regularization parameters l x and l y evenly spaced from θ 1 = .001 to θ 2 = .2 with v = 200.
The top d = 3 embedding components (which were experimentally found to meet the criteria of l 0 = .99 for all SRCCA on all 3 multimodal combinations) were produced from CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT . The classification accuracies were determined with the classifiers K-Nearest Neighbor, denoted via j KNN [20], with K = 1, and Random Forest, denoted via j RF [21], with 50 Trees. Both these classifiers were used because of their high computational speed. Accuracies were determined using leave-one-out validation, which was implemented because of the small sample size. In this process, 18 samples were used for the initial feature pruning, determining the optimal regularization parameter and training the classifier while the remaining sample was used as the testing set for evaluating the classifier. This procedure was repeated till all the samples were used in the testing set.

Experiment 2 -Comparing SRCCA with PCA and PLSR
In addition to the steps performed in Experiment 1, metaspaces were also produced with PCA and PLSR. j P  and j J were concatenated and PCA was then performed on this new data matrix. For PLSR, a regression of j J on j P was performed. Similarly, using the top d = 3 embedding components produced from PCA, PLSR, SRCCA TT , SRCCA WRST , and SRCCA WLT , the classification accuracies of j KNN , with K = 1, and j RF , with 50 Trees, were determined using leave-one-out validation. In addition, we denote as a 1 (i), the classification accuracy obtained by the DF scheme i, where i {PCA, PLSR, CCA, RCCA} and a 2 (j) as the accuracy obtained by the DF scheme i, where j {SRCCA TT , SRCCA WRST , SRCCA WLT }. A two paired student t-test was employed to identify whether there were statistically significant improvements in the 3 SRCCA variants by comparing the classification accuracies with the null hypothesis: for all i {PCA, PLSR, CCA, RCCA} and for all j {SRCCA TT , SRCCA WRST , SRCCA WLT }.

Experiment 4 -Computational consideration for RCCA and SRCCA
We measured the 3 individual single run completion times for RCCA and SRCCA to fuse (j P , j M ), (j P , j A ), and (j P , j H ), with the null hypothesis: These experiments were performed on a quadcore computer with a clock speed of 1.8 GHz, and the programs were written on MATLAB(R) platform.

Experiment 1
Across both classifiers for d = 3, the 3 SRCCA variants, SRCCA TT , SRCCA WRST , SRCCA WLT , had a combined median classification accuracy of 80% compared to 60% for CCA and 42% for RCCA. SRCCA also performed better in all 36 of 36 direct comparisons with CCA and RCCA (see Tables 3 and 4). The higher classification accuracy results indicate that SRCCA produces a metaspace, where the samples are more stratified, compared to CCA and RCCA. This also seems to indicate that the supervised scheme of choosing regularization parameters, by the 3 SRCCA variants, is a more appropriate scheme for classification purposes compared to the ridge regression scheme used by RCCA.
These results, which seem to suggest that SRCCA outperforms the other two CCA based approaches for this dataset, CCA and RCCA, are observable in the embedding plots of Figure 2, which show the metaspace produced by CCA, RCCA, SRCCA TT , SRCCA WRST and SRCCA WLT with d = 2 components. It may be seen that because CCA lacks regularization, the corresponding covariance matrices are singular and lack inverses. For this reason, in Figure 2 the embedding components are not orthogonal but are highly correlated to each other and yield the same information. RCCA overcomes this regularization problem but still does not produce the same level of discrimination between patient classes compared to the 3 variations of SRCCA. Note that SRCCA TT , SRCCA WRST and SRCCA WLT chose similar regularization parameters, l x and l y , and have similar embedding plots.

Experiment 2
We see that SRCCA TT , SRCCA WRST , SRCCA WLT are able to outperform PCA and PLSR in all 36 of 36 direct comparisons (see Tables 5 and 6). Even though, across both classifiers for d = 3, PCA and PLSR have median classification accuracies of 64% and 61%, which is higher than the accuracies for CCA and RCCA, it is still much lower than the 80% for SRCCA TT , SRCCA WRST , SRCCA WLT . These results also seem to indicate that SRCCA TT , SRCCA WRST , SRCCA WLT could also create a more appropriate metaspace than, not only CCA and RCCA, but also PCA and PLSR.
Classification accuracies obtained for fusing (j P , j M ), (j P , j A ), and (j P , j H ), with CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT using the top d = 3 components, using j KNN with K = 1 neighbor and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not.
Classification accuracies obtained for fusing (j P , j M ), (j P , j A ), and (j P , j H ), with CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT using the top d = 3 components, using j RF with 50 trees and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not.

Experiment 3
In Tables 7 and 8 we see that the maximum and median j KNN and j RF of the 3 SRCCA variants for fusion of (j I , j J ) were much higher than the corresponding values of PCA, PLSR, CCA or RCCA. We also see that SRCCA WLT attains a maximum classifier accuracy of 93.16% (see Table 7). In Tables 9 and 10, the 3 SRCCA variants are statistically significantly better than PCA, PLSR, CCA or RCCA even at the p = .001 level using either classifiers, j KNN or j RF . We further see that SRCCA WLT tends to marginally outperform SRCCA TT and SRCCA WRST . However given the small sample size it is difficult to draw any definitive conclusions about which of SRCCA TT , SRCCA WRST , or SRCCA WLT might be the better SRCCA variant.
In Figures 3 and 4, we see the classification accuracies of the 7 DF methods, PCA, PLSR, CCA, RCCA, SRCCA TT , SRCCA WRST , or SRCCA WLT over a range of d {1, 2, ..10} embedding components for the fusion (j P , j H ). Importantly, we see that the SRCCA TT , SRCCA WRST , and SRCCA WLT all outperform PCA, PLSR, CCA and RCCA for a majority of the embedding dimensions, across both the j KNN and j RF classifiers.

Experiment 4
Figure 5 reveals that the completion time of SRCCA is significantly lower than the completion time of RCCA. Even though the differences in these times are visibly different, a p-value of 1.9 × 10 -3 even with just 3 Figure 2 2-dimensional representation of (j P , j A ). 2-dimensional representation of (j P , j A ) using (a) CCA, (b) RCCA, (c) SRCCA TT , (d) SRCCA WRST and (e) SRCCA WLT where the X and Y axes are the two most significant embedding components produced by the 3 different algorithms. CCA (a) suffers from lack of regularization, RCCA (b) is regularized but does not produce the best metaspace while the three variations of SRCCA (c)(d)(e) result in the best embedding components in terms of classification accuracy distinguished via best fit ellipses with one outlier.
Classification accuracies obtained for fusing (j P , j M ), (j P , j A ), and (j P , j H ), with CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT using the top d = 3 components, using j KNN with K = 1 neighbor and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not. Classification accuracies obtained for fusing (j P , j M ), (j P , j A ), and (j P , j H ), with CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT using the top d = 3 components, using j RF with 50 trees and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not.
samples, indicates that SRCCA appears to be statistically significantly faster compared to RCCA. Note that the canonical factorization stage is the most time consuming part of the of the algorithm. The Feature Selection stage computation, in comparison, is not as time consuming. SRCCA TT , SRCCA WRST , and SRCCA WLT (whose results are reported in Figure 5) all have similar execution times.

Conclusions
In this paper, we presented a novel data fusion (DF) algorithm called Supervised Regularized Canonical Correlation Analysis (SRCCA) that, unlike CCA and RCCA, is (1) able to fuse with a feature selection (FS) scheme, (2) regularized, and (3) computationally cheap. We demonstrate how SRCCA can be used for quantitative integration and representation of multi-scale, multimodal imaging and non-imaging data. In this work we leveraged SRCCA for the purpose of constructing a fused quantitative histologic-proteomic classifier for predicting which prostate cancer patients are at risk for 5 year biochemical recurrence following surgery. We have demonstrated that SRCCA is able to (1) produce a metaspace, where the samples are more stratified than    Maximum classification accuracies obtained for fusing (j P , j M ), (j P , j A ), and (j P , j H ), with PCA, PLSR, CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT across d {1, 2, ..10} components, using two classifiers, j KNN , with K = 1, and j RF , with 50 trees, and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not. Figure 3 Classification accuracies of (j P , j H ) across dimensions d Î {1, 2, ..10} using the classifier j KNN . Accuracies were obtained for fusing (j P , j H ), with PCA, PLSR, CCA, RCCA, SRCCA TT , SRCCA WRST , and SRCCA WLT across d {1, 2, ...10} components, using j KNN , with K = 1, and leave-one-out validation to identify patients at the risk of biochemical recurrence from those who are not. the metaspace produced by CCA or RCCA, (2) better identify patients at the risk of biochemical recurrence compared to Principal Component Analysis (PCA), Partial Least Squares Regression (PLSR), CCA or RCCA, (3) perform regularization, all the while being statistically significantly faster compared to RCCA.
While the fused prognostic classifier for predicting biochemical recurrence in this work appears to be promising, we also acknowledge the limitations of this work: (1) As previously mentioned, the cost of mass spectrometry limited this study to only 19 datasets. By using a minimum sample size derivation model [70,71], we were able to determine that our fused SRCCA classifier would yield an accuracy of 93%, more than 95% of the time if our dataset were expanded to 56 studies. We intend to evaluate our classifier on such a cohort in the future. (2) Ideally, a randomized cross validation strategy should have been employed for the training and evaluation of the classifier. Unfortunately, this was also limited by the size of the cohort. While both parametric and non-parametric feature selection strategies were employed in this work, the availability of a larger dataset for classification in conjunction with SRCCA would allow for employment of parametric selection strategies, assuming that the underlying distribution can be estimated. For small sample datasets, a non-parametric feature selection strategy might be more approrpriate. In future work, we also plan to apply SRCCA in the context of data fusion for other imaging and non-imaging datasets in the context of other problem domains and applications.