### Algorithm

In this section we first explain a simple two-step procedure, based on whitening and PCA, for finding the aspects shared by the sources, and then show how the same fusion solution can equivalently be derived from the result of applying a generalized CCA to the collection. The two-step procedure provides the intuition for the approach: First *remove* the within-data variation, and then *capture* all the variation that is still remaining. The connection to CCA then demonstrates how the procedure provides a solution to the issue of combining the separate components CCA gives.

Denote a collection of *p* data sets by {**X**
_{1},...,**X**
_{
p
}}, where each **X**
_{
i
}is a *m* × *n*
_{
i
}matrix such that *m* ≫ *N*, and *N* = ∑*n*
_{
i
}. The rows of the matrices correspond to the same object in each set, while the columns correspond to features that need not be the same in the data sets. For example, in traditional expression analyses the rows would be genes and the columns would be conditions, treatments, time points, etc. For notational simplicity, we assume zero mean data.

In the first step, each data set is whitened to remove all within-data correlations, and the data are scaled so that all dimensions have equal variance. The whitened version
of a data matrix **X**
_{
i
}is given by
, where
is the whitening matrix. The whitening matrix is simply
, where
is the covariance matrix of **X**
_{
i
}.

After each data set has been whitened, the next step is to find the shared variation in them. This is done by principal component analysis (PCA) on the columnwise concatenated whitened data sets. Since all the within-data structure PCA could extract has been removed, it can only find variation shared by at least two of the data sets, and the maximum variance directions it searches for correspond to the highest between-data correlations.

Formally, applying PCA to the columnwise concatenation of the whitened data sets
yields the factorization

**C**_{
Z
}= **V Λ V**^{
T
}, (1)

where the orthonormal matrix **V** contains the eigenvectors, **Λ** is a diagonal matrix of projection variances, and **C**
_{
Z
}is the covariance matrix of **Z**.

Projecting **Z** onto the first *d* eigenvectors **V**
_{
d
}corresponding to the *d* largest eigenvalues gives the *d* principal components, which are the optimal *d*-dimensional representation in terms of the shared variance. The whole data collection becomes integrated into

**P**_{
d
}= **ZV**_{
d
}, (2)

where **P**
_{
d
}is of size *m* × *d* and contains a *d*-dimensional feature vector for each of the analyzed objects. The idea is then simply to use this new representation for any further analysis, which can be made using any method that operates on vectorial data. The whole procedure can be seen as fusing the collection of data sets into a single set, while at the same time reducing the total dimensionality of the data to find the most reliable shared effects.

As mentioned in the *Background* section, the above two-step procedure is equivalent to running CCA on the collection and summing the separate components from each source. The connection is derived here for two data sets. The proof extends easily for several data sets, for one of the many alternative generalizations of CCA.

CCA is a method for finding linear projections of two sets of variables so that the correlation between the projections is maximal. CCA is often formulated as a generalized eigenvalue problem

where **C**
_{
ij
}denotes the (cross-)covariance of **X**
_{
i
}and **X**
_{
j
}. The eigenvalues *λ* of the solution appear as pairs 1 + *ρ*
_{1}, 1 -*ρ*
_{1},...,1 + *ρ*
_{
p
}, 1 - *ρ*
_{
p
}, 1,...,1, where *p* = min(*n*
_{1}, *n*
_{2}), and (*ρ*
_{1},...,*ρ*
_{
p
}) are the canonical correlations. The canonical weights corresponding to the canonical correlations are
, *i* = 1,...,*p*.

In conventional use of CCA we are usually interested in the correlations, the canonical weights **u**
^{
i
}, and the canonical scores, defined as projections of **X**
_{1} and **X**
_{2} on the corresponding canonical weights. Next we show how the combined data set (2) can be obtained from the canonical scores, thus providing a way of using CCA to find a single representation that captures the dependencies.

For a single component, (1) can be equivalently written as

where

*α* is the variance,

**v** is the corresponding principal component, and

denotes the (cross-)covariance of

and

. Due to the whitening, the

and

are identity matrices. We can alternatively write

and

, leading to

where

**Iv** has been subtracted from both sides. Equivalently,

Let us denote

by diag [

**W**
_{1},

**W**
_{2}], and multiply the right hand side of (4) by the identity matrix

**I** = diag[

**W**
_{1},

**W**
_{2}]

^{-1}diag [

**W**
_{1},

**W**
_{2}]. On the right side of the equation we then have the term diag[

**W**
_{1}
^{
T
},

**W**
_{2}
^{
T
}]

^{-1} diag[

**W**
_{1},

**W**
_{2}]

^{-1} = diag [

**C**
_{11},

**C**
_{22}] based on the definition of the whitening matrix, and thus (4) can be written as

where

. Adding

to both sides gives equation structurally identical to (3). Both methods thus lead to the same eigenvalues, i.e.

*λ* =

*α*, and the eigenvectors are related by a linear transformation,

The combined representation (2) of *d* dimensions can be written in terms of canonical scores as **P**
_{
d
}= **ZV**
_{
d
}= [**X**
_{1}, **X**
_{2}]diag[**W**
_{1}, **W**
_{2}]**V**
_{
d
}= [**X**
_{1}, **X**
_{2}]
= **X**
_{1}
**U**
_{1,d
}+ **X**
_{2}
**U**
_{2,d
}, where **U**
_{1,d
}and **U**
_{2,d
}are the first *d* canonical directions of the two data sets.

CCA can be generalized to more than two data sets in several ways [

8], and the two-step procedure described here is equivalent to the one formulated as solving a generalized eigenvector problem

**Cu** =

*λ*
**Du**, where

**C** is the covariance matrix of the column-wise concatenation of the

**X**
_{
i
}and

**D** is a block-diagonal matrix having the dataset-specific covariance matrices

**C**
_{
ii
}on its diagonal. Here

**u** is a row-wise concatenation of the canonical weights corresponding to the different data sets. The proof follows along the same lines as for two data sets, and again the combined data set for any

dimensions can be written in terms of the generalized CCA results as

where each **U**
_{
i,d
}contains the *d* eigenvectors corresponding to the *d* largest eigenvalues.

In summary, the simple linear preprocessing method of whitening followed by PCA equals computing the generalized CCA on a collection of data sets and summing the canonical scores of the data sets. In practice it does not matter in which way the result is obtained, but the two-step procedure illustrates more clearly why this kind of approach is useful for data integration. Furthermore, it is not limited to linear projections, and the same motivation could be extended to different kind of models. In practice implementing the first step might, however, be difficult in more complex models.

#### Choice of dimensionality

The dimensionality of the projection can be chosen to be fixed, such as two or three for visualization, or alternatively an "optimal" dimensionality can be sought. In this section we introduce our suggestion for optimizing the dimensionality. Intuitively, the dimensionality should be high enough to preserve most of the shared variation and yet low enough to avoid overfitting. The first few components contain most of the reliable shared variation among the data sets, while the last components may actually represent just noise, and thus dropping some of the dimensions makes the method more robust.

The maximum dimensionality is the sum of the dimensionalities of the data sets, but in practice already a considerably smaller dimensionality is often sufficient, and in fact leads to a better representation due to suppression of noise. Note also that in the case of two data sets the number of unique projections is only the minimum of the data dimensionalities.

In a nutshell, we increase the dimensionality one at a time, testing with a randomization test that the new dimension captures shared variation. To protect from overfitting, all estimates of captured variation will be computed using a validation set, i.e., for data that has not been used when computing the components (dimensions). The randomization test essentially compares the shared variance along the new dimension to the shared variance we would get under the null-hypothesis of mutual independency. When the shared variance does not differ significantly from the null-hypothesis, the final dimensionality has been reached.

To compute the shared variance of the original data, we divide the data into training,
and validation data,
. The two step procedure described in the *Algorithm* subsection is applied to the training data to compute the eigenvectors **V**
^{
t
}and the whitening matrix **W**
^{
t
}, where **W**
^{
t
}is a block diagonal matrix containing the whitening matrices for each matrix in training data. The fused representation for the validation data is computed as
, where **X**
^{
v
}is the columnwise concatenation of the validation data matrices. Variance in the fused representation is now our estimate of shared variance. We average the estimate over 3 different splits into training and validation sets.

To compute the shared variance under the null hypothesis, random data sets are created from the multivariate normal distribution with a diagonal covariance matrix where the values in diagonal equal the columnwise variances of
. The shared variance for the random data is computed in the same way as described above. We repeat the process for 50 randomly created data sets.

The shared variance in the original data is then compared to the distribution of shared variances under the null hypothesis, starting from the first dimension. When the dimensions no longer differ significantly (we used 2% confidence level), we have arrived at the "optimal" dimensionality and the rest of the dimensions are discarded.

Note that assuming normally distributed data in the null hypothesis is consistent with the assumptions implicitly made by CCA. The underlying task is to capture all statistical dependencies in the new representation, and finding correlations (as done by CCA) is equivalent to that only for data from the normal distribution. For considerably non-normal data the choice of dimensionality may not be optimal, but neither is the method itself. Therefore transforming the data so that it would roughly follow normal distribution (such as taking logarithm of gene expression values) would be advisable.