We proposed a method to compare associations found between two high-dimensional data sets, for two groups of samples. Our method models associations between features in one data set and sets of covariates in the other, thus facilitating the search for markers that are related to a set of features.

Our method uses all data at the same time in the model, thus being less susceptible to small sample sizes in (at least) one of the groups, compared with separate analysis per group. In addition, we consider associations between probes and gene sets. These characteristics make for a powerful method that finds robust differences in associations.

To the best of our knowledge, we are the first to suggest an integrated analysis method for comparing association patterns between two groups of samples. The most closely related method seems to be the one proposed by Artmann et al. [27], as they consider two high-dimensional data sets and a grouping factor. Artmann et al. [27] essentially proposed to first look for differential behavior between features in the two groups of samples, per data set separately. Subsequently, results are combined via meta-analysis. Their proposed method is in the context of microRNA and mRNA analysis, so their meta-analysis involves connecting microRNAs to a set of possible targets. So there is no actual joint analysis of the two data sets, but rather a combination of results of two separate analyses.

Our method does not require that features are differentially valued between the two groups of samples. Indeed, we argue that this is not necessary and, in our view, not even desirable. We have in fact made sure that dSIM results cannot be driven merely by differences on the data distribution of the dependent data between the two sample groups. In our view, such differences in a single data set would not necessarily mean different associations between sample groups, and should not be seen by the method as such either. Furthermore, the lack of differential values does not rule out differential associations, so by considering differential values one actually restricts the results too much.

This makes biological sense. Let us take for example the setup in the TCGA data example. It is entirely possible that DNA copy number behaves in the same way, and displays the same distribution, in both ER-positive and ER-negative groups, and yet are associated differently with gene expression sets in those groups. This could happen due to another molecular mechanism coming into play, say. We feel that an open-minded analysis should be able to pick up those effects.

A unique feature of dSIM is that it corrects for the baseline association before looking for differential associations between groups of samples. Thus, the association effects common to both groups are eliminated, and only those association patterns left over in the residuals are analyzed. In subsection ‘Correcting for the baseline association’, we describe the steps taken by dSIM to correct the baseline association while using ridge regression. The rational behind the usage of ridge penalty is that the metric for the global test statistic is the same as that of ridge. Therefore, the entire method makes use of the same metric.

The residual effects studied by dSIM are relatively small as they are obtained after removing the large association effects during baseline association correction. These effects may weaken further due to overfitting of the ridge penalized model, making them harder to detect. Hence, for detecting these weak effects, it is important to minimize the loss of information during baseline association correction. We ensure this by optimizing the tuning parameter *λ* for ridge penalization using leave-one-out cross-validation. Then the leave-one-out cross-validated predictions, corresponding to the selected *λ* value, are used to get the residuals. Unlike the traditional way of using the fitted predictions, we use the cross-validated predictions, as leaving out a sample during cross-validation makes the estimated coefficients unbiased towards that sample. This, in turn, avoids overfitting of the model when predicting the outcome for the left out sample, hence minimizing the loss while obtaining the residuals.

The fact that we test for a large number of copy number probes simultaneously requires choosing an appropriate multiple testing correction method. As the dSIM p-values are generated using permutation testing, the traditional methods for controlling FWER like Bonferroni, Holm or Hommel are not very useful. Firstly, they are too conservative and secondly, they do not take into account the dependence structure of the data [28, 29]. The less stringent FDR control methods such as Benjamini-Hochberg also do not fully exploit the information concerning dependence in the dataset, hence suffering from loss of power. Therefore, we use Meinshausen’s multiple testing correction method which not only takes into account the dependence structure of the permuted p-values, but is also more powerful if the effects are small and spread over a larger region (as in copy number data). Since we consider copy number as dependent data, using Meinshausen’s procedure ensures detection of subtle yet significant effects, like in the case of chromosome arm 1q. Another method that can also be used is the Westfall and Young’s multiple testing correction method [30]. Like Meinshausen’s approach, it also takes into account the dependence structure of the data and is useful for working with permuted p-values when gene expression is considered as the dependent data.

In this paper we focus on analyzing copy number regulated gene expression, where the gene sets are defined on the basis of their genomic locations. The method can be easily extended to analyze other types of genomic data and gene sets as well. For example, there could be interest in finding group-specific interaction effects between microRNA and mRNA expression, while looking at pathway specific genes. It is also possible to invert the model given in (1) to have copy number as independent data and gene expression as dependent data. However, we should indicate that ridge penalty does not exploit the inherit spatial correlation structure found in natural ordering of the copy number probes. In the case where copy number data is independent data, one possible option would be to use fused lasso penalty instead of ridge.

Another interesting extension of this model is to consider more than two groups of samples. The extension involves complex steps as the degrees of freedom goes up from one to *n*
_{
G
} - 1, where *n*
_{
G
} is the number of groups. As the association patterns are then compared between more that two groups, the number of interaction terms (*γ*) in the model increases. One possible way to compare multiple groups is to use the method proposed in this paper for performing pairwise comparison. This problem is beyond the scope of this paper and will be dealt with elsewhere.

Our method, dSIM, is based on a linear model where we assume that there is a linear relationship between the two genomic datasets. However, we should point out that the non-linear associations present between the datasets may not be detected by dSIM. We also assume the the distribution of the errors to be normal. In cases, with non-normal random errors, one can consider transforming the data before using dSIM. Another assumption made by dSIM is in using the global test for testing the null hypothesis *H*
_{0}:*θ*
^{2} = 0, where *θ*
^{2} is the variance of the distribution for {*γ*
_{
k
}}. The permutation test not only tests for *H*
_{0}:*θ*
^{2} = 0 but also for *δ* = 0, where *δ* is the intercept. The correct size of the test is therefore not guaranteed if *θ*
^{2} = 0 but *δ*≠0. However, because we use a global test for the null hypothesis *H*
_{0}:*θ*
^{2} = 0 internally in the permutation, it is unlikely that the permutation procedure has any serious power for *H*
_{0}:*δ* = 0. In practice, dSIM is not sensitive to those copy number variations that do not affect gene expression levels.

In summary, we developed a method to test for the copy number and gene expression associations differing between two groups of samples. Through several simulation studies, we showed the robustness of dSIM under various conditions. Application of dSIM to the TCGA and NKI breast cancer datasets highlights the importance of having all samples together in the model.