 Research
 Open access
 Published:
Spatial rankbased multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes
BMC Bioinformatics volume 22, Article number: 480 (2021)
Abstract
Background
Identifying interaction effects between genes is one of the main tasks of genomewide association studies aiming to shed light on the biological mechanisms underlying complex diseases. Multifactor dimensionality reduction (MDR) is a popular approach for detecting gene–gene interactions that has been extended in various forms to handle binary and continuous phenotypes. However, only few multivariate MDR methods are available for multiple related phenotypes. Current approaches use Hotelling’s T^{2} statistic to evaluate interaction models, but it is well known that Hotelling’s T^{2} statistic is highly sensitive to heavily skewed distributions and outliers.
Results
We propose a robust approach based on nonparametric statistics such as spatial signs and ranks. The new multivariate rankbased MDR (MRMDR) is mainly suitable for analyzing multiple continuous phenotypes and is less sensitive to skewed distributions and outliers. MRMDR utilizes fuzzy kmeans clustering and classifies multilocus genotypes into two groups. Then, MRMDR calculates a spatial ranksum statistic as an evaluation measure and selects the best interaction model with the largest statistic. Our novel idea lies in adopting nonparametric statistics as an evaluation measure for robust inference. We adopt tenfold crossvalidation to avoid overfitting. Intensive simulation studies were conducted to compare the performance of MRMDR with current methods. Application of MRMDR to a real dataset from a Korean genomewide association study demonstrated that it successfully identified genetic interactions associated with four phenotypes related to kidney function. The R code for conducting MRMDR is available at https://github.com/statpark/MRMDR.
Conclusions
Intensive simulation studies comparing MRMDR with several current methods showed that the performance of MRMDR was outstanding for skewed distributions. Additionally, for symmetric distributions, MRMDR showed comparable power. Therefore, we conclude that MRMDR is a useful multivariate nonparametric approach that can be used regardless of the phenotype distribution, the correlations between phenotypes, and sample size.
Background
Many attempts have been made to identify susceptible genes that influence the risk of complex diseases such as autism, hypertension, and diabetes [1,2,3]. Analyzing a single locus is not enough to understand the pathophysiology of complex diseases and results in the socalled missing heritability problem. To overcome this problem, several studies have sought to identify gene–gene interactions (GGIs) or geneenvironmental interactions [4,5,6].
As a nonparametric modelfree approach, multifactor dimensionality reduction (MDR) has been widely applied for detecting GGIs [5]. For binary phenotypes, such as those analyzed in case–control studies, MDR divides highdimensional genotype combinations into a onedimensional variable with two groups (highrisk and lowrisk), according to whether the ratio of cases to controls exceeds a threshold. Then it finds the interaction model that best predicts the disease status. Balanced accuracy can be used for an evaluation measure [7]. To prevent overfitting, kfold crossvalidation (CV) can be applied. Crossvalidation consistency (CVC), or the number of times each singlenucleotide polymorphism (SNP) combination is chosen as best, is obtained during the kfold CV process. The SNP combination with the highest CVC is defined as the final best interaction model [8]. MDR has several advantages: i) the dimensions of the data are effectively reduced, ii) no specific genetic model is assumed, and iii) highorder interactions can be identified, even if there are no significant main effects [9, 10].
Since its introduction, many studies have been conducted to broaden the scope of application of MDR. According to Gola et al. [4], about 800 MDRrelated studies were found as of February 2014 when searching PubMed and Google Scholar. For discrete traits, loglinear models MDR and robust MDR have been proposed to handle data with empty or sparse cells [11, 12]. Odds ratio MDR was proposed, replacing the naïve classifier with the odds ratio [9] and optimal MDR replaced the fixed threshold with a datadriven threshold using an ordered combinatorial partitioning algorithm [13]. As a method of dealing with continuous traits, generalized MDR (GMDR) was proposed. GMDR can handle both dichotomous and continuous phenotypes and can adjust for covariates [14]. Quantitative MDR (QMDR) for continuous traits uses the sample mean of each genotype combination as a classifier, reducing the computing time with comparable performance [15]. Recently, clusterbased MDR (CLMDR) has been proposed as a method that is less sensitive to outliers and distributional assumptions [16]. For survival time with censored data, SurvMDR was proposed, which uses the logrank test statistic to classify the cells of a multifactor combination [17]. CoxMDR and accelerated failure time MDR are extended versions of GMDR for the survival phenotype based on Cox regression and the accelerated failure time model, respectively [18, 19]. Recently, Kaplan–Meier MDR was also developed, which uses the median Kaplan–Meier survival time as a classifier [20].
As described above, many studies have been conducted to identify genetic interactions associated with single phenotype, but only a few studies have been done on methodologies to treat multiple phenotypes. For complex diseases, several correlated traits are often measured together. For example, weight, the waisthip ratio, and the body mass index (BMI) can be jointly measured as obesityrelated traits. The power to detect associations between genes and these traits is expected to increase if the multivariate approach is adopted [21]. Therefore, more research on multivariate methods detecting GGIs is needed. Recently, multivariate generalized MDR (GEEGMDR) extended GMDR to the multivariate case by constructing generalized estimation equation models [22]. GEEGMDR provides stable results, but it does not always show higher power than GMDR [23]. Multivariate QMDR (multiQMDR) extended QMDR using principal component analysis scores and Hotelling’s T^{2} statistic as a classifier and an evaluation measure instead [23]. MultiQMDR gives a high CVC and stable results, but it is prone to outliers or influencing points. More recently, multivariate clusterbased MDR (multiCMDR) has been proposed [24]. MultiCMDR applies fuzzy kmeans clustering to discriminate between high and lowrisk groups and uses Hotelling’s T^{2} statistic for evaluation. MultiCMDR is less sensitive to outliers and nonsymmetric distributions.
While MDR is a nonparametric approach, all these methods use parametric test statistics as evaluation measures based on a multivariate normal distribution or exponential family distribution. Instead of using parametric approach, this study considers nonparametric evaluation measures for testing the equality of two multivariate populations. Various methods based on multivariate ranks or distances between the pairs of individual observations have been studied [25]. Note that signs and ranks in a univariate case are based on the ordering of the data. Unfortunately, however, there is no natural ordering of the data for a multivariate case. To extend the concept of rank to a multivariate case, several principles have been considered. First, the methods using interdirection were introduced [26, 27]. Interdirection is a measure based on the angular distance between two observation vectors relative to the rest of the data [28]. Second, the tests based on data depth were proposed [29,30,31]. A data depth measures how deep a multivariate sample lies in the data cloud [32]. Any function which provides a reasonable centraloutward ordering of points in multidimensional space can be considered as a depth function [33]. Third, multivariate extensions using spatial signs and ranks were also studied [34, 35]. Affine invariant methods based on spatial sign and rank vectors for various multivariate problems were proposed [34]. More recently, for highdimensional data, a nonparametric multivariate test using spatial signs [36] and a spatial ranks test for two samples were proposed [37]. Among various approaches and measures, we chose the spatial ranksum statistic as the nonparametric evaluation measure in this study, because it is one of the most widely statistics and implemented with R program.
We propose a new nonparametric multivariate approach for identifying genetic interactions. We call the proposed method multivariate rankbased MDR (MRMDR). During classification process, MRMDR utilizes the fuzzy kmeans clustering analysis with a noise cluster as in multiCMDR. For the evaluation process, MRMDR calculates the spatial ranksum statistic as an evaluation measure and selects the best interaction model with the largest statistic. The tenfold CV method is adopted and the final best interaction model is determined by maximum CV consistency.
This manuscript is organized as follows. We first start with an introduction of the spatial rank statistic. The algorithm of the proposed MRMDR method is then described in detail. We then present the results of intensive simulation studies to investigate the performance of the proposed method. Our method is compared with multiQMDR and multiCMDR. We applied the proposed MRMDR method to data on four multivariate phenotypes related to kidney function obtained from the Korean Genome and Epidemiology Study (KoGES): blood urea nitrogen (BUN), serum creatinine, urinary albumin, and urinary red blood cell (RBC) levels. MRMDR successfully identified genetic interactions associated with these four phenotypes.
Methods
Nonparametric multivariate rank test
We first introduce the multivariate nonparametric test used for evaluation. To detect a twosample location shift in univariate analysis, the twosample ttest is popularly used when the response variable is normally distributed. The Mann–Whitney test based on the rank sum is well known as a nonparametric counterpart of the twosample univariate ttest. Various robust univariate nonparametric tests have been developed for the twosample location problem [38].
For multivariate analysis, a classical approach such as Hotelling’s T^{2} is a popular parametric approach. T^{2} has optimal power under the assumption of a multivariate normal distribution. However, it performs poorly in the case of heavytailed distributions and is highly sensitive to outliers [39]. As an alternative, we consider a nonparametric approach based on spatial signs and ranks. We start with the definition of spatial sign and spatial rank.
Let \({\mathbf{Y}} = ({\mathbf{y}}_{{\mathbf{1}}} ,...,{\mathbf{y}}_{{\mathbf{n}}} )^{\prime }\) be an n × p data matrix with n observations and p variables. The spatial sign function and spatial rank function are defined as follows [28].
where ave_{j} means the average taken over all observations for j = 1,…n, y is the Euclidean distance of y from 0, and y and y_{j} are pvariate vectors [28]. The observed spatial signs s_{i} and observed centered spatial ranks r_{i} are defined as
and
respectively for i, j = 1, …, n. Here, \({\mathbf{s}}_{ij} = {\mathbf{S}}({\mathbf{y}}_{i}  {\mathbf{y}}_{j} )\), ave{r_{i}} = 0.
To make an affineinvariant test statistic, we can apply the spatial sign function to the transformed data points. The test statistic \(T({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n} )\) is said to be affineinvariant if \(T({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n} ) = T(D{\mathbf{y}}_{1} ,...,D{\mathbf{y}}_{n} )\) for every p × p nonsingular matrix D and for every pvariate dataset \({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n}\) [34]. Affineinvariant spatial signs and ranks are obtained by transforming y_{i} to A_{y}y_{i},
where A_{y} is Tyler’s transformation, which makes the spatial sign covariance matrix proportional to the identity matrix, that is, \(ave\{ {\mathbf{r}}_{i}^{{*{\prime }}} {\mathbf{r}}_{i}^{*} \} = [c_{y}^{2} /p]I_{p}\), where \(c_{y}^{2} = ave\{ {\mathbf{r}}_{i}^{*} ^{2} \}\). A_{y} can be obtained during the iterative process and chosen so that \(trace(A_{y}^{\prime } A_{y} ) = p\) [34, 40, 41]. The ranks \({\mathbf{r}}_{i}^{*}\) lie in the unit psphere. The direction of \({\mathbf{r}}_{i}^{*}\) roughly points outward from the center of the data cloud and its length tells how far away this point is from the center of the data cloud [42].
Next, for the twosamples location problem, let \({\mathbf{Y}}_{1} = ({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{{{\mathbf{n}}_{1} }} )^{\prime }\) and \({\mathbf{Y}}_{2} = ({\mathbf{y}}_{{{\mathbf{n}}_{1} + 1}} ,...,{\mathbf{y}}_{{{\mathbf{n}}_{1} + {\mathbf{n}}_{2} }} )^{\prime }\) be two independent samples with p variables that have the cumulative distribution functions \(F({\mathbf{x}}  {{\varvec{\upmu}}})\) and \(F({\mathbf{x}}  {{\varvec{\upmu}}}  {{\varvec{\Delta}}})\), respectively. To test the null hypothesis of no differences in location, H_{0}:Δ:0, the multivariate version of Mann–Whitney test statistic U^{2} can be used. For the combined sample Y = [Y_{1}:Y_{2}], the affinetransformed spatial signs of the transformed differences \({\mathbf{s}}_{ij}^{*} = {\mathbf{S}}(A_{y} ({\mathbf{y}}_{i}  {\mathbf{y}}_{j} ))\) and spatial ranks \({\mathbf{r}}_{i}^{*} = ave_{j} \{ {\mathbf{S}}_{ij}^{*} \}\) are obtained. The multivariate Mann–Whitney test statistic U^{2} is given by
where \({\overline{\mathbf{r}}}_{i}^{*}\) is the samplewise mean vector of the spatial ranks \({\mathbf{r}}_{i}^{*}\) and \(c_{y}^{2} = ave\{ {\mathbf{r}}_{i}^{*} ^{2} \}\) [34]. The pvalue is obtained by \(E_{\eta } (I(U_{\gamma }^{2} \ge U^{2} ))\), where \(E_{\eta } ( \cdot )\) represents expectation value where \(\eta = (\eta_{1} ,...,\eta_{N} )\) is uniformly distributed over N! permutations of \(\left( {1, \ldots ,N} \right)\) and \(U_{\gamma }^{2}\) is the value of the test statistic of a permuted sample. As the dimension p increases and the distribution becomes heaviertailed, the performance of U^{2} improves relative to Hotelling’s T^{2} statistic [34].
We investigated the effect of Tyler’s transformation on y_{i} through an empirical study using a toy example. Two multivariate distributions of the response variables were considered: a bivariate normal distribution and a bivariate gamma distribution. The correlations between two response variables were set to 0.4 and 0.8. The original data were transformed to spatial signs, and then spatial centered ranks were obtained by averaging the spatial signs of differences. Figure 1 shows the spatial signs and ranks with and without Tyler’s transformation. Spatial signs are represented as directions from the origin with unit length, and thus all the spatial signs lie on a circle of radius 1. The spatial signs with Tyler's transformation tend to be more evenly arranged around the circumference than the spatial signs without Tyler's transformation. The spatial ranks tend to spread evenly, as if they are uniformly distributed within a circle, for the Tyler’s transformation case. Note that the spatial ranks before and after Tyler’s transformation appear different when the correlation is high (r = 0.8); however, the change due to the transformation is negligible if the correlation is moderate (r = 0.4).
MRMDR procedure
There are many variations of MDR methods for finding GGIs. However, most MDR approaches have two core procedures: how to classify the cells into two groups and how to evaluate the interaction models. The proposed MRMDR adopts fuzzy clustering technique in the classification process as in multiCMDR [24]. For the evaluation process, MRMDR uses a multivariate spatial rank test statistic. Figure 2 shows the flow of the MRMDR procedure. The detailed algorithm of MRMDR is as follows.
Step 0. Data

Suppose there are n^{*} samples, with p SNP datapoints and q continuous phenotypes.

Standardize all the phenotypes to have a mean of zero and a unit variance. Let \({\mathbf{Y}}_{i} = (y_{i1} ,y_{i2} , \ldots ,y_{iq} )^{T}\) be the standardized phenotype vector and \({\mathbf{X}}_{i} = (x_{i1} ,x_{i2} , \ldots ,x_{ip} )^{T}\) be the genotype vector for the ith subject, respectively.
Step 1. Global process: clustering

Perform fuzzy kmeans clustering with k = 2 using phenotype information. We add a noise cluster such that noisy data points may be assigned to the noise class [43]. Remove all the samples in the noise cluster. The remaining samples have membership degrees for each of the two clusters. Denote these two clusters as C_{1} and C_{2}.

Define the global ratio θ as
$$\theta = {{\sum\limits_{i = 1}^{n} {D_{i1} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {D_{i1} } } {\sum\limits_{i = 1}^{n} {D_{i2} } }}} \right. \kern\nulldelimiterspace} {\sum\limits_{i = 1}^{n} {D_{i2} } }},$$where n is the number of remaining samples after deleting noise samples and D_{ik} is the membership degree of the ith subject in cluster C_{k}, (k = 1, 2) [24].

For crossvalidation (CV), split the samples randomly into 10 subgroups of equal size. Let nine sets of samples be the training dataset and the remaining dataset be the test dataset used for evaluating the model.
Step 2. Local process: classification

To find mthorder GGIs, select a set of m SNPs from a pool of SNPs.

Calculate the local ratio θ_{j} for the jth genotype combination in the training set,
$$\theta_{j} = {{\sum\limits_{i = 1}^{{n_{j} }} {D_{ij1} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{{n_{j} }} {D_{ij1} } } {\sum\limits_{i = 1}^{{n_{j} }} {D_{ij2} } }}} \right. \kern\nulldelimiterspace} {\sum\limits_{i = 1}^{{n_{j} }} {D_{ij2} } }},\quad \left( {j = 1, \ldots ,3^{m} } \right),$$where D_{ijk} is the membership degree of the ith subject with the jth genotype combination in cluster C_{k}.

Label each genotype combination either “H” if θ_{j} ≥ θ, or “L” if θ_{j} < θ.
Step 3. Local process: evaluation

Obtain spatial ranks of Y_{i} for the combined samples from two clusters for i = 1, 2, …, n.
$$\begin{aligned} r_{i}^{*} & = ave_{j} \{ {\mathbf{S}}(A_{y} ({\mathbf{y}}_{i}  {\mathbf{y}}_{j} )\} \\ & = \frac{1}{n}\sum\limits_{j = 1}^{n} {\frac{{A_{y} ({\mathbf{y}}_{i}  {\mathbf{y}}_{j} )}}{{\sqrt {(A_{y} ({\mathbf{y}}_{i}  {\mathbf{y}}_{j} ))^{2} } }}} \\ \end{aligned}$$where S(·)is a sign function and A_{y} is Tyler’s transformation.

Calculate the multivariate Mann–Whitney test statistic:
$$U^{2} = \frac{p}{{c_{y}^{2} }}\sum\limits_{k = 1}^{2} {n_{k} }^{{_{{}} }} {\overline{\mathbf{R}}}_{k} ^{2}$$where p is the number of variables, n_{k} is the number of samples in cluster C_{k}, \({\overline{\mathbf{R}}}_{k}\) is the mean vectors of the spatial ranks for cluster C_{k}, and \(c_{y}^{2} = ave\{ {\mathbf{R}}_{k} ^{2} \}\).

Obtain U^{2} for every m SNP combination. Choose the SNP combination with the largest U^{2} statistic as the best mthorder interaction model in the training data.
Step 4. Selection process: best interaction model

Repeat step 2 and 3 for each fold and count the number of specific SNP combinations chosen for the best model. We call this crossvalidation consistency (CVC).

Select the model with the largest CVC as best final interaction model.

Derive the final rank sum statistic for the best model by averaging all statistics for the test data.

Calculate the empirical pvalue by a permutation test.
Results
Simulation analysis
To compare the performance of the proposed MRMDR with other existing methods, we conducted simulations for various situations. We considered 20 SNPs, including twoway diseasecausal SNPs. For each of the combinations of seven different heritability values (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and two minor allele frequency (MAF) values (0.2, 0.4), five different interaction models (M1M5) were considered. Typically, penetrance rate is defined as the proportion of individuals with a given genotype who exhibit the phenotype associated with that genotype. However, it is not appropriate for continuous phenotypes and there is no clear definition of continuous phenotype. For QMDR, the penetrance for continuous phenotypes was defined as a function of mean [15]. Similarly, we set the penetrance rate as a function of mean of the response variable in each genotype. Thus, a total of 70 epistatic models with various penetrance functions were generated, as done by Velez et al. [7]. All the models had little marginal effect. For the phenotype distribution, we considered a bivariate normal distribution and a bivariate gamma distribution. The correlations of the bivariate phenotypes also varied (0, 0.25, 0.5). Sample sizes of 100, 200, and 400 were considered. Finally, a total of 1000 replicated data sets were generated for 1260 (= 70 × 2 × 3 × 3) combinations.
We used both a Tyler’stransformed version (MRMDR_transformed) and an untransformed version (MRMDR_ untransformed). For the purpose of comparison, multiCMDR and multiQMDR methods were also used. All three summary statistics of the multiQMDR—the first principal component (FPC), weighted sum of principal components (WPC), and weighted squared sum of principal components (WSPC)—were included. To compare the data with the univariate approach, QMDR was also performed for each phenotype separately. Tenfold CV was considered. A summary of the simulation scheme is shown in Table 1.
Since the epistatic models given by Velez et al. [7] were devised only for the discrete phenotype, we modified them to handle continuous phenotypes. Let \({\mathrm{SNP}}_{1}\) and \({\mathrm{SNP}}_{2}\) be the two true causal SNPs, \({\mathbf{Y}} = (Y_{1} ,Y_{2} )^{T}\) the continuous bivariate phenotypes, and f_{ij} the penetrance function for the ith genotype of \({\mathrm{SNP}}_{1}\) and jth genotype of \({\mathrm{SNP}}_{2}\) in [7]. For the bivariate normal distribution, \({\mathbf{Y}} = (Y_{1} ,Y_{2} )^{T}\) was generated by
where \({{\varvec{\upmu}}}_{ij} = \left( {\begin{array}{*{20}c} {f_{ij} } \\ {f_{ij} } \\ \end{array} } \right)\) and \({{\varvec{\Sigma}}} = \left( {\begin{array}{*{20}c} 1 & \rho \\ \rho & 1 \\ \end{array} } \right)\). We used the mvrnorm() function of the MASS package in R. Three different values of ρ (0, 0.025, and 0.5) were considered, as mentioned above.
For the skewed asymmetrically distributed phenotypes, we used the copulabased multivariate gamma model [44]. A copulabased bivariate gamma distribution is given by
where \(c(u,{{\varvec{\Sigma}}}) = {{\varvec{\Sigma}}}^{\frac{1}{2}} \exp \left[ {  \frac{{{\tilde{\mathbf{u}}}^{{\prime }} ({{\varvec{\Sigma}}}^{  1}  {\rm I}){\tilde{\mathbf{u}}}}}{2}} \right],\tilde{u} = (\Phi^{  1} (u_{1} ), \, \Phi^{  1} (u_{2} ))^{{\prime }}\), and \(u_{k} = {\mathbf{F}}_{k} (y_{k} )\) for k = 1, 2. Here f_{k} and F_{k} are the marginal probability density function and cumulative distribution function of the kth phenotype, respectively, and Φ^{−1} is the inverse of the cumulative distribution of the standard normal distribution function. In this simulation, we set
for k = 1, 2. We used the mvdc(), rMvdc(), normalCopula() functions in the copula package in R.
The power was estimated by the hit ratio, which is the proportion of times that each method correctly chose the causal SNP pairs (\({\mathrm{SNP}}_{1}\) and \({\mathrm{SNP}}_{2}\)) as the best model among all possible twoway interaction models out of each set of 1000 datasets. Figure 3 shows the hit ratios of the eight different methods for the bivariate normal distribution. The power of the multiQMDR using FPC (multiQMDR_FPC) was slightly higher than that of the proposed MRMDR when there was no correlation between phenotypes. However, the difference between multiQMDR_FPC and MRMDR decreased as the correlation became stronger. The performances of the transformed one (MRMDR_transformed) and untransformed (MRMDR_untransformed) one were almost the same. MultiQMDR with WSPC (multiQMDR_WSPC) showed lower power even than the QMDR method.
Figure 4 shows the hit ratios for a bivariate gamma distribution. The proposed MRMDR outperformed the other methods for all ranges of heritability. There was little difference between the performance of the two versions of MRMDR, and the differences between them were less than 0.01 in all situations. The power of multiCMDR was the next highest. It should be noted that multiCMDR uses the fuzzy clustering approach to split data as in MRMDR. The gap between MRMDR and other methods became larger as the sample size decreased or the correlation became stronger. MultiQMDRFPC and multiQMDR using the WPC (multiQMDRWPC) showed lower power than MRMDR and multiCMDR, but higher power than QMDR. The performance of multiQMDRWSPC was still poor, although the difference was less than in bivariate normal distribution.
Through these simulation studies, we demonstrated that the proposed MRMDR outperformed the other existing methods for all ranges of heritability when the phenotypes were asymmetrically distributed. However, when the phenotypes are symmetrically distributed, as in a normal distribution, all methods yielded similar performance.
Three additional simulations were conducted to find out the further properties of the proposed method. The robustness of fuzzy kmeans clustering, the effect of noise cluster size, and the effect of outliers were investigated. First, to explore the robustness of the fuzzy kmeans clustering in MRMDR algorithm, we performed a comparison study to investigate the effect of logtransformation on phenotypes which were generated from the bivariate gamma distribution. The power of MRMDR after log transformation was obtained for each seven heritabilities. We set the correlation coefficient between two phenotypes 0.25 and the sample size 200. The average power of MRMDR for five interaction model (M1M5) for 1000 random samples are given in Table 2. The power slightly reduced after logtransformation. Therefore, we can conclude that the fuzzy kmeans clustering is robust to the skewed distribution and does not require any further transformation of original data.
Secondly, we investigate the efficiency according to the size of noise cluster since the large size of the noise cluster can be a source of loss of efficiency. The size of noise cluster depends on the noise distance δ, which needs to be chosen in advance. If δ is too large, outliers are not removed and are classified as a real cluster. On the other hand, if δ is too small, many observations can be misplaced into the noise cluster. Still, the estimation of the optimal value of δ is an openproblem [45]. In our approach, we used an iterative procedure to determine the value of δ optimally, which is the default of FKM.noise procedure in R.
To check the efficiency, we compared the performance with various values of δ. The results are shown in Table 3. For both bivariate normal and bivariate gamma distribution, the average power of MRMDR for five interaction models (M1M5) are given in Table 3. This new simulation result shows that the power using the iterative δ yielded the highest hit ratios for all heritabilities. Note that in this setting outliers were not generated. When there were outliers, a smaller noise cluster would have been created.
Thirdly, we have conducted an additional simulation to investigate the effects of outliers. The power of MRMDR for the datasets with or without outliers was obtained for each seven heritabilities. We set the correlation coefficient between two phenotypes 0.25 and the sample size 200 as in Table 3. The power for five interaction models (M1M5) for 1000 random samples were obtained. For the datasets with outliers, about 5% of the data were replaced by outliers. For both phenotypes, outliers were generated by adding three times of IQR for the randomly chosen 5% samples. The results are summarized in Table 4. In the presence of outliers, the power tends to decrease for all methods. The differences were the smallest in MRMDR, indicating the minimum power loss of MRMDR. As a result, MRMDR showed much higher power than other methods in the presence of outlying observations.
Real data analysis
We illustrate the proposed MRMDR method by analyzing data from the KoGES [46]. The data were collected from two recruitment areas. Each region is a cohort representing city (Ansan) and rural areas (Anseong). After standard quality control procedures for the subjects and SNPs, a total of 8842 participants and 327,872 SNPs were used for this analysis.
We considered four phenotypes related to kidney function: BUN, serum creatinine, urinary albumin levels, and urinary RBC levels. Traditionally, BUN and serum creatinine levels have been used as surrogate markers of kidney function deterioration [47]. The amounts of albumin and RBC in urine also could be good indicators of kidney disease. Although there have been some studies on associations between genes and kidneyrelated phenotypes, few studies have performed GGI analyses for these phenotypes [48, 49]. In the case of albumin, the urine test is known to be more accurate than in the case of blood test, so the urine test result is used here. However, urine tests are conducted only on a relatively small number of subjects, which resulted in missing observations in phenotypes. For this high rate of missing data, imputation for phenotypes is not appropriate. We removed observations with at least one missing phenotype value, and 3267 samples remained.
A linear model was separately fit to each phenotype with covariate adjustments for sex, age and recruitment area. Finally, 205 SNPs were selected that had significant marginal effects (p < 1 × 10^{−7}). Residuals were used for the analysis to control for covariate effects. The largest correlation coefficient was 0.32, which was the correlation between BUN and creatinine. The distributions of the phenotypes were heavily skewed. Figure 5 shows scatter plots and box plots of the phenotypes.
We applied the proposed MRMDR method to identify the best first and secondorder interaction models. Table 5 shows the center of the global clusters from fuzzy kmeans clustering during step 1. Cluster centers were determined by the means of each phenotype across samples belonging to the cluster. The clusters differed greatly for BUN and serum creatinine. There seemed to be no difference in RBC levels between the two clusters. Since the higher values of BUN and creatinine indicate abnormal kidney function, we can interpret the first cluster as a lowrisk group and the second cluster as a highrisk group.
We selected the SNP pair with the largest CVC as the best interaction model for each order. The test score was the average of spatial rank sum statistics calculated from the test data set. Empirical pvalues were obtained by comparing the test score with the statistic from the permuted dataset generated by shuffle phenotype vectors. If the score calculated with the permuted data exceeded our score, that case was counted. Then pvalue was calculated as a/b, where a is the number of cases with a permuted score higher than the original score and b is the total number of trials.
List of the interaction models that had the highest training score at least once during the tenfold CV process by MRMDR is shown in Table 6. For the firstorder interaction, rs41476549 had the highest CVC and was selected as the most relevant to the four phenotypes. rs790410 was selected as the best once during 10 CV processes but showed the highest score. All SNPs except rs17168600 had a pvalue lower than 0.05. For the secondorder interaction model, the pair of rs41476549 and rs1117105 showed the highest CVC, while the pair of rs790410 and rs961413 yielded the highest test score. However, the CVC value was low in most cases. Among the selected SNPs, rs16862782 on chromosome 3 has been reported to be related to BUN in Korean and to serum urea measurement in European [47, 50]. rs4686914 on chromosome 3 is also known to be related to the kidney function in European and East Asian [47, 51]. Both SNPs are mapped to LINC01991 gene. rs17586294 maps to TUBBP11 gene. To the best of our knowledge, there are no studies that have analyzed kidneyrelated GGI in a multivariate manner. Therefore, none of the interactions of the selected pair of SNPs have ever been reported.
We also applied multiCMDR and multiQMDR methods for comparison. Only multiQMDR using FPC was considered for comparison, because it showed the highest power among three types of multiQMDRs. There are some similarities between the results of MRMDR and multiCMDR. However, the results are totally different for multiQMDR, which are expected to be caused by some outlying observations. For example, the pair of rs17014894 and rs10517358 was chosen as the best interaction model. However, this pair suffers from sparsity and outliers. In particular, there are four cells with zero counts and one cell with one count having an outlying observation.
Figure 6 shows the box plots of the phenotypes after removing the noise cluster for the genotype combinations of the best model selected by MRMDR. The distribution of each phenotype was different depending on the genotype combination, suggesting that there was an interaction effect.
To evaluate pure interaction effects for continuous phenotypes, we adopted the classical linear model. For the selected SNP combinations, we fit multivariate analysis of variance (MANOVA) model including two main effects and the interaction effect. The SNP effects were tested for the interaction effect only (H_{01}:β_{12} = 0) and for the total effects including both main and interaction effects (H_{02}:β_{1} = β_{12=}0 and H_{03}:β_{2} = β_{12=}0). The results nine SNP pairs are summarized in Table 7. None of the selected SNP pairs showed significant interaction effects (p > 0.05), while some pairs showed significant total effects. This is because most MDR methods tend to choose a model with a large total effect ignoring pure interaction effects. However, our further empirical study showed that the introduction of noise cluster by fuzzy kmeans increased the power of detecting interaction effects. The same MANOVA model were fit to the trimmed data after removing the noise cluster by MRMDR. As expected, several significant interaction effects were successfully identified after trimming. Although MANOVA requires a Gaussian assumption, the process removing the noise cluster in the proposed method had the effect of yielding more robust and reliable MANOVA results. Among the nine SNP pairs with nonsignificant interactions, four pairs showed significant interaction effects. The results are summarized in the last three columns of Table 7.
Discussion
The proposed MRMDR method is a nonparametric approach assuming no particular genetic model. To assign samples to two risk groups, MRMDR uses the fuzzy clustering technique, as in the multiCMDR method. MRMDR uses the spatial rank sum statistic as evaluation measure for comparing two groups whereas Hotelling’s T^{2} statistic is used in multiCMDR and multiQMDR. Therefore, robust results can be expected in MRMDR for skewed distributions or those with outliers.
When calculating the spatial rank statistic, a datadriven transformation matrix was needed to make the statistic invariant. It is known that an affineinvariant test performs better than noninvariant angle coordinatewise sign tests when there is significant deviation from spherical symmetry caused by the presence of correlations among observed variables. Moreover, the affineinvariant test outperforms Hotelling’s T^{2}test for heavytailed nonnormal multivariate distributions [52]. As can be seen in our toy example, the invariant statistic differed from the untransformed statistic, especially in the presence of a high correlation between phenotypes.
The problem with using Tyler's transformation statistic is that it takes much longer to calculate than the untransformed statistic. However, the change of the spatial ranks due to the Tyler’s transformation is trivial even when the correlation is moderate (r = 0.4), as seen in the toy example. Moreover, the simulation results showed that there was little improvement in performance compared to untransformed versions of MRMDR. This phenomenon was also observed in the case of negative correlation. Therefore, we recommend using the untransformed MRMDR version if the absolute value of correlation between phenotypes is not too high (e.g., r< 0.5).
To apply the proposed method for GWAS data, we considered a filtering procedure to reduce the number of SNPs to be investigated. We selected SNPs with significant marginal effects and investigated the interactions. Since MDR is an exhaustive method, this kind of filtering is needed. However, this filtering process can lead to ignoring gene–gene interactions for the SNPs with weak marginal effects.
Conclusions
In this paper, we proposed the MRMDR method to detect the best interaction model for multivariate quantitative traits. MRMDR is based on the fuzzy clustering technique and spatial ranksum statistic. Intensive simulation studies comparing MRMDR with several current methods showed that the performance of MRMDR was outstanding for skewed distributions. The difference in power between the MRMDR and other methods increased as the sample size became smaller and the correlation became stronger. Additionally, for symmetric distributions, MRMDR showed comparable power. Therefore, we conclude that MRMDR is a useful multivariate nonparametric approach that can be used regardless of the phenotype distribution, the correlations between phenotypes, and sample size.
Availability of data and materials
The Korea Association Resource (KARE) project data will be publicly distributed by the Distribution Desk of Korea Biobank Network (https://koreabiobank.re.kr/). The data request should be made directly to Distribution Desk of Korea Biobank Network. Any inquiries should be sent to admin@koreabiobank.re.kr.
Abbreviations
 GGI:

Gene–gene interaction
 MDR:

Multifactor dimensionality reduction
 CV:

Cross validation
 CVC:

Cross validation consistency
 SNP:

Single nucleotide polymorphism
 BMI:

Body mass index
 BUN:

Blood urea nitrogen
 RBC:

Red blood cell
 MAF:

Minor allele frequency
 FPC:

First principal component
 WPC:

Weight sum of principal components
 WSPC:

Weighted squared sum of principal components
 MANOVA:

Multivariate analysis of variance
References
Grove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, Pallesen J, Agerbo E, Andreassen OA, Anney R, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51(3):431–44.
McCarthy MI, Zeggini E. Genomewide association studies in type 2 diabetes. Curr Diab Rep. 2009;9(2):164–71.
Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison AC, Johnson AD, Aspelund T, et al. Genomewide association study of blood pressure and hypertension. Nat Genet. 2009;41(6):677–87.
Gola D, Mahachie John JM, van Steen K, Konig IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform. 2016;17(2):293–308.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. 2001.
Ritchie MD, Van Steen K. The search for gene–gene interactions in genomewide association studies: challenges in abundance of methods, practical considerations, and biological interpretation. Ann Transl Med. 2018;6(8):157.
Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007;31(4):306–15.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241(2):252–61.
Chung Y, Lee SY, Elston RC, Park T. Odds ratio based multifactordimensionality reduction method for detecting gene–gene interactions. Bioinformatics. 2007;23(1):71–6.
Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene–gene and geneenvironment interactions. Bioinformatics. 2003;19(3):376–82.
Lee SY, Chung Y, Elston RC, Kim Y, Park T. Loglinear modelbased multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics. 2007;23(19):2589–95.
Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet. 2011;75(1):20–8.
Hua X, Zhang H, Zhang H, Yang Y, Kuk AYC. Testing multiple gene interactions by the ordered combinatorial partitioning method in case–control studies. Bioinformatics. 2010;26(15):1871–8.
Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD. A generalized combinatorial approach for detecting genebygene and genebyenvironment interactions with application to nicotine dependence. Am J Hum Genet. 2007;80(6):1125–37.
Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, Navis G, Van Gilst WH, Asselbergs FW, GilbertDiamond D. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene–gene interactions for quantitative traits. PLoS ONE. 2013;8(6):e66545–e66545.
Lee Y, Park M, Park T, Kim H. Gene–gene interaction analysis for quantitative trait using clusterbased multifactor dimensionality reduction method. Int J Data Min Bioinform. 2018;20(1):66.
Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting gene–gene interactions with application to bladder cancer prognosis. Hum Genet. 2011;129(1):101–10.
Lee S, Kwon MS, Oh JM, Park T. Gene–gene interaction analysis for the survival phenotype based on the Cox model. Bioinformatics. 2012;28(18):i582–8.
Oh S, Lee S. An extension ofmultifactor dimensionality reduction method for detecting gene–gene interactions with the survival time. J Korean Data Inf Sci Soc. 2014;25(5):1057–67.
Park M, Lee JW, Park T, Lee S. Gene–gene interaction analysis for the survival phenotype based on the Kaplan–Meier median estimate. Biomed Res Int. 2020;2020:5282345.
Oh S, Huh I, Lee SY, Park T. Analysis of multiple related phenotypes in genomewide association studies. J Bioinform Comput Biol. 2016;14(5):1644005.
Choi J, Park T. Multivariate generalized multifactor dimensionality reduction to detect gene–gene interaction. BMC Syst Biol. 2013;6:66.
Yu W, Kwon MS, Park T. Multivariate quantitative multifactor dimensionality reduction for detecting gene–gene interactions. Hum Hered. 2015;79(3–4):168–81.
Kim H, Jeong HB, Jung HY, Park T, Park M. Multivariate clusterbased multifactor dimensionality reduction to identify genetic interactions for multiple quantitative phenotypes. Biomed Res Int. 2019;2019:4578983.
Anderson MJ. A new method for nonparametric multivariate analysis of variance. Austral Ecol. 2001;26:32–46.
Randles RH, Peters D. Multivariate rank tests for the twosample location problem. Commun Stat. 1990;19(11):4225–38.
Dawn Peters RHR. A multivariate signedrank test for the onesample location problem. J Am Stat Assoc. 1990;85(410):552–7.
Sirkiä S, Taskinena S, Oja H. Symmetrised Mestimators of multivariate scatter. J Multivar Anal. 2007;98(8):1611–29.
Liu RY, Kesar S. A quality index based on data depth and multivariate rank tests. J Am Stat Assoc. 1993;88:252–60.
Liu RY, Kesar S. Ordering directional data: concepts of data depth on circles and spheres. Ann Stat. 1992;20(3):1468–84.
Yijun Zuo XH. On the limiting distributions of multivariate depthbased rank sum statistics and related tests. Ann Stat. 2006;34(6):2879–96.
Liu RY, Parelius JM, Kesar S. Multivariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu andSingh). Ann Stat. 1999;27(3):783–858.
Vencálek O. Concept of data depth and its applications. Mathematica. 2001;50(2):111–9.
Oja H, Randles RH. Multivariate nonparametric tests. Stat Sci. 2004;19(4):598–605.
Choi K, Marden J. An approach to multivariate rank tests in multivariate analysis of variance. J Am Stat Assoc. 1997;92(440):1581–90.
LanWang BP, Li R. A highdimensional nonparametric multivariate test for mean vector. J Am Stat Assoc. 2015;110(512):1658–69.
Chakraborty A, Chaudhuri P. Tests for highdimensional data based on means, spatial signs and spatial ranks. Ann Stat. 2017;45(2):771–99.
Fried R, Dehling H. Robust nonparametric tests for the twosample location problem. Stat Methods Appl. 2011;20(4):409–22.
Sirkiä S, Taskinen S, Nevalainen J, Oja H. Multivariate nonparametrical methods based on spatial signs and ranks: the R package SpatialNP. J Stat Softw. 2007;6:66.
Tyler D. A distributionfree mestimator of multivariate scatter. Ann Stat. 1987;15:66.
Oja H. Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer; 2010.
Liu RY. Data depth: robust multivariate analysis, computational geometry, and applications, vol. 72. American Mathematical Society; 2006.
Dave RN. Characterization and detection of noise in clustering. Pattern Recogn Lett. 1991;12(11):657–64.
Stitou Y, Lasmar NE, Berthoumieu Y. Copulas based multivariate Gamma modeling for texture classification. 2012.
Cimino MGCA, Frosini G, Lazzerini B, Marcelloni F. On the noise distance in robust fuzzy Cmeans. In: Proceedings of world academy of science, engineering and technology; 2005. p. 1. ISSN 13076884.
Kim Y, Han BG. group tK: Cohort Profile: the Korean Genome and Epidemiology Study (KoGES) Consortium. Int J Epidemiol. 2016;46(2):e20–e20.
Lee J, Lee Y, Park B, Won S, Han JS, Heo NJ. Genomewide association analysis identifies multiple loci associated with kidney diseaserelated traits in Korean populations. PLoS ONE. 2018;13(3):e0194044.
Freedman BI, Skorecki K. Gene–gene and geneenvironment interactions in apolipoprotein L1 geneassociated nephropathy. Clin J Am Soc Nephrol. 2014;9(11):2006–13.
Tin A, Köttgen A. Genomewide association studies of CKD and related traits. Clin J Am Soc Nephrol. 2020;6:66.
SinnottArmstrong NTY, Amar D, Mars N, Benner C, Aguirre M, Venkataraman GR, Wainberg M, Ollila HM, Kiiskinen T, Havulinna AS, Pirruccello JP, Qian J, Shcherbina A, FinnGen F, Rodriguez F, Assimes TL, Agarwala V, Tibshirani R, Hastie T, Ripatti S, Pritchard JK, Daly MJ, Rivas MA. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat Genet. 2021;53(2):185–94.
Thio CHL RA, van der Most PJ, Kamali Z, Vaez A, Smit JH, Penninx BWJH, Haller T, Mihailov E, Metspalu A, Damman J, de Borst MH, van der Harst P, Verweij N, Navis GJ, Gansevoort RT, Nolte IM, Snieder H; Lifelines Cohort Study group. Genomewide association scan of serum urea in European populations identifies two novel loci. Am J Nephrol. 2019;ss49(3):193–202.
Chakraborty B, Chaudhuri P, Oja H. Operating transformation retransformation on spatial median and angle test. Stat Sin. 1998;8(3):767–84.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2013M3A9C4078158, NRF2021R1A2C1007788).
Author information
Authors and Affiliations
Contributions
Conceptualization, M.P.; methodology, M.P. and T.P.; software, H.J. and J.L.; formal analysis, H.J. and J.L.; writing—original draft preparation, M.P.; writing—review and editing, T.P. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study was reviewed and approved by the Institutional Review Board of Seoul National University (IRB No. E1908/001004). The patients/participants provided their written informed consent to participate in this study. All methods were carried out in accordance with relevant guidelines and regulations (Declaration of Helsinki).
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Park, M., Jeong, HB., Lee, JH. et al. Spatial rankbased multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes. BMC Bioinformatics 22, 480 (2021). https://doi.org/10.1186/s1285902104395y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104395y