Skip to main content

Detecting gene-gene interactions for complex quantitative traits using generalized fuzzy classification

Abstract

Background

Quantitative traits or continuous outcomes related to complex diseases can provide more information and therefore more accurate analysis for identifying gene-gene and gene- environment interactions associated with complex diseases. Multifactor Dimensionality Reduction (MDR) is originally proposed to identify gene-gene and gene- environment interactions associated with binary status of complex diseases. Some efforts have been made to extend it to quantitative traits (QTs) and ordinal traits. However these and other methods are still not computationally efficient or effective.

Results

Generalized Fuzzy Quantitative trait MDR (GFQMDR) is proposed in this paper to strengthen identification of gene-gene interactions associated with a quantitative trait by first transforming it to an ordinal trait and then selecting best sets of genetic markers, mainly single nucleotide polymorphisms (SNPs) or simple sequence length polymorphic markers (SSLPs), as having strong association with the trait through generalized fuzzy classification using extended member functions. Experimental results on simulated datasets and real datasets show that our algorithm has better success rate, classification accuracy and consistency in identifying gene-gene interactions associated with QTs.

Conclusion

The proposed algorithm provides a more effective way to identify gene-gene interactions associated with quantitative traits.

Background

With the advent of the genomic era, doctors can utilize genetic data to analyze the mechanisms of diseases and customize medical treatment. Diseases are usually associated with genetic variants, mainly single nucleotide polymorphisms (SNPs) or simple sequence length polymorphic markers (SSLPs), which are already a valuable source for mapping complex diseases and complex genetic traits [1]. Searching for genetic factors that influence complex traits and complex diseases is both a goal and a challenge for modern geneticists.

In recent years, the field has been revolutionized by using genome-wide association studies (GWASs) to assess the statistical associations of genetic variants with many important common diseases [2]. A single-locus approach, where each variant is tested individually for association with a specific phenotype is used by most of these studies. However research limited to individual gene effects will make a large proportion of the heredity of complex diseases and complex traits unexplained [3, 4]. Gene-gene and gene-environment interactions play an important role in genetic association studies of complex diseases and complex traits [5]. If a genetic factor functions primarily through interaction with other genetic factors or environmental factors, the effect might be missed if the gene is examined individually without allowing for its interactions with these other unknown factors.

A variety of methods have been proposed to identify gene-gene interactions existing in complex diseases. These methods include regression modeling [6,7,8,9,10], data reduction [11,12,13,14], genetic programming [15], neural networks [16, 17], pattern mining [18, 19] and machine learning approaches, such as random forest [20], support vector machine [21] and ensemble learning [22].

These methods are mainly used in a case control study to identify interactive SNPs for predicting a binary disease status and have achieved great success. Among these methods, the Multifactor Dimensionality Reduction (MDR) method, was proposed as a nonparametric and model-free data reduction approach for identifying interactions without significant main effects and has been successfully applied to identify gene-gene interactions in many common complex diseases [13, 23, 24]. In the analysis of binary traits, MDR reduces high dimension of multi-locus genotype combinations to one dimension of two groups: high risk group and low risk group, thus avoids the problem of sparse data combinations and models with too many parameters. Each genotype combination is classified as either high risk or low risk according to its ratio of cases and controls. The set of genetic markers which has best classification performance is then selected as having the strongest association with the trait. Although MDR has been extended in many directions, it is mainly applied in binary traits.

However in many cases, continuous outcomes or quantitative traits such as body weight, tumor size, blood pressure can provide more accurate analysis.

Some efforts have been made to extend MDR to quantitative traits (QTs). The Combinatorial Partitioning Method (CPM) [25] was proposed to identify partitions of multi-locus genotypes for predicting variation in quantitative trait levels. The Restricted Partition Method (RPM) detects multi-locus genotypes as predictors of a quantitative trait by a partitioning of genotypes into subgroups. The Generalized MDR (GMDR) [26] extends MDR to continuous phenotypes and includes covariate adjustment. In Model based MDR (MB-MDR) [27], MDR is extended to continuous outcomes by using parametric regression.

There are also methods based on information theory. In [28], a method built on two information-theoretic metrics, the k-way interaction information (KWII) and phenotype-associated information (PAI) is developed for analyzing the gene-gene and gene-environmental interactions associated with quantitative traits. In [29], as an extension of the usefulness of information gain, a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype is proposed. In [30], an entropy-based statistic which asymptotically follows a χ2 distribution is proposed to test genetic epistasis. This approach can test genetic epistasis with high efficiency in a case-only design.

CPM searches over the state space made up of all possible sets of genotypic partitions of all the m-locus genotypes to identify m loci that divide corresponding genotypes into k partitions that are most similar within and most dissimilar between partitions for the mean of a quantitative trait. The number of k sets of genotypic partitions is a Stirling number of the second kind:

$$ S\left({g}_{M,}k\right)=\frac{1}{k!}\sum \limits_{i=0}^{k-1}{\left(-1\right)}^i\left(\begin{array}{c}k\\ {}i\end{array}\right){\left(k-i\right)}^{g_M} $$
(1)

Where gM is the size of the set of m-locus genotypes. A permutation test is used to estimate P values for the R2 for each of the k sets of genotypic partitions.

The RPM tries to find the most reasonable partition for evaluation to decrease most of the computational burden associated with the CPM. However a permutation test is used for all possible m-locus classifiers.

MB-MDR is implemented in R (https://www.r-project.org/) but is only used on one-way and two-way interaction models [30]. G-MDR still requires the outcome in the data file to be dichotomous [30].

KWII needs to compute the entropies of all subsets of m loci. Although the computation of the PAI requires only individual and joint entropies, making it computationally far more tractable than the KWII, the hill climbing algorithm it employs will miss many interactions which have small main effects.

In Quantitative MDR (QMDR) [31], to exploit continuous outcomes to make the analysis more accurate, a test statistic, rather than the balanced accuracy, is used to determine the best interaction model. This is a computationally efficient algorithm. However this method still classified the outcome into two groups: high and low level groups, which results in the loss of the large variability of the quantitative outcome.

Also there are few methods applied to ordinal categorical traits. Ordinal categorical traits such as the obesity classification based on body mass index (e.g., normal, pre-obese, mild obese and severe obese), the diabetes diagnosis based on glucose level (e.g., normal, impaired glucose tolerance and diabetes) are common in many genetic association studies. These traits are also derived from quantitative traits. In Ordinal MDR (OMDR) [32], MDR is extended to analyze gene-gene interaction for ordinal traits and tau-b [33], a common ordinal association measure, is used to replace balanced accuracy to evaluate interactions. However the tau-b measure only measures the degree of tendency of positive association between true categories of an ordinal trait and predicted categories and doesn’t consider the difference between true categories and predicted categories.

In order to better use the information contained in the quantitative trait, we first classify the quantitative outcome into several (greater than two) ordinal levels. Then an extended MDR is used to identify gene-gene interactions on this converted ordinal categorical trait. Rather than using balanced accuracy or common ordinal association measures, such as tau-b, we use a generalized fuzzy classification method to select the set of genetic markers as having the strongest association with the trait. Usually for each prediction of a category, its accuracy value is either 1, if the prediction is right, or 0, if the prediction is wrong. However for quantitative or ordinal traits, when the prediction is wrong, the closeness of a quantitative value to the true category is different. To reflect such difference, member functions of fuzzy sets could be employed to compute accuracy in classification. Since the range of a member function is between 0 and 1, to better describe the difference of quantitative values to a category, we extend its range to {− 1, 1} when it is used in fuzzy classification.

In this paper, a new kind of member functions which have an extended output range from − 1 to 1 are proposed to be used in fuzzy classification first. Then Generalized Fuzzy Quantitative MDR (GFQMDR) algorithm which is an improvement of Fuzzy Quantitative trait based Ordinal MDR (FQOMDR) in [34] is given to strengthen identification of gene-gene interactions associated with QTs. This algorithm first transforms a quantitative trait into an ordinal trait and then select best sets of SNPs as having strong association with the trait using such kind of member functions in the extended MDR. To test the performance of the proposed algorithm, we use it to identify five different interaction models in simulated data and compare success rates with three other methods. We also use it in two real data sets to select SNPs having strong association with the trait and compare balanced test accuracy and consistency with the same three other methods.

Methods

Traditional member functions

The degree of membership of different values to a fuzzy set can be computed using a membership function whose range is between 0 and 1.

Take QTs as an example. Usually we can divide them into three intervals or levels: high (H), average (A) and low (L) associated with three fuzzy sets. Here as an example, we use equal length intervals and associate them with three fuzzy sets using linear member functions, as shown in Fig. 1.

Fig. 1
figure1

The linear membership functions of high(H), average(A) and low(L) levels of a QT

Let Qmin and Qmax denote the maximum and minimum values that a QT takes on in all samples in a dataset. B1 and B2 are upper borders of the low level and the average level respectively. P1, P2 and P3 are the middle positions of the low level, average level and high level respectively can be derived as follows:

$$ {P}_1=\frac{Q_{\mathrm{min}}+{B}_1}{2} $$
(2)
$$ {P}_2=\frac{B_1+{B}_2}{2} $$
(3)
$$ {P}_3=\frac{B_2+{Q}_{\mathrm{max}}}{2}. $$
(4)

Then member functions for L, A and H levels in Fig. 1 can be expressed as:

$$ {\mu}_{L1}(x)=\left\{\begin{array}{l}1,\kern4.5em \mathrm{if}\kern0.5em x<={P}_1\\ {}\frac{P_2-x}{P_2-{P}_1},\kern1.5em \mathrm{if}\kern0.5em {P}_1<x<={P}_2\\ {}0,\kern4em \mathrm{otherwise}\end{array}\right. $$
(5)
$$ {\mu}_{A1}(x)=\left\{\begin{array}{l}\frac{x-{P}_1}{P_2-{P}_1},\kern1.5em \mathrm{if}\kern0.5em {P}_1<=x<={P}_2\\ {}\frac{P_3-x}{P_3-{P}_2},\kern1.5em \mathrm{if}\kern0.5em {P}_2<x<={P}_3\\ {}0,\kern4em \mathrm{otherwise}\end{array}\right. $$
(6)
$$ {\mu}_{H1}(x)=\left\{\begin{array}{l}0,\kern4em \mathrm{if}\kern0.5em x<={P}_2\\ {}\frac{x-{P}_2}{P_3-{P}_2},\kern1.5em \mathrm{if}\kern0.5em {P}_2<x<={P}_3\\ {}1,\kern4em \mathrm{otherwise}\end{array}\right. $$
(7)

Generalized fuzzy classification using extended member functions

Membership functions of fuzzy sets can also be used as an accuracy measure in fuzzy classification. For example, when different values are classified to the high level, we can get different accuracies between 0 and 1 from μH1(x). However when selecting a best classifier composed of a set of SNPs to classify a QT, such a range could not fully show differences among different classifiers. For example, if there are both 500 samples in genotypes that are classified as the high level for two classifiers, for classifier 1 there are 300 samples located at P3, 200 samples located at P2 and 100 samples located at P1 in genotypes that are classified as the high level, for classifier 2 there are 300 sample located at P3, 100 samples located at P2 and 200 samples located at P1 in genotypes that are classified as high levels, then the accuracies of the high level for these two classifiers would be the same: 0.6. However classifier 1 is obviously a better classifier to classify the high level. To reflect such difference, we extend the range of member functions from {0, 1} to {− 1, 1} when they are used in fuzzy classification to select the best classifier.

Such a linear extended member function is illustrated in Fig. 2 and can be expressed as:

$$ {\mu}_{L2}(x)=\left\{\begin{array}{l}1,\kern4.5em \mathrm{if}\kern0.5em x<={P}_1\\ {}\frac{P_2-x}{P_2-{P}_1},\kern1.5em \mathrm{if}\kern0.5em {P}_1<x<={P}_3\\ {}\hbox{-} 1,\kern3.5em \mathrm{otherwise}\end{array}\right. $$
(8)
$$ {\mu}_{A2}(x)=\left\{\begin{array}{l}\frac{x-{P}_1}{P_2-{P}_1},\kern2em \mathrm{if}\kern0.5em x<={P}_2\\ {}\frac{P_3-x}{P_3-{P}_2},\kern2em \mathrm{otherwise}\end{array}\right. $$
(9)
$$ {\mu}_{H2}(x)=\left\{\begin{array}{l}\hbox{-} 1,\kern3.5em \mathrm{if}\kern0.5em x<={P}_1\\ {}\frac{x-{P}_2}{P_3-{P}_2},\kern1.5em \mathrm{if}\kern0.5em {P}_1<x<={P}_3\\ {}1,\kern4.5em \mathrm{otherwise}\end{array}\right. $$
(10)
Fig. 2
figure2

The extended linear membership functions of high(H), average(A) and low(L) levels of a QT

It can also be regarded as a transformation of the member function in Fig. 1 as follows:

$$ {\mu}_{L2}(x)=\left\{\begin{array}{l}{\mu}_{L1(x)},\kern2.5em \mathrm{if}\kern0.5em x<={P}_2\\ {}\frac{P_2-x}{P_2-{P}_1},\kern1.5em \mathrm{if}\kern0.5em {P}_2<x<={P}_3\\ {}\hbox{-} 1,\kern3.5em \mathrm{otherwise}\end{array}\right. $$
(11)
$$ {\mu}_{A2}(x)=\left\{\begin{array}{l}\frac{x-{P}_1}{P_2-{P}_1},\kern1.5em \mathrm{if}\kern0.5em x<={P}_1\kern0.5em \\ {}{\mu}_{A1(x)},\kern2.5em \mathrm{if}\kern0.5em {P}_1<=x<={P}_3\\ {}\frac{P_3-x}{P_3-{P}_2},\kern1.5em \mathrm{otherwise}\end{array}\right. $$
(12)
$$ {\mu}_{H2}(x)=\left\{\begin{array}{l}\hbox{-} 1,\kern3.5em \mathrm{if}\kern0.5em x<={P}_1\\ {}\frac{x-{P}_2}{P_3-{P}_2},\kern1.5em \mathrm{if}\kern0.5em {P}_1<x<={P}_2\\ {}{\mu}_{H1(x)},\kern2.5em \mathrm{otherwise}\end{array}\right. $$
(13)

MDR algorithm

In order to detect high- dimensional gene-gene interaction, MDR reduces genotype combinations at multiple loci into a single class variable taking values of either high risk or low risk categories, then tests association between a binary trait or disease with this new one dimensional variable.

The MDR method proceeded as follows. The 10-fold cross validation is used. A set of m genetic factors is selected and their possible combinations or cells are represented in m dimensional space. For example, for two diallelic loci, each has three genotypes and there are nine two-locus-genotype combinations. Then the ratio of the number of cases to the number of controls is estimated within each cell, which is then labeled either as “high-risk”, if the cases:controls ratio is equal or greater than some threshold, or otherwise as “low-risk”. Thus all cells are allocated to either high risk group or low risk group, which reduces the n-dimensional model into a one dimensional model. The procedure is repeated for each possible n-factor combination. The training balanced accuracy of the two groups is used to select the best classifier. Balanced accuracy is defined as the arithmetic mean of sensitivity and specificity:

$$ \left(\mathrm{sensitivity}+\mathrm{specif}\ \mathrm{icity}\right)/2=\left(\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right)+\mathrm{TN}/\left(\mathrm{TN}+\mathrm{FP}\right)\right)/2 $$
(14)

where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. The prediction error of the selected best classifier can be estimated using the remaining one-tenth of the data to get the testing balanced accuracy. The process is repeated for all ten training sets and testing sets and for each of the selected m-locus classifiers, the number of cross-validation replicates in which it is chosen as the best classifier (cross-validation consistency) is recorded. The m-locus classifier that has the maximum testing balanced accuracy and highest cross-validation consistency is selected as the final best m-locus classifier, where cross-validation consistency is used as a tie-break.

For an ordinal categorical trait with J levels, an m dimensional cell is labeled as one of J groups as follows. Let 1, 2, ..., J be J levels or categories for an ordinal trait. For any combination of m SNPs, let n+j be the number of individuals in class j, nij be the number of individuals with the ith multi-locus genotype in category j, where i = {1, 2,...,3m} and j = 1, 2,..., J. Then the ith m-locus genotype will be labeled as category c(i) as follows:

$$ c(i)=\underset{j\in \left\{1,\dots, J\right\}}{\arg \max}\left(\frac{n_{ij}}{n_{+j}}\right) $$

GFQMDR algorithm

GFQMDR extends MDR to analyze quantitative traits by first converting them to ordinal traits. Then Instead of evaluating each classifier using balanced accuracy or common ordinal association measures, it uses generalized fuzzy classification based on extended member functions to evaluate each classifier and select the best one as having the strongest association with the trait. The procedure of GFQMDR is as follows:

  1. 1.

    Divide the range of a quantitative trait into J intervals and label them as categories 1, 2,…,J respectively.

  2. 2.

    Partition the dataset into L subsets for L-fold cross-validation (CV). Use one of the L subsets as a testing set and the rest as a training set.

  3. 3.

    For each m-way interaction derived from m SNPs or SSLPs, let nij be the number of individuals belonging to category j with the ith multi-locus genotype in the training set, n+j be the total number of individuals belonging to category j in the training set, where i = {1, 2,...,3m} and j = 1, 2,..., J. Then all individuals with the ith multi-locus genotype will be assigned into the category c(i) by the classifier corresponding to the m given SNPs as follows:

    $$ c(i)=\underset{j\in \left\{1,\dots, J\right\}}{\arg \max}\left(\frac{n_{ij}}{n_{+j}}\right) $$
    (15)

where nij and n+j are real numbers, nij is computed using the extended linear member function, n+j, the size of class j, is computed using the traditional linear member function.

  1. 4.

    Compute the training balanced accuracy for each m-way interaction:

    $$ \frac{1}{J}\sum \limits_{i=1}^{3^m}\frac{n_{i,c(i)}}{n_{+c(i)}} $$
    (16)

where ni,c(i), the number of individuals with the ith multi-locus genotype which really belong to the class they are classified to, is computed using the extended linear member function.

  1. 5.

    Select k classifiers that have best training balanced accuracies and compute their testing balanced accuracies.

  2. 6.

    Repeat steps 3–5 on all L CV dataset.

  3. 7.

    Since multiple gene-gene interactions associated with a QT is common in complex traits, multiple candidates of m-way gene-gene interactions are selected as having the maximum testing balanced accuracy and highest generalized cross-validation consistency based on top-K selection (GCVCK or simplified as GCVC) [34], where general cross-validation consistency is used as a tie-break.. The GCVCK is calculated as follows:

    $$ {\mathrm{GCVC}}^K={\sum}_{l=1}^L{I}_l\ \mathrm{where}\ {I}_l=\left\{\begin{array}{l}1,\kern1.25em \mathrm{if}\ \mathrm{the}\ \mathrm{MDR}\ \mathrm{classifier}\ \mathrm{is}\ \mathrm{identified}\ \mathrm{as}\ \mathrm{one}\ \\ {}\kern1.5em \mathrm{of}\ \mathrm{top}\hbox{-} K\ \mathrm{classifier}\mathrm{s}\ \mathrm{at}\ {1}^{\mathrm{th}}\mathrm{CV}\ \mathrm{dataset}\\ {}0,\kern1.25em \mathrm{otherwise}\end{array}\right. $$
    (17)
  4. 8.

    To lower type I error, compute P values of selected candidates of m-way gene-gene interactions based on 1000 permutations and select candidates having P values smaller than α (α is a prescribed threshold) as final identified gene-gene interactions.

Results

Experiments on simulation data

Experimental setup

The simulation experiment is designed to study the success rate of the proposed method and compare it with that of MDR, OMDR and Fuzzy Quantitative MDR (FQMDR) which uses fuzzy classification based on traditional member functions.

Five different interaction models were used for the ordinal trait transferred from a quantitative trait (Fig. 3) [32]. For each model, one pair of SNPs was simulated as a causal factor among all possible combinations.

Fig. 3
figure3

Models of two way interactions for ordinal traits. White, light grey, dark grey represent normal, low risk, high risk of an ordinal trait respectively. (Figure is from [25])

The program gs 2.0 [35, 36] can quickly generate a large number of samples with genotype data based on real data that share similar local linkage disequilibrium (LD) patterns as those in human populations. It can be used to implement various interaction models. So we first use gs 2.0 to generate simulated genotype data.

Since the outcome is binary status (case or control), we derive continuous outcome from the penetrance functions (the penetrance function denotes the probability of being a case for each genotype combination.) of the five models as follows:

Let fij be the element from the ith row and jth column of a penetrance function for two interacting SNPs, the QT is generated from the following normal distribution:

$$ \mathrm{y}\mid \mathrm{SNP}1=i,\mathrm{SNP}2=j\sim \mathrm{N}\left({f}_{ij},{\sigma}^{\ast}\right) $$
(18)

where fij and σ* are the mean and variance of the normal distribution respectively. Then the QT is transferred to an ordinary trait with three categories. Since the QT obeys a normal distribution, we use the following classification. Let μ, σ be the mean value and variance of the quantitative trait, any quantitative trait value smaller than μ-σ/2 is classified as low category; any value between μ-σ/2 and μ + σ/2 is classified as middle category; any value larger than μ + σ/2 is classified as high category.

We use two different minor allele frequencies (MAF = 0.2 and 0.4), five different variances (σ* = 0.1, 0.2, 0.3, 0.4 and 0.5) and three different sample size (n = 200, 400, 800) with fixed SNP number (100 SNPs) and penetrance functions (0.01, 0.25, 0.5 for white, light grey, dark grey in Fig. 3. respectively) to create simulated datasets. For each interaction model, 100 replicated datasets were generated. Varying variances with fixed penetrance functions is equivalent to varying penetrance functions with fixed variances.

Hit ratio which is defined as the proportion of replicates with which the true causal SNPs are detected as the best SNPs among all possible same number of SNPs is used to measure the success rate. Here the best SNPs are also selected by using step 8 of the GFQMDR algorithm with α set as 0.01.

To test the type I error rate, the null datasets with no causal pair of SNPs were simulated for different sample sizes (n = 200, 400 and 600) and different SNP numbers (m = 10, 15, 20). Permutation P values of the identified strongest interaction pair of SNPs were calculated by permuting trait values of each dataset 1000 times. The ratio of the permutation 푃 values smaller than the significance level 훼=0.01 in 1000 replicates is calculated as the type I error rate. The number of the permutation ensured its accuracy to one decimal place when expressed in percent.

To demonstrate the power of the proposed algorithm to detect multiple gene-gene interactions associated with a QT, we use a combination of two models to simulate two set of two interacting SNPs associated with a QT. We use two combinations. One is a combination of model 1 and model 2, another is a combination of model 3 and model 4.

Let f1ij be the element from the ith row and jth column of the penetrance functions for the first set of two interacting SNPs and f2kl be the element from the kth row and lth column of the penetrance functions for the first set of two interacting SNPs, the QT is generated from the following normal distribution:

$$ \mathrm{y}\mid \mathrm{SNP}1=i,\mathrm{SNP}2=j,\mathrm{SNP}1=k,\mathrm{SNP}2=l\sim \mathrm{N}\left({w}_1\ {f}_{1 ij}+{w}_2\ {f}_{2 ij},{\sigma}^{\ast}\right) $$

where w1f1ij + w2f2ij and σ* are the mean and variance of the normal distribution respectively, w1 and w2 are the weights. Then the QT is transferred to an ordinary trait with three categories as in the first experiment.

We use two different MAFs (0.2 and 0.4), three different variances (σ* = 0.1, 0.2 and 0.3) and three ratio of weights with fixed sample size (n = 200) and fixed SNP number (100 SNPs) and penetrance functions (0.01, 0.25, 0.5 for white, light grey, dark grey in Fig. 3. respectively) to create simulated datasets. For each interaction model, 100 replicated datasets were generated. Hit ratio is also used to measure the success rate with α set as 0.01 in step 8 of the GFQMDR algorithm here.

Experiment results

Experiment results of five models are shown in Tables 1, 2, 3, 4 and 5.

Table 1 Hit ratios (%) for model 1
Table 2 Hit ratios (%) for model 2
Table 3 Hit ratios (%) for model 3
Table 4 Hit ratios (%) for model 4
Table 5 Hit ratios (%) for model 5

The performance of GFQMDR is better than other three methods in general. It is also observed that the performances of FQMDR and OMDR are better than that of MDR and the performance of FQMDR is slightly better than that of OMDR.

For the type I error rate, results given in Table 6 show that GFQMDR has type I error rate tightly gathering around 1% with a range from 0.7 to 1.3%, better than three other methods. Therefore GFQMDR controls type I error rate better.

Table 6 Type I Error Rate with the Significance Level 훼 of 0.01 from Datasets with 1000 Replicates

Tables 7 and 8 show results for the third experiment. All methods identify both models with relatively high ratios when two weights are similar and identify the model having higher weight with high ratios and the model having lower weight with low ratios when two weights are different. On the whole, GFQMDR identifies model 2, 3 and 4 with higher hit ratios than three other methods, but identifies model 1 with lower hit ratios than FQMDR and OMDR.

Table 7 Hit ratios (%) for model 1 and model 2
Table 8 Hit ratios (%) for model 3 and model 4

Experiments on real data

Experimental setup

We use two real datasets to show applications and performance of the proposed method.

One is high density lipoprotein and atherosclerosis data of 294 female F2 mice.

Atherosclerosis is a complex disease related to both environmental and genetic factors. Since the QTL for a trait are located in homologous regions in mice and humans, analysis of mouse atherosclerosis can facilitate genetic analysis of human atherosclerosis.

Female B6 mice have low plasma high-density lipoprotein (HDL) levels and are susceptible to atherosclerosis while female 129 mice have high plasma HDL levels and are relatively resistant. F2 mice are derived from intercross of (B6 × 129) F1 progeny produced by the mating of C57BL/6 J (B6) and 129S1/SvImJ (129) mice. This dataset contains genotypes of 111 SSLPs, HDL concentration and size of aortic fatty streak measurements for 294 female F2 mice fed a high-fat diet for 14 weeks [37]. The data were downloaded from the Center for Genome Dynamics at the Jackson Laboratory https://phenome.jax.org/projects/Ishimori1. Here HDL concentrations and size of aortic fatty streak (AFS) measurements are two quantitative traits of interest. The atherosclerotic aortic fatty streak lesion size variable was logarithmically transformed (base 10).

Another is Ultra-violet (UV) Light-Induced Immunosuppression Data. F1 backcross mice are derived from a backcross between low susceptibility BALB/c female mice and high susceptibility (BALB/c × C57BL/6) F1 male mice. This dataset contains 64 markers, sex and UV light-induced percent immunosuppression (PI) of a contact hypersensitivity response of 134 F1 backcross mice. The data were acquired from the Center for Genome Dynamics at the Jackson Laboratory https://phenome.jax.org/projects/Clemens1. UV light-induced percent immuno- suppression is the quantitative trait of interest.

For missing values of SSLP, we set them to the majority value of that SSLP; for missing values of QTs, we set them to the mean value of that quantitative trait.

All three QTs are divided into three equal length intervals since better performance can be achieved in this way. For HDL concentrations, three intervals are defined as high concentration, middle concentration and low concentration states respectively; for size of AFS, three intervals are defined as large size, middle size and small size states respectively; for UV light-induced percent immunosuppression three intervals are defined as high percent immunosuppression, middle percent immunosuppression and low percent immunosuppression states respectively.

Experiment results

The GFQMDR method is used to select multiple best 2-way, 3-way and 4-way interactions in the above real datasets associated with HDL, AFS and PI respectively and α is set as 0.01 in step 8 of the GFQMDR algorithm here.

The performance of the GFQMDR method is evaluated in maximum testing balanced classification accuracy (MTSBCA) on ten CVs and corresponding CVC, where CVC is used as a tie break, and compared with that of FQMDR, OMDR and MDR methods. Balanced accuracy using the extended linear member function, balanced accuracy using the traditional linear member function, tau-b and balanced accuracy are used to select multiple sets of best interaction SNPs in each CV in GFQMDR, FQMDR, OMDR and MDR methods respectively. We choose multiple best sets of SNPs for each of 2-way, 3-way and 4-way interactions.

We set k to 3, i.e. for each CV of a specific QT, we choose three best sets of SNPs of a fixed order. MTSBCA1 through MTSBCA3 are used to represent three sets of SNPs which have largest MTSBCAs in the descending order and GCVC1 through GCVC3 are corresponding GCVCs which are used as a tie break.

From Tables 9, 10 and 11, we can see that the performance of GFQMDR is better than that of FQMDR, OMDR and MDR in most cases. Figure 4 shows that AMTSBCA1 with GFQMDR is higher than that with three other methods for each of the four QTs except for PI with OMDR, AMTSBCA2 with EFQMDR is higher than that of three other methods.

Table 9 Comparison of MTSBCA and GCVC of PI classifiers among EFQMDR, FQMDR, OMDR and MDR when k = 3
Table 10 Comparison of MTSBCA and GCVC of HDL classifiers among EFQMDR, FQMDR, OMDR and MDR when k = 3
Table 11 Comparison of MTSBCA and GCVC of AFS classifiers among EFQMDR, FQMDR, OMDR and MDR when k = 3
Fig. 4
figure4

Comparison of AMTSBCA1 (average maximum testing balanced classification accuracy of a trait), AMTSBCA2(average maximum testing balanced classification accuracy of all traits) among GFQOMDR, FQMDR, OMDR and MDR

After computing P values of these classifiers, we found that HDL has P values 0 for all three classifiers of two way and three way classifiers for GFQMDR and FQMDR, 0, 0.002, 0.002 for three classifiers of two way classifiers for OMDR, 0, 0.001, 0.001 for three classifiers of two way classifiers for MDR, 0.001, 0.003, 0.004 for three classifiers of three way classifiers for OMDR. In other cases, P values are all above 0.01. For P values below 0.01, GFQMDR and FQMDR have lower P values than OMDR and MDR, indicating stronger gene-gene interactions.

For cases where P values are below 0.01, we set k to bigger values and identify more classifiers with P values below 0.01, but some of them are identified due to linkage disequilibrium among causal snps and non-causal snps.

In summary the performance of the proposed algorithm is better than that of FQMDR, OMDR and MDR.

Discussion

In step 3 and step 4 of GFQMDR Algorithm, when computing the size of each category in a particular cell, an extended linear member function is used; when computing the total size of each category in all cells, a traditional linear member function is used. The reason is that when deciding the label or category of a particular cell, the difference among different categories when being tried to assign to that cell can be reflected by the size of different categories in that cell, rather than the total size of different categories in all cells. Such a difference can be better reflected by an extended linear member function. Experiments also show much better performance when using the extended linear member function and the traditional linear member function in different cases.

In GFQMDR Algorithm, fuzzification is not only applied to the computation of training and testing accuracies, but also applied to the classification of each cell or genotype combination, while in [34], fuzzification is only applied to the computation of training and testing accuracies.. Experiments show better performance of such a double fuzzification than that of a single fuzzification in either the computation of training and testing accuracies or the classification of each cell or genotype combination.

It’s a complex problem to divide QTs into meaningful intervals. Usually deviation is used to divide QTs as in simulated data, but the condition is the data obey approximately some kind of normal distribution. If not, dividing QTs into equal length intervals is a simple and acceptable choice if it has a better performance.

To make our method more computationally efficient, the GENIE software package which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis [38] could be used.

Alternative methods could be to use fuzzy balanced accuracy based on traditional member function of fuzzy sets, or balanced signed accuracy where 1 is used to denote that the predicted category is the same as the true category, 0 to denote that the predicted category is close to the true category, − 1 to denote that the predicted category is far from the true category. However our experiments show that the performance of our algorithm is better than that of the above two methods.

When multiple sets of causal snps exist, the performance of our proposed method depends on the sizes of influence of different sets of causal snps. When the sizes are similar, they are easier to be identified, whereas the sizes are quite different, the set with bigger size will be easily identified.

Mathematical analysis is further needed to explain the better performance of the generalized fuzzy classification based on extended member functions. This will be our future work.

To apply our method, the continuous trait should be divided into J intervals first. To get the optimal J, we can try different number of intervals. If for J intervals, its performance is better than J-1 intervals and J + 1 intervals, J intervals could be approximately considered as optimal. If the performance is increasingly better when J increases, we can set an upper bound. In this paper, we only try three intervals for simplicity. We intend to try more intervals in our future work.

We would also try testing the proposed method with data in dbGAP or other human data that we can get a hold on in our future work.

Conclusions

In this study, a new method to identify gene-gene interactions for complex quantitative traits is proposed based on generalized fuzzy classification. To better use the information contained in a quantitative trait, it is first divided into several (greater than two) ordinal levels. Then a new ordinal association measure, fuzzy balanced accuracy based on generalized fuzzy classification is employed to select best sets of SNPs as having the strongest association with the trait in our proposed GFQMDR algorithm. Experimental results on simulated datasets and real datasets show that our algorithm has better performance in identifying gene-gene interactions associated with a complex quantitative trait.

Abbreviations

AFS:

Aortic fatty streak

CV:

Cross validation

FQMDR:

Fuzzy quantitative MDR

GFQMDR:

Generalized fuzzy quantitative MDR

GWAS:

Genome-wide association study

HDL:

High-density lipoprotein

LD:

Linkage disequilibrium

MAF:

Minor allele frequencies

MDR:

Multifactor dimensionality reduction

OMDR

Ordinal MDR

PI:

Percent immunosuppression

QT:

Quantitative trait

SNP:

Single nucleotide polymorphism

References

  1. 1.

    Collins FS, Guyer MS, Chakravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278(5343):1580–1.

    CAS  Article  Google Scholar 

  2. 2.

    WTCC Consortium. Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature. 2007;447(7145):661–78.

    Article  Google Scholar 

  3. 3.

    Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;37(4):413–7.

    CAS  Article  Google Scholar 

  4. 4.

    Franke B, Neale BM, Faraone SV. Genome-wide association studies in ADHD. Hum Genet. 2009;126(1):13–50.

    CAS  Article  Google Scholar 

  5. 5.

    Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.

    CAS  Article  Google Scholar 

  6. 6.

    Cordell HJ. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002;11(20):2463–8.

    CAS  Article  Google Scholar 

  7. 7.

    Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L. Sequence analysis using logic regression. Genet Epidemiol. 2001;21(1):S626–31.

    Article  Google Scholar 

  8. 8.

    Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28(2):157–70.

    Article  Google Scholar 

  9. 9.

    Millstein J, Conti DV, Gilliland FD, Gauderman WJ. A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006;78(1):15–27.

    CAS  Article  Google Scholar 

  10. 10.

    Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9(1):30–50.

    Article  Google Scholar 

  11. 11.

    Zhang H, Bonney G. Use of classification trees for association studies. Genet Epidemiol. 2000;19(4):323–32.

    CAS  Article  Google Scholar 

  12. 12.

    Nelson MR, Kardia SLR, Ferrell RE, Sing CF. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11(3):458–70.

    CAS  Article  Google Scholar 

  13. 13.

    Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–47.

    CAS  Article  Google Scholar 

  14. 14.

    Culverhouse R, Klein T, Shannon W. Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol. 2004;27(2):141–52.

    Article  Google Scholar 

  15. 15.

    Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics. 2007;23(24):3280–8.

    CAS  Article  Google Scholar 

  16. 16.

    Motsinger AA, Lee SL, Mellick G, Ritchie MD. GPNN: power studies and applications of a neural network method for detecting gene–gene interactions in studies of human disease. BMC Bioinformatics. 2006;7(5):1–10.

    Google Scholar 

  17. 17.

    Motsinger-Reif AA, Dudek SM, Hahn LW, Ritchie MD. Comparison of approaches for machinelearning optimization of neural networks for detecting gene–gene interactions in genetic epidemiology. Genet Epidemiol. 2008;32(4):325–40.

    Article  Google Scholar 

  18. 18.

    Li Z, Zheng T, Califano A, Floratos A. Pattern-based mining strategy to detect multi-locus association and gene× environment interaction. BMC Proc. 2007;1(S1):S16.

    Article  Google Scholar 

  19. 19.

    Long Q, Zhang Q, Ott J. Detecting disease-associated genotype patterns. BMC Bioinformatics. 2009;10(S1):S75.

    Article  Google Scholar 

  20. 20.

    Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, et al. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol. 2005;28(2):171–82.

    Article  Google Scholar 

  21. 21.

    Chen S, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, et al. A support vector machine approach for detecting genegene interaction. Genet Epidemiol. 2008;32(2):152–67.

    Article  Google Scholar 

  22. 22.

    Zhang Z, Zhang S, Wong M, Wareham NJ, Sha Q. An ensemble learning approach jointly modelling main and interaction effects in genetic association studies. Genet Epidemiol. 2008;32(4):285–300.

    Article  Google Scholar 

  23. 23.

    Moore JH. Computational analysis of gene–gene interactions using multifactor dimensionality reduction. Expert Rev Mol Diagn. 2004;4(6):795–803.

    CAS  Article  Google Scholar 

  24. 24.

    Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241(2):252–61.

    Article  Google Scholar 

  25. 25.

    Nelson M, Kardia S, Ferrell R, Sing C. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11(3):458–70.

    CAS  Article  Google Scholar 

  26. 26.

    Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet. 2007;80(6):1125–37.

    CAS  Article  Google Scholar 

  27. 27.

    Calle ML, Urrea V, Malats N, Van Steen K. MB-MDR: Model based multifactor dimensionality reduction for detecting interactions in highdimensional genomic data. Ann Hum Genet. 2008;75:1–14.

  28. 28.

    Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. BMC Genomics. 2009;10(1):509–30.

    Article  Google Scholar 

  29. 29.

    Yee J, Kwon MS, Jin S, Park T, Park M. Detecting genetic interactions for quantitative traits using m-spacing entropy measure. Biomed Res Int. 2015;2015(2):523641.

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Kang G, Yue W, Zhang J, Cui Y, Zuo Y, Zhang D. An entropy-based approach for testing genetic epistasis underlying complex diseases. J Theor Biol. 2008;250(2):362–74.

    CAS  Article  Google Scholar 

  31. 31.

    Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, et al. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene-gene interactions for quantitative traits. PLoS One. 2013;8(6):e66545.

    CAS  Article  Google Scholar 

  32. 32.

    Kim K, Kwon MS, Oh S, Park T. Identification of multiple gene-gene interactions for ordinal phenotypes. BMC Med Genet. 2013;6(Suppl 2):S9.

    Google Scholar 

  33. 33.

    Agresti A, Kateri M. Categorical data analysis. Berlin Heidelberg: Springer; 2011.

    Google Scholar 

  34. 34.

    Zhou X, Chan KCC. An effective approach to identify gene-gene interactions for complex quantitative traits using generalized fuzzy accuracy.In: IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Chiang Mai; 2016.

  35. 35.

    Li J, Chen Y. Generating samples for association studies based on HapMap data. BMC Bioinformatics. 2008;9(1):1–13.

    Article  Google Scholar 

  36. 36.

    Chen Y, Li J. Generation of synthetic data and experimental designs in evaluating interactions for association studies. J Bioinforma Comput Biol. 2012;10(1):1240005.

    Article  Google Scholar 

  37. 37.

    Ishimori N, Li R, Kelmenson PM, Korstanje R, Walsh KA, Churchill GA, et al. Quantitative trait loci analysis for plasma hdl-cholesterol concentrations and atherosclerosis susceptibility between inbred mouse strains c57bl/6j and 129s1/svimj. Arterioscler Thromb Vasc Biol. 2004;24(1):161–6.

    CAS  Article  Google Scholar 

  38. 38.

    Chikkagoudar S, Wang K, Li M. GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores. BMC Res Notes. 2011;4(1):158.

    CAS  Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all the guest editors and anonymous reviewers for their constructive comments on the manuscript.

Availability of data and materials

The supporting datasets and the GFQMDR software are available in https://github.com/zhouxd66/zhouxd66/blob/master/GFQMDR%20programs.zip.

Author information

Affiliations

Authors

Contributions

XDZ and KCCC developed the method, designed the experiments, interpreted the results and wrote the manuscript. XDZ implemented the method and performed the experiments. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiangdong Zhou.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, X., Chan, K.C.C. Detecting gene-gene interactions for complex quantitative traits using generalized fuzzy classification. BMC Bioinformatics 19, 329 (2018). https://doi.org/10.1186/s12859-018-2361-5

Download citation

Keywords

  • Quantitative traits
  • Gene-gene interactions
  • Multifactor dimensionality reduction
  • Ordinal traits
  • Fuzzy accuracy