 Methodology article
 Open access
 Published:
Detecting genegene interactions for complex quantitative traits using generalized fuzzy classification
BMC Bioinformatics volume 19, Article number: 329 (2018)
Abstract
Background
Quantitative traits or continuous outcomes related to complex diseases can provide more information and therefore more accurate analysis for identifying genegene and gene environment interactions associated with complex diseases. Multifactor Dimensionality Reduction (MDR) is originally proposed to identify genegene and gene environment interactions associated with binary status of complex diseases. Some efforts have been made to extend it to quantitative traits (QTs) and ordinal traits. However these and other methods are still not computationally efficient or effective.
Results
Generalized Fuzzy Quantitative trait MDR (GFQMDR) is proposed in this paper to strengthen identification of genegene interactions associated with a quantitative trait by first transforming it to an ordinal trait and then selecting best sets of genetic markers, mainly single nucleotide polymorphisms (SNPs) or simple sequence length polymorphic markers (SSLPs), as having strong association with the trait through generalized fuzzy classification using extended member functions. Experimental results on simulated datasets and real datasets show that our algorithm has better success rate, classification accuracy and consistency in identifying genegene interactions associated with QTs.
Conclusion
The proposed algorithm provides a more effective way to identify genegene interactions associated with quantitative traits.
Background
With the advent of the genomic era, doctors can utilize genetic data to analyze the mechanisms of diseases and customize medical treatment. Diseases are usually associated with genetic variants, mainly single nucleotide polymorphisms (SNPs) or simple sequence length polymorphic markers (SSLPs), which are already a valuable source for mapping complex diseases and complex genetic traits [1]. Searching for genetic factors that influence complex traits and complex diseases is both a goal and a challenge for modern geneticists.
In recent years, the field has been revolutionized by using genomewide association studies (GWASs) to assess the statistical associations of genetic variants with many important common diseases [2]. A singlelocus approach, where each variant is tested individually for association with a specific phenotype is used by most of these studies. However research limited to individual gene effects will make a large proportion of the heredity of complex diseases and complex traits unexplained [3, 4]. Genegene and geneenvironment interactions play an important role in genetic association studies of complex diseases and complex traits [5]. If a genetic factor functions primarily through interaction with other genetic factors or environmental factors, the effect might be missed if the gene is examined individually without allowing for its interactions with these other unknown factors.
A variety of methods have been proposed to identify genegene interactions existing in complex diseases. These methods include regression modeling [6,7,8,9,10], data reduction [11,12,13,14], genetic programming [15], neural networks [16, 17], pattern mining [18, 19] and machine learning approaches, such as random forest [20], support vector machine [21] and ensemble learning [22].
These methods are mainly used in a case control study to identify interactive SNPs for predicting a binary disease status and have achieved great success. Among these methods, the Multifactor Dimensionality Reduction (MDR) method, was proposed as a nonparametric and modelfree data reduction approach for identifying interactions without significant main effects and has been successfully applied to identify genegene interactions in many common complex diseases [13, 23, 24]. In the analysis of binary traits, MDR reduces high dimension of multilocus genotype combinations to one dimension of two groups: high risk group and low risk group, thus avoids the problem of sparse data combinations and models with too many parameters. Each genotype combination is classified as either high risk or low risk according to its ratio of cases and controls. The set of genetic markers which has best classification performance is then selected as having the strongest association with the trait. Although MDR has been extended in many directions, it is mainly applied in binary traits.
However in many cases, continuous outcomes or quantitative traits such as body weight, tumor size, blood pressure can provide more accurate analysis.
Some efforts have been made to extend MDR to quantitative traits (QTs). The Combinatorial Partitioning Method (CPM) [25] was proposed to identify partitions of multilocus genotypes for predicting variation in quantitative trait levels. The Restricted Partition Method (RPM) detects multilocus genotypes as predictors of a quantitative trait by a partitioning of genotypes into subgroups. The Generalized MDR (GMDR) [26] extends MDR to continuous phenotypes and includes covariate adjustment. In Model based MDR (MBMDR) [27], MDR is extended to continuous outcomes by using parametric regression.
There are also methods based on information theory. In [28], a method built on two informationtheoretic metrics, the kway interaction information (KWII) and phenotypeassociated information (PAI) is developed for analyzing the genegene and geneenvironmental interactions associated with quantitative traits. In [29], as an extension of the usefulness of information gain, a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype is proposed. In [30], an entropybased statistic which asymptotically follows a χ^{2} distribution is proposed to test genetic epistasis. This approach can test genetic epistasis with high efficiency in a caseonly design.
CPM searches over the state space made up of all possible sets of genotypic partitions of all the mlocus genotypes to identify m loci that divide corresponding genotypes into k partitions that are most similar within and most dissimilar between partitions for the mean of a quantitative trait. The number of k sets of genotypic partitions is a Stirling number of the second kind:
Where g_{M} is the size of the set of mlocus genotypes. A permutation test is used to estimate P values for the R^{2} for each of the k sets of genotypic partitions.
The RPM tries to find the most reasonable partition for evaluation to decrease most of the computational burden associated with the CPM. However a permutation test is used for all possible mlocus classifiers.
MBMDR is implemented in R (https://www.rproject.org/) but is only used on oneway and twoway interaction models [30]. GMDR still requires the outcome in the data file to be dichotomous [30].
KWII needs to compute the entropies of all subsets of m loci. Although the computation of the PAI requires only individual and joint entropies, making it computationally far more tractable than the KWII, the hill climbing algorithm it employs will miss many interactions which have small main effects.
In Quantitative MDR (QMDR) [31], to exploit continuous outcomes to make the analysis more accurate, a test statistic, rather than the balanced accuracy, is used to determine the best interaction model. This is a computationally efficient algorithm. However this method still classified the outcome into two groups: high and low level groups, which results in the loss of the large variability of the quantitative outcome.
Also there are few methods applied to ordinal categorical traits. Ordinal categorical traits such as the obesity classification based on body mass index (e.g., normal, preobese, mild obese and severe obese), the diabetes diagnosis based on glucose level (e.g., normal, impaired glucose tolerance and diabetes) are common in many genetic association studies. These traits are also derived from quantitative traits. In Ordinal MDR (OMDR) [32], MDR is extended to analyze genegene interaction for ordinal traits and taub [33], a common ordinal association measure, is used to replace balanced accuracy to evaluate interactions. However the taub measure only measures the degree of tendency of positive association between true categories of an ordinal trait and predicted categories and doesn’t consider the difference between true categories and predicted categories.
In order to better use the information contained in the quantitative trait, we first classify the quantitative outcome into several (greater than two) ordinal levels. Then an extended MDR is used to identify genegene interactions on this converted ordinal categorical trait. Rather than using balanced accuracy or common ordinal association measures, such as taub, we use a generalized fuzzy classification method to select the set of genetic markers as having the strongest association with the trait. Usually for each prediction of a category, its accuracy value is either 1, if the prediction is right, or 0, if the prediction is wrong. However for quantitative or ordinal traits, when the prediction is wrong, the closeness of a quantitative value to the true category is different. To reflect such difference, member functions of fuzzy sets could be employed to compute accuracy in classification. Since the range of a member function is between 0 and 1, to better describe the difference of quantitative values to a category, we extend its range to {− 1, 1} when it is used in fuzzy classification.
In this paper, a new kind of member functions which have an extended output range from − 1 to 1 are proposed to be used in fuzzy classification first. Then Generalized Fuzzy Quantitative MDR (GFQMDR) algorithm which is an improvement of Fuzzy Quantitative trait based Ordinal MDR (FQOMDR) in [34] is given to strengthen identification of genegene interactions associated with QTs. This algorithm first transforms a quantitative trait into an ordinal trait and then select best sets of SNPs as having strong association with the trait using such kind of member functions in the extended MDR. To test the performance of the proposed algorithm, we use it to identify five different interaction models in simulated data and compare success rates with three other methods. We also use it in two real data sets to select SNPs having strong association with the trait and compare balanced test accuracy and consistency with the same three other methods.
Methods
Traditional member functions
The degree of membership of different values to a fuzzy set can be computed using a membership function whose range is between 0 and 1.
Take QTs as an example. Usually we can divide them into three intervals or levels: high (H), average (A) and low (L) associated with three fuzzy sets. Here as an example, we use equal length intervals and associate them with three fuzzy sets using linear member functions, as shown in Fig. 1.
Let Q_{min} and Q_{max} denote the maximum and minimum values that a QT takes on in all samples in a dataset. B_{1} and B_{2} are upper borders of the low level and the average level respectively. P_{1}, P_{2} and P_{3} are the middle positions of the low level, average level and high level respectively can be derived as follows:
Then member functions for L, A and H levels in Fig. 1 can be expressed as:
Generalized fuzzy classification using extended member functions
Membership functions of fuzzy sets can also be used as an accuracy measure in fuzzy classification. For example, when different values are classified to the high level, we can get different accuracies between 0 and 1 from μ_{H1}(x). However when selecting a best classifier composed of a set of SNPs to classify a QT, such a range could not fully show differences among different classifiers. For example, if there are both 500 samples in genotypes that are classified as the high level for two classifiers, for classifier 1 there are 300 samples located at P_{3}, 200 samples located at P_{2} and 100 samples located at P_{1} in genotypes that are classified as the high level, for classifier 2 there are 300 sample located at P_{3}, 100 samples located at P_{2} and 200 samples located at P_{1} in genotypes that are classified as high levels, then the accuracies of the high level for these two classifiers would be the same: 0.6. However classifier 1 is obviously a better classifier to classify the high level. To reflect such difference, we extend the range of member functions from {0, 1} to {− 1, 1} when they are used in fuzzy classification to select the best classifier.
Such a linear extended member function is illustrated in Fig. 2 and can be expressed as:
It can also be regarded as a transformation of the member function in Fig. 1 as follows:
MDR algorithm
In order to detect high dimensional genegene interaction, MDR reduces genotype combinations at multiple loci into a single class variable taking values of either high risk or low risk categories, then tests association between a binary trait or disease with this new one dimensional variable.
The MDR method proceeded as follows. The 10fold cross validation is used. A set of m genetic factors is selected and their possible combinations or cells are represented in m dimensional space. For example, for two diallelic loci, each has three genotypes and there are nine twolocusgenotype combinations. Then the ratio of the number of cases to the number of controls is estimated within each cell, which is then labeled either as “highrisk”, if the cases:controls ratio is equal or greater than some threshold, or otherwise as “lowrisk”. Thus all cells are allocated to either high risk group or low risk group, which reduces the ndimensional model into a one dimensional model. The procedure is repeated for each possible nfactor combination. The training balanced accuracy of the two groups is used to select the best classifier. Balanced accuracy is defined as the arithmetic mean of sensitivity and specificity:
where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. The prediction error of the selected best classifier can be estimated using the remaining onetenth of the data to get the testing balanced accuracy. The process is repeated for all ten training sets and testing sets and for each of the selected mlocus classifiers, the number of crossvalidation replicates in which it is chosen as the best classifier (crossvalidation consistency) is recorded. The mlocus classifier that has the maximum testing balanced accuracy and highest crossvalidation consistency is selected as the final best mlocus classifier, where crossvalidation consistency is used as a tiebreak.
For an ordinal categorical trait with J levels, an m dimensional cell is labeled as one of J groups as follows. Let 1, 2, ..., J be J levels or categories for an ordinal trait. For any combination of m SNPs, let n_{+j} be the number of individuals in class j, n_{ij} be the number of individuals with the ith multilocus genotype in category j, where i = {1, 2,...,3^{m}} and j = 1, 2,..., J. Then the ith mlocus genotype will be labeled as category c(i) as follows:
GFQMDR algorithm
GFQMDR extends MDR to analyze quantitative traits by first converting them to ordinal traits. Then Instead of evaluating each classifier using balanced accuracy or common ordinal association measures, it uses generalized fuzzy classification based on extended member functions to evaluate each classifier and select the best one as having the strongest association with the trait. The procedure of GFQMDR is as follows:

1.
Divide the range of a quantitative trait into J intervals and label them as categories 1, 2,…,J respectively.

2.
Partition the dataset into L subsets for Lfold crossvalidation (CV). Use one of the L subsets as a testing set and the rest as a training set.

3.
For each mway interaction derived from m SNPs or SSLPs, let n_{ij} be the number of individuals belonging to category j with the ith multilocus genotype in the training set, n_{+j} be the total number of individuals belonging to category j in the training set, where i = {1, 2,...,3^{m}} and j = 1, 2,..., J. Then all individuals with the ith multilocus genotype will be assigned into the category c(i) by the classifier corresponding to the m given SNPs as follows:
$$ c(i)=\underset{j\in \left\{1,\dots, J\right\}}{\arg \max}\left(\frac{n_{ij}}{n_{+j}}\right) $$(15)
where n_{ij} and n_{+j} are real numbers, n_{ij} is computed using the extended linear member function, n_{+j}, the size of class j, is computed using the traditional linear member function.

4.
Compute the training balanced accuracy for each mway interaction:
$$ \frac{1}{J}\sum \limits_{i=1}^{3^m}\frac{n_{i,c(i)}}{n_{+c(i)}} $$(16)
where n_{i,c(i)}, the number of individuals with the ith multilocus genotype which really belong to the class they are classified to, is computed using the extended linear member function.

5.
Select k classifiers that have best training balanced accuracies and compute their testing balanced accuracies.

6.
Repeat steps 3–5 on all L CV dataset.

7.
Since multiple genegene interactions associated with a QT is common in complex traits, multiple candidates of mway genegene interactions are selected as having the maximum testing balanced accuracy and highest generalized crossvalidation consistency based on topK selection (GCVC^{K} or simplified as GCVC) [34], where general crossvalidation consistency is used as a tiebreak.. The GCVC^{K} is calculated as follows:
$$ {\mathrm{GCVC}}^K={\sum}_{l=1}^L{I}_l\ \mathrm{where}\ {I}_l=\left\{\begin{array}{l}1,\kern1.25em \mathrm{if}\ \mathrm{the}\ \mathrm{MDR}\ \mathrm{classifier}\ \mathrm{is}\ \mathrm{identified}\ \mathrm{as}\ \mathrm{one}\ \\ {}\kern1.5em \mathrm{of}\ \mathrm{top}\hbox{} K\ \mathrm{classifier}\mathrm{s}\ \mathrm{at}\ {1}^{\mathrm{th}}\mathrm{CV}\ \mathrm{dataset}\\ {}0,\kern1.25em \mathrm{otherwise}\end{array}\right. $$(17) 
8.
To lower type I error, compute P values of selected candidates of mway genegene interactions based on 1000 permutations and select candidates having P values smaller than α (α is a prescribed threshold) as final identified genegene interactions.
Results
Experiments on simulation data
Experimental setup
The simulation experiment is designed to study the success rate of the proposed method and compare it with that of MDR, OMDR and Fuzzy Quantitative MDR (FQMDR) which uses fuzzy classification based on traditional member functions.
Five different interaction models were used for the ordinal trait transferred from a quantitative trait (Fig. 3) [32]. For each model, one pair of SNPs was simulated as a causal factor among all possible combinations.
The program gs 2.0 [35, 36] can quickly generate a large number of samples with genotype data based on real data that share similar local linkage disequilibrium (LD) patterns as those in human populations. It can be used to implement various interaction models. So we first use gs 2.0 to generate simulated genotype data.
Since the outcome is binary status (case or control), we derive continuous outcome from the penetrance functions (the penetrance function denotes the probability of being a case for each genotype combination.) of the five models as follows:
Let f_{ij} be the element from the ith row and jth column of a penetrance function for two interacting SNPs, the QT is generated from the following normal distribution:
where f_{ij} and σ* are the mean and variance of the normal distribution respectively. Then the QT is transferred to an ordinary trait with three categories. Since the QT obeys a normal distribution, we use the following classification. Let μ, σ be the mean value and variance of the quantitative trait, any quantitative trait value smaller than μσ/2 is classified as low category; any value between μσ/2 and μ + σ/2 is classified as middle category; any value larger than μ + σ/2 is classified as high category.
We use two different minor allele frequencies (MAF = 0.2 and 0.4), five different variances (σ* = 0.1, 0.2, 0.3, 0.4 and 0.5) and three different sample size (n = 200, 400, 800) with fixed SNP number (100 SNPs) and penetrance functions (0.01, 0.25, 0.5 for white, light grey, dark grey in Fig. 3. respectively) to create simulated datasets. For each interaction model, 100 replicated datasets were generated. Varying variances with fixed penetrance functions is equivalent to varying penetrance functions with fixed variances.
Hit ratio which is defined as the proportion of replicates with which the true causal SNPs are detected as the best SNPs among all possible same number of SNPs is used to measure the success rate. Here the best SNPs are also selected by using step 8 of the GFQMDR algorithm with α set as 0.01.
To test the type I error rate, the null datasets with no causal pair of SNPs were simulated for different sample sizes (n = 200, 400 and 600) and different SNP numbers (m = 10, 15, 20). Permutation P values of the identified strongest interaction pair of SNPs were calculated by permuting trait values of each dataset 1000 times. The ratio of the permutation 푃 values smaller than the significance level 훼=0.01 in 1000 replicates is calculated as the type I error rate. The number of the permutation ensured its accuracy to one decimal place when expressed in percent.
To demonstrate the power of the proposed algorithm to detect multiple genegene interactions associated with a QT, we use a combination of two models to simulate two set of two interacting SNPs associated with a QT. We use two combinations. One is a combination of model 1 and model 2, another is a combination of model 3 and model 4.
Let f_{1ij} be the element from the ith row and jth column of the penetrance functions for the first set of two interacting SNPs and f_{2kl} be the element from the kth row and lth column of the penetrance functions for the first set of two interacting SNPs, the QT is generated from the following normal distribution:
where w_{1}f_{1ij} + w_{2}f_{2ij} and σ* are the mean and variance of the normal distribution respectively, w_{1} and w_{2} are the weights. Then the QT is transferred to an ordinary trait with three categories as in the first experiment.
We use two different MAFs (0.2 and 0.4), three different variances (σ* = 0.1, 0.2 and 0.3) and three ratio of weights with fixed sample size (n = 200) and fixed SNP number (100 SNPs) and penetrance functions (0.01, 0.25, 0.5 for white, light grey, dark grey in Fig. 3. respectively) to create simulated datasets. For each interaction model, 100 replicated datasets were generated. Hit ratio is also used to measure the success rate with α set as 0.01 in step 8 of the GFQMDR algorithm here.
Experiment results
Experiment results of five models are shown in Tables 1, 2, 3, 4 and 5.
The performance of GFQMDR is better than other three methods in general. It is also observed that the performances of FQMDR and OMDR are better than that of MDR and the performance of FQMDR is slightly better than that of OMDR.
For the type I error rate, results given in Table 6 show that GFQMDR has type I error rate tightly gathering around 1% with a range from 0.7 to 1.3%, better than three other methods. Therefore GFQMDR controls type I error rate better.
Tables 7 and 8 show results for the third experiment. All methods identify both models with relatively high ratios when two weights are similar and identify the model having higher weight with high ratios and the model having lower weight with low ratios when two weights are different. On the whole, GFQMDR identifies model 2, 3 and 4 with higher hit ratios than three other methods, but identifies model 1 with lower hit ratios than FQMDR and OMDR.
Experiments on real data
Experimental setup
We use two real datasets to show applications and performance of the proposed method.
One is high density lipoprotein and atherosclerosis data of 294 female F2 mice.
Atherosclerosis is a complex disease related to both environmental and genetic factors. Since the QTL for a trait are located in homologous regions in mice and humans, analysis of mouse atherosclerosis can facilitate genetic analysis of human atherosclerosis.
Female B6 mice have low plasma highdensity lipoprotein (HDL) levels and are susceptible to atherosclerosis while female 129 mice have high plasma HDL levels and are relatively resistant. F2 mice are derived from intercross of (B6 × 129) F1 progeny produced by the mating of C57BL/6 J (B6) and 129S1/SvImJ (129) mice. This dataset contains genotypes of 111 SSLPs, HDL concentration and size of aortic fatty streak measurements for 294 female F2 mice fed a highfat diet for 14 weeks [37]. The data were downloaded from the Center for Genome Dynamics at the Jackson Laboratory https://phenome.jax.org/projects/Ishimori1. Here HDL concentrations and size of aortic fatty streak (AFS) measurements are two quantitative traits of interest. The atherosclerotic aortic fatty streak lesion size variable was logarithmically transformed (base 10).
Another is Ultraviolet (UV) LightInduced Immunosuppression Data. F1 backcross mice are derived from a backcross between low susceptibility BALB/c female mice and high susceptibility (BALB/c × C57BL/6) F1 male mice. This dataset contains 64 markers, sex and UV lightinduced percent immunosuppression (PI) of a contact hypersensitivity response of 134 F1 backcross mice. The data were acquired from the Center for Genome Dynamics at the Jackson Laboratory https://phenome.jax.org/projects/Clemens1. UV lightinduced percent immuno suppression is the quantitative trait of interest.
For missing values of SSLP, we set them to the majority value of that SSLP; for missing values of QTs, we set them to the mean value of that quantitative trait.
All three QTs are divided into three equal length intervals since better performance can be achieved in this way. For HDL concentrations, three intervals are defined as high concentration, middle concentration and low concentration states respectively; for size of AFS, three intervals are defined as large size, middle size and small size states respectively; for UV lightinduced percent immunosuppression three intervals are defined as high percent immunosuppression, middle percent immunosuppression and low percent immunosuppression states respectively.
Experiment results
The GFQMDR method is used to select multiple best 2way, 3way and 4way interactions in the above real datasets associated with HDL, AFS and PI respectively and α is set as 0.01 in step 8 of the GFQMDR algorithm here.
The performance of the GFQMDR method is evaluated in maximum testing balanced classification accuracy (MTSBCA) on ten CVs and corresponding CVC, where CVC is used as a tie break, and compared with that of FQMDR, OMDR and MDR methods. Balanced accuracy using the extended linear member function, balanced accuracy using the traditional linear member function, taub and balanced accuracy are used to select multiple sets of best interaction SNPs in each CV in GFQMDR, FQMDR, OMDR and MDR methods respectively. We choose multiple best sets of SNPs for each of 2way, 3way and 4way interactions.
We set k to 3, i.e. for each CV of a specific QT, we choose three best sets of SNPs of a fixed order. MTSBCA1 through MTSBCA3 are used to represent three sets of SNPs which have largest MTSBCAs in the descending order and GCVC1 through GCVC3 are corresponding GCVCs which are used as a tie break.
From Tables 9, 10 and 11, we can see that the performance of GFQMDR is better than that of FQMDR, OMDR and MDR in most cases. Figure 4 shows that AMTSBCA1 with GFQMDR is higher than that with three other methods for each of the four QTs except for PI with OMDR, AMTSBCA2 with EFQMDR is higher than that of three other methods.
After computing P values of these classifiers, we found that HDL has P values 0 for all three classifiers of two way and three way classifiers for GFQMDR and FQMDR, 0, 0.002, 0.002 for three classifiers of two way classifiers for OMDR, 0, 0.001, 0.001 for three classifiers of two way classifiers for MDR, 0.001, 0.003, 0.004 for three classifiers of three way classifiers for OMDR. In other cases, P values are all above 0.01. For P values below 0.01, GFQMDR and FQMDR have lower P values than OMDR and MDR, indicating stronger genegene interactions.
For cases where P values are below 0.01, we set k to bigger values and identify more classifiers with P values below 0.01, but some of them are identified due to linkage disequilibrium among causal snps and noncausal snps.
In summary the performance of the proposed algorithm is better than that of FQMDR, OMDR and MDR.
Discussion
In step 3 and step 4 of GFQMDR Algorithm, when computing the size of each category in a particular cell, an extended linear member function is used; when computing the total size of each category in all cells, a traditional linear member function is used. The reason is that when deciding the label or category of a particular cell, the difference among different categories when being tried to assign to that cell can be reflected by the size of different categories in that cell, rather than the total size of different categories in all cells. Such a difference can be better reflected by an extended linear member function. Experiments also show much better performance when using the extended linear member function and the traditional linear member function in different cases.
In GFQMDR Algorithm, fuzzification is not only applied to the computation of training and testing accuracies, but also applied to the classification of each cell or genotype combination, while in [34], fuzzification is only applied to the computation of training and testing accuracies.. Experiments show better performance of such a double fuzzification than that of a single fuzzification in either the computation of training and testing accuracies or the classification of each cell or genotype combination.
It’s a complex problem to divide QTs into meaningful intervals. Usually deviation is used to divide QTs as in simulated data, but the condition is the data obey approximately some kind of normal distribution. If not, dividing QTs into equal length intervals is a simple and acceptable choice if it has a better performance.
To make our method more computationally efficient, the GENIE software package which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis [38] could be used.
Alternative methods could be to use fuzzy balanced accuracy based on traditional member function of fuzzy sets, or balanced signed accuracy where 1 is used to denote that the predicted category is the same as the true category, 0 to denote that the predicted category is close to the true category, − 1 to denote that the predicted category is far from the true category. However our experiments show that the performance of our algorithm is better than that of the above two methods.
When multiple sets of causal snps exist, the performance of our proposed method depends on the sizes of influence of different sets of causal snps. When the sizes are similar, they are easier to be identified, whereas the sizes are quite different, the set with bigger size will be easily identified.
Mathematical analysis is further needed to explain the better performance of the generalized fuzzy classification based on extended member functions. This will be our future work.
To apply our method, the continuous trait should be divided into J intervals first. To get the optimal J, we can try different number of intervals. If for J intervals, its performance is better than J1 intervals and J + 1 intervals, J intervals could be approximately considered as optimal. If the performance is increasingly better when J increases, we can set an upper bound. In this paper, we only try three intervals for simplicity. We intend to try more intervals in our future work.
We would also try testing the proposed method with data in dbGAP or other human data that we can get a hold on in our future work.
Conclusions
In this study, a new method to identify genegene interactions for complex quantitative traits is proposed based on generalized fuzzy classification. To better use the information contained in a quantitative trait, it is first divided into several (greater than two) ordinal levels. Then a new ordinal association measure, fuzzy balanced accuracy based on generalized fuzzy classification is employed to select best sets of SNPs as having the strongest association with the trait in our proposed GFQMDR algorithm. Experimental results on simulated datasets and real datasets show that our algorithm has better performance in identifying genegene interactions associated with a complex quantitative trait.
Abbreviations
 AFS:

Aortic fatty streak
 CV:

Cross validation
 FQMDR:

Fuzzy quantitative MDR
 GFQMDR:

Generalized fuzzy quantitative MDR
 GWAS:

Genomewide association study
 HDL:

Highdensity lipoprotein
 LD:

Linkage disequilibrium
 MAF:

Minor allele frequencies
 MDR:

Multifactor dimensionality reduction
OMDR
Ordinal MDR
 PI:

Percent immunosuppression
 QT:

Quantitative trait
 SNP:

Single nucleotide polymorphism
References
Collins FS, Guyer MS, Chakravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278(5343):1580–1.
WTCC Consortium. Genomewide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature. 2007;447(7145):661–78.
Marchini J, Donnelly P, Cardon LR. Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;37(4):413–7.
Franke B, Neale BM, Faraone SV. Genomewide association studies in ADHD. Hum Genet. 2009;126(1):13–50.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
Cordell HJ. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002;11(20):2463–8.
Kooperberg C, Ruczinski I, LeBlanc ML, Hsu L. Sequence analysis using logic regression. Genet Epidemiol. 2001;21(1):S626–31.
Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28(2):157–70.
Millstein J, Conti DV, Gilliland FD, Gauderman WJ. A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006;78(1):15–27.
Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9(1):30–50.
Zhang H, Bonney G. Use of classification trees for association studies. Genet Epidemiol. 2000;19(4):323–32.
Nelson MR, Kardia SLR, Ferrell RE, Sing CF. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11(3):458–70.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–47.
Culverhouse R, Klein T, Shannon W. Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol. 2004;27(2):141–52.
Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I. Detecting highorder interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics. 2007;23(24):3280–8.
Motsinger AA, Lee SL, Mellick G, Ritchie MD. GPNN: power studies and applications of a neural network method for detecting gene–gene interactions in studies of human disease. BMC Bioinformatics. 2006;7(5):1–10.
MotsingerReif AA, Dudek SM, Hahn LW, Ritchie MD. Comparison of approaches for machinelearning optimization of neural networks for detecting gene–gene interactions in genetic epidemiology. Genet Epidemiol. 2008;32(4):325–40.
Li Z, Zheng T, Califano A, Floratos A. Patternbased mining strategy to detect multilocus association and gene× environment interaction. BMC Proc. 2007;1(S1):S16.
Long Q, Zhang Q, Ott J. Detecting diseaseassociated genotype patterns. BMC Bioinformatics. 2009;10(S1):S75.
Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, et al. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol. 2005;28(2):171–82.
Chen S, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, et al. A support vector machine approach for detecting genegene interaction. Genet Epidemiol. 2008;32(2):152–67.
Zhang Z, Zhang S, Wong M, Wareham NJ, Sha Q. An ensemble learning approach jointly modelling main and interaction effects in genetic association studies. Genet Epidemiol. 2008;32(4):285–300.
Moore JH. Computational analysis of gene–gene interactions using multifactor dimensionality reduction. Expert Rev Mol Diagn. 2004;4(6):795–803.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241(2):252–61.
Nelson M, Kardia S, Ferrell R, Sing C. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11(3):458–70.
Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, et al. A generalized combinatorial approach for detecting genebygene and genebyenvironment interactions with application to nicotine dependence. Am J Hum Genet. 2007;80(6):1125–37.
Calle ML, Urrea V, Malats N, Van Steen K. MBMDR: Model based multifactor dimensionality reduction for detecting interactions in highdimensional genomic data. Ann Hum Genet. 2008;75:1–14.
Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. Informationtheoretic genegene and geneenvironment interaction analysis of quantitative traits. BMC Genomics. 2009;10(1):509–30.
Yee J, Kwon MS, Jin S, Park T, Park M. Detecting genetic interactions for quantitative traits using mspacing entropy measure. Biomed Res Int. 2015;2015(2):523641.
Kang G, Yue W, Zhang J, Cui Y, Zuo Y, Zhang D. An entropybased approach for testing genetic epistasis underlying complex diseases. J Theor Biol. 2008;250(2):362–74.
Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, et al. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of genegene interactions for quantitative traits. PLoS One. 2013;8(6):e66545.
Kim K, Kwon MS, Oh S, Park T. Identification of multiple genegene interactions for ordinal phenotypes. BMC Med Genet. 2013;6(Suppl 2):S9.
Agresti A, Kateri M. Categorical data analysis. Berlin Heidelberg: Springer; 2011.
Zhou X, Chan KCC. An effective approach to identify genegene interactions for complex quantitative traits using generalized fuzzy accuracy.In: IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Chiang Mai; 2016.
Li J, Chen Y. Generating samples for association studies based on HapMap data. BMC Bioinformatics. 2008;9(1):1–13.
Chen Y, Li J. Generation of synthetic data and experimental designs in evaluating interactions for association studies. J Bioinforma Comput Biol. 2012;10(1):1240005.
Ishimori N, Li R, Kelmenson PM, Korstanje R, Walsh KA, Churchill GA, et al. Quantitative trait loci analysis for plasma hdlcholesterol concentrations and atherosclerosis susceptibility between inbred mouse strains c57bl/6j and 129s1/svimj. Arterioscler Thromb Vasc Biol. 2004;24(1):161–6.
Chikkagoudar S, Wang K, Li M. GENIE: a software package for genegene interaction analysis in genetic association studies using multiple GPU or CPU cores. BMC Res Notes. 2011;4(1):158.
Acknowledgements
The authors would like to thank all the guest editors and anonymous reviewers for their constructive comments on the manuscript.
Availability of data and materials
The supporting datasets and the GFQMDR software are available in https://github.com/zhouxd66/zhouxd66/blob/master/GFQMDR%20programs.zip.
Author information
Authors and Affiliations
Contributions
XDZ and KCCC developed the method, designed the experiments, interpreted the results and wrote the manuscript. XDZ implemented the method and performed the experiments. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhou, X., Chan, K.C.C. Detecting genegene interactions for complex quantitative traits using generalized fuzzy classification. BMC Bioinformatics 19, 329 (2018). https://doi.org/10.1186/s1285901823615
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901823615