Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data
 Tricia A ThorntonWells^{1, 2, 3},
 Jason H Moore^{4} and
 Jonathan L Haines^{3, 5}Email author
DOI: 10.1186/147121057204
© ThorntonWells et al; licensee BioMed Central Ltd. 2006
Received: 08 September 2005
Accepted: 12 April 2006
Published: 12 April 2006
Abstract
Background
Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heterogeneity. The performance of three such methods – Bayesian Classification, HypergraphBased Clustering, and Fuzzy kModes Clustering – appropriate for categorical data were compared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heterogeneity and/or genegene interaction, which are two other complicating factors in discovering genetic models of complex human disease. To determine the efficacy of applying the Bayesian Classification method to real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated using permutation testing.
Results
Bayesian Classification outperformed the other two methods, with the exception that the Fuzzy kModes Clustering performed best on the most complex genetic model. Bayesian Classification achieved excellent recovery for 75% of the datasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datasets with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfunctional loci (across all simulated models). Neither Hypergraph Clustering nor Fuzzy kModes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a restricted set of conditions. When using the average log of class strength as the internal clustering metric, the false positive rate was controlled very well, at three percent or less for all three significance levels (0.01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent significance level of 0.10.
Conclusion
Bayesian Classification shows promise as an unsupervised computational method for dissecting trait heterogeneity in genotypic data. Its control of false positive and false negative rates lends confidence to the validity of its results. Further investigation of how different parameter settings may improve the performance of Bayesian Classification, especially under more complex genetic models, is ongoing.
Background
Complex human genetic disease
Current statistical approaches to detecting heterogeneity, such as the admixture test [21, 22], are neither sensitive nor powerful and can merely account for, not resolve, any underlying heterogeneity. In addition, while a small number of supervised computational methods exist for discovering genegene interactions, the power of these methods drops dramatically when locus or trait heterogeneity is present [23, 24]. Phenotypic data can be utilized to improve the performance of these methods in the face of locus or trait heterogeneity by facilitating heuristic stratification of data. However, for many diseases, little detailed phenotypic data has been collected consistently in combination with genotypic data. It is for these reasons that an unsupervised method, which does not rely on phenotypic data, is needed to mine potentially heterogeneous genotypic data as a means of data stratification and hypothesis generation.
For genetic factors involving heterogeneity, there are multiple independent (predictor) variables or else multiple dependent (outcome) variables that complicate the analysis by creating a heterogeneous model landscape. In the case of locus heterogeneity, multiple predictor variables (i.e., multiple loci) are present, some of which may be unmeasured or unobserved and, therefore, unavailable for inclusion in the disease model. In the case of trait heterogeneity, multiple outcome variables are present, which cannot or have not been distinguished based on the available phenotypic information. Genegene interactions create a rugged model landscape for statistical analysis. There is clear and convincing evidence that genegene interactions, whether synergistic or antagonistic, are not only possible but probably ubiquitous [18, 19, 25–28]. Thus, it is critical that complex genetic data sets be properly interrogated for possible underlying interactions.
Statistical analysis
No one analytic method is superior in all respects for the range of complicating factors that might be present in a specific data set. Given the relative shortcomings of our current analyses in complex diseases, we need to greatly extend the range of available analytical tools. There is a critical need for extensive reevaluation of existing methodologies for complex diseases, as well as for massive efforts in new method development. It is important that empirical studies be conducted to compare and contrast the relative strengths and weaknesses of methods on specific types of problems. For example, while cluster analysis has shown promise in numerous scientific and mathematical fields, including genetics [29, 30], its use with discrete genotypic data has not been adequately explored. Similarly, artificial neural networks modified with evolutionary computation have great potential for discovering nonlinear interactions among genes and environmental factors [31]. However, work is still ongoing to evaluate their limitations with regard to the heritability and effect sizes that can be detected.
Ultimately, though, the real power of existing and yettobedeveloped methods lies in our ability to marry them into a comprehensive approach to genetic analysis, so that their relative strengths and weaknesses can be balanced and few alternative hypotheses are left uninvestigated. Because no single method adequately investigates heterogeneity and interaction issues simultaneously, we propose routinely taking a twostep approach to analysis. For example, clustering or ordered subset analysis [32] can be used first to uncover genotypic and/or phenotypic heterogeneity and to subdivide the data into more homogeneous groups. Then in a second step, specific tests of interactions, such as the S sum statistic approach [33, 34, 34] or the multifactor dimensionality reduction method [23, 24, 35–37] could be used to investigate genegene or geneenvironment interactions within each of the homogenized subgroups. While this is not a perfect approach, it is an important improvement over the more common alternative of a singlepronged approach to analysis.
Cluster analysis
For over 30 years, cluster analysis has been used as a method of data exploration [38]. Clustering is an unsupervised classification methodology, which attempts to uncover 'natural' clusters or partitions of data. It involves data encoding and choosing a similarity metric, which will be used in determining the relative 'goodness' of a clustering of data. No one clustering method has been shown universally effective when applied to the wide variety of structures present in multidimensional data sets. Instead, the choice of suitable methods is dependent on the type of target data to be analyzed. Clustering has been utilized widely for the analysis of gene expression (e.g., DNA microarray) data; however, its application to genotypic data has been limited [29].
Most traditional clustering algorithms use a similarity metric based on distance that may be inappropriate for categorical data such as genotypes. Newer methods have been developed with categorical data in mind and include extensions of traditional methods and application of probabilistic theory. Three such methods were chosen for comparison in the task of discovering trait heterogeneity using multilocus genotypes – Bayesian Classification [39], HypergraphBased Clustering [40], and Fuzzy kModes Clustering [41] – all of which are appropriate for categorical data.
Results
Descriptive statistics
Confidence intervals around ARI_{HA} means by method.
95% Confidence Interval  

Method  Mean  Standard Error  Lower End  Upper End 
Bayesian  0.666  0.001  0.664  0.667 
Hypergraph  0.354  0.001  0.352  0.355 
Fuzzy kModes  0.556  0.001  0.555  0.558 
Comparison of clustering methods
Overall results of ChiSquare Test of Independence. Overall results of chisquare test of independence testing the null hypothesis that the percentage of clustering results achieving the specified cluster recovery level does not differ across clustering methods.
Cluster Recovery Statistic  χ^{2}  df  p 

%Results achieving Excellent cluster recovery (ARI_{HA} ≥ 0.90)  1787  2  < 0.001 
%Results achieving Good cluster recovery (ARI_{HA} ≥ 0.80)  1614  2  < 0.001 
%Results achieving Moderate cluster recovery (ARI_{HA} ≥ 0.65)  8565  2  < 0.001 
Results of ChiSquare Test of Independence for Trait Heterogeneity Only datasets. Results of chisquare test of independence for Trait Heterogeneity Only datasets, testing the null hypothesis that the percentage of clustering results achieving the specified cluster recovery level does not differ across clustering methods.
Cluster Recovery Statistic  Model  χ^{2}  df  p 

%Results achieving Excellent cluster recovery (ARI_{HA} ≥ 0.90)  THO  3713  2  < 0.001 
%Results achieving Good cluster recovery (ARI_{HA} ≥ 0.80)  THO  3107  2  < 0.001 
%Results achieving Moderate cluster recovery (ARI_{HA} ≥ 0.65)  THO  2609  2  < 0.001 
Applicability to real data
Discussion
Data simulation
The new data simulation algorithm produced complex genotypic datasets that included trait heterogeneity, locus heterogeneity and genegene interactions. Most existing simulation software that attempt to simulate heterogeneity do so by allowing the user to specify what portion of the dataset is to be simulated under one model versus another, and the resulting individuals are simply combined into one dataset. In the new algorithm, however, the disease penetrance models, which were used to simulate the data, were constructed so that overall prevalence levels were controlled, allowing naturally occurring overlaps, in which some individuals would have both traits (and their associated multilocus genotypes) by chance. This novel data simulation algorithm should prove very useful for future studies of other proposed genetic analysis methods for complex diseases.
Comparison of clustering methods
The Bayesian Classification method outperformed the other two methods across most dataset parameter combinations, with the exception of the most complex model (THB) on which Fuzzy kModes Clustering performed best. When the results were further examined to find a set of parameters for which one or more methods performed well, Bayesian Classification was found to have achieved excellent recovery for 73% of the datasets with the THO model and achieved moderate recovery for 56% of datasets with 500 or more affecteds and for 86% of datasets with 10 or fewer nonfunctional loci (data not shown here). Neither Hypergraph Clustering nor Fuzzy kModes Clustering achieved good or excellent cluster recovery even under a restricted set of conditions.
Bayesian Classification was obtained as closedsource software, for which there were numerous parameters, which could have been optimized. Initial parameter settings were chosen as recommended by the authors based on the type of data being analyzed. However, it is possible that alternative settings may have yielded better results. For example, for datasets with the more complex genetic models, greater numbers of nonfunctional loci and smaller sample sizes, the maximum number of classification trials and/or the maximum number of classification cycles per trial may need to be longer, and those parameters concerned with convergence rate and stopping criteria may need to be changed to delay convergence. If improvements in performance could be achieved with reasonable time and resource tradeoffs, such changes would certainly be desirable. The results of this simulation study are encouraging enough to warrant further investigation of this matter.
It was disappointing that Hypergraph Clustering did not perform very well under most conditions, despite its intuitive appeal as a method that would find frequentlyoccurring multilocus genotypic patterns. The Hypergraph Clustering method has been reported to work well with very large variable sets (on the order of thousands), which have complex patterns for which large numbers of clusters (10–20+) were relevant [43]. However, there has been no examination of the method's performance on smaller variable sets. Thus, it is possible that the restricted patterns present in our multilocus genotypic data were too simple and sparse and that the method is simply tuned to search for more complex patterns. Also, we were required to devise a translation of the resulting partitioning of genotypes into a clustering of individuals. We tested several such translations and implemented the best process out of several tested. Oftentimes, even when the method correctly chose the functional genotypes to be in different partitions, too many other nonfunctional genotypes were also chosen, which meant that the difference between an individual's likelihood of belonging to one cluster versus another was too small, making the choice of cluster assignment almost arbitrary.
The Fuzzy kModes Clustering method performed comparably to Bayesian Classification for the more complex datasets and was much less computationally intensive. It has been widely reported that the performance of kmeans algorithms is highly variable depending on the method of seeding the initial cluster centroids [44]. While we used the recommended method of selecting individuals from the dataset to serve as the initial cluster modes, we perhaps could have achieved better results if we implemented an additional step to ensure that the initial centroids were substantially dissimilar to each other. This is supported by evidence that when the Fuzzy kModes Clustering resulted in only one cluster (effectively no partitioning of the data), the initial centroids were very similar and the method had converged early so that individuals had equal probability of belonging to any of the clusters. In such cases, the individual was arbitrarily assigned to the first cluster, thereby leading to all other clusters being empty.
As expected, the simpler the model, the better the performance of the three clustering algorithms, with the exception that the Hypergraph Clustering and Fuzzy kModes Clustering methods performed somewhat better on the THB (Trait Heterogeneity with Both locus heterogeneity and genegene interaction) datasets than they did on the THL (Trait Heterogeneity with Locus heterogeneity) and THG (Trait Heterogeneity with Genegene interaction) datasets. Likewise, in general, the fewer the nonfunctional loci and the larger the sample size, the better the performance was.
Applicability to real data
To determine the efficacy of using the Bayesian Classification method on real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated. Using the combination of the average log of class strength and the average cross class entropy to determine significance, the false positive rate was controlled very well, at three percent or less for all three significance levels. The false negative rate was acceptably low (18 percent) for the less stringent significance levels of 0.10. However, it was high (47 percent) for the most stringent significance level of 0.01. Thus, if a clustering of data were called significant according to permutation testing using either the average log of class strength or the average cross class entropy, we can be quite confident that the result were real. Typically geneticists prefer to accept a higher false positive rate to increase power; however, there is indeed a tradeoff between these two types of error. Valuable time and resources can be spent on followup studies, and it can be very detrimental to pursue leads that do not have a good chance of yielding new information about the disease under study.
Conclusion
The efficacy of three clustering methodologies at uncovering trait heterogeneity in genotypic data was investigated. One method, Bayesian Classification, was found to perform very well under certain conditions (THO model) and to outperform the other methods. Permutation testing confirmed that the method could be used on real data with excellent Type I error control and acceptable Type II error control. By controlling the false positive rate so well, Bayesian Classification offers a comfortable degree of certainty with regard to the hypotheses that it generates. This is true at least when the underlying data structure is similar to that simulated under the THO model. Further investigation of how different parameter settings may improve the performance of Bayesian Classification is planned.
Methods
Data simulation
To compare the performance of clustering methodologies in the task of uncovering trait heterogeneity in genotypic data, datasets were needed in which such heterogeneity was known to exist. Since there are no wellcharacterized real datasets available that fit this description, a simulation study was needed. Genetic models that contained two binary diseaseassociated traits, such that there is trait heterogeneity among 'affected' individuals, were used. In addition, some of the models incorporate locus heterogeneity, a genegene interaction, or both. Figure 2 depicts the structure of the four genetic models used to simulate the genotypic data.
Four prevalence levels were simulated for each genetic model: (1) fifteen percent, which is characteristic of a common disease phenotype such as obesity [45], (2) five percent, which is characteristic of a relatively common disease such as prostate cancer [46], (3) one percent, which is characteristic of a less common disease such as schizophrenia [47], and (4) one tenth of one percent, which is characteristic of a more uncommon disease such as multiple sclerosis [48]. Three realistic levels of sample size were simulated for each model: 200, 500 and 1000 affected individuals. Finally, four levels of nonfunctional loci were simulated: 0, 10, 50 and 100. The inclusion of nonfunctional loci adds a random noise effect that is present in real candidate gene studies in which the functional locus or loci are among many more suspected but actually nonfunctional loci. All loci, including the functional loci, were simulated to have equal biallelic frequencies of 0.5.
Although the above parameter settings are by no means exhaustive of the biologically plausible situations, the outlined conditions are reasonable and specify 192 different sets of data specifications due to the combinatorial nature of the study design. To have adequate power to detect a difference in performance among clustering methodologies, 100 datasets per set of parameters, resulting in a total of 19,200 simulated data sets, were simulated.
The input file for each of the clustering methods described below includes genotype and trait status information. Each row is a single individual. Column headings include unique individual number, trait status (affected for Trait 1, Trait 2, or both), and all simulated loci. Genotypes for each locus are encoded nominally (not ordinally), such that no genetic model assumptions are incorporated. Loci are numbered, and alleles are lettered. Thus, for a given locus '3' that has two alleles 'A' and 'B', the three possible genotypes are '3A3A', '3A3B', and '3B3B'. A different nomenclature could easily be used, however, since the methods simply treat each genotype as a character string for labeling purposes only and do not attribute any meaning or order to them.
Clustering methods
There exists a very large number of clustering algorithms and even more implementations of those algorithms. The choice of which clustering methodology to use should be determined by the kind of data being clustered and the purpose of the clustering [51]. Genotypic data is categorical, which immediately narrows the field of appropriate methods for this study to only a few. Three different clustering methodologies were chosen that are suitable for categorical data and are appealing due to their speed or theoretical underpinnings.
The first clustering method is Bayesian Classification [39, 52]. The corresponding AutoClass software is freely available from Peter Cheeseman at the NASA Ames Research Center. Bayesian Classification (BC) aims to find the most probable clustering of data given the data and the prior probabilities. In the case of genotypic data, prior probabilities are based on genotype frequencies, which for the purpose of the proposed data simulations are set in accordance with HardyWeinberg equilibrium and equal biallelic frequencies of 0.5. The most probable clustering of data is determined from two posterior probabilities. The first involves the probability that a particular individual belongs to its 'assigned' cluster, or otherwise stated as the probability of the individual's multilocus genotype, conditional on it belonging to that cluster, with its characteristic genotypes. The second posterior probability involves the probability of a cluster given its assigned individuals, or otherwise stated as the probability of the cluster's characteristic genotypes, conditional on the multilocus genotypes of the individuals assigned to that cluster.
In actuality, individuals are not 'assigned' to clusters in the hard classification sense but instead in the fuzzy sense they are temporarily assigned to the cluster to which they have the greatest probability of belonging. Thus, each individual has its own vector of probabilities of belonging to each of the clusters. The assignment of individuals is also not considered the most important result of the clustering method. A ranked listing is produced of all loci in the dataset with their corresponding normalized "attribute influence" values (ranging between 0 and 1), which provide a rough heuristic measure of relative influence of each locus in differentiating the classes from the overall data set. Thus, emphasis is placed on the identification of which attributes, or loci, are most important in producing the clustering. This information that can then be used to more directly stratify affected (and/or unaffected) individuals, for instance, by using the top n most influential loci identified, and to enable meaningful interpretation of the clustering result.
The second method is Hypergraph Clustering [40]. It has been implemented in the hMETIS software, which is freely available from George Karypis at the University of Minnesota. Hypergraph clustering seeks a partitioning of vertices, such that intracluster relatedness meets a specified threshold, while the weight of hyperedges cut by the partitioning is minimized. In this case, vertices represent single locus genotypes, hyperedges represent association rules, and hyperedge weights represent the strength of the association rules. For instance, if a specific genotype at one locus cooccurs with a specific genotype at another locus, an association rule linking those two genotypes would be created, and that rule would have a weight equivalent to the proportion of individuals in the dataset that had both of those genotypes. Thus, for our purposes, association rules are multilocus genotype combinations that are found in the dataset. The freely available LPminer program was used to generate the association rules [53]. LPminer searches the database for multilocus genotype combinations that appear together with substantial frequency (above a prespecified "support" percentage) and outputs this info as a list of association rules. hMETIS takes these association rules and uses them to create a hypergraph in which single locus genotypes are vertices and association rules dictate the presence and weight of hyperedges. hMETIS creates a partition of the hypergraph such that the weight of the removed hyperedges is minimized. It achieves this by using a series of phases, somewhat analogous to the stages of a simulated annealing algorithm, in an attempt to avoid making decisions which are only locally (not globally) optimal.
Statistical analysis
Comparison of clustering methods
Each clustering method has its own metric(s) for evaluating the "goodness" of a clustering of data. Since these methods are being tested on simulated data, classification error of a given clustering can be calculated as the number of misclassified individuals divided by the total number of individuals. However, simple classification error has its disadvantages. Firstly, in cases such as this where there is overlap between the known classes, the researcher must make an arbitrary decision as to when individuals who have been simulated to have both traits, not just one or the other, are considered to be misclassified. The decision about error is equally arbitrary when the number of resulting clusters is greater than the number of known classes. For instance, if the individuals belonging to one class were divided into two classes by the clustering algorithm, calculating classification error would require either (1) that none of those individuals be considered incorrectly classified, since they are all in homogenous clusters, or else (2) that all individuals from one of those clusters be considered misclassified. Neither choice seems to satisfactorily capture the "goodness" of the clustering result. Subsequently, it is not advisable to compare the classification error of two clustering results for which the number of clusters differs.
It is for these reasons alternative cluster recovery metrics were investigated. The HubertArabie Adjusted Rand Index (ARI_{HA}) addresses the concerns raised by classification error and was, therefore, chosen to evaluate the goodness of clustering results from the three clustering methods being compared [42]. Calculation of the ARI_{HA} involves determining (1) whether pairs of individuals, who were simulated to have the same trait, are clustered together or apart and (2) whether pairs of individuals, who do not have the same trait, are clustered together or apart. The ARI_{HA} is robust with regard to the number of individuals to cluster, the number of resulting clusters, and the relative size of those clusters [54]. It is, however, sensitive to the degree of class overlap, which is desirable since it will penalize more for good clusterings that occur by chance than classification error would. When interpreting ARI_{HA} values, 0.90 and greater can be considered excellent recovery, 0.80 and greater is good recovery, 0.65 and greater reflects moderate recovery, and less than 0.65 indicates poor recovery. These values were derived from empirical studies showing observations cut at the 95th, 90th, 85th and 80th percentiles corresponded to ARI_{HA} values of 0.86, 0.77, 0.67 and 0.60 respectively [54].
The ARI_{HA} was used as the gold standard measure to compare the performance of the three clustering methods. Three categorical variables were created that could be tested using the nonparametric chisquare test of independence. The ARI_{HA} values were discretized into a 1 or 0 depending on whether they met or exceeded the cutoff values for excellent, good and moderate cluster recovery, as described above. A chisquare test of independence was performed testing the null hypothesis that the number of clusterings achieving a certain ARI_{HA} value was independent of the clustering method. Five percent was chosen as the significance level (alpha). An evaluation was performed of whether one method significantly outperformed the others and whether that method performed satisfactorily accordingly to the ARI_{HA}.
Applicability to real data
As a reminder, the ultimate goal of this research is to find a clustering method that works well at uncovering trait heterogeneity in real genotypic data. Unlike for the current simulation study, for real data to which clusters individuals belong is not known a priori, otherwise the clustering would not be necessary. Indeed, it is the goal of clustering to uncover natural clusters or partitions of data using the methodspecific "goodness" metric as a guide. In preparation for application of a clustering method to real data, after choosing the superior method, that method's internal clustering metrics were analyzed using permutation testing to determine how good a proxy they are for ARI_{HA}.
The Bayesian Classification method produces two internal clustering metrics for each resulting cluster, or class: (1) class strength, and (2) crossclass entropy. Class strength is a heuristic measure of how strongly each class predicts "its" instances and is reported as the log of class strength. Crossclass entropy is a measure of how strongly the class probability distribution function differs from that of the dataset as a whole. Because each metric is reported per resulting cluster, or class, the average metric value across clusters was calculated and utilized for evaluating cluster fitness.
The ratio of one hundred permuted datasets per simulated dataset was chosen, which should result in a reasonable approximation of the null distribution but would not put unreasonable strain on resources and time [55]. Genotypes were permuted within loci across individuals, such that the overall frequency of genotypes at any one locus was unchanged, but the frequency of multilocus genotypes was altered at random. This created a null sample in which the frequency of multilocus genotypes was no longer associated with trait status except by chance. The empiricallydetermined superior clustering method was applied to each permuted data set and both the internal clustering metric values and the ARI_{HA} were determined. For each set of 100 permuted data sets, the significance of each of the simulated dataset results was determined based on whether they exceeded the values at the significance level in the corresponding null distribution. Ten percent was chosen as the acceptable Type I error rate since these methods serve as a means of data exploration to be followed by more rigorous, supervised analyses on individual clusters of the data. However, the more conventional levels of 0.05 and 0.01 were also evaluated.
Finally, the ability of permutation testing to preserve an acceptable Type I (false positive) error rate was evaluated at the three specified significance levels. A false positive was defined as a clustering result which had a pvalue according to at either of the internal clustering metrics that was significant but had a pvalue according to ARI_{HA} that was not significant. The Type II (false negative) error rate was evaluated at the same alpha levels to determine the sensitivity for detecting trait heterogeneity when it is present. A false negative was defined as a clustering result which had a pvalue that was not significant according to both of the internal clustering metrics but had a pvalue that was significant according to ARI_{HA}.
Abbrevations
 ARIHA:

HubertArabie Adjusted Rand Index
 THO:

Trait Heterogeneity Only
 THL:

Trait Heterogeneity with Locus Heterogeneity
 THG:

Trait Heterogeneity with GeneGene Interaction
 THB:

Trait Heterogeneity with Both Locus Heterogeneity and GeneGene Interaction
Declarations
Acknowledgements
This work has been funded by NLM training grant T32 MH64913 and by NIH grants HL65234, AI59694, N532830, A619085, HD047447 and RR018787. The authors would like to acknowledge the contribution of Scott M. Dudek, who programmed the Fuzzy kModes Clustering algorithm. He was funded by NIH grant HL065962. The authors would also like to thank Marylyn Ritchie, Lance Hahn and Bill White for their thoughtful input on study design.
Authors’ Affiliations
References
 ThorntonWells TA, Moore JH, Haines JL: Genetics, statistics and human disease: analytical retooling for complexity. Trends Genet 2004, 20: 640–647.View ArticlePubMedGoogle Scholar
 Rivolta C, Sharon D, DeAngelis MM, Dryja TP: Retinitis pigmentosa and allied diseases: numerous diseases, genes, and inheritance patterns. Hum Mol Genet 2002, 11: 1219–1227.View ArticlePubMedGoogle Scholar
 Kulczycki LL, Kostuch M, Bellanti JA: A clinical perspective of cystic fibrosis and new genetic findings: relationship of CFTR mutations to genotypephenotype manifestations. Am J Hum Genet 2003, 116A: 262–267.View ArticleGoogle Scholar
 Povey S, Burley MW, Attwood J, Benham F, Hunt D, Jeremiah SJ, Franklin D, Gillett G, Malas S, Robson EB, Tippett P, Edwards JH, Kwiatkowski DJ, Super M, Mueller R, Fryer A, Clarke A, Webb D, Osborne J: Two loci for tuberous sclerosis: one on 9q34 and one on 16p13. Ann Hum Genet 1994, 58: 107–127.View ArticlePubMedGoogle Scholar
 Young J, Povey S: The genetic basis of tuberous sclerosis. Mol Med Today 1998, 4: 313–319.View ArticlePubMedGoogle Scholar
 Harding AE: The clinical features and classification of the late onset autosomal dominant cerebellar ataxias: a study of 11 families, including descendants of 'the Drew family of Walworth.'. Brain 1982, 105: 1–28.View ArticlePubMedGoogle Scholar
 Rosenberg RN: Autosomal dominant cerebellar phenotypes: the genotype has settled the issue. Neurology 1995, 45: 1–5.View ArticlePubMedGoogle Scholar
 Devos D, SchraenMaschke S, Vuillaume I, Dujardin K, Naze P, Willoteaux C, Destee A, Sablonniere B: Clinical features and genetic analysis of a new form of spinocerebellar ataxia. Neurology 2001, 56: 234–238.View ArticlePubMedGoogle Scholar
 TagerFlusberg H, Joseph RM: Identifying neurocognitive phenotypes in autism. Philos Trans R Soc Lond B Biol Sci 2003, 358: 303–314.PubMed CentralView ArticlePubMedGoogle Scholar
 Bradford Y, Haines JL, Hutcheson H, Gardiner M, Braun T, Sheffield V, Cassavant T, Huang W, Wang K, Vieland V, Folstein S, Santangelo S, Piven J: Incorporating language phenotypes strengthens evidence of linkage to autism. Am J Med Genet 2001, 105: 539–547.View ArticlePubMedGoogle Scholar
 Shao Y, Raiford KL, Wolpert CM, Cope HA, Ravan SA, AshleyKoch AA, Abramson RK, Wright HH, DeLong RG, Gilbert JR, Cuccaro ML, PericakVance MA: Phenotypic homogeneity provides increased support for linkage on chromosome 2 in autistic disorder. Am J Hum Genet 2002, 70: 1058–1061.PubMed CentralView ArticlePubMedGoogle Scholar
 Carrasquillo MM, McCallion AS, Puffenberger EG, Kashuk CS, Nouri N, Chakravarti A: Genomewide association study and mouse model identify interaction between RET and EDNRB pathways in Hirschsprung disease. Nat Genet 2002, 32: 237–244.View ArticlePubMedGoogle Scholar
 Dohura K, Tateishi J, Sasaki H, Kitamoto T, Sakaki Y: Protoleu change at position 102 of prion protein is the most common but not the sole mutation related to GerstmannStraussler syndrome. Biochem Biophys Res Comm 1989, 163: 974–979.View ArticlePubMedGoogle Scholar
 Owen F, Poulter M, Collinge J, Crow TJ: A codon 129 polymorphism in the PRIP gene. Nucleic Acids Res 1990, 18: 3103.PubMed CentralView ArticlePubMedGoogle Scholar
 Collinge J, Palmer MS, Dryden AJ: Genetic predisposition to iatrogenic CreutzfeldtJakob disease. Lancet 1991, 337: 1441–1442.View ArticlePubMedGoogle Scholar
 Palmer MS, Dryden AJ, Hughes JT, Collinge J: Homozygous prion protein genotype predisposes to sporadic CreutzfeldtJakob disease. Nature 1991, 352: 340–342.View ArticlePubMedGoogle Scholar
 De Silva R, Ironside JW, McCardle L, Esmonde T, Bell J, Will R, Windl O, Dempster M, Estibeiro P, Lathe R: Neuropathological phenotype and 'prion protein' genotype correlation in sporadic CreutzfeldtJakob disease. Neurosci Lett 1994, 179: 50–52.View ArticlePubMedGoogle Scholar
 Dohura K, Kitamoto T, Sakaki Y, Tateishi J: CJD discrepancy. Nature 1991, 353: 801–802.View ArticlePubMedGoogle Scholar
 Goldfarb LG, Brown P, Haltia M, Cathala F, McCombie WR, Kovanen J, Cervenakova L, Goldin L, Nieto A, Godec MS, Asher DM, Gajdusek DC: CreutzfeldtJakob disease cosegregates with the codon 178Asn PRNP mutation in families of European origin. Ann Neurol 1992, 31: 274–281.View ArticlePubMedGoogle Scholar
 Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K: A comprehensive review of genetic association studies. Genet Med 2002, 4: 45–61.View ArticlePubMedGoogle Scholar
 Ott J: Strategies for characterizing highly polymorphic markers in human gene mapping. Am J Hum Genet 1992, 51: 283–290.PubMed CentralPubMedGoogle Scholar
 Smith CAB: Testing for heterogeneity of recombination fraction values in human genetics. Annals of Human Genetics 1963, 27: 175–182.View ArticlePubMedGoogle Scholar
 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147.PubMed CentralView ArticlePubMedGoogle Scholar
 Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, phenocopy and genetic heterogeneity. Genet Epidemiol 2003, 24: 150–157.View ArticlePubMedGoogle Scholar
 Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered 2003, 56: 73–82.View ArticlePubMedGoogle Scholar
 Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Change M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikan R, Roberts T, Sdicu A, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Gurd CG, Numro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the yeast genetic interaction network. Science 2004, 303: 808–813.View ArticlePubMedGoogle Scholar
 Moore JH: A global view of epistasis. Nat Genet 2005, 37: 13–14.View ArticlePubMedGoogle Scholar
 Moore JH, Williams SM: Traversing the conceptual divide between biological and statistical epistasis: Systems biology and a more modern synthesis. Bioessays 2005, 27: 637–646.View ArticlePubMedGoogle Scholar
 Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet Suppl 2002, 32: 502–508.View ArticleGoogle Scholar
 Mountain JL, CavalliSforza LL: Multilocus genotypes, a tree of individuals, and human evolutionary history. Am J Hum Genet 1997, 61: 705–718.PubMed CentralView ArticlePubMedGoogle Scholar
 Ritchie MD, White B, Parker JS, Hahn LW, Moore JH: Optimization of neural network architecture improves the power to identify genegene interaction in common diseases. BMC Bioinformatics 2003, 4: 28.PubMed CentralView ArticlePubMedGoogle Scholar
 Hauser ER, Watanabe RM, Duren WL, Bass MP, Langefeld CD, Boehnke M: Ordered subset analysis in genetic linkage mapping of complex traits. Genet Epidemiol 2004, 27: 53–63.View ArticlePubMedGoogle Scholar
 Hoh J, Wille A, Ott J: Trimming, Weighting, and Grouping SNPs in Human CaseControl Association Studies. Genome Res 2001, 11: 2115–2119.PubMed CentralView ArticlePubMedGoogle Scholar
 Ott J, Hoh J: Set association analysis of SNP casecontrol and microarray data. J Comput Biol 2003, 10: 569–574.View ArticlePubMedGoogle Scholar
 Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics 2003, 19: 376–382.View ArticlePubMedGoogle Scholar
 Hahn LW, Moore JH: Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol 2004, 4: 183–94.PubMedGoogle Scholar
 Moore JH: Computational analysis of genegene interactions using multifactor dimensionality reduction. Expert Rev Mol Diagn 2004, 4: 795–803.View ArticlePubMedGoogle Scholar
 Anderberg MR: Cluster Analysis for Applications. New York: Academic Press; 1973.Google Scholar
 Hanson R, Stutz J, Cheeseman P: Bayesian classification theory. In Technical Report # FIA90–12–701. Artificial Intelligence Research Branch, NASA Ames Research Center; 1991.Google Scholar
 Han EH, Karypis G, Kumar V, Mobasher B: Clustering Based on Association Rule Hypergraphs. Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery: 1997; Tucson 1997, 9–13.Google Scholar
 Huang Z, Ng MK: A fuzzy kmodes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 1999, 7: 446–452.View ArticleGoogle Scholar
 Hubert L, Arabie P: Comparing partitions. J Classif 1985, 2: 193–218.View ArticleGoogle Scholar
 Han EH, Karypis G, Kumar V, Mobasher B: Clustering in High Dimensional Space Using Hypergraph Models. In Technical Report #97–063. Computer Science and Engineering, University of Minnesota; 1997.Google Scholar
 Duda RO, Hart PE: Pattern Classification and Scene Analysis. New York: John Wiley and Sons; 1973.Google Scholar
 Flegal KM, Carroll MD, Kuczmarski RJ: Overweight and obesity in the United States: prevalence and trends, 1960–1994. Int J Obe Relat Metab Disord 1998, 22: 39–47.View ArticleGoogle Scholar
 Narod SA, Dupont A, Cusan L, Diamond P, Gomez JL, Suburu R, Labrie F: The impact of family history on early detection of prostate cancer. Nat Med 1995, 1: 99–101.View ArticlePubMedGoogle Scholar
 Schultz S, Andreasen N: Schizophrenia. Lancet 1999, 353: 1425–1430.View ArticlePubMedGoogle Scholar
 Kurtzke JF: Multiple sclerosis: changing times. Neuroepidemiology 1991, 10: 1–8.View ArticlePubMedGoogle Scholar
 Li WT, Reich J: A complete enumeration and classification of twolocus disease models. Human Heredity 2000, 50: 334–349.View ArticlePubMedGoogle Scholar
 Frankel WN, Schork NJ: Who's afraid of epistasis? Nat Genet 1996, 14: 371–373.View ArticlePubMedGoogle Scholar
 Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, Inc; 1990.View ArticleGoogle Scholar
 Cheeseman P, Stutz J: Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining. Edited by: Fayyad UM, PiatetskyShapiro G, Smyth P, Uthurusamy R. Menlo Park: The AAAI Press; 1996.Google Scholar
 Seno M, Karypis G: LPMiner: An Algorithm for Finding Frequent Itemsets Using LengthDecreasing Support Constraint. Proceedings of the IEEE Conference on Data Mining: 2001; IEEE Computer Society 2001, 505–512.View ArticleGoogle Scholar
 Steinley D: Properties of the HubertArabie Adjusted Rand Index. Psychol Methods 2004, 9: 386–396.View ArticlePubMedGoogle Scholar
 Good P: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. New York: Springer; 2000.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.