Open Access

Optimizationof neural network architecture using genetic programming improvesdetection and modeling of gene-gene interactions in studies of humandiseases

  • Marylyn D Ritchie1,
  • Bill C White1,
  • Joel S Parker1,
  • Lance W Hahn1 and
  • Jason H Moore1Email author
BMC Bioinformatics20034:28

https://doi.org/10.1186/1471-2105-4-28

Received: 12 March 2003

Accepted: 07 July 2003

Published: 07 July 2003

Abstract

Background

Appropriate definitionof neural network architecture prior to data analysis is crucialfor successful data mining. This can be challenging when the underlyingmodel of the data is unknown. The goal of this study was to determinewhether optimizing neural network architecture using genetic programmingas a machine learning strategy would improve the ability of neural networksto model and detect nonlinear interactions among genes in studiesof common human diseases.

Results

Using simulateddata, we show that a genetic programming optimized neural network approachis able to model gene-gene interactions as well as a traditionalback propagation neural network. Furthermore, the genetic programmingoptimized neural network is better than the traditional back propagationneural network approach in terms of predictive ability and powerto detect gene-gene interactions when non-functional polymorphismsare present.

Conclusion

This study suggeststhat a machine learning strategy for optimizing neural network architecturemay be preferable to traditional trial-and-error approaches forthe identification and characterization of gene-gene interactionsin common, complex human diseases.

Background

The detection and characterization of genes associated with common,complex diseases is a difficult challenge in human genetics. Unlikerare genetic disorders which are easily characterized by a singlegene, common diseases such as essential hypertension are influencedby many genes all of which may be associated with disease risk primarilythrough nonlinear interactions [1, 2]. Gene-gene interactionsare difficult to detect using traditional parametric statisticalmethods [2] because of thecurse of dimensionality [3]. That is, wheninteractions among genes are considered, the data becomes too sparseto estimate the genetic effects. To deal with this issue, one can collecta very large sample size. However, this can be prohibitively expensive.The alternative is to develop new statistical methods that haveimproved power to identify gene-gene interactions in relativelysmall sample sizes.

Neural networks (NN) have been used for supervised pattern recognitionin a variety of fields including genetic epidemiology [412].The success of the NN approach in genetics, however, varies a greatdeal from one study to the next. It is hypothesized that for complexhuman diseases, we are dealing with a rugged fitness landscape [9, 13],that is, a fitness landscape with many local minima which make itmore difficult to find the global minimum. Therefore, the inconsistentresults in genetic epidemiology studies may be attributed to thefact that there are many local minima in the fitness landscape.Training a NN involves minimizing an error function. When there aremany local minima, a NN using a hill-climbing algorithm for optimizationmay stall on a different minimum on each run of the network [9].

To avoid stalling on local minima, machine learning methods suchas genetic programming [14] and genetic algorithms [15] have been explored. Genetic programming(GP) is a machine learning methodology that generates computer programsto solve problems using a process that is inspired by biologicalevolution by natural selection [1620].Genetic programming begins with an initial population of randomlygenerated computer programs, all of which are possible solutionsto a given problem. This step is essentially a random search orsampling of the space of all possible solutions. An example of onetype of computer program, called a binary expression tree, is shownin Figure 1.
Figure 1

Binary expression tree example of a GP solution. Thisfigure is an example of a possible computer program generated byGP. While the program can take virtually any form, we are usinga binary expression tree representation, thus we have shown thistype as an example.

Next, each of these computer programs are executed and assigneda fitness value that is proportional to its performance on the particularproblem being solved. Then, the best computer programs, or solutions,are selected to undergo genetic operations based on Darwin's principle ofsurvival of the fittest. Reproduction takes place with a subsetof the best solutions, such that these solutions are directly copiedinto the next generation. Crossover, or recombination, takes placebetween another subset of solutions. This operation is used to createnew computer programs by combining components of two parent programs.An example of a crossover event between two solutions is shown inFigure 2.
Figure 2

GP crossover. This figure shows a crossover eventin GP between two binary expression trees. Here, the left sub-treeof parent 1 is swapped with the left sub-tree of parent 2 to create2 new trees.

Thus, the new population is comprised of a portion of solutionsthat were copied (reproduced) from the previous generation, anda portion of solutions that are the result of recombination (crossover)between solutions of the parent population. This new populationreplaces the old population and the process begins again by executing eachprogram and assigning a fitness measure to each of them. This isrepeated for a set number of generations or until some terminationcriterion is met. The goal is to find the best solution, which islikely to be the solution with the optimal fitness measure.

Koza [17, 18] and Koza et al. [19] give a detailed description of GP.A description of GP for bioinformatics applications is given byMoore and Parker [13]. Evolutionary computationstrategies such as GP have been shown to be effective for both variableand feature selection with methods such as symbolic discriminantanalysis for microarray studies [13, 16, 21]. Koza and Rice [14] have suggested GP for optimizing thearchitecture of NN.

The goal of this study was to implement a GP for optimizing NNarchitecture and compare its ability to model and detect gene-geneinteractions with a traditional back propagation NN. To achievethis goal we simulated data from a set of different models exhibitingepistasis, or gene-gene interactions. We applied the back propagationNN (BPNN) and the GP optimized NN (GPNN) to these data to comparethe performance of the two methods. We considered the GPNN to haveimproved performance if 1) it had improved prediction error, and/or2) it had improved power compared to the BPNN. We implemented crossvalidation with both the BPNN and GPNN to estimate the classificationand prediction errors of the two NN approaches. Based on our results,we find that GPNN has improved prediction error and improved powercompared to the BPNN. Thus, we consider the optimization of NN architectureby GP to be a significant improvement to a BPNN for the gene-geneinteraction models and simulated data studied here. In the remainderof this section, we will provide background information on the BPNN, theGPNN strategy, and the cross validation approach used for this study.

Back propagation NN (BPNN)

The back propagation NN (BPNN) is one of the most commonly usedNN [22] and is the NNchosen for many genetic epidemiology studies [412].In this study, we used a traditional fully-connected, feed-forwardnetwork comprised of one input layer, zero, one, or two hidden layers,and one output layer, trained by back propagation. The softwareused, the NICO toolkit, was developed at the Royal Institute ofTechnology, http://www.speech.kth.se/NICO/index.html.

Defining the network architecture is a very important decisionthat can dramatically alter the results of the analysis [23]. There are a variety of strategiesutilized for selection of the network architecture. The featuresof network architecture most commonly optimized are the number of hiddenlayers and the number of nodes in the hidden layer. Many of theseapproaches use a prediction error fitness measure, such that theyselect an architecture based on its generalization to new observations [23], while others use classification error,or training error [24]. We chose to useclassification error rather than prediction error as a basis forevaluating and making changes to the BPNN architecture because weuse prediction error to measure the overall network fitness. Webegan with a very small network and varied several parameters includingthe number of hidden layers, number of nodes in the hidden layer,and learning momentum (the fraction of the previous change in aweight that is added to the next change) to obtain an appropriatearchitecture for each data set. This trial-and-error approach iscommonly employed for optimization of BPNN architecture [24, 25].

A Genetic Programming Neural Network (GPNN) Strategy

We developed a GP-optimized NN (GPNN) in an attempt to improveupon the trial-and-error process of choosing an optimal architecturefor a pure feed-forward BPNN. The GPNN optimizes the inputs froma larger pool of variables, the weights, and the connectivity ofthe network including the number of hidden layers and the numberof nodes in the hidden layer. Thus, the algorithm attempts to generateappropriate network architecture for a given data set. Optimizationof NN architecture using GP was first proposed by Koza and Rice[14].

The use of binary expression trees allow for the flexibility ofthe GP to evolve a tree-like structure that adheres to the componentsof a NN. Figure 3 shows an exampleof a binary expression tree representation of a NN generated byGPNN. Figure 4 shows the sameNN that has been reduced from the binary expression tree form tolook more like a common feed-forward NN. The GP is constrained insuch a way that it uses standard GP operators but retains the typicalstructure of a feed-forward NN. A set of rules is defined priorto network evolution to ensure that the GP tree maintains a structurethat represents a NN. The rules used for this GPNN implementationare consistent with those described by Koza and Rice [14]. The flexibility of the GPNN allowsoptimal network architectures to be generated that contain the appropriate inputs,connections, and weights for a given data set.
Figure 3

GPNN representation of a NN. This figure is an exampleof one NN optimized by GPNN. The O is the output node, S indicatesthe activation function, W indicates a weight, and X1-X4 arethe NN inputs.

Figure 4

Feed-forward BPNN representation of the GPNN in Figure3. To generate this NN, each weight in Figure 3 was computed to produce a single value.

The GP has a set of parameters that must be initialized beforebeginning the evolution of NN models. First, a distinct set of inputsmust be identified. All possible variables can be included as optionalinputs, although the GP is not required to use all of them. Second,a set of mathematical functions used in weight generation must bespecified. In the present study, we use only the four basic arithmetic operators.Third, a fitness function for the evaluation of GPNN models is definedby the user. Here, we have designated classification error as thefitness function. Finally, the operating parameters of the GP mustbe initialized. These include initial population size, number ofgenerations, reproduction rate, crossover rate, and mutation rate [14].

Training the GPNN begins by generating an initial random populationof solutions. Each solution is a binary expression tree representationof a NN, similar to that shown in Figure 3.The GP then evaluates each NN. The best solutions areselected for crossover and reproduction using a fitness-proportionateselection technique, called roulette wheel selection, based on theclassification error of the training data [26].A predefined proportion of the best solutions will bedirectly copied (reproduced) into the new generation. Another proportionof the solutions will be used for crossover with other best solutions.Crossover must take place such that the rules of network constructionstill apply. Next, the new generation, which is equal in size tothe original population, begins the cycle again. This continuesuntil some criterion is met at which point the GPNN stops. Thiscriterion is either a classification error of zero or the maximumnumber of generations has been reached. In addition, a "best-so-far"solution is chosen after each generation. At the end of the GP run,the one "best-so-far" solution is used as the solution to the problem[14, 16].

While GPNN can be effective for searching highly nonlinear, multidimensionalsearch spaces, it is still susceptible to stalling on local minima [14]. To address this problem, GPNN canbe run in parallel on several different processors. Several isolatedpopulations, or demes, are created and a periodic exchange of bestsolutions takes place between the populations. This is often referredto as an "island model" [27]. This exchangeincreases diversity among the solutions in the different populations.Following the set number of generations, the best-so-far solutionsfrom each of the n processors are compared and a singlebest solution is selected. This solution has the minimum classificationerror of all solutions generated [28].

Cross-Validation

While NN are known for their ability to model nonlinear data,they are also susceptible to over-fitting. To evaluate the generalizabilityof BPNN and GPNN models, we used 10 fold cross-validation [29, 30] (CV).Here, the data are divided into 10 parts of equal size. We use 9/10of the data to train the BPNN or the GPNN, and we use the remaining 1/10of data to test the model and estimate the prediction error, whichis how well the NN model is able to predict disease status in that1/10 of the data. This is done 10 times, each time leaving outa different 1/10 of data for testing [29, 30].A prediction error is estimated as an average across the 10 cross-validations.Cross-validation consistency is a measure of the number of timeseach variable appears in the BPNN or GPNN model across the 10 cross-validations [21, 31, 32]. That is, we measuredthe consistency with which each single nucleotide polymorphism (SNP)was identified across the 10 cross-validations. The motivation forthis statistic is that the effects of the functional SNPs shouldbe present in most splits of the data. Thus, a high cross-validationconsistency (~10) lends support to that SNP being important forthe epistasis model. Further detail regarding the implementationof cross-validation can be found in the Data Analysis section ofthe paper.

Results

Three different analyses were conducted to compare the performanceof the BPNN and the GPNN. First, a trial-and-error procedure forselecting the optimal BPNN architecture for a subset of the datawas performed. This step established the motivation for optimizingthe NN architecture. Next, the ability to model gene-gene interactions byboth the BPNN and GPNN was determined by comparing the classificationand prediction error of the two methods using data containing onlythe functional SNPs. Finally, the ability to detect and model gene-geneinteractions was investigated for both the BPNN and GPNN. This wasdetermined by comparing the classification and prediction errorsof the two methods using data containing the functional SNPs anda set of non-functional SNPs. The details of the implementationof these three analyses are reported in the Data Analysis sectionof the paper. The results of these three analyses are presentedin the following sections.

Trial-and-error procedure for the back propagation NN (BPNN)

The results of our trial-and-error optimization technique forselecting the traditional BPNN architecture indicate the importanceof selecting the optimal NN architecture for each different epistasismodel. Table 1 shows the resultsfrom the traditional BPNN architecture optimization technique onone data set from each epistasis model generated with the two functionalSNPs only. A schematic for each of the best architectures selectedis shown in Figure 5. Note thatthe optimal architecture varied for each of the epistasis models.For example, the optimal architecture for Model 1 was composed of2 hidden layers, 15 nodes in the first layer and 5 nodes in thesecond layer, and a momentum of 0.9. In contrast, for Model 2 theoptimal architecture included 2 hidden layers, 20 nodes in the firstlayer and 5 nodes in the second layer, and a momentum of 0.9. Differentarchitectures were optimal for models 3, 4, and 5 as well. Similarresults were obtained using the simulated data containing the twofunctional and eight non-functional SNPs as well (data not shown).This provides further motivation for automating the optimizationof the NN architecture in order to avoid the uncertainty of trial-and-errorexperiments.
Table 1

Trial and ErrorOptimization of BPNN with only functional SNPs

HL

U/L

M

Epistasis Models

   

1

2

3

4

5

0

0

.1

0.48786

0.49460

0.49560

0.49160

0.49545

0

0

.5

0.48786

0.49460

0.49560

0.49160

0.49545

0

0

.9

0.48786

0.49460

0.49560

0.49160

0.49545

1

5

.1

0.47317

0.45883

0.49568

0.49160

0.49553

1

5

.5

0.36422

0.34229

0.48754

0.49010

0.49543

1

5

.9

0.31206

0.23181

0.34522

0.44670

0.48905

1

10

.1

0.47430

0.46820

0.49607

0.49150

0.49559

1

10

.5

0.35916

0.36446

0.49284

0.49020

0.49542

1

10

.9

0.31209

0.23193

0.34524

0.44660

0.49136

1

15

.1

0.48495

0.47508

0.49599

0.49160

0.49552

1

15

.5

0.37511

0.36221

0.49364

0.49150

0.49542

1

15

.9

0.31217

0.23203

0.34525

0.44670

0.49399

1

20

.1

0.48630

0.49240

0.49583

0.49160

0.49549

1

20

.5

0.40750

0.34406

0.49469

0.49070

0.49544

1

20

.9

0.31217

0.23216

0.34511

0.44660

0.49402

2

5:5

.1

0.49965

0.49997

0.49997

0.50000

0.49996

2

5:5

.5

0.49628

0.49980

0.49996

0.49990

0.49995

2

5:5

.9

0.31205

0.23704

0.41740

0.44670

0.49471

2

10:5

.1

0.49623

0.49980

0.49987

0.49980

0.49972

2

10:5

.5

0.49024

0.49854

0.49929

0.49950

0.49847

2

10:5

.9

0.31201

0.23158

0.35430

0.45450

0.49477

2

15:5

.1

0.49398

0.49944

0.49954

0.49940

0.49913

2

15:5

.5

0.48697

0.49578

0.49850

0.49530

0.49700

2

15:5

.9

0.31199

0.23584

0.35993

0.44740

0.49465

2

20:5

.1

0.49160

0.49849

0.49946

0.49840

0.49889

2

20:5

.5

0.48700

0.49212

0.49808

0.49290

0.49596

2

20:5

.9

0.31199

0.23157

0.34657

0.44750

0.49519

Results from the trial and error optimizationof the BPNN on one dataset from each epistasis model. We used 27different architectures varying in HL – hidden layer, U/L – unitsper layer, M – momentum. The average classification error across10 cross-validations from each data set generated for each of thefive epistasis models are shown. The best architecture is shownin bold and in Figure 5. Thiswas the most parsimonious architecture with the minimum classificationerror.

Figure 5

Optimal architecture from BPNN trial and error optimization. Thisfigure shows the result of the BPNN trial and error procedure onone data set from each epistasis model. This shows the NN architecturefor the best classification error selected from Table 1.

Modeling gene-gene interactions

Table 2 summarizes the averageclassification error (i.e. training error) and prediction error(i.e. testing error) for the BPNN and GPNN evaluated using 100 datasets for each of the five epistasis models with only the two functionalSNPs in the analyses. GPNN and the BPNN performed similarly inboth training the NN and testing the NN through cross-validation.In each case, the NN solution had an error rate within 4 percentof the error inherent in the data. Due to the probabilistic natureof the penetrance functions used to simulate the data,there is some degree of noise simulated in the data. The average errorin the 100 datasets generated under each epistasis model are 24%,18%, 27%, 36%, 40% for Model 1, 2, 3, 4, and 5, respectively. Therefore,the error estimates obtained by the BPNN or GPNN are very closeto the true error rate. There is also little opportunity for either methodto over-fit the data here, since only the functional SNPs are presentin the analysis. These results demonstrate that the BPNN and GPNNare both able to model the nonlinear gene-gene interactions specifiedby these models.
Table 2

Results from theBPNN and GPNN analyses of datasets with only functional SNPs.

Epistasis Model

BPNN

GPNN

 

Classification Error

Prediction Error

Classification Error

Prediction Error

1

0.238

0.233

0.237

0.237

2

0.180

0.179

0.181

0.181

3

0.268

0.268

0.287

0.301

4

0.370

0.386

0.375

0.398

5

0.405

0.439

0.405

0.439

Detecting and modeling gene-gene interactions

Table 3 shows the average classificationerror and prediction error for the BPNN and GPNN evaluated using100 data sets for each of the five epistasis models with the two functionaland eight non-functional SNPs. GPNN consistently had a lower predictionerror than the BPNN, while the BPNN consistently had a lower classificationerror. The lower classification error seen with the BPNN is due toover-fitting. These results show that when non-functional SNPs arepresent, the BPNN has a tendency to over-fit the data and thereforehave a higher prediction error than GPNN. GPNN is able to modelgene-gene interactions and develop NN models that can generalizeto new observations.
Table 3

Results from theBPNN and GPNN analyses of data sets with both functional and non-functionalSNPs

Epistasis Model

BPNN

GPNN

 

Classification Error

Prediction Error

Classification Error

Prediction Error

1

0.008

0.340

0.237

0.237

2

0.008

0.303

0.242

0.243

3

0.009

0.398

0.335

0.360

4

0.013

0.477

0.387

0.431

5

0.012

0.486

0.401

0.479

Table 4 shows the power todetect the two functional SNPs for GPNN and the BPNN using 100 datasets for each of the five epistasis models with both the two functionaland eight non-functional SNPs in the analyses. For all five models,GPNN had a higher power than the BPNN to detect both SNPs. Theseresults demonstrate the high power of GPNN in comparison to a BPNNwhen attempting to detect gene-gene interactions in the presenceof non-functional SNPs.
Table 4

Power (%) to detecteach functional SNP by BPNN and GPNN analyses of data sets withboth functional and non-functional SNPs

Epistasis Model

BPNN

GPNN

 

SNP 1

SNP 2

SNP1

SNP 2

1

88

90

100

100

2

80

82

100

100

3

41

50

100

100

4

3

0

92

87

5

0

1

44

47

Discussion

We have implemented a NN that is optimized by GP using the approachoutlined by Koza and Rice [14]. Based on theresults of the trial and error architecture optimization, wehave shown that the selection of optimal NN architecture can alterthe results of data analyses. For one example data set from eachof the epistasis models, the best architecture was quite different,as shown in Table 1 and Figure 5 for functional SNP only data. This wasalso the case for data containing functional and non-functionalSNPs (data not shown). Since we only tried 27 different architectures,there may have been more appropriate architectures for these datasets. In fact, since enumeration of all possible NN architecturesis impossible [24], there is no wayto be certain that the global best architecture is ever selected.Thus, the ability to optimize the NN architecture using GPNN maydramatically improve the results of NN analyses.

Using simulated data, we demonstrated that GPNN was able to modelnonlinear interactions as well as a traditional BPNN. These resultsare important because it is well known that traditional BPNN areuniversal function approximators [22].When given the functional SNPs, one would expect the BPNN to accuratelymodel the data. Here, we have shown that GPNN is also capable ofaccurately modeling the data. This demonstrates that GPNN is ableto optimize the NN architecture such that the NN evolved is ableto model data as well as a BPNN.

GPNN had improved power and predictive ability compared to aBPNN when applied to data containing both functional and non-functionalSNPs. These results provide evidence that GPNN is able to detectthe functional SNPs and model the interactions for the epistasismodels described here with minimal main effects in a sample size of200 cases and 200 controls. In addition, these are the two criteriawe specified for considering GPNN an improvement over the BPNN.Therefore, in situations where the functional SNPs are known andthe user is attempting to model the data, either GPNN or a BPNN wouldbe equally useful. However, in situations where the functional SNPsare unknown and the user wants to perform variable selection aswell as model fitting, GPNN may be the preferable method. This distinctioncan be made due to the increase in power of GPNN and the fact thatGPNN does not over-fit the data much like the traditional BPNN.

We speculate that GPNN is not over-fitting because while theGPNN is theoretically able to build a tree with all of the variablesas inputs, it is not building a fully connected NN. This may bepreventing it from over-fitting the way that the BPNN has done.Secondly, it is possible that the strong signal of the correct solutionin the simulated data caused the GP to quickly pick up that signal,and propagate trees with components of that model in the population.Because we have mutation set to 0%, there is never a large changeto the trees to lead them to explore in the other areas of the searchspace. As a result, the GPNN converges quickly on a small solutioninstead of exploring the entire search space. We plan to explorewhether GPNN over-fits in certain situations and if so, to developstrategies to deal with this issue, such as the three-way data split discussedby Roland [33].

In an attempt to estimate the power of these NN methods for arange of genetic effects, we selected epistasis models with varyingdegrees of heritability. Heritability, in the broad sense, is theproportion of total phenotypic variance attributable to geneticfactors. Thus, higher heritability values will have a stronger geneticeffect. The five disease models we used varied in heritability from0.008 to 0.053. To calculate the heritability of these models, we usedthe formula described by Culverhouse et al. [34]. Heritabilityvaries from 0.0 (no genetic component) to 1.0 (completely genetically-specified).

We selected models with varying heritability values to obtaina more accurate comparison of the two NN methods in the presenceof different genetic effects. Interestingly, the results showedthat GPNN had greater than 80% power for all heritability valuestested in the range of 0.012 to 0.053. In addition, GPNN had 100%power for all models with a heritability of 0.026 or greater.However, the BPNN had greater than 80% power only for heritabilityvalues greater than 0.051. Therefore, the BPNN has low power todetect gene-gene interactions that have an intermediate to weakgenetic effect (i.e. heritability value in the range from 0.008to 0.026), while GPNN maintains greater than 80% power, even foran epistasis model with a relatively weak genetic effect (i.e. 0.012). Thepower of GPNN falls below 80% for a heritability value that is verysmall (i.e. 0.008). Thus, GPNN is likely to have higher power thanthe BPNN for detecting many gene-gene interaction models with intermediateto small genetic effects.

While GPNN has improved power and predictive ability over theBPNN, there are some advantages and disadvantages to this approach.An advantage of GPNN is its modeling flexibility. With commercialBPNN software, such as the BPNN used here, the user must definethe inputs, the initial values of the weights, the number of connections eachinput has, and the number of hidden layers. Often, the algorithmparameters that work well for one data set will not be successfulwith another data set, as demonstrated here. With the GP optimization,the user only needs to specify a pool of variables that the networkcan use and the GP will select the optimal inputs, weights, connections,and hidden layers. An important disadvantage of GPNN is the requiredcomputational resources. To use GPNN effectively, one needs accessto a parallel processing environment. For a BPNN, on the other hand, adesktop computer is the only requirement. Another disadvantage isthe interpretation of the GPNN models. The output of GPNN is a NNin the form of a binary expression tree. A NN in this form can bedifficult to interpret, as it can get quite large (up to 500 nodes).

While we have demonstrated the ability of GPNN to model and detectgene-gene interactions, further work is needed to fully evaluatethe approach. For example, we would like to know whether using alocal search algorithm, such as back propagation or simulated annealing [35], to refine the weights of a GPNN modelis useful. This sort of approach has been employed for a geneticalgorithm approach to optimizing NN architecture for classificationof galaxies in astronomy [36]. However, as describedabove, a local search could lead to increased over-fitting. Next,the current version of GPNN uses only arithmetic operators in thebinary expression trees. The use of a richer function set, includingBoolean operators and other mathematical operators, may allow moreflexibility in the NN models. Third, we would like to evaluate thepower of GPNN for a variety of high order epistasis models (suchas three, four, and five locus interaction models). Finally, wewould like to develop and distribute a GPNN software package.

Conclusions

The results of this study demonstrate that GP is an excellentway of automating NN architecture design. The NN inputs, weights,and interconnections are optimized for a specific problem whiledecreasing susceptibility to over-fitting which is common in thetraditional BPNN approach. We have shown that GPNN is able to model gene-geneinteractions as well as a BPNN in data containing only the functionalSNPs. We have also shown that when there are nonfunctional SNPsin the data (i.e. potential false positives), GPNN has higher powerthan a BPNN, in addition to lower prediction error. We anticipatethis will be an important pattern recognition method in the searchfor complex disease susceptibility genes.

Methods

Data simulation

The goal of the data simulation was to generate data sets thatexhibit gene-gene interactions for the purpose of evaluating theclassification error, prediction error, and power of GPNN and atraditional BPNN. As discussed by Templeton [1],epistasis, or gene-gene interaction occurs when the combined effectof two or more genes on a phenotype could not have been predictedfrom their independent effects. Current statistical approaches inhuman genetics focus primarily on detecting the main effects and rarelyconsider the possibility of interactions [1].In contrast, we are interested in simulating data using different epistasismodels that exhibit minimal independent main effects, but producean association with disease primarily through interactions. We simulateddata with two functional single nucleotide polymorphisms (SNPs)to compare GPNN to a BPNN for modeling nonlinear epistasis models.In this study, we use penetrance functions as genetic models.

Penetrance functions model the relationship between genetic variationsand disease risk. Penetrance is defined as the probability of diseasegiven a particular combination of genotypes. We chose five epistasismodels to simulate the data. The first model used was initiallydescribed by Li and Reich [37] and later by Culverhouseet al. [34] and Moore et al. [38] (Table 5).This model is based on the nonlinear XOR function that generatesan interaction effect in which high risk of disease is dependenton inheriting a heterozygous genotype (Aa) from one SNPor a heterozygous genotype from a second SNP (Bb), butnot both. The high-risk genotype combinations are AaBB, Aabb, AABb,and aaBb with disease penetrance of 0.1 for all four. Thesecond model was initially described by Frankel and Schork [39] and later by Culverhouse et al. [34] and Moore et al. [38] (Table 6).In this model, high risk of disease is dependent on inheriting twoand exactly two high-risk alleles, either "a" or "b"from two different loci. For this model, the high-risk genotypecombinations are AAbb, AaBb, and aaBB withdisease penetrance of 0.1, 0.05, and 0.1 respectively.
Table 5

Penetrance functionsfor Model 1.

 

Table penetrance

Margin penetrance

 

AA (.25)

Aa (.50)

aa (.25)

 

BB (.25)

0.00

0.10

0.00

0.05

Bb (.50)

0.10

0.00

0.10

0.05

bb (.25)

0.00

0.10

0.00

0.05

Margin penetrance

0.05

0.05

0.05

 

Each cell represents the probability of diseasegiven the particular combination of genotypes [p(D|G)]. This modelhas a heritability of 0.053.

Table 6

Penetrance functionsfor Model 2.

 

Table penetrance

Margin penetrance

 

AA (.25)

Aa (.50)

aa (.25)

 

BB (.25)

0.00

0.00

0.10

0.025

Bb (.50)

0.00

0.05

0.00

0.025

bb (.25)

0.10

0.00

0.00

0.025

Margin penetrance

0.025

0.025

0.025

 

Each cell represents the probability of diseasegiven the particular combination of genotypes [p(D|G)]. This modelhas a heritability of 0.051.

The subsequent three models were chosen from a set of epistasismodels described by Moore et al. [38] (Table 7,8,9). All five models were selected becausethey exhibit interaction effects in the absence of any main effectswhen allele frequencies are equal and genotypes are generated usingthe Hardy-Weinberg equation. In addition, we selected models withina range of heritability values. As mentioned previously, heritability,in the broad sense, is the proportion of total phenotypic varianceattributable to genetic factors. Thus, higher heritability valueswill have a stronger genetic effect. We selected models with varyingheritability values to obtain a more accurate comparison of thetwo NN methods in the presence of varying genetic effects. The heritabilitiesare 0.053, 0.051, 0.026, 0.012, and 0.008 for Models 1, 2, 3, 4,and 5 respectively. Although the biological plausibility of thesemodels is unknown, they represent the worst-case scenario for adisease-detection method because they have minimal main effects.If a method works well with minimal main effects, presumably themethod will continue to work well in the presence of main effects.
Table 7

Penetrance functionsfor Model 3.

 

Table penetrance

Margin penetrance

 

AA (.25)

Aa (.50)

aa (.25)

 

BB (.25)

0.00

0.04

0.00

0.02

Bb (.50)

0.04

0.02

0.00

0.02

bb (.25)

0.00

0.00

0.08

0.02

Margin penetrance

0.02

0.02

0.02

 

Each cell represents the probability of diseasegiven the particular combination of genotypes [p(D|G)]. This modelhas a heritability of 0.026.

Table 8

Penetrance functionsfor Model 4.

 

Table penetrance

Margin penetrance

 

AA (.25)

Aa (.50)

aa (.25)

 

BB (.25)

0.00

0.02

0.08

0.03

Bb (.50)

0.05

0.03

0.01

0.03

bb (.25)

0.02

0.04

0.02

0.03

Margin penetrance

0.03

0.03

0.03

 

Each cell represents the probability of diseasegiven the particular combination of genotypes [p(D|G)]. This modelhas a heritability of 0.012.

Table 9

Penetrance functionsfor Model 5.

 

Table penetrance

Margin penetrance

 

AA (.25)

Aa (.50)

aa (.25)

 

BB (.25)

0.00

0.04

0.08

0.04

Bb (.50)

0.06

0.04

0.02

0.04

bb (.25)

0.04

0.04

0.04

0.04

Margin penetrance

0.04

0.04

0.04

 

Each cell represents the probability of diseasegiven the particular combination of genotypes [p(D|G)]. This modelhas a heritability of 0.008.

Each data set consisted of 200 cases and 200 controls, each withtwo functional interacting SNPs. SNPs generally have two possiblealleles, and in our study, they were simulated with equal allelefrequencies (p = q = 0.5). We used a dummy variableencoding for the genotypes where n-1 dummy variables areused for n levels [40]. We simulated 100data sets of each epistasis model with two functional SNPs. Basedon the dummy coding, these data would have four variables and thusfour NN inputs. Next, we simulated 100 data sets of each model witheight non-functional SNPs and two functional SNPs. Based on the dummycoding, these data would have 20 variables and thus 20 NN inputs.These two types of data sets allow us to evaluate the abilityto either model gene-gene interactions or to detect gene-gene interactions.

Data analysis

In the first stage of analysis, we posed the following question:Is GPNN able to model gene-gene interactions as well or better thana traditional BPNN? First, we used a fully connected, feed-forwardBPNN to model gene-gene interactions in the simulated data containingfunctional SNPs only. The BPNN was trained for 1000 epochs. Althoughthere are an effectively infinite number of possible NN architectures,for each data set we evaluated the classification ability of 27different architectures. We chose the best architecture as the onethat minimized the classification error and was most parsimonious(i.e. simplest network) in the event of two or more with equal classificationerror. We began with a very small network (4 inputs: 1 output)and varied the number of hidden layers (0,1,2), number of nodesin the hidden layers (5:0, 10:0, 15:0, 20:0, 5:5, 10:5, 15:5, 20:5),and learning momentum (0.1, 0.5, 0.9). We used this optimizationprocedure to analyze 100 data sets of each epistasis model. We used 10fold cross-validation to evaluate the predictive ability of theBPNN models. After dividing the data into 10 equal parts, the architectureoptimization procedure is run on the first 9/10 of the data toselect the most appropriate architecture. Next, the best architectureis used to test the BPNN on the 1/10 of the data left out. Thisis done 10 times, each time leaving out a different 1/10 of thedata for testing. We then estimated the prediction error based onthe average predictive ability across the 10 cross-validations forall 100 data sets generated under each epistasis model.

Next, we used the GPNN to analyze the same 100 data sets foreach of the five epistasis models. The GP parameter settings included10 demes, migration of best models from each deme to all otherdemes every 25 generations, each deme had a population size of 200,50 generations, a crossover rate of 0.9, reproduction rate of 0.1,and mutation rate of 0.0. Fitness was defined as classificationerror of the training data. As with the BPNN, GPNN was requiredto use all four inputs in the NN model for this stage of the analysis.Again, we used 10 fold cross-validation to evaluate the predictiveability of the GPNN models. We then estimated the prediction errorof GPNN based on the average prediction error across the 10 cross-validationsfor all 100 data sets for each epistasis model.

The second aspect of this study involves answering the followingquestion: In the presence of non-functional SNPs (i.e. potentialfalse-positives), is GPNN able to detect gene-gene interactionsas well or better than a traditional BPNN? First we used a traditionalBPNN to analyze the data with eight non-functional SNPs and twofunctional SNPs. We estimated the power of the BPNN to detect the functionalSNPs as described below. In this network, all possible inputs areused and the significance of each input is calculated from its inputrelevance, R_I, where R_I is the sum of squared weights for theith input divided by the sum of squared weights for allinputs. Next, we performed 1000 permutations of the data to determinewhat input relevance was required to consider a SNP significantin the BPNN model. The range of critical relevance values for determiningsignificance was 10.43% – 11.83%.

Next, we calculated cross-validation consistency [21, 31, 32]. That is, we measuredthe consistency with which each SNP was identified across the 10cross-validations. The basis for this statistic is that the functional SNPs'effect should be present in most splits of the data. Thus, a highcross-validation consistency (~10) lends support to that SNP beingimportant for the epistasis model. Through permutation testing,we determined an empirical cutoff for the cross-validation consistencythat would not be expected by chance. We used this cut-off valueto select the SNPs that were functional in the epistasis model for eachdata set. For the BPNN, a cross-validation consistency greater thanor equal to five was required to be statistically significant. Weestimated the power by calculating the percentage of datasets wherethe correct functional SNPs were identified. Either one or bothof the dummy variables could be selected to consider a locus presentin the model. Finally, we estimated the prediction error based onthe average predictive ability across 100 data sets for each epistasismodel.

Next, we used the GPNN to analyze 100 data sets for each of theepistasis models. In this implementation, GPNN was not requiredto use all the variables as inputs. Here, GPNN performed randomvariable selection in the initial population of solutions. Throughevolution, GPNN selects those variables that are most relevant.We calculated the cross-validation consistency as described above. Permutationtesting was used to determine an empirical cut-off to select theSNPs that were functional in the epistasis model for each data set.For the GPNN, a cross-validation consistency greater than or equalto seven was required to be statistically significant. We estimatedthe power of GPNN as the number of times the functional SNPs wereidentified in the model divided by the total number of runs. Again,either one or both of the dummy variables could be selected to considera locus present in the model. We also estimated the prediction errorof GPNN based on the average prediction error across 100 data setsper epistasis model.

Declarations

Acknowledgements

This work was supported by National Institutes of Health grantsHL65234, HL65962, GM31304, AG19085, AG20135, and LM007450. We wouldlike to thank two anonymous reviewers for helpful comments and suggestions.

Authors’ Affiliations

(1)
Program in Human Genetics and Department of Molecular Physiology and Biophysics, Vanderbilt University Medical School

References

  1. Templeton AR: Epistasisand complex traits. In: Epistasisand Evolutionary Process (Edited by: Wade M, Brodie III B, Wolf J). Oxford, Oxford University Press 2000.Google Scholar
  2. Moore JH, Williams SM: New strategiesfor identifying gene-gene interactions in hypertension. Ann Med 2002, 34: 88–95.View ArticlePubMedGoogle Scholar
  3. Bellman R: Adaptive ControlProcesses. Princeton, PrincetonUniversity Press 1961.Google Scholar
  4. Bhat A, Lucek PR, Ott J: Analysis ofcomplex traits using neural networks. Genet Epidemiol 1999, 17: S503-S507.View ArticlePubMedGoogle Scholar
  5. Curtis D, North BV, Sham PC: Use of an artificialneural network to detect association between a disease and multiple markergenotypes. Ann Hum Genet 2001, 65: 95–107.View ArticlePubMedGoogle Scholar
  6. Li W, Haghighi F, Falk C: Design of artificialneural network and its applications to the analysis of alcoholismdata. Genet Epidemiol 1999, 17: S223-S228.View ArticlePubMedGoogle Scholar
  7. Lucek PR, Ott J: Neural networkanalysis of complex traits. Genet Epidemiol 1997, 14: 1101–1106.View ArticlePubMedGoogle Scholar
  8. Lucek P, Hanke J, Reich J, Solla SA, Ott J: Multi-locusnonparametric linkage analysis of complex trait loci with neural networks. Hum Hered 1998, 48: 275–284.View ArticlePubMedGoogle Scholar
  9. Marinov M, Weeks D: The complexityof linkage analysis with neural networks. Hum Hered 2001, 51: 169–176.View ArticlePubMedGoogle Scholar
  10. Ott J: Neural networksand disease association studies. Am J Med Genet 2001, 105: 60–61.View ArticlePubMedGoogle Scholar
  11. Saccone NL, Downey TJ Jr, Meyer DJ, Neuman RJ, Rice JP: Mapping genotypeto phenotype for linkage analysis. Genet Epidemiol 1999, 17: S703-S708.View ArticlePubMedGoogle Scholar
  12. Sherriff A, Ott J: Applicationsof neural networks for gene finding. Adv Genet 2001, 42: 287–298.View ArticlePubMedGoogle Scholar
  13. Moore JH, Parker JS: Evolutionarycomputation in microarray data analysis. In: Methods of Microarray Data Analysis (Edited by: Lin S, Johnson K). Boston:Kluwer Academic Publishers 2001.Google Scholar
  14. Koza JR, Rice JP: Genetic generationof both the weights and architecture for a neural network. IEEE Press 1991., II:Google Scholar
  15. Gruau FC: Cellular encodingof genetic neural networks. Master'sthesis Ecole Normale Superieure de Lyon 1992, 1–42.Google Scholar
  16. Moore JH, Parker JS, Hahn LW: Symbolic discriminantanalysis for mining gene expression patterns. In Lecture Notes in Artificial Intelligence 2167 (Edited by: De Raedt L, Flach P). Springer-Verlag, Berlin 2001, 372–381.Google Scholar
  17. Koza JR: Genetic Programming:On the programming of computers by means of natural selection. Cambridge, MIT Press 1993.Google Scholar
  18. Koza JR: Genetic ProgrammingII: Automatic discovery of reusable programs. Cambridge, MIT Press 1998.Google Scholar
  19. Koza JR, Bennett FH, Andre D, Keane MA: Genetic ProgrammingIII: Automatic programming and automatic circuit synthesis. Cambridge, MIT Press 1999.Google Scholar
  20. Banzhaf W, Nordin P, Keller RE, Francone FE: Genetic Programming:An Introduction. San Francisco,Morgan Kauffman Publishers 1998.Google Scholar
  21. Moore JH, Parker JS, Olsen NJ, Aune TS: Symbolic discriminantanalysis of microarray data in autoimmune disease. Genet Epidemiol 2002, 23: 57–69.View ArticlePubMedGoogle Scholar
  22. Schalkoff RJ: ArtificialNeural Networks,. New York,McGraw-Hill Companies Inc 1997.Google Scholar
  23. Utans J, Moody J: Selecting neuralnetwork architectures via the prediction risk application to corporatebond rating prediction. In: ConferenceProceedings on the First International Conference on ArtificialIntelligence Applications on Wall Street 1991.Google Scholar
  24. Moody J: Prediction riskand architecture selection for neural networks. In: From Statistics to Neural Networks: Theoryand Pattern Recognition Applications (Edited by: CherkasskyV, Friedman JH, Wechsler H). NATO ASI Series F, Springer-Verlag 1994.Google Scholar
  25. Fahlman SE, Lebiere C: The Cascade-CorrelationLearning Architecture. Mastersthesis Carnegie Mellon University School of Computer Science 1991.Google Scholar
  26. Mitchell M: An Introductionto Genetic Algorithms. Cambridge, MITPress 1996.Google Scholar
  27. Cantú-Paz E: Efficientand Accurate Parallel Genetic Algorithms. Boston, Kluwer Academic Publishers 2000.Google Scholar
  28. Koza JR: Survey of geneticalgorithms and genetic programming. In: Wescon 95: E2. Neural-Fuzzy Technologies andIts Applications IEEE, San Francisco 1995, 589–594.Google Scholar
  29. Hastie T, Tibshirani R, Friedman JH: The Elementsof Statistical Learning. NewYork, Springer-Verlag 2001.Google Scholar
  30. Ripley BD: Pattern Recognitionand Neural Networks. Cambridge,Cambridge University Press 1996.Google Scholar
  31. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals high-order interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147.PubMed CentralView ArticlePubMedGoogle Scholar
  32. Moore JH: Cross validationconsistency for the assessment of genetic programming results inmicroarray studies. In: Lecture Notesin Computer Science 2611 (Edited by: Corne D, Marchiori E). Berlin, Springer-Verlag 2003.Google Scholar
  33. Roland JJ: Generalisationand model selection in supervised learning with evolutionary computation. LNCS 2003, 2611: 119–130.Google Scholar
  34. Culverhouse R, Suarez BK, Lin J, Reich T: A Perspectiveon Epistasis: Limits of Models Displaying No Main Effect. Am J Hum Genet 2002, 70: 461–471.PubMed CentralView ArticlePubMedGoogle Scholar
  35. Sexton RS, Dorsey RE, Johnson JD: Optimizationof neural networks: a comparative analysis of the genetic algorithm andsimulated annealing. Eur J OperatRes 1999, 114: 589–601.View ArticleGoogle Scholar
  36. Cantú-Paz E: Evolvingneural networks for the classification of galaxies. In: Proceedings of the Genetic and EvolutionaryAlgorithm Conference (Edited by: Langdon, WB, Cantu-Paz E, MathiasK, Roy R, Davis D, Poli R, Balakrishnan K, Honavar V, Rudolph G,Wegener J, Bull L, Potter MA, Schultz AC, Miller JF, Burke E, Jonoska). San Francisco, Morgan Kaufman Publishers 2002, 1019–1026.Google Scholar
  37. Li W, Reich J: A complete enumerationand classification of two-locus disease models. Hum Hered 2000, 50: 334–349.View ArticlePubMedGoogle Scholar
  38. Moore JH, Hahn LW, Ritchie MD, Thornton TA, White BC: Applicationof genetic algorithms to the discovery of complex genetic modelsfor simulations studies in human genetics. In: Proceedings of the Genetic and EvolutionaryAlgorithm Conference (Edited by: Langdon WB, Cantu-Paz E, MathiasK, Roy R, Davis D, Poli R, Balakrishnan K, Honavar V, Rudolph G,Wegener J, Bull L, Potter MA, Schultz AC, Miller JF, Burke E, JonoskaN). San Francisco, Morgan Kaufman Publishers 2002, 1150–1155.Google Scholar
  39. Frankel WN, Schork NJ: Who's afraidof epistasis? Nature Genetics 1996, 14: 371–373.View ArticlePubMedGoogle Scholar
  40. Ott J: Neural networksand disease association. Am J Med Genet 2001, 105: 60–61.View ArticlePubMedGoogle Scholar

Copyright

© Ritchie et al; licensee BioMed Central Ltd. 2003

This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.

Advertisement