New application of intelligent agents in sporadic amyotrophic lateral sclerosis identifies unexpected specific genetic background

Background Few genetic factors predisposing to the sporadic form of amyotrophic lateral sclerosis (ALS) have been identified, but the pathology itself seems to be a true multifactorial disease in which complex interactions between environmental and genetic susceptibility factors take place. The purpose of this study was to approach genetic data with an innovative statistical method such as artificial neural networks to identify a possible genetic background predisposing to the disease. A DNA multiarray panel was applied to genotype more than 60 polymorphisms within 35 genes selected from pathways of lipid and homocysteine metabolism, regulation of blood pressure, coagulation, inflammation, cellular adhesion and matrix integrity, in 54 sporadic ALS patients and 208 controls. Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis Results Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis. An unexpected discovery of a strong genetic background in sporadic ALS using a DNA multiarray panel and analytical processing of the data with advanced artificial neural networks was found. The predictive accuracy obtained with Linear Discriminant Analysis and Standard Artificial Neural Networks ranged from 70% to 79% (average 75.31%) and from 69.1 to 86.2% (average 76.6%) respectively. The corresponding value obtained with Advanced Intelligent Systems reached an average of 96.0% (range 94.4 to 97.6%). This latter approach allowed the identification of seven genetic variants essential to differentiate cases from controls: apolipoprotein E arg158cys; hepatic lipase -480 C/T; endothelial nitric oxide synthase 690 C/T and glu298asp; vitamin K-dependent coagulation factor seven arg353glu, glycoprotein Ia/IIa 873 G/A and E-selectin ser128arg. Conclusion This study provides an alternative and reliable method to approach complex diseases. Indeed, the application of a novel artificial intelligence-based method offers a new insight into genetic markers of sporadic ALS pointing out the existence of a strong genetic background.


Background
Amyotrophic lateral sclerosis (ALS), the most common form of motoneuron disease, is a relatively rare (incidence: 1-3/100.000 per year), progressive and fatal disease characterised by neurodegeneration involving primarily motor neurons of the cerebral cortex, brain stem and spinal cord. To date, most studies have focused upon the familial form of the disease, which accounts for less then 10% of cases, and which is usually inherited as autosomal dominant inheritance. The gene coding for copper/ zinc superoxide dismutase 1 (SOD1) appears to be mutated in 10-20% in the familial form [1].
Genetic risk factors for ALS have been extensively studied and some "major genes", in addition to SOD1, have been recognised as being responsible for the monogenic inheritance pattern. There are now at least six dominant inherited adult onset ALS genes of which only three have been identified so far [2]. However, most ALS cases seem to be a typical multifactorial disease deriving from the interaction between a number of genes and environmental factors, some of which are still not established as causing of the disease, including brain and spinal cord trauma, strenuous physical activity, exposure to radiation [3].
Current hypotheses suggest a complex interplay between multiple mechanisms including genetic risk factors, oxidative stress, neuroexcitatory toxicity, mitochondrial dysfunction, intermediate neurofilament disorganization, failure of intracellular mineral homeostasis involving zinc, copper, or calcium, disrupted axonal transport, abnormal protein aggregation or folding, and neuroinflammation [3,4]. Recently there has been growing interest in the role played by non-neuronal neighbourhood cells in the pathogenesis of motor neuron injury and in the dysfunction of specific molecular signal pathways [5,6].
Among the genetic factors that may predispose to sporadic ALS, neurofilaments, apolipoprotein epsilon 4 genotype, excitotoxicity genes, ciliary neurotrophic factor (CTNF), cytochrome P450 debrisoquine hydroxylase CYP2D6, apurinic apyrimidinic endonuclease (APEX), mitochondrial manganese superoxide dismutase SOD2, monoamine oxidase allele B and paraoxonases, have been reported in different studies, partly with contradictory results [2,4,[7][8][9]. Not all the published studies have been replicated, probably because of the different populations analysed as well as insufficient sample size. On the other hand, different studies have employed either tissue microdissection or microarray technologies to search for other "low penetrant" or "susceptibility" genes that are more common in the population and often polymorphic and the combination and interaction of these with environmental factors may contribute to modulate individual risk [10][11][12]. Recently, several genome-wide association studies have been performed with innovative approaches, i.e. the Illumina platform, and the authors have identified SNPs (single nucleotide polymorphisms) potentially associated with ALS [13][14][15][16]. However most genome-wide association studies have not confidently identified risk genes that are replicated in every study. The most likely causes are disease heterogeneity, allelic heterogeneity, small effect sizes and probably, insufficient sample size. However, so far no microarrays panel has been specifically developed for ALS and the aetiology of the disease still remains to be defined.
Some years ago our group had the opportunity of working on another multifactorial complex disease such as venous thrombosis and analysing the results by an innovative statistical approach, Artificial Neural Networks (ANNs) [17]. Indeed, ANNs promises to improve the predictive value of traditional statistical data analysis. Initially, a known set of data, from a given problem with a known solution, is learned by the ANNs and subsequently the networks can reconstruct the fuzzy rules which may be underlying a complex set of data. ANNs have been successfully used in many areas of medicine as recently illustrated in an extensive review by Lisboa [18], as well as by Ritchie et al [19] where neural networks were used for supervised pattern recognition in genetic epidemiology, and also in SNPs association studies [20][21][22]. Much effort has been spent to adapt ANNs architectures and the ensembles to specific problems to be solved. More specifically many novel computational approaches have been developed and applied with special attention to complex gene-gene, gene-environment interactions and ANNs [19][20][21][22][23].
The literature data together with the impressive results we obtained with ANNs, by which we were able to identify a subset of polymorphisms related to the disease, prompted us to employ the same approach also in ALS hoping to discover specific genetic patterns underlying the sporadic form of this disease. We applied a multiarray approach including allelic variations in genes that could be involved in the pathogenesis of ALS disease since it has been demonstrated that inflammation, cellular adhesion, and lipid pathways are linked to such a disease [10,11]. On the contrary, it has never been demonstrated that regulation of blood pressure, coagulation, homocysteine metabolism and matrix integrity pathways are directly linked to ALS even though they could be indirectly.
Genotyping of ALS cases and controls was performed. We applied advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms and compared the results with those obtained by linear discriminant analysis and a simple back propagation approach.
Surprisingly, we discovered a novel strong genetic background allowing a correct classification of cases and controls with a higher than 90% accuracy.

Subjects
The study population included subjects of Caucasian origin belonging to Italian ancestry and consisted of 54 sporadic ALS (SALS) patients and 208 control subjects.
Diagnostic Criteria for ALS disease were based on the World Federation of Neurology El Escorial Revisited document [24]. All patients diagnosed to have Definite, Probable or Probably laboratory-supported ALS, who gave their informed consent, were included in the study. The diagnosis of Possible ALS was also accepted. According to common clinical practice, our cases were subdivided into bulbar and spinal onset on the basis of the first symptoms reported by each patient. All patients, referred to the Department of Neurology of Niguarda Hospital, Milan from 2001 to 2005, were defined sporadic when the disease was present in a single member of the family and when no mutations were present in SOD1 gene.
Control subjects were selected from a healthy control population, randomly collected from healthy blood donors admitted to the "Healthy Blood Donor Service" of Niguarda Ca' Granda Hospital. We checked the absence of personal and familial history of ALS in this group through direct interview.
This study was approved by the local ethics committee.
The marker TNF beta thr26asn is twice present in the arrays as a control for the multiplex PCR and the hybridization procedures.
All ALS subjects were screened for SOD1 mutation through PCR amplification and direct sequencing according to standard procedures [27].

Database
Each record related to a known clinical condition or to a sample population, and comprised 62 variables corresponding to the 60 SNPs plus case and control. We eliminated from the database those markers for which only one genotype was present (APOB Arg3500Gln, CBS Ile278Thr, CETP Asp442Gly, 14G(+1) A and 14(+3) T ins) both in cases and controls. All the analysed polymorphisms may have three genotype classes: wild type, heterozygous and homozygous status. The association of these variables with ALS status was tested by ANNs and the results were compared with those obtained by a linear discriminant analysis. The models we used aimed at correct classification of the subjects in two classes: 1) SALS patients (cases), 2) healthy subjects (controls).
No other specific genetic model potentially linked to the analysed SNP was evaluated; ANNs are able to build a model with a strong genetic basis just collecting all the information included within the SNP without any a priori definition. The mathematical approach of ANNs consists in measuring the general dependence of random variables related to a group of subject without making any assumption about the nature of their underlying relationships.

Artificial neural networks analysis
In this study we applied supervised ANNs, in order to develop a model able to predict with high degree of accuracy the diagnostic class starting from genotype data alone.
Supervised ANNs are networks which learn by examples, calculating an error function during the training phase and adjusting the connection strengths in order to minimize the error function. The learning constraint of the supervised ANNs make their own output coincide with the predefined target. The general form of these ANNs is: y = f(x,w*), where w* constitutes the set of parameters which best approximate the function.
We employed the Back Propagation (BP) ANNs [28]. This type of ANN belongs to a very large family of ANNs, that normally uses a specific kind of law of learning named Feed Forward (FF). In the FF ANNs the signal proceeds from the input to the output of the ANN, crossing all of the nodes once only. The architecture of these networks is characterized by different layers of interconnected nodes (input, hidden and output nodes), which processes the input signal according to a non-linear function (generally, of sigmoid type). The fundamental equation that characterizes the activation of a single node and, therefore, the signal transfer from one layer to another is: Learning, i.e. the modelling of the input-output relation represented by data, occurs through minimization of the error in output and retropropagation of this to the internal nodes, one hidden units, using the algorithm of the descending gradient in the majority of cases. In particular each weight is corrected by the formula: where for the retropropagated error we have: for the last layer and: for all the other layers.
In theory, a Back Propagation having a sufficient number of hidden units is able to reconstruct any y = f(x) function.
The BP used in this work was intentionally improved through the use of the SoftMax equation, specific for classification problems [29]: and through the use of the Selfmomentum equation [30] which appears as follows: where the learning cycle is indicated by n.
From a practical point of view, the Selfmomentum equation allows the solution of all of the problems solved by the Momentum, in a much faster way, maintaining the unitary learning coefficient (Rate = 1).
The architecture of ANN BP-FF is an input layer according to the number of selected variables, one hidden layer according to the different input layer (min 2 nodes, max 12 nodes). The output layer consisting in one of two prediction targets (SALS cases; control).
We employed as benchmark linear discriminant analysis (LDA) applied on the same training and testing data sets used for ANNs. For the analysis of LDA, the SAS version 6.04 (SAS Institute, Cary, NC, USA) using forward stepwise procedure was employed.

Preprocessing methods and experimental protocols
Data preprocessing was performed using two different resampling criteria of the global dataset.

-Random criterion
We employed the so-called 5 × 2 cross-validation protocol [31]. In this procedure the study sample is five-times randomly divided into two sub-samples, always different but containing similar distribution of cases and controls: the training one (containing the dependent variable) and the testing one. During the training phase the ANNs learn a model of data distribution and then, on the basis of such a model, classify subjects in the testing set in a blind way. Training and testing sets are then reversed and consequently 10 analyses for every model employed are conducted.
The T&T system is a robust data resampling technique that is able to arrange the source sample into sub-samples that all possess a similar probability density function. In this way, the data is split into two or more sub-samples in order to train, test and validate the ANN models more effectively. The T&T is based on a population of n ANNs managed by an evolutionary system. In its simplest form, The performance score reached by each ANN in the testing phase represents its "fitness" value (i.e., the individual probability of evolution). The genome of each "networkindividual" thus codifies a data distribution model with an associated validation strategy. The n data distribution models are combined according to their fitness criteria using an evolutionary algorithm. The selection of "network-individuals" based on fitness determines the evolution of the population; that is, the progressive improvement of performance of each network until the optimal performance is reached, which is equivalent to the better division of the global dataset into subsets. The evolutionary algorithm mastering this process, named "Genetic Doping Algorithm" (GenD) (33) has similar characteristics to a genetic algorithm but it's able to maintain an inner instability during the evolution, carrying out a natural increase of biodiversity and a continuous "evolution of the evolution" in the population. The elaboration of T&T is articulated in two phases: -preliminary phase: in this phase an evaluation of the parameters of the fitness function that will be used on the global dataset is performed. During this phase an inductor is configured, which consists of an ANN with an algorithm (A) Back Propagation standard. For this inductor the optimal configuration to reach the convergence is stabilized at the end of different training trials on the global dataset D Γ ; in this way the configuration that most "suits" the available dataset is determined: the number of layers and hidden units and some possible generalizations of the standard learning law. The parameters thus determined define the configuration and the initialization of all the individual-networks of the population and will then stay fixed in the following computational phase. Basically, during this preliminary phase there is a fine-tuning of the inductor that defines the fitness values of the population's individuals during evolution.
The accuracy of the ANN performance with the testing set will be the fitness of that individual (that is, of that hypothesis of distribution into two halves of the whole dataset).
-Computational phase: the system extracts from the global dataset the best training and testing sets. During this phase the individual-network of the population is running, according to the established configuration and the initialization parameters. From the evolution of the population, managed by the GenD algorithm, the best distribution of the global dataset D Γ into two subsets is generated, starting from the initial population of possible solutions . Preliminary experimental sessions are performed using several different initialization and configuration of the network in order to achieve the best partition of the global dataset.
Parallel to T&T runs I.S. The IS system is an adaptive system, which is also based on the evolutionary algorithm GenD, and which is able to evaluate the relevance of the different variables of the dataset in an intelligent way. Therefore it can be considered on the same level as a feature selection technique.  the cardinality of the original input space. Every gene indicates if an input variable is to be used or not during the evaluation of the population fitness. Through the evolutionary algorithm, the different "hypotheses" of variable selection, generated by each ANNs of the population, change over time, at each generation: this leads to the selection of the best combination of input variables. As in the T&T systems the genetic operators crossover and mutation are applied on the ANNs population; the rates of occurrence for both operators are self-determinated by the system in adaptive way at each generation.
When the evolutionary algorithm no longer improves its performance, the process stops, and the best selection of the input variables is employed on the testing subset.
In order to improve the speed and the quality of the solutions that have to be optimized, the GenD algorithm makes the evolutionary process of the artificial populations more natural and less centered on the individual liberalism culture.
The combined action of T&T and I.S. systems allow us to solve two frequent problems in managing ANNs. Both systems are based on a Genetic Algorithm, the Genetic Doping Algorithm (GenD) developed at Semeion Research Centre [33].
GenD was provided with 100 individuals, generated randomly. Each individual represents a possible distribution of the whole dataset into two subsets. Two independent Multilayers Perceptrons (MLPs) with 4 hidden units, are trained for 100 epochs and tested in blinded manner on the two subsets. A function of the testing results of the two independent MLPs defines the fitness of each individual.
A crossover function is applied on the populations of 100 individuals and new individuals are generated. A mutation operator is applied to the new individuals and to the individuals whose fitness is weakest. In the GenD algorithm the rate of crossover and the rate of mutation are self-determined by the system in adaptive way at each generation. This loop is applied for at least 300 generations, or stopped when the system does not show any significant improvement at least from 50 generations. The individual whose distribution of the whole dataset into two subsets is the best one from the blind testing results is saved and then used as optimal distribution to train and test more sophisticated ANNs.
We implemented both algorithms in C language and we used a Pentium III CPU to run the system on real data. Around 48 hours were spent for each run. We remind that T&T and I.S. algorithms have to be used only once to train the system. Once trained, the system can answer on line to any new pattern.
After this processing, the features that were most significant for the classification were selected and at the same time the training set and the testing set were created with a function of probability distribution similar to the one that provided the best results in the classification.
A supervised Multi Layer Perceptron, with four hidden units, was then used for the classification task.

Study populations
We collected 54 patients ( All patients were previously screened for SOD1 gene mutation by sequence analysis and no genetic variations were found. Control subjects were 144 males and 67 females; age range 21 to 75 years, (average 38.94). Table 1 summarizes the distribution of the SNPs in the two groups of patients and controls. The reliability of the whole molecular procedure (multiplex and hybridization steps) was checked by the TNF beta thr26asn polymorphism that gave the same results in both strips A and B for the same subject analyzed (see 17 and 26 for details).

Classification performances with ANNs
Results obtained with Linear Discriminant Analysis were compared with those obtained with a simple Back Propagation approach ( Table 2 and 3).
In these experiments we applied the random criterion to divide the dataset five times in training and testing subsets applying the 5 × 2 Cross Validation protocol.
The predictive accuracy obtained with Linear Discriminant Analysis and standard artificial neural networks ranged from 70% to 79% (average 75.31%) and from 69.1 to 86.2% (average 76.6%) respectively.
With the TWIST approach, every experiment was conducted in a blind and independent manner in two direc- Deviations of the genotype frequencies from the Hardy-Weinberg equilibrium were tested in the control group with chi-squared statistics (p values ranging from 0.2 to 1.0). Allele frequencies at each marker locus were calculated from the genotype frequencies of the control group under the null hypothesis of Hardy-Weinberg equilibrium. Allele frequencies at each marker locus are reported.
tions: training with sub-sample A and blind testing with sub-sample B vs training with sub-sample B and blind testing with sub-sample A. The results from the best five applications of TWIST procedures are reported in Table 4. This advanced intelligent system, through the final selection of a subgroup of 25-27 variables along ten independent applications, provided the highest predictive performance with a sensitivity ranging from 92.0% to 100% (average 96.75%), and a specificity ranging from 91.67% to 98.81% (average 95.78%) and with an overall accuracy ranging from 94.4 to 97.6% (average 96.0%). In all the TWIST system experiments the 90% overall accuracy threshold was exceeded whereas Back Propagation and Linear Discriminant Analysis never exceeded the 80% threshold.

Genetic variants independently selected by four TWIST procedures
The number of genetic variants selected four times over five experiments consisted of: peroxisome proliferator activated receptor gamma (PPARG) pro12ala (chromo- The second and third columns report the percentage of patients correctly classified as belonging to cases or controls. The fourth and fifth columns report the accuracy obtained by the model as arithmetic mean and weighted mean. The number of errors is reported in the last column. The last row reports the mean values of all the columns. The second and third columns report the percentage of patients correctly classified as belonging to cases or controls. The fourth and fifth columns report the accuracy obtained by the model as arithmetic mean and weighted mean. The number of errors is reported in the last column. The last row reports the mean values of all the columns. The TNF beta thr26asn was used as further control. First it was selected by four TWIST systems and later, since the information linked to such a variation was already recruited, none of the TWIST systems selected this SNP.

Discussion
The mechanism of neurodegeneration in ALS remains an enigma. The major problem is that little is known about the disease mechanism, making candidate gene selection difficult and haphazard. It follows that an unconventional approach of making no a priori assumptions about the location of the variants of interest might be appropriate, provided that a similarly unconventional statistical approach is available to manage the data complexity.
Comparison of results obtained using three different analytical approaches (classical statistics, standard neural networks and advanced artificial neural networks), points out the need to employ systems that are really able of handling the disease complexity instead of treating the data with reductionist approaches unable to detect multiple The second and third columns report the percentage of patients correctly classified as belonging to cases or controls. The fourth and fifth columns report the accuracy obtained by the model as arithmetic mean and weighted mean. The number of errors is reported in the last column. The last row reports the mean values of all the columns. In the first column ab means training on subset a and testing on subset b; ba means the opposite. The second and third columns report the percentage of patients correctly classified as belonging to cases or controls. The fourth and fifth columns report the accuracy obtained by the model as arithmetic mean and weighted mean. The number of errors is reported in the last column. The last row reports the mean values of all the columns.
genes of smaller effect in predisposing to the disease. The possibility of obtaining high diagnostic accuracy from limited and selected genetic information using these new analytical tools, shows that even in so-called sporadic ALS the genetic background plays a fundamental role.
Another important obstacle in approaching the molecular basis of a rare disease like ALS in a conventional manner, is related to the difficulty of finding a homogeneous sample population large enough to be analysed for a wide number of genetic variants. Artificial neural networks, at variance with the classical statistical tests, can manage complexity even with relatively small samples and the subsequent unbalanced ratio between variables and records. In this connection, it is important to note that adaptive learning algorithms of inference, based on the principle of a functional estimation like artificial neural networks, overcome the problem of dimensionality.
Internal validation of the prediction accuracy is one of the most important problems in neural networks analysis. In fact, the restriction of training procedures to only a part of the dataset, generally half of it, causes a potential loss of power to recognize hidden patterns. In this study optimization of the training and testing procedures were addressed using the evolutionary training and testing algorithm, which ensured that the two halves of the dataset contained the same amount of relevant information. Thus, the best division of the whole dataset into a training and a testing set was reached after a finite number of generations. Finally ANNs were able to identify gene combinations (allelic variants) that are likely to produce accurate predictions of ALS for a single individual, regardless of some possible limitations such as Male/Female ratio and age differences among the case and control groups. This study enrolled more than 50 medical cases with an accurate diagnosis of ALS and we were able to test them for 69 SNPs in 35 genes. Although the SALS patients analyzed represent a small cohort, it is nevertheless really representative from an epidemiological point of view (e.g. male/female ratio, bulbar/spinal ratio).
Besides, all ALS patients were previously screened for SOD1 gene mutations with negative results, thus confirming the sporadic nature of the disease. However, the sample size of 54 cases analysed for more than 60 SNPs, prompted us to look for valid, powerful and efficient statistical tools to approach and evaluate our data.
On the basis of the observed results some information related to the methodological approaches used can be assumed. The multiarray approach was previously validated by ourselves [17] and others [26] and contains TNF beta as the internal control.
Indeed, ApoE arg158cys was selected by all the five TWISTs while the ApoE cys112arg was selected only once. For NOS variants, the position -922 in the promoter region was never selected while the -690 variant in the promoter region too and the non synonymous variant in position 698 were both selected by all the five TWISTs. The two Factor VII and Selectin (SELE) genetic variants both containing the information necessary for the correct attribution to the disease vs healthy status, were selected five times (FVII arg353glu and SELE ser128arg) and four times (FVII del/ins and SELE leu554phe), respectively. The role of the paroxonase in predisposing to ALS disease appears to be confirmed: PON1 met 55leu and PON2 ser311cys were chosen four times, whereas PON1 gln192arg was never. PPARγ pro12ala was chosen four times: we can assume a generic role of this receptor on ALS disease since PPARγ is at the crossroads between lipid metabolism and innate immune response [34]. In addition, we noticed, for example, that in the same TNF locus, 6p21.3, lies also the HFE gene for hemocromatosis and the peripherin gene, both previously involved in ALS disease [35].
Few genetic variants were never selected by any of the TWIST procedures. One possible reason is that some information had already been picked up by the systems, e.g. for PON1, NOS and TNF. Moreover, regarding APOA4 and APO C3 variants we observed that they lie on chromosome 11 which may not be at all involved in the disease. Indeed, a very recent paper on genome wide genotyping in ALS [13], found no SNPs associated with the disease on chromosome 11.
From a biological point of view, the identified gene variations confirm some of the already known results (ApoE and PON for example) and identify new gene/genetic variations not known to be involved in the disease. Our results strengthen the involvement of oxidative stress as well as angiogenesis (NOS) and immune response (TNF) pathways. Besides, our results shed light on the involvement of lipid pathways (LIPC, PPARγ). Indeed, a role for polyunsaturated fatty acids has been postulated for the misfolding protein aggregations in several neurodegenerative diseases including familial ALS [36]. Furthermore polyunsaturated fatty acids could be enzymatically converted into various lipid mediators such as leukotriene and prostaglandins which have a strong biological activity in several signalling pathways [37].

Conclusion
Our study has a major focus on disentangling the effect of interacting multiple low penetrance alleles on complex diseases. We analysed genetic variables within genes possibly involved in the ALS disease and thanks to artificial intelligence agents such as those employed in this study, on the basis of a subset of genetic data only, we were are able to conveniently differentiate ALS cases from control subjects. We still do not know which specific variation within the subset of SNP is linked to the disease, however ANNs are able to discriminate among cases and controls with only seven genetic SNPs.
We are aware that this is an exploratory study and that it should be replicated in another and much larger sample size, nevertheless this study offers new insight into genetic markers of sporadic ALS pointing out the existence of a strong genetic background. The data provide useful information to direct future research on the complexity of the genetic profile of ALS subjects.