We split the discussion into three parts. The first part concerns the quality of the three feature selection methods information gain, random forest and GA/SVM as well as the classification performance of different classifiers. The second part deals with the over- and under-represented gene pairs selected by the GA/SVM. Finally, we discuss the biological relevance of the selected genes.
Comparison of the three feature selection methods
Usually, regulatory processes in a cell are very complex and single genes are not able to explain all aspects of a biological cell state. For this reason, combining the best-ranked genes to improve the classification capability is a frequently used approach
[18, 19]. However, combining redundant genes usually does not improve the classification capability of a gene set much.
1 shows the classification capability of incrementally smaller sets of genes ranked top by our three feature selection methods. Independently of the classifier used to evaluate the resulting gene list, we observe very similar results. There does not seem to be any particular preference for the GA/SVM if the SVM is used to measure performance. This suggests that the selected genes generally have a good classification capability. Comparing classifiers only trained with the best 50 genes (Figure
1) to classifiers trained with the larger set of 1000 genes (Table
1), we find nearly no difference in classification accuracy. We assume this to be another piece of evidence for the quality of the selected biomarkers.
For both data sets we observe the same two main points. First, if we use only few genes for training a classifier, the genes selected by information gain and random forest show better classification results than the genes selected by GA/SVM. Second, if we use five or more top ranked genes, the genes selected by GA/SVM are better.
, we explain these observations by the differences between the three feature selection methods. Information gain ranks a gene only considering the gene expression values of this single gene. For this reason we expect a lot of redundancy among the top ranked genes. The results in Figure
1 show that the accuracy increases slowly as we increase the number of genes used for classification, which probably happens precisely because combining redundant genes does not improve the classification accuracy much. Nevertheless, the Shannon entropy
 (used by information gain) seems to be a good ranking criterion, as the top ranked gene shows a very good classification capability.
In contrast to information gain, random forest determines the importance of a gene based on the classification capability of that gene in multiple trees. This way the score of each gene depends on the gene expression value of other genes as well. Nevertheless, usually the list of top ranked genes still contains many redundancies (see Figure
2). This explains why combining multiple top ranked genes does not distinctly improve the classification capability of the trained classifier.
In contrast to these two methods, our GA/SVM has a strong tendency to eliminate redundant genes. This is demonstrated by the observed low values of mutual information (see Figure
2) measuring the mutual dependency of gene pairs. Thus we hypothesize that each small gene set, selected by the GA/SVM in a single run, consists of genes that fulfill different functions for the specific biological state. If multiple genes are redundant, only one of those redundant genes tends to be chosen by chance, as a member of a gene set. For this reason, ranking the genes by their frequency of occurrence in the small gene sets, genes with many redundant partners are infrequent. The genes best ranked by the GA/SVM tend to have no redundant partners in the list of top ranked genes concerning the specific biological state (Alzheimer affected; pluripotent). Combining top ranked genes strongly improves the classification capability of a gene set, as can be seen in the sharp rise (from right to left) in the accuracy as the number of genes increases in Figure
Analyzing the 50 top-ranked genes of each feature selection algorithm we observe that the genes selected by the GA/SVM algorithm on average show less pairwise mutual information than those selected by information gain and random forest. As the mutual information of two genes is a measure for the mutual dependency of the genes we assume that on average two genes selected by our GA/SVM algorithm depend on each other in a weaker degree than genes selected by the other two feature selection methods. As a high mutual dependency reveals redundancy between the genes, this supports our assumption that the gene lists selected by the GA/SVM algorithm contain less redundancies than the genes selected by information gain and random forest.
Even though GA/SVM is known to show good results in feature selection
[3, 4, 21, 22] the utility of the small gene sets selected during a single run has not been investigated in-depth yet. In the following section we consider this topic.
Analysis of the small gene sets of GA/SVM
Many machine learning methods used for learning biomarkers, such as information gain
 and random forest
, perform only univariate ranking of genes. Although the top-ranked genes are valuable hints for understanding the underlying molecular mechanism of cellular processes, these processes are usually more complicated than single genes are able to explain. We are, therefore, interested in small sets of genes that best distinguish Alzheimer diseased versus healthy tissues (respectively pluripotent versus non-pluripotent cells).
In the previous section we discussed that small sets of genes obtained from the GA/SVM combined list are better suited for classification than those from information gain and random forest. Small sets individually selected by the GA/SVM have an even higher classification accuracy, as seen in Figure
3. For this reason, we conclude that the assembly of specific genes in a small set plays an important role for the prediction accuracy of classifiers using the genes selected by our GA/SVM. Therefore, it is useful to examine gene pairs that occur more frequently or less frequently than expected in the small gene sets of the GA/SVM.
5 we display the difference between the most over- and under-represented gene pairs in absolute classification accuracy as well as in mean and minimal gain of accuracy. We observe that gene pairs often selected together in a single small gene set of the GA/SVM are on average better suited for separating the two groups of samples than those gene pairs rarely selected together (Figure
5(a)). Further, we observe that the GA/SVM prefers to assemble those genes whose combination leads to an increase (gain) of classification accuracy (Figure
5(b) and Figure
5(c)). We propose that non-redundant genes are chosen together much more often than redundant genes.
6 shows the most over-represented and under-represented gene pair for both data sets. The corresponding SVM accuracies can be found in Tables
4. For the AD data set the two genes LAP3 and SLC39A12 individually have a low classification capability (accuracy: 63% and 70%, respectively). However, LAP3/SLC39A12 is the most over-represented gene pair in the AD data set. Combining the two genes increases classification accuracy to 80%. Figure
6(a) illustrates the similar distribution of the two sample groups regarding the single genes, as well as the straight separability of the samples using both genes. Using a simplified rule, we may classify a sample as Alzheimer-affected if the gene expression value of SLC39A12 is larger than the expression value of LAP3.
For the most over-represented gene pair of the PLURI data set we make some similar observations. Combining the single genes Irx3 and Utp20 with individual classification accuracies of 77% and 81% increases accuracy to 92%. In Figure
6(b) we find the sample distribution concerning the single genes even better separable than for Lap3/SLC39A12 (see Figure
6(a)). Using a simple rule, we classify a sample as pluripotent if Utp20≥8 and Irx3≤6.
For the two most under-represented gene pairs in Figure
6(c) and Figure
6(d) we cannot determine such an easy rule for distinguishing the two sample groups. From a biological point of view we hypothesize that the genes contained in the pairs occurring in the small sets selected by the GA/SVM more often than expected are responsible or indicating the specific biological state, but we assume that they are not co-regulated and, therefore, not correlated. Instead we hypothesize that the genes fulfill different functions with respect to the specific biological state. We find indirect evidence for this hypothesis by inspecting Gbx2/Otx2, which constitute the most under-represented gene pair in the PLURI data set. Otx2 and Gbx2 are both known to play a role in pluripotent and undifferentiated stem cells
[23, 24]. Further, Gbx2 and Otx2 interplay as antagonists in cellular processes
[25–27]. This negative correlation suggest that the two genes are redundant and, therefore, selected together rarely in a small set of genes by the GA/SVM. Inspecting the other gene pairs displayed in Figure
6 we find no reference to their interactions or redundancy in the literature. Thus, it is future work to investigate our hypothesis in more detail, optimally in a systematic way.
Comparison of the two data sets
Performing the same analyses on two different data sets (PLURI and AD) enables us to compare the two data sets with each other and draw some conclusions about the two biological phenomena.
Besides many similarities between the two data sets already discussed in the two previous sections we also observe an important difference. Independently of the analysis performed, we observe a difference in absolute classification accuracy between the two data sets. Performing the same analysis, the accuracy obtained on the PLURI data set is usually at least 5% higher than on the AD data set (see Figures
5(a) as well as Tables
Another interesting point is the different size of the small gene sets selected by the GA/SVM on AD and PLURI. Starting with approximately 15 genes in the start chromosomes of the GA/SVM, the algorithm selects up to 15 genes in the final small sets on the AD data set but only up to 8 genes on PLURI. For that reason we can assume that more genes are required for separating Alzheimer’s disease affected samples from healthy ones than pluripotent from non-pluripotent.
There are two possible reasons for the differences between the two data sets. (1) As the size of the two data sets differs a lot (containing 286 samples in the PLURI and 161 samples in the AD data set) and machine learning methods work best on large data sets, increasing the number of samples of AD could improve the ability of training a good classifier. (2) There are differences in the number of genes involved in pluripotency and Alzheimer’s disease, and in the way these genes function together. This could lead to specific difficulties in the classification problem based on the respective data sets. Both reasons are also supported by the different sizes of the small sets selected by the GA/SVM.
Biological relevance of the selected genes
In this subsection we discuss the biological relevance of our results on the AD data set.
First, we elaborate on the enrichment analysis performed on the best genes selected by the three feature selection methods (see Figure
4). Second, we discuss the top genes in detail (see Table
The gene list provided by GeneCards (http://www.genecards.org) when searching for Alzheimer’s disease is the most extensive one. All of the gene sets computed by the three feature selection methods show a significant enrichment. For the two gene lists offered by Genotator
 and AlzGene
 the biomarkers selected by information gain as well as random forest show a significant enrichment. The genes selected by the GA/SVM still show an enrichment. The experimental gene expression array studies of Soler et al.
 (brain) and Goni et al.
 (blood+brain) show an interesting point. Whereas the genes found by the GA/SVM are enriched in both studies concerning the brain, for information gain and random forest we only find an enrichment in the gene list of Soler et al. As we use only samples of brain tissues, it is understandable that none of the methods show an enrichment in the blood-based genes found by Goni et al.
The top five genes we found using the GA/SVM are LOC642711, PRKXP1, LOC283345, SST and LY6H. Although some of these genes are not yet well characterized, we can identify a relevance for Alzheimer’s disease for the majority of them.
The GA/SVM wrapper method found that LOC642711 is a very good choice for a small set of genes that can discriminate Alzheimer-affected brain tissue from non-affected brain tissue. However, LOC642711 has a RefSeq status of ’withdrawn’. To shed some light on this, we performed a nucleotide BLAST search with the withdrawn RefSeq, accession XM_931285.1, using standard parameters. The best match (see Additional file
1) is NR_036650.1, the ’WAS protein homolog associated with actin, golgi membranes and microtubules pseudogene (LOC100288615)’, which is 100% identical to our probe, over 75% of its length. This pseudogene is abbreviated as WHAMMP3, WHAMML12 and WHDC1L1 (according to GeneCards), and very recently is was shown to be part of a duplication on chromosome 15 associated with Alzheimer
. Also, sorting in post-golgi compartments has been implicated in AD
. Further analyses and experiments are needed to find out about the expression pattern of this pseudogene and its possible involvement in Alzheimer’s disease.
Recently, PRKXP1, SST and TNCRNA (also known as NEAT1, listed second by information gain) were identified by Squillario and Barla
 as part of a 39 gene signature implicated in Alzheimer’s disease.
LOC283345, known as RPL13P5, is ranked by all three feature selection methods as one of the best 20 genes, although we have not found any link to Alzheimer’s disease. However, RPL13 has been implicated in severe Alzheimer
 although it is also known as a housekeeping gene, in qRT-PCR studies in autopsy brain tissue samples from control and Alzheimer diseased cases
 and it indeed shows a very low log fold change of 0.04 in our data set. However, its pseudogene 5 shows a very high log fold change of 0.96. Therefore, we assume that there is an unknown link between the pseudogene and Alzheimer’s disease, possible mediated by the original RPL13 gene. Alternatively, there may be unknown phenomena that confound the distinction between the pseudogene and the gene itself.
PRKXP1 is another pseudogene; interestingly the original gene PRKX is patented as an Alzheimer’s disease diagnostic and therapeutic target (http://www.google.com/patents/US20090136504).
SST (somatostatin), see also
, expression was shown to be decreased in cortex and hippocampus of Alzheimer-affected brains
. SST also occurs in the top 20 overall list, since it is ranked high by information gain and random forest as well.
LY6H is patented as a brain-specific gene for treating Alzheimer’s disease (http://www.freepatentsonline.com/y2004/0254340.html).
Besides LOC283345, LOC642711 and TNCRNA (see above), the random forest ranks FLJ11903 and GPRASP2best. The function of the pseudogene FLJ11903 is not yet known. GPRASP2 encodes a protein that was shown to interact with several GPCRs (G protein-coupled receptors), which are relevant for the signal transduction system in Alzheimer’s disease
Similar to random forest, information gain ranks FLJ11903 and TNCRNA best. It further chooses LOC283755, PTPN3 and PCYOX1L as most important genes. The protein encoded by PTPN3 (also known as PTPH1) belongs to a family known as cellular process regulating signaling molecules. A PTPH1 inhibitor is patented for the treatment of Alzheimer’s disease (http://www.freepatentsonline.com/y2011/0015254.html). Moreover, PTPN3 is a phosphatase and phosphorylation of the tau protein is considered highly relevant for AD progression
. The two genes LOC283755, also called HERC2P3, and PCYOX1L are not yet related to Alzheimer’s disease. Nevertheless, PCYOX1L is the second highest ranked gene in the overall list.
An interesting gene in the overall list, following LOC283345 and PCYOX1L, is C6ORF151. As the two others, it is ranked by all three selection methods as one of the top 20 genes. Likely, C6ORF151 is involved in U12-type 5’ splice site recognition; also known as SNRNP48 it participates in the massive transcriptional downregulation seen at late stage neurodegenerative (ALS) disease affecting mRNA metabolism and processing as well as RNA splicing
The gene ranked 4th overall, MID1IP1, is among the top-10 genes found upregulated in the Alzheimer neocortex
. Finally, the gene ranked 5th overall, BCL6, is (together with CD24) the only immunity-related gene with significantly higher expression in severe Alzheimer’s disease that was singled out by principal component analysis (PCA)
Recapitulating, we can demonstrate that the majority of the genes found by the three feature selection methods are related to Alzheimer’s disease. The genes not yet associated with Alzheimer’s disease will have to be further examined.