 Research
 Open access
 Published:
Neuroimaging feature extraction using a neural network classifier for imaging genetics
BMC Bioinformatics volume 24, Article number: 271 (2023)
Abstract
Background
Dealing with the high dimension of both neuroimaging data and genetic data is a difficult problem in the association of genetic data to neuroimaging. In this article, we tackle the latter problem with an eye toward developing solutions that are relevant for disease prediction. Supported by a vast literature on the predictive power of neural networks, our proposed solution uses neural networks to extract from neuroimaging data features that are relevant for predicting Alzheimer’s Disease (AD) for subsequent relation to genetics. The neuroimaginggenetic pipeline we propose is comprised of image processing, neuroimaging feature extraction and genetic association steps. We present a neural network classifier for extracting neuroimaging features that are related with the disease. The proposed method is datadriven and requires no expert advice or a priori selection of regions of interest. We further propose a multivariate regression with priors specified in the Bayesian framework that allows for group sparsity at multiple levels including SNPs and genes.
Results
We find the features extracted with our proposed method are better predictors of AD than features used previously in the literature suggesting that single nucleotide polymorphisms (SNPs) related to the features extracted by our proposed method are also more relevant for AD. Our neuroimaginggenetic pipeline lead to the identification of some overlapping and more importantly some different SNPs when compared to those identified with previously used features.
Conclusions
The pipeline we propose combines machine learning and statistical methods to benefit from the strong predictive performance of blackbox models to extract relevant features while preserving the interpretation provided by Bayesian models for genetic association. Finally, we argue in favour of using automatic feature extraction, such as the method we propose, in addition to ROI or voxelwise analysis to find potentially novel diseaserelevant SNPs that may not be detected when using ROIs or voxels alone.
Background
Brain imaging genomic studies have great potential for better understanding psychopathology and neurodegenerative disorders. While highthroughput genotyping technology can determine highdensity genetic markers single nucleotide polymorphisms (SNPs), neuroimaging technology provides a great level of detail of brain structure and function [1]. Various modalities of brain imaging can be used to generate meaningful biological information that can in turn be used to evaluate how genetic variation influences disease and cognition. In Alzheimer’s disease (AD), structural modalities such as magnetic resonance imaging (MRI) can detect the presence of neuronal cell loss and gray matter atrophy, both indicators of neurodegeneration. Such neuroimaging phenotypes are attractive because they are closer to the biology of genetic function than clinical diagnosis [2].
Imaging genetic data analysis is a statistically challenging task due to the high dimension of both the neuroimages and genetic data. Further increasing the challenge is the fact that the data can be of multiple forms; neuroimages can be collected in multiple formats, e.g. MRI, Computerised Tomography (CT), Positron Emission Tomography (PET) using different machines and in different institutions. Consequently, it is important to find a general solution to the dimension problem that is applicable on a wide range of data structure, which is what we propose in this manuscript.
We consider studies having an emphasis on exploring the relation between genetic variation and brain imaging from structural modalities such as MRI and consider associated statistical methodology for dimension reduction and genetic variable selection. We focus our effort on the identification of SNPs that are potentially related to disease, for example, AD, with brain imaging endophenotypes which have the potential to provide additional structure related to the underlying etiology of the disease. Existing approaches for such analysis are based on considering the imaging data through a specific set of regions of interest (ROIs) (see, e.g., [3,4,5,6,7]) or they are based on a full voxelwise analysis with statistical models fit at each voxel (see, e.g., [8,9,10,11,12]).
The first approach for statistical analysis in studies of imaging genetics developed brainwide and genomewide mass univariate analyses [9]. A drawback of this framework is that it ignores linkage disequilibrium and the associated multicollinearity between genetic markers as well as dependence between the components of the imaging phenotype. Hibar et al. [8] employed genebased dimensionality reduction to avoid collinearity of SNP vectors. Vounou et al. [6] employed sparse procedures based on reducedrank regression while Ge et al. [11] considered multilocus interactions and developed kernel machine approaches. A review of methods is provided by Nathoo et al. [13].
Bayesian joint modelling combining imaging, genetic and disease data has been considered in [14] and [15]. The proposed joint models use logistic regression to relate disease endpoints to imagingbased features and a second regression relates imaging to genetic markers. Spikeandslab selection is employed in both regression components of the joint model. Hierarchical models accounting for spatial dependence in the imaging phenotype using Markov random fields have been developed in [16] and [17]. Zhu et al. [7] developed a Bayesian reduced rank regression reducing the dimension of the regression coefficient matrix and incorporating a sparse latent factor representation for the covariance matrix of the imaging data based on a gamma process prior. Kundu et al. [18] proposed a semiparametric conditional graphical model for imaging genetics within the context of functional brain connectivity where a Dirichlet process mixture is used for clustering regression coefficients into a modular structure. Azadeh et al. [19] developed a voxelwise Bayesian approach that began by partitioning the brain into ROIs and then fitting multivariate regression models to lowerdimensional projections of the voxelspecific data within each ROI separately and in parallel across ROIs.
We investigate here a new approach for extracting imaging features in either the ROI or the voxelwise setting. Statistical learning approaches for feature construction and dimension reduction have been developed based on a number of approaches such as Gaussian Mixture Models (GMM) [20] and Principal Component Analysis (PCA) [21]. In the former, Chaddad et al. use the assignment weights of GMMs as a set of features while in the latter the lowdimension projection of PCA plays the role of extracted features. The ability of neural networks (NNs) to effectively reduce the dimension of large data has been known for some time [22]. Since then, NNs have been at the foundation of multiple feature extraction models [23,24,25] in image analysis. The autoencoder (AE) is a commonly used NN model for feature extraction [26, 27]. It consists of two pieces, an encoder and a decoder. The former compresses the data, embedding it within a lowerdimensional representation, while the latter decompresses this representation to its original dimension. Both of these components are optimized simultaneously so as to reduce the reconstruction error. The encoder and the decoder can take various forms but we will assume both are NNs.
Predicting a diagnosis successfully using NNs is also supported by a large literature [28,29,30,31,32,33,34] that has demonstrated that various modern neural network architectures, such as Convolutional Neural Networks (CNNs) [35,36,37], weighted probabilistic neural networks [38] and ensembles of deep neural networks [36, 39] can achieve extremely high accuracy in the classification of MRI and PET scans. Shen et al. [40] present a thorough review of early applications of deep learning in medical imaging. Specifically within the context of imaging genetics, Ning et al. [41] were among the first to apply NN approaches. Their approach was to train a NN taking both imaging data and genetic markers as inputs to predict a binary disease response (AD diagnosis).
In the manuscript, we first present a novel threestep imaging genetic pipeline: image processing, feature extraction and finally genetic inference. This separates the pieces where we do not require strong interpretability such as image processing and feature extraction from the pieces where we do need interpretability, namely in genetic inference. Then, we argue in favour of using a prediction model for the feature extraction step. Finally, we implement a simple version of the proposed pipeline as a proof of concept and discuss our findings.
This separation is beneficial for multiple reasons. First, it allows us to utilize the increased prediction accuracy of blackbox models for feature extraction without suffering from their drawbacks such as the lack of interpretability of these models or their inability to provide us with rigorous confidence intervals or anything statistically equivalent. Additionally, it is easy to modify and improve the three pieces individually, making this pipeline applicable to a wide range of data structures. This is central to our contribution because what we propose is a general approach exemplified with a specific implementation of the approach. This way, our proposed pipeline is applicable to a wide range of imaging data and can be constructed with the latest stateoftheart models.
Consequently, the novelty of our pipeline lies in how we utilize wellestablished models altogether so that the resulting SNP selection has greater meaning and relevance for disease while the imaging features are nonlinear representations that are otherwise not attainable through standard voxelwise and ROI based imaging genetic analysis.
Using a classification model for feature extraction ensures that the lowerdimensional representation, the extracted features, is relevant in predicting the neurological disease of interest. A popular NN architecture for feature extraction is the AE. However, there is no way to guarantee that the lowerdimension representation is correlated with the disease of interest. Using a NNC to extract features is a way to combine the strength of AEs for producing lowdimensional representations with the high predictive accuracy of NNCs to extract features relevant to disease diagnosis. Those features are subsequently related to genetics using a Bayesian inference model accounting for grouping of regression coefficients within SNPs and within genes.
We demonstrate that it is possible to achieve higher prediction accuracy to classify disease status (AD relative to normal controls (NC)) when using NNC features compared with features used previously in the literature based on known AD ROIs. This improvement in classification accuracy could be made even larger by using more sophisticated models but this is outside of the scope of this manuscript where our focus is imaging genetics. We do not argue in favour of a specific model for image classification but rather in favour of using classification models for feature extraction. Consequently, what we propose is a general approach where the classification model can be changed depending on the task at hand.
The rest of the paper proceeds as follows. We introduce our proposed pipeline in Sect. 2. Then, in Sect. 3 we discuss our experimental testing setup and an implementation of the proposed approaches with ADNI data. Section 4 presents our findings on a case example. Finally, Sect. 5 concludes with a discussion about our experimental results, implications of the findings and possible extensions.
Proposed pipeline
Concept
Based on the premise that neuroimaging data is a better representation of the phenotype of interest than clinical diagnostics, we aim at capturing genetic variations related to the disease by directly considering the brain structure. Due to the highdimensionality of neuroimaging, we propose NNs to extract features related to disease while simultaneously reducing data’s dimensionality.
We assume that the natural generation of data follows the premise that genotype is related to brain structure that in turn is related to disease as explained in [42], sequentially in that order. Our framework thus reverses this process which, while clearly an oversimplification, provides a useful mechanism for thinking about data analysis and SNP selection.
The automated diseaserelevant feature extraction is based on training a classification model on the imaging data with the disease diagnostic variable as output. Without loss of generality, we propose a NN, without specifically proposing an architecture at this moment. The neurons of the second to last layer of this NN prediction function act as the features extracted by the model. Because the NN is optimized to predict disease diagnosis as accurately as possible using the image data, those neurons are in fact the variables constructed from the images that are the most appropriate to predict the disease and are consequently features relevant for SNP selection. An alternative, which we make comparisons to in our test analysis are features extracted from known disease regions using expert knowledge.
Formal definition
Let \(v_{n,m}\) denote voxel \(m \in \{1,\dots ,M\}\) for subject \(n \in \{1,\dots ,N\}\) and \({\textbf{v}}_n\) denote the complete imaging data for subject n. We identify with \({\textbf{v}}^*_n\) the processed image for subject n. Here, the processed images may take on different forms but \({\textbf{v}}^*_n\) is some standardized image data that the prediction model f takes as input. The processing might only involve image registration in its simplest form or it might involve the extraction of volumetric and cortical thickness statistics using FreeSurfer for instance. Then, \(y_n\) is the disease phenotype for subject n which can be binary or multiclass categorical. We further let \(g_{n,s}\) denote the genetic variant \(s \in \{1,\dots ,S\}\) for subject n so that \({\textbf{g}}_n\) is the genetic data for subject n.
Let h be the image processing function which takes the images \({\textbf{v}}\) as input and outputs \({\textbf{v}}^*\). We define as f the classification function which takes \(\mathbf {v^*}\), the processed imaging data, as input and outputs y, the disease phenotype. We define f as a NN function composed of L layers each identified as \({f_l: l \in (1,L)}\), \(f_1\) being the input layer of \(f_L\) the output layer. Each layer l may have a different number of neurons x, say \(K_l\). In our current parametrization, the output layer is a \(K_L\)dimensional vector, \({\textbf{o}}\), where \(o_{n,k} = {\hat{P}}(y_n = k)\), the predicted probability that subject n belongs to class k. After training the neural network f, we fit a statistical model, p, which has the genetic data \({\textbf{g}}\) as explanatory variable and the neurons of the second to last layer of f, \(f_{L1}\), as response.
A detailed representation of the proposed pipeline is shown in Fig. 1. All components previously described are trained as follows: (i) process the raw images \({\textbf{v}}\), (ii) train a prediction model, f, of choice by taking the processed images \({\textbf{v}}^*\) as inputs and the diagnosis score as output and (iii) train the inference model p that predicts the features extracted from the prediction model using genetic markers as inputs. The use of a statistical model as our choice of inference model is based on the current availability of interpretable and inferencefocused models in the literature.
The proposed approach can be generalized to include various prediction models such as CNNs taking images as inputs or different NN architectures with inputs being the imaging features extracted from commonly used softwares such as FreeSurfer developed by Dale, Fischl and Sereno (see [43, 44]). This setup also has the flexibility to easily handle multiple brain imaging modalities which would extract features from, for example EEG, MRI and fMRI using a modular NN with different modules corresponding to different modalities. Similarly, a wide range of inference models can be used and later combined using Bayesian model averaging techniques that account for model uncertainty at the inference stage.
Methods
The aim of this section is to provide readers with a concrete implementation of the proposed pipeline to lay out a test application. It also provides results that highlight the benefits and drawbacks of the proposed approach. In the following test analysis we use disease (AD), MRI and genetic data from the ADNI1 study. FreeSurfer is used for image processing, a simple NN for the prediction model and a multivariate groupsparse Bayesian regression model for SNP selection. Figure 2 provides a visual representation of this simple implementation.
We compare the prediction accuracy of the 56 volumetric and cortical thickness measurements considered in [5, 45], and [46], which include locations of regions of interest such as the hippocampus, cerebellum and ventricles relevant for AD, with features automatically extracted by our proposed technique. We also compare the SNPs identified given those two sets of phenotype features. Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a publicprivate partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).
Cohort of subjects
The cohort of subjects we use in our test application has been previously described by Mirabnahrazam et al. [34]. Briefly, the ADNI1 database has genetic information for 818 subjects. Genotyping information of the ADNI1 subjects was downloaded in PLINK [47] format from the LONI Image Data Archive (https://ida.loni.usc.edu/). During the genotyping phase, 620,901 SNPs were obtained on the Illumina Human610Quad BeadChip platform. Genomic quality control was conducted using the PLINK software and yielded 521,014 SNPs for 570 subjects. When excluding subjects that had no diagnosis label available, we ended up with 543 subjects for our analysis. The diagnosis values we consider for this experiment are NC, MCI and AD.
In summary, we have a cohort of 543 subjects with 145 NC, 256 MCI and 142 AD. We have T1weighted baseline MRI scans for every subject as well as 521,014 SNPs.
Image preprocessing
The T1weighted baseline MRI scans were downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (n=543). A detailed description of the MRI acquisition protocols can be found on the ADNI website (https://adni.loni.usc.edu/methods/documents/mriprotocols). The T1weighted images \({\textbf{v}}\) were then segmented into gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF) tissue compartments using Freesurfer (version 6.0), which is freely available for download (http://surfer.nmr.mgh.harvard.edu), and has been described previously [43, 44, 48]. A standardized quality control procedure was used to manually identify and correct any errors in the automated tissue segmentation in accordance with FreeSurfer’s troubleshooting guidelines. Subsequently, cortical GM was parcellated into 68 regions using FreeSurfer’s cortical DesikanKilliany atlas [49] and 62 regions using Freesurfer’s DesikanKillianyTourville atlas [50]. Subcortical GM was parcellated into 45 regions using Freesurfer’s “aseg” atlas and subcortical WM was parcellated into 70 regions using Freesurfer’s “wmparc” atlas [51]. For the white matter parcellation (wmparc), optional Freesurfer parameters were used to ensure the entire white matter compartment was parcellated, (not just WM within a fixed default distance from GM), and any T1 hypotensities were labelled as white matter. This was done to ensure that the white matter parcellation included all white matter voxels and was not biased by individual T1 hypointensity burden. For all other parcellations, the default Freesurfer options were used. From these four parcellations, a total of 1860 features were obtained. These features included:

The volume, mean, standard deviation, min, max, and range of Freesurfer normalized T1 intensity values for the “aseg” (270 total features) and “wmparc” (420 total features) atlas parcellations.

The number of vertices, surface area, gray matter volume, thickness (mean, standard deviation), curvature (mean, Gaussian), folding index, and curvature index for the DesikanKillianyTourville (558 total features) and DesikanKilliany (612 total features) atlas parcellations.
These 1860 features form \({\textbf{v}}^*\), the processed image.
Prediction model for feature extraction
We propose a fully connected NN as a prediction model for this simple test application. The inputs of our prediction model are the entirety of the features extracted with FreeSurfer described previously, \(\mathbf {v^*}\). The output is AD diagnosis, which is a categorical variable for the ADNI1 data base and finally, the second to last layer of this NN are the features we are interested in.
In this proposed approach, there is great flexibility to build the early stages of the NN. Specifically, we have control over the number of hidden layers and the nonlinear activation function. Assuming the response is a \(K_L\)class categorical variable, the output of the NN is a \(K_L\)dimensional vector \({\textbf{o}}\) where \(o_{n,k} = {\hat{P}}(y_n = k)\) which represents the belief that subject n belongs to class k. The relation between the second to last layer and the output layer can be thought of as the one established between predictors and output in a multiclass logistic regression. To do so, we take \(K_L\) linear combinations of the \(K_{L1}\) inputs \({\textbf{x}}_{L1}\), so that \(\mathbf {o^*} = B{\textbf{x}}_{L1}\), where B is a \(K_L \times K_{L1}\) matrix of coefficients. Then, as activation function, we apply, elementwise, the softmax function to make sure the values are positive and sum to one: \(o_{j} = \frac{\exp (o_j^*)}{\sum _{k=1}^K \exp (o_k^*)}\).
The model is trained in a similar fashion to a multicase logistic regression. We minimize the negative log likelihood loss \(NLLL({\textbf{o}},{\textbf{y}})= \sum _{n=1}^N nlll_n\) where \(nlll_n =  \sum _{k=1}^{K_L} \log (o_{n,k}) 1(y_n=k)\). This is essentially the equivalent of maximizing the log likelihood of a multinomial distribution. Thus, one could think of the features extracted \({\textbf{x}}_{L1}\) to effectively be one logistic regression away from the disease response, however, these features are constructed from datadriven nonlinear functions built from the input.
We use the Python language and the Panda package [52] to import and manipulate the data set. The feature extraction is entirely done using Python. We use the Pytorch package [53] to define and train the NNC. Our NN is a single hidden layer NN with 35 hidden nodes trained with the Adagrad [54] optimizer. Finally, in order to train the NNC to distinguish AD from NC patients and thus to extract features related with the difference between those two groups, we only keep NC and AD during the training of the NNC, thus excluding MCI patients. In other words, we train the NNC on a cohort of 287 subjects (145 NC and 142 AD).
Most of the parameters, such as the number of hidden layers (1), the optimizer (Adagrad), the learning rate (0.01), the learning decay (0) and the number of epochs (350) were selected using crossvalidation with the exception of the number of neurons in the hidden layer. We have initially set the number of neurons in the second to last layer to 56 as we wanted to design our model to extract the same number of features as in previous articles [5, 45], and [46]. However, reducing its number of neurons to 35 did not decrease the accuracy, so our final set of automaticallyextracted features has 35 brain features.
Inference model
The SNPs dimension contrasts with its small fraction expected to be related to the imaging phenotypes. SNPs are connected to traits through various pathways and multiple SNPs on one gene often jointly carry out genetic functionalities. Therefore, it is desirable to develop a model to exploit the group structure of SNPs.
Wang et al. [4] developed GroupSparse Multitask Regression and Feature Selection (GSMuRFS) to perform simultaneous estimation and SNP selection across phenotypes. Consider matrices as boldface uppercase letters and vectors as boldface lowercase letters. Given the SNP data of the ADNI participants as \(\{{\varvec{g}}_1,...,{\varvec{g}}_n\}\subseteq {\mathbb {R}}^{S}\), where n is the number of participants (sample size), S is the number of SNPs (feature dimensionality), \({\varvec{G}}=[{\varvec{g}}_1,...,{\varvec{g}}_n]\), and the imaging phenotypes as \(\{{\varvec{x}}_1,...,{\varvec{x}}_n\}\subseteq {\mathbb {R}}^{C}\), C the number of imaging phenotypes, \({\varvec{X}}=[{\varvec{x}}_1,...,{\varvec{x}}_n]\), \({\varvec{W}}\) being a \(S \times C\) matrix of regression coefficients, where the entry \(w_{ij}\) of the weight matrix \({\varvec{W}}\) measures the relative importance of the ith SNP in predicting the response of the jth imaging phenotype, the matrix algebraic mathematical formulation of the regression is:
where \(._{Gr_{2,1}}\) is the group \(l_{2,1}\)norm, devised by Wang et al. [4]. We recapitulate this norm definition: consider that the SNPs, are partitioned into Q groups \(\Pi = {\{\pi _q\}}^Q_{q=1}\), such that, the ith row of \({\varvec{W}}\), \({\{\varvec{w^i}\}}^{m_q}_{i=1} \in \pi _q\) are genetically linked, \(m_q\) being the number of SNPs in \(\pi _q\). Denote \({\varvec{W}}=[{\varvec{W}}^1... {\varvec{W}}^Q]^T\), \(\varvec{W^q} \in {\mathbb {R}}^{m_q \times c} (1\le q \le Q)\), then the group \(l_{2,1}\)norm can be both defined as
While producing sparse point estimates of regression coefficients, the GSMuRFS lacked standard error computation. Kyung et al. [55] demonstrated that bootstrapping standard error computations preform poorly when the true value of the coefficient is zero, so an equivalent hierarchical Bayesian model was developed in [5]. The hierarchical model takes the form
with the coefficients corresponding to different genes assumed conditionally independent
and with the prior distribution for each \({\textbf{W}}^{(q)}\) having a density function that is based on a product of multivariate Laplace kernels
This product Laplace density can be expressed as a Gaussian scale mixture which allows for the implementation of Bayesian inference using a standard Gibbs sampling algorithm. The algorithm is implemented in the R package bgsmtr, https://cran.rproject.org/web/packages/bgsmtr/bgsmtr.pdf which is available for download on the Comprehensive R Archive Network (CRAN). The selection of tuning parameters \(\lambda _{1}\), \(\lambda _{2}\) in this model requires crossvalidation.
This model serves as our primary inference model in this test application and we refer to this model by the name of its associated package, BGSMTR.
Results
The framework we propose is designed for the identification of SNPs related to a disease of interest. In the simple implementation provided, we aim at identifying SNPs related to AD. Based on the assumption that neuroimaging features that can accurately predict disease status are more closely related to the disease, we compare the accuracy performances of logistic regression models that take NNextracted features as inputs to the accuracy of a model that utilizes previously expertselected features in recent imaging genetics publications such as [5, 45], and [46]. For that purpose, we proceed with 50 repetitions of random subsampling validation: randomly dividing the data set into a training set and a test set. The training set contains 200 observations while the 73 other observations are assigned to the test set. Compared to kfold crossvalidation, random subsampling validation has the benefit of allowing us to fix the size of the training and testing set independently from the number of Monte Carlo samples.
Table 1 shows the results. The model trained using the automatically extracted features not only has a significantly higher accuracy (pvalue \(< 0.0001\)) but also has a smaller performance variance across the subsamples. The better prediction performance suggests that these features are useful for subsequent genetic analysis. More sophisticated prediction models can be investigated in future studies.
To provide an additional perspective of the NNextracted features and to visually compare them to the features selected based on standard ROIs, we compute a 2dimensional embedding for both sets of features using a tdistributed stochastic neighbor embedding (tSNE) as proposed in [56], a tdistributed variant of the original SNE proposed in [57]. Different from PCA that finds a linear representation capturing as much variability as possible, the SNEs proposed in [57] try to identify a lowdimensional representation to optimally preserve a neighborhood identity. A neighborhoodpreserving embedding is especially interesting here as the features are extracted to carry information about the disease status of the patient. Figures 3 and 4 contain the embeddings of the training cohort containing strictly the NC and AD patients. A randomly selected neighborhood in Fig. 3 is more likely to have a high concentration of one class compared to a randomly selected neighborhood in Fig. 4.
To begin the genetic analysis, we follow the recommendations found in [17, 46] and adjust for subject specific factors by fitting univariate least squares linear regression for every feature (both NNderived and ROIbased features) onto the age, gender, education level, the APOE genotype and the total intracranial volume. The residuals from each regression are then used as the adjusted imaging response in the inference model.
We then proceed with a twostep process to reduce the number of SNPs selected. First, we reduce the large number of SNPs to a smaller subset of 485 potentially related with AD SNPs [5] based on expert advice. Second, we fit univariate models between every feature and every SNP and keep the top 100 SNPs based on the resulting pvalues [58, 59]. We rank the SNPs by their smallest pvalue, among all models they are included in.
Table 2 contains the top 20 SNPs extracted using univariate regression as explained above. Status, novel or known, is checked against two previous publications, [4] and [5]. By comparing the SNPs associated with both sets of features, we first notice the top 3 SNPs are quite similar and that overall many SNPs belong to both groups. Additionally, we notice that the genes are also quite similar between the two sets of screened SNPs. However, we also identified multiple SNPs that were not identified using the original, expertbased, features. The possibility of identifying additional SNPs based on features that are more predictive of disease is the potential addedvalue of the proposed approach. Thus the NN derived features can be used alongside more standard ROIbased features. These novel SNPs could simply be carrying a gene specific signature but this is also a reason why we rely on a multivariate regression model to determine the final set of SNPs.
For this reason, we follow with a subsequent multivariate regression that will better allow us to distinguish between association with the features and confounding SNPs. We use the 100 screened SNPs as predictors in our inference model, the BGSMTR model described earlier.
Table 3 contains the top 20 SNPs ranked by the posterior standard score: the posterior mean divided by the posterior standard deviation. In this table we see again a mix of novel and known SNPs and once again, the status, novel or known, is checked against two previous publications [4, 5]. Among other, identifying the association with AD through MRI features of SNPs rs1699105, rs1699105, rs2025935 and rs12209631 to name a few is consistent with previous publications [4, 5]. Since half the SNPs identified were identified in previous publications, our approach is consistent with known results and this consistency is very positive in light of the reproducibility of our datadriven approach. Our approach exhibits signs of consistency and reproducibility with past experiments.
On the flip side, if we only discover known SNPs then there is little advantage to our approach. The SNP rs6511720 is ranked very high on the list and was associated with 15 features (according to 95% credible intervals with selection as in [5]). The SNPs rs6457200, rs2243581 and rs3785817 are also ranked high and/or are related with multiple features.
Discussion
The results above provide a strong argument in favour the proposed pipeline which can be used in addition to a standard voxelwise or ROI based imaging genetics analysis. The features extracted are not only better at predicting the neurological disease of interest but more importantly, these features allowed the identification of different SNPs. For instance, we identified the SNP rs6511720, being related with 15 features, and in the meanwhile this SNP was not found to be related with expertselected features. Therefore, our proposed method could lead to the identification of novel causal SNPs. Furthermore, the extraction process is datadriven and requires no expert advice, outside of the diagnostic. Consequently, we argue in favor of using automatic feature extraction in addition to ROI or voxelwise features to find signal potentially novel SNPs that may not be detected when using ROIs or voxels alone. Our focus here is to identify SNPs related with MRI in a manner that is predictive of disease and obtain confidence intervals and posterior distributions. Integrating machine learning approaches within imaging genetics studies is of potential use as demonstrated in our analysis.
One advantage of the procedure we propose is its flexibility: we can easily improve on each of the three pieces of the pipeline separately. However, a limitation of the study of Sect. 3 is that only a single implementation was tested on a single data set. On the flip side, it opens up possible improvements for future projects. In this first implementation of our proposed pipeline, we use the wellestablished FreeSurfer software to obtain volumetric and cortical thickness statistics from the MRI scans. We obtain automatically extracted features in a datadriven which have higher predictive power relevant for disease. Thus, it seems reasonable to extend that principle to image processing and also try to automatically process the images in a datadriven way. For instance, a common NNC for images is the Convolutional Neural Networks (CNNs) [35,36,37]. Using a CNN taking as input the 3dimensional brain scan images and training this model to predict the diagnosis would be of potentially great value for further investigation. The convolutional layers replace some of the image processing steps and the lowerlevel layers act as the feature extractor. However, some processing, mostly registration, would still be required. As previously demonstrated [28, 29], we expect the CNN to provide an even better prediction accuracy and thus features more closely related to AD. Another interesting approach to explore is to use an AE to reduce the dimension of the images in an unsupervised manner first. Different AEs can be trained for each brain region separately and it allows the number of features extracted per region to vary. This allows the collection of AEs to extract more features from regions with higher variability or from regions with more predictive power. Finally, a different perspective for future work would be to model and capture the complex interactions between the neuroimaging data and genetic data using an heterogeneous information networks. Zhao et al. [60] successfully combined different data modalities for drugdisease associations using a graph representation learning model when given a biological heterogeneous information networks.
Additionally, our work demonstrates the use of different objective functions to extract features and reduce the dimension of large observations, such as neuroimages. Instead of using unsupervised models, we are able to direct the feature extraction towards a variable of interest, in our case the disease diagnostic variable which would have otherwise not been used in the analysis relating imaging to genetics. However, with gradientbased models, such as NN, we can design many other objective functions and tailor the feature extraction process for problemspecific needs. This idea can be applied in various ways when we analyse neuroimages, and we recommend considering a large collection of objective functions that are datadriven when extracting features instead of strictly relying on expert advice.
We choose to use a NN for feature extraction, this comes with strengths and weaknesses. Because our goal is to do inference at the SNP level we agreed to lose interpretability on the neuroimage feature level, this is usually considered a weakness of blackbox models such as NNs. In counterparts, this allows us to get nonlinear features that are functions of the complete processed images and the use of classification models ensure that those features are indeed most relavent to AD. The automatic feature extraction approach provides genuine added value when used alongside studies that are conducted at either the ROI or voxelwise level. It requires no external expertise for feature selection and uses disease data that are typically available but are not typically used in such analyses. The features are built considering disease prediction through nonlinear representations of neuroimaging.
Finally, the last step of our pipeline involves an inference step using a multivariate Bayesian group sparse regression. There is scope for generalizing this step to account for model uncertainty where the Bayesian model used is included within a collection of different models (e.g., [7, 14,15,16, 18]) and then Bayesian model averaging is used for inference at the SNP level while accounting for model uncertainty. This extension will be explored as part of future work.
Availability of data and materials
The majority of the code used in preparation of this manuscript is available on the first author’s GitHub page; https://github.com/CedricBeaulac/Neuroimaging_genetics. However because the data is only available through ADNI the code on itself does not run. The ADNI data base is publicly available at https://adni.loni.usc.edu by filling the appropriate forms.
References
Jinhua S, Yu X, Qiao Z, Luyun W, Ze Y, Jie Y. Predictive classification of Alzheimer’s disease using brain imaging and genetic data. Sci Rep. 2022;12:2405.
MeyerLindenberg A. The future of fMRI and genetics research. NeuroImage. 2012;62(2):1286–92. https://doi.org/10.1016/j.neuroimage.2011.10.063.
Wang H, Nie F, Huang H, Risacher SL, Saykin AJ, Shen L, Initiative ADN. Identifying disease sensitive and quantitative traitrelevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics. 2012;28(12):127–36.
Wang H, Nie F, Huang H, Kim S, Nho K, Risacher SL, Saykin AJ, Shen L, Initiative ADN. Identifying quantitative trait loci via groupsparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics. 2012;28(2):229–37.
Greenlaw K, Szefer E, Graham J, Lesperance M, Nathoo FS, Initiative ADN. A Bayesian group sparse multitask regression model for imaging genetics. Bioinformatics. 2017;33(16):2513–22.
Vounou M, Nichols TE, Montana G, Initiative ADN. Discovering genetic associations with highdimensional neuroimaging phenotypes: a sparse reducedrank regression approach. Neuroimage. 2010;53(3):1147–59.
Zhu H, Khondker Z, Lu Z, Ibrahim JG. Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. J Am Stat Assoc. 2014;109(507):977–90.
Hibar DP, Stein JL, Kohannim O, Jahanshad N, Saykin AJ, Shen L, Kim S, Pankratz N, Foroud T, Huentelman MJ. Voxelwise genewide association study (vGeneWAS): multivariate genebased association testing in 731 elderly subjects. Neuroimage. 2011;56(4):1875–91.
Stein JL, Hua X, Lee S, Ho AJ, Leow AD, Toga AW, Saykin AJ, Shen L, Foroud T, Pankratz N. Voxelwise genomewide association study (vGWAS). NeuroImage. 2010;53(3):1160–74.
Ge T, Feng J, Hibar DP, Thompson PM, Nichols TE. Increasing power for voxelwise genomewide association studies: the random field theory, least square kernel machines and fast permutation procedures. Neuroimage. 2012;63(2):858–73.
Ge T, Nichols TE, Ghosh D, Mormino EC, Smoller JW, Sabuncu MR, Initiative ADN. A kernel machine method for detecting effects of interaction between multidimensional variable sets: an imaging genetics application. Neuroimage. 2015;109:505–14.
Huang M, Nichols T, Huang C, Yu Y, Lu Z, Knickmeyer RC, Feng Q, Zhu H, Initiative ADN. FVGWAS: fast voxelwise genome wide association analysis of largescale imaging genetic data. Neuroimage. 2015;118:613–27.
Nathoo FS, Kong L, Zhu H, Initiative ADN. A review of statistical methods in imaging genetics. Can J Stat. 2019;47(1):108–31.
Batmanghelich NK, Dalca AV, Sabuncu MR, Golland P. Joint modeling of imaging and genetics. In: International Conference on Information Processing in Medical Imaging, 2013:766–777. Springer
Batmanghelich NK, Dalca A, Quon G, Sabuncu M, Golland P. Probabilistic modeling of imaging, genetics and diagnosis. IEEE Trans Med Imaging. 2016;35(7):1765–79.
Stingo FC, Guindani M, Vannucci M, Calhoun VD. An integrative Bayesian modeling approach to imaging genetics. J Am Stat Assoc. 2013;108(503):876–91.
Song Y, Ge S, Cao J, Wang L, Nathoo FS. A Bayesian spatial model for imaging genetics. Biometrics. 2021. https://doi.org/10.1111/biom.13460.
Kundu S, Kang J. Semiparametric bayes conditional graphical models for imaging genetics applications. Stat. 2016;5(1):322–37.
Azadeh S, Hobbs BP, Ma L, Nielsen DA, Moeller FG, Baladandayuthapani V. Integrative Bayesian analysis of neuroimaginggenetic data through hierarchical dimension reduction. In: 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), 2016:824–828. IEEE
Chaddad A. Automated feature extraction in brain tumor by magnetic resonance imaging using gaussian mixture models. Int J Biomed Imaging. 2015;2015:8.
López M, Ramírez J, Górriz JM, Álvarez I, SalasGonzalez D, Segovia F, Chaves R, Padilla P, GómezRío M, Initiative ADN. Principal component analysisbased techniques and supervised classification schemes for the early detection of Alzheimer’s disease. Neurocomputing. 2011;74(8):1260–71.
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7.
Pradeep J, Srinivasan E, Himavathi S. Diagonal based feature extraction for handwritten character recognition system using neural network. In: 2011 3rd International Conference on Electronics Computer Technology, 2011;4:364–368. IEEE
Chen Y, Jiang H, Li C, Jia X, Ghamisi P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans Geosci Remote Sens. 2016;54(10):6232–51.
ElKenawy ESM, Ibrahim A, Mirjalili S, Eid MM, Hussein SE. Novel feature selection and voting classifier algorithms for COVID19 classification in CT images. IEEE Access. 2020;8:179317–35.
Wang W, Huang Y, Wang Y, Wang L. Generalized autoencoder: a neural network framework for dimensionality reduction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014:490–497
Wang Y, Yao H, Zhao S. Autoencoder based dimensionality reduction. Neurocomputing. 2016;184:232–42.
Suk HI, Lee SW, Shen D, Initiative ADN. Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis. NeuroImage. 2014;101:569–82.
Islam J, Zhang Y. Brain MRI analysis for Alzheimer’s disease diagnosis using an ensemble system of deep convolutional neural networks. Brain Informatics. 2018;5(2):1–14.
Lu D, Popuri K, Ding GW, Balachandar R, Beg MF, Initiative ADN. Multiscale deep neural network based analysis of FDGPET images for the early diagnosis of Alzheimer’s disease. Med Image Anal. 2018;46:26–34.
Lin W, Tong T, Gao Q, Guo D, Du X, Yang Y, Guo G, Xiao M, Du M, Qu X. Convolutional neural networksbased MRI image analysis for the Alzheimer’ s disease prediction from mild cognitive impairment. Front Neurosci. 2018;12:777.
Duraisamy B, Shanmugam JV, Annamalai J. Alzheimer disease detection from structural MR images using FCM based weighted probabilistic neural network. Brain Imaging Behav. 2019;13(1):87–110.
Jain R, Jain N, Aggarwal A, Hemanth DJ. Convolutional neural network based Alzheimer’s disease classification from magnetic resonance brain images. Cognit Syst Res. 2019;57:147–59.
Mirabnahrazam G, Ma D, Lee S, Popuri K, Lee H, Cao J, Wang L, Galvin JE, Beg MF, Initiative ADN, et al.: Machine learning based multimodal neuroimaging genomics dementia score for predicting future conversion to alzheimer’s disease. J Alzheimer’s Dis (Preprint), 2022:1–21
LeCun Y. Generalization and network design strategies. Connect Perspect. 1989;19:143–55.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.
Kusy M, Kowalski PA. Weighted probabilistic neural network. Inf Sci. 2018;430:65–76.
Zhang C, Ma Y. Ensemble machine learning: methods and applications. Berlin: Springer; 2012.
Shen D, Wu G, Suk HI. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–48.
Ning K, Chen B, Sun F, Hobel Z, Zhao L, Matloff W, Toga AW, Initiative ADN. Classifying alzheimer’s disease with brain imaging and genetic data using a neural network framework. Neurobiol Aging. 2018;68:151–8.
Batmanghelich NK, Dalca A, Quon G, Sabuncu M, Golland P. Probabilistic modeling of imaging, genetics and diagnosis. IEEE Trans Med Imaging. 2016;35(7):1765–79.
Dale AM, Fischl B, Sereno MI. Cortical surfacebased analysis: I. Segmentation and surface reconstruction. Neuroimage. 1999;9(2):179–94.
Fischl B, Sereno MI, Dale AM. Cortical surfacebased analysis: II: inflation, flattening, and a surfacebased coordinate system. Neuroimage. 1999;9(2):195–207.
Shen L, Kim S, Risacher SL, Nho K, Swaminathan S, West JD, Foroud T, Pankratz N, Moore JH, Sloan CD. Whole genome association study of brainwide imaging phenotypes for identifying quantitative trait loci in MCI and AD: a study of the ADNI cohort. Neuroimage. 2010;53(3):1051–63.
Szefer E, Lu D, Nathoo F, Beg MF, Graham J, Initiative ADN. Multivariate association between singlenucleotide polymorphisms in Alzgene linkage regions and structural changes in the brain: discovery, refinement and validation. Statistical Appl Genet Mol Biol. 2017;16(5–6):367–86.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Secondgeneration plink: rising to the challenge of larger and richer datasets. Gigascience. 2015;4(1):13742–015.
Fischl B. FreeSurfer. Neuroimage. 2012;62(2):774–81.
Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, Buckner RL, Dale AM, Maguire RP, Hyman BT. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage. 2006;31(3):968–80.
Klein A, Tourville J. 101 labeled brain images and a consistent human cortical labeling protocol. Front Neurosci. 2012;6:171.
Fischl B, Salat DH, Busa E, Albert M, Dieterich M, Haselgrove C, Van Der Kouwe A, Killiany R, Kennedy D, Klaveness S. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002;33(3):341–55.
Wes McKinney: Data Structures for Statistical Computing in Python. In: Stéfan van der Walt, Jarrod Millman (eds.) Proceedings of the 9th Python in Science Conference, 2010:56–61. https://doi.org/10.25080/Majora92bf192200a
Paszke A, Gross S, Massa F, et al. An imperative style, highperformance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’AlchéBuc F, Fox E, Garnett R, editors, Advances in Neural Information Processing Systems (vol. 32). New York: Curran Associates Inc.; 2019. pp. 8024–8035
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12(7):2121.
Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5:369–411.
Van der Maaten L, Hinton G. Visualizing data using tSNE. J Mach Learn Res 2008;9(11)
Hinton GE, Roweis S. Stochastic neighbor embedding. Adv Neural Inf Process Syst 2002;15
Yin J, Li H. A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat. 2011;5(4):2630.
Li Y, Nan B, Zhu J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics. 2015;71(2):354–63.
Zhao BW, Wang L, Hu PW, Wong L, Su XR, Wang BQ, You ZH, Hu L. Fusing higher and lowerorder biological information for drug repositioning via graph representation learning. IEEE Trans Emerging Topics Comput. 2023. https://doi.org/10.1109/TETC.2023.3239949.
Acknowledgements
The authors would like to acknowledge the financial support of the Canadian Statistical Sciences Institute (CANSSI), the Alzheimer Society Research Program (ASRP), the National Health Institute (NIH) and the Natural Sciences and Engineering Research Council of Canada (NSERC). Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH1220012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (http://www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Funding
The funding bodies have no specific roles. Cédric Beaulac is funded by the Canadian Statistical Sciences Institute (CANSSI). Michelle F. Miranda is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). Jiguo Cao holds a Tier II Canada Research Chair in Data Science. Farouk S. Nathoo holds a Tier II Canada Research Chair in Biostatistics for spacial and HighDimensional Data. Mirzal Faisal Beg and Erin Gibson are funded by Alzheimer Society Research Program (ASRP) and the National Health Institute (NIH).
Author information
Authors and Affiliations
Contributions
CB: conceptualization, methodology, software, formal analysis, writing—original draft, writing—review & editing, visualization. SW: Software, Formal analysis, Validation, Writing—Original Draft, Writing—Review & Editing. EG: Software, Data Curation, Writing  Review & Editing. MFM: Conceptualization, Writing  Review & Editing, Supervision. JC: Conceptualization, Writing  Review & Editing, Supervision. LR: Software, Formal analysis, Writing  Original Draft. MFB: Resources, Funding acquisition. FSN: Conceptualization, Methodology, Software, Writing  Original Draft, Writing  Review & Editing, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethics approval and consent to participate was collected by ADNI.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Beaulac, C., Wu, S., Gibson, E. et al. Neuroimaging feature extraction using a neural network classifier for imaging genetics. BMC Bioinformatics 24, 271 (2023). https://doi.org/10.1186/s1285902305394x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902305394x