Feature selection for fMRI-based deception detection
© Jin et al. 2009
Published: 17 September 2009
Skip to main content
© Jin et al. 2009
Published: 17 September 2009
Functional magnetic resonance imaging (fMRI) is a technology used to detect brain activity. Patterns of brain activation have been utilized as biomarkers for various neuropsychiatric applications. Detecting deception based on the pattern of brain activation characterized with fMRI is getting attention – with machine learning algorithms being applied to this field in recent years. The high dimensionality of fMRI data makes it a difficult task to directly utilize the original data as input for classification algorithms in detecting deception. In this paper, we investigated the procedures of feature selection to enhance fMRI-based deception detection.
We used the t-statistic map derived from the statistical parametric mapping analysis of fMRI signals to construct features that reflect brain activation patterns. We subsequently investigated various feature selection methods including an ensemble method to identify discriminative features to detect deception. Using 124 features selected from a set of 65,166 original features as inputs for a support vector machine classifier, our results indicate that feature selection significantly enhanced the classification accuracy of the support vector machine in comparison to the models trained using all features and dimension reduction based models. Furthermore, the selected features are shown to form anatomic clusters within brain regions, which supports the hypothesis that specific brain regions may play a role during deception processes.
Feature selection not only enhances classification accuracy in fMRI-based deception detection but also provides support for the biological hypothesis that brain activities in certain regions of the brain are important for discrimination of deception.
Blood oxygen level dependent (BOLD) fMRI  has been widely used to detect brain activity. Recently, brain activation patterns detected by fMRI technology have been used as biomarkers to detect deception (lie detection) [2–10]. Currently, the existing rule-based classification methods for deception detection have already achieved excellent performance [5, 6, 9, 10]. In these studies, regions of interest were first identified through performing conventional group-wise statistical comparisons of brain activation patterns obtained during lying and truth-telling sessions. Then, the activation patterns within the regions were used as input features to derive various classification rules. Machine learning algorithms have also been applied to perform deception detection based on fMRI data . The task is commonly cast as a classification problem, in which pre-processed BOLD signals in the voxels of fMRI images are treated as input features and each input image is associated with a class label. Using fMRI image data as input for classification poses a major challenge due to the high dimensionality of fMRI images. The images usually consist of hundreds of thousands of voxels that causes most of the contemporary classification algorithms to suffer from overfitting – a phenomenon whereby a classifier performs well on training data but fails on new data. In this study, we investigated the utility of different dimension reduction and feature selection procedures in fMRI-based deception detection. While this study concentrates on employing fMRI-based biomarkers for deception detection, the principle developed in this study can be applied to other fMRI-based biomarker for translational research, e.g., psychiatric disease diagnosis and prognosis.
When dealing with high dimensional data, two types of methods can be applied: dimensional reduction and feature selection. We first investigated a few models that perform high dimensional classification without selecting features. Contemporary state-of-the-art classifiers, particularly partial least square (PLS) , random forest (RF)  and support vector machine (SVM), have achieved excellent performance in various high-dimensional classification tasks, e.g., text categorization  and microarray-based disease classification [14–16]. The PLS algorithm deals with high dimensional data by reducing the dimensionality of the data through a linear project of data from a high dimensional space to a low dimensional principal component space, with a constraint of maintaining the class separation in the low dimensional space. RF handles high dimensionality by averaging the output from a large number of classification trees trained with relatively small number of features. SVM employs kernel methods to project training data into a high dimensional space to make the data points separable and reduces overfitting through maximizing the margin that separates the data from different classes. Since each classification task is unique, we first investigated how well these classifiers handled the high dimensional data and if they performed well over the fMRI-based deception detection. The PLS and RF classifiers ran extremely slowly on the data with 65,166 features, which would render them impractical in the real world application. We then tested the methods on the data consisting of 1,070 voxels selected in our previous study  as features. We applied leave-one-out cross validation to estimate the accuracy, sensitivity and specificity. Briefly, the performances of the above classifiers were not satisfactory in terms of overall accuracy. In PLS classification, models with different number of principal components, from 5 to 10 and 15, were trained. All models showed accuracy less than 60%. In the case of the RF classifier, models were trained with 2,000 trees and the number of features for each tree varied from 5 to 10, 50, and 100. The accuracies for all trained models were smaller than 60%, which were deemed to be non-satisfactory. Lastly, the accuracy of SVM with the leave-one-out evaluation was 55.2%. The above results indicate that the fMRI-based classification is a unique task, in which directly applying conventional classification algorithms in an out of box manner does not perform well. The results also indicate that dimension reduction using PLS and RF failed to provide practically acceptable performances.
Another approach to reduce the dimensionality of the fMRI data is to sample the BOLD signals using a large voxel size. In a recent report on fMRI-based deception detection by Davatzikos et al , the dimensionality of the fMRI image was reduced by sampling the BOLD signals using 560 large voxels, which are of the size 16 × 16 × 16 mm instead of the commonly used voxel size of 3 × 3 × 3 mm. Then, the BOLD signal was modeled using a general linear model to produce a β-value for each voxel (see Methods section). Using values from the voxels as input features, their in-house-developed SVM model achieved an outstanding accuracy in the study. Adopting a similar strategy, we tested the performance of SVM on the re-sampled BOLD signals from the subjects of this study, and the overall SVM classification accuracy was less than 60%. The discrepancy between our results and those from Davatzikos et al. can be possibly explained by the differences in the subject populations and parameterization of SVM models. While the resampling approach alleviates the difficulty of high dimensionality, large voxels (~100 times the size of conventional voxels) may potentially result in the loss of information regarding the anatomic structure involved in the cognitive process of deception. This prompted us in the direction of identifying the most discriminative features through feature selection, instead of through the dimension reduction approach.
Methods for feature selection are mainly grouped into two categories: the filter approach; and the wrapper approach [17, 18]. In the filter approach, feature selection is only based on predefined relevance measures and is independent of classification performance of specific classifiers. We investigated five feature selection approaches, including two filter methods, Fisher criterion score (FCS)  and Relief-F , two wrapper methods, GAR2W2 and GAJH , and an ensemble method. Both wrapper methods use SVM as the classifier and the genetic algorithm (GA)  to search for the 'fittest' feature subset. The ensemble method is designed based on FCS, Relief-F, GAR2W2 and GAJH (see Methods section). We applied these five methods on the data with 65,166 features to identify the discriminative features.
The anatomic overlapping of the selected features with the brain regions from the previous studies indicates that the feature selection procedure provides another means to identify the brain regions that may be involved in the psychological process of deception. However, it should be noted the method in this study for identifying the brain regions is significantly different from the previous ones. In the previous analysis , all fMRI images associated with one type of event, e.g., truth or lie, were pooled from all subjects, followed by group-wise comparison to identify the regions with differential brain activities. Therefore, the procedure was not classification oriented, and the features from this procedure resulted in a relatively poor classification performance using our methods. On the other hand, the feature selection approach from this study combines the procedure of using t-maps to remove background noise and selecting features that are relevant to the classification. Therefore, it is not surprising that the features from this task-oriented procedure significantly enhanced classification performance.
In this study, we investigated the utility of feature selection in fMRI-based detection of deception. The high dimensionality of fMRI data makes it a difficult task to directly utilize the original data as input for classification algorithms. Our results indicate that feature selection not only enhances classification accuracy in fMRI-based detection of deception when compared to the models that rely on dimension reduction, but also provides support for the biological hypothesis that brain activities in certain regions of brain are important for the discrimination of deception. While these conclusions were obtained in the setting of detection of deception, the general approach of feature selection is applicable to the identification of other fMRI-based biomarkers.
The data are from a previously published study of fMRI detection of deception . Briefly, a test participant was instructed to take a ring or a watch before the fMRI scan and was asked to lie regarding which object he/she took. Then, fMRI scans were acquired while the subject responded to the visually presented questions. Four types of questions were asked: neutral, truth, lie and control. All images associated with one type of event from a subject were grouped and were modeled with a general linear model using the statistical parametric mapping (SPM2) software package . For each subject, the procedure produced 4 parametric maps, referred to as β-maps, corresponding to the 4 types of questions. Each voxel in these maps contained the estimated parameter of the general linear model, reflecting the influence of the event on the BOLD signal. Pooling the data from 61 subjects led to a data set consisting of 61 β-maps labeled with truth class and 61 maps labeled as lie class. In order to normalize the influence of the events, the β-maps corresponding to the lie and truth events were further compared to that of the neutral event to produce truth-vs-neutral and lie-vs-neutral t-statistics maps, referred to as t-maps. A standard gray-matter map was applied to the β-maps and t-maps, such that the values of 65,166 voxels corresponding to brain gray matter were retained and used as input features for classification. Through a series of test, we found that the classification performances of all classifiers were consistently better using t-maps rather than β-maps (data not shown) as input. Therefore the following results reflect using t-maps as input features.
In order to make a comparison to classification with feature selection in deception detection, partial least square (PLS) , random forest (RF)  and support vector machine (SVM) were used to perform high-dimensional classification without feature selection. We have used LIBSVM , a library for SVM and packages of PLS and RF implemented in the R language downloaded from the Comprehensive R Archive Network (CRAN) .
On the other hand, Relief-F  uses a weighting approach to rank features. For a two-class task, Relief-F repeatedly draws a random instance from the data set. Then, k nearest neighbors of the instance are selected from the same class and the opposite class, leading to two sets of cases. Iterating through each feature, the weight associated with the feature is adjusted. The weighting scheme strives to minimize the averaged distance, evaluated with a feature of interest, of the instance to its neighbors of the same class while maximizing the averaged distance to the neighbors of the different class.
where and are training samples, C is the regularization parameter used to trade-off between margin maximization and error minimization, K is a kernel function, the size of the training data set is n, and α = (α 1, ⋯, α n ).
where β = (β 1, ⋯, β n ).
The research was partially supported under a contract from Cephos Corporation and by an NIH grant 1 R01 LM009153-01A1 to XL, and by the grant from the Defense Agency for Credibility Assessment formerly Department of Defense Polygraph Institute to FAK (W74V8H-04-1-0010). The views expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Department of Defense, the National Institute of Health (NIH), or the U.S. Government.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.