A stack LSTM structure for decoding continuous force from local field potential signal of primary motor cortex (M1)

Background Brain Computer Interfaces (BCIs) translate the activity of the nervous system to a control signal which is interpretable for an external device. Using continuous motor BCIs, the user will be able to control a robotic arm or a disabled limb continuously. In addition to decoding the target position, accurate decoding of force amplitude is essential for designing BCI systems capable of performing fine movements like grasping. In this study, we proposed a stack Long Short-Term Memory (LSTM) neural network which was able to accurately predict the force amplitude applied by three freely moving rats using their Local Field Potential (LFP) signal. Results The performance of the network was compared with the Partial Least Square (PLS) method. The average coefficient of correlation (r) for three rats were 0.67 in PLS and 0.73 in LSTM based network and the coefficient of determination (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^{2}$$\end{document}R2) were 0.45 and 0.54 for PLS and LSTM based network, respectively. The network was able to accurately decode the force values without explicitly using time lags in the input features. Additionally, the proposed method was able to predict zero-force values very accurately due to benefiting from an output nonlinearity. Conclusion The proposed stack LSTM structure was able to predict applied force from the LFP signal accurately. In addition to higher accuracy, these results were achieved without explicitly using time lags in input features which can lead to more accurate and faster BCI systems.

automatically used for predicting the current target value. Recurrent neural networks have the aforementioned characteristics.
Recurrent neural networks (RNNs) can learn the intrinsic dynamic of data. However, due to gradient vanishing, the information from the past samples cannot easily reach the current sample [21]. To solve this problem, new structures, including LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units), were introduced [22]. Long short-term memory networks have an extra path for transferring relevant information from the previous sample to more recent ones. Unlike classical recurrent neural networks, LSTM is more robust to gradient vanishing and was considerably successful in natural language processing and time series prediction. It is expected that these networks can also be able to learn relevant information in neural data. For instance, Belo et al. [23] used A GRU based structure for synthesizing multiple biological signals including Electrocardiogram (ECG) and Electromyogram (EMG) with the purpose of denoising, classification and generating (reproduction) of EMG and ECG signals. Also, in another case, LSTM network was used to predict hand kinematics [24]. Unfortunately, unlike natural language processing, there are few samples of data in BCI related datasets. Therefore, LSTMs are prone to be overfitted to training data. Beside regularization, Dropout layers proved to be useful in controlling overfitting problem [25]. In this study, an LSTM based neural network is used to decode the force amplitude from the spectral features of LFP data without directly using time lags in features. The results are compared with the Partial Least Square (PLS) method.

Results
The network is evaluated using seven-fold cross-validation. In this method, data is divided into 7 partitions, and each time, 6 partitions are used as train and the remaining partition as test data. This process is repeated 7 times, and the final performance is the average performance in all the 7 folds. The same train and test data are used in PLS evaluation. The hyper-parameters of the LSTM network and the number of components for PLS were optimized as explained in the method section for each fold.

Prediction accuracy
The correlation coefficient (r) and coefficient of determination ( R 2 ) are reported for each rat in all the iteration in Tables 1 and 2, respectively. In both Tables 1 and 2, the highest values are italic. In terms of correlation, the mean value for all 7-folds for rat1, rat 2 and rat 3 are 0.7 ± 0.05, 0.6 ± 0.06, 0.71 ± 0.02 for PLS with 10 time lags and 0.74 ± 0.05, 0.70 ± 0.05 and 0.75 ± 0.03 for LSTM Network respectively. The statistical significance of the results was tested using the Wilcoxon signed-rank test p < 0.01. In terms of coefficient of determination ( R 2 ), the mean value of 7 folds for rat 1, rat 2 and rat 3 are 0.44 ± 0.06, 0.43 ± 0.1 and 0.49 ± 0.05 for PLS and 0.52 ± 0.08, 0.49 ± 0.09 and 0.54 ± 0.06 for LSTM Network respectively with statistical significance p < 0.05 (Wilcoxon signed-rank test).
These results indicate that the LSTM Network was able to predict the force value more accurately. For additional evaluation of prediction accuracy, we also evaluated the network with 7-times 7-folds cross-validation, and the significance of the results was evaluated for each rat (see Additional file 1).
To have a better visualization of the predictions, three predictions with the highest ( R 2 ) values are plotted in Fig. 1. The force value predicted by LSTM and PLS, and the observed force values are compared. The blue line indicates the true force value, the red and green lines are predictions made by PLS and LSTM, respectively. For each rat, the folds with the highest decoding accuracy are plotted. As can be seen in Fig. 1, LSTM was able to follow the true force value more accurately.

CAR filtering
In order to evaluate the effect of CAR filter, force values were predicted once with applying CAR filter on the LFP signal and once without CAR filter. The effect of CAR filter was investigated for both PLS and LSTM Network and all rats. The results are shown in Tables 3 and 4. The results show that using CAR filter improves the prediction accuracy for both PLS and LSTM Network. This improvement was significant for both PLS and LSTM (p < 0.05 Wilcoxon signed-rank test for all rats and all folds combined).

Contribution of time lags in PLS prediction
In order to investigate the effect of time lags in PLS prediction, PLS was evaluated with different number of time lag. The best results were obtained by considering 10 time lags. In Fig. 2, the coefficient of determination ( R 2 ) for PLS predictions with a

Contribution of frequency bands
The contribution of each frequency band is calculated according to Eq. (16). The mean values of contributions for all the 7 folds are illustrated in Fig. 3. The standard error values were negligible; therefore, not displayed. Figure 3 shows that higher frequency bands have more contribution in predicting the force values.

Alternative recurrent cells
For all the above analyses, we used LSTM cells as the main recurrent block in the network. However, other recurrent cells, like Simple RNN, and GRU, are also able to remember relevant information across time sample. To compare these recurrent cells, we trained and evaluated the same network structure with three types of recurrent cell, i.e., Simple RNN, GRU, and LSTM. The LSTM-based network showed a higher Coefficient of Determination (R2) for three rats (Fig. 4). The mean (R2) value for all rats in LSTM-based network was significantly higher than GRU-based (p < 0.05), and Simple RNN-based (p < 0.01) networks. (Wilcoxon signed-rank test for three 7 folds and three rats).

Alternative regression models
We used PLS as the main method of regression for comparing the proposed LSTMbased neural network. Additionally, we compared PLS results with Support Vector Regression (SVR) from the family of kernel methods, and Random Forest with bootstrapping from ensembled learning methods. Figure 5 shows the Coefficient of Determination (R2) for predicted force value by Random Forest, SVR and PLS regression, respectively. The feature extraction process, number of time lags, and validation were similar for all the method. PLS regression had higher R2 value compared to Random Forest (p < 0.01) and SVR (p < 0.01) method (Wilcoxon signed-rank test for three rats, and 7 folds).

Time lag features
Using multiple time lag features is a common practice in decoding neural data. However, this time lags increase the dimensionality of the feature space dramatically. For instance, in this study, 96 features were extracted from 6 frequency bands and 16 channels. The dimensionality will increase to 960 only with 10 time lags. This increment will drastically increase the possibility of over-fitting. LSTM based networks, owing to their intrinsic potential to carry relevant information from previous samples, can recall useful information in previous LFP sample to predict the current force observation without directly providing time lags in features. Therefore, the LSTM network is less prone to over-fitting caused by a large number of features, and there is no need to optimize the number of lags in the initial feature extracting process. On the other hand, as it can be seen in Fig. 2, the decoding performance of PLS decreases with reducing the number of time lags.

Predicted force values
As it can be seen in Fig. 1, LSTM Network can predict zero force values, but PLS prediction fluctuates around the zero values. PLS and other linear methods can only generate output values that are a linear combination of the predictors. Therefore, the nonlinear characteristics of the systems are always estimated with the closest linear model. On the other hand, neural networks, in this case, LSTM Network, has nonlinear components (11) in its structure which can model the nonlinearity in the system. Furthermore, the activation function of the output layer can be selected in a way that can improve decoding performance. For instance, in this study, Rectified Linear Unit (ReLU) was deliberately chosen as the activation function of the last layer in the LSTM network. ReLU maps all negative values to zero and acts as a simple line for positive values which makes it easier for the network to predict zero force values. Figure 3 shows that higher frequency bands, β (12-30 Hz), γ (30-120 Hz) and high-γ (120-200 Hz), had more contribution, for rat 2 and rat 3, in predicting the force values in LSTM network. This significance of higher frequency bands in neural decoding was observed in the previous study on the same data set [12]. Kashefi and Daliri BMC Bioinformatics (2021) 22:26 Alternative recurrent cells

Contribution of each frequency band
We compared the proposed network structure with different recurrent cells. LSTM cell showed the highest decoding accuracy (Fig. 4). Both GRU and LSTM are gated structures and use a gating mechanism to remember (forget) relevant (irrelevant) information. However, higher accuracy of LSTM is probably because LSTM leverages an extra path for carrying information across the time sample. Simple RNN cell, as expected, had the lowest decoding performance because of the well-known problem of vanishing gradient through time.

Alternative regression methods
In addition to PLS, we examined the decoding performance of two other regression methods from different families of regression techniques. Both SVR and Random Forest showed lower decoding performance compared to PLS (Fig. 5). The underperformance of SVR and Random forest, we believe, is due to a large number of features and consequently, over-fitting on the training data set. PLS, on the other hand, has an intrinsic mechanism for mitigating a large number of features without requiring a separate feature selection or dimensionality reduction.

Conclusion
In this study, we introduced an LSTM Network, which can learn both the nonlinearity and the intrinsic dynamics of data. The overall results show that there is rich information in LFP signal for decoding fine movements like force, and the proposed network can predict this continues movement accurately, which can be used in LFPbased BCI systems.

Behavioral task
In this study, three Wistar rats -average weight between 200 and 300 g-were trained to push a load cell (1 DOF) in order to receive a drop of water from a rotating lever as a reward. The applied pressure between 0 and 0.15 N was linearly mapped to 0-90 degrees of rotation in the lever. The load cell was placed 10 cm above the setup floor, and due to the negligible movement of the load cell, the position and orientation of the forelimbs were stable while performing the task (Fig. 6). If the applied force on the load cell exceeded the 0.15 N threshold, after a 1.5 s delay, the rat was rewarded. No start or end cue was defined; therefore, the timing of each trial was spontaneous.

Implantation of micro-arrays
After training, the micro-array was implanted in the primary motor cortex (M1) of three rats. The array was placed contralateral to their dominant hand. All three rats performed the task with their right hands; therefore, the arrays were placed on the left hemisphere. The surgery starts with anesthetizing the animal by administrating 100 mg/kg Ketamine and 10 mg/kg xylazine. The depth of the anesthesia was determined by toe pinching and monitoring respiration rate. Then, an incision was made in the head skin midline, and all the tissue was removed from the scalp in order to make head bone accessible. Afterward, Bregma, Lambda and the proper craniotomy positions were marked. One screw in the posterior of the lambda point and five other screws were placed to connect the ground and secure the area respectively. Then, the forelimb region of M1 was pinpointed using rat brain atlas. In the next step, the center of the micro-array was implanted 1.6 mm anterior to Bregma, 2.6 mm lateral to the midline, and 1.5 mm deep under the dura matter surface, covering all forelimb area. Finally, the area was sealed with dental acrylic. In order to avoid infection and assuage the pain, 0.2 mg/kg Meloxicam and 5 mg/kg Endrofloxacin for two days after the surgery. More detailed information on the task and surgery can be found in [12].

Ethical considerations
All the rats belonged to the Neuroengineering and Neuroscience Research Laboratory of Iran University of Science and Technology, and the use of the animals in this Fig. 6 The experimental setup. a Raw LFP signal of 16 channels and one arbitrary trial. b The force profile recorded simultaneously with LPF shown in a. c The experiment setup, the force applied to the sensor was linearly mapped to the rotation of a lever. The rat was rewarded if the applied force reached 0.15 N. Neural signal, and the applied force were recorded simultaneously study was approved and authorized by the local committee. After completion of the study, the rats were euthanized by exposure to CO 2 and then, euthanasia was confirmed by continuing the gas exposure for 20 min after the respiratory arrest according to NIH guidelines. For more information, see Ethics approval and consent to participate section.

Data recording
Two weeks after the surgery, the rats were placed in the task setup. Then, the neural and force data were recorded simultaneously. The implanted micro-array was connected to the preamplifier of the recording device using the implanted connector. The initial sampling rate was 10 kHz. The spikes were removed by filtering the signal between 300-3000 Hz and then manually thresholding for each channel. Then the LFP signal was extracted by filtering the signal between 0.1 and 500 Hz and downsampling to 1000 Hz. Regarding the force signal, there were negligible components above 5 Hz; therefore, the force signal was filtered and then downsampled 10 samples per second. All of the filtering processes were performed with a 4th order Butterworth filters both forward and backward. As mentioned before, the rats were free to do the task anytime; 1 s before and 2 s after the 0.15 N threshold was considered as a trial.

Data preprocessing
The final shape of data for each trial is a matrix of 3000 by 16, representing time samples (three seconds of data with a sampling rate of 1000 samples per second) and channels respectively. There are 74, 79 and 80 successful trials for rat 1, rat 2 and rat 3 respectively. The first step is to remove the noise by performing a CAR (Common Average Reference) filter on the data [26]. In using CAR, we assume that the noise is a common component existing on all the channels. Therefore, by removing the mean of all channels from each channel, the common noise can be removed. Using CAR has shown to improve the decoding performance in both methods. In the next step, the signal is decomposed into 6 frequency band. The filter band consists of δ(1-4 Hz), θ(4-8 Hz), α  (8)(9)(10)(11)(12), β (12-30 Hz), γ (30-120 Hz) and high-γ (120-200 Hz). Then, the absolute value was calculated, and the signal was smoothed with a 3rd order Savitzky-Golay filter with 150 sample window length. Using Savirzky-Golay improves the decoding mainly because it preserves the local minima in the signal. Next, the data is centered and normalized by subtracting the mean and dividing the signal by the standard deviation. Finally, the signal is downsampled in order to equalize the number of force and LFP samples. Now the dimension of feature space is (6 filter bands * 16 channels = 96) for each sample of data. In the end, for each trial, the neural data and the target force values will be matrices of size (30,96) and (30,1), respectively. The feature extraction pipeline is summarized in Fig. 7.

PLS-exclusive preprocessing step
In PLS, SVR, and Random Forest algorithm, for predicting the current time sample, in addition to features for the current time sample, the features from previous time samples are included. In this study, 10 sample time lags were included in the prediction of each time sample. Therefore, the dimensionality of data will increase to (10 lags * 6 filter bands * 16 channels) 960.

PLS mode
The general model of PLS is shown in (1) and (2). X and Y are predictor matrix and measurement vector, respectively. In this study X is a (n: number of samples, m: number of features) matrix containing features for all the time samples and y is a (n: number of samples, 1) vector for force values. The goal here is to predict y values using X.
In PLS, unlike general linear models like Least Squares, instead of working directly with X and Y, their latent variables are used. In our underlying model, T and U are two n × l matrices which are scores of X and Y respectively. P and Q are orthogonal loading matrices with the size of m × l and p × l respectively. E and F are two i.i.d Gaussian random variables.
Generally, PLS tries to explain the latent variable of Y with most variance, using the latent variable of X that describes it (it refers to the latent variable of Y) the best. Therefore, the model does not require feature selection and can reduce the possibility of over-fitting made by a large number of features, making it the ideal choice for neural data.
In PLS regression problem, the final goal is to find a weight vector β and an intercept β 0 which linearly relates the predictors to the measured values as shown in (3).
There are many variants of PLS methods and solving approaches to find loading and scoring matrices. In this study, we used PLS-1 method, which is a well-known and wildly used method for solving the PLS regression problem for cases in which Y (1) is a vector. PLS-1 finds columns of loading matrices one by one in a stepwise manner. PLS-1 can be summarized in the following steps: Step 1: Find an initial loading weights by finding the direction in which the covariance between X and y maximized and name it w.
Step 2: Find the first score column by projecting X onto w and name it t.
Step 3: Find the first loading vectors of X and y by projecting X and y on normalized score vector t found in step 2 and call them p and q, respectively.
Step 4: Remove all the information of the first score and loading vectors from X and y.
Step 5: Jump to step 1 and use X new and y new to find the next loading and score vectors. Then, iterated l times for the desired number of components. Finally, concatenate all the loading, score and loading weights. All the q values calculated in step 3 are scalars; therefore, the concatenation of q values will lead to a vector Q.
Step 6: Calculate regression weights β and regression intercept β 0 from the calculated Matrices calculated in step 5.
In order to evaluate the performance of the PLS model, we used seven-fold cross-validation. Furthermore, The number of the latent variables are selected based on Wold's criterion [27] shown in (10). PRESS represents the prediction error of the model when first l components are used for prediction.
Baibing et al. showed that using Wold's criterion can improve the performance of the model compared to other approaches which try to find the optimal number of components [28]. To find the optimal number of components, we performed ten-fold crossvalidation over train data and considered the number of components optimum when R Wold reached 0.9. The best number of components are six, four and five components for rat 1, rat 2 and rat 3, respectively.

LSTM model
Inspired by classical recurrent neural networks, Long Short-Term Memory networks receive the data samples sequentially, and it uses the latest prediction for predicting the next sample of data. Classical RNNs have a feedback loop which brings back the latest outputs of the network in the input. This structure design leads to various problems such as exploding or vanishing gradient during the training of the network. To address these problems, LSTM networks share an extra parameter, cell state, between sequences that gives them the ability to remember/forget important/irrelevant features of data in any part of the sequence.
LSTM network can have various input and output structure layers. For instance, LSTM can receive all the input samples and return one output at the end of receiving all input samples, or it can yield an output for each input sample. In this work, as it can be seen in Fig. 8, the network has one output for each input value in the sequence.
(11) C �t� = tanh W c y �t−1� , x �t� x and y are one value of a sequence of input and output sample of data. W c , W u , W f and W o are carry, update, forgetting and output weights, respectively, which are going to be learned during the training process and ⊙ stands for the element-wise product. The LSTM algorithm can be summarized in the following steps. First, using the current input sample and the previous output sample, a potential carry value, i.e., C �t� , is calculated using (11). Then again using the last output and the current input, the value of update, forget, and output gates are determined according to (12). These gates can have values between 0 and 1. For instance, in the extreme case in which Ŵ u = 1 and Ŵ f = 0, the network will fully forget the previous values and update the carry with the new carry potential value according to (13). Then, using the update and forget gates, the final carry value for the current step is calculated. Finally, the estimated output is calculated by the dot product of the current carry value and the output gate. In the original model, bias values are considered in the Eqs. (11) and (12), but in this study, the bias values were eliminated due to their negligible effect on the results and reducing the trainable parameters of the model. Furthermore, the output activation of Among different variants of LSTM structures, we used the vanilla LSTM structure because it is shown that other structures do not show a significant performance improvement in various tasks [29]. The summary of the LSTM structure is shown in Fig. 9.

Network structure
The network structure used in this study is illustrated in Fig. 10. The network consists of two LSTM layers that are connected to a single fully connect neuron. The first LSTM layer has 30 units, and the overfitting in this layer is controlled by a dropout both in the forward and recurrent path. Also, the layer has no bias term. Second LSTM layer consists of 15 units with both forward and recurrent dropout. The output of the second layer is fully connected to a single neuron with 'ReLU' activation. The weights of the fully connected neuron are regulated with L2-norm in order to mitigate over-fitting.
The network is optimized using Adam [30] optimizer, and we set β 1 = 0.9 and β 2 = 0.999 as recommended in [30]. Mean absolute error was selected as cost owing to its robustness to noise. The number of epochs to train, learning rate, values for dropout rates, batch size, and regularization value of fully connected layers are hyperparameters of the network. Seven-fold cross-validation is used to evaluate the performance of the network and the PLS method. We trained the neural network with individual trials, and for PLS and other methods, training trials were concatenated. In each fold, 20% of training data is used for validation and selecting the optimum values for hyper-parameters. We used a Bayesian Optimization toolbox, Hyperopt [31], for selecting the optimum  values of hyper-parameters. The Bayesian optimizer selected the optimal combination of values from the table of hyper-parameters ( Table 5) that showed the best performance on validation data.

Alternative method
In addition to PLS, we used SVR and Random Forest with bootstrapping to decode the applied force from the LFP signal. The feature extraction and validation process for SVR and Random Forest were identical to that of PLS. For SVR, we used RBF kernel with kernel coefficient γ = 1 numfeatures . We selected the regularization parameter (C) base on fivefold cross-validation on training data. As for Random Forest, we consider the maximum number of 100 trees and the maximum number of features for the best split was selected to be the square root of the number of features.

Performance criteria
Coefficient of correlation (r) and coefficient of determination ( R 2 ) were used to evaluate the performance of the models. Coefficient of correlation shows the overall resemblance between the observed and predicted values. On the other hand, the coefficient of determination can show how much of the variance in the observed data exist in the predicted values. The formulation of (r) and ( R 2 ) are represented in (14) and (15) respectively.
In this formulation, y i is the ith sample of the target value and ỹ is the mean of the target value. In the same manner, ŷ i is the ith sample of predicted value and ŷ is the mean value of the predicted value.

Contribution of each frequency band
The weights of the first LSTM layer contain information about the contribution of each feature of neural data for predicting the force value. Therefore, according to (16), the absolute values of weight related to each frequency band are added and then normalized by the sum of the absolute values of all the weights. The contribution value was calculated and averaged for in all validation folds.