An efficient scRNA-seq dropout imputation method using graph attention network

Background Single-cell sequencing technology can address the amount of single-cell library data at the same time and display the heterogeneity of different cells. However, analyzing single-cell data is a computationally challenging problem. Because there are low counts in the gene expression region, it has a high chance of recognizing the non-zero entity as zero, which are called dropout events. At present, the mainstream dropout imputation methods cannot effectively recover the true expression of cells from dropout noise such as DCA, MAGIC, scVI, scImpute and SAVER. Results In this paper, we propose an autoencoder structure network, named GNNImpute. GNNImpute uses graph attention convolution to aggregate multi-level similar cell information and implements convolution operations on non-Euclidean space on scRNA-seq data. Distinct from current imputation tools, GNNImpute can accurately and effectively impute the dropout and reduce dropout noise. We use mean square error (MSE), mean absolute error (MAE), Pearson correlation coefficient (PCC) and Cosine similarity (CS) to measure the performance of different methods with GNNImpute. We analyze four real datasets, and our results show that the GNNImpute achieves 3.0130 MSE, 0.6781 MAE, 0.9073 PCC and 0.9134 CS. Furthermore, we use Adjusted rand index (ARI) and Normalized mutual information (NMI) to measure the clustering effect. The GNNImpute achieves 0.8199 (ARI) and 0.8368 (NMI), respectively. Conclusions In this investigation, we propose a single-cell dropout imputation method (GNNImpute), which effectively utilizes shared information for imputing the dropout of scRNA-seq data. We test it with different real datasets and evaluate its effectiveness in MSE, MAE, PCC and CS. The results show that graph attention convolution and autoencoder structure have great potential in single-cell dropout imputation.

The scRNA-seq technology can produce single-cell-level resolution data. As a result of defects such as low capture rate and low sequencing depth, the sequencing library data contains a lot of noise [9,10].
Compared with the next-generation sequencing data, scRNA-seq data usually contains a lot of zero expressions. These zero expressions can arise in two ways: One is that the genes are not expressed in the corresponding cells and the other is that some genes with low expression cannot be detected due to technical limitations. These events are called dropout events [11]. There are some reasons for dropout, including nonlinear amplification of mRNA, transcription efficiency when reverse transcription of mRNA to cDNA and low sequencing read depth [11][12][13].
In the downstream analysis of scRNA-seq data, the dimensional reduction and unsupervised clustering are always be used to infer cell development trajectories and identify rare cell clusters [14,15]. However, dropout events will seriously affect the calculation of the distance between expression profiles, which leads to downstream results [16].
Recently, many methods have been developed to impute dropout in scRNA-seq data. For example, MAGIC [17] is based on Markov affinity-based graph, which uses similar cells and genes information to impute missing values. However, this method lacks robustness and cannot adapt to the nonlinear relationship with genes. Furthermore, DCA [18] is a neural network-based method that uses deep autoencoding networks for unsupervised learning. It performs zero-inflated negative binomial modeling on scRNAseq data to solve the problem of noise and gene distribution. In addition, there are some imputation methods based on deep learning or statistical methods such as scVI [19], scImpute [20], and SAVER [21]. These methods can only apply to Euclidean space by using Euclidean spatial data, such as the expression matrix. But they cannot directly deal with Non-Euclidean spatial data like cell graphs [22][23][24][25].
Therefore, we propose a novel structure neural network named GNNImpute, which is an autoencoder structure network that uses graph attention convolution. By building a graph from the scRNA-seq data, GNNImpute uses graph attention convolutional layer to make a targeted selection of similar neighboring nodes. Then, it aggregates these similar neighboring nodes. The nodes in the graph can continuously transmit messages along the edge direction until stability is reached. In this way, GNNImpute enables the expression of the cells in the same tissue area to be embedded in low-dimensional vectors through the autoencoder structure. GNNImpute can not only capture the coexpression patterns between similar cells but also remove sequencing technical noise from imputing dropout, which improves the downstream analysis of scRNA-seq data.

The high-level approach
In scRNA-seq data, each cell has its own expression profile, and the expression profile of each cell is different and unique. But cells from the same tissue or with the same function usually have similar features. Therefore, when a dropout event occurs in any cell, it can be recovered by the gene expression profile of similar cells.
GNNImpute is a deep learning method based on a graph attention neural network. Different from MAGIC, GNNImpute introduces the attention mechanism that can assign weights to different similar cells according to attention coefficients. It can establish nonlinear relationships between genes by learning low-dimensional embedding of expressions through the autoencoder structure network. Compared with DCA, GNNImpute can learn the gene co-expression patterns of similar cells by aggregating information from multi-level neighbors. The co-expression patterns can help recover low-expressed genes. GNNImpute reduces the dropout noise and improves the gene expression profile of cells.
The overall structure of GNNImpute is shown in Fig. 1a. It is composed of an encoder and a decoder. Figure 1b shows that the encoder of GNNImpute has two graph attention convolutional layers, which are used to transmit the information of neighbor nodes. And the decoder consists of two linear layers. GNNImpute uses the masked expression matrix as the model input. The output of the model is used to calculate the loss value. And the parameters of the model are optimized by this value.

Data preprocessing
GNNImpute uses the expression matrix of scRNA-seq data as input. As shown in Table 1, the expression matrix is an nCells * nGenes scale. The rows represent different cells number, and the columns represent different gene sites. Each value in the matrix indicates the expression intensity of a gene in a cell. Due to the sparseness of scRNA-seq   data, the expression matrix contains a very large number of zeros. When the expression values in row or column are all zero, it means that those cells or genes are no expression at all. We filtered these no expression values from the matrix, because these values may cause impurity interference and invalid information. Similarly, we filtered cells with overexpression in the matrix, which may be caused by incorrect counting or cell rupture after death.
In the data preprocessing, we use SCANPY [26] to filter the original matrix. We address the data in four steps. In the first step, cells with expression values less than 200 and genes with expression values less than 3 are filtered. Second, we filter the cells with overexpression of mitochondrial genes [27], as shown in Fig. 2a. Third, the cells with a high total expression count in the Fig. 2b should also be filtered. Finally, we normalize the filtered expression matrix, the purpose is to make each row (cell) in the expression matrix have the same value of expressions, and this value is the median of the values of expressions of all cells before normalization.

Build connection graph
GNNImpute is a dropout imputation method based on a graph attention network. It is used to obtain gene expression from similar cells to recover the dropout event. In order to aggregate the cells with similar expressions, it is necessary to define a connection graph between the cells. In this graph, we use nodes to represent cells and we use edges to represent the similarity between cells. In this graph, cells can transmit information to adjacent cells. As shown in Fig. 3, the construction of such a graph is divided into three  There are three steps to construct a connection graph. First, a shows the result of visualization after dimensionality reduction by PCA. Second, b is the distance matrix represented by a heat map. Third, c represents the K-nearest neighbor graph after selecting K neighbors steps. The first step is to reduce the dimensionality of the expression matrix. Figure 3a shows the result of scRNA-seq data dimensionality reduction by Principal Component Analysis (PCA). After PCA, we can see that the cells are clustered according to similar expressions (Fig. 3a). We select the first 50 principal components as the GNNImpute input. The second step is to calculate the Euclidean distance between every two cells in the expression matrix. As shown in Fig. 3b, we get a heat map of nCells * nCells scale. Heat map rows and columns represent different cells and the heat map color represent different cell distance. The color is deeper, the distance is closer. In the third step, we select K closest cells to construct graph edges. The K edges display a similar relation of cells. Through the above steps, we construct a cell-to-cell connection graph (K-nearest neighbor graph). We set the K = 5 (The K number can be customized).
After constructing the graph, all graph cells have K cells with the most similar expression. We call these K cells the "first-level" neighbors. These K cells are adjacent to the origin cells. Similarly, there are also K cells in the "first-level" neighbors. We named the "second-level" cells. It doesn't exist edges between origin cells and "second-level" cells, but the transferability of intermediate nodes can still indicate similarity. Figure 3c shows the origin cell and its neighbors. We use a two-layer graph convolution structure to transfer information within the range of "second-level" neighbors. This structure can not only maximize the aggregation of similar node information but also avoid over smooth node features.

Multi-head graph attention convolutional layer
In order to aggregate the information of cells, we need a connection graph and graph attention convolutional layers. The essence of the graph convolutional layer is not to aggregate information around the original nodes, but aggregate nodes connected by edges. The calculation process of graph convolutional layers is as follows: where k is the number of layers of graph convolution. W is the trainable weight. Â =D − 1 2ÃD − 1 2 , Where Ã and D are the adjacency matrix and degree matrix of the cellto-cell connection graph, respectively. The adjacency matrix Ã = A + I , I is the identity matrix, which means add self-connections to the adjacency matrix. The degree matrix D ii = jÃ ij . σ is the activation function. ReLU is used as the activation function.
is the input matrix of the k-th graph convolutional layer. When k = 0 , H (k) = X . In the GNNImpute, it is the expression matrix of scRNA-seq data.
Through the superposition of multiple graph convolutional layers, the information aggregation of multi-order neighbors can be achieved. We use two-layer graph convolution in the encoder, and the output of the encoder is as follows: To aggregate the information of neighbors more efficiently, we propose an attention model for neighbor nodes. By adding attention to neighbor nodes in the form of weights, this attention model achieves targeted aggregation of neighbor nodes. Specifically, the more similar the neighbor node is to the target node, the greater the attention coefficient obtained by the neighbor node. In this way, different weights are applied to different neighbors. The calculation of the attention coefficient is as follows: where − → h i and − → h j represent the features of node i and node j, which is the gene expression profile of the cells. And e ij represents the attention coefficient of cell j to cell i. a() is the attention calculation formula. It is used to calculate the similarity of every two nodes. we use the dot product as the attention calculation formula to calculate the similarity. W is a shared weight matrix. It transforms the input features into more advanced features, so that each node can obtain sufficient expressive ability. We only calculate the attention coefficient of cell i and cell j, where j ∈ N i , and N i is the first-order neighbors of cell i in the cell-to-cell connection graph. Further calculation of multiple independent attention coefficients can be extended to the multi-head attention mechanism. In this attention mechanism, the attention coefficients are combined by calculate the average values to stabilize the learning process. The formula is as follows: In order to compare attention coefficients between different nodes, it is necessary to add the softmax function to standardize it: Combining the above (1), (2), (3), (4), (5) formula, the final attention weight displays as follows:

Architecture and training
GNNImpute builds the model with an autoencoder (encoder and decoder). The input layer and output layer of the model have the same number of nodes. In the hidden layer, the nodes are much lower than encoder and decoder nodes. Different from the traditional autoencoder structure, we make an improvement on the GNNImpute encode layer. We use graph attention networks in the GNNImpute encode layer instead of linear networks. As shown in Fig. 4a, there is the autoencoder structure used by GNNImpute. In the encoder, the input size of the first layer is the number of gene features of the cell, and the first layer output is 512. The second input size is equal to the first layer output (512). GNNImpute decoder is composed of three parts: linear layer, batch normalization layer and ReLU.
Further, GNNImpute adds dropout layers to combat the over-fitting problem of the model. GNNImpute introduces a multi-head attention mechanism to achieve targeted . selection when aggregating neighbor nodes. This multi-head attention mechanism can stabilize the learning process and provide robustness for the model. It is noted that GNNImpute uses a semi-supervised learning method to recover from dropout events. The advantage of semi-supervised learning is that some labeled cells can provide soft labels for many other unlabeled cells, and it can help the model recover from dropout events more accurately. The model can learn the potential features by minimizing the error between the reconstructed expression matrix and the original expression matrix. Meanwhile, the hidden layer can capture the distribution of the matrix and ignore invalid changes in the low-dimensional environment.
Because the dropout event is random, there are few dropout benchmarks. Therefore, we used a fair measurement method [28,29]. This is a method of constructing a dropout benchmark by randomly masking the expression matrix. Using this fair measurement method can make various methods calculate the corresponding metrics. First, we process the expression matrix of the real scRNA-seq data to obtain the filtered matrix as the ground truth. Then, we randomly masked non-zeros based on a predetermined dropout rate. After two of the above steps, we can obtain the masked expression matrix and unmasked expression matrix. We can use the matrix data to train the GNNImpute model and validate the imputation effectiveness.
In the model training phase, we divide the PBMC dataset according to the ratio of 6:2:2. There are 1706 cells in the training set, 569 cells in the validation set, and 568 cells in the test set. Training set used to train the model, the validation set would be used to test the trained model, and the test set would evaluate the final model. The total parameters of GNNImpute will be adjusted according to the size of the dataset. When using the PBMC dataset, the total parameters of the model are 26.75 M. The loss function of the model is set to the mean square error loss function. The optimizer is Adam, and the learning rate is 0.0001. The maximum number of iterations is set to 3000. If the loss value of the validation set does not decrease in 200 consecutive iterations, it will interrupt training early. The training processes of the model are shown in Fig. 4b-d.

Evaluation metrics
In the experiment, we use four metrics to measure the imputation ability of GNNImpute with the other four methods. The four metrics are mean square error (MSE), mean absolute error (MSE), Pearson correlation coefficient (PCC) and Cosine similarity (CS), respectively. MSE and MAE are used to show whether the imputed gene expression values are the same as the labels. PCC and CS are used to measure whether the express trend of the imputed matrix is consistent with the raw matrix. In downstream data analysis, we employ the Adjusted rand index (ARI) and Normalized mutual information (NMI) to measure the clustering results.

Result
In order to validate the dropout imputation performance of different methods, we compare GNNImpute results with other five methods, including DCA, MAGIC, scVI, scImpute and SAVER. DCA uses a zero-inflated negative binomial distribution model to denoise the autoencoder network. This denoising network can solve the problem of count distribution, over-dispersion and sparsity. MAGIC is an imputation method based on Markov affinity-based graph, which imputes dropout values by sharing information among similar cells. scVI can capture the basic low-dimensional structure in the scRNAseq data by introducing a robust latent variable model, which can eliminate the noise in the data. scImpute is a statistical method that can automatically identify possible dropout events and recover them. It can also exclude outliers without introducing new bias. SAVER uses regularized regression prediction and empirical Bayesian methods to recover the gene expression profile in noisy and sparse data. We conduct experiments on four single-cell sequencing datasets of humans and mice. To illustrate the imputation performance of methods, we use dropout recovery index, clustering, robustness to evaluate the results.

Imputation evaluation
By randomly masking the expression matrix on the four real datasets, we can obtain positive and negative training data. We compare the imputation performance of GNNImpute with the other five imputation methods using four real datasets. Figure 5 shows the performance of GNNImpute with the other five methods. Overall, In the Fig. 5a, b, we can see the average MSE and MAE of GNNImpute can achieve 3.0130 and 0.6781. The results are better than DCA (3.0130 vs. 5.1888, and 0.6781 vs. 0.9036). The reason is that GNNImpute uses semi-supervised learning, which can learn from the labeled data to recover the dropout event. Since scVI is also a neural network method based on autoencoders, its performance is second only to DCA. For the scImpute, the MSE and MAE in the four datasets are the worst, because scImpute has an overall bias in the imputation. As shown in Fig. 5c, d, (Chen, Campbell), which indicates that they cannot perform effective imputation on the large dataset. Another reason why scImpute may have an overall bias in imputation is that it has the worst performance in MSE and MAE but the PCC and CS are better than MAGIC and SAVER.

Heat map and clustering evaluation
The purpose of imputation is to improve the downstream analysis of scRNA-seq data. Therefore, we use clustering results to evaluate the downstream analysis. We use two metrics (ARI, NMI) to measure the performance of cell clustering after imputation.
In the clustering analysis, we used the data published by Klein. They analyzed mouse embryonic stem cells, revealing in detail the population structure and the heterogeneous onset of differentiation after leukemia inhibitory factor (LIF) withdrawal. The cluster labels are determined by the intervals of LIF withdrawal (0, 2, 4, 7 days). The t-distributed stochastic neighbor embedding (t-SNE) algorithm is used to reduce the dimension of the expression matrix. And it can realize the visual analysis of clustering. In Fig. 6a-c, we show the figures of the raw matrix, noised matrix and the denoised matrix after GNNImpute imputation. The dimensions of these matrices are all reduced by t-SNE for visualization. The visual analysis of the noised expression matrix in Fig. 6b shows that four cell clusters have different degrees of mixing, and there is no obvious dividing line. But the expression matrix imputed by GNNImpute can separate different clusters, as shown in Fig. 6c. After imputing the matrices with different methods, we use k-means algorithm to measure the performance of matrix clustering. Then, we use ARI and NMI to measure the clustering results obtained by the k-means algorithm. GNNImpute reaches 0.8368 (ARI) and 0.8199 (NMI), which are at least 1.82% and 1.21% better than other methods (shown in Fig. 6d, e).
By calculating the gene heat map of all cells in the imputed expression matrix, the results can also be visualized to determine which methods can improve the downstream analysis of scRNA-seq data. Because the PBMC dataset does not have real cluster labels, we use Leiden algorithm to calculate pseudo labels. Then we find highly differentiated marker genes in each cluster by t-test based on the pseudo-labels. Finally, we select 50 marker genes most relevant to cluster classification in the PBMC dataset to measure the performance of GNNImpute and other five methods. Figure 6f, g shows the heat maps of the raw matrix and noised matrix. GNNImpute can recover the dropout events that occurred in different clusters, especially rare cell clusters (No. 7, No. 8 and No. 9 clusters, shown in Fig. 6h). The expression matrix after imputation by MAGIC shows that the large cell clusters are almost the same. The recovery of the dropout event is too smooth (such as LTB, RPS5 and CD74, shown in Fig. 6j). As a result, it lost the unique heterogeneity of scRNA-seq data. For scVI, it can only impute limited dropout values. The reason may be that low-expressed genes are mistaken for noise and ignored, such as RPL31 and RPS6 shown in Fig. 6k. scImpute can impute the dropout genes. But it changes the expression intensity of most genes, which further illustrates that scImpute imputes dropout values with a certain overall bias. The genes marked in Fig. 6l show that the expression intensity of these genes has been changed. SAVER does not perform obvious imputation. It may be that it cannot handle data with a high dropout rate.

Robustness analysis under different dropout rates
Next, we evaluate the ability of the imputation method for scRNA-seq data under different dropout rates. The dropout rates are 10%, 20%, 30%, 40%, 50%, and 60%, using PBMC dataset with random mask expression matrix. Figure 7a-d shows the performance of the six scRNA-seq dropout imputation methods under different dropout rates. From Fig. 7 we can see that GNNImpute is not sensitive to the dropout rate. It can recover the most dropout events at a high dropout rate (60%

Analysis of different training sets for semi-supervised learning
GNNImpute uses a semi-supervised learning method to train the model and learn the dropout knowledge. The advantage of semi-supervised is that it can use only a small amount of labeled data and a large amount of unlabeled data for training, which greatly reduces the requirements for manually labeling data. In this Fig. 7 a shows the MSE of six methods at different dropout rates. b shows the MAE of six methods at different dropout rates. c shows the PCC of six methods at different dropout rates. d shows the CS of six methods at different dropout rates experiment, we use 80 ~10% data with labels for model training. Even if only 10% labeled data is used, the model still shows great imputation performance, as shown in Fig. 8. The MSE and MAE are 3.4685 and 0.8147, and the PCC and CS are 0.9351 and 0.9436, respectively.

Imputation performance in simulated data
To further evaluate the performance of GNNImpute, we evaluate the performance in simulated data. Following the previous work, we used the Splatter package [30] to generate two simulation datasets. The first dataset has 2 groups, and the second dataset has 6 groups. Both simulation datasets contain expression matrices of 4000 cells and 20,000 genes. Table 2 shows the MSE, MAE, PCC and CS of GNNImpute and the other five methods in simulated dataset (2 groups). GNNImpute is better than other methods on MAE, PCC and CS. Only on MSE, our method is slightly inferior to DCA (21.2961 vs. 19.9282), but compared with the other four remaining methods, our method still has a significant improvement. When we evaluate in the simulated dataset (6 groups), we can get similar results, as shown in Table 3. The reason for this phenomenon may be that the simulated data generated by the Splatter package is quite different from the real data. After calculation, we can confirm

Analysis of attention mechanism of GNNImpute
In order to verify the effectiveness of the attention mechanism, we specially added experiments to evaluate our method and GCN architecture model (without attention). As shown in Tables 4 and 5, we evaluated the performance of five models on the PBMC dataset and Klein dataset. They are GNNImpute (GCN architecture without attention), GNNImpute (with 1 attention head), GNNImpute (with 3 attention heads), GNNImpute (with 5 attention heads) and GNNImpute (with 8 attention heads). The experiment method uses five independent repeated experiments to take the average value.
The results of PBMC dataset in Table 4 show that the performance of the model using the attention mechanism is all better than the GCN model (without attention). In the evaluation of MSE and MAE, the model using the attention mechanism is at least better than GCN model (without attention) by 12.9% (3.3047 vs. 2.8800) and 3.6% (0.8022 vs. 0.7736). And they are also better than GCN (without attention) on PCC and CS (0.9436    Table 5 shows the improvement of the clustering effect of the attention mechanism on the Klein dataset. In the evaluation of ARI, the attention mechanism provided about 2% (0.7998 vs. 0.8155). For the evaluation of NMI, it is also better than GCN (without attention) by 1.6% (0.8204 vs. 0.8049). We also observed that the performance of GNNImpute (with 1 attention head) is not stable, so we recommend using the multi-head attention mechanism to stabilize the performance of the model.

Discussion
On the high level, imputing the dropout of scRNA-seq data is a process of denoising the expression matrix data of the raw scRNA-seq library. Many existing approaches show that deep learning methods, especially autoencoders, can effectively denoise data. Our GNNImpute method extends this high-level approach to the case of Non-Euclidean spatial data like cell graphs. By reconstructing the expression matrix of scRNA-seq data, GNNImpute can establish a learning mechanism between the input and output of the model. It allows our model to capture non-linear relationships between genes and make better utilization of data.
Usually, the dropout imputation performance is affected by the sparseness of scRNA-seq data. For noise data, it is usually more difficult to impute dropout values. However, GNNImpute shows excellent performance compared with other methods. GNNImpute compensates for the lack of low expression intensity of some genes by aggregating the features information of similar cells. Meanwhile, it can recover the dropout events in the scRNA-seq data and remain the specificity between cells to avoid excessive smoothing of expression.
Compared with other dropout imputation methods, GNNImpute has great adaptability for addressing the different sizes of datasets (especially large datasets). The difficulty in processing large datasets is that there are many cell types, and each cell contains a lot of gene information. Since GNNImpute uses a neural network model, it can capture important features from all information, and then reduce the dimensionality to ignore unimportant features.
Moreover, GNNImpute is a semi-supervised learning model, which does not require manually labeled data. It can be trained with few labeled data. This model only uses 10% of the dataset for training can still achieve great results.

Conclusions
In this paper, a novel imputation method based on graph attention convolution is proposed, which is a semi-supervised learning method using an autoencoding structure network. GNNImpute focuses on determining the similarity between cells and constructs a connection graph to capture the features of similar cells. This method also introduces an attention mechanism of weighted neighbor nodes to select the cell node with the most useful features information. In the experiments of four datasets, the performance of GNNImpute is better than other existing methods for four metrics of MSE, MAE, PCC and CS. When we explore the limits of GNNImpute, we find that it cannot provide the interpretability of cell clusters. Future investigations will focus on how to make GNNImpute more explainable.