Deep learning improves the ability of sgRNA off-target propensity prediction

Background CRISPR/Cas9 system, as the third-generation genome editing technology, has been widely applied in target gene repair and gene expression regulation. Selection of appropriate sgRNA can improve the on-target knockout efficacy of CRISPR/Cas9 system with high sensitivity and specificity. However, when CRISPR/Cas9 system is operating, unexpected cleavage may occur at some sites, known as off-target. Presently, a number of prediction methods have been developed to predict the off-target propensity of sgRNA at specific DNA fragments. Most of them use artificial feature extraction operations and machine learning techniques to obtain off-target scores. With the rapid expansion of off-target data and the rapid development of deep learning theory, the existing prediction methods can no longer satisfy the prediction accuracy at the clinical level. Results Here, we propose a prediction method named CnnCrispr to predict the off-target propensity of sgRNA at specific DNA fragments. CnnCrispr automatically trains the sequence features of sgRNA-DNA pairs with GloVe model, and embeds the trained word vector matrix into the deep learning model including biLSTM and CNN with five hidden layers. We conducted performance verification on the data set provided by DeepCrispr, and found that the auROC and auPRC in the “leave-one-sgRNA-out” cross validation could reach 0.957 and 0.429 respectively (the Pearson value and spearman value could reach 0.495 and 0.151 respectively under the same settings). Conclusion Our results show that CnnCrispr has better classification and regression performance than the existing states-of-art models. The code for CnnCrispr can be freely downloaded from https://github.com/LQYoLH/CnnCrispr.

Background CRISPR/Cas9 system [1][2][3][4](Clustered regularly interspaced short palindromic repeats Introduction of an insertion/deletion (indel) base to cause mutation of a gene at a target position by nonhomologous end-joining (NHEJ); or utilization of the "donor template" provided by foreign DNA to recombine with a mutant target to achieve DNA-based editing of the genome by homology-directed repair (HDR) [17][18][19].Some studies have found that when CRISPR/Cas9 system operates, several mismatch sites may appear in the complementary matching of sgRNA to the target DNA sequence, therefore resulting in unintended cleavage of the DNA sequence, which is called "offtarget" [16,20,21].Fu et al. [20]have confirmed that sgRNA allows 1-5 base mismatches during the guiding process, which in turn causes unintended sequences to be erroneously edited.The existence of off-target phenomenon has greatly hindered the clinical application and further promotion of CRISPR technology.How to assess the off-target propensity of specific sgRNAs and minimize the risk of off-target has become the focus of the CRISPR/Cas9 system study.
Presently, a variety of off-target detection methods have been developed, such as the GUIDE-Seq [22][23][24] method created by Tsai et al., which can effectively identify 0.1% of mutations in cells and predict the cleavage activity of the system based on sequencing results.The HTGTS [25] method utilizes fusion of known DNA double-strand breaks with other cleavage DNAs to detect DNA breaks by PCR amplification techniques and further detect off-target sites.On this basis, Frock et al. [26] further developed a higher throughput off-target detection method.The BLESS [27] technique further speculate on off-target sites by detecting DNA double-strand breaks.However, this method is complicated to operate and it is impossible to detect a break site that has not occurred or has already been repaired.In addition, the IDLV [28,29] method can detect off-target sites within the genome-wide range without bias, but with an accuracy of only 1%.
The above detection method cannot detect all off-target sites of a specific sgRNA, and has disadvantages such as high cost, difficult operation, and low detection accuracy.As the core of artificial intelligence, machine learning and deep learning can effectively analyze empirical data and provide important technical support for bioinformatics.To this date, machine learning has been gradually applied to off-target site prediction [14,30], sgRNA activity prediction [14] and sgRNA design optimization [31,32], etc .Various machine learning based sgRNA design models [30,[33][34][35][36] have been developed and put into application.Their main design idea is to introduce sgRNA sequence features and secondary structure features, rank all possible sgRNA for specific target DNA sequences by scores of off-target effect, and selecting the sgRNA with high cleavage efficiency and low off-target propensity.
The above machine learning methods were based on sequence features.At the time this paper is written, only three existing prediction models have introduced the idea of deep learning into the sgRNA off-target propensity prediction problem.
DeepCpf1 [31], based on the convolutional neural network(CNN), introduced sgRNA sequence features and chromatin accessibility to predict the editing efficiency of sgRNA corresponding to Cpf1.This method does not have to construct the feature artificially, further simplifying the model, and is convenient for researchers to use.DeepCrispr [37] introduced four epigenetic features in addition to DNA sequence features and automatically extracts valid information using the principle of Auto-encoder.Several models including sgRNA target cleavage and off-target propensity prediction were established.However, it is still unknown whether the four epigenetic characteristics will have a positive impact on the model prediction results.CNN_std [38] only used sequence features to construct two-dimensional input matrix by means of "XOR" coding design and utilized CNN for prediction.This deep learning method also received a higher accuracy in the CRISPOR dataset [39].In addition, Dimauro, G et al. proposed a model named CRISPRLearner [40] for predicting sgRNA on-target knockout activity.Although its purpose is different from ours, its application of deep learning to prediction tasks related to sgRNA provided us with ideas.
Most of the existing prediction methods are still based on machine learning methods and model prediction through complex manual feature extraction [41][42][43][44][45][46].However, the internal mechanism of CRISPR gene editing technology is not presently clear and explicit.
Manual design of sgRNA features may have a negative impact on the prediction results.
Therefore, we would like to present CnnCrispr a novel computational method for prediction of sgRNA off-target cleavage propensity utilizing the deep learning method.In CnnCrispr, the GloVe embedding model was introduced to extract global and statistical information of input sequences by constructing the co-occurrence matrix of sgRNA and its corresponding DNA sequence.Further integrating with the deep neural network model, the off-target propensity of a given sgRNA at a specific DNA fragments can be predicted.We trained CnnCrispr with the data set used by DeepCrispr [37], and proved that CnnCrispr has a better competitive advantage in predicting sgRNA off-target propensity through performance comparison with four state-of-the-arts models, therefore it is expected to become a potential tool to help on the research of CRISPR system.

Model Structure and Prediction
In our initial conception, we combined biLSTM with CNN framework at the final prediction model and the model structure is shown in Fig. 1.We also constructed several similar but different models by removing different network parts to compare the test results and select the final prediction model.All pre-selected network frameworks for model selection are briefly described in Table 1.
The structure of the benchmark framework of CnnCrispr is described in detail below: The first layer of CnnCrispr is an embedding layer, which is used for input of the vector obtained by GloVe model.Since the vector dimension of the GloVe model is set to 100, the input of embedding layer is a two-dimensional matrix with the size of 16×100.We called the mittens package in Python to train the GloVe model on the basis of the realization of GloVe co-occurrence matrix.
The second layer is a biLSTM network, which is mainly used to extract the context features of input information.Five convolution layers are subsequently connected to the model, and each layer has a different kernel number and kernel size.Then the full connection layers are introduced behind the last convolution layer, having the sizes of 20 and 2 respectively.
In addition to the framework mentioned above, Batch Normalization and Dropout layers are added between each layers to prevent model overfitting.The parameters of the Dropout layer are set as 0.3.For the output layer, softmax and sigmoid functions are used as activation functions respectively to obtain the prediction results of classification model and regression model.
In the training process, the initial learning rate was set as 0.01, and we used Adam algorithm to optimize the loss function.Furthermore, we set the batch size as 256 in consideration of the requirements of potential information extraction from negative data and avoiding the occurrence of over fitting.Too large of a batch size may increase the risk of multiple occurrence of some positive data in a single batch during training, while too small of a batch size may reduce the training speed of a model and extend the training time.
Our experiment was divided into two parts.First, we compared the performance of different models.Then, the final prediction model was compared with the existing models with better performance to evaluate the practical application ability of our model.Detailed network descriptions can be found in Additional file 2.

Model Selection
Experimental data are from the attachment provided by DeepCrispr article, and the relevant data description is detailed in section 5.1.During the process of training, 20% of the data in the Hek293t and K562 data sets were randomly selected to compile the test sets (Hek293t test set, K562 test set and Total test set respectively).Different prediction models were obtained by training with all the remaining data, and the prediction performance of each model in the three test sets were evaluated.During the training process, we generated the batch training data using the data sampling method mentioned in Section 5.6.
We built two models for classification and regression prediction, respectively.The first three models mentioned in Table 1 were trained in order to verify the influence of different parts on the prediction performance of the model.The structure of the benchmark model CnnCrispr is introduced in Section 2.1.And the model CnnCrispr_No_LSTM was obtained by removing the LSTM part from the basis of CnnCrispr, CnnCrispr_Conv_LSTM was obtained by adjusting the order of Convolution layers and Recurrent layer on the basis of CnnCrispr.Among them, the purpose of the latter two models was mainly to illustrate whether CNN layer and RNN layer have improved the performance, as well as whether the order of the two frameworks will affect the performance.
We initially trained the three models mentioned above and obtained the prediction results.
The model performance is shown in Table 2.
Due to the highly unbalanced nature of the data set, it was easy for the model to obtain a high auROC value.Therefore, we gave up the comparison of auROC values and focused on the comparison results of auPRC and Recall value on the test set.The results in Table 2 were used to draw the histogram (Fig. 2), from which it can intuitively be seen that CnnCrispr has better predictive performance.Therefore, we took the CnnCrispr as the benchmark network framework and further well-tuned the network structure.
Based on CnnCrispr, the Dropout layer and Batch Normalization layer were removed respectively to verify the influence of the two parts on performance.A brief description of the network structure is given in Table 1.The recall value of CnnCrispr_No_Dropout was 0.810 in the total test set, which was a little lower than that of CnnCrispr, this showed that the Dropout layer does have improved performance and prevented over-fitting, although the degree of improvement is not very noticeable.However, after adding the Dropout layer, the training parameters of the model were greatly reduced, which further saved time for model training, hence we kept the Dropout layer in the final model.Then we trained the model without the Batch Normalization layer several times and analyzed it on the test set, but every time the entire test set were all classified as negative samples.
This indicated that the model without the BN layer has lost its ability of classification prediction.Therefore, the BN layer is essential in the final model.In addition, we also mentioned the importance of BN layer for neural network model in Section 5.5, hence we reserved it in our final model.
CFD is short for Cutting Frequency Determination.As a scoring model for evaluating the off-target propensity of sgRNA-DNA interaction, CFD specified different scores for the location and type of mismatch between sgRNA and corresponding DNA sequence.When multiple mismatches appear in the sequence pair, the corresponding scores are multiplied to obtain the final score.For example, if the sgRNA-DNA sequence has a rG-dA mismatch in position 6 and a rC-dT mismatch in position 10, it will receive a CFD score of 0.67×0.87=0.583.Haeussler et al. [39] compared the performance of CFD with that of MIT, and proved that the prediction performance of CFD was slightly better than that of MIT in CRISPOR data set.CNN_std is a CNN-based sgRNA off-target propensity prediction model developed by Jiecong Lin.The combination of sgRNA and corresponding DNA sequences was encoded by "XOR" principle and predicted by multi-layer convolution network.
DeepCrispr is a deep learning method which combines sgRNA-DNA sequence information with genomic epigenetic characteristics as the input.DeepCrispr used the largest data set available to conduct model training and introduced the auto-encoder to automatically acquire potential features of the sgRNA-DNA sequence, which was a good attempt at deep learning in sgRNA related prediction problems.
In order to make a more comprehensive comparison with the four models above, we tested the performance of the classification and regression models in two test patterns.We downloaded the prediction models of CFD, MIT and CNN_std from relevant websites and obtained the prediction results on the same test set as CnnCrispr.Due to the fact that the training methods were consistent between CnnCrispr and DeepCrispr, we just used the test results given by DeepCrispr to make the comparison.

Test pattern 1 --withheld 20% as an independent testing set
Consistent with the training method of "Model Selection" section, we randomly divided the data sets of each cell line in the proportion of 8:2.We compared the performance of CnnCrispr with the current preferable prediction models.Fig. 3 shows the comparison results under the classification schema.CnnCrispr achieved an auROC value of 0.975 and an auPRC value of 0.679 at the total test set.Which were both higher than the value of CFD, MIT and CNN_std (there were similar trends in the Hek293t test set and K562 test set, CnnCrispr achieved the auROC of 0.971 and 0.995 on Hek293t test set and K562 test set, respectively.And auPRC of 0.686 and 0.688 on Hek293t test set and K562 test set, respectively).The AUC values of ROC curve and PRC curve of CnnCrispr on the three test sets were all higher than those of CFD, MIT and CNN_std, which proved that CnnCrispr had more advanced prediction ability.In addition, the PRC curve obtained by CnnCrispr on the total test set and Hek293t test set completely contained the PRC curve obtained by the other three models, CFD, MIT and CNN_std, while on the K562 test set, only a small portion of the curve was covered by the CNN_std.Comprehensive comparison showed that the overall performance of the CnnCrispr was better than the other three models, and since the training and test sets were extremely unbalanced, the PRC curve and the area under it were more important measures for model evaluation, where CnnCrispr had a strong competitive advantage.In addition to the comparison with the above three models, we further compared the testing performance of CnnCrispr with DeepCrispr.Since the training methods and data sets were consistent, we directly compared the test results given in ref. [37], and the results are shown in Table3.The auROC values of DeepCrispr were slightly better than those of CnnCrispr (shown more intuitively on Hek293t test set), but the auPRC values obtained by CnnCrispr on all three test sets were higher than those of DeepCrispr.By comprehensive comparison, CnnCrispr showed better performance than DeepCrispr under test pattern 1.
Unlike the classification schema, the Pearson correlation coefficient and Spearman rank correlation coefficient of the prediction results were mainly used as evaluation measures for regression schema.From the comparison results, the Pearson correlation coefficient between CnnCrispr's prediction results and the real labels was strictly superior to the three comparison models (Since the Pearson coefficient was not selected as the evaluation measure in DeepCrispr, we only compared the Spearman values of CnnCrispr with DeepCrispr.).
The Pearson value of CnnCrispr on Hek293t test set reached 0.712(higher than 0.371 obtained by CFD, 0.153 obtained by MIT 0.33 obtained by CNN_std).In the entire test set, CnnCrispr also demonstrated its better predictive ability, with Pearson value reaching 0.682, higher than 0.343 of CFD, 0.150 of MIT and 0.321 of CNN_std.For Spearman correlation coefficient, the negative data in the test set was much larger than the positive data (about 250:1), therefore, a high Spearman value cannot be achieved.Nevertheless, the prediction ability of CnnCrispr was still better than those of the four models above (the test results of CnnCrispr on Hek293t, K562 and Total test set were 0.154, 0.160 and 0.134 respectively, while the Spearman correlation coefficients of CFD on the three test sets were 0.140, 0.143 and 0.128 respectively; Spearman correlation coefficients of MIT were 0.085, 0.084 and 0.086 respectively; Spearman correlation coefficients of CNN_std were 0.141, 0.144 and 0.132 respectively; Spearman correlation coefficients of DeepCrispr were 0.136, 0.126 and 0.133 respectively).In addition, we also compared the AUC values under ROC and PRC curves of the five models by referencing the CRISTA's evaluation method and considering the predicted results as the probability of the classification labels.The auROC value and auPRC value obtained by CnnCrispr on the total test set were as high as 0.986 and 0.601 respectively, which were superior to 0.942 and 0.316 of CFD, 0.947 and 0.208 of CNN_std, and the same results were obtained on Hek293t and K562 test sets.Based on the above performance results, we concluded that CnnCrispr had better prediction ability.

Test pattern 2 -"Leave-one-sgRNA-out"
In order to examine the accuracy and generalization ability of CnnCrispr for the prediction of off-target propensity of new sgRNA, we set up the "leave-one-sgRNA-out" experiment, which is a good evaluation method for the prediction of off-target propensity.During the training, a sgRNAs and its corresponding off-target sequences (with true cleaved propensity or the potential sites obtained from whole genome) were completely extracted for model testing.According to the difference of sgRNAs, model training and performance evaluations were conducted a total of 29 times.Through this 29-fold cross-validation method, we were able to comprehensively evaluate the generalization ability of CnnCrispr and avoid over-fitting or under-fitting of the model when predicting for some special sgRNAs.
For classification, CnnCrispr achieved an average auROC of 0.957 and auPRC of 0.429, which were both higher than the results of the four models above (CFD achieved an average auROC of 0.903, auPRC of 0.319, MIT achieved an average auROC of 0.848, auPRC of 0.115, CNN_std achieved an average auROC of 0.925, auPRC of 0.303; and DeepCrispr achieved an average auROC of 0.841, auPRC of 0.421).In the 29-fold cross validation, CnnCrispr's comprehensive competitive advantage was more significant, and the auPRC results were higher than results yielded by the other four models, which was essential to prevent the model from missing the actual off-target sites (see Fig. 5).
In order to make a more comprehensive evaluation, we also considered the distribution of the values of auROC and auPRC obtained by "29-fold" cross-validation, and drew the violin plot (Due to the fact that we weren't able to get the test data of DeepCrispr, we were unable to draw a violin plot for it.).Violin plot is characterized by the kernel density estimation of the basic distribution, and the external shape of the violin plot is the kernel density estimation.First of all, Fig. 6 shows that the auROC values of CnnCrispr were generally higher and the AUC values of CnnCrispr were more concentrated, 75% of the prediction results were greater than 0.9.On the other hand, there were obvious abnormal points in the prediction results of auROC by the other three models, indicating that they cannot play a good role in predicting the off-target propensity of individual sgRNA.In addition, the distribution of CnnCrispr's auROC values was more concentrated, while the auROC values of CFD and CNN_std had obvious discrete values (the whiskers on the lower side were longer).With the increase of auROC values, the horizontal distance of the violin plot plotted by CnnCrispr was larger, which showed that more auROC values were distributed on this interval, further indicating the good prediction performance of CnnCrispr.For auPRC values, the median of prediction results obtained by CnnCrispr was significantly larger than that of the other three models, which showed that CnnCrispr had a higher overall score and 75% of auPRC values obtained by CnnCrispr were greater than 0.2.CnnCrispr was more distributed at higher scores, indicating that the overall predictive performance of CnnCrispr was indeed better than that of CFD and CNN_std (see Fig. 6).
We further compared the 29-fold cross-validation results in regression schema and organized the performance visualization results in Fig. 5-6.We first compared the average value of Pearson correlation coefficient and the Spearman correlation coefficient (see Fig. 5).CnnCrispr achieved a higher mean Pearson value and Spearman value, this showed that CnnCrispr had better fitting ability.Furthermore, we drew 29 sets of Pearson values and Spearman values into violin maps.As shown in Fig. 6, Pearson values obtained by CnnCrispr were more distributed in the high score range.In addition, the Spearman scores of all four models were lower, but despite this, the distribution of CnnCrispr scores was significantly better than that of the other three models.Concluding with the fact CnnCrispr had a higher probability of obtaining highly fitting prediction results for offtarget propensity (Detailed results are in Additional file 1).

Discussion
As a kind of classical neural network algorithm, RNN has the following features: memory ability, Shared parameters and Turing completeness.Therefore, it has advantages in learning the nonlinear features of sequences and plays an important role in the study of sequence problems with time characteristics.In the relevant studies of CRISPR editing technology, it has been shown that the base types at different positions have a certain influence on the cleavage propensity of sgRNA [11,21,41,42,47].Therefore, we considered introducing an RNN framework into the prediction model to extract context information for sgRNA-DNA pairs.The convolution kernel size of the CNN was smaller than the input matrix, so the convolution operation can extract more local features --which is consistent with the image processing.In fact, it is not necessary for each neuron to perceive the global image, but only need to perceive the local image, and then integrate the local information at a higher level to obtain the global information.The parameter sharing mode of CNN can also greatly reduce the computation.In addition, we set convolution kernels of different sizes for different levels in the convolution part, and used multiple convolution kernels to convolve the input images, to extract local features as comprehensively as possible in this way.Furthermore, GloVe method utilized the statistical information of global word cooccurrence to learn word vectors, so as to combine the advantages of statistical information with the local context window method.We used this method to replace the traditional "one-hot" representation method hence allowing the input sequence of CnnCrispr to have better characteristic representation ability.
In the initial structural design of the model, we comprehensively considered the necessity of extracting sequence context information and local region information, so we integrated RNN and CNN model to improve the ability of feature extraction, and the excellent prediction ability of the final network model CnnCrispr was proved by comparing with the performance of different pre-selected models.The final network structure is shown in Fig. 1.After the GloVe model, the biLSTM was connected to extract context features, and the two-dimensional matrix information was further extracted by using 5 convolutional layers.
In the output layer of the network, the model was divided into classification schema and regression schema by setting different activation functions (softmax or sigmoid functions).
In "Model Selection" section, we also intuitively saw that the order of RNN and CNN had a great impact on the test performance, and the model CnnCrispr_Conv_LSTM cannot play a very good role in feature extraction and data prediction (see section 2.2 and Table 2).We briefly analyzed the following reasons: the RNN can fully extract the contextual text features of input sequences, while the convolution operation will initially break the internal connection of sequences and affect the function of RNN.Firstly, the RNN operation was carried out to extract the context features of the sequence, and then the CNN was used to extract the local features, and the local information was integrated at a higher level to obtain the global feature information, so as to improve the prediction ability of CnnCrispr.
In comparison with the performance of the existing four state-of-the-arts prediction models, CnnCrispr had better prediction ability in highly unbalanced test sets from DeepCrispr.In the "leave-one-sgRNA-out" experiment, the mean auPRC of 0.471 and mean Pearson value of 0.502 were achieved, which showed that CnnCrispr has a better competitive advantage.In addition, CnnCrispr only used the sequence information between sgRNA and corresponding potential DNA segments, giving up the construction of artificial features, thus avoiding the introduction of invalid or interfering information and making the prediction results more convincing.
We hope that CnnCrispr can help clinical researchers narrow down the screening range of off-target site test and save researchers more time and energy.
Since 2014, the number of open source data sets and online resources available for studying of the application of machine learning on CRISPR/Cas9 system has been increasing.As of the day this composition is written, the data set used by the author for model training is the largest data set presently available.However, with the continuous development of biological research technology, the number of available open source data sets will gradually increase, this will further improve the generalization ability of CnnCrispr in the future.

Conclusion
In this paper, we built a novel sgRNA off-target propensity prediction model, CnnCrispr.
With introduction of the GloVe model, CnnCrispr attempted new feature representation methods to embed sequence information into the deep learning model, combined RNN with CNN, and only used sequence information to predict the off-target propensity of sgRNA at specific sites.By comparison with existing prediction models, the superior prediction ability of CnnCrispr was further confirmed.Our model used deep learning to

Figure 5 Leave
Figure 5