### Dataset

The dataset we used for training, validation and testing is the DeepHF dataset [17]. We extracted 55604, 58617, 56888 sgRNAs with activity (represented by insertion/deletion (indel)) for WT-SpCas9, eSpCas9(1.1) and SpCas9-HF1, respectively, from its source data, and use the same partition method to divide train set and test set.

### Sequence encoding and embedding

For encoding process, we use the complementary base to represent the original base in sgRNA. Further, we use one-hot encode strategy, that is to say, we encode each base in sgRNA into a four-dimensional vector (encode A,T,G,C into [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], respectively), called one-hot vector. Then a sgRNA can be considered as a matrix \(X_{oh} \in R^{l \times 4}\), named one-hot matrix (a little sparse, since a one-hot vector is zero in all but one dimension). We believe it is meaningful to regard *X*_{oh} as a binary image, therefore, it is used as an input of CNN, which performs well in the image field. Meanwhile, as mentioned above, the one-hot matrix is a little sparse.

To facilitate the training process, we can map each one-hot vector into a dense real-valued high-dimensional space, which is called embedding. In summary, at the matrix level, the formula is as follows:

$$X_{e} = X_{oh} E_{m}$$

(1)

where *X*_{e} named embedding matrix, \(E_{m} \in R^{4 \times m}\) is a trainable transformational matrix, *m* refers to the dimension of embedding space. We believe it is also meaningful to regard nucleotides in the sgRNA sequence as words, and the sgRNA sequence itself as a sentence, guided by this belief *E*_{m} is the word embedding matrix and *X*_{e} is the sentence embedding in NLP. Therefore, *X*_{e} is used as an input of RNN (or its variant), which performs well in the NLP field.

As each element of *X*_{oh} is interpretable (representing whether there is a corresponding nucleotide type at the corresponding location), we call *X*_{oh} the spatial input, and the CNN that works on *X*_{oh} is the method in the spatial domain. On the other hand, different from *X*_{oh}, *X*_{e} can only be explained in the first dimension (representing the embedding vector of corresponding nucleotide type), and embedding vector is difficult for humans to understand. That's why we call *X*_{e} the temporal input, and the RNN (or its variant) that works on *X*_{e} belongs to the method in the temporal domain.

### Neural network architecture

Based on the categorization above, we assume that the method in the spatial domain and the temporal domain are heterogeneous, which can satisfy the diversity premise of ensemble learning. Based on the assumption above and ensemble learning, we follow the stacking strategy to develop AttCRISPR which can extract potential feature representation of sgRNA sequence in both spatial and temporal domain parallelly. Further, we apply attention mechanisms in both spatial and temporal domains to enhance the interpretability of AttCRISPR.

### First-order preference and second-order preference

To introduce the neural network architecture of AttCRISPR, Let's define first-order preference and second-order preference for convenience. Taking a simple linear regression model as an example, for input \(X \in R^{l}\), where *l* refers to the length of base sequences, predicted activity *y* is as follows:

where \(A \in R^{d}\). The total differential of *y* in Eq. (2) is as following:

$$dy = \sum\limits_{i}^{d} {A_{i} dX_{i} }$$

(3)

where *A*_{i} and *X*_{i} denotes the *i*-th dimension of the vector *X* and *A*, *A*_{i} indicates how dramatically the function changes as *X*_{i} changes in a neighborhood of *X*, in other words, the importance of *X*_{i}. That's why we'll call *A* first-order preference in our paper. Specifically, we use a vector *A*_{i} to build the first-order original preference at position *i* within sgRNA sequence, and *X*_{i} is an embeddedness of the *i*-th feature, then *A* and *X* are two matrices. Further, the final result can be weighted by a trainable non-negative weight vector \(W \in R^{l}\), as follow:

$$y = W \cdot AX^{T}$$

(4)

then we define \(\tilde{A}\) as the first-order combine preference matrix (or just first-order preference), which means \(\tilde{A}\) can be expressed linearly by *A* as follow:

where the weight matrix \(B \in R^{l \times l}\) is learned through attention mechanism, which we will call the second-order preference matrix in our paper as its calculation is based on first-order preference, it can explain how a particular pattern containing two nucleotides affects the base sequence. Then the predicted value can be expressed as:

$$y = W \cdot \tilde{A}X_{e}^{T}$$

(6)

### The method in spatial domain

As demonstrated in Fig. 2, the method in the spatial domain relies on CNN. As previously mentioned, the sgRNA sequence has been encoded into a 21 × 4 one-hot matrix *X*_{oh}, and we regard *X*_{oh} as a binary image. Then, convolution kernels with different sizes are used to extract potential spatial features just like other works have done in computer vision. According to the foregoing, the spatial attention module can be applied in our method [22], which has been used to improve the performance of CNN in vision tasks.

As shown in Additional file 1: Supplementary Figures Fig. S1, for a given one-hot matrix *X*_{oh}, the spatial attention module generates a spatial attention matrix \(A_{s} \in R^{l \times 4}\) with the same shape as *X*_{oh}. Each element of *A*_{s} is constrained to a range of zero to one, implemented by a sigmoid function, which reflects the importance of the corresponding elements of *X*_{oh}. The overall spatial attention process can be summarized as:

$$\left\{ \begin{gathered} X_{mc} = f^{3 \times 4} (X_{oh} ) \hfill \\ A_{s} = \sigma (f^{3 \times 2} ([AvgPool(X_{mc} );MaxPool(X_{mc} )])) \hfill \\ X_{rf} = A_{s} \otimes X_{oh} \hfill \\ \end{gathered} \right.$$

(7)

where \(f^{p \times q}\) represents a convolution operation with the filter size of *p* × *q*, \(p,q \in R^{{Z^{ + } }}\), *X*_{mc} is a multi-channel map generated by *X*_{oh}, \(\sigma ( \cdot )\) denotes the sigmoid function, \(AvgPool( \cdot )\) denotes the average-pooling operation, \(MaxPool( \cdot )\) denotes the max-pooling operation, ⊗ denotes element-wise multiplication. The spatial attention matrix *A*_{s} formally conforms to our proposed definition of first-order preference (each element of *X*_{oh} is multiplied by the corresponding element of *A*_{s}), in other words, elements of *A*_{s} reveal how important the corresponding elements in *X*_{oh} are. We think it can reveal the preference of the scoring function at each position. For instance, following the encoding rules above, we train the spatial domain part of AttCRISPR with the WT-SpCas9 dataset. Then take the average of all spatial attention matrices, and the element in the first row and third column are closer to 1, which means when calculating the final score, G typically may have an important contribution at the first position within the sgRNA sequence. In fact, this corresponds to some early studies concerning the Human (hU6) promoter, which is believed to require G as the first nucleotide of its transcript [1,2,3].

### The method in the temporal domain

As shown in Fig. 3, the temporal domain part of AttCRISPR relies on the RNN (or its variant). As previously mentioned, we map each one-hot vector into a dense real-valued high-dimensional space following Eq. 1, which generates the embedded matrix *X*_{e}. And we regard *X*_{e} as sequential data or temporal data. RNN (or its variant) has shown outstanding performance in the tasks with temporal data (for instance, NLP, sequential recommendation). That's why we prefer to use it to extract potential temporal features. To be precise, we prefer the architecture of encoder-decoder which has been proven to be effective in the Seq2Seq task. Two main differences we have to face are that sgRNA is not a natural language in the traditional sense, and we don't have to translate it to other sequences. To accommodate them, the embedded matrix *X*_{e} is used as input of both the encoder and decoder, and the output sequence of the decoder is used to build the first-order preference of sgRNA sequence \(\tilde{A}\). As mentioned above, the predicted value *y* should satisfy Eq. 6.

On this basis, we apply the idea of attention mechanism which has been widely used in NLP tasks to AttCRISPR in the method of the temporal domain, and name it the temporal attention module. The temporal attention module satisfies the following equation

$$Attention(Q,K,V) = align(Q,K)V$$

(8)

where *Q*, *K*, *V* are queries, keys, and values matrix [21, 23].

As Additional file 1: Supplementary Figures Fig. S2 shows, in our attention module they are calculated by the following equation:

$$\left\{ \begin{gathered} K_{i} = Encoder(X_{ei} ,\theta_{E} ,K_{i - 1} ) \hfill \\ Q_{i} = Decoder(X_{ei} ,\theta_{D} ,Q_{i - 1} ) \hfill \\ V = K \hfill \\ \end{gathered} \right.$$

(9)

where vector *K*_{i}, *Q*_{i} denotes the *i*-th row of the matrix *K* and *Q* accordingly, *Encoder*(·) and *Decoder*(·) are independent GRU units, \(\theta_{E}\) and \(\theta_{D}\) denote all the related parameters of GRU networks accordingly. In the actual implementation, we apply the bidirectional GRU networks for better performance, and for the sake of conciseness, we show a conventional GRU network here. The function *align*(·)is as follows:

$$B_{i} = softmax(Q_{i} K^{T} ) \otimes G_{i}$$

(11)

$$G_{ij} = \left\{ {\begin{array}{*{20}l} {\exp \left( {\frac{{(i - j)^{2} }}{ - 2\sigma }} \right),} \hfill & {\left| {i - j} \right| \le \sigma } \hfill \\ {0,} \hfill & {\left| {i - j} \right| > \sigma } \hfill \\ \end{array} } \right.$$

(12)

where matrix \(B \in R^{l \times l}\) is the second-order preference we need, and vector *B*_{i} denotes the *i*-th row of the matrix *B*. \(G \in R^{l \times l}\) is the damping matrix base on the Gaussian function. Since a simple belief that the closer the base is to the *i*-th position, the more influence it has on the *i*-th position, we use the damping matrix *G* to constrain the network learning. \(\sigma\) represents a threshold of length, any base over this length from the position *i* is not considered to be affected. Further, if we think of the values matrix as a vector form of the first-order preference *A* in Eq. 5, we can reach the following equation:

according to the above mentioned, the values matrix *V* comes from the hidden states of a bidirectional GRU network, which is usually hard to understand. While *B* is the second-order preference matrix obtained by the attention mechanism. We believe that the *j*-th dimension of *B*_{i}, denoted as *B*_{ij}, can reveal the effect of the base at position *j* on position *i* in the biological sense.

### Ensemble model following stacking strategy

Some indirect sgRNA features, which can't be obtained directly by deep learning, including position accessibilities of secondary structure, stem-loop of secondary structure, melting temperature, and GC content are strongly associated with sgRNA activity. It's worth noting that the hand-crafted biological features are not standardized in the work of others[17, 24]. Since the wide range of data distribution, we standardize it based on Z-Score.

Then we use a simple fully connected network to extract the indirect features, and call the output of the fully connected network *y*_{bio}. As mentioned above, we assume that the method in the spatial domain and in the temporal domain can satisfy the diversity premise of ensemble learning. That's why we follow the stacking strategy, to integrate the methods in the spatial domain and the temporal domain. Specifically, the *y*_{bio}, the spatial output *y*_{s} and the temporal output *y*_{t} we got earlier are concatenated, and then weighted averaging is performed through a full connection layer as follow:

$$y = W[y_{bio} ;y_{s} ;y_{t} ]$$

(14)

where *y* is the final prediction value of AttCRISPR, *W* is the weight learned by the full connection network. In the actual implementation, we freeze the network in the spatial domain and temporal domain firstly, in order to make our network focused on learning the weight *W*. Then the parameters of the entire network are adjusted in the fine tuning of AttCRISPR.