Dataset
The dataset we used for training, validation and testing is the DeepHF dataset [17]. We extracted 55604, 58617, 56888 sgRNAs with activity (represented by insertion/deletion (indel)) for WT-SpCas9, eSpCas9(1.1) and SpCas9-HF1, respectively, from its source data, and use the same partition method to divide train set and test set.
Sequence encoding and embedding
For encoding process, we use the complementary base to represent the original base in sgRNA. Further, we use one-hot encode strategy, that is to say, we encode each base in sgRNA into a four-dimensional vector (encode A,T,G,C into [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1], respectively), called one-hot vector. Then a sgRNA can be considered as a matrix \(X_{oh} \in R^{l \times 4}\), named one-hot matrix (a little sparse, since a one-hot vector is zero in all but one dimension). We believe it is meaningful to regard Xoh as a binary image, therefore, it is used as an input of CNN, which performs well in the image field. Meanwhile, as mentioned above, the one-hot matrix is a little sparse.
To facilitate the training process, we can map each one-hot vector into a dense real-valued high-dimensional space, which is called embedding. In summary, at the matrix level, the formula is as follows:
$$X_{e} = X_{oh} E_{m}$$
(1)
where Xe named embedding matrix, \(E_{m} \in R^{4 \times m}\) is a trainable transformational matrix, m refers to the dimension of embedding space. We believe it is also meaningful to regard nucleotides in the sgRNA sequence as words, and the sgRNA sequence itself as a sentence, guided by this belief Em is the word embedding matrix and Xe is the sentence embedding in NLP. Therefore, Xe is used as an input of RNN (or its variant), which performs well in the NLP field.
As each element of Xoh is interpretable (representing whether there is a corresponding nucleotide type at the corresponding location), we call Xoh the spatial input, and the CNN that works on Xoh is the method in the spatial domain. On the other hand, different from Xoh, Xe can only be explained in the first dimension (representing the embedding vector of corresponding nucleotide type), and embedding vector is difficult for humans to understand. That's why we call Xe the temporal input, and the RNN (or its variant) that works on Xe belongs to the method in the temporal domain.
Neural network architecture
Based on the categorization above, we assume that the method in the spatial domain and the temporal domain are heterogeneous, which can satisfy the diversity premise of ensemble learning. Based on the assumption above and ensemble learning, we follow the stacking strategy to develop AttCRISPR which can extract potential feature representation of sgRNA sequence in both spatial and temporal domain parallelly. Further, we apply attention mechanisms in both spatial and temporal domains to enhance the interpretability of AttCRISPR.
First-order preference and second-order preference
To introduce the neural network architecture of AttCRISPR, Let's define first-order preference and second-order preference for convenience. Taking a simple linear regression model as an example, for input \(X \in R^{l}\), where l refers to the length of base sequences, predicted activity y is as follows:
where \(A \in R^{d}\). The total differential of y in Eq. (2) is as following:
$$dy = \sum\limits_{i}^{d} {A_{i} dX_{i} }$$
(3)
where Ai and Xi denotes the i-th dimension of the vector X and A, Ai indicates how dramatically the function changes as Xi changes in a neighborhood of X, in other words, the importance of Xi. That's why we'll call A first-order preference in our paper. Specifically, we use a vector Ai to build the first-order original preference at position i within sgRNA sequence, and Xi is an embeddedness of the i-th feature, then A and X are two matrices. Further, the final result can be weighted by a trainable non-negative weight vector \(W \in R^{l}\), as follow:
$$y = W \cdot AX^{T}$$
(4)
then we define \(\tilde{A}\) as the first-order combine preference matrix (or just first-order preference), which means \(\tilde{A}\) can be expressed linearly by A as follow:
where the weight matrix \(B \in R^{l \times l}\) is learned through attention mechanism, which we will call the second-order preference matrix in our paper as its calculation is based on first-order preference, it can explain how a particular pattern containing two nucleotides affects the base sequence. Then the predicted value can be expressed as:
$$y = W \cdot \tilde{A}X_{e}^{T}$$
(6)
The method in spatial domain
As demonstrated in Fig. 2, the method in the spatial domain relies on CNN. As previously mentioned, the sgRNA sequence has been encoded into a 21 × 4 one-hot matrix Xoh, and we regard Xoh as a binary image. Then, convolution kernels with different sizes are used to extract potential spatial features just like other works have done in computer vision. According to the foregoing, the spatial attention module can be applied in our method [22], which has been used to improve the performance of CNN in vision tasks.
As shown in Additional file 1: Supplementary Figures Fig. S1, for a given one-hot matrix Xoh, the spatial attention module generates a spatial attention matrix \(A_{s} \in R^{l \times 4}\) with the same shape as Xoh. Each element of As is constrained to a range of zero to one, implemented by a sigmoid function, which reflects the importance of the corresponding elements of Xoh. The overall spatial attention process can be summarized as:
$$\left\{ \begin{gathered} X_{mc} = f^{3 \times 4} (X_{oh} ) \hfill \\ A_{s} = \sigma (f^{3 \times 2} ([AvgPool(X_{mc} );MaxPool(X_{mc} )])) \hfill \\ X_{rf} = A_{s} \otimes X_{oh} \hfill \\ \end{gathered} \right.$$
(7)
where \(f^{p \times q}\) represents a convolution operation with the filter size of p × q, \(p,q \in R^{{Z^{ + } }}\), Xmc is a multi-channel map generated by Xoh, \(\sigma ( \cdot )\) denotes the sigmoid function, \(AvgPool( \cdot )\) denotes the average-pooling operation, \(MaxPool( \cdot )\) denotes the max-pooling operation, ⊗ denotes element-wise multiplication. The spatial attention matrix As formally conforms to our proposed definition of first-order preference (each element of Xoh is multiplied by the corresponding element of As), in other words, elements of As reveal how important the corresponding elements in Xoh are. We think it can reveal the preference of the scoring function at each position. For instance, following the encoding rules above, we train the spatial domain part of AttCRISPR with the WT-SpCas9 dataset. Then take the average of all spatial attention matrices, and the element in the first row and third column are closer to 1, which means when calculating the final score, G typically may have an important contribution at the first position within the sgRNA sequence. In fact, this corresponds to some early studies concerning the Human (hU6) promoter, which is believed to require G as the first nucleotide of its transcript [1,2,3].
The method in the temporal domain
As shown in Fig. 3, the temporal domain part of AttCRISPR relies on the RNN (or its variant). As previously mentioned, we map each one-hot vector into a dense real-valued high-dimensional space following Eq. 1, which generates the embedded matrix Xe. And we regard Xe as sequential data or temporal data. RNN (or its variant) has shown outstanding performance in the tasks with temporal data (for instance, NLP, sequential recommendation). That's why we prefer to use it to extract potential temporal features. To be precise, we prefer the architecture of encoder-decoder which has been proven to be effective in the Seq2Seq task. Two main differences we have to face are that sgRNA is not a natural language in the traditional sense, and we don't have to translate it to other sequences. To accommodate them, the embedded matrix Xe is used as input of both the encoder and decoder, and the output sequence of the decoder is used to build the first-order preference of sgRNA sequence \(\tilde{A}\). As mentioned above, the predicted value y should satisfy Eq. 6.
On this basis, we apply the idea of attention mechanism which has been widely used in NLP tasks to AttCRISPR in the method of the temporal domain, and name it the temporal attention module. The temporal attention module satisfies the following equation
$$Attention(Q,K,V) = align(Q,K)V$$
(8)
where Q, K, V are queries, keys, and values matrix [21, 23].
As Additional file 1: Supplementary Figures Fig. S2 shows, in our attention module they are calculated by the following equation:
$$\left\{ \begin{gathered} K_{i} = Encoder(X_{ei} ,\theta_{E} ,K_{i - 1} ) \hfill \\ Q_{i} = Decoder(X_{ei} ,\theta_{D} ,Q_{i - 1} ) \hfill \\ V = K \hfill \\ \end{gathered} \right.$$
(9)
where vector Ki, Qi denotes the i-th row of the matrix K and Q accordingly, Encoder(·) and Decoder(·) are independent GRU units, \(\theta_{E}\) and \(\theta_{D}\) denote all the related parameters of GRU networks accordingly. In the actual implementation, we apply the bidirectional GRU networks for better performance, and for the sake of conciseness, we show a conventional GRU network here. The function align(·)is as follows:
$$B_{i} = softmax(Q_{i} K^{T} ) \otimes G_{i}$$
(11)
$$G_{ij} = \left\{ {\begin{array}{*{20}l} {\exp \left( {\frac{{(i - j)^{2} }}{ - 2\sigma }} \right),} \hfill & {\left| {i - j} \right| \le \sigma } \hfill \\ {0,} \hfill & {\left| {i - j} \right| > \sigma } \hfill \\ \end{array} } \right.$$
(12)
where matrix \(B \in R^{l \times l}\) is the second-order preference we need, and vector Bi denotes the i-th row of the matrix B. \(G \in R^{l \times l}\) is the damping matrix base on the Gaussian function. Since a simple belief that the closer the base is to the i-th position, the more influence it has on the i-th position, we use the damping matrix G to constrain the network learning. \(\sigma\) represents a threshold of length, any base over this length from the position i is not considered to be affected. Further, if we think of the values matrix as a vector form of the first-order preference A in Eq. 5, we can reach the following equation:
according to the above mentioned, the values matrix V comes from the hidden states of a bidirectional GRU network, which is usually hard to understand. While B is the second-order preference matrix obtained by the attention mechanism. We believe that the j-th dimension of Bi, denoted as Bij, can reveal the effect of the base at position j on position i in the biological sense.
Ensemble model following stacking strategy
Some indirect sgRNA features, which can't be obtained directly by deep learning, including position accessibilities of secondary structure, stem-loop of secondary structure, melting temperature, and GC content are strongly associated with sgRNA activity. It's worth noting that the hand-crafted biological features are not standardized in the work of others[17, 24]. Since the wide range of data distribution, we standardize it based on Z-Score.
Then we use a simple fully connected network to extract the indirect features, and call the output of the fully connected network ybio. As mentioned above, we assume that the method in the spatial domain and in the temporal domain can satisfy the diversity premise of ensemble learning. That's why we follow the stacking strategy, to integrate the methods in the spatial domain and the temporal domain. Specifically, the ybio, the spatial output ys and the temporal output yt we got earlier are concatenated, and then weighted averaging is performed through a full connection layer as follow:
$$y = W[y_{bio} ;y_{s} ;y_{t} ]$$
(14)
where y is the final prediction value of AttCRISPR, W is the weight learned by the full connection network. In the actual implementation, we freeze the network in the spatial domain and temporal domain firstly, in order to make our network focused on learning the weight W. Then the parameters of the entire network are adjusted in the fine tuning of AttCRISPR.