BreakNet uses the read alignment file produced by long reads and the corresponding reference genome as input, and it outputs deletion regions. BreakNet contains four main modules (Fig. 1). The first module is a feature matrix generation module, which splits the reference sequence into subregions, and then each subregion is transformed into a feature matrix based on alignment information. The second module is a CNN module, which maps the features to feature vectors. The third module is a bidirectional recurrent neural network (BRNN) module, which analyses the feature vectors associated with multiple time steps in both the forward and backward directions and integrates more deletion information. The fourth module is a classification module, which makes predictive judgements regarding the vectors output from the BRNN module and determines whether deletion has occurred.
Feature matrix generation module
The feature matrix generation module first divides the reference into subregions of length m (200 bp by default). This module extracts long reads that can be aligned in each subregion. Then, the module extracts the CIGAR string of each aligned long read and finds the positions that refer to a deletion (D operations in a CIGAR string).
We assume that the number of aligned long reads is p. Then, the i-th aligned long read is converted to an m-tuple \(\left( {d_{i1} , d_{i2} \cdots d_{im} } \right)\), where dij represents whether the j-th position in this subregion refers to a deletion based on the CIGAR string of the i-th aligned long read. When the j-th position is a deletion, dij is set to 1; otherwise, it is set to 0.
Then, these aligned long reads are sorted based on their deletion counts in descending order. Note that the numbers of aligned long reads for various sub-regions are commonly different, and the row number of each matrix is also different. To generate a matrix with the same size, we set the number of rows to n (18 by default). If p is greater than or equal to n, this module selects the first n rows to produce the matrix. Otherwise, the elements in the remaining n – p rows are set to 0.
CNN module
In this module, we adopt 100 continuous transposed feature matrices as inputs. Each feature matrix is fed into a time-distributed CNN, which outputs a 160-dimensional feature vector to the next module. The CNN module consists of several operations, which include average Pooling, max pooling and 2D convolution (conv2D). In detail, this module first applies a 1 × 2 average pooling layer to downsample the input matrix and reduce the computational cost. Then, this module uses 6 convolutional blocks to map each feature matrix to the output vector. Each convolutional block consists of a conv2D layer, a squeeze-and-excitation (SE) optimization layer and a max pooling layer [22]. As an example, the convolution of a matrix can be computed by the following equation:
$$O_{{m^{\prime},{ }n^{\prime}}} \left[ {row,col} \right] = \mathop \sum \limits_{i = 0}^{a - 1} \mathop \sum \limits_{j = 0}^{b - 1} w_{ij} \times I_{m,n} \left[ {row + i,col + j} \right]$$
(1)
where \(I_{m \times n}\) and \(O_{{m^{\prime}, n^{\prime}}}\) are the input and output matrices, respectively, \(w_{a*b}\) is the weight matrix of the convolution kernel, and \(row, col\) are the rows and columns of the output matrix. Next, we use a rectified linear unit (ReLU) as the activation function to add nonlinearity.
$$x_{a} = ReLu\left( x \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {if\; m < 0} \hfill \\ {x,} \hfill & {if\; m \ge 0} \hfill \\ \end{array} } \right.$$
(2)
To enhance the performance of the network, we use SE optimization to add weights to each computed convolution channel, and the network can adaptively adjust the weight of each feature map. Similar to EfficientNet [23], the CNN module uses a 2D global average pooling layer and two 1 × 1 conv2D layers as an implementation of the SE module. Finally, we apply a max pooling operation on the weighted convolution to reduce the number of parameters.
BRNN module
The BRNN module takes a matrix with dimensions of \(\left( { T, F} \right)\) as input. F is set to 160, representing the 160-dimensional vector produced from the CNN module. T is the time step of the BRNN module, and each time step includes a hidden representation of one feature matrix. These matrices for each time step must satisfy backward and forward relations in the read alignment. In the BRNN module, we use two BLSTM layers, where each layer includes 64 LSTM units and is able to capture both the forward and reverse information about the input feature vector. The value of an LSTM cell can be calculated recursively using the following formulas.
$$I_{t} = sigmoid\left( {W_{I} x_{t} + U_{I} h_{t - 1} + b_{I} } \right)$$
(3)
$$F_{t} = sigmoid\left( {W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} } \right)$$
(4)
$$O_{t} = sigmoid\left( {W_{O} x_{t} + U_{O} h_{t - 1} + b_{O} } \right)$$
(5)
$$C_{t} = F_{t} C_{t - 1} + I_{t} \odot {\text{tanh}}\left( {W_{C} x_{t} + U_{C} h_{t - 1} + b_{C} } \right)$$
(6)
$$h_{t} = O_{t} \odot tanh\left( {C_{t} } \right)$$
(7)
\(I,F,O\) represent the activation vectors of the input gate, forget gate and output gates, respectively. \(C\) represents the state vector of the cell. \(W, U, b\) are the parameters of the LSTM cell.\(x,h\) are the input and output vectors of the LSTM cell, respectively. The output vector of an LSTM cell depends on the current input vector, the output vector of the hidden layer at the previous time step and the information stored in the LSTM cell.
Classification module
The classification module uses two fully connected layers to classify the vectors output by the BRNN module. Dropout is used after each fully connected layer to improve the generality of the network. Finally, a sigmoid function is used to calculate the output values of the fully connected layers. If the final output is greater than 0.5, the related subregion is a deletion; otherwise, this subregion is not a deletion.
Breakpoint estimation
For a deletion whose size is larger than the window size, the deletion will present in multiple subregions and corresponds to multiple feature matrices. Then these adjacent sub-regions will be merged as a large region which refers to a large deletion.
Next, for each region which is determined as a deletion, BreakNet will assign a more precise deletion location and size. For a long read which is aligned in this region, BreakNet extracts its’ deletion location and size from CIGAR string if its’ deletion size is larger than 20 bp. After processing all these long reads, BreakNet can get a deletion set {(L1, S1), (L2, S2), … (Lt, St)}, Li and Si separately represent the location and size of i-th deletion information from CIGAR strings. Then, BreakNet will classify (Li, Si) and (Lj, Sj) into the same cluster, if |Li – Lj| < 40. After iteratively processing all elements in the deletion set, BreakNet selects the cluster which contains the most elements. And BreakNet calculates the average location and size of all elements in the cluster, which are the final location and size of the deletion about the region.
Model training
Through the high-confidence call set of one sample, we determine the real deletion regions in the sample. Then, we label the subregions that overlap with the real deletion as 1, while the rest are labelled as 0. The model learns its parameters by minimizing a loss function.
However, the high-confidence call set used in this study has two key features. One is that the information provided in the call set may not be comprehensive, and some of the variances are not provided in this call set. Second, the information provided in the call set is relatively accurate, and there are few false positives. Therefore, based on the above characteristics, we design a new loss function to satisfy the following requirements. First, if a deletion predicted by the model is not provided by the high-confidence call set, the loss function should produce a smaller loss value and gradient. Second, if a deletion is provided by the high-confidence call set and not detected by this model, the loss function should yield a larger loss value and gradient.
$$\ln \left( {prediction - label + 1 + a} \right)^{2}$$
(8)
where \(prediction\) is the output value of the model,\(label\) is the label of the input data point and \(a\) is a hypermeter used to adjust the maximum output value of this loss function. We set \(a\) to 0.001 in this paper.
Because deletions are usually rare, ~ 99% of the training samples are negative. To ensure that sufficient positive samples are included at each training step, we first adequately shuffle the training data and use a large batch size. We use the Adam optimizer for training and set the learning rate to 0.001.
Two techniques are employed to prevent overfitting during the model training process. First, we add dropout layers to each fully connected layer and set the dropout probability to 0.4. Second, we use the early stopping technique during training. By examining the performance of the model on the validation set, we save the parameters with the best performance. Additionally, if the model does not perform better after 10 epochs, we stop the training process and save the best model parameters as the final model.
We use TensorFlow 2.4 to implement the CNN, BRNN, and classification modules.