 Research
 Open access
 Published:
AutoCoV: tracking the early spread of COVID19 in terms of the spatial and temporal patterns from embedding space by Kmer based deep learning
BMC Bioinformatics volumeÂ 23, ArticleÂ number:Â 149 (2022)
Abstract
Background
The widely spreading coronavirus disease (COVID19) has three major spreading properties: pathogenic mutations, spatial, and temporal propagation patterns. We know the spread of the virus geographically and temporally in terms of statistics, i.e., the number of patients. However, we are yet to understand the spread at the level of individual patients. As of March 2021, COVID19 is widespread all over the world with new genetic variants. One important question is to track the early spreading patterns of COVID19 until the virus has got spread all over the world.
Results
In this work, we proposed AutoCoV, a deep learning method with multiple loss object, that can track the early spread of COVID19 in terms of spatial and temporal patterns until the disease is fully spread over the world in July 2020. Performances in learning spatial or temporal patterns were measured with two clustering measures and one classification measure. For annotated SARSCoV2 sequences from the National Center for Biotechnology Information (NCBI), AutoCoV outperformed seven baseline methods in our experiments for learning either spatial or temporal patterns. For spatial patterns, AutoCoV had at least 1.7fold higher clustering performances and an F1 score of 88.1%. For temporal patterns, AutoCoV had at least 1.6fold higher clustering performances and an F1 score of 76.1%. Furthermore, AutoCoV demonstrated the robustness of the embedding space with an independent dataset, Global Initiative for Sharing All Influenza Data (GISAID).
Conclusions
In summary, AutoCoV learns geographic and temporal spreading patterns successfully in experiments on NCBI and GISAID datasets and is the first of its kind that learns virus spreading patterns from the genome sequences, to the best of our knowledge. We expect that this type of embedding method will be helpful in characterizing fastevolving pandemics.
Background
In December 2019, a new virus, severe acute respiratory syndrome coronavirus 2 (SARSCoV2) broke out as coronavirus disease 2019 (COVID19), and in March 2020, the World Health Organization (WHO) declared a pandemic [1,2,3]. Various vaccines are being developed as a result of the efforts of numerous scientists to overcome COVID19, but the effectiveness of vaccines against the newly occurring mutations is not well established yet [4,5,6]. As of August 2020, more than 21 million people were infected with SARSCoV2 across the world. In the United States, with the largest number of infections, 5.23 million or 1.7% of the U.S. population were infected [7]. As the virus has spread, the number of sequenced SARSCoV2 genomes increased and the sequence has been analyzed mainly for point mutations. Mutationbased studies classified the class of SARSCoV2 according to the type of variants and reported that COVID19 in early Asia and Europe were caused by different types of viruses [8, 9]. In addition, the basic reproductive number (\(R_0\)), which indicates the degree of transmission of the disease, showed that different types of variants had different rates of spread [10]. As of March 2021, COVID19 is widespread all over the world with new genetic variants. Once the virus becomes globally widespread, it is very difficult to track the spreading patterns of COVID19 because of external factors such as the global lockdown and vaccinations. However, while the virus is being spread, i.e., early stage of a disease epidemic, it might be possible to track spreading patterns in terms of spatial and temporal characteristics.
According to Nextstrainâ€™s August 2020 SARSCoV2 situation1 reports, the virus has mutated over time as the virus spreads to different regions [11]. Different strains of the virus evolve as the virus spreads to different regions over time (see the Additional file 1: Fig. S2). Based on the observations, there is no doubt that SARSCoV2 genome sequences have different characteristics depending on regions and time but are not yet fully understood. In this paper, we define such unknown information in terms of spatial patterns and temporal patterns, respectively. Thus, understanding the spatial and temporal characteristics of virus spreading patterns is a very important task. However, our knowledge so far is limited to simply measuring how many patients have occurred geographically and temporally. Beyond the simple statistics, the spread at the level of individual patients can be investigated through an embedding space with spatial and temporal features.
The main goal of our study is to investigate the tracking of COVID19 in terms of biological perspectives. When external factors such as global lockdown and vaccination are enforced, it is difficult to investigate the spreading potentials of COVID19 per se, thus we analyzed the early spread pattern of COVID19 using the SARSCoV2 sequences up to July 2020. In this work, we propose a deep learning method, AutoCoV that can track the early spread of COVID19 (Fig.Â 1). Tracking the virus spreading patterns of spatial and temporal characteristics was achieved by an autoencoder based latent representation deep learning model. Since there are quite a number of mutations in a virus sequence, a sequence is preprocessed using kmer, also known as ngram in natural language processing for information theoretic filtering. Then, the input to AutoCoV is a set of informative kmers for a virus sequence with spatial or temporal information. By augmenting an auxiliary classifier and a center loss objective function on the autoencoder, we guided the latent representation to learn the spatial and temporal patterns. Formally, AutoCoV optimizes three objective functions, such as reconstruction loss, classification loss, and center loss, as shown in Eq.Â 9. In this paper, we used SARSCoV2 sequences from two different datasets: the National Center for Biotechnology Information (NCBI) and the Global Initiative for Sharing All Influenza Data (GISAID) [12,13,14]. As a result, we showed that our model outperformed baselines on an experiment of annotated SARSCoV2 sequences in the NCBI dataset. An extensive ablation study showed the contributions of each component and strategy used in AutoCoV. Furthermore, we demonstrated the robustness of the embedding space generated by AutoCoV using the NCBI dataset against an independent GISAID dataset.
Related works
Since the first outbreak of the COVID19 virus in 2019, numerous papers have been published. Early papers conducted biological evolutionary and structure analysis of SARSCoV2 [15, 16]. As the number of SARSCoV2 genome sequences increased, mutationbased research was conducted and cluster analysis was performed on sequences with the same mutation [17, 18]. Some papers statistically summarized spatial and temporal features according to the types of mutations [9, 19, 20]. Recently, many studies have been conducted using machine learning technologies, for examples, diagnosing COVID19 using medical images by applying convolutional neural network (CNN) [21,22,23], predicting antiviral drugs using natural language processing (NLP) [24], or using deep neural network (DNN) [25] to form a drug repurposing perspective. In addition, research on the current outbreak status and spread of COVID19 using demographic or mobility data [26, 27], and the impact of external factors such as global lockdown policies were also studied [28, 29]. However, there have been no studies on the spatial and temporal spread of COVID19 utilizing the SARSCoV2 sequence.
Analyzing long biological sequences at the character level, i.e. single resolution, is a very difficult task. Although single nucleotidebased analysis may utilize more rich information about the sequences, large amounts of features may lead to the curse of dimensionality, especially in the long length of biological sequences. Therefore, it is important to use an appropriate encoding method that can reflect biological meaning while reducing the dimension of the sequence. There are simple but widely used methods of expressing the characteristics of the sequences such as onehot encoding or kmer encoding. Onehot encoding of the sequence data is straightforward and it has the advantage of preserving the positional information of the sequence, whereas the encoding length depends on the length of the sequence [30, 31]. Kmer encoding, on the other hand, loses positional information, but it has the same encoding length for a fixed value of k, and has the advantage of being able to learn the local context of a biological sequence [32,33,34,35]. In the case of SARSCoV2 genome sequences, the sequences are very similar to each other both in length and of nucleotide compositions and only a small number of variants determine the pathogenic properties of the virus. There is also a study on the possibility of additional mutations occurring around mutations [36]. Therefore, in this study, we used a kmer based approach to encode the sequence to focus on the local context of the sequence rather than the location information.
There are many existing machine learning algorithms for biological sequences that utilize the kmer information as input features of models in various ways and learn embedding vectors of the kmers or the sequence itself. For example, word2vec [37] or doc2vec [38] based models were proposed such as BioVec [37], dna2vec [39], or Super2Vec [40]. Hybrid approaches of CNN and recurrent neural network (RNN) were also proposed to capture local contexts of kmers and longrange interactions within the sequence [41, 42]. However, most existing methods are designed for shortlength of biological sequences and these methods have difficulty in learning embedding vectors of longlength sequences such as SARSCoV2.
Results
Dataset
SARSCoV2 sequences were obtained from two different databases: NCBI Virus [43,44,45] (http://www.ncbi.nlm.nih.gov/genome/viruses/) and GISAID (http://www.gisaid.org). From NCBI Virus, we collected 7,031 sequences including SARSCoV2 reference sequence NC_045512, as of July 17, 2020. NCBI Virus sequences had spatial and temporal labels based on collected continent and date. And each sequence was fully annotated with information about the location of the genes. From GISAID, we collected 61,210 sequences, as of July 17, 2020. GISAID sequences had spatial and temporal labels as well as subclass labels based on pathogenic mutations defined by the GISAID nomenclature system.
Experimental setup
For input data of the model, we used 5210 sequences commonly contained in NCBI Virus and GISAID. Each sequence was about 30,000 bases in length and had spatial and temporal labels. For the spatial label, the sequences were divided into four classes based on the collected continents: â€˜Asiaâ€™, â€˜Oceaniaâ€™, â€˜Europeâ€™, and â€˜North Americaâ€™. For the temporal label, the sequences were divided into three classes based on March 2020, when the spike protein mutation occurred [8, 9]. Based on March 2020, when the D614G mutation in which Aspartic acid (D) at position 614 of spike protein mutated to Glycine (G) occurred, the sequence before March 2020 in which there was no D614G mutation is labeled as â€˜Earlyâ€™ and the sequence in March 2020 in which the D614G mutation occurred is labeled as â€˜Middleâ€™. And the sequences after March 2020, when the D614G mutation has spread around the world, are labeled as â€˜Lateâ€™. Mutations in nonproteincoding regions were also important [46, 47], but due to different sequencing techniques and sequencing lengths of each sequence, there were lost in some sequences. Thus, only the proteincoding regions from ORF1ab to the N gene were used for quality control of the sequence. Among the sequences only from GISAID, we excluded lowquality sequences such as sequences with lowercase nucleotide or more than 100 Nâ€™s, so we used 23,979 as an external dataset to test the robustness of the embedding space learned from the NCBI dataset (Detailed GISAID dataset preprocessing is described in Additional file 1). The details of datasets are listed in TableÂ 1. We split NCBI dataset into training, validation, test data split with 80:10:10 ratio with preserving the spatial/temporal label ratio and performed the stratified split 10 times. In the case of learning spatial patterns, class labels are â€˜Asiaâ€™, â€˜Oceaniaâ€™, â€˜Europeâ€™, and â€˜North Americaâ€™ for supervised learning. Similarly, in the case of learning temporal patterns, class labels are â€˜Earlyâ€™, â€˜Middleâ€™, and â€˜Lateâ€™. All reported figures and tables in this section were created using the test data only. Detailed structures and hyperparameters of models are described in Fig.Â 2 and Additional file 1.
Baseline methods
The performances of AutoCoV on learning spatial and temporal patterns were compared with seven baselines belonging to the three categories: â€˜Dimension Reductionâ€™, â€˜Unsupervisedâ€™, and â€˜Supervisedâ€™. In case of â€˜Dimension Reductionâ€™, three conventional dimension reduction techniques were used such as Principal Component Analysis (PCA) [48], tStochastic Neighbor Embedding (tSNE) [49], and Uniform Manifold Approximation and Projection (UMAP) [50]. The input data of the three methods was the same kmer frequency matrix as AutoCoV. In case of â€˜Unsupervisedâ€™, two sequence embedding methods such as dna2vec [51] and seq2vec [39] were utilized. In the case of â€˜Supervisedâ€™, two wellknown deep learning models for nature language processing were utilized with augmentation of a classifier network (CF), which was similar with AutoCoV: Seq2Seq+CF [52] and BERT+CF [53]. In the contrast with â€˜Dimension Reductionâ€™, the baselines of â€˜Unsupervisedâ€™ and â€˜Supervisedâ€™ used the raw genome sequences of SARSCoV2 as the input data. Details are described in the Additional file 1.
Evaluation metrics
Learning the spatial or temporal patterns of SARSCoV2 means that in the embedding space generated by the model, sequences of the same labels are well clustered and sequences of different labels are well separated while well predicting labels of the sequences. Given the spatial or temporal labels of the sequence as ground truth, we used two clustering metrics and one classification metric to measure how well the model learn spatial or temporal pattern in embedding space. The first clustering metric is label homogeneous score (LHS) that measures the purity of class labels within the cluster [54]: \(LHS(Y; {\tilde{Y}})\) where Y is true class labels and \({\tilde{Y}}\) is predicted class labels by assigned the cluster in the space. The other metric is mutual information score (MI) that measures the dependence between target patterns and axes (\(\text {Dim}1\text { and }\text {Dim}2\)) of 2D embedding space: \(I(Y;\text {Dim}1, \text {Dim}2)\). Lastly, the classification metric is the F\(_1\) score of KNearest Neighbors classifier (neighbors = 10) where the neighbors are the sequences in the train data. All three metrics are ranged from 0.0 to 1.0 and the higher values represent the good performance. Details about the metrics are described in the Additional file 1.
Performance comparison of learning embedding spaces
We conducted experiments for learning spatial and temporal patterns of SARSCoV2. As qualitative results, Fig.Â 3 shows 2D embedding spaces of SARSCoV2 in the view of each pattern. As quantitative results, TableÂ 2 provides the results of the three metrics on each embedding space. From both results, AutoCoV outperformed the baselines in terms of learning the spreading patterns of the virus. We further investigated AutoCoV performance with two additional experiments: effects of different kmer sizes and ablation study. From comparative experiments on varying lengths of k ranging from 1 to 7, we showed that when k is 6, it can cover the performance aspect of the model and the aspect of biological prior knowledge. Furthermore, in an ablation experiment to measure the contribution of each component of AutoCoV, we showed that AutoCoV, when all components were incorporated, has the best performance. Details of the two additional experiments are in the Additional file 1.
Visualization of spatial and temporal patterns
For learning an embedding space representing spatial patterns, AutoCoV constructed the space that successfully distinguishes Asia and North America (Fig.Â 3a and Additional file 1: Figs. S3(a)â€“(b) and S4(a)â€“(b)). Interestingly, the AutoCoV was the only one that could infer the spatial spread of SARSCoV2. FigureÂ 4a showed that viruses were wellseparated and wellclustered according to continents in the AutoCoV spatial embedding space, and the spread pattern of the virus from Asia through Europe and Oceania to North America could be inferred through the embedding space. This trend was similar to the current outbreak of SARSCoV2 until August, and it was consistent with the August 2020 report of NextStrain [11]. In addition, without using subclass labels, AutoCoV also learned subclass characteristics simultaneously (Additional file 1: Fig. S5a). Among the baselines, UMAP showed reasonable spatial patterns, but the same regions were not well clustered. The other baselines showed results with all regions being mixed.
For learning an embedding space representing temporal patterns, AutoCoV distinguished middle and late time points of SARSCoV2 (Fig.Â 3b and Additional file 1: Figs. S3câ€“d and S4câ€“d). Furthermore, temporal spreading patterns of SARSCoV2 were only observed in the embedding space of AutoCoV. FigureÂ 4b showed that the time points were wellseparated in the AutoCoV temporal embedding space, and the temporal spread patterns of the virus could be inferred from the bottom center to the left and then to the right. Likewise, with learning spatial patterns, AutoCoV learned subclass characteristics simultaneously without subclass labels (Additional file 1: Fig. S5b). Among the baselines, dna2vec showed a similar time trend but most of the sequences were extremely tightly clustered, which can hardly be interpreted as the spreading patterns of the virus. Again, the other baselines showed results such as alltime points being mixed or projecting test data into the wrong time points (e.g. tSNE).
Quantitative measurements
AutoCoV outperformed the baselines in all three metrics on both spatial and temporal patterns (TableÂ 2). In the context of spatial patterns, AutoCoV had 2.2 times, 1.7 times, and 4.4% improvements in LHS, MI, and F\(_1\) over the best baseline UMAP, respectively. In the context of temporal patterns, AutoCoV had 2.4 times, 1.7 times, and 9.0% improvements in LHS, MI, and F\(_1\) over the best baseline UMAP, respectively. Among the baselines, dimension reduction techniques showed better performances than unsupervised and supervised learning methods. This is because the dimension reduction techniques captured local and global structures in both patterns and used kmer frequency matrix as input, like AutoCoV. For unsupervised and supervised learning that used raw sequences as input, rich information was available but it may contain redundant information. Since most of the characters in the SARSCoV2 sequences were similar between the sequences, most of the rich information was uninformative and would result in the curse of dimensionality that hinders entire learning processes.
External dataset validation
To demonstrate the robustness and usefulness of AutoCoV, unannotated sequences in the GISAID as an external dataset, were mapped into the embedding space constructed from the NCBI dataset. To preserve the ratio of each label in the spatial and temporal classes, we performed the stratified sampling 10 times from the GISAID. Details on processing the GISAID dataset are in the Additional file 1. FigureÂ 5 shows the spatial and temporal embedding space results of the fold with median performances in the GISAID dataset. For spatial patterns, the F\(_1\) score was about \(0.813 \pm 0.004\), and all continents were represented reasonably well in the embedding space (Fig.Â 5a). All continents spread more widely than those in the NCBI dataset, but this seemed to indicate the influx of mutant viruses by intercontinental migration, as mentioned in NextStrain [11]. For the temporal patterns, the F\(_1\) score was about \(0.739 \pm 0.005\), slightly lower than the spatial case, but it reflected the time flow of the virus relatively well (Fig.Â 5b). These results are expected to help characterize fastevolving pandemics in spatial and temporal embedding space via AutoCoV.
Discussion and conclusion
Effectiveness of encoding schemes of SARSCoV2 sequences using kmer
In general, supervised learning methods outperform unsupervised learning methods. Interestingly, however, the supervised learning methods (Seq2Seq+CF, BERT+CF) performed even worse than the unsupervised learning methods (dna2vec, seq2vec) or dimension reduction methods (PCA, tSNE, UMAP) in most LHS, MI, and F1 evaluations on both patterns. On the other hand, AutoCoV performed significantly better than unsupervised methods. We discuss why this happened in terms of the effectiveness of encoding schemes of SARSCoV2 sequences using kmer.
AutoCoV extracted kmer information from the sequence and transformed one sequence as a kmer frequency vector with entropy filtering. In the AutoCoV encoding scheme, positional information of kmers was lost. Meanwhile, Seq2Seq+CF and BERT+CF preserved the sentence structure with positional information of kmers. Intuitively, keeping the positional information of kmers seems more natural for biological sequences, but there is a good reason why the AutoCoV encoding scheme performed better. SARSCoV2 sequences are very similar to each other, and a small number of variants determine the pathogenic properties of the virus. The order of kmers in the sequence is nearly identical, so it is not useful for distinguishing between SARSCoV2 sequences. Therefore, AutoCoV did not utilize the positional information and focused on kmer count changes of the informative variants.
Limitations and further studies
In experiments on NCBI and GISAID datasets, it was shown that AutoCoV can effectively model spatial or temporal patterns in the SARSCoV2 sequence. There are a number of limitations in our approach. While the kmer encoding scheme of AutoCoV works well for SARSCoV2 sequences, this scheme may lose positional information. We can interpret which kmers are important, but it is hard to know exactly where the kmer is in the sequence. Therefore, analysis through additional learning is required to know the exact location of the important kmer in the sequence. In addition, more findgrained information has been lost because temporal labels were discretized in the experimental setup. There may also be sampling bias issues because the amount of data for each label was not adjusted to reflect the prevalence of diffusion.
To compensate for these shortcomings, future studies include improving the interpretability of the model by adopting attention mechanisms [55] or analyzing saliency map of kmer vectors like compute vision domain [56]. In addition, utilizing denoising techniques will enhance the robustness of the model and mitigate the effects of sequence noises. It also seems possible to improve model structures so that more finegrained label information can be modeled in the embedding space using continuous label representation and unbiased manipulation of the label data.
Conclusion
In this paper, we proposed AutoCoV that learned 2D embedding spaces for modeling the spatial and temporal patterns in the early spread of COVID19. To the best of our knowledge, AutoCoV is the first of its kind that learns virus spreading patterns from the genome sequences. The technical contribution of this paper is that AutoCoV can effectively handle the longlength SARSCoV2 genome sequences using information theory and autoencoder based deep learning model. The biological significance of AutoCoV is the ability to map SARSCoV2 genome sequences to 2D spaces that preserve the spatial and temporal patterns without knowing subclass information of the SARSCoV2 genome sequences. In extensive experiments, AutoCoV outperformed current embedding spaces learning and visualization methods. The generalization power of the embedding space was demonstrated in a cross validation experiment using the NCBI dataset and in another experiment with an independent GISAID dataset. The main contribution of our work is that a machine learning approach is shown to be capable of learning the spread of COVID19. We expect that the embedding space that can reflect temporal and spatial features can potentially be useful to track spread dynamics between sequences.
In closing this paper, we emphasize the importance and significance of learning the sequence embedding space for spatial and temporal patterns of the virus spread. A virus evolves into new types often with more aggressive characteristics even in the quarantined countries. Furthermore, infection among people can make different types of viruses, which makes the understanding of virus evolution more complicated. As of now, we do not have computational methods to model these complex spreading patterns spatially and temporally from the virus genome sequences. Existing studies have mainly studied virus propagation by external control factors such as lockout policies based on demographic data or mobility data [28, 29, 57]. As we showed in this study, methods for embedding or representation learning have potential for this important but challenging problem. We expect that comprehensive analysis of statistical methods based on demographic data and this type of embedding method will help characterize the rapidly evolving pandemics.
Methods
AutoCoV consists of four modules: Sequence Preprocessing, AutoEncoder Network, Classifier Network, and Center Loss for learning spatial and temporal patterns of SARSCoV2 sequences by constructing two dimensional (2D) embedding spaces.
Sequence preprocessing
Let \(\mathbf{X }\) denote a \(n \times d\) matrix where n is the number of sequences and d is the number of kmers, \(d = 4^k\). In this work, we use \(k=6\) for learning embedding spaces of the sequences by kmer approach, which can reflect the biological prior knowledge such as dicodon, i.e., two amino acids. Of course, 3mers in the DNA sequence are also useful because they can be considered codons, but the number of features that can be expressed as 3mers is limited than 6mers. For this reason, \(d = 4^6 = 4,096\) is used for the input matrix \(\mathbf{X }\). Because most of the sequences are almost identical except for a few character mismatches, frequencies of most kmers are the same and uninformative. An information theorybased kmer feature selection process is performed to distinguish subtly different sequences and to reduce the number of features. On each kmer, an entropy value \(H(\cdot )\) is calculated to measure the variance of \(\mathbf{X }_{*j}\), i.e.,
where \(\mathbf{X }_{*j} = [ x_{1j} \; x_{2j} \dots x_{nj} ]^T\) denotes the jth column vector of the input matrix \(\mathbf{X }\), \(M_j\) denotes the number of observed frequency count differently in \(\mathbf{X }_{*j}\), i.e., \(M_j = \text {Set}(\mathbf{X }_{*j})\) , \(\phi _m\) denotes the observed frequency count in \((\text {Set}(\mathbf{X }_{*j})\)) and \(p(\phi _m)\) denotes the probability of observation of \(\phi _m\) in \(\mathbf{X }_{*j}\). As various kmer frequency counts are observed, the entropy value increases. It means that the kmer is likely to contribute to distinguishing the sequences. Using the entropy value \(H(\mathbf{X }_{*j})\) of each kmer, we exclude uninformative kmers that have entropy values smaller than 0.2, an empirically determined threshold. Using the cutoff, kmers that show the same count in 99% of the entire sequences are discarded. Thus, the input matrix X is reduced to the smaller matrix \(\mathbf{X }'\), a \(n \times d'\) matrix where \(d'\) is the number of informative kmers by the information theoretic filtering.
After the kmer filtering step, twolevel normalization steps are performed: sequencelevel and kmer featurelevel. We generated a kmer vector of each sequence using proteincoding regions. However, there are slight differences in length for each sequence due to problems such as differences in sequencing techniques. To reduce the sequencelength dependency, as sequencelevel normalization, the frequency count of a kmer is divided by the total frequency count of the informative kmers. After sequencelevel normalization, the kmer feature level normalization is done by the standardization on each kmer feature. We used StandardScaler from scikitlearn library of Python. Finally, for the next deep learning modules, tanh activation function is applied to the normalized input matrix to adjust the scale of each kmer feature. We denote the result of the preprocessing module as \(\mathbf{X }_{norm}\).
Autoencoder network
We use an AutoEncoder Network in AutoCoV to learn latent representations \(\mathbf{Z } = \{\mathbf{z }_1, \mathbf{z }_2, ..., \mathbf{z }_n\}\) of SARSCoV2 sequences. As illustrated in Fig.Â 2, the AutoEncoder Network is a core module of AutoCoV. It is widely used for encoding input features to smaller dimensional features to reconstruct input data as much as possible [58, 59]. Given the input matrix \(\mathbf{X }_{norm}\), an encoder network generates a hidden representation matrix \(\mathbf{Z }\) and a decoder network generates the reconstructed input matrix \(\tilde{\mathbf{X }}_{norm}\) via the following calculations:

Single linear layer configuration: the layer \(f(\cdot )\) takes \(\mathbf{x }_i^T\) as input and generates \(\mathbf{h }_i^T\) as output where \(\mathbf{x }_i\) is the row vector of input \(\mathbf{X }_{norm}\), i.e., kmer frequency vector of ith sequence. To avoid overfitting and improve generalization performances, batch normalization layer and dropout layer are applied on the layer as below:
$$\begin{aligned} \mathbf{h }_i^T = f(\mathbf{x }_i^T) = \text {Dropout}(\sigma (\text {BN}(\mathbf{W }\mathbf{x }_i^T+\mathbf{b }))) \end{aligned}$$(2)where \(\mathbf{W }\) and \(\mathbf{b }\) are a weight matrix and a bias vector for the linear layer, respectively, BN is the batch normalization layer, \(\sigma (\cdot )\) is an activation function.

Encoder network configuration: the goal of encoder network is generating the latent representation \(\mathbf{z }_i^T\) from the input \(\mathbf{x }_i^T\). Let the encoder network consists of L single linear layers. Then, \(\mathbf{z }_i^T\) is calculated as below:
$$\begin{aligned} \mathbf{z }_i^T = f^L \circ \cdot \cdot \cdot \circ f^2 \circ f^1(\mathbf{x }_i^T) \end{aligned}$$(3)where the activation function \(\sigma (\cdot )\) is tanh function in all layers and Lth layer has no dropout layer.

Decoder network configuration: the goal of decoder network is to reconstruct the input data \(\tilde{\mathbf{x }}_i^T\) from the latent representation \(\mathbf{z }_i^T\). The structure of the decoder network is inversely same with that of the encoder network. Then, \(\tilde{\mathbf{x }}_i^T\) is calculated as below:
$$\begin{aligned} \tilde{\mathbf{x }}_i^T = f^{1} \circ f^{2} \circ \cdot \cdot \cdot \circ f^{L} (\mathbf{z }_i^T) \end{aligned}$$(4)where \(f^{l}\) is a corresponding decoding layer of lth encoding layer \(f^l\) and no tied weights are used for the network. The last layer \(f^{1}(\cdot )\) has no batch normalization and dropout layer.
To measure reconstruction performance, mean squared error (MSE) loss \({{\mathcal {L}}_{rec}}\) is calculated between \(\mathbf{x }_i\) and \(\tilde{\mathbf{x }}_i\) as below:
Classifier network
In general, standard AutoEncoder Network generates latent representations \(\mathbf{Z }\) and reconstructs input data well enough. In the case of SARSCoV2 sequences, however, it is hard to construct wellseparated embedding spaces due to the high similarity of sequences. To achieve the distinct latent representations according to spatial and temporal patterns, we adopt an auxiliary Classifier Network on the AutoEncoder Network. This auxiliary network will predict spatial labels for learning spatial patterns or temporal labels for learning temporal patterns. Like multitask learning, this can be considered as a guide to learning the representations for spatial or temporal patterns, while reconstructing input data by the AutoEncoder network. Given the latent representation \(\mathbf{z }_i^T\) from the AutoEncoder Network, a predicted class label \(\tilde{\mathbf{y }}_i\) is calculated by L linear layers:
where the last layer \(f^L\) has only fully connected layer. The Classifier Network is trained by minimizing the cross entropy loss \({{\mathcal {L}}}_{clf}\) as below:
where C is the number of classes and c is the index of class labels.
Center loss
Cross entropy loss with softmax layer guides the model to learn features that distinguish classes. However, it is insufficient to learn compact representations of data belonging to the same class [60, 61]. For this reason, Wen et al. proposed a new loss function called center loss [60]. The basic concept of center loss is to minimize the distances between data within the same class. In this study, to learn compact embedding spaces for spatial or temporal patterns, label information of each sequence is utilized in the Center Loss module. Let \({\varvec{\mu }}_c\) denotes a mean vector of class c. The center loss for ith sequence is measured as below:
where \(y_i\) is the class of ith sequence. Using the center loss, model parameters are updated to bring latent representations closer to their mean vectors. Then, the mean vector for each class is recalculated using the updated latent representations. That is, the latent representations and the mean vectors are updated simultaneously.
Loss function and training model
For learning spatial and temporal patterns of SARSCoV2 effectively, AutoCoV utilize three loss functions: MSE loss, cross entropy loss, and center loss. Then, a total loss function for a sequence is formulated in the following equation:
For minibatch training, the total loss is normalized by the size of the minibatch. We used Adam optimizer with two different learning rates for learning AutoCoV [62]. The first learning rate controls the overall network of the model except the center loss module. The second learning rate is used for the center loss module. In this study, to focus on generating compact and distinct latent representations, the learning rate for the center loss is larger than the overall learning rate.
Availability of data and materials
The SARSCoV2 sequence can be downloaded from NCBI Virus (http://www.ncbi.nlm.nih.gov/genome/viruses/) and GISAID (http://www.gisaid.org). And the proposed model AutoCoV is freely available on GitHub (https://github.com/smaster7/AutoCoV).
Abbreviations
 SARSCoV2:

Severe acute respiratory syndrome coronavirus 2
 COVID19:

Coronavirus disease 2019
 WHO:

The World Health Organization
 \(R_0\) :

The basic reproductive number
 NCBI:

The National Center for Biotechnology Information
 GISAID:

Global initiative for sharing all influenza data
 CNN:

Convolutional neural network
 NLP:

Natural Language processing
 DNN:

Deep neural network
 RNN:

Recurrent neural network
 2D:

Two dimensional
 MSE:

Mean squared error
 PCA:

Principal component analysis
 TSNE:

Tstochastic neighbor embedding
 UMAP:

Uniform manifold approximation and projection
 CF:

Classifier network
 LHS:

Label Homogeneous Score
 MI:

Mutual information
References
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, et al. A new coronavirus associated with human respiratory disease in china. Nature. 2020;579(7798):265â€“9.
Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, et al. A novel coronavirus from patients with pneumonia in China, 2019. New Engl J Med. 2020.
Gorbalenya AE, Baker SC, Baric RS, de Groot RJ, Drosten C, Gulyaeva AA, Haagmans BL, Lauber C, Leontovich AM, Neuman BW, Penzar D, Perlman S, Poon LLM, Samborskiy DV, Sidorov IA, Sola I, Ziebuhr J, ofÂ the International Committee on TaxonomyÂ of Viruses, C.S.G. The species severe acute respiratory syndromerelated coronavirus: classifying 2019ncov and naming it sarscov2. Nature Microbiol. 2020;5(4), 536â€“544.
Wise J. Covid19: The E484K mutation and the risks it poses. British Med J Publ Group. 2021.
Zou J, Xie X, FontesGarfias CR, Swanson KA, Kanevsky I, Tompkins K, Cutler M, Cooper D, Dormitzer PR, Shi PY. The effect of sarscov2 d614g mutation on bnt162b2 vaccineelicited neutralization. npj Vaccines. 2021;6(1):1â€“4.
Chen J, Gao K, Wang R, Wei GW. Prediction and mitigation of mutation threats to covid19 vaccines and antibody therapies. Chem Sci. 2021;12(20):6929â€“48.
World Health Organization: Coronavirus disease (covid19): situation report, 209, 2020.
Grubaugh ND, Hanage WP, Rasmussen AL. Making sense of mutation: what d614g means for the covid19 pandemic remains unclear. Cell. 2020;182(4):794â€“5.
Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, Hengartner N, Giorgi EE, Bhattacharya T, Foley B, et al. Tracking changes in sarscov2 spike: evidence that d614g increases infectivity of the covid19 virus. Cell, 2020;812â€“82719.
Ke R, RmeroSeverson EO, Sanche S, Hengartner N. Estimating the reproductive number r0 of sarscov2 in the united states and eight european countries and implications for vaccination. medRxiv. 2020.
Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: realtime tracking of pathogen evolution. Bioinformatics. 2018;34(23):4121â€“3.
Pruitt KD, Tatusova T, Maglott DR. Ncbi reference sequences (refseq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(SUPPL. 1):61â€“5.
Elbe S, BucklandMerrett G. Data, disease and diplomacy: Gisaidâ€™s innovative contribution to global health. Glob Challenges. 2017;1(1):33â€“46.
Shu Y, McCauley J. Gisaid: global initiative on sharing all influenza datafrom vision to reality. Eurosurveillance. 2017;22(13):30494.
Benvenuto D, Giovanetti M, Ciccozzi A, Spoto S, Angeletti S, Ciccozzi M. The 2019new coronavirus epidemic: evidence for virus evolution. J Med Virol. 2020;92(4):455â€“9.
Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, Si HR, Zhu Y, Li B, Huang CL, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270â€“3.
Forster P, Forster L, Renfrew C, Forster M. Phylogenetic network analysis of sarscov2 genomes. Proc Natl Acad Sci. 2020;117(17):9241â€“3.
Rambaut A, Holmes EC, Hill V, OToole A, McCrone J, Ruis C, du Plessis L, Pybus O. A dynamic nomenclature proposal for sarscov2 to assist genomic epidemiology. bioRxiv, 2020.
Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, y Piontti AP, Mu K, Rossi L, Sun K, et al. The effect of travel restrictions on the spread of the 2019 novel coronavirus (covid19) outbreak. Science. 2020;368(6489):395â€“400.
Sun J, He WT, Wang L, Lai A, Ji X, Zhai X, Li G, Suchard MA, Tian J, Zhou J, et al. Covid19: epidemiology, evolution, and crossdisciplinary perspectives. Trends Mol Med. 2020.
Ozturk T, Talo M, Yildirim EA, Baloglu UB, Yildirim O, Acharya UR. Automated detection of covid19 cases using deep neural networks with Xray images. Comput Biol Med. 2020;103792.
Ardakani AA, Kanafi AR, Acharya UR, Khadem N, Mohammadi A. Application of deep learning technique to manage covid19 in routine clinical practice using ct images: results of 10 convolutional neural networks. Comput Biol Med. 2020;103795.
Farooq M, Hafeez A. Covidresnet: A deep learning framework for screening of covid19 from radiographs. arXiv preprint arXiv:2003.14395 2020.
Beck BR, Shin B, Choi Y, Park S, Kang K. Predicting commercially available antiviral drugs that may act on the novel coronavirus (sarscov2) through a drugtarget interaction deep learning model. Comput Struct Biotechnol J. 2020;784â€“790.
Ke YY, Peng TT, Yeh TK, Huang WZ, Chang SE, Wu SH, Hung HC, Hsu TA, Lee SJ, Song JS, et al. Artificial intelligence approach fighting covid19 with repurposing drugs. Biomed J. 2020.
Gao S, Rao J, Kang Y, Liang Y, Kruse J. Mapping countylevel mobility pattern changes in the united states in response to covid19. SIGSpatial Spec. 2020;12(1):16â€“26.
Castro MC, Kim S, Barberia L, Ribeiro AF, Gurzenda S, Ribeiro KB, Abbott E, Blossom J, Rache B, Singer BH. Spatiotemporal pattern of covid19 spread in brazil. Science. 2021;372(6544):821â€“6.
Pachetti M, Marini B, Giudici F, Benedetti F, Angeletti S, Ciccozzi M, Masciovecchio C, Ippodrino R, Zella D. Impact of lockdown on covid19 case fatality rate and viral mutations spread in 7 countries in Europe and North America. J Transl Med. 2020;18(1):1â€“7.
Ji T, Chen HL, Xu J, Wu LN, Li JJ, Chen K, Qin G. Lockdown contained the spread of 2019 novel coronavirus disease in Huangshi city, china: early epidemiological findings. Clin Infect Diseases. 2020;71(6):1454â€“60.
Kuzmin K, Adeniyi AE, DaSouza AK Jr, Lim D, Nguyen H, Molina NR, Xiong L, Weber IT, Harrison RW. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun. 2020;533(3):553â€“8.
Melnyk A, Mohebbi F, Knyazev S, Sahoo B, Hosseini R, Skums P, Zelikovsky A, Patterson M. From alpha to zeta: Identifying variants and subtypes of sarscov2 via clustering. J Comput Biol. 2021;28(11):1113â€“29.
Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792â€“7.
Murray KD, Webers C, Ong CS, Borevitz J, Warthmann N. kwip: the kmer weighted inner product, a de novo estimator of genetic similarity. PLOS Comput Biol. 2017;13(9):1005727.
Lee S, Lee T, Noh YK, Kim S. Ranked kspectrum kernel for comparative and evolutionary comparison of exons, introns, and cpg islands. IEEE/ACM Trans Comput Biol Bioinform. 2019.
Steinegger M, SÃ¶ding J. Clustering huge protein sequence sets in linear time. Nature Commun. 2018;9(1):1â€“8.
Araya CL, Cenik C, Reuter JA, Kiss G, Pande VS, Snyder MP, Greenleaf WJ. Identification of significantly mutated regions across cancer types highlights a rich landscape of functional molecular alterations. Nature Genet. 2016;48(2):117â€“25.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013.
Le Q, Mikolov T. Distributed representations of sentences and documents. In: International Conference on Machine Learning, 2014;1188â€“1196.
Kimothi D, Soni A, Biyani P, Hogan JM. Distributed representations for biological sequence analysis. 2016. arXiv preprint arXiv:1608.05949.
Kimothi D, Biyani P, Hogan JM, Soni A, Kelly W. Learning supervised embeddings for large scale sequence comparisons. PloS One. 2020;15(3):0216636.
Jurtz VI, Johansen AR, Nielsen M, Almagro Armenteros JJ, Nielsen H, SÃ¸nderby CK, Winther O, SÃ¸nderby SK. An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics. 2017;33(22):3685â€“90.
Hu S, Ma R, Wang H. An improved deep learning method for predicting dnabinding proteins based on contextual features in amino acid sequences. PloS One. 2019;14(11):0225317.
Bao Y, Federhen S, Leipe D, Pham V, Resenchuk S, Rozanov M, Tatusov R, Tatusova T. National center for biotechnology information viral genomes project. J Virol. 2004;78(14):7291â€“8.
Brister JR, AkoAdjei D, Bao Y, Blinkova O. Ncbi viral genomes resource. Nucleic Acids Res. 2015;43(D1):571â€“7.
Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y, SchÃ¤ffer AA, Brister JR. Virus variation resourceimproved response to emergent viral outbreaks. Nucleic Acids Res. 2017;45(D1):482â€“90.
Bukh J, Purcell RH, Miller RH. Sequence analysis of the 5â€™noncoding region of hepatitis c virus. Proc Natl Acad Sci. 1992;89(11):4942â€“6.
Bhattacharyya P, Biswas SC. Small noncoding rnas: Do they encode answers for controlling sarscov2 in the future? Front Microbiol. 2020;11:2271.
Pearson K. Liii. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559â€“72.
Maaten Lvd, Hinton G. Visualizing data using tsne. J Mach Learn Res. 2008;9(Nov):2579â€“605.
McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. 2018. arXiv preprint arXiv:1802.03426.
Ng P. dna2vec: Consistent vector representations of variablelength kmers. 2017. arXiv preprint arXiv:1701.06279.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, 2014;3104â€“3112.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pretraining of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805.
Rosenberg A, Hirschberg J. Vmeasure: A conditional entropybased external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), 2007;410â€“420.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. 2017. arXiv preprint arXiv:1706.03762.
Hou X, Zhang L. Saliency detection: A spectral residual approach. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007;1â€“8. Ieee
Singh BP, Singh G. Modeling tempo of covid19 pandemic in India and significance of lockdown. J Public Affairs. 2020;20(4):2257.
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504â€“7.
Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012;37â€“49.
Wen Y, Zhang K, Li Z, Qiao Y. A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, 2016;499â€“515. Springer
Zheng W, Yang L, Genco RJ, WactawskiWende J, Buck M, Sun Y. Sense: Siamese neural network for sequence embedding and alignmentfree comparison. Bioinformatics. 2019;35(11):1820â€“8.
Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 23 Supplement 3, 2022: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2021): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume23supplement3.
Funding
This research was supported by the Collaborative Genome Program for Fostering New PostGenome Industry of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (MSIT) [No.NRF2014M3C9A3063541]; Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) [No.2021001343, Artificial Intelligence Graduate School Program (Seoul National University)]; Basic Science Research Program through the NRF funded by the Ministry of Education [No.NRF2021R1A6A3A01086898], Republic of Korea. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript. Publication costs are funded by NRF funded by MSIT.
Author information
Authors and Affiliations
Contributions
SK conceived the experiment, IS and SL conducted the experiment, IS and SL drafted the manuscript, IS, SL, MP, and YS processed data and analyzed results. All authors read, revised, and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Â Supplementary document containing basic information about SARSCoV2 and details information about materials, method, and additional results.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sung, I., Lee, S., Pak, M. et al. AutoCoV: tracking the early spread of COVID19 in terms of the spatial and temporal patterns from embedding space by Kmer based deep learning. BMC Bioinformatics 23 (Suppl 3), 149 (2022). https://doi.org/10.1186/s1285902204679x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902204679x