Kullback-Leibler divergence is the log likelihood for the word counts
Kullback-Leibler (KL) divergence has a natural probability theory interpretation as a limiting case of the multinomial distribution. In particular, it was used in the context of alignment-free sequence comparison in the work ;
Under this model one assumes that the counted n-mers are drawn from a pool q with frequencies of each n-mer being q
, index i numbering all possible n-mers and running from 1 to 4>n. In other words, the model assumes that the words in a sequence are independent, and the probability of appearance of a particular word at a given position is position independent. The probability of appearance of the word i at a given position is q
; i = 1,…,4n. Under these assumptions the probability of obtaining n-mer count vector c is given by the multinomial distribution2:
We denote — the total number of words in a sequence. For sufficiently large counts one can use the Stirling’s approximation,
Denote the normalized counts as
consequently, the log of the probability is
The KL divergence between the frequency distributions p and q is:
When the difference between p
is small, this probability distribution reduces to the multivariate normal distribution3,
We have used the Taylor expansion for the natural logarithm:
dropping the cubic and higher terms.
Interpretation of the formula (3) in the context of clustering is as follows. When we have several candidate pools qα (“centroids”), KL divergence DKL(p|qα) gives the log odds ratio for a sequence having the vector of counts c to be attributed to centroid α. Therefore the ML estimate of a centroid is obtained by minimizing the KL divergence. We employ this relation within the framework of expectation maximization clustering.
Expectation maximization clustering
The problem of clustering is the challenge of partitioning a dataset of N points into K subsets (clusters) such that the similarity between the points within each subset is maximized, and the similarity between different subsets is minimized. The measure of similarity can vary depending on the data. Generally the clustering problem is computationally hard. However, there exist heuristic clustering algorithms that run in polynomial time. Most common clustering approaches are hierarchical clustering, k-means type (centroid based) clustering  and density based clustering . Each of these approaches possesses its own advantages and disadvantages.
Hierarchical clustering does not require one to specify the number of clusters a priori. Instead it produces the linkage tree, and the structure of the tree (in particular, branch lengths) determines the optimal number of clusters. However, the complexity of hierarchical clustering is at least . Density based algorithms (e. g., DBSCAN ) can find clusters of arbitrary shape as well as identify outliers. They do not require the prior specification of the number of clusters. Run time is . Centroid based approaches (k-means, expectation maximization) have a linear run time. Prior specification of the number of clusters is required, and results depend on the initialization of the algorithm.
In the present work we focus on centroid based technique. Our rationale for this is as follows. First, there exists a natural likelihood function for the word counts, which allows one to perform EM clustering. Also, the space of word counts possesses a natural notion of a centroid: for a set of sequences which belong to the same cluster one adds all the words within them; and the resulting frequencies yield the cluster centroid. Second, linear run time is critical for large datasets (in particular, HTS data).
EM is a generalization of k-means algorithm. The number of clusters K needs to be specified in advance. For the execution of the algorithm on N sequences one needs the following variables: centroids qα, α = 1,…,K; and assignments (“latent data”) za, a = 1,…,N. The algorithm consists of the two steps repeated iteratively until it converges.
Expectation step: given the current centroids q α, compute the new values of z a so that the log likelihood is maximized.
Maximization step: given the current assignments z a, compute the new values of q α so that the log likelihood is maximized.
This procedure guarantees that the log likelihood is non-decreasing at each step. Note that Equation (3) implies that the log likelihood is bounded from above by zero. These two facts imply that the algorithm converges4. In terms of the the variables qα and za the log likelihood is
We denote the total number of words in the a’th sequence as L
. Consequently, expectation step reassigns each point to its closest centroid:
Centroids are updated during the maximization step as follows:
Here we have introduced the Kronecker delta symbol:
This prescription exactly corresponds to the natural notion of a centroid: one adds all the words counts within a cluster to obtain the total count vector and normalizes this vector. Detailed derivation of Equation (9) is presented in Appendix 1.
The EM algorithm depends on initialization. In other words, depending on the initial centroid assignment the algorithm may converge to a partitioning that is only locally optimal. One of the ways to minimize the impact of the random initialization is to perform clustering several times using different initializations. This results in several partitionings, and then the one which maximizes the likelihood function is chosen. In the framework of k-means clustering selecting the partitioning with the minimal distortion leads to such maximization. Distortion is the sum of the intra-cluster variances for all the clusters. Using KL divergence as a likelihood function, one arrives at the modified definition of distortion:
Note that in the limit when the likelihood function reduces to the Gaussian one, our EM algorithm reduces to Gaussian mixture EM. In this case in the light of the formula (5) our definition of distortion reduces to the regular one.
Alternative distance (pseudo-likelihood) functions
We also explore some other distance functions, such as d2 and [6, 15, 16]. We are not aware of their direct probabilistic interpretation as a likelihood function. Nevertheless, they represent some distances; i. e., they can serve as some measure of a degree of dissimilarity between the two sequences. One can operate in terms of a distortion function as a measure of the quality of the partitioning. In the case of the EM clustering of k-means clustering, distortion equals the negative log likelihood. If one can prove that the analogs of both expectation and maximization steps lead to a decrease of distortion, this provides the basis for the convergence of the clustering algorithm.
d2 distance between the two vectors is defined as 1− cos θ, where θ is the angle between these vectors:
Here ∥v∥ denotes the norm of the vector v:
and the dot denotes the dot product:
One can define the distortion function as
In the context of d2 distance it is natural to normalize the word counts for centroids and individual sequences so that they have a unit norm: ∥p∥ = ∥q∥ = 1.
EM algorithm can be generalized to use the d2 distance as follows. During the expectation step one assigns each sequence to the closest (in terms of the d2 distance) centroid. During the maximization step one updates the centroids as follows:
We assume that the word counts for individual sequences are normalized so that ∥pa∥ = 1. Equation (16) is derived in Appendix 1. This procedure ensures that at each step the distortion is non-increasing. The distortion is bounded by zero from below. These two facts ensure the convergence of the algorithm. Equations (12) and (16) imply that the value of the d2 distance and the updated positions of the the centroids only depend on the normalized word counts. Consequently, the algorithm makes no distinction between the short and the long sequences.
distance was introduced in works ,. Its modification with suitable normalization for comparing short sequences was introduced in work  and called. This distance computation of expected word frequencies using the zero order Markov model and standardization of the observed word counts. In the context of centroid based clustering it can be formulated as follows.
For a given cluster count the frequencies of single nucleotides (1-mers) within the union of all sequences within the cluster.
Compute the vector of expected frequencies of n-mers Q using zero order Markov model. Under this prescription the expected frequency of n-mer is the product of the frequencies of individual characters.
For a vector of raw counts x define the corresponding standardized vector as
Denote the word count vector of all sequences within a cluster as x; then the distance between the centroid of this cluster and a sequence with the word count vector c is
Update of sequences’ assignment to clusters is the analog of the maximization step. Update of the expected frequencies is the analog of the expectation step. A priori it is not obvious how to define the distortion so that both expectation and minimization steps lead to a guaranteed decrease in distortion. We leave this question as well as the proof of convergence beyond the scope of the current work.
Standardization procedure as defined in Equation (17) is inspired by the Poisson distribution where mean equals variance. Following a similar logic, we introduce the χ2 distance:
Despite the apparent similarity of this definition with Equation (5), the frequency vector Q is the expected vector computed from the zero order Markov model (the same way as it was computed in the calculation of distance).
Symmetrized Kullback-Leibler divergence
This distance is the symmetrized Kullback-Leibler divergence:
It assumes that p and q are normalized frequency vectors:
Centroid based and hierarchical clustering techniques can be combined in consensus clustering. In this approach centroid based clustering is performed a number of times, each time randomly selecting a fraction of samples into the bootstrap dataset. After that the distance matrix is formed as
Hierarchical clustering is performed with distance matrix D
. This approach is computationally expensive as complexity of the distance matrix construction is , and the complexity of the hierarchical clustering using average linkage is for an input of N sequences.
Consider a set of HTS reads originating from several genes (contigs). Grouping together reads originating from the same gene provides a natural partitioning of the read set. Recall rate is a measure of how well the clustering agrees with this natural partitioning. In other words, the recall rate provides a measure of how well the reads from the same contig cluster together. It is defined as follows. Consider reads originating from some gene G. For example, if the number of clusters is K = 4 and 40% of reads from G are assigned to cluster 1, 20% of reads from G are assigned to cluster 2, 10% of reads from G are assigned to cluster 3, 30% of reads from G are assigned to cluster 4; the recall rate is R
Generally, assume that there are K clusters, and consider reads originating from some gene G. Denote f
the fraction of all reads originating from G which are assigned to the cluster k. Recall rate for gene G is
Recall rate provides a measure of how clustering interferes with assembly. In particular, when the recall rate is R
= 1, all reads from gene G get assigned to the same cluster; and the contig for G can be assembled from just one cluster with no losses.
We performed a numerical experiment to estimate the dependence of the recall rate on the read length and the clustering method. We generated 50 sets of human RNA sequences, each set containing 1000 sequences randomly chosen from the set of the reference sequences. We required that the length of each sequence is at least 500 bp and at most 10000bp. After that we simulated reads of length 30, 50, 75, 100, 150, 200, 250, 300, 400bp from each of these 50 sets using Mason . Each read set contained 100000 reads. Mason options used were illumina -N 100000 -n READLENGTH -pi 0 -pd 0 -pmm 0 -pmmb 0 -pmme 0. This way we obtained a total of 450 simulated read sets: one set for each of the 50 gene sets and 9 values of the read length. To study the dependence of the recall rate on the sequencing error rate for each of the 50 gene sets we generated 100000 reads of length 200 and error rate 0.001, 0.005, 0.01, 0.02, 0.03, 0.04, 0.05. Mason options used were illumina -N 100000 n 200 -pi 0 -pd 0 -pmm ERRORRATE -pmmb ERRORRATE -pmme ERRORRATE. This way we obtained 350 simulated read sets: one set for each of the 50 gene sets and 7 values of the error rate. To study the dependence of the recall rate on the depth of coverage (total number of reads) we simulated read sets with 200000, 150000, 100000, 75000, 50000, 30000, 20000, 10000, 5000 reads. Mason options used were mason illumina -N NUMREADS -n 200 -pi 0 -pd 0 -pmm 0 -pmmb 0 -pmme 0. This way we obtained 450 simulated read sets: one read set for each of the 50 gene sets and 9 values of the number of reads.
We performed hard EM clustering, k-means clustering, L2 clustering and d2 clustering and computed the recall rate for each gene in each read set. The results show that the EM algorithm exhibits a higher recall rate than that of k-means algorithm. For k-means clustering we used the implementation available in scipy package.
Soft EM clustering
For the execution of the algorithm one needs the following variables: centroids qα and probabilities for observation point a to be associated with cluster α. EM algorithm iteratively updates probabilities starting from centroid locations, and then updates centroid locations qα using the updated probabilities . These steps are performed as follows.
Given a set of centroids qα and observations (count vectors) ca, the probability for observation a to be associated with centroid α is
as it follows from Bayes’ theorem. In the “soft” EM algorithm can take fractional values, calculated according to Equation (24)5.
Given the probabilities , one updates centroid locations by maximizing the log likelihood expectation
When written explicitly it becomes
Here we denote the number of clusters by K and the number of sequences by N. In our conventions Greek index α runs over the different clusters, Latin index a runs over different sequences, and Latin index i runs over different n-mers. As derived in Appendix 1, centroids are computed as follows:
Note that Equation (27) conforms to the intuitive notion of centroid in terms of the word counts. Namely, word counts from all the sequences in the cluster are added up (with some weights in soft EM), and the resulting frequencies are calculated.
As explained, soft EM algorithm assigns the read a to the cluster α with some probability. Choice of a confidence threshold ε produces a set of clusters: read a is a member of cluster α if . Note that the clusters are possibly overlapping; i. e., one read can be assigned to multiple clusters simultaneously.
ROC curve for soft EM clustering
Consider a group of reads coming from the same origin (e. g., the same gene or the same organism). A perfect alignment-free classification would assign them to a single cluster (possibly containing some other reads). Let us assume that we know the origin of each read. A choice of some fixed value for the cutoff ε will assign each read to zero or more clusters. We consider the cluster which embraces the largest part of the reads from gene G to be the “correct” assignment for the reads originating from this gene. For example, assume that we have K = 4 (overlapping) clusters, containing 40%, 35%, 35% and 10% of the reads correspondingly. Then the first cluster is the “correct” assignment that would be attributed to all the reads from gene G if the clustering algorithm were perfect.
The true positive rate (recall rate) is
We define the false positive rate as
A read is considered “incorrectly” assigned if it is assigned to at least one cluster different from the correct one. Note that for some values of the threshold ε the same read can be simultaneously assigned to a correct and an incorrect cluster, thus producing both a true and a false positive. In the limit ε → 0 each read is assigned to each cluster (FPR=TPR=1). In the limit ε → 1 neither read gets assigned to any cluster (FPR=TPR=0).
Dependence of TPR vs FPR as ε changes from 0 to 1 gives an analog of the ROC curve6. Performance of the algorithm is characterized by the area under the curve (AUC).
Assembly of real data
Reads from an Illumina run on cDNA of a nasal swab were taken. After filtering out the low quality and the low complexity reads 21,568,249 100bp single end reads were left. Velvet  assembly was performed with the default settings. Velveth command line was velveth Assem HASHLENGTH -short -fasta INPUTFILE. Velvetg command line was velvetg Assem. Values of the has length were 21, 31, 41. Assembly was performed on the complete read set as well as on subsets obtained as a result of alignment-free clustering of the reads. Hard clustering was performed 5 times, and the partitioning with the minimal distortion was chosen. Soft clustering was performed once. Confidence cutoff for the soft clustering is ε = 0.05. For every splitting of the read set all the contings generated from individual parts were merged together. After that the original reads were mapped back onto the contings using bwa and allowing up to 2 mismatches. The number of reads which map back onto the contigs is a measure of the quality of assembly. It takes care of the possible duplicate contigs which may be formed when assembling separate parts of the sample.
Reference sequences for human mRNA genes were obtained from NCBI RefSeq ftp site, http://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/. Data were downloaded from NCBI on Apr 09 2013. Sequences for the bacterial recA, dnaA, rpsA and 16S rRNA genes used in the simulation were extracted from streptococcus pneumoniae genome, [GenBank:NC_003028]. Viral sequences used in the simulation are [GenBank:NC_001477, NC_001943, NC_000883, NC_015783, NC_001806, NC_003977, NC_001802]. We concatenate all the segments of segmented viruses (rotavirus [GenBank:NC_011500, NC_011506, NC_011507, NC_011508, NC_011510, NC_011501, NC_011502, NC_011503, NC_011504, NC_011509, NC_011505], Lujo virus [GenBank:FJ952385, FJ952384] and influenza virus). For influenza virus we use the sequence of the vaccine strain, A/California/7/2009.