We now formalize the problem of lossy compression of quality scores and describe the proposed method. As stated in previous sections, the FASTQ format is widely accepted as the standard to store sequencing data. Therefore, we consider the compression of quality scores presented in this format, and assume all the reads are of the same length *n* within a file. Denote the number of entries in the FASTQ file by *N*, where each entry is composed of four lines. The quality score sequences presented in a FASTQ file are denoted by
, where **Q**
_{
i
}=[*Q*
_{
i
}(1),…,*Q*
_{
i
}(*n*)]. Our goal is to design an encoder-decoder pair that describes the quality score vectors using only as many bits as specified by the user, while minimizing a given distortion *D* between the original vectors
and the reconstructed vectors
. More specifically, we consider that each **Q**
_{
i
} is compressed using at most *n*
*R* bits, where *R* denotes the rate (bits per quality score), and that the distortion *D* is computed as the average distortion of each of the vectors, i.e.,
. Furthermore, we consider the MSE as our given distortion function
, which operates symbol by symbol (as opposed to block by block), so that
. These assumptions allow us to model the encoder-decoder pair as a rate-distortion scheme of rate *R*, where the encoder is described by the mapping *f*
_{
n
}:**Q**
_{
i
}→{1,2,…,2^{
n
R
}}, which represents the compressed version of the vector **Q**
_{
i
} of length *n* using *n*
*R* bits, and the decoder is described by the mapping
, where
denotes the reconstructed sequence.

With this formulation of the problem we can use some results on rate distortion theory to guide the design of QualComp. For a detailed description on rate distortion theory and proofs, please refer to [41]. We are interested in the following result:

#### Theorem 1

For an i.i.d. Gaussian vector source

, with

*Σ*
_{
X
}= diag

(i.e., independent components), the optimal allocation of

*n*
*R* bits that minimizes the MSE is given as the solution to the following optimization problem:

where *ρ*
_{
j
} denotes the number of bits allocated to the *j*
^{
t
h
} component of **X**.

Next we describe how we use this result in the design of QualComp. In real data, quality scores take integer values in a finite alphabet
, but for the purpose of modeling, we assume
(the set of real numbers). Although the quality scores of different reads may be correlated, we model correlations only within a read, and consider quality scores across different reads to be independent. Thus we assume that each quality score vector **Q**
_{
i
} is independent and identically distributed (i.i.d.) as *P*
_{
Q
}, which will be specified below.

To the best of our knowledge, there are no known statistics of the quality score vectors. However, given a vector source with a particular covariance matrix, the multivariate Gaussian is the least compressible. Furthermore, compression/coding schemes designed on the basis of Gaussian assumption, i.e., worst distribution for compression, will also be good for non-Gaussian sources, as long as the mean and the covariance matrix remain unchanged [40]. Guided by this observation, we model the quality scores as being jointly Gaussian with the same mean and covariance matrix, i.e.,
, where *μ*
_{
Q
} and *Σ*
_{
Q
} are empirically computed from the set of vectors
. Due to the correlation of quality scores within a read, *Σ*
_{
Q
} is not in general a diagonal matrix. Thus to apply Theorem 1, we need to first decorrelate the quality score vectors.

In order to decorrelate the quality score vectors, we first perform the singular value decomposition (SVD) of the matrix

*Σ*
_{
Q
}. This allows us to express

*Σ*
_{
Q
} as

*Σ*
_{
Q
}=

*V*
*S*
*V*
^{
T
}, where

*V* is a unitary matrix that satisfies

*V*
*V*
^{
T
}=

*I* and

*S* is a diagonal matrix whose diagonal entries

*s*
_{
j
j
}, for

*j*∈[1:

*n*], are known as the singular values of

*Σ*
_{
Q
}. We then generate a new set of vectors

by performing the operation

for all

*i*. This transformation, due to the Gaussian assumption on the quality score vectors, makes the components of each

independent and distributed as

*N*(0,

*s*
_{
j
j
}), for

*j*∈[1:

*n*], since

. This property allows us to use the result of Theorem 1. The number of bits alloted per quality score vector,

*n*
*R*, is a user specified parameter. Thus we can formulate the bit allocation problem for minimizing the MSE as a convex optimization problem, and solve it exactly. That is, given a budget of

*n*
*R* bits per vector, we allocate the bits by first transforming each

**Q**
_{
i
} into

, for

*i*∈[1:

*N*], and then allocating bits to the independent components of

. In order to minimize the MSE, we solve the following optimization problem:

where *ρ*
_{
j
} represents the number of bits allocated to the *j*
^{
t
h
} position of
, for *i*∈[1:*N*], i.e., the allocation of bits is the same for all the quality score vectors and thus the optimization problem has to be solved only once. Ideally, this allocation should be done by vector quantization, i.e., by applying a vector quantizer with *N*
*ρ*
_{
j
} bits to
, for *j*∈[1:*n*]. However, due to ease of implementation and negligible performance loss, we use a scalar quantizer. Thus for all *i*∈[1:*N*], each component
, for *j*∈[1:*n*], is normalized to a unit variance Gaussian and then it is mapped to decision regions representable in *ρ*
_{
j
} bits. For this we need *ρ*
_{
j
} to be an integer. However, this will not be the case in general, so we randomly map each *ρ*
_{
j
} to
, which is given by either the closest integer from below or from above, so that the average of
and *ρ*
_{
j
} coincide. In order to ensure the decoder gets the same value of
, the same pseudorandom generator is used in both functions. The decision regions that minimize the MSE for different values of *ρ* and their representative values are found offline from a Lloyd Max procedure [42] on a scalar Gaussian distribution with mean zero and variance one. For example, for *ρ*=1 we have 2^{1} decision regions, which correspond to values below zero (decision region 0) and above zero (decision region 1), with corresponding representative values −0.7978 and +0.7978. Therefore, if we were to encode the value −0.344 with one bit, we will encode it as a ‘0’, and the decoder will decode it as −0.7978. The decoder, to reconstruct the new quality scores
, performs the operations complementary to that done by the encoder. The decoder constructs round (*V*
*Q*
^{′}+*μ*
_{
Q
}) and replaces the quality scores corresponding to an unknown basepair (given by the character ‘N’), by the least reliable quality value score. The steps are summarized below.

Notice that the final size is given by *n*
*R*
*N* plus an overhead to specify the mean and covariance of the quality scores, the length *n* and the number of sequences *N*. This can be further reduced by performing a lossless compression using a standard universal entropy code.

Since the algorithm is based on the statistics of the quality scores, better statistics would give lower distortion. With that in mind, and to capture possible correlation between the reads, we allow the user to first cluster the quality score vectors, and then perform the lossy compression in each of the clusters separately. For that we use the standard *k-means algorithm*[43], which we explain below. Notice that the total size in this case is just increased by a small amount for each of the clusters, since we do not preserve the order of the sequences. Specifically, after decoding the quality scores, we create the corresponding FASTQ file by incorporating the remaining information, i.e., the first three lines of each entry (including the header and the nucleotide sequence), and sorting the entries in alphabetical order with respect to the headers. This guarantees that related FASTQ files with paired end reads will have the same ordering after applying the lossy compression algorithm.

Finally, notice that *R*=0 is not the same as discarding the quality scores, since the decoder will not assign the same value to all the reconstructed quality scores. Instead, the reconstructed quality score vectors within a cluster will be the same, and equal to the empirical mean of the original quality score vectors within the cluster, but each quality score within the vector will in general be different.