Adaptable probabilistic mapping of short reads using position specific scoring matrices

Background Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Here we present a probabilistic mapping method based on position-specific scoring matrices, which can take into account not only the quality scores of the reads but also user-specified models of evolution and data-specific biases. Results We show how evolution, data-specific biases, and sequencing errors are naturally dealt with probabilistically. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. falciparum, when modeling the biases of these data. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism (D. melanogaster) to a related genome (D. simulans). Conclusion The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. The resulting implementation provides not only a tool for biologists working with low quality and/or biased sequencing data but also a demonstration of the feasibility of using a probability based alignment method on real and simulated data sets.


PAR-CLIP reads
Simulated reads were created by changing the base T to C at a rate of 0.11 in the single end length 36 simulated data set above. Real PAR-CLIP reads were obtained by sampling 100,000 reads from NCBI SRA data set SRR189777. Primers were removed from the reads using the AdapterRemoval tool (Lindgreen, 2012).

Xeno mapping
Three data sets were used for xeno mapping: a set of short reads (NCBI SRA data set SRR001981), a set of long reads with low quality (NCBI SRA data set ID SRR023647) and a set of long reads with high quality (NCBI SRA data setSRR516029). As reference genomes we used the April 2006 assembly of the D. melanogaster genome from the Berkeley Drosophila Genome Project and the April 2005 assembly of the D. simulans genome from the Genome Sequencing Center at Washington University School of Medicine in St. Louis. Both were obtained via the UCSC Genome Browser data downloads with IDs dm3 and droSim1. For comparing mapping positions in the two genomes we used the liftOver tool from the UCSC Genome Browser "kent" Bioinformatic Utilities and the associated liftOver data file (dm3ToDroSim1).

Mappers and Parameters
The performance of the program BWA-PSSM was tested on both simulated reads and real data from the Illumina platform. We compared the performance of BWA-PSSM to that of BWA (Li and Durbin, 2009), BWA-MEM (Li, 2013), Bowtie (Langmead et al., 2009), Bowtie2 (Langmead and Salzberg, 2012), and GEM (Marco-Sola et al., 2012) using the following options. In all of the tests, comparisons are shown between the raw alignment as well as the quality filtered alignment. The quality filtered alignment discards all mappings with a MapQ of less than 25.

Mapping ancient DNA with BWA-PSSM
The ancient DNA reads were converted to a PSSM using the provided fastq2wm33.pl script using the following command.
cat reads.fastq | ./fastq2wm33.pl > reads.pssm The BWA-PSSM mapper was then invoked as usual except instead of passing the fastq file as a parameter, the PSSM was used.

Mapping probabilities for ungapped alignments
In this section we will derive the mapping match probability P ( , M|x, g), which is the probability of the alignment position and the foreground model M given a read x and the genome g. Here we assume that the read is aligned to starting position in the genome using only end-gaps in the read sequence and no gaps in the genome. Using the sum and product rules we can express the match probability as where we used that P (N|x) + P (M|x) = 1. Using the sum and product rule, we can write If we assume that all mapping positions are equally likely a priori we have that P ( |M, x) = 1/L, where L = |g| is the length of the genome. Using this assumption and equation (1) we get .
We will assume that the bases in the genome are independent both in the foreground and background model. In the background model N we will also assume that the genome is independent of the read, which means that we can write In the foreground model M, we will assume that (a) the i'th genome base is independent of all read bases except the read base it is aligned to. Furthermore we will assume that (b) the probability of an unaligned genome base in the foreground model is the same as in the background model. Finally we will assume that (c) the probability of an aligned genome base is independent of the starting position of the alignment given the aligned read base. Based on these three assumptions we can write Using equations (2) and (3) we can write the ratio of the genome probability according to the foreground model and background model as We then define the log-odds score S(g| , x) def = log 2 P (g| ,M,x) P (g|N,x) , which based on equation (4) can be written as where S(g +j−1 |x j ) = log 2 P (g +j−1 |M,x j ) P (g +j−1 |N) . Clearly the individual log-odds scores can naturally be represented as a PSSM and from equation (5) we see that S(g| , x) can be calculated by scoring the sequence g : +|x| with this matrix. So finally we can express the mapping match probability in terms of PSSM scores by which is the result given in the paper. Figure S1: The search path for BWA-PSSM. The PSSM is converted into a set of offsets from the maximum score as shown in the top. Below, the partial prefix tree to depth three is shown for the sequence GATTACA. To search for matches of the PSSM in the sequence, partial hits are scored according to the greatest offset encountered so far. In the illustrated search, the highest scoring string would be 'GTT'. After first visiting the 'G', there is no 'T' on that branch and the 'A' in the next position would lead to a combined offset of -7. Thus, the value of 'A' in the first position (with a score offset of -4) will rise to the top of the heap and be visited next.   Table S2. Bowtie and GEM are excluded as they do not provided MapQ scores.    Table S2: Analysis of single-end data simulated with WG-SIM. Comparison of sensitivity, positive predictive value (PPV) and run time using BWA-PSSM, BWA, BWA-MEM, Bowtie, Bowtie2 and GEM on simulated data sets covering a random 1% of the human genome. The reads were simulated using the WG-SIM (Li, 2011) program with the parameters listed in the Read Simulation section.   Table S4: Analysis of single-end data simulated with MASON. Comparison of sensitivity, positive predictive value (PPV) and run time using BWA-PSSM, BWA, BWA-MEM, Bowtie, Bowtie2 and GEM on simulated data sets covering a random 1% of the human genome. The reads were simulated using the MASON (Holtgrewe, 2010) program with the parameters listed in the Read Simulation section.