Biases in read coverage demonstrated by interlaboratory and interplatform comparison of 117 mRNA and genome sequencing experiments
© Khrameeva and Gelfand; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Skip to main content
© Khrameeva and Gelfand; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
High-throughput sequencing of whole genomes and transcriptomes allows one to generate large amounts of sequence data very rapidly and at a low cost. The goal of most mRNA sequencing studies is to perform the comparison of the expression level between different samples. However, given a broad variety of modern sequencing protocols, platforms and versions thereof, it is not clear to what extent the obtained results are consistent across platforms and laboratories. The comparison of 117 human mRNA and genome high-throughput sequencing experiments performed on the Illumina and SOLiD platforms at 26 institutions all over the world demonstrated high dependency of the gene coverage profiles on the producing laboratory. Gene coverage profiles showed laboratory-specific non-uniformity that survived the 3'-bias correction and mappability normalization, suggesting that there are other yet unknown mRNA-associated biases.
Next-generation sequencing technologies have completely transformed the field of genetics, making it possible to generate large amounts of sequence data very rapidly and at a low cost. High-throughput sequencing of whole genomes and transcriptomes has become a major focus of modern biology as DNA sequencing is now available to many more projects, and even single research groups. As the performance of platforms or versions may differ, it is not clear to what extent the obtained results are consistent across platforms or versions thereof, or even between different laboratories using the same equipment .
Many efforts have been made to understand and overcome the biases inherent in the next-generation sequencing technology. On the Illumina platform, regions of elevated GC content have higher read coverage; sequencing errors occur preferentially at the 3'-end of reads; sequences preceding error positions are G-rich; the transversions G → T and A → C are the most frequent substitutions; quality scores underestimate the true error rate for high quality values and overestimate the true error rate for low quality values . It has been shown that the distributions of the sequenced nucleotides change across the positions of the reads and this bias influences the uniformity of the read location along expressed transcripts [3, 4]. Also, there are PCR biases over-amplifying identical cDNA fragments ; mappability bias leading to lower coverage of regions with low sequence complexity; non-hydrolysis bias increasing levels of 5'-termini in the sequenced pool . mRNA sequencing may be influenced by contamination by non-processed or under-spliced transcripts leading to visible intron coverage; mRNA degradation when RNAs are selected by polyA leading to the higher coverage of the 3'-end; influence of RNA secondary structure on fragmentation.
Here, we compared results of 117 human mRNA and genome high-throughput sequencing experiments performed on the Illumina and SOLiD platforms of all generations at 26 institutions all over the world to demonstrate the existence of systematic biases that can potentially affect gene coverage profiles. The gene coverage profiles are important for widely applied differential expression studies because gene coverage non-uniformity can lead to wrong estimations of expression levels of different exons, for example at the beginning or at the end of the gene. We observed high dependency of the gene coverage profiles on the producing laboratory, with most Illumina mRNA datasets showing existence of a systematic bias, while the genome datasets were not biased significantly.
Starting with the raw data, we calculated coverage profiles that are normally used to determine biologically relevant parameters such as gene expression or exon inclusion rates. Only single-exon genes filtered for high coverage (178 genes, see Methods for details) were considered to make gene coverage profiles comparable between mRNA sequencing experiments for different tissues that may produce alternative splice isoforms, and also to allow for the comparison of genomic and transcriptomic data.
The genome sequencing experiments do not produce such a clear picture. While some of them cluster by laboratory, others do not. The SOLiD experiments all cluster together. The mRNA sequencing experiments performed on the SOLiD platform are distant, by gene coverage profiles, from other mRNA sequencing experiments, as well as from the genome sequencing experiments, including those performed on SOLiD.
The clustering procedure was repeated for the unfiltered set of single-exon genes (1074 genes, see Additional file 1), which demonstrated generally the same result as the filtered one. Hence our observations do not depend on the filtering criterion.
Most Illumina mRNA datasets show coverage bias, with 3' gene termini, on average, covered higher than 5'-termini, as demonstrated by linear fitting of averaged coverage plots for all genes in a single experiment (see Additional files 2 and 3). On the contrary, in all SOLiD mRNA datasets 5' gene termini are covered higher than 3'-termini. To eliminate the 5' - 3' coverage non-uniformity, we normalized all datasets according to their linear fitting models and re-clustered the experiments (see Additional file 4). No significant difference was observed between dendrograms corresponding to the initial and normalized datasets. Normalized mRNA experiments cluster by laboratory as strongly as the initial ones (average correlation of profiles from one laboratory 0.43 ± 0.21, and between laboratories 0.25 ± 0.14, respectively).
The observed high dependency of the gene coverage profiles on the producing laboratory demonstrates that comparisons of sequencing results are difficult. This problem is crucial for mRNA sequencing experiments as the goal of most such studies is to compare the expression levels in various tissues, diseased and healthy individuals, case and control experiments, etc. As the genome sequencing experiments do not cluster distinctly by laboratory, they are most likely not biased significantly, whereas the transcriptome sequencing experiments demonstrate existence of a systematic bias that could be caused by the influence of the RNA secondary structure on the sample preparation and sequencing.
The 5' - 3' coverage non-uniformity has been observed before [7, 8] and can be traced back to specifics of the random hexamers or oligo(dT) preparation protocols. Studies involving comparative analysis of sequencing data produced by the same laboratory would be largely unaffected by such artefacts (average correlation 0.46 ± 0.14), whereas biological implications of wider comparisons may require additional controls and normalization procedures (average correlation 0.27 ± 0.10).
Normalization for 5' - 3' coverage bias did not result in significant improvements, neither the mappability normalization. After normalization, some gene regions are still covered higher than the others, with their relative positions in genes unique for each laboratory, meaning that there may be other mRNA-associated laboratory-specific biases of an unknown origin.
The data were retrieved from the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra). In the SRA, the data are organized into sets of experiments that result from a single study, performed in one lab, presumably following the same protocol. Each experiment in a study contains one or more sequencing runs. Experiments that had the total amount of sequenced data exceeding 500 million bases (Mb) and were publicly available as of 10th of October, 2010 were selected for further analysis. For each experiment, several runs were retrieved to compile a dataset of about 1.5 billion sequenced bases (Gb). Studies SRP001106, SRP001699, SRP001734, SRP002009, and SRP002881 were excluded as the size of each of their runs exceeded 5 Gb.
Reads were aligned to the reference human genome (version hg19) with the program bowtie . The number of allowed mismatches depended on the read length: one mismatch for reads shorter than 26 nucleotides (nt), two mismatches for the interval 26-50 nt, and three mismatches for reads longer than 50 nt. Reads produced by the Illumina platform were mapped in the base space and reads produced by SOLiD were mapped in the color space. Reads that mapped to multiple locations in the human genome were discarded.
Per-nucleotide gene coverage was calculated for each position of the gene as the total number of reads spanning over this position. For each gene, the Pearson correlation coefficient of the coverage profiles was calculated between all possible pairs of sequencing experiments. The coverage profiles were not preliminary normalized by the overall read count for each experiment because the Pearson correlation coefficient does not depend on the scaling factor.
Genes that were covered by reads in less than 90% of experiments were discarded. 178 of 1074 single-exon protein-coding genes survived this filtration. The correlation coefficient was averaged over these genes for each pair of sequencing experiments, which were clustered by the UPGMA method  in the R environment .
This work was supported by State contracts 02.740.11.0101 and 14.740.11.0003, programs "Molecular and Cellular Biology" and "Basic Science for Medicine" of the Russian Academy of Sciences.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S6.