A haplotype-based normalization technique for the analysis and detection of allele specific expression

Background Allele specific expression (ASE) has become an important phenotype, being utilized for the detection of cis-regulatory variation, nonsense mediated decay and imprinting in the personal genome, and has been used to both identify disease loci and consider the penetrance of damaging alleles. The detection of ASE using high throughput technologies relies on aligning short-read sequencing data, a process that has inherent biases, and there is still a need to develop fast and accurate methods to detect ASE given the unprecedented growth of sequencing information in big data projects. Results Here, we present a new approach to normalize RNA sequencing data in order to call ASE events with high precision in a short time-frame. Using simulated datasets we find that our approach dramatically improves reference allele quantification at heterozygous sites versus default mapping methods and also performs well compared to existing techniques for ASE detection, such as filtering methods and mapping to parental genomes, without the need for complex and time consuming manipulation. Finally, by sequencing the exomes and transcriptomes of 96 well-phenotyped individuals of the CARTaGENE cohort, we characterise the levels of ASE across individuals and find a significant association between the proportion of sites undergoing ASE within the genome and smoking. Conclusions The correct treatment and analysis of RNA sequencing data is vital to control for mapping biases and detect genuine ASE signals. By normalising RNA sequencing information after mapping, we show that this approach can be used to identify biologically relevant signals in personal genomes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1238-8) contains supplementary material, which is available to authorized users.

Where ----DNAvcf is the name and path to VCF file containing variation that will be used within the simulation, ----DNAid is the name of the individual as shown in the VCF file, ----OUTid is the name of output folder (also used in output file names -- directory will be created with this name and simulation files will be produced within this directory, ----RNAbam is the name of the BAM file containing mapped RNA sequencing data for the individual of interest (This file must contain properly paired, uniquely mapped reads), ----MaxProcs allows the simulation to be run in parallel on multiple cpus (we recommend using no more than 8 cpus on a node containing 12 cpus, default: 1), and ----SampleDepth is the depth of reads to be simulated across each site (increasing this may improve accuracy during normalisation, decreasing will improve speed of simulation and subsequent mapping, default: 2000).
Step 3: Map the simulated null dataset: The simulation program will generate two fastq.gz files containing read pairs. Map these files using the exact same approach that you used in step 1. The simulation program will also produce two count files showing the number of reference (and thus alternative) reads covering each site for both trimmed (ignoring read pair overlaps) and non-trimmed data. These are for use in the ase_normalisation_compare.pl program to generate results.
Step 4: Correct the original data using the null dataset: Use the original mapped data, the simulated mapped data and the count files (Null_sim_file below) to generate P--values for each heterozygous site: perl ase_normalisation_compare.pl [options] ----DNAvcf <VCF_file_name> ----DNAid <ind_id_from_vcf> ----OUTid <Outfolder_name> ----RNAbam <Mapped_RNA_BAM> ----RNASIMbam <Mapped_RNASIM_BAM> ----Ref <Reference_fasta> ----MapNull <Null_sim_file> Where ----DNAvcf is the name and path to VCF file containing variation that will be used within the simulation, ----DNAid is the name of the individual as shown in the VCF file, ----OUTid is the name of Output folder (also used in output file names -- directory will be created with this name and simulation files will be produced within this directory), ----RNAbam is the name of the BAM file containing mapped RNA sequencing data for the individual of interest, ----RNASIMbam is the name of the BAM file containing mapped RNA sequencing data for the individual of interest, generated from the simulation program, ----Ref is the full path to Reference fasta file for use in Samtools Mpileup (this should be the same reference as used to call SNVs present in the supplied VCF file), ----MapNull is the path to null allele file generated by the simulation program, ----MaxProcs allows the simulation to be run in parallel on multiple cpus (we recommend using no more than 8 cpus on a node containing 12 cpus, default: 1) and ----Baq is used to turn on probabilistic realignment for the computation of base alignment quality in Samtools Mpileup (default: off). The results file will detail allele counts before and after normalisation, as well as a P-value from a binomial test.

Software Flags
The following is the full list of options available for the ase_normalisation_sim.pl program: ----DNAvcf <VCF_file_name> : Name and path to VCF file containing variation that will be used within the simulation. (Required) ----DNAid <ind_id_from_vcf> : Name of the individual as shown in the VCF file.
(Required) ----OUTid <Outfolder_name> : Name of Output folder (also used in output file names). Directory will be created with this name and simulation files will be produced within this directory.
The following is the full list of options available for the ase_normalisation_compare.pl program: ----DNAvcf <VCF_file_name> : Name and path to VCF file containing variation that will be used within the simulation. (Required) ----DNAid <ind_id_from_vcf> : Name of the individual as shown in the VCF file.
(Required) ----OUTid <Outfolder_name> : Name of Output folder (also used in output file names). Directory will be created with this name and simulation files will be produced within this directory.

(Required) ----RNAbam <Mapped_RNA_BAM>: Name of the BAM file containing mapped RNA sequencing data for the individual of interest. (Required) ----RNASIMbam <Mapped_RNA_BAM>: Name of the BAM file containing mapped RNA sequencing data for the individual of interest, generated from the simulation program. (Required) ----Ref <Reference_fasta>: Full path to Reference fasta file for use in Samtools Mpileup. This should be the same reference as used to call SNVs present in the supplied VCF file. (Required) ----MapNull <Null_sim_file> :
The path to null allele file generated by the simulation program. (Required) ----MaxProcs <number_of_cpus> : The simulation can be run in parallel on multiple cpus. We recommend using no more than 8 cpus on a node containing 12 cpus.

[default: 1] (Optional) ----Baq : Use to turn on probabilistic realignment for the computation of base alignment quality in Samtools Mpileup (BAQ). [default: off] (Optional) Example Pipeline and Data
As an example we have provided fastq files, bam files and an example results file at https://github.com/AJHodgkinson/ASE that can be used to test the software. The following details the steps that can be taken in order to reproduce the results in the example file: The following files have been provided at https://github.com/AJHodgkinson/ASE: • Sample fastq files: OUT_RNASeq_PE_sim_ind1_SNPs_1_chr22.fq and OUT_RNASeq_PE_sim_ind1_SNPs_2_chr22.fq --This data was generated by a custom script and is loosely based on the NA12812 1000G individual. It contains simulated data for a single chromosome.
• Mapped data: ind1_snps_chr22.Aligned.out.sort.PP.UM.bam --The above files were mapped with STAR and filtered. If you don't want to map the fastq files yourself, this bam file can be used in the simulation program. • SNP file: ind1.vcf --this file shows the locations of polymorphisms in the fastq files and is to be used in the simulation program. This will produce a null allele file 'ref_allele_count_ind1.txt' for use in the final step 2) Map and filter the null dataset in the same way as the original data. If using the test BAM file, use the following approach: #Map data (you will need to create a reference STAR genome with hg19 reference): STAR ----genomeDir hg19 ----readFilesIn fastq_sim_ind1_R1.fastq.gz fastq_sim_ind1_R2.fastq.gz ----runThreadN 12 ----readFilesCommand zcat ----outSAMstrandField intronMotif ----outFileNamePrefix ind1_SNPs_chr22_deepsim. #Covert SAM-->BAM samtools view --bh ind1_SNPs_chr22_deepsim. 3) Use the compare software to get the allele counts before and after normalisation (use the hg19 fasta file used to create STAR reference above): Although the results file you generate may not be exactly the same as the results file provided (as the software relies on random sampling of reads), the normalized read counts should correlate strongly across the two results files.