In the current study, RNA-Seq was performed twice using two chicken cDNA samples. The first run of RNA-Seq had fewer number of reads and larger variation in terms of total number of reads between the two samples, while the second run had greater number of reads and very small variation between the two samples. The first run was performed at the very early stage of the sequencing technology when it was still in the testing phase. The lower reads and larger variation in the first run may be coming from two major sources of technical error: the purification of cDNA templates during the library preparation, and the loading of libraries onto flow cells (RNA-Seq technical guide and personal communications, Illumina technical support staff). These potential sources of errors were corrected during the second RNA-Seq analysis, which provided very good sequencing depth with greater number of reads. The first RNA-Seq datasets were directly derived from actual experiment, which made the results more informative than replicating datasets by random sampling. Therefore, we chose to include these two early datasets in the analysis in the current study. Furthermore, all of the reads from each sample were normalized by the RPKM and the datasets can serve as a reference for random sampling at different sequencing depths from the exact same samples.
The capacity of sequencing length of 60 bp for the first run was increased to 75 bp at the second RNA-Seq analysis. Longer reads should reduce estimation error and mapping uncertainty, and read lengths have been consistently increasing with improving Illumina massively parallel sequencing technology. However, people have noted that the number of reads is more important than the read length once reaching a minimum read length of 25 bp [9, 10]. The read lengths (60 bp and 75 bp) in the current study were larger than 25 bp; therefore, the read length will not affect the overall conclusions drawn.
As a powerful new technology for transcriptome analysis, RNA-Seq provides a more comprehensive view of the transcriptome than earlier technologies. Besides its ability to detect splicing variation, RNA editing and discovery of new transcripts [11], RNA-Seq can also function in the role of a conventional microarray in measuring gene expression due to its accurate measurements. In order to detect less abundant transcripts, appropriate sequencing depth is needed. The transcriptome coverage or sequencing depth needed for a given study can be affected by several factors such as genome size, transcriptome complexity and objectives of the study. In general, the more complex the transcriptome, the more sequencing depth is required for adequate coverage [12]. Depending on the purpose of transcriptome analysis, the requirement of sequencing depth varies. In most transcriptome studies, quantifying mRNA abundance is one of the major objectives. There is a certain sequencing depth that is sufficient in simple transcriptomes. For example, in the yeast genome, a 29.9 M (35 bp) reads dataset was generated by RNA-Seq which was able to get 100% transcriptome coverage [13]. The number of transcripts detected by RNA-Seq in the yeast dataset was able to reach 80% transcriptome coverage at 4M mapped reads, and even though the sequencing depth doubled as 8M reads, the transcriptome coverage only increased 10% [13, 14]. These results suggest that the improvement of sequencing depth or transcriptome coverage after reaching a certain sequencing depth had relatively less impact on detecting low abundant genes [15]. In addition, the cost per sample per lane by RNA-Seq is still not affordable for most laboratories. Recent development in multiplex labelling using bar-coded libraries by Illumina and continued increase in sequence output have made it possible to sequence multiple samples per lane without extra cost or running time [16]. Therefore, it is imperative to examine the correlation between sequencing depth and transcriptome coverage; in other words, what sequencing depth might be sufficient in reaching a certain level of transcriptome coverage and reliable measurement for RNA-Seq. In order to accomplish this objective, two approaches could be applied: experimental or simulation methods. Both methods have been applied in this study. High correlation among replicates within each sequencing depth, gradual increase in correlation coefficients from 10 M to 20 M, and consistent patterns observed between Samples1 and 2 (Fig. 4) have demonstrated that random sampling was an effective and reliable method in reaching the goals of this study.
Transcriptome coverage is one of the most important parameters in profiling global gene expression. The number and level of transcript isoforms is not always known and transcription activity varies across the genome [14]. This was confirmed in a study by using the number of unique transcription start sites as a measure of coverage in mouse cells [15]. In the current study, we took a more practical approach using all annotated genes in the chicken genome. Because the chicken genome is far under-annotated, we assume that the 15,742 annotated chicken genes in the database would well represent different levels of expression abundance in the chicken genome, which is essential for the analysis of transcriptome coverage in this study. Since gene expression depends on tissue and time of biological process [15], it is impossible for any single tissue to have all genes in the genome expressed. Ninety percent of all annotated genes (Fig. 3) detected at about 30 M (75 bp) reads might represent a saturated detection of the whole genome. The analysis results indicated significant improvements of transcriptome coverage occurred from 1.6 M to 4.9 M and from 20 M to 30 M. Depending on the purpose of transcriptome analysis, the current study suggested that 10 M (75 bp) reads could have 80% of transcriptome coverage, while 30 M (75 bp) reads could reach 90% of transcriptome coverage.
When we analyzed overall correlation coefficients at different levels of sequencing depth regardless of gene expression level, we observed very high correlation coefficients between each level of sequencing depth compared with 30 M, except for 1.6 M. One might draw a conclusion that there is no significant difference among different levels of sequencing depth. But as we see in Figure 4, this might be true in the case of highly abundant genes (the 4th quartile group), but not in the case of the 1st to 3rd quartile groups, especially the first two quartile groups (i.e. expression below the median). High abundant genes will be less affected by sequencing depth than low abundant genes, because high abundant genes are more likely to be captured even when the sequencing depth varies [17]. This is also confirmed by our analysis. Collectively, the following points can be inferred: 1) Sequencing depth below 20 M (75 bp) reads had a significant effect on detecting transcript levels of medium and low abundant transcripts; 2) Sequencing depth at both 1.6 M and 4.9 M would result in unreliable mRNA expression on all genes except highly abundant transcripts; 3) There was no significant improvement in correlation coefficients when sequencing depth doubled from 10 M to 20 M. Based on these analysis, the results suggested: 1) 5 M reads might be sufficient in obtaining reliable mRNA expression measurement on highly abundant transcripts; 2) When sequencing depth is beyond 10 M reads, a relatively reliable measurement of mRNA expression is expected, especially for abundant transcripts; 3) It seems that 30 M of reads is needed to achieve reliable measurement of mRNA expression across all genes in the chicken genome. To our knowledge, this is the first study evaluating the appropriate sequencing depth using RNA-Seq in farm animals and will provide the first reference for similar studies. The knowledge generated from this study has laid a solid foundation for applying this analysis to other species.