Understanding transcriptome dynamics and their impact on gene expression levels is essential for unveiling gene regulatory mechanisms and interpreting genotypic and phenotypic variations. With the recent advent of high-throughput RNA sequencing (RNA-seq) technologies, researchers have gained a powerful tool for not only investigating the expression profiles at the transcriptional level but also identifying novel and non-coding transcripts [1,2,3]. To date, several transcriptome analysis methods for RNA-seq data have been developed. Based on whether a reference genome is taken into account, two different approaches have been proposed [4,5,6]. The reference-based (RB) transcriptome analysis method is based on aligning the sequenced reads to a pre-existing reference genome, followed by assembling overlapping alignments into transcripts. In contrast, the reference-free (RF) de novo transcriptome analysis method allows to directly assemble sequenced reads into transcripts by using high levels of redundancy and overlapping of reads, without using a reference genome.
In recent years, many bioinformatics studies have evaluated the advantages and disadvantages of several tools implementing either the RB or RF transcriptome analysis method and have provided guidance for selecting easy-to-handle, reliable, and objective tools. Currently, there are several distinct types of methodological quality assessment strategies for transcriptome assembly. By using a reference genome, multiple RB approaches have been compared, and it has been found that their performances vary with genome complexity, which may potentially complicate correct alignments due to a certain level of variance that may arise from polymorphisms, intron signals, incomplete annotation, and alternative splicing. Therefore, applying relevant methods effectively for handling both low- and high-complexity regions is required . Without using any reference genomes, Holzer and Martz  assessed 10 reference-free methods using 9 RNA-seq datasets from 5 different species. The performance of each method was shown to display species- and data-dependent differences. There is no gold standard tool for achieving the best results for any type of RNA-seq dataset. Intriguingly, it has been suggested that in cases where a well-annotated genome from a closely related species is available, this neighbor genome could be utilized to guide de novo transcriptome assembly, albeit with caution [9, 10]. Finally, comparison of differential gene expression analysis results obtained by the RB or RF method have highlighted that 70–80% of the differentially expressed genes are shared [11,12,13].
Due to the widespread availability and affordability of high-throughput next-generation sequencing technologies, the genomes of numerous species have been sequenced. However, most non-model species lack a high-quality reference genome, and thus, the number of studies comprising transcriptome characterization by RNA-seq has rapidly increased and is continuously growing, particularly in studies related to genetics and genomics. In these studies, RF is the only method available, and according to previous reports, it can very effectively complement the results of genome-based transcriptome analyses in terms of the transcriptome repertoire [14,15,16,17,18]. Although the fragmented and misassembled transcripts from RNA-seq data with intrinsic methodological issues, including low sequencing accuracy, incomplete gene coverage, and chimerism [6, 19], can negatively affect accurate and reproducible quantification of gene expression levels, to the best of our knowledge, no previous study has provided a comprehensive evaluation of the consistency of expression levels between RF and RB approaches.
In the present study, we evaluated whether gene expression profiles obtained by RF and RB approaches could be generally compared. Using six human RNA-seq datasets, we observed that the RF analysis could predict on average up to 80% of the expressed genes; additionally, there was a significant positive correlation of gene expression levels when compared with those of the RB analysis. Expectedly, owing to the intrinsic methodological issues of the RF method, the overall gene expression levels were underestimated by approximately 30–44%. Here, we revealed that this disparity between gene expression levels obtained by RF and RB methods could partly be attributed to the proportion of genes that were lowly expressed, had long coding sequences (CDSs), or belonged to large gene families.