Skip to main content

PΨFinder: a practical tool for the identification and visualization of novel pseudogenes in DNA sequencing data

Abstract

Background

Processed pseudogenes (PΨgs) are disabled gene copies that are transcribed and may affect expression of paralogous genes. Moreover, their insertion in the genome can disrupt the structure or the regulatory region of a gene, affecting its expression level. These events have been identified as occurring mutations during cancer development, thus being able to identify PΨgs and their location will improve their impact on diagnostic testing, not only in cancer but also in inherited disorders.

Results

We have implemented PΨFinder (P-psy-finder), a tool that identifies PΨgs, annotates known ones and predicts their insertion site(s) in the genome. The tool screens alignment files and provides user-friendly summary reports and visualizations. To demonstrate its applicability, we scanned 218 DNA samples from patients screened for hereditary colorectal cancer. We detected 423 PΨgs distributed in 96% of the samples, comprising 7 different parent genes. Among these, we confirmed the well-known insertion site of the SMAD4-PΨg within the last intron of the SCAI gene in one sample. While for the ubiquitous CBX3-PΨg, present in 82.6% of the samples, we found it reversed inserted in the second intron of the C15ORF57 gene.

Conclusions

PΨFinder is a tool that can automatically identify novel PΨgs from DNA sequencing data and determine their location in the genome with high sensitivity (95.92%). It generates high quality figures and tables that facilitate the interpretation of the results and can guide the experimental validation. PΨFinder is a complementary analysis to any mutational screening in the identification of disease-causing mutations within cancer and other diseases.

Background

Pseudogenes (Ψgs) are abundant and ubiquitous protein-coding gene copies that are originally derived from functional genes [1]. These have been widely known as “junk” DNA for many years [2]. However, nowadays there is evidence of a handful of functional Ψgs [3,4,5,6,7]. For instance, some can interfere with their parental counterparts in tumorigenesis by retaining or gaining protein coding properties [7,8,9]. Cheetham et al. [10] has recently compiled a list of such functional Ψgs.

Depending on their mechanism of origin, Ψgs can be classified in three major classes: unitary, unprocessed and processed. Unitary Ψgs are derived from an ancestral protein-coding gene that has lost its protein-coding potential due to spontaneous mutations [11, 12]. While the unprocessed Ψgs originate from gene duplications that accumulate mutations, preventing their translation. On the other hand, processed pseudogenes (PΨgs), arise from the reverse transcription (retrotransposition) and integration of a processed mRNA into a new genomic location [13]. PΨgs lack the 5′ promoter sequence as well as any introns, however, they exhibit a 3′ polyA tail and duplications of varying length at its insertion site [14]. Recently, a new group of PΨgs have been identified in human and mouse, which lack the 3′ end poly-A tail and are derived by retrotranscription of circular RNAs (circRNA) [15].

PΨgs, are the most abundant type of Ψgs in the human genome with an estimated amount between ~ 8000 and 14,112 [11, 16,17,18]. As for today, GENCODE [12], the reference annotation for the human and mouse genomes, has annotated 10,822 human PΨgs (Release 37, GRCh38.p13) [19].

Processed pseudogenes and cancer

Next-generation sequencing has contributed to the discovery of a large number of PΨgs and further studies have confirmed their involvement in the development, progression and prognosis of certain diseases, including cancer. A comprehensive list of PΨgs participating in the pathogenesis of different diseases has been compiled by Chen et al. [20]. Moreover, the detection of the transcribed PΨgs has demonstrated that certain PΨgs are expressed only in cancer samples, either in a specific cancer or in multiple cancers [21, 22]. For example, the ATP8A2-PΨg has been restricted to breast tumors with luminal histology showing a potential oncogenic nature [21]. In lung adenocarcinoma, the PTPN12-PΨg induces the removal of the MGA promoter, a likely tumor suppressor gene [23]. In gastric cancer, POU5F1B, a PΨg adjacent to MYC, is a prognostic marker [24], while in prostate cancer, the fusion of the KLKP1-PΨg and KJK4 gene may be a potential biomarker in routine screening [25, 26].

Several PΨg integrations have been also identified, however no clear function nor correlation to disease has been yet understood. For instance, the SMAD4-PΨg is a confounding element in quantitative results and increases erroneous variant calls. Besides, the integration of the SMAD4-PΨg in the SCAI gene has been corroborated in hereditary cancer-predisposition cases [27], and while SCAI is characterized to have suppressive effect on tumor cell invasiveness, it has not been determined whether SCAI expression is hindered by the SMAD4-PΨg [28].

Detection of processed pseudogenes

Ψgs were often discovered as by-product of gene sequencing or PCR experiments. With the advent of whole genome sequencing projects, computational approaches have aided in their identification and annotation, relying on the specific features of the Ψgs, such as level of sequence homology and completeness relative to a parent gene, lack of introns, ratio of non-synonymous to synonymous substitution rates (KA/KS), occurrence of polyadenine tail and the existence of frame disruptions, among others [29]. In eukaryotic genomes, some methods rely on homology-based approaches and these include in-house pipelines within genome-wide surveys [16, 30] and tools such as PseudoPipe [31], retroFinder [17] and PPFINDER [32], which unfortunately are not publicly available or are based on deprecated tools. Another type of algorithms relies on the information from mapped reads. The bioinformatics method developed by Cook et al. [23] detects somatically acquired Ψgs by aligning paired-end sequencing data to the genome and the transcriptome, nevertheless it is not publicly available. More recently, sideRETRO [33] was developed as a tool that focuses on the detection of de novo somatic and polymorphic insertions of PΨgs using a reference genome as well as a reference for the transcriptome. It applies a density-based clustering non-parametric algorithm and compiles the results in VCF format.

Today it is common that sequencing data analyses are performed by tech savvy staff, resulting in the use of formats that are burdensome to handle by researchers with basic computational skills. To aid in the prediction and interpretation of novel PΨg candidates, we present PΨFinder (P-psy-finder), a bioinformatics pipeline that rapidly screens alignments of DNA sequencing data to detect such events. It creates a simple table that can be sorted and filtered in any spreadsheet program, as well as graphical representations that besides providing a visual confirmation of the candidates, can be used to guide the experimental validation and the characterization of the genomic arrangement of such candidates. In addition, PΨFinder also provides information about known PΨgs found in the analyzed samples and can be used with any organism from whose genome is available.

Implementation

PΨFinder overview

PΨFinder aims to detect PΨgs within DNA sequencing data and predict their insertion sites. PΨfinder is written in python (3.6) [34] and requires STAR (2.7.7a) [35], SAMtools [36] (1.11), BEDTools [37] (v2.30.0) and R [38] (4.0.3).

The overall workflow is shown in Fig. 1A. For a given organism, PΨFinder takes fastq files as input and aligns them to the corresponding reference genome using STAR, a splice-aware aligner [35], alternatively alignment files can be supplied as input. To provide evidence of PΨgs in the sample, spliced reads (SR) across known exon-exon junctions are selected and clustered (Fig. 1B). To identify the insertion sites of the PΨg candidates, the pipeline extracts two pieces of information from the alignment files: (1) chimeric read pairs (CPs), pairs that are aligned in different chromosomes or at larger distances than expected, and (2) chimeric reads (CRs), soft-clipped reads (reads that align to two different locations) (Fig. 1C). The overlap between the PΨg candidates, CPs and CRs determine the PΨg’s insertion site. As output, PΨFinder provides summary reports in text and html formats as well as visualization of the predicted insertion sites (Additional file 1). Individual PΨgs and their insertion sites can be plotted either in linear or circular format (Fig. 1A). As a complementary result, PΨFinder also provides a list of detected known PΨgs.

Fig. 1
figure 1

PΨFinder workflow. A Overall workflow of PΨFinder. If the input data are fastq files, these will be aligned using STAR. The resulting BAM file (input as default) is annotated with known PΨgs. Spliced reads, chimeric pairs and chimeric reads are extracted and combined to detect novel PΨgs and predict its insertion site. Results are displayed as text and html files. Different plots may be generated, including circular and linear layouts, scatter plots and summary dotplots. B Detection of a PΨg. Blue rectangles depict coding regions. Reads are depicted as arrows. Gray arrows refer to spliced reads, with dotted lines showing the splicing event. C Detection of the PΨg-insertion site. Blue arrows show reads mapping to the PΨg or portion of the PΨg, while red arrows show reads mapping to the PΨg insertion site or portion of the PΨg insertion site. Solid gray lines deem two reads as a pair. PΨFinder reports positions marked with dashed arrows. CR chimeric read, CP chimeric pair

Results

Screening for PΨgs in blood samples using PΨFinder

To demonstrate the use of PΨFinder, we scanned DNA sequencing data from 218 human blood samples. These samples were initially sequenced and analyzed (data not shown) using a custom-designed panel that included genes associated with hereditary colorectal cancer (Additional file 2, Additional file 3: Table S1). The comprehensive panel covers 28 genes and their promoter regions [39]. As a complementary analysis, these samples were scanned with PΨFinder using the human genome (hg19) as reference.

We detected a total of 423 PΨgs distributed across 209 samples. The predictions included PΨgs of only seven parent genes: BMPR1A, CBX3, DHFR, HNRNPC, POLE, PTEN and SMAD4 (Table 1). PΨFinder detected only one PΨg in 34% of the positive samples, while 2 or more PΨgs were predicted in the rest of them (Additional file 3: Table S2). In terms of their genomic insertion site, the majority of the PΨgs were found either in intronic (54%) or intergenic (45%) regions, while only 1% had evidence of being inserted within an exon (BMPR1A-PΨg in 6 samples and PTEN-PΨg in 2 samples). The most common PΨgs detected were CBX3-PΨg and BMPR1A-PΨg, found in 180 and 155 samples, respectively. PTEN-PΨg and DHFR-PΨg were detected in 51 and 29 samples, respectively, while 6 samples contained HNRNPC-PΨg. POLE-PΨg and SMAD4-PΨg were found in one sample each.

Table 1 Summary of identified processed pseudogene candidates across all samples analyzed

The detection of the insertion sites of the novel PΨgs, relies on CPs and CRs, and while having both as evidence is not essential (one suffices to narrow down the insertion region), nevertheless, they do strengthen the accuracy of the predicted insertion site (Additional file 3: Table S4). In most of the insertion sites detected, only CPs (66.3%) or CRs (25.2%) gave supporting evidence, while a small percentage (8.5%) had support from both CPs and CRs. Among these well supported insertion sites we found (1) CBX3-PΨg located within the second intron of C15ORF57 (chr15:40854180–40854180, in 119 samples), (2) HNRNPC-PΨg detected in the intergenic region between LINC02541 and MARCKS (chr6:114017523–114017528, in 3 samples), (3) BMPR1A-PΨg inserted in the intergenic region between PCGF5 and HECTD2 (chr10:93083258–93083258, in 1 sample) and (4) SMAD4-PΨg located within intron 18 of SCAI (chr9:127732713–127732715, in 1 sample).

To validate these results, we selected PΨgs that besides having evidence from both CPs and CRs, they were inserted within an exon or an intronic region. This could provide evidence of a disease-causing mutation if the coding region of the disturbed gene were altered. CBX3-PΨg and SMAD4-PΨg complied with these criteria and were selected to experimentally determine their insertions sites using Sanger sequencing (Additional file 2).

The ubiquitous CBX3-PΨg is reversed inserted in second intron of C15ORF57

From RNA-seq data of lymphoblast tissue, CBX3 has shown evidence to be expressed as a chimera with C15ORF57 [40, 41]. This chimera has also been detected in multiple non-diseased tissues (tonsils, placenta, liver, skeletal muscle, adrenal gland and skin) from the Genotype Tissue Expression (GTEx) dataset [42] as well as in hepatocellular carcinoma [41] and glioblastoma [43]. In this study we present the DNA breakpoints of CBX3 and C15ORF57 as predicted with PΨFinder (Fig. 2A). The experimental validation showed an unknown insertion (ATTTTTTTTTTTAAAGA) and duplicated nucleotides (TCAGGAAATAT) in one of the breakpoints, while no aberrations were seen in the other breakpoint. CBX3-PΨg was found to be reversed inserted, aligning to the same reading orientation as C15ORF57. This makes the transcription of this fusion gene possible (Fig. 2B). The fact that CBX3-PΨg is recurrent (found in 82.6% of the samples) may suggest that it might have an effect in the predisposition to colorectal cancer development. However, in previous studies the CBX3-C15ORF57 fusion was not only found in cancerous tissues, but also in normal or noncancerous samples [41, 42]. Although, experimental validations are needed, for example silencing the fusion through decreased cell proliferation and cell motility in specific cell populations, one might suggest that the expression level of this fusion gene might have an impact in cancer development.

Fig. 2
figure 2

Experimental validation of CBX3 processed pseudogene identified with PΨFinder. A Circos plot showing the coverage over CBX3 and its insertion site in C15ORF57 for sample113. The outer heatmap displays the coverage over the entire CBX3 parent gene (only exons are depicted), darker color indicates higher coverage. The inner histogram shows the coverage over the exon-exon junctions, suggesting the presence of a PΨg. The red outmost histogram displays the coverage across the predicted insertion site in SCAI, shown by the arrow. B DNA sequence over the breakpoints of CBX3-PΨg inserted in intron 2 of the C15ORF57 gene. Genomic coordinates refer to the human reference genome build hg19. Red arrows point out the actual breakpoints. Sequence for breakpoint 1 (PCR 1) includes an unknown insertion (ATTTTTTTTTTTAAAGA) and duplicated nucleotides (TCAGGAAATAT). Sequence for breakpoint 2 (PCR2) without any aberrations. CBX3-PΨg is reversed inserted on the minus strand, its parent gene is read from the plus strand, in the same reading orientation than the CBX3 gene (see Additional file 2 for further details)

The well-known insertion site of the SMAD4-PΨg within the last intron of SCAI

Deleterious mutations in SMAD4 have been shown to result in pancreatic cancer [44], juvenile polyposis syndrome [45], hereditary hemorrhagic telangiectasia syndrome [46] and Myhre syndrome [47]. The presence of the SMAD4-PΨg has interfered with diagnostic analyses based on clinical sequencing applications, creating false-positives results in 0.24–0.26% of the cases [27, 28]. Thus, its identification is crucial to reduce this confounding effect. We validated the breakpoints of SMAD4 and SCAI as predicted with PΨFinder (Fig. 3A). In one of the breakpoints, we identified a deletion of three nucleotides (GTC), while in the other breakpoint a polyA-tail and a duplication of four nucleotides (TTTC) were confirmed [28] (Fig. 3B). The SCAI gene, might therefore lead to an upregulation of downstream genes involved in cancer development. The impact on the effect of the integration of SMAD4 on the disruption of the function of SCAI needs to be further experimentally evaluated.

Fig. 3
figure 3

Experimental validation of SMAD4 processed pseudogene identified with PΨFinder. A Circos plot showing the coverage over SMAD4 and its insertion site in SCAI for sample220. The outer heatmap displays the coverage over the entire SMAD4 parent gene (only exons are depicted), darker color indicates higher coverage. The inner histogram shows the coverage over the exon-exon junctions, suggesting the presence of a PΨg. The red outmost histogram displays the coverage across the predicted insertion site in SCAI, shown by the arrow. B DNA sequence over the breakpoints of SMAD4-PΨg inserted in intron 18 of SCAI gene. Genomic coordinates refer to human reference genome build hg19. Red arrows point out the actual breakpoints. Sequence for breakpoint 1 (PCR 1) includes a deletion of three nucleotides (GTC). Sequence for breakpoint 2 (PCR2) includes a poly A-tail and duplication of four nucleotides (TTTC). The SMAD4-PΨg is inserted in the minus strand in the same reading orientation as its parent gene in the opposite reading orientation than the SCAI gene (see Additional file 2 for further details)

Detection level and accuracy

Sequencing depth plays an important role in any kind of computational predictions. To establish the detection level of PΨFinder, the four samples used for experimental validation were downsampled at 0.1, 0.5, 1.0, 2.5, 5.0 and 7.0 M paired reads, using seqtk (1.0) [48] (Additional file 3: Table S3). The resulting analysis with PΨFinder, determined that predictions obtained from samples with a sequencing depth of 5 M reads, an average coverage of at least 144X and including both CPs and CRs, can be deemed as true positive PΨg-insertion sites (Additional file 3: Table S4). Insertion site predictions based on samples with 2.5 M reads and an average coverage between 72 and 128X, start to lose evidence from either CPs or CRs. Thus, at this sequencing depth or “gray-zone” area, we recommend to further inspect the resulting predictions. Samples with 1 M sequencing reads or less and an average coverage of less than 52X over the panel, are outside the detection level of PΨFinder. Although we cannot entirely dismiss these predictions, they should be treated with caution. From the 218 colorectal cancer samples analyzed in this work, only one sample is within the “gray zone” with an average coverage of 120.71X (Additional file 3: Table S1), all others lie above the confidence prediction level (Fig. 4A). Considering this and the positive experimental validations all PΨgs detected that have supporting evidence from both, CPs and CRs, are most likely to be true.

Fig. 4
figure 4

Detection level and Performance evaluation of PΨFinder. A Detection level based on sequencing depth. Samples with an average coverage of 144X or more (gray dashed line) are confident PΨg-insertion site predictions as determined by downsampling analysis (blue triangles, Additional file 3: Table S4). Samples with an average coverage of 95 or less (red dashed line) are below the detection level of PΨFinder. Samples between these thresholds are likely to be true predictions and manual inspection of the results is required. Biological samples are shown as red circles. B Performance evaluation. A total of 117 samples were analyzed. TP, true positives; TN, true negatives; FP, false positives; FN, false negatives; F1 score, harmonic mean of the precision and sensitivity; FDR, false discovery rate; TPR, true positive rate; PPV, positive prediction value

To investigate the overall performance of PΨFinder we simulated 117 samples, each with a different PΨg inserted in a random position within the genome (Additional file 3: Table S5). An in-house script based on wgsim (0.3.0) [49] was developed defining the simulated error rate (2%) of the sequencing reads as well as their outer distance (500) and read length (90 bp). All samples contained 5 M simulated reads, yielding to a 98.8% of mapped reads with a mean coverage of 473X. These samples were analyzed with PΨFinder and sideRETRO (both with default values). The performance of both tools is remarkably similar (Fig. 4B) and their running time, while analyzing the simulated data, was in average one and two minutes per sample, respectively. Nevertheless, an advantage of using PΨFinder is the graphical visualization that it produces (Additional file 1) which aids in the confirmation of the predictions as well as in their experimental validation (Figs. 2, 3). Although the use of standard formats must be encouraged, e.g. sideRETRO using VCF files to report their results, these formats are still not easy to examine by researchers with basic computational skills, thus the html report and simple tabular format that PΨFinder generates is more convenient and user-friendly. Moreover, our tool reports known PΨgs that are found during the analysis.

Conclusions

PΨFinder is a tool that can detect novel PΨgs from DNA sequencing data and determine their location in the genome. Here we demonstrated its application by scanning 218 DNA blood samples, from patients suspected of an inherited form of colon cancer, and identified 423 PΨgs from seven parent genes.

Among the predicted PΨgs, we identified the ubiquitous CBX3-PΨg, which has been shown to form a chimeric transcript with C15ORF57 [40] and has been associated to glioblastoma [43] and hepatocellular cancer [41]. We validated its insertion site within intron 18 of C15ORF57 and showed that CBX3-PΨg is reversed inserted. Although further expression and functional analyses are required, it may be likely that CBX3-C15ORF57 could also be involved in the development of colorectal cancer.

We also detected SMAD4 and validated its insertion site within the second intron of SCAI. SMAD4-PΨg is a known confounding element in the mutation analysis of next generation sequencing data in patients with juvenile polyposis syndrome or combined/juvenile polyposis/hereditary hemorrhagic telangiectasia [27]. Thus, it is important its identification to determine its relevance.

PΨFinder is a tool whose comprehensive and user-friendly results, can aid in the identification of PΨgs and complement any mutational screening in the identification of occurring mutations during cancer development and other diseases.

Availability and requirements

Project name: Novel processed pseudogenes detection tool.

Project home page: https://github.com/bcfgothenburg/SSF.

Operating system(s): Linux, Mac OS.

Programming language: Python (3.6), bash.

Other requirements: STAR (2.7.7a), SAMtools (1.11), BEDTools (v2.30.0), R (4.0.3).

License: GNU General Public License, version 3.0 (GPLv3).

Any restrictions to use by non-academics: None (except the ones stated in GPLv3).

Availability of data and materials

The code for PΨFinder, the user guide and test dataset are available in GitHub (https://github.com/bcfgothenburg/SSF). The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

PΨgs:

Processed pseudogenes

PΨFinder:

P-psy-Finder

circRNA:

Circular RNAs

KA/KS :

Ratio of the non-synonymous to synonymous substitution rates

HAVANA:

Human and Vertebrate Analysis and Annotation

SR:

Spliced read

CP:

Chimeric pairs

CR:

Chimeric reads

References

  1. Xiao-Jie L, Ai-Mei G, Li-Juan J, Jiang X. Pseudogene in cancer: real functions and promising signature. J Med Genet. 2015;52(1):17–24.

    Article  PubMed  Google Scholar 

  2. Sen K, Ghosh TC. Pseudogenes and their composers: delving in the ‘debris’ of human genome. Brief Funct Genomics. 2013;12(6):536–47.

    Article  CAS  PubMed  Google Scholar 

  3. Wen Y-Z, Zheng L-L, Qu L-H, Ayala FJ, Lun Z-R. Pseudogenes are not pseudo any more. RNA Biol. 2012;9(1):27–32.

    Article  PubMed  Google Scholar 

  4. McCarrey JR, Riggs AD. Determinator–inhibitor pairs as a mechanism for threshold setting in development: a possible function for pseudogenes. Proc Natl Acad Sci. 1986;83(3):679–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Muro EM, Andrade-Navarro MA. Pseudogenes as an alternative source of natural antisense transcripts. BMC Evol Biol. 2010;10(1):338.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Korneev SA, Park JH, O’Shea M. Neuronal expression of neural nitric oxide synthase (nNOS) protein is suppressed by an antisense RNA transcribed from an NOS pseudogene. J Neurosci. 1999;19(18):7711–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ishiguro T, Sato A, Ohata H, Sakai H, Nakagama H, Okamoto K. Differential expression of nanog1 and nanogp8 in colon cancer cells. Biochem Biophys Res Commun. 2012;418(2):199–204.

    Article  CAS  PubMed  Google Scholar 

  8. Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature. 2010;465(7301):1033–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Bischof JM, Chiang AP, Scheetz TE, Stone EM, Casavant TL, Sheffield VC, Braun TA. Genome-wide identification of pseudogenes capable of disease-causing gene conversion. Hum Mutat. 2006;27(6):545–52.

    Article  CAS  PubMed  Google Scholar 

  10. Cheetham SW, Faulkner GJ, Dinger ME. Overcoming challenges and dogmas to understand the functions of pseudogenes. Nat Rev Genet. 2020;21(3):191–201.

    Article  CAS  PubMed  Google Scholar 

  11. Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, et al. The GENCODE pseudogene resource. Genome Biol. 2012;13(9):R51.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2018;47(D1):D766–73.

    Article  PubMed Central  Google Scholar 

  13. Esnault C, Maestre J, Heidmann T. Human LINE retrotransposons generate processed pseudogenes. Nat Genet. 2000;24(4):363–7.

    Article  CAS  PubMed  Google Scholar 

  14. Vanin EF. Processed pseudogenes: characteristics and evolution. Annu Rev Genet. 1985;19(1):253–72.

    Article  CAS  PubMed  Google Scholar 

  15. Dong R, Zhang X-O, Zhang Y, Ma X-K, Chen L-L, Yang L. CircRNA-derived pseudogenes. Cell Res. 2016;26(6):747–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 2003;13(12):2541–58.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J. Retrocopy contributions to the evolution of the human genome. BMC Genomics. 2008;9:466.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Navarro FC, Galante PA. RCPedia: a database of retrocopied genes. Bioinformatics. 2013;29(9):1235–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. https://www.gencodegenes.org/human/.

  20. Chen X, Wan L, Wang W, Xi W-J, Yang A-G, Wang T. Re-recognition of pseudogenes: from molecular to clinical applications. Theranostics. 2020;10(4):1479–99.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kalyana-Sundaram S, Kumar-Sinha C, Shankar S, Robinson DR, Wu YM, Cao X, Asangani IA, Kothari V, Prensner JR, Lonigro RJ, et al. Expressed pseudogenes in the transcriptional landscape of human cancers. Cell. 2012;149(7):1622–34.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Han L, Yuan Y, Zheng S, Yang Y, Li J, Edgerton ME, Diao L, Xu Y, Verhaak RGW, Liang H. The Pan-Cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat Commun. 2014;5(1):3963.

    Article  CAS  PubMed  Google Scholar 

  23. Cooke SL, Shlien A, Marshall J, Pipinikas CP, Martincorena I, Tubio JMC, Li Y, Menzies A, Mudie L, Ramakrishna M, et al. Processed pseudogenes acquired somatically during cancer development. Nat Commun. 2014;5(1):3644.

    Article  PubMed  Google Scholar 

  24. Hayashi H, Arao T, Togashi Y, Kato H, Fujita Y, De Velasco MA, Kimura H, Matsumoto K, Tanaka K, Okamoto I, et al. The OCT4 pseudogene POU5F1B is amplified and promotes an aggressive phenotype in gastric cancer. Oncogene. 2015;34(2):199–208.

    Article  CAS  PubMed  Google Scholar 

  25. Lai J, Lehman ML, Dinger ME, Hendy SC, Mercer TR, Seim I, Lawrence MG, Mattick JS, Clements JA, Nelson CC. A variant of the KLK4 gene is expressed as a cis sense-antisense chimeric transcript in prostate cancer cells. RNA. 2010;16(6):1156–66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Chakravarthi BVSK, Dedigama-Arachchige P, Carskadon S, Sundaram SK, Li J, Wu K-HH, Chandrashekar DS, Peabody JO, Stricker H, Hwang C, et al. Pseudogene associated recurrent gene fusion in prostate cancer. Neoplasia. 2019;21(10):989–1002.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Millson A, Lewis T, Pesaran T, Salvador D, Gillespie K, Gau CL, Pont-Kingdon G, Lyon E, Bayrak-Toydemir P. Processed pseudogene confounding deletion/duplication assays for SMAD4. Journal of Molecular Diagnostics. 2015;17(5):576–82.

    Article  CAS  Google Scholar 

  28. Watson CM, Camm N, Crinnion LA, Antanaviciute A, Adlard J, Markham AF, Carr IM, Charlton R, Bonthron DT. Characterization and genomic localization of a SMAD4 processed pseudogene. J Mol Diagn. 2017;19(6):933–40.

    Article  CAS  PubMed  Google Scholar 

  29. Zhang Z, Gerstein M. Large-scale analysis of pseudogenes in the human genome. Curr Opin Genet Dev. 2004;14(4):328–35.

    Article  CAS  PubMed  Google Scholar 

  30. Torrents D, Suyama M, Zdobnov E, Bork P. A genome-wide survey of human pseudogenes. Genome Res. 2003;13(12):2559–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics. 2006;22(12):1437–9.

    Article  CAS  PubMed  Google Scholar 

  32. van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006;16(5):678–85.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Miller TLA, Orpinelli F, Buzzo JLL, Galante PAF. sideRETRO: a pipeline for identifying somatic and polymorphic insertions of processed pseudogenes or retrocopies. Bioinformatics. 2020;13:e1005567.

    Google Scholar 

  34. Foundation PS: Python language reference. 3.6 edn.

  35. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.

    Article  CAS  PubMed  Google Scholar 

  36. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome project data processing S: the sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Team RC. R: a language and environment for statistical computing. Team RC; 2016.

    Google Scholar 

  39. Rohlin A, Rambech E, Kvist A, Torngren T, Eiengard F, Lundstam U, Zagoras T, Gebre-Medhin S, Borg A, Bjork J, et al. Expanding the genotype-phenotype spectrum in hereditary colorectal cancer by gene panel testing. Fam Cancer. 2017;16(2):195–203.

    Article  CAS  PubMed  Google Scholar 

  40. Schrider DR, Navarro FC, Galante PA, Parmigiani RB, Camargo AA, Hahn MW, de Souza SJ. Gene copy-number polymorphism caused by retrotransposition in humans. PLoS Genet. 2013;9(1):e1003242.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Zhu C, Wu L, Lv Y, Guan J, Bai X, Lin J, Liu T, Yang X, Robson SC, Sang X, et al. The fusion landscape of hepatocellular carcinoma. Mol Oncol. 2019;13(5):1214–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Singh S, Qin F, Kumar S, Elfman J, Lin E, Pham L-P, Yang A, Li H. The landscape of chimeric RNAs in non-diseased tissues and cells. Nucleic Acids Res. 2020;48(4):1764–78.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Bao ZS, Chen HM, Yang MY, Zhang CB, Yu K, Ye WL, Hu BQ, Yan W, Zhang W, Akers J, et al. RNA-seq of 272 gliomas revealed a novel, recurrent PTPRZ1-MET fusion transcript in secondary glioblastomas. Genome Res. 2014;24(11):1765–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. MIM Number: 260350 [https://omim.org/].

  45. MIM Number: 174900 [https://omim.org/].

  46. MIM Number: 175050 [https://omim.org/].

  47. MIM Number: 139210 [https://omim.org/].

  48. seqtk [https://github.com/lh3/seqtk].

  49. wgsim [https://github.com/lh3/wgsim].

Download references

Acknowledgements

The authors thank the BRCAlab at the Division of Breastcancer-genetics, Dept. of Clinical Sciences, Lund, Lund University for performing sequencing of all samples and the Bioinformatics Core Facility at the University of Gothenburg for providing computational resources for data analysis and storage.

Funding

Open access funding provided by University of Gothenburg. This study was supported by the Swedish Foundation for Strategic Research (RIF14–0081) and the Assar Gabrielssons Foundation (FB19-56). The funding bodies did not play any role in the design of the study, or collection, analysis, or interpretation of data, or in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

SA participated in the design of the tool, implemented and tested the software, wrote the software manual and drafted the manuscript. FE collected the samples, carried out the experimental validation, contributed with the report layout and drafted the manuscript. AR provided expert feedback in the design, the evaluation of the results and on the writing of the paper. MDL conceived the main idea, contributed to the overall software design, tested the software and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Marcela Dávila López.

Ethics declarations

Ethics approval and consent to participate

Ethical approval for the study was obtained from the regional ethics committee in Gothenburg (administration number 227-10).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Fig. S1.

PΨFinder summary report and visualization aids.

Additional file 2

. Supplementary methods, experimental validation of SMAD4-SCAI and CBX3-C15ORF57 including primers and gel pictures.

Additional file 3: Tables S1–S5.

Sequencing data summary statistics and output from PΨFinder of case samples, downsampled data and simulated data. Benchmarking results.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Abrahamsson, S., Eiengård, F., Rohlin, A. et al. PΨFinder: a practical tool for the identification and visualization of novel pseudogenes in DNA sequencing data. BMC Bioinformatics 23, 59 (2022). https://doi.org/10.1186/s12859-022-04583-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04583-4

Keywords

  • Processed pseudogenes
  • DNA sequencing
  • Colorectal cancer
  • SMAD4
  • CBX3
  • C15ORF57
  • SCAI