Analysis of the role of retrotransposition in gene evolution in vertebrates
© Yu et al; licensee BioMed Central Ltd. 2007
Received: 07 April 2007
Accepted: 24 August 2007
Published: 24 August 2007
The dynamics of gene evolution are influenced by several genomic processes. One such process is retrotransposition, where an mRNA transcript is reverse-transcribed and reintegrated into the genomic DNA.
We have surveyed eight vertebrate genomes (human, chimp, dog, cow, rat, mouse, chicken and the puffer-fish T. nigriviridis), for putatively retrotransposed copies of genes. To gain a complete picture of the role of retrotransposition, a robust strategy to identify putative retrogenes (PRs) was derived, in tandem with an adaptation of previous procedures to annotate processed pseudogenes, also called retropseudogenes (RψGs). Mammalian genomes are estimated to contain 400–800 PRs (corresponding to ~3% of genes), with fewer PRs and RψGs in the non-mammalian vertebrates. Focussing on human and mouse, we aged the PRs, analysed for evidence of transcription and selection pressures, and assigned functional categories. The PRs have significantly less transcription evidence mappable to them, are significantly less likely to arise from alternatively-spliced genes, and are statistically overrepresented for ribosomal-protein genes, when compared to the proteome in general. We find evidence for spurts of gene retrotransposition in human and mouse, since the lineage of either species split from the dog lineage, with >200 PRs formed in mouse since its divergence from rat. To examine for selection, we calculated: (i) Ka/Ks values (ratios of non-synonymous and synonymous substitutions in codons), and (ii) the significance of conservation of reading frames in PRs. We found >50 PRs in both human and mouse formed since divergence from dog, that are under pressure to maintain the integrity of their coding sequences. For different subsets of PRs formed at different stages of mammalian evolution, we find some evidence for non-neutral evolution, despite significantly less expression evidence for these sequences.
These results indicate that retrotranspositions are a significant source of novel coding sequences in mammalian gene evolution.
Genes are subject to many different processes that give rise to novel sequences, such as segmental and local duplication, gene conversion, and retrotransposition. The extent to which these different processes contribute to gene evolution is unclear. In the present paper, we focus on the phenomenon of gene retrotransposition. Retrotransposition entails the reverse transcription of an mRNA transcript and the subsequent re-integration of the resulting cDNA into genomic DNA, in germ-line cells . There is substantial genomic evidence for large-scale retrotransposition of mRNAs in mammalian genomes, from detection of thousands of apparent retropseudogenes in human, mouse and rat [2–4]. Such retropseudogenes (RψGs) are decayed or disabled gene sequence copies (typically bearing frameshifts and stop codons) that demonstrate the hallmark characteristics of retrotransposition, namely lack of introns of the parental gene, and also 3' polyadenine tails, if formed more recently . Other features include short direct repeats flanking the sequence (for young retrotranspositions) , frequent 5' truncations, and genomic location different from that of the parent gene [2, 3]. It has been demonstrated experimentally that RψG s can be formed through the action of LINE-1 reverse transcriptases . The computational comparison of LINEs and RψG s also supports the generation of RψG s by LINEs . The poly(A) tails and frequent truncations found at the 5' end in the RψG s are typical for LINEs . Moreover, they share similar structures, including a common TT|AAAA insertion motif .
Since the substantial majority of these retrosequences bear disablements (frameshifts and stop codons), or have codon substitution patterns indicative of decay [9, 5, 3], gene retrotransposition appears generally to lead to non-functional sequences that decay in the genomic DNA as evolution progresses [10, 2, 9]. However, even though the promoters of these gene retrosequences are not transferred, a small minority of them appears to be transcribed . For the human genome, there is a small population of at least ~200 transcribed processed pseudogenes, which have the symptoms of a lack of coding ability despite evidence of transcription, and are significantly likely to be found near others genes (as would be expected if they are co-opting promoters) .
Generation of a new functional gene is also a possible outcome of retrotransposition . There is an increasing number of transcribed, functionally characterized genes in mammalian and invertebrate animal genomes reported to bear the characteristics of retrosequences . Over ninety such retrogenes have been annotated in the human and mouse genomes . Most of the functional retrogenes identified are expressed in testis and may have provided important raw material for rapid testis evolution in primates .
Here, to derive an overview of the role of gene retrotransposition in the genome evolution of vertebrates, and particularly mammals, we derive and apply a robust procedure to annotate gene retrotranspositions, built on our previous analyses of retropseudogenes [11, 3, 2, 14]. Our strategy incorporates a new rapid procedure for annotating retrocopies in the genomic DNA, in tandem with a pipeline to identify them in existing gene annotations. This PR annotation pipeline incorporates aging of the sequences through evolutionary rate analysis relative to putative parents and their orthologs, as well as analysis of the chromosomal milieus of these sequences and their putative parents. We find evidence for, on average, several hundred PRs in each proteome. Focussing on human and mouse, we find evidence for spurts of gene retrotransposition in both human and mouse, since divergence from dog. A small number (>50) of PRs have formed in both mouse and human since divergence form dog, that show signs of being under selection to maintain their coding sequences.
Rapid identification of retropseudogenes (RψGs)
Retropseudogenes were annotated on the Ensembl genome versions used in this present analysis, using the rapid improvement of previous procedures to identify retropseudogenes described above (summarised in Figure 2) [11, 5, 14, 2, 3].
Identification of putative retrotransposed genes (PRs)
Homology detection: Each proteome was compared against itself using BLAST to find similarities with e-value ≤ 10-4 . Any match to a potential pseudogene contaminant in the proteome annotations was removed (Figure 1).
Exon seam analysis: Exon boundary information for each protein was extracted from the appropriate Ensembl genome annotation files. The positioning of exon boundaries in encoded protein sequences, i.e., 'exon seams', was then deduced . Using the positioning of exon seams, the BLAST matches between proteins were filtered to pick out alignments between a protein encoded by a multiple-exon gene, and a single exon of another gene. To define PRs, the length of the exon was required to be between 0.9 and 1.1 of the whole length of the multiple-exon protein. (This is stricter than the criterion of 0.7 of the length used for annotation of retropseudogenes ).
Additional filtering: In addition, PRs were discarded if they matched olfactory receptors (ORs) with BLAST e-value ≤ 10-4 over ≥ 0.5 of the length of the OR, since recent olfactory receptors (ORs) have probably originated from different mechanism other than retrotransposition . Olfactory receptor sequences were taken from ORDB .
Local gene order test for similarity of the chromosomal milieus for PRs and parent genes: To check that the PRs did not arise from local or segmental duplication, we derived a 'local gene order' test. For this test, we compared the chromosomal milieus of PRs and parents for significant similarity, as follows. Proteins encoded by genes adjacent to PRs in the chromosomes were BLASTed against the corresponding proteins from genes adjacent to putative parents, for a given window (w genes ) of number of genes in either direction (5' and 3'). A w genes size of 7 (the PR or parent, plus 3 genes in either direction), with an allowance for one gap between the positions of matches within w genes , was found to be suitable. The number of significant homologous matches N homologs (BLAST e-value ≤ 10-4, sequence identity >40%, and match ≥ 0.6 length of both PR and parent) between the milieus of PR and parent was tallied. An expected benchmark distribution for N homologs was derived for the chromosomal milieus of 1,000 randomly-sampled pairs of proteins that have any significant BLAST match to each other (e-value ≤ 10-4). From examination of this distribution, we found that 80% of such random pairs have N homologs <1, and 87% have N homologs ≤ 1. We thus chose N homologs = 1, as a suitable threshold for local similarity arising from duplication of genomic DNA. However, the results differ little if a threshold of N homologs = 0 is used. This procedure was applied to the genomes of the human and mouse. Interestingly, application of this criterion resulted in the exclusion of many sequences with large individual exons (PRs with FLE parent ≥ 0.8; 36/86 = 42% of those excluded), that may be false positives in our data set of PRs. A large fraction of these sequences (68%) tend to have long, tandem arrays of Zn-finger domains covering more than a third of their sequences (Additional Figure 1).
Additional filtering and annotation
The following additional analysis was performed on the PRs:
(i) Fraction of largest exon in parent: We calculated the fraction of the length any parent gene that is taken up by its largest exon. This is denoted FLE parent . We found that there is no peculiar tendency for the parents of PRs to have a single large exon (which would yield a tendency for high FLE values) [Additional Figure 1].
Overview of gene retrotransposition analysis for eight vertebrates
Number of genes *
Number of PRs
PR matches retrotransposed TE **
PR overlaps pseudogene annotation
FLE ≤ 0.8 ***
Number of retro- pseudogenes (RψGs)
PRs passing local gene order test
Matching Refseq mRNA or Unigene consensus ****
(iii) Filtering for potential transposable elements (TEs): Each proteome was compared using TBLASTN to libraries of transposable elements taken from the RepeatMasker distribution , using an e-value threshold of ≤ 10-4. Any proteins containing SINEs, or near-complete matches to LINEs (≥ 0.8 of their lengths), were labeled as potentially TE-containing.
(iv) Whether single-exon gene or multiple-exon gene: The PRs were labeled as either single-exon genes or part of multiple-exon genes.
Orthologs of parent genes were identified using the bi-directional best hits method, using BLAST (e-value ≤ 10-4, amino-acid sequence identity ≥ 40% and requiring the alignment to cover ≥ 0.6 of the lengths of each sequence. The bi-directional best hits method is a common procedure for guarding against considering paralogs.
Analysis of Ks and Ka/Ks values, and derivation of genome- and lineage-specific gene lists
The package PAML  was used to calculate maximum-likelihood Ka, Ks and Ka/Ks values for pairs of sequences (either PR versus computed ancestral sequences, or PR versus parent). In addition, branch-specific maximum-likelihood Ka/Ks values were calculated for three-way alignments of PR, parent and parent's orthologs from another close species.
We applied three different strategies based on analysis of Ks, to determine lists of genome-specific and lineage-specific PRs. For example, for the human genome, we calculated human-specific lists relative to the chimpanzee genome. Also, we calculated primate-specific lists for human plus chimpanzee, relative to a mammalian outgroup, such as dog or cow. To determine genome-specific lists of PRs, we investigated each of the following three methods ("parent's ortholog" refers to the ortholog of the parent in the most closely related genome):
(1) The distribution of Ks values for orthologous genes in the two organisms was calculated, and the median value m derived from this. If Ks [PR←→parent] <m and Ks [parent ←→parent's ortholog] >m, then a PR is labeled genome-specific;
(2) Secondly, a PR could be labeled genome-specific if Ks [PR ←→parent] < Ks [parent ←→parent's ortholog];
(3) Thirdly, a PR could be labeled genome-specific if, in a three-way tree of PR, parent and parent's ortholog, the branch-specific Ks [PR] is < (Ks [parent] + Ks [parent's ortholog])/2 ;
Lineage-specific lists were derived in a similar fashion. Additional Figure 2 shows how these three methods overlap which each other. Based on the overlaps observed, we used Method (3) for further analysis.
Analysis of reading frame conservation
We assessed the reading frame conservation (RFC) in sequences using simulations of insertion and deletion governed by power-law insertion/deletion (indel) statistics . Power-law statistics for indels were extracted from recently-formed RΨGs having ≥ 85% amino-acid identity with their parent sequences. Power-law relationships were fitted, omitting points for any indels of size 3n, with n any positive integer. Expected ratios for insertions versus deletions were taken from this data; the expected number of indels per nucleotide substitution for several mammals was culled from the literature [21, 22]. The program DNADIST  was used to calculate the nucleotide-level divergence of the PRs from ancestral sequences (calculated using PAML ; see section on Ka/Ks analysis above). This divergence value is used as a target in simulations. For each PR, repeated simulations of the evolution of the ancestral sequence towards present-day, for 1000 iterations, was performed using a Kimura two-parameter model. In each case, the resulting simulated protein coding sequence was marked for frame disablements (stop codons and frameshifts). PR sequences whose simulations yielded frame-disrupted sequences ≥ 99% of the time were labeled as having significant RFC.
Assignment of functional categories
GO (Gene Ontology; ) functional categories were taken from the annotation files provided on the Ensembl  and Gene Ontology websites . Further GO term annotations were derived by mapping functional GO annotations for the PDB (also downloaded from the GO website) onto Ensembl protein annotations, using 50% sequence identity and 0.8 fractional sequence coverage (for the protein domain) as thresholds, using alignment made by the program BLASTP (e-value ≤ 0.0001) . These thresholds were benchmarked on the complete SCOP protein domain sequence database , to give a 2% false positive rate for GO term transfer.
Mapping of cDNAs/mRNAs
Refseq mRNAs and complete Unigene consensus sequences were downloaded from the NCBI website , for both human and mouse. These were mapped to the coding sequences of Ensembl gene annotations, using blastn (e-value ≤ 1 × 10-10 for alignments ≥ 100 nucleotides) . All mappings that match with ≥ 99% sequence identity over ≥ 0.99 of the sequence length of the cDNA or mRNA, after removal of any polyadenylation, were retained. Further restriction of analysis of cDNA/mRNA mappings to those that do not match their putative parent sequences with >95% sequence identity, does not change the trends reported with regard to transcription evidence reported below.
Results and discussion
The pipelines for annotating the complement of gene retrotranspositions (both retropseudogenes (RψG s) and putative retrogenes (PRs)) were applied to eight vertebrates. In particular, we focused on the mammals, to analyse the ages of putative retrogenes (PRs), to derive genome- and lineage-specific lists and to check for spurts of gene retrotransposition activity. We then examined for evidence of transcription (mRNA and cDNA mapping), involvement in alternative splicing, selection pressures (significant Ka/Ks values and reading-frame conservation), and for functional categorizations of parent genes.
Overview of gene retrotranspositions in vertebrates
Our analysis suggests that up to ~3% of the genes encoded in a vertebrate genome contain a PR (Table 1), with the smallest percentages in the chicken and puffer fish T. nigroviridis. By comparison, mammalian genomes have ~2,000–5,000 retropseudogenes (RψG s), that have at least 70% of the coding sequence of their parent genes, again with smaller numbers in non-mammal vertebrates (just 644 RψG s in T. nigroviridis) (Table 1). These results together indicate that there has been less, recent gene-retrotransposition activity in the two non-mammal vertebrates. These observations tally well with other evidence for less retrotransposition activity in chicken and Tetraodon. In chicken, there appear to be little or no SINEs , and only ~8% of the genomic DNA is comprised of the CR1 ('chicken repeat 1') LINE-1 [28, 29], whose reverse transcriptase is thought not to copy polyadenylated mRNAs . In Tetraodon, <1% of the genome is comprised of retrotransposons, so gene retrotransposition should consequently be less likely . For the eight genomes studied here, there are no significant linear correlations between the number of genes or proteins from a genome, versus the number of PRs, or RψGs (data not shown). Small percentages of the PRs could be classified as homologs of retrotransposed transposable elements, such as LINEs (2–12%), or as overlapping pseudogene annotations (4–13%).
As described in detail in Methods, we applied a 'local gene order' test, to set aside any PRs that may have arisen through local or segmental duplication, specifically for the human and mouse genomes (Table 1). This filter allows for at most one homologous protein encoded within a window of +/-3 genes along the genomic DNA (i.e., N homologs ≤ 1) (Table 1). The substantial majority of human and mouse PRs pass this filter (80–87%).
Ages of primate and rodent gene retrotranspositions
By comparison, in the rodents, there is more, very recent gene retrotransposition activity. In mouse, we find proportionately more, genome-specific PRs (relative to rat), with 44% having Ks ≤ 0.02 (Figure 3B). In the two rodents, mouse and rat, there are >200 genome-specific PRs, compared to ~40 in each of the primates human and chimp. However, setting aside genome-specific examples, there are more gene retrotranspositions appearing in the primate lineage since its divergence from the dog or cow lineage (Figure 4).
These observations are in keeping with the apparent maintenance of greater levels of LINE and SINE retrotransposition activity in the rodents [31, 32]; also, they tally well with previous observations for a general fall-off in such retrotransposition activity in the primate lineage [2, 33].
Focussing on human and mouse, we examined the proportion of PRs that could be mapped to a complete Unigene consensus cDNA or a complete Refseq mRNA from the NCBI (; see Methods for details). For both organisms, we found that the PRs have significantly less mapping of this transcription evidence (P < 0.001, using the z-score for distribution of the sample mean). For human, only 23% of human PRs mapped to a Refseq mRNA or Unigene cDNA consensus sequence (compared to 41% for the whole proteome). This may be due to lower transcription levels, because they are novel gene sequences using co-opted promoter elements at a site distal to the genomic location of their parent genes . This general reduction in transcription is also to be expected, if some of the sequences are recent pseudogenes without disablements.
In addition, we examined how many PRs arise from alternatively-spliced genes. To do this, we cross-referenced the PR data with alternative splicings classified in the Alternative Splicing Database (ASD) at the EBI . We found that 24% of genes for human PRs arose from an alternatively spliced gene, compared to 59% of genes overall (significantly less, P < 0.001 using the z-score for distribution of the sample mean). A significant reduction in representation from alternatively-spliced genes was also observed in mouse (4%, compared to 29% overall).
Ka/Ks and reading frame conservation (RFC) analysis
Is there any evidence for selection pressures on the PRs in human and mouse? We investigated this question for PRs that have been formed in human and mouse, since their divergences from dog. One standard indicator of selection pressures is the Ka/Ks ratio. This is the number of non-synonymous mutations per non-synonymous site, over the number of synonymous mutations per synonymous site, in codons. Negative (or 'purifying') selection in a specific lineage is indicated by a value significantly <1.0, where positive ('diversifying') selection is demonstrated by a value significantly >1.0. We calculated Ka/Ks values for PRs relative to ancestral sequences for the parents of the PRs (see Methods for details). We tested whether any of these Ka/Ks values were significantly < or >1.0 by generating 500 random pairs of sequences as diverged as the PR and ancestral sequence (to calculate expected means and standard deviations for the Ka/Ks values), and then deriving a P-value for the observed Ka/Ks.
Strikingly, when we correct for multiple hypothesis testing in the Ka/Ks calculations, we find only one PR sequence (formed since divergence from dog) that is under significant selection at the codon level in the human genome, and none in the mouse genome. (The one significant human example is a PR under purifying selection, from a family of proteins with the GTP-binding SAR1 domain.)
Conservation of open reading frames without disablements (frameshifts or stop codons), can also be an indicator of coding ability [12, 35, 36]. We derived a method for assessing significant conservation of open reading frames, using simulation with power-law insertion/deletion (indel) statistics . Using simulations with calculated neutral rates of substitution, insertion and deletion, the likelihood of conservation of an open reading frame without interruption by frameshifts and stop codons, can be determined (see Methods for details). To give sufficient power, a P-value threshold of ≤ 0.01 was used as an indicator for significant reading-frame conservation (RFC). This calculation is complementary to the Ka/Ks analysis.
Results of analysis for reading-frame conservation (RFC)
Number with significantly conserved reading frame ††
Human-specific relative to chimp
Others in human, that were formed since divergence from dog
Other older PRs
Mouse-specific relative to rat
Others in mouse, that were formed since divergence from dog
Other older PRs
These results are evidence for conservation of protein open reading frames, even though we found no evidence for purifying selection from examination of the sequences individually for Ka/Ks. This would arise if the PRs were generally under relaxed or positive selection pressures at the codon level. The existence of relaxed selection is consistent with the markedly low numbers of PRs found to be transcribed in both human and mouse, particularly those that were formed since divergence from dog.
Out of those with significant RFC, is there any evidence for non-neutral Ka/Ks trends? We checked for significant excess of PRs with Ka/Ks values <0.5, > 0.5, <1.0 or >1.0 for each of the subsets listed (Table 2). In the human lineage, we find a significant excess of PRs with Ka/Ks >0.5 (40/59, P < 0.05, χ2 test and Fisher exact test, compared to an expectation from RψG sequences) formed between divergence from dog and from chimp. This non-neutral trend is suggestive of positive selection distributed throughout this specific subset population of RψG sequences. Out of the other subsets listed in Table 2, the only other significant non-neutral Ka/Ks tendency is for an excess of mouse-specific PRs to be under purifying selection (26/35 having Ka/Ks <0.5, compared to an expectation from RψG sequences, P < 0.05, χ2 test and Fisher exact test).
Most common Gene Ontology (GO) functional terms for different sets of sequences *
All genes (Total = 33930)
retropseudogenes (Total = 2493)
All PRs (Total = 631)
PRs formed since divergence from dog lineage (Total = 211) **
GO:0005515, protein binding (2360)
GO:0003735, structural constituent of ribosome (203)
GO:0008270, zinc ion binding (49)
GO:0003735, structural constituent of ribosome (11)
GO:0008270, zinc-ion binding (2069)
GO:0008270, zinc ion binding (189)
GO:0006355, regulation of transcription, DNA-dependent (35)
GO:0003677, DNA binding (10)
GO:0006355, regulation of transcription, DNA-dependent (2029)
GO:0006355, regulation of transcription, DNA-dependent (166)
GO:0005509, calcium ion binding (25)
GO:0006355, regulation of transcription, DNA-dependent (9)
GO:0005524, ATP-binding (1687)
GO:0003676, nucleic acid binding (132)
GO:0005525, GTP binding (21)
GO:0005525, GTP binding (5)
GO:0003677, DNA binding (1339)
GO:0003723, RNA binding (126)
GO:0005515, protein binding (21)
GO:0003823, antigen binding (5)
GO:0007165, signal transduction (1264)
GO:0005515, protein binding (114)
GO:0004842, ubiquitin-protein ligase activity (21)
GO:0003676, nucleic acid binding (5)
GO:0016740, transferase activity (1263)
GO:0003677, DNA binding (110)
GO:0003677, DNA binding (20)
GO:0030145, manganese ion binding (4)
GO:0004872, receptor activity (1242)
GO:0005524, ATP binding (93)
GO:0003676, nucleic acid binding (20)
GO:0020037, heme binding (4)
GO:0016787, hydrolase activity (1171)
GO:0046872, metal ion binding (63)
GO:0003735, structural component of the ribosome (16)
GO:0016757, transferase activity, transferring glycosyl groups (4)
GO:0003700, transcription factor activity (1052)
GO:0000166, nucleotide binding (57)
GO:0003723, RNA binding (13)
GO:0005509, calcium ion binding (4)
All genes (Total = 32442)
Retropseudogenes (Total = 2969)
PRs (Total = 663)
PRs formed since divergence from dog lineage (Total = 298) **
GO:0005515, protein binding (2502)
GO:0003676, nucleic acid binding (273)
GO:0005515, protein-binding (17)
GO:0003735, structural constituent of ribosome (16)
GO:0004872, receptor activity (1923)
GO:0051287, NAD binding (243)
GO:0003735, structural constituent of ribosome (16)
GO:0005524, ATP binding (8)
GO:0006355, regulation of transcription, DNA-dependent (1571)
GO:0008943, glyceraldehyde-3-phosphate dehydrogenase activity (243)
GO:0008270, zinc ion binding (12)
GO:0005515, protein-binding (7)
GO:0008270, zinc ion binding (1481)
GO:0004365, glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) activity (243)
GO:0006355, regulation of transcription, DNA-dependent (12)
GO:0016740, transferase activity (6)
GO:0005524, ATP binding (1252)
GO:0008270, zinc ion binding (235)
GO:0005524, ATP binding (12)
GO:0016491, oxidoreductase activity (6)
GO:0016740, transferase activity (1036)
GO:0003735, structural constituent of ribosome (201)
GO:0005509, calcium ion binding (12)
GO:0006355, regulation of transcription, DNA-dependent (6)
GO:0003677, DNA binding (1017)
GO:0005515, protein-binding (101)
GO:0016740, transferase activity (10)
GO:0016853, isomerase activity (5)
GO:0016787, hydrolase activity (911)
GO:0016491, oxidoreductase activity (94)
GO:0016787, hydrolase activity (9)
GO:0016787, hydrolase activity (5)
GO:0000166, nucleotide binding (873)
GO:0005524, ATP binding (78)
GO:0003677, DNA binding (9)
GO:0003677, DNA binding (5)
GO:0003676, nucleic acid binding (872)
GO:0004190, aspartic-type endopeptidase activity (77)
GO:0003676, nucleic acid binding (9)
GO:0016874, ligase activity (3)
We have developed two parallel pipelines for the annotation of putative retrogenes (PRs) and retropseudogenes (RψG) in whole genomes. The new pipeline for retropseudogene annotation employs length-based heuristics to speed up the processing of sequence alignment data. We used these pipelines on the vertebrates here, but they are readily applicable to any genome and its set of gene/protein annotations. Genome analysis is constantly in flux, and so obviously, as vertebrate genome assemblies and their annotations are streamlined further, we will be further able to refine our retrotransposition analyses, to remove any errors from missing gene annotations, small genome assembly gaps, etc.
We focussed on the annotation of retro(pseudo)genes in mouse and human. We were particularly interested in the retro(pseudo)genes formed since divergence from an 'outgroup' genome, that of the dog. We found evidence for excess, recent gene-retrotransposition activity in both human and mouse, since their divergences from the dog lineage. We find some evidence for selection on PRs at different phases of mouse and human genome evolution. In human, there is statistical evidence for non-neutral evolution (suggestive of positive selection), for population of PRs that have significantly conserved reading frames and that formed since divergence from the dog lineage. Also, we found that, human PRs formed since divergence from the dog lineage have significantly less transcription evidence, which is consistent with the possibility that they are pseudogenes, or in some intermediate phase of relaxed selection. Such a state of low expression coupled with relaxed selection may also arise for alternatively-spliced exons [38, 39]. In summary, our genomic analysis suggests that some human PRs, formed since divergence from the dog lineage, are undergoing a form of non-neutral evolution, but the majority are either young pseudogenes (that are undisabled simply by chance), or lowly-expressed coding sequences in a state of 'relaxed' selection.
Further information on the PRs and RψGs is available at the website: .
- RψG :
reading frame conservation
Protein Data Bank
Structural Classification of Proteins
Long Interspersed Element
Short Interspersed Element
fraction of largest exon.
This work was funded by a Discovery Grant awarded to P.M.H., from the National Science and Engineering Research Council of Canada.
- D'Errico I, Gadaleta G, Saccone C: Pseudogenes in metazoa: origin and features. Briefings in functional genomics & proteomics 2004, 3(2):157–167. 10.1093/bfgp/3.2.157View ArticleGoogle Scholar
- Zhang Z, Harrison P, Liu Y, Gerstein M: Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 2003, 13(12):2541–2558. 10.1101/gr.1429003PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Harrison P, Gerstein M: Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res 2002, 12(10):1466–14482. 10.1101/gr.331902PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic acids research 2003, 31: 5338–5348. 10.1093/nar/gkg745PubMed CentralView ArticlePubMedGoogle Scholar
- Harrison P, Gerstein M: Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol 2002, 318(5):1155–1174. 10.1016/S0022-2836(02)00109-2View ArticlePubMedGoogle Scholar
- Betran E, Thornton K, Long M: Retroposed new genes out of the X in Drosophila. Genome research 2002, 12(12):1854–1859. 10.1101/gr.6049PubMed CentralView ArticlePubMedGoogle Scholar
- Esnault C, Maestre J, Heidmann T: Human LINE retrotransposons generate processed pseudogenes. Nature genetics 2000, 24(4):363–367. 10.1038/74184View ArticlePubMedGoogle Scholar
- Pavlicek A, Paces J, Elleder D, Hejnar J: Processed pseudogenes of human endogenous retroviruses generated by LINEs: their integration, stability, and distribution. Genome research 2002, 12(3):391–399. 10.1101/gr.216902. Article published online before print in February 2002PubMed CentralView ArticlePubMedGoogle Scholar
- Torrents D, Suyama M, Zdobnov E, Bork P: A genome-wide survey of human pseudogenes. Genome research 2003, 13(12):2559–2567. 10.1101/gr.1455503PubMed CentralView ArticlePubMedGoogle Scholar
- Brosius J: RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 1999, 238(1):115–134. 10.1016/S0378-1119(99)00227-9View ArticlePubMedGoogle Scholar
- Harrison PM, Zheng D, Zhang Z, Carriero N, Gerstein M: Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic acids research 2005, 33: 2374–2383. 10.1093/nar/gki531PubMed CentralView ArticlePubMedGoogle Scholar
- Marques AC, Dupanloup I, Vinckenbosch N, Reymond A, Kaessmann H: Emergence of young human genes after a burst of retroposition in primates. PLoS biology 2005, 3: e357. 10.1371/journal.pbio.0030357PubMed CentralView ArticlePubMedGoogle Scholar
- Emerson JJ, Kaessmann H, Betran E, Long M: Extensive gene traffic on the mammalian X chromosom. Science 2004, 303(5657):537–540. 10.1126/science.1090042View ArticlePubMedGoogle Scholar
- Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M: Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res 2002, 12(2):272–280. 10.1101/gr.207102PubMed CentralView ArticlePubMedGoogle Scholar
- Ensembl database [ http://www.ensembl.org ] .
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Ophir R, Graur D: Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 1997, 205: 191–202. 10.1016/S0378-1119(97)00398-3View ArticlePubMedGoogle Scholar
- ORDB [ http://senselab.med.yale.edu/senselab/ORDB ].
- Repeatmasker [ http://www.repeatmasker.org ] .
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997, 13: 555–556.PubMedGoogle Scholar
- Cooper GM, Brudno M, Stone EA, Dubchak I, Batzoglou S, Sidow A: Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res 2004, 14: 539–548. 10.1101/gr.2034704PubMed CentralView ArticlePubMedGoogle Scholar
- Britten RJ, Rowen L, Williams J, Cameron RA: Majority of divergence between closely-related DNA samples is due to indels. PNAS 2003, 100: 4665–4670. 10.1073/pnas.0330964100View ArticleGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic acids research 2004, 32: D258-D261. 10.1093/nar/gkh036View ArticleGoogle Scholar
- Gene Ontology [ http://www.geneontology.org ] .
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic acids research 2004, 32: D189-D192. 10.1093/nar/gkh034PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI [ http://www.ncbi.nih.gov ] .
- Consortium ICGS: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004, 432: 695–716. 10.1038/nature03154View ArticleGoogle Scholar
- Haas NB, Grabowski JM, Sivitz AB, Burch JB: Chicken repeat 1 (CR1) elements, which define an ancient family of vertebrate non-LTR retrotransposons, contain two closely spaced open reading frames. Gene 1997, 197: 305–309. 10.1016/S0378-1119(97)00276-XView ArticlePubMedGoogle Scholar
- Roest Crollius H, Jaillon O, Dasilva C, Ozouf-Costaz C, Fizames C, Fischer C, Bouneau L, Billault A, Quetier F, Saurin W, Bernot A, J. W: Characterization and repeat analysis of the compact genome of the freshwater pufferfish Tetraodon nigroviridis. Genome Res 2000, 10: 939–949. 10.1101/gr.10.7.939View ArticlePubMedGoogle Scholar
- Consortium RGSP: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 2004, 428(6982):493–521. 10.1038/nature02426View ArticleGoogle Scholar
- Consortium MGS: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520–562. 10.1038/nature01262View ArticleGoogle Scholar
- Genomics C: The sequence of the human genome. Science 2001, 291(5507):1304–1351. 10.1126/science.1058040View ArticleGoogle Scholar
- Consortium ENCODEP: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306(5696):636–640. 10.1126/science.1105136View ArticleGoogle Scholar
- Harrison PM, Carriero N, Liu Y, Gerstein M: A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 2003, 333: 885–892. 10.1016/j.jmb.2003.09.016View ArticlePubMedGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423: 241–254. 10.1038/nature01644View ArticlePubMedGoogle Scholar
- Harrison PM: Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila. BMC Bioinformatics 2006, 7: 441. 10.1186/1471-2105-7-441PubMed CentralView ArticlePubMedGoogle Scholar
- Modrek B, Lee CJ: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 2003, 34(2):177–180. 10.1038/ng1159View ArticlePubMedGoogle Scholar
- Wang W, Zheng H, Yang S, Yu H, Li J, Jiang H, Su J, Yang L, Zhang J, McDermott J, Samudrala R, Wang J, Yang H, Yu J, Kristiansen K, Wong GK, J W: Origin and evolution of new exons in rodents. Genome Res 2005, 15: 1258–1264. 10.1101/gr.3929705PubMed CentralView ArticlePubMedGoogle Scholar
- http://biology.mcgill.ca/faculty/harrison/retro [ http://biology.mcgill.ca/faculty/harrison/retro ] .
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.