Skip to main content

Analysis of the role of retrotransposition in gene evolution in vertebrates



The dynamics of gene evolution are influenced by several genomic processes. One such process is retrotransposition, where an mRNA transcript is reverse-transcribed and reintegrated into the genomic DNA.


We have surveyed eight vertebrate genomes (human, chimp, dog, cow, rat, mouse, chicken and the puffer-fish T. nigriviridis), for putatively retrotransposed copies of genes. To gain a complete picture of the role of retrotransposition, a robust strategy to identify putative retrogenes (PRs) was derived, in tandem with an adaptation of previous procedures to annotate processed pseudogenes, also called retropseudogenes (RψGs). Mammalian genomes are estimated to contain 400–800 PRs (corresponding to ~3% of genes), with fewer PRs and RψGs in the non-mammalian vertebrates. Focussing on human and mouse, we aged the PRs, analysed for evidence of transcription and selection pressures, and assigned functional categories. The PRs have significantly less transcription evidence mappable to them, are significantly less likely to arise from alternatively-spliced genes, and are statistically overrepresented for ribosomal-protein genes, when compared to the proteome in general. We find evidence for spurts of gene retrotransposition in human and mouse, since the lineage of either species split from the dog lineage, with >200 PRs formed in mouse since its divergence from rat. To examine for selection, we calculated: (i) Ka/Ks values (ratios of non-synonymous and synonymous substitutions in codons), and (ii) the significance of conservation of reading frames in PRs. We found >50 PRs in both human and mouse formed since divergence from dog, that are under pressure to maintain the integrity of their coding sequences. For different subsets of PRs formed at different stages of mammalian evolution, we find some evidence for non-neutral evolution, despite significantly less expression evidence for these sequences.


These results indicate that retrotranspositions are a significant source of novel coding sequences in mammalian gene evolution.


Genes are subject to many different processes that give rise to novel sequences, such as segmental and local duplication, gene conversion, and retrotransposition. The extent to which these different processes contribute to gene evolution is unclear. In the present paper, we focus on the phenomenon of gene retrotransposition. Retrotransposition entails the reverse transcription of an mRNA transcript and the subsequent re-integration of the resulting cDNA into genomic DNA, in germ-line cells [1]. There is substantial genomic evidence for large-scale retrotransposition of mRNAs in mammalian genomes, from detection of thousands of apparent retropseudogenes in human, mouse and rat [24]. Such retropseudogenes (RψGs) are decayed or disabled gene sequence copies (typically bearing frameshifts and stop codons) that demonstrate the hallmark characteristics of retrotransposition, namely lack of introns of the parental gene, and also 3' polyadenine tails, if formed more recently [5]. Other features include short direct repeats flanking the sequence (for young retrotranspositions) [6], frequent 5' truncations, and genomic location different from that of the parent gene [2, 3]. It has been demonstrated experimentally that RψG s can be formed through the action of LINE-1 reverse transcriptases [7]. The computational comparison of LINEs and RψG s also supports the generation of RψG s by LINEs [8]. The poly(A) tails and frequent truncations found at the 5' end in the RψG s are typical for LINEs [2]. Moreover, they share similar structures, including a common TT|AAAA insertion motif [8].

Since the substantial majority of these retrosequences bear disablements (frameshifts and stop codons), or have codon substitution patterns indicative of decay [9, 5, 3], gene retrotransposition appears generally to lead to non-functional sequences that decay in the genomic DNA as evolution progresses [10, 2, 9]. However, even though the promoters of these gene retrosequences are not transferred, a small minority of them appears to be transcribed [11]. For the human genome, there is a small population of at least ~200 transcribed processed pseudogenes, which have the symptoms of a lack of coding ability despite evidence of transcription, and are significantly likely to be found near others genes (as would be expected if they are co-opting promoters) [11].

Generation of a new functional gene is also a possible outcome of retrotransposition [10]. There is an increasing number of transcribed, functionally characterized genes in mammalian and invertebrate animal genomes reported to bear the characteristics of retrosequences [12]. Over ninety such retrogenes have been annotated in the human and mouse genomes [13]. Most of the functional retrogenes identified are expressed in testis and may have provided important raw material for rapid testis evolution in primates [12].

Here, to derive an overview of the role of gene retrotransposition in the genome evolution of vertebrates, and particularly mammals, we derive and apply a robust procedure to annotate gene retrotranspositions, built on our previous analyses of retropseudogenes [11, 3, 2, 14]. Our strategy incorporates a new rapid procedure for annotating retrocopies in the genomic DNA, in tandem with a pipeline to identify them in existing gene annotations. This PR annotation pipeline incorporates aging of the sequences through evolutionary rate analysis relative to putative parents and their orthologs, as well as analysis of the chromosomal milieus of these sequences and their putative parents. We find evidence for, on average, several hundred PRs in each proteome. Focussing on human and mouse, we find evidence for spurts of gene retrotransposition in both human and mouse, since divergence from dog. A small number (>50) of PRs have formed in both mouse and human since divergence form dog, that show signs of being under selection to maintain their coding sequences.


Genome data

The genome sequences and annotations of seven organisms analyzed in this paper (human, dog, cow, mouse, rat, chicken and Tetraodon nigriviridis), were downloaded from the Ensembl Web site [15], in January 2005. Version 2.1 of the chimpanzee assembly (downloaded in April 2006) was also used. Putative retrogenes (PRs) were identified in the annotated proteomes of eight vertebrates using the pipeline in Figure 1. This procedure is described in detail below. In tandem, putative retropseudogenes (RψGs), and additional PRs outside of current protein annotations, were assigned using a modification of previous procedures (Figures 1 &2) [14, 2]. Genes from which PR s and RψG s are thought to have originated are called parent genes.

Figure 1
figure 1

Pipeline summarizing the annotation of PRs and retropseudogenes. The pipeline for PR annotation is summarized. There is an inset at the bottom, that summarizes the tests for local gene order and chromosomal milieu.

Figure 2
figure 2

Rapid annotation of retropseudogenes. (1) TBLASTN matches (e-value ≤ 10-4) of the annotated proteome against the genomic DNA are sorted by coordinates and collated for each protein to form a set of matches {M}. (2) The sets {M} are filtered using length-based heuristics. (3) Each protein is realigned to the genomic DNA using FASTY, and the best-matching proteins at each point have disablements and that matches >70% of the length of the parent sequence are picked as retropseudogene annotations.

Rapid identification of retropseudogenes (RψGs)

Retropseudogenes were annotated on the Ensembl genome versions used in this present analysis, using the rapid improvement of previous procedures to identify retropseudogenes described above (summarised in Figure 2) [11, 5, 14, 2, 3].

Identification of putative retrotransposed genes (PRs)

  1. (1)

    Homology detection: Each proteome was compared against itself using BLAST to find similarities with e-value ≤ 10-4 [16]. Any match to a potential pseudogene contaminant in the proteome annotations was removed (Figure 1).

  2. (2)

    Exon seam analysis: Exon boundary information for each protein was extracted from the appropriate Ensembl genome annotation files. The positioning of exon boundaries in encoded protein sequences, i.e., 'exon seams', was then deduced [11]. Using the positioning of exon seams, the BLAST matches between proteins were filtered to pick out alignments between a protein encoded by a multiple-exon gene, and a single exon of another gene. To define PRs, the length of the exon was required to be between 0.9 and 1.1 of the whole length of the multiple-exon protein. (This is stricter than the criterion of 0.7 of the length used for annotation of retropseudogenes [2]).

(3) Assignment of parent genes: To assign parent genes to PRs, we calculated substitution rates at synonymous codon sites (i.e., Ks values) for all matches to PRs using the package PAML, for all instances where the amino-acid sequence identity for the pair of sequences is ≥ 70%. The sequence with the smallest Ks value was chosen as the 'parent gene'. For sequence identities <70%, saturation of substitutions is likely [17], and so the sequence with the highest BLAST bitscore in alignment with the PR was chosen as the putative parent.

  1. (4)

    Additional filtering: In addition, PRs were discarded if they matched olfactory receptors (ORs) with BLAST e-value ≤ 10-4 over ≥ 0.5 of the length of the OR, since recent olfactory receptors (ORs) have probably originated from different mechanism other than retrotransposition [2]. Olfactory receptor sequences were taken from ORDB [18].

  2. (5)

    Local gene order test for similarity of the chromosomal milieus for PRs and parent genes: To check that the PRs did not arise from local or segmental duplication, we derived a 'local gene order' test. For this test, we compared the chromosomal milieus of PRs and parents for significant similarity, as follows. Proteins encoded by genes adjacent to PRs in the chromosomes were BLASTed against the corresponding proteins from genes adjacent to putative parents, for a given window (w genes ) of number of genes in either direction (5' and 3'). A w genes size of 7 (the PR or parent, plus 3 genes in either direction), with an allowance for one gap between the positions of matches within w genes , was found to be suitable. The number of significant homologous matches N homologs (BLAST e-value ≤ 10-4, sequence identity >40%, and match ≥ 0.6 length of both PR and parent) between the milieus of PR and parent was tallied. An expected benchmark distribution for N homologs was derived for the chromosomal milieus of 1,000 randomly-sampled pairs of proteins that have any significant BLAST match to each other (e-value ≤ 10-4). From examination of this distribution, we found that 80% of such random pairs have N homologs <1, and 87% have N homologs ≤ 1. We thus chose N homologs = 1, as a suitable threshold for local similarity arising from duplication of genomic DNA. However, the results differ little if a threshold of N homologs = 0 is used. This procedure was applied to the genomes of the human and mouse. Interestingly, application of this criterion resulted in the exclusion of many sequences with large individual exons (PRs with FLE parent ≥ 0.8; 36/86 = 42% of those excluded), that may be false positives in our data set of PRs. A large fraction of these sequences (68%) tend to have long, tandem arrays of Zn-finger domains covering more than a third of their sequences (Additional Figure 1).

Additional filtering and annotation

The following additional analysis was performed on the PRs:

(i) Fraction of largest exon in parent: We calculated the fraction of the length any parent gene that is taken up by its largest exon. This is denoted FLE parent . We found that there is no peculiar tendency for the parents of PRs to have a single large exon (which would yield a tendency for high FLE values) [Additional Figure 1].

(ii) Overlap with retropseudogenes annotations: Retropseudogenes were annotated on the Ensembl genome versions used in this present analysis, using the rapid improvement of previous procedures to identify retro(pseudo)genes described above. Any PR that overlapped one of these annotations was flagged (Table 1).

Table 1 Overview of gene retrotransposition analysis for eight vertebrates

(iii) Filtering for potential transposable elements (TEs): Each proteome was compared using TBLASTN to libraries of transposable elements taken from the RepeatMasker distribution [19], using an e-value threshold of ≤ 10-4. Any proteins containing SINEs, or near-complete matches to LINEs (≥ 0.8 of their lengths), were labeled as potentially TE-containing.

(iv) Whether single-exon gene or multiple-exon gene: The PRs were labeled as either single-exon genes or part of multiple-exon genes.


Orthologs of parent genes were identified using the bi-directional best hits method, using BLAST (e-value ≤ 10-4, amino-acid sequence identity ≥ 40% and requiring the alignment to cover ≥ 0.6 of the lengths of each sequence. The bi-directional best hits method is a common procedure for guarding against considering paralogs.

Analysis of Ks and Ka/Ks values, and derivation of genome- and lineage-specific gene lists

The package PAML [20] was used to calculate maximum-likelihood Ka, Ks and Ka/Ks values for pairs of sequences (either PR versus computed ancestral sequences, or PR versus parent). In addition, branch-specific maximum-likelihood Ka/Ks values were calculated for three-way alignments of PR, parent and parent's orthologs from another close species.

We applied three different strategies based on analysis of Ks, to determine lists of genome-specific and lineage-specific PRs. For example, for the human genome, we calculated human-specific lists relative to the chimpanzee genome. Also, we calculated primate-specific lists for human plus chimpanzee, relative to a mammalian outgroup, such as dog or cow. To determine genome-specific lists of PRs, we investigated each of the following three methods ("parent's ortholog" refers to the ortholog of the parent in the most closely related genome):

(1) The distribution of Ks values for orthologous genes in the two organisms was calculated, and the median value m derived from this. If Ks [PR←→parent] <m and Ks [parent ←→parent's ortholog] >m, then a PR is labeled genome-specific;

(2) Secondly, a PR could be labeled genome-specific if Ks [PR ←→parent] < Ks [parent ←→parent's ortholog];

(3) Thirdly, a PR could be labeled genome-specific if, in a three-way tree of PR, parent and parent's ortholog, the branch-specific Ks [PR] is < (Ks [parent] + Ks [parent's ortholog])/2 ;

Lineage-specific lists were derived in a similar fashion. Additional Figure 2 shows how these three methods overlap which each other. Based on the overlaps observed, we used Method (3) for further analysis.

Analysis of reading frame conservation

We assessed the reading frame conservation (RFC) in sequences using simulations of insertion and deletion governed by power-law insertion/deletion (indel) statistics [4]. Power-law statistics for indels were extracted from recently-formed RΨGs having ≥ 85% amino-acid identity with their parent sequences. Power-law relationships were fitted, omitting points for any indels of size 3n, with n any positive integer. Expected ratios for insertions versus deletions were taken from this data; the expected number of indels per nucleotide substitution for several mammals was culled from the literature [21, 22]. The program DNADIST [23] was used to calculate the nucleotide-level divergence of the PRs from ancestral sequences (calculated using PAML [20]; see section on Ka/Ks analysis above). This divergence value is used as a target in simulations. For each PR, repeated simulations of the evolution of the ancestral sequence towards present-day, for 1000 iterations, was performed using a Kimura two-parameter model. In each case, the resulting simulated protein coding sequence was marked for frame disablements (stop codons and frameshifts). PR sequences whose simulations yielded frame-disrupted sequences ≥ 99% of the time were labeled as having significant RFC.

Assignment of functional categories

GO (Gene Ontology; [24]) functional categories were taken from the annotation files provided on the Ensembl [15] and Gene Ontology websites [25]. Further GO term annotations were derived by mapping functional GO annotations for the PDB (also downloaded from the GO website) onto Ensembl protein annotations, using 50% sequence identity and 0.8 fractional sequence coverage (for the protein domain) as thresholds, using alignment made by the program BLASTP (e-value ≤ 0.0001) [16]. These thresholds were benchmarked on the complete SCOP protein domain sequence database [26], to give a 2% false positive rate for GO term transfer.

Mapping of cDNAs/mRNAs

Refseq mRNAs and complete Unigene consensus sequences were downloaded from the NCBI website [27], for both human and mouse. These were mapped to the coding sequences of Ensembl gene annotations, using blastn (e-value ≤ 1 × 10-10 for alignments ≥ 100 nucleotides) [16]. All mappings that match with ≥ 99% sequence identity over ≥ 0.99 of the sequence length of the cDNA or mRNA, after removal of any polyadenylation, were retained. Further restriction of analysis of cDNA/mRNA mappings to those that do not match their putative parent sequences with >95% sequence identity, does not change the trends reported with regard to transcription evidence reported below.

Results and discussion

The pipelines for annotating the complement of gene retrotranspositions (both retropseudogenes (RψG s) and putative retrogenes (PRs)) were applied to eight vertebrates. In particular, we focused on the mammals, to analyse the ages of putative retrogenes (PRs), to derive genome- and lineage-specific lists and to check for spurts of gene retrotransposition activity. We then examined for evidence of transcription (mRNA and cDNA mapping), involvement in alternative splicing, selection pressures (significant Ka/Ks values and reading-frame conservation), and for functional categorizations of parent genes.

Overview of gene retrotranspositions in vertebrates

Our analysis suggests that up to ~3% of the genes encoded in a vertebrate genome contain a PR (Table 1), with the smallest percentages in the chicken and puffer fish T. nigroviridis. By comparison, mammalian genomes have ~2,000–5,000 retropseudogenes (RψG s), that have at least 70% of the coding sequence of their parent genes, again with smaller numbers in non-mammal vertebrates (just 644 RψG s in T. nigroviridis) (Table 1). These results together indicate that there has been less, recent gene-retrotransposition activity in the two non-mammal vertebrates. These observations tally well with other evidence for less retrotransposition activity in chicken and Tetraodon. In chicken, there appear to be little or no SINEs [28], and only ~8% of the genomic DNA is comprised of the CR1 ('chicken repeat 1') LINE-1 [28, 29], whose reverse transcriptase is thought not to copy polyadenylated mRNAs [29]. In Tetraodon, <1% of the genome is comprised of retrotransposons, so gene retrotransposition should consequently be less likely [30]. For the eight genomes studied here, there are no significant linear correlations between the number of genes or proteins from a genome, versus the number of PRs, or RψGs (data not shown). Small percentages of the PRs could be classified as homologs of retrotransposed transposable elements, such as LINEs (2–12%), or as overlapping pseudogene annotations (4–13%).

As described in detail in Methods, we applied a 'local gene order' test, to set aside any PRs that may have arisen through local or segmental duplication, specifically for the human and mouse genomes (Table 1). This filter allows for at most one homologous protein encoded within a window of +/-3 genes along the genomic DNA (i.e., N homologs ≤ 1) (Table 1). The substantial majority of human and mouse PRs pass this filter (80–87%).

Ages of primate and rodent gene retrotranspositions

How old are these PRs? Is there any evidence for spurts of gene retrotransposition activity in mammalian evolution? To answer these questions, we examined the distribution of Ks values for PRs compared to their assigned parent genes, in the human and mouse genomes. (Only PRs passing the local gene order test, with threshold N homologs ≤ 1 were analysed.) Ks is the rate of synonymous substitutions per synonymous site in codons, and has been generally used to age coding sequences. From comparing Ks values for PRs, their parents, and orthologs of their parents, we have also been able to derive lists of genome-specific and lineage-specific PRs (Figure 4; see Methods for details).

Figure 4
figure 4

Lineage-specific lists of PRs: The number of species-specific PRs relative to other species. PRs specific relative to other species were obtained by comparison of Ks between the PR and its parent and the Ks between the parent (Ks PR_parent ) and the ortholog of the parent in the other species (Ks parent_ortholog ). PRs with Ks PR_parent <Ks parent_ortholog were defined as specific PRs relative to the other species. Only PRs which amino acid identity to parents is more than 70% and have an ortholog in other species were subjected to this calculation. Orthology criteria used are 40% identity over 60% length overlap. 'Human-specific' and 'Chimp-specific' PRs are those formed since the species diverged from each other; similarly, for 'Mouse-specific' and 'Rat-specific' PRs. 'Other primate-specific' are any other PRs formed in human or chimp since divergence from dog (in bold typeface), or from cow (in italic typeface); similarly, for 'Other rodent-specific'.

In human PRs, we see that there is a bimodal distribution of Ks (Figure 3A). The median Ks values for lists of PRs that are human-specific or that have otherwise been formed since divergence from dog, are labelled on the Ks histogram. The peak at Ks ~0.06–0.08 corresponds approximately to the median Ks value for PRs formed between human divergence from dog and from chimpanzee. This peak has been noted previously in analyses of retropseudogenes and total retrosequence populations [2, 3, 12], peaking at approximately the point of human lineage divergence from the New World Monkeys, some ~40 million years ago [3]. The peak at 0.0–0.02 (containing 21% of PRs) obviously corresponds to human-specific PRs. Some of these PRs may simply be too young to differentiate as a PR or retropseudogene (They have not been around long enough to acquire (apparent) reading-frame disablements.) Evidence for selection pressures on these sequences is discussed below.

Figure 3
figure 3

Ks distributions: (A) Ks distribution for human PRs meeting the local gene order test with threshold of N homologs = 0, from comparison to their parent sequences. Labelled are the median values for the 'Human-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an N homologs threshold of ≤ 1 for the local gene order test. (B) Ks distribution for mouse PRs meeting the local gene order test with threshold of N homologs = 0, from comparison to their parent sequences. Labelled are the median values for the 'Mouse-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an N homologs threshold of ≤ 1 for the local gene order test.

By comparison, in the rodents, there is more, very recent gene retrotransposition activity. In mouse, we find proportionately more, genome-specific PRs (relative to rat), with 44% having Ks ≤ 0.02 (Figure 3B). In the two rodents, mouse and rat, there are >200 genome-specific PRs, compared to ~40 in each of the primates human and chimp. However, setting aside genome-specific examples, there are more gene retrotranspositions appearing in the primate lineage since its divergence from the dog or cow lineage (Figure 4).

These observations are in keeping with the apparent maintenance of greater levels of LINE and SINE retrotransposition activity in the rodents [31, 32]; also, they tally well with previous observations for a general fall-off in such retrotransposition activity in the primate lineage [2, 33].

Transcription evidence

Focussing on human and mouse, we examined the proportion of PRs that could be mapped to a complete Unigene consensus cDNA or a complete Refseq mRNA from the NCBI ([27]; see Methods for details). For both organisms, we found that the PRs have significantly less mapping of this transcription evidence (P < 0.001, using the z-score for distribution of the sample mean). For human, only 23% of human PRs mapped to a Refseq mRNA or Unigene cDNA consensus sequence (compared to 41% for the whole proteome). This may be due to lower transcription levels, because they are novel gene sequences using co-opted promoter elements at a site distal to the genomic location of their parent genes [11]. This general reduction in transcription is also to be expected, if some of the sequences are recent pseudogenes without disablements.

In addition, we examined how many PRs arise from alternatively-spliced genes. To do this, we cross-referenced the PR data with alternative splicings classified in the Alternative Splicing Database (ASD) at the EBI [34]. We found that 24% of genes for human PRs arose from an alternatively spliced gene, compared to 59% of genes overall (significantly less, P < 0.001 using the z-score for distribution of the sample mean). A significant reduction in representation from alternatively-spliced genes was also observed in mouse (4%, compared to 29% overall).

We examined the divergence of PRs from their putative parents, in the context of transcription evidence. This is illustrated in Figure 5 for those PRs that pass the local gene order test for both mouse and human, with N homologs ≤ 1. In human, there is a marked difference in the behaviour of transcribed PRs (purple bars in Figure 5A), compared to those without transcription evidence (blue bars in Figure 5A). There are relatively very few transcribed PRs with high sequence identities (i.e., that formed relatively recently). The bimodal character of these plots may arise because some of the PRs: (i) are in a younger population of PRs that are not under selection pressures, but which have not accumulated deleterious mutations, simply by chance; i.e., they are pseudogenes without disablements; or (ii) are in a state of relaxed selection, and thus concomitantly have low transcription levels. Similarly, only a small fraction of the PRs calculated to have formed since divergence from the dog lineage (Figure 4), in either human (15/207, 7%) or mouse (20/292, 7%), are transcribed (significantly less, P < 0.05 using χ2 tests for both cases).

Figure 5
figure 5

Distributions of percentage protein sequence identity between PRs and parents. (A) Distribution of % protein sequence identity for all human PRs that pass the local gene order test (N homologs ≤ 1). These are broken down into 'transcribed' and 'not transcribed'. (B) The fraction that are transcribed in each bin of the histogram in panel A. (C) Distribution of % protein sequence identity for all mouse PRs that pass the local gene order test (N homologs ≤ 1). These are broken down into 'transcribed' and 'not transcribed'. (D) The fraction that are transcribed in each bin of the histogram in panel C.

Ka/Ks and reading frame conservation (RFC) analysis

Is there any evidence for selection pressures on the PRs in human and mouse? We investigated this question for PRs that have been formed in human and mouse, since their divergences from dog. One standard indicator of selection pressures is the Ka/Ks ratio. This is the number of non-synonymous mutations per non-synonymous site, over the number of synonymous mutations per synonymous site, in codons. Negative (or 'purifying') selection in a specific lineage is indicated by a value significantly <1.0, where positive ('diversifying') selection is demonstrated by a value significantly >1.0. We calculated Ka/Ks values for PRs relative to ancestral sequences for the parents of the PRs (see Methods for details). We tested whether any of these Ka/Ks values were significantly < or >1.0 by generating 500 random pairs of sequences as diverged as the PR and ancestral sequence (to calculate expected means and standard deviations for the Ka/Ks values), and then deriving a P-value for the observed Ka/Ks.

Strikingly, when we correct for multiple hypothesis testing in the Ka/Ks calculations, we find only one PR sequence (formed since divergence from dog) that is under significant selection at the codon level in the human genome, and none in the mouse genome. (The one significant human example is a PR under purifying selection, from a family of proteins with the GTP-binding SAR1 domain.)

In addition, we calculated the distribution of Ka/Ks values from directly comparing the PRs versus their parents. From this specific sort of comparison, the neutral expectation for Ka/Ks is not ~1, because of non-synonymous mutations accumulating in the parent genes [2]. A significant excess of Ka/Ks values <0.5, however, may be indicative of purifying selection in the data set. For comparison, we also similarly calculated a Ka/Ks distribution for RψG s versus their respective parents, carefully parsing out disablements (frameshifts and premature stop codons) from the RψG sequences. This Ka/Ks distributional analysis is performed for both human and mouse (Figure 6). For human (Figure 6A), we find no significant excess of PRs with Ka/Ks values <0.5 relative to RψG s, either for the whole data set of PRs, or for the subset formed in the primate lineage, contrary to a previous report [12] (χ2 test or Fisher's exact test). This distribution is thus consistent with a set of largely neutral retrotranspositions, behaving like RψG s. However, for mouse-specific PRs, there is a significant excess with Ka/Ks <0.5 (P ≤ 0.001, χ2 test and Fisher's exact test) (Figure 6B), indicating that some of these mouse PR sequences are under purifying selection.

Figure 6
figure 6

Ka/Ks distributions for PRs and for retropseudogenes ( RψG s). (A) Distribution of Ka/Ks for human PRs (n = 262) meeting the local gene order test (N homologs ≤ 1), compared to Ka/Ks for the RψG s (n = 183). All sequences were required to have protein sequence identity ≥ 60.0% with their parent sequences. (B) As in (A), but for mouse PRs (n = 318) and RψG s (n = 220).

Conservation of open reading frames without disablements (frameshifts or stop codons), can also be an indicator of coding ability [12, 35, 36]. We derived a method for assessing significant conservation of open reading frames, using simulation with power-law insertion/deletion (indel) statistics [4]. Using simulations with calculated neutral rates of substitution, insertion and deletion, the likelihood of conservation of an open reading frame without interruption by frameshifts and stop codons, can be determined (see Methods for details). To give sufficient power, a P-value threshold of ≤ 0.01 was used as an indicator for significant reading-frame conservation (RFC). This calculation is complementary to the Ka/Ks analysis.

The results are listed Table 2. Even though there were no significant Ka/Ks values in mouse, we find over 30 mouse-specific PRs that have significant reading frame conservation, and a further 17 that were formed since divergence from dog (Table 2). In human, we find in total, 59 PRs with significant RFC, that have arisen since divergence from dog (Table 2). A phylogenetic tree for an example of one of the mouse PRs with significant RFC, which is homologous to citrate synthase, is depicted in Additional Figure 3, with a depiction of the chromosomal milieu of this PR and of its parent in Additional Figure 4. Two further representative examples of human PRs are also shown in Additional Figure 4 (one with significant RFC and the other without), with varying degrees of age and transcription evidence.

Table 2 Results of analysis for reading-frame conservation (RFC)

These results are evidence for conservation of protein open reading frames, even though we found no evidence for purifying selection from examination of the sequences individually for Ka/Ks. This would arise if the PRs were generally under relaxed or positive selection pressures at the codon level. The existence of relaxed selection is consistent with the markedly low numbers of PRs found to be transcribed in both human and mouse, particularly those that were formed since divergence from dog.

Out of those with significant RFC, is there any evidence for non-neutral Ka/Ks trends? We checked for significant excess of PRs with Ka/Ks values <0.5, > 0.5, <1.0 or >1.0 for each of the subsets listed (Table 2). In the human lineage, we find a significant excess of PRs with Ka/Ks >0.5 (40/59, P < 0.05, χ2 test and Fisher exact test, compared to an expectation from RψG sequences) formed between divergence from dog and from chimp. This non-neutral trend is suggestive of positive selection distributed throughout this specific subset population of RψG sequences. Out of the other subsets listed in Table 2, the only other significant non-neutral Ka/Ks tendency is for an excess of mouse-specific PRs to be under purifying selection (26/35 having Ka/Ks <0.5, compared to an expectation from RψG sequences, P < 0.05, χ2 test and Fisher exact test).

Functional categories

To assess whether the PRs and RψG s have any unusual functional associations, we assigned functional categories using the Gene Ontology (GO) functional classification (Table 3). As noted previously, 'Structural constituent of ribosome' is a prevalent functional category for RψG s [2, 3, 14]. Noticeably, for mouse, there are more retropseudogenes in metabolic categories, than in human (Table 3). A notable prevalence indicative of origin in retrotransposition, 'structural constituent of ribosome', occurs in the top ten of all PR (sub)sets, and is ranked number one for PRs formed since divergence from dog, for both mouse and human (Table 3). 'Structural constituent of ribosome' is also the only Gene Ontology term that is statistically overrepresented in all of the retrotransposed gene sets listed (Table 3 legend; P' <0.05, using binomial statistics and a Bonferroni correction for multiple hypothesis testing [37]). The functional category preferences are not caused by over-representation of any one parent, since when representations of PR s on a parent-by-parent basis are tallied up, we find only a very small number of parents giving rise to five or more PR s (Suppl. Table 1); the substantial majority of parents have only one PR offspring (286/353 [81%] for mouse, and 279/347 [80%] for human).

Table 3 Most common Gene Ontology (GO) functional terms for different sets of sequences *


We have developed two parallel pipelines for the annotation of putative retrogenes (PRs) and retropseudogenes (RψG) in whole genomes. The new pipeline for retropseudogene annotation employs length-based heuristics to speed up the processing of sequence alignment data. We used these pipelines on the vertebrates here, but they are readily applicable to any genome and its set of gene/protein annotations. Genome analysis is constantly in flux, and so obviously, as vertebrate genome assemblies and their annotations are streamlined further, we will be further able to refine our retrotransposition analyses, to remove any errors from missing gene annotations, small genome assembly gaps, etc.

We focussed on the annotation of retro(pseudo)genes in mouse and human. We were particularly interested in the retro(pseudo)genes formed since divergence from an 'outgroup' genome, that of the dog. We found evidence for excess, recent gene-retrotransposition activity in both human and mouse, since their divergences from the dog lineage. We find some evidence for selection on PRs at different phases of mouse and human genome evolution. In human, there is statistical evidence for non-neutral evolution (suggestive of positive selection), for population of PRs that have significantly conserved reading frames and that formed since divergence from the dog lineage. Also, we found that, human PRs formed since divergence from the dog lineage have significantly less transcription evidence, which is consistent with the possibility that they are pseudogenes, or in some intermediate phase of relaxed selection. Such a state of low expression coupled with relaxed selection may also arise for alternatively-spliced exons [38, 39]. In summary, our genomic analysis suggests that some human PRs, formed since divergence from the dog lineage, are undergoing a form of non-neutral evolution, but the majority are either young pseudogenes (that are undisabled simply by chance), or lowly-expressed coding sequences in a state of 'relaxed' selection.

Further information on the PRs and RψGs is available at the website: [40].



putative retrotransposition

RψG :



reading frame conservation


transposable element


Gene Ontology


Protein Data Bank


Structural Classification of Proteins


Long Interspersed Element


Short Interspersed Element


fraction of largest exon.


  1. D'Errico I, Gadaleta G, Saccone C: Pseudogenes in metazoa: origin and features. Briefings in functional genomics & proteomics 2004, 3(2):157–167. 10.1093/bfgp/3.2.157

    Article  Google Scholar 

  2. Zhang Z, Harrison P, Liu Y, Gerstein M: Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 2003, 13(12):2541–2558. 10.1101/gr.1429003

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Zhang Z, Harrison P, Gerstein M: Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res 2002, 12(10):1466–14482. 10.1101/gr.331902

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic acids research 2003, 31: 5338–5348. 10.1093/nar/gkg745

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Harrison P, Gerstein M: Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol 2002, 318(5):1155–1174. 10.1016/S0022-2836(02)00109-2

    Article  CAS  PubMed  Google Scholar 

  6. Betran E, Thornton K, Long M: Retroposed new genes out of the X in Drosophila. Genome research 2002, 12(12):1854–1859. 10.1101/gr.6049

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Esnault C, Maestre J, Heidmann T: Human LINE retrotransposons generate processed pseudogenes. Nature genetics 2000, 24(4):363–367. 10.1038/74184

    Article  CAS  PubMed  Google Scholar 

  8. Pavlicek A, Paces J, Elleder D, Hejnar J: Processed pseudogenes of human endogenous retroviruses generated by LINEs: their integration, stability, and distribution. Genome research 2002, 12(3):391–399. 10.1101/gr.216902. Article published online before print in February 2002

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Torrents D, Suyama M, Zdobnov E, Bork P: A genome-wide survey of human pseudogenes. Genome research 2003, 13(12):2559–2567. 10.1101/gr.1455503

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Brosius J: RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 1999, 238(1):115–134. 10.1016/S0378-1119(99)00227-9

    Article  CAS  PubMed  Google Scholar 

  11. Harrison PM, Zheng D, Zhang Z, Carriero N, Gerstein M: Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic acids research 2005, 33: 2374–2383. 10.1093/nar/gki531

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Marques AC, Dupanloup I, Vinckenbosch N, Reymond A, Kaessmann H: Emergence of young human genes after a burst of retroposition in primates. PLoS biology 2005, 3: e357. 10.1371/journal.pbio.0030357

    Article  PubMed Central  PubMed  Google Scholar 

  13. Emerson JJ, Kaessmann H, Betran E, Long M: Extensive gene traffic on the mammalian X chromosom. Science 2004, 303(5657):537–540. 10.1126/science.1090042

    Article  CAS  PubMed  Google Scholar 

  14. Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M: Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res 2002, 12(2):272–280. 10.1101/gr.207102

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Ensembl database [ ] .

  16. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Ophir R, Graur D: Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 1997, 205: 191–202. 10.1016/S0378-1119(97)00398-3

    Article  CAS  PubMed  Google Scholar 

  18. ORDB [ ].

  19. Repeatmasker [ ] .

  20. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 1997, 13: 555–556.

    CAS  PubMed  Google Scholar 

  21. Cooper GM, Brudno M, Stone EA, Dubchak I, Batzoglou S, Sidow A: Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res 2004, 14: 539–548. 10.1101/gr.2034704

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Britten RJ, Rowen L, Williams J, Cameron RA: Majority of divergence between closely-related DNA samples is due to indels. PNAS 2003, 100: 4665–4670. 10.1073/pnas.0330964100

    Article  Google Scholar 

  23. Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.

    Google Scholar 

  24. Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic acids research 2004, 32: D258-D261. 10.1093/nar/gkh036

    Article  Google Scholar 

  25. Gene Ontology [ ] .

  26. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic acids research 2004, 32: D189-D192. 10.1093/nar/gkh034

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. NCBI [ ] .

  28. Consortium ICGS: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004, 432: 695–716. 10.1038/nature03154

    Article  Google Scholar 

  29. Haas NB, Grabowski JM, Sivitz AB, Burch JB: Chicken repeat 1 (CR1) elements, which define an ancient family of vertebrate non-LTR retrotransposons, contain two closely spaced open reading frames. Gene 1997, 197: 305–309. 10.1016/S0378-1119(97)00276-X

    Article  CAS  PubMed  Google Scholar 

  30. Roest Crollius H, Jaillon O, Dasilva C, Ozouf-Costaz C, Fizames C, Fischer C, Bouneau L, Billault A, Quetier F, Saurin W, Bernot A, J. W: Characterization and repeat analysis of the compact genome of the freshwater pufferfish Tetraodon nigroviridis. Genome Res 2000, 10: 939–949. 10.1101/gr.10.7.939

    Article  CAS  PubMed  Google Scholar 

  31. Consortium RGSP: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 2004, 428(6982):493–521. 10.1038/nature02426

    Article  Google Scholar 

  32. Consortium MGS: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520–562. 10.1038/nature01262

    Article  Google Scholar 

  33. Genomics C: The sequence of the human genome. Science 2001, 291(5507):1304–1351. 10.1126/science.1058040

    Article  Google Scholar 

  34. Consortium ENCODEP: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306(5696):636–640. 10.1126/science.1105136

    Article  Google Scholar 

  35. Harrison PM, Carriero N, Liu Y, Gerstein M: A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 2003, 333: 885–892. 10.1016/j.jmb.2003.09.016

    Article  CAS  PubMed  Google Scholar 

  36. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003, 423: 241–254. 10.1038/nature01644

    Article  CAS  PubMed  Google Scholar 

  37. Harrison PM: Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila. BMC Bioinformatics 2006, 7: 441. 10.1186/1471-2105-7-441

    Article  PubMed Central  PubMed  Google Scholar 

  38. Modrek B, Lee CJ: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 2003, 34(2):177–180. 10.1038/ng1159

    Article  CAS  PubMed  Google Scholar 

  39. Wang W, Zheng H, Yang S, Yu H, Li J, Jiang H, Su J, Yang L, Zhang J, McDermott J, Samudrala R, Wang J, Yang H, Yu J, Kristiansen K, Wong GK, J W: Origin and evolution of new exons in rodents. Genome Res 2005, 15: 1258–1264. 10.1101/gr.3929705

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  40. [ ] .

Download references


This work was funded by a Discovery Grant awarded to P.M.H., from the National Science and Engineering Research Council of Canada.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Paul M Harrison.

Additional information

Authors' contributions

ZY developed the pipelines, performed most of the data analysis, and wrote initial drafts of the manuscript. DM performed some data analysis of alternative splicing events. MI performed phylogenetic analysis. PH conceived of the project, performed evolutionary analyses, and wrote later drafts of the manuscript.

Electronic supplementary material


Additional file 1: Additional Figure 1: Distributions of the fraction of the largest exon (FLE). The fraction of the parent sequences that are taken up by their largest exons (denoted FLE) is plotted for various sets of sequences. (PDF 231 KB)


Additional file 2: Additional Figure 2: Overlap of the three methods for determining species-specific or lineage-specific lists of PRs, for Human (top panel) and Mouse (bottom panel). Three different methods for determining the relative age of sequences were used for generating species-specific lists. This figure demonstrates the overlap between these methods. (PDF 55 KB)


Additional file 3: Additional Figure 3: Example of a PR of citrate synthase in Mouse. A phylogenetic analysis was performed of citrate synthase PR from mouse. (PDF 72 KB)


Additional file 4: Additional Figure 4: Chromosomal milieus for three PR examples. Screenshots taken from the Ensembl database, depicting nearby features for three PRs and their putative parents. (PDF 565 KB)

Additional file 5: Parents with most PRs. The parent genes that spawn most PRs are listed. (DOC 34 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Yu, Z., Morais, D., Ivanga, M. et al. Analysis of the role of retrotransposition in gene evolution in vertebrates. BMC Bioinformatics 8, 308 (2007).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Gene Ontology
  • Ancestral Sequence
  • Putative Parent
  • Chicken Repeat
  • Refseq mRNA