- Open Access
Computational and experimental analyses of retrotransposon-associated minisatellite DNAs in the soybean genome
BMC Bioinformaticsvolume 13, Article number: S13 (2012)
Retrotransposons are mobile DNA elements that spread through genomes via the action of element-encoded reverse transcriptases. They are ubiquitous constituents of most eukaryotic genomes, especially those of higher plants. The pericentromeric regions of soybean (Glycine max) chromosomes contain >3,200 intact copies of the Gmr9/GmOgre retrotransposon. Between the 3' end of the coding region and the long terminal repeat, this retrotransposon family contains a polymorphic minisatellite region composed of five distinct, interleaved minisatellite families. To better understand the possible role and origin of retrotransposon-associated minisatellites, a computational project to map and physically characterize all members of these families in the G. max genome, irrespective of their association with Gmr9, was undertaken.
A computational pipeline was developed to map and analyze the organization and distribution of five Gmr9-associated minisatellites throughout the soybean genome. Polymerase chain reaction amplifications were used to experimentally assess the computational outputs.
A total of 63,841 copies of Gmr9-associated minisatellites were recovered from the assembled G. max genome. Ninety percent were associated with Gmr9, an additional 9% with other annotated retrotransposons, and 1% with uncharacterized repetitive DNAs. Monomers were tandemly interleaved and repeated up to 149 times per locus.
The computational pipeline enabled a fast, accurate, and detailed characterization of known minisatellites in a large, downloaded DNA database, and PCR amplification supported the general organization of these arrays.
The genomic landscapes of most higher eukaryotes are dominated by repetitive DNAs [1–3]. Most genome-wide, interspersed repeats are retrotransposons, including long and short interspersed elements (LINEs and SINEs, respectively) and long terminal repeat (LTR) retrotransposons [1, 3]. The action of LINE- or LTR retrotransposon-encoded reverse transcriptases on transcribed RNA intermediates and integration of the resulting cDNAs has resulted in the accumulation of thousands of these elements dispersed throughout the genomes of nearly all eukaryotic species [1, 3].
LTR retrotransposons range in length from a few hundred base pairs (non-autonomous, truncated copies) to >25,000 bp . Most autonomous elements encode structural proteins (gag) that assemble into intracellular virus-like particles, and enzymes (pol) required for polyprotein processing, reverse transcription, and cDNA integration (Figure 1) . Most elements are littered with incapacitating mutations, including large insertions and deletions [1, 3].
The proliferation of retrotransposons can be highly disruptive to gene and genome structure and function, and host mechanisms can silence and eliminate elements [4, 5]. However, there is increasing evidence that retrotransposons have made important contributions to the evolution of gene and genome structure and function .
One feature of a few of these LTR retroelements is the presence of other classes of repeats within their DNA, specifically microsatellites and minisatellites [7–10]. Gmr9/GmOgre from soybean (Figure 1) is an uncharacteristically long and relatively high copy-number retrotransposon with a canonical representative >21 kb in length and in excess of 3,200 copies per genome [11, 12]. A member of the Ty3-gypsy retrotransposon superfamily, most copies are restricted to pericentromeric regions of all twenty soybean chromosomes . Members of this family and related elements in other plant species contain a polymorphic minisatellite (MS) array of several hundred base pairs just downstream of the coding region [7, 12, 13]. A combination of computational and experimental approaches was used to map and fully characterize the organization and distribution of the five Gmr9-associated MS throughout the soybean genome.
All G. max assembled chromosome sequences  were downloaded from GenBank and made into a BLAST database. Details and implementation of the computational pipeline are described in Note 1 in Additional file 1 and is available at the link https://github.com/slowkow/soy-rtms.
Genomic DNA was isolated using a DNeasy Plant Mini Kit (Qiagen) from 100 mg of leaf tissue from Glycine max cv Williams 82 ground to a fine powder under liquid nitrogen. Primer sequences and cycling parameters are described in Note 2 in Additional file 1.
Computational analysis and results
The Gmr9/GmOgre MS region has five distinct repeat families designated A through E. The consensus sequences have been reported [12, 15–19]. The lengths were 26, 38, 37, 105 and 43 bp, respectively (see Note 3 in Additional file 1). Nine of the last 11 bp of repeats B and C are identical, and could be considered sub-repeats, but otherwise there are no detectable sequence similarities among any of the repeat families. BLASTn searches of all Genbank DNA databases, from which Glycine sequences were excluded, retrieved no similar sequences (see Note 4 in Additional file 1).
Individual queries of the five MS consensus sequences against the downloaded soybean chromosome database resulting in 63,841 unique hits with ≥90% identity, of which 51,154 (80%) were within the map coordinates of annotated retrotransposons (Table 1 and Figure 2). Of these, a total of 40,150 (78%) fall within the coordinates of an "intact" member of the Gmr9 family (Table 1). In addition to Gmr9, 42 other defined retrotransposon families representing both Ty3-gypsy and Ty1-copia superfamilies contain at least one of the MS sequences (Table 1). With the exception of Gmr5 and Gmr6, the MS repeats were generally more plentiful among Ty3-gypsy superfamily members than Ty1-copia members (Table 1).
The remaining 18,781 MS hits fell outside of annotated transposable elements (TE) and clustered into a total of 4,328 loci. Ninety-two percent of the DNA sequences (3,975) were at least 80% identical over a length of ≥ 400 bp to annotated copies of Gmr9 found elsewhere in the genome (Table 1). This far exceeded the number of discreet MS hits initially found for Gmr9, as did the corresponding data for Gmr3, Gmr4, Gmr5, Gmr25, and Gmr139. Of the remaining 354 unannotated loci, all but 75 could be assigned to a TE family. DNA's from the unidentified 75 loci were queried against the nr and gss Genbank databases and all retrieved >25 hits with e values <10-10 in one or both of these databases, indicating that all were repetitive families. No further analyses of these sequences were undertaken (see Note 5 in Additional file 1).
The average number of repeats per Gmr9 element - the ratio of total hits to discreet hits - was 8.5 for repeat A, 6.4 for repeat B, 6.0 for repeat C, 2.0 for repeat D, and 2.9 for repeat E. These values were consistent with the organization of the consensus sequence reported previously . The total number of hits was considerably smaller for most of the other families. Figure 2 illustrates the distribution and density of TE and the five MS on chromosome 4. The densities of MS and TE are strongly correlated, and the former are restricted to the pericentromeric region. Figure 3 represents a 34 kb section of Chromosome 4 with two tandem Gmr9 family members (top) and an expanded region of 2.9 kb from Gmr9_Gm4-9 (bottom). The MS array extends across 2.6 kb and consists of 17 tandem repeats of A-B-C, followed by one tandem array of A-B-A. Approximately 120 bp downstream of the last A repeat there is one D-E repeat followed by a break of about 100 bp and a second E repeat.
[ABAC]n was the primary pattern found in the MS arrays, but other arrays of [ABC]n as found for Gm4-97 (Figure 3) and [ACB]n were retrieved (Table S1 in Additional file 2). The longest unbroken tandem array consisted of 37 repeats of ABAC. The total length of this array was 4,760 bp. Other long, unbroken tandem arrays were found in which ABC was repeated 16 to 28 times to total lengths of nearly 3,000 bp. The longest unbroken tandem array of ACB was nearly 1,800 bp in length. The majority of arrays were far shorter (see Table S1 in Additional file 2 and Note 6 in Additional file 1).
Of the approximately 22,500 copies of repeat A retrieved, nearly 75% were identical to the consensus sequence, and another 20% differed by a single base pair (Fig. S1 in Additional file 3). In the case of repeat B, almost 44% of the approximately 17,650 copies of this repeat were identical to the consensus with the remaining 56% distributed among several different variants (Fig. S1). Repeat D, the longest repeat, was far more polymorphic than the other repeats, with a greater number of sequences that varied significantly from the consensus in identity and length (Figs. S1 and S2 in Additional file 3). Length variants of the other repeats are shown in Fig. S2 (see Note 7 in Additional file 1). Repeat A has virtually no length variants.
Electrophoretic separation of the amplification products generated from all primer combinations resulted in long ladders of closely spaced bands (Fig. S3 in Additional file 3). The longest amplicons were in excess of 3 kb, consistent with the computational findings (see Table S1).
Discussion and conclusions
Gmr9/GmOgre is one of a number of plant retrotransposons in the Ogre retrotransposon lineage that contain embedded satellites (see Note 8 in Additional file 1) [7, 12, 13]. In the case of the five MS families initially found in Gmr9, we have shown that every single copy is embedded in a repetitive DNA, 99% of which are LTR retrotransposons, and most of these are Gmr9 copies (see Note 9 in Additional file 1). Virtually all are found in pericentromeric regions of all twenty G. max chromosomes. The origin of the MS repeats is clearly Gmr9, but the means by which other retrotransposon families acquired them is unknown.
The considerable repeat number variation among the clusters of MS loci (Table S1) was not unexpected. The mechanisms sponsoring expansions and contractions of satellite repeats, including polymerase slippage, gene conversion, non-allelic homologous recombination, and post-replicative DNA repair , might be elevated for several reasons. For instance, in the case of slippage, host RNA polymerase, element-encoded reverse transcriptase, and host DNA polymerase could all contribute. The sheer number of retrotransposon loci carrying these MS clusters creates thousands of potential sites for non-allelic recombination. The maintenance of the relatively high sequence identity of repeats A, B, and C suggests that gene conversion may be homogenizing these sequences.
The possible functions, if any, of these MS sequences reported here are not known. These and other more distantly related retrotransposons that possess internal MS regions [20–23] invite speculation about the origins and possible functions of these DNAs. Pericentromeric regions are highly enriched for both retrotransposons and centromere-specific MS DNAs and both classes are recovered in centromere-specific histone H3 chromatin immunoprecipitation assays [24–27]. Alternatively, centromeric retrotransposons may contribute to molecular processes that facilitate the formation of centromeric chromatin . Minisatellites embedded in mobile elements that target centromeres would be an effective pairing for the dispersal and amplification of sequences that contribute to centromere function.
Computational tools enabled a complete physical characterization of the polymorphisms, map positions, and organization of five MS in the soybean genome. The results confirm that these particular MS are universally embedded in other repetitive DNA classes, primarily LTR retrotransposons, the majority of which are members of the Gmr9 retrotransposon family.
Grandbastien MA: Retrotransposons in Plants. Encyclopedia of Plants. Edited by: Mahy BWJ, Van Regenmortel MHV. 2008, Oxford: Elsevier, 428-436.
Richard GF, Kerrest A, Dujon B: Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev. 2008, 72: 686-727. 10.1128/MMBR.00011-08.
Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O, Paux E, SanMiguel P, Schulman AH: A unified classification system for eukaryotic transposable elements. Nature Rev Genet. 2007, 8: 973-982. 10.1038/nrg2165.
Bennetzen JL: Mechanisms and rates of genome expansion and contraction in flowering plants. Genetica. 2002, 115: 29-36. 10.1023/A:1016015913350.
Lisch D, Bennetzen JL: Transposable element origins of epigenetic gene regulation. Curr Opin Plant Biol. 2011, 14: 156-161. 10.1016/j.pbi.2011.01.003.
Shapiro JA: Mobile DNA and evolution in the 21st century. Mobile DNA. 2010, 1: 4-10.1186/1759-8753-1-4.
Macas J, Koblizkova A, Navratilova A, Neumann P: Hypervariable 3' UTR region of plant LTR-retrotransposons as a source of novel satellite repeats. Gene. 2009, 448: 198-206. 10.1016/j.gene.2009.06.014.
Ramsay L, Macaulay M, Cardle L, Morgante M, degli IS, Maestri E, Powell W, Waugh R: Intimate association of microsatellite repeats with retrotransposons and other dispersed repetitive elements in barley. Plant J. 1999, 17: 415-425. 10.1046/j.1365-313X.1999.00392.x.
Smykal P, Kalendar R, Ford R, Macas J, Griga M: Evolutionary conserved lineage of Angela-family retrotransposons as a genome-wide microsatellite repeat dispersal agent. Heredity. 2009, 103: 157-167. 10.1038/hdy.2009.45.
Laten HM, Havecker ER, Farmer LM, Voytas DF: SIRE1, an endogenous retrovirus family from Glycine max, is highly homogeneous and evolutionarily young. Mol Biol Evol. 2003, 20: 1222-1230. 10.1093/molbev/msg142.
Du J, Tian Z, Hans CS, Laten HM, Cannon SB, Jackson SA, Shoemaker RC, Ma J: Evolutionary conservation, diversity and specificity of LTR-retrotransposons in flowering plants: insights from genome-wide analysis and multi-specific comparison. Plant J. 2010, 63: 584-598. 10.1111/j.1365-313X.2010.04263.x.
Laten HM, Mogil LS, Wright LN: A shotgun approach to discovering and reconstructing consensus retrotransposons ex novo from dense contigs of short sequences derived from Genbank Genome Survey Sequence database records. Gene. 2009, 448: 168-173. 10.1016/j.gene.2009.06.011.
Macas J, Neumann P: Ogre elements-a distinct group of plant Ty3/gypsy-like retrotransposons. Gene. 2007, 390: 108-116. 10.1016/j.gene.2006.08.007.
Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463: 178-183. 10.1038/nature08670.
Laten HM: Retrotransposon-associated 105-bp minisatellite in Glycine max: MSAT-105_Gm. Repbase Rep. 2010, 10: 2178-
Mogil LS, Laten HM: Retrotransposon-associated 43-bp minisatellite in Glycine max: SAT-43_Gm. Repbase Rep. 2011, 11: 2562-
Mogil LS, Laten HM: Retrotransposon-associated 26-bp minisatellite in Glycine max: SAT-26_Gm. Repbase Rep. 2011, 11: 2559-
Mogil LS, Laten HM: Retrotransposon-associated 38-bp minisatellite in Glycine max: SAT-38_Gm. Repbase Rep. 2011, 11: 2561-
Mogil LS, Laten HM: Retrotransposon-associated 37-bp minisatellite in Glycine max: SAT-37_Gm. Repbase Rep. 2011, 11: 2560-
Kejnovsky E, Kubat Z, Macas J, Hobza R, Mracek J, Vyskot B: Retand: a novel family of gypsy-like retrotransposons harboring an amplified tandem repeat. Mol Genet Genom. 2006, 276: 254-263. 10.1007/s00438-006-0140-x.
Martinez-Izquierdo JA, Garcia-Martinez J, Vicient CM: What makes Grande1 retrotransposon different?. Genetica. 1997, 100: 15-28. 10.1023/A:1018332218319.
Sanz-Alferez S, SanMiguel P, Jin YK, Springer PS, Bennetzen JL: Structure and evolution of the Cinful retrotransposon family of maize. Genome. 2003, 46: 745-752. 10.1139/g03-061.
Tek AL, Song J, Macas J, Jiang J: Sobo, a recently amplified satellite repeat of potato, and its implications for the origin of tandemly repeated sequences. Genetics. 2005, 170: 1231-1238. 10.1534/genetics.105.041087.
Houben A, Schroeder-Reiter E, Nagaki K, Nasuda S, Wanner G, Murata M, Endo TR: CENH3 interacts with the centromeric retrotransposon cereba and GC-rich satellites and locates to centromeric substructures in barley. Chromosoma. 2007, 116: 275-283. 10.1007/s00412-007-0102-z.
Nagaki K, Shibata F, Suzuki G, Kanatani A, Ozaki S, Hironaka A, Kashihara K, Murata M: Coexistence of NtCENH3 and two retrotransposons in tobacco centromeres. Chromosome Res. 2011, 19: 591-605. 10.1007/s10577-011-9219-2.
Tek AL, Kashihara K, Murata M, Nagaki K: Functional centromeres in soybean include two distinct tandem repeats and a retrotransposon. Chromosome Res. 2010, 18: 337-347. 10.1007/s10577-010-9119-x.
Plohl M, Luchetti A, Mestrovic N, Mantovani B: Satellite DNAs between selfishness and functionality: structure, genomics and evolution of tandem repeats in centromeric (hetero)chromatin. Gene. 2008, 409: 72-82. 10.1016/j.gene.2007.11.013.
Neumann P, Navratilova A, Koblizkova A, Kejnovsky E, Hribova E, Hobza R, Widmer A, Dolezel J, Macas J: Plant centromeric retrotransposons: a structural and cytogenetic perspective. Mobile DNA. 2011, 2: 4-10.1186/1759-8753-2-4.
Grant D, Nelson RT, Cannon SB, Shoemaker RC: SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Res. 2010, 38: D843-846. 10.1093/nar/gkp798.
This work was supported by Loyola University Carbon and Mulcahy undergraduate research fellowships to LSM and KS, respectively. The authors thank Catherine Putonti for programming support.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 2, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S2
The authors declare that they have no competing interests.
LSM carried out the experimental work and contributed to general design, implementation, and analysis of computational outputs. KS designed, developed, and implemented all computational tools and contributed to analysis of outputs. LSM and KS contributed to the initial draft of the manuscript. HML conceived of the study, participated in its design and coordination, and generated the final draft of the manuscript. All authors read and approved the final manuscript.
Lauren S Mogil, Kamil Slowikowski contributed equally to this work.