MITE Digger, an efficient and accurate algorithm for genome wide discovery of miniature inverted repeat transposable elements
© Yang; licensee BioMed Central Ltd. 2013
Received: 11 March 2013
Accepted: 2 June 2013
Published: 7 June 2013
Miniature inverted repeat transposable elements (MITEs) are abundant non-autonomous elements, playing important roles in shaping gene and genome evolution. Their characteristic structural features are suitable for automated identification by computational approaches, however, de novo MITE discovery at genomic levels is still resource expensive. Efficient and accurate computational tools are desirable. Existing algorithms process every member of a MITE family, therefore a major portion of the computing task is redundant.
In this study, redundant computing steps were analyzed and a novel algorithm emphasizing on the reduction of such redundant computing was implemented in MITE Digger. It completed processing the whole rice genome sequence database in ~15 hours and produced 332 MITE candidates with low false positive (1.8%) and false negative (0.9%) rates. MITE Digger was also tested for genome wide MITE discovery with four other genomes.
MITE Digger is efficient and accurate for genome wide retrieval of MITEs. Its user friendly interface further facilitates genome wide analyses of MITEs on a routine basis. The MITE Digger program is available at: http://labs.csb.utoronto.ca/yang/MITEDigger.
Miniature inverted repeat transposable elements (MITEs) are short non-autonomous transposable elements (TEs) that move by cut-and-paste mechanisms [1-3]. They do not produce transposases, proteins that mobilize TEs, and are therefore dependent on those produced by autonomous elements for transposition [4, 5]. Compared to typical cut-and-paste transposons, MITE families often have high copy numbers, and transposition of these elements generates widespread genomic variations [6-8]. Due to their small sizes, MITE insertions are much less disruptive to genes than the larger elements. Therefore, they can often be found in genic regions, introducing phenotypical changes in some cases. For example, an insertion of a Stowaway MITE named dTstu1 in the flavonoid 3′,5′-hydroxylase (F3′5′H) gene of a potato leads to red pigmentation, and a mPing insertion in the rice Hd1 gene results in changes in flowering time [9-11]. Most MITE insertions may not cause phenotypical changes, but rather they alter gene expression levels and epigenetic profiles that may contribute to the overall fitness of the organisms under certain conditions [6, 12-16]. While understanding how MITEs transpose to achieve high copy numbers can further our knowledge on their influence on genome evolution and provide MITE based genetic tools [4, 17-19], genome wide identification and characterization of MITE families broaden our views on different types of MITEs and the scale of their activity and amplification during evolution [6, 20-23]. New MITE families may become better candidates for studies of their transposition and amplification as well as for genetic markers.
MITEs were first discovered from the genetic variation caused by an insertion at the maize wx-B2 locus . Computational approaches were employed to assist the characterization of MITEs with the increasingly available genomic sequences in databases around early 1990s . Due to their well defined structural features including small size (50-800 bp), terminal inverted repeats (TIRs) and target site duplications (TSDs), the task to discover MITE families at genomic levels is suitable for automation. However, the complexity of higher eukaryotic genomes presents a major challenge for such automation. The TSDs and TIRs of MITEs are very short sequences that can occur at a high frequency by chance, resulting in a large number of false output entries that need to be manually analyzed [26-28]. Automated genome wide identification of MITEs can be time consuming because of the large sizes of higher eukaryotic genomes and high TE contents. Such computing tasks are often demanding on computing resources such as the number of CPUs and the amount of RAM.
The program FINDMITE was developed and used for the discovery of eight novel MITE families in the malaria mosquito Anopheles gambiae genome from a sequence database containing short entries . Its input parameters include a predefined sequence or size of TSD, the length of TIRs, and the minimal distance between TIRs. All sequences satisfying the parameters are retrieved and processed. The program MUST, MITE Uncovering SysTem, is based on string matching to identify candidate TIR structures followed by checking the presence of a flanking TSD pair . All candidates are retrieved and grouped. For genome wide analyses of higher eukaryotic MITEs, FINDMITE and MUST generate a large number of false positives because many sequences that satisfy the defined parameters can occur by chance. MITE-Hunter was developed to decrease the number of false positives . It uses multiple sequence alignment to filter out sequences otherwise meeting MITE signature criteria but bearing similar flanking sequences. As a result, MITE-Hunter has a false positive rate of 4.4-8.3% compared to 85% of FINDMITE and 86% of MUST . In these programs, all of the candidate elements were retrieved and analyzed while, theoretically, the identification of only a single element is necessary for a MITE family with hundreds of copies. Therefore the existing algorithms are resource expensive and require lengthy processing time. For example, it took MITE-Hunter ~44 hours to process the rice genome database (~380 Mb) with a Linux cluster using five CPUs. The 700 raw output entries were reduced and grouped into 132 MITE families with manual downstream analyses.
Here, an algorithm was developed to increase processing efficiency by reducing or avoiding redundant computing, therefore shortening the processing time and reducing the requirement for computing resources. The novel algorithm was implemented in MITE Digger, one of the few TE analysis programs featuring graphical user interface [31, 32]. When tested with the rice genome sequence database, it took MITE Digger ~15 hours on a quad core Windows system to complete processing with a typical memory use of ~150 Mb. Comparative analyses of the MITE Digger output with the MITE Hunter output showed that MITE Digger is accurate with low false positive and false negative rates.
Database and programs
The rice genome sequence database was obtained from IRGSP/RAP build 5 . The output from MITE-hunter was obtained from Yujun Han and Sue Wessler (personal communication) . Analyses of MITE families and comparisons were performed with MAK1.8 [34, 35] (http://labs.csb.utoronto.ca/yang/MAK/). MITE Digger was based on the Perl script used for the retrieval of ATon elements . Blast + 2.2.24 was used to perform sequence similarity searches. MITE Digger output was generated using its default parameters. The MITE Digger entries that do not match those in the MITE-hunter output were searched against the rice repeat database using RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker). Genome sequences of rapa, tomato, potato and sorghum were downloaded from plantGDB (http://www.plantgdb.org/).
Implementation and testing system
The algorithm was implemented in ActivePerl 5.10.1 with Perl Tk 804.029 and Bioperl modules . MITE Digger was tested on Windows XP and Windows 7 systems.
Results and discussion
Redundant computing in genome wide discovery of MITEs
A MITE family typically consists of several hundred highly similar copies. Therefore, when every candidate element is processed in genome wide analyses, a MITE family can be computed hundreds of times. Such repetitions can occur at multiple stages including signature feature (i.e. TIR and TSD) identification, screening, multiple sequence alignment or clustering. These repetitions also occur to the elements that do not qualify for the input criteria such as the element length. Furthermore, genomes often contain highly repetitive sequences such as retro elements that have the structures of short inverted repeats flanked by short direct repeats buried in their internal sequences. These non-MITE sequences can occur more often than MITE families and each family may be computed hundreds or thousands of times depending on their copy numbers.
These redundant computing can take up a major portion of the total processing time. In addition, after retrieval of every element of a family, the need to remove redundancy or to group elements into a family can be time consuming and resource intensive. Particularly, when multiple sequence alignment is used to identify candidates with different flanking sequences, aligning elements in a family with several hundred copies takes a significant amount of time. Finally, when database entries are sliced into very short fragments to reduce memory use, the processing efficiency can be dramatically affected because of the overhead on retrieving and analyzing a large number of sequences.
Measures to reduce redundant computing
Pipeline algorithm and parameters
MITE Digger allows customized input of parameters. Database entries larger than the defined entry size will be automatically sliced and automatically formatted. The option to set the number of CPUs allows optimal performance of MITE Digger in platforms with different hardware settings. The option for probability level allows timely processing of large genome databases with a minor chance of missing a MITE family. Changes to other parameters such as the copy number threshold, different flank sequence threshold and sensitivity will affect the number of output entries. The predicted running time is based on the current average processing rate, therefore the actual total run time can be dramatically shorter than the predicted time at the early part of processing because of acceleration (Figure 3B).
Evaluation of MITE Digger output
The output from MITE Digger was compared with that of the MITE Hunter. First, the entries in the MITE Digger output were cross matched with those in the MITE Hunter output, resulting in 1407 non-redundant matching pair records. Since the MITE Hunter output contains entries up to 1500 bp, only those between 50 and 800 bp were considered as MITEs. Among the 1407 records, 658 pairs cover at least 80% of the length of both query and hit sequences. The remaining records were manually inspected and four additional matching pairs were found. Because a MITE family may contain several subfamilies, one entry from MITE Digger can match several entries in MITE Hunter and vice versa. Therefore, the 662 matching pairs consisted of only 287 MITE Digger entries and 301 MITE Hunter entries (Additional file 2). In the remaining 749 records, 13 MITE Digger entries match the terminal sequences of some MITE Hunter entries, suggesting that these MITE Digger entries represent new subfamilies in the families of the MITE Hunter entries (Additional file 2). The rest of the records match short regions in the internal sequences of some long entries of the MITE Hunter output and were considered non-informative records. Therefore, of the 332 MITE Digger output entries, 300 can be classified with the MITE Hunter output entries. The remaining 32 entries were scanned with rice repeat database using RepeatMasker, 11 of them were previously annotated elements or can be manually classified. Six (1.8%) of them are false positives: two rice simple repeats and four retroelements (one SZ-50_int-int LTR terminal, two Cassandra, and one SINE03_OS). The other five were DNA elements: two Stowaways (TREP215, STOWAWAY10_OS); one Harbinger (ID-4), one Mutator and one unknown category (OSTE23). Therefore a total of 17 (7.5%) MITE Digger output entries are classifiable MITE families that are not in the MITE Hunter output (Additional file 2). The remaining 21 entries cannot be classified even though they have the characteristics of Class II TEs.
Summary of the comparison between the MITE digger and MITE hunter output
MITE hunter output
MITE digger output
Only in hunter
Only in digger
<= 800 bp
MITE (<=800 bp, >10 copies)
False positive (retro, simple)
To calculate the false negative rate, the entries of the MITE Digger false output were cross matched with the MITE Hunter output entries. A large number of matching pairs are those matching the internal regions of MITEs of the MITE Hunter output as expected. Only three MITE Digger false output entries were found to be real MITEs that matched MITE Hunter output entries (OS_mMutator_126, OS_mMutator_69, OS_mMutator_67), therefore the false negative rate is 0.9% (3/332).
MITE digger processing of additional genome databases
Genome DB size (Mb)
No. output entries
No. DB entries
No. false output entries
In summary, MITE Digger retrieved exemplars of MITE families from the rice genome with high accuracy and low false positive and false negative rates. Importantly, MITE Digger is not computing resource intensive and the output requires minimal manual processing. Therefore, it can be used routinely to perform genome wide identification of MITEs in higher eukaryotic genomes.
Availability and requirements
Project Name: MITE Digger
Project homepage: http://labs.csb.utoronto.ca/yang/MITEDigger
Operating system: Windows
Programming language: PERL
Other requirements: N/A
License: by the developer
Any restrictions to use by non-academics: license needed
Miniature inverted repeat transposable element
Terminal inverted repeat
Target site duplication.
The MITE Hunter output was kindly provided by Dr. Yujun Han and Dr. Susan Wessler. I would like to thank Dr. Brad Cavinder for comments on the manuscript. Funded by National Sciences and Engineering Research Council (RGPIN371565 to G.Y.) of Canada; Canadian Foundation for Innovation (24456 to G.Y.); Ontario Research Fund (24456 to G.Y.) and University of Toronto.
- Isam F, Rooke R, Wong A, Hui C, Luu T, Bhardwaj P, Yang G: Miniature Inverted-Repeat Transposable Elements (MITEs): Discovery, Distribution and Activity. Genome. 2013, 10.1139/gen-2012-0174.Google Scholar
- Feschotte C, Zhang X, Wessler SR: Miniature Inverted-Repeat Transposable Elements (MITEs) and their relationship with established DNA transposons. Mobile DNA II. Edited by: Craig N, Craigie R, Gellert M, Lambowitz A. 2002, Washington DC: American Society of Microbiology Press, 1147-1158.View ArticleGoogle Scholar
- Jiang N, Feschotte C, Zhang XY, Wessler SR: Using rice to understand the origin and amplification of Miniature Inverted Repeat Transposable Elements (MITEs). Curr Opin Plant Biol. 2004, 7 (2): 115-119. 10.1016/j.pbi.2004.01.004.View ArticlePubMedGoogle Scholar
- Yang G, Zhang F, Hancock CN, Wessler SR: Transposition of the rice miniature inverted repeat transposable element mPing in Arabidopsis Thaliana. Proc Natl Acad Sci USA. 2007, 104 (26): 10962-10967. 10.1073/pnas.0702080104.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang GJ, Nagel DH, Feschotte C, Hancock CN, Wessler SR: Tuned for transposition: molecular determinants underlying the hyperactivity of a stowaway MITE. Science. 2009, 325 (5946): 1391-1394. 10.1126/science.1175688.View ArticlePubMedGoogle Scholar
- Lu C, Chen JJ, Zhang Y, Hu Q, Su WQ, Kuang HH: Miniature Inverted-Repeat Transposable Elements (MITEs) have been accumulated through amplification bursts and play important roles in gene expression and species diversity in Oryza sativa. Mol Biol Evol. 2012, 29 (3): 1005-1017. 10.1093/molbev/msr282.PubMed CentralView ArticlePubMedGoogle Scholar
- Yaakov B, Ben-David S, Kashkush K: Genome-wide analysis of stowaway-like MITEs in wheat reveals high sequence conservation, gene association, and genomic diversification. Plant Physiol. 2013, 161 (1): 486-496. 10.1104/pp.112.204404.PubMed CentralView ArticlePubMedGoogle Scholar
- Park KC, Lee JK, Kim NH, Shin YB, Lee JH, Kim NS: Genetic variation in Oryza species detected by MITE-AFLP. Genes Genet Syst. 2003, 78 (3): 235-243. 10.1266/ggs.78.235.View ArticlePubMedGoogle Scholar
- Momose M, Abe Y, Ozeki Y: Miniature inverted-repeat transposable elements of stowaway are active in potato. Genetics. 2010, 186 (1): 59-U115. 10.1534/genetics.110.117606.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang N, Bao ZR, Zhang XY, Hirochika H, Eddy SR, McCouch SR, Wessler SR: An active DNA transposon family in rice. Nature. 2003, 421 (6919): 163-167. 10.1038/nature01214.View ArticlePubMedGoogle Scholar
- Yano M, Katayose Y, Ashikari M, Yamanouchi U, Monna L, Fuse T, Baba T, Yamamoto K, Umehara Y, Nagamura Y: Hd1, a major photoperiod sensitivity quantitative trait locus in rice, is closely related to the arabidopsis flowering time gene CONSTANS. Plant Cell. 2000, 12 (12): 2473-2483.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang GJ, Lee YH, Jiang YM, Shi XY, Kertbundit S, Hall TC: A two-edged role for the transposable element Kiddo in the rice ubiquitin2 promoter. Plant Cell. 2005, 17 (5): 1559-1568. 10.1105/tpc.104.030528.PubMed CentralView ArticlePubMedGoogle Scholar
- Yan YS, Zhang YM, Yang K, Sun ZX, Fu YP, Chen XY, Fang RX: Small RNAs from MITE-derived stem-loop precursors regulate abscisic acid signaling and abiotic stress responses in rice. Plant J. 2011, 65 (5): 820-828. 10.1111/j.1365-313X.2010.04467.x.View ArticlePubMedGoogle Scholar
- Kuang HH, Padmanabhan C, Li F, Kamei A, Bhaskar PB, Shu OY, Jiang JM, Buell CR, Baker B: Identification of miniature inverted-repeat transposable elements (MITEs) and biogenesis of their siRNAs in the Solanaceae: new functional implications for MITEs. Genome Res. 2009, 19 (1): 42-56.PubMed CentralView ArticlePubMedGoogle Scholar
- Cantu D, Vanzetti LS, Sumner A, Dubcovsky M, Matvienko M, Distelfeld A, Michelmore RW, Dubcovsky J: Small RNAs, DNA methylation and transposable elements in wheat. BMC Genomics. 2010, 11: 408-10.1186/1471-2164-11-408.PubMed CentralView ArticlePubMedGoogle Scholar
- Piriyapongsa J, Jordan IK: A Family of human MicroRNA genes from Miniature Inverted-Repeat Transposable Elements. PLoS One. 2007, 2 (2): e203-10.1371/journal.pone.0000203.PubMed CentralView ArticlePubMedGoogle Scholar
- Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Richardson AO, Okumoto Y, Tanisaka T, Wessler SR: Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Nature. 2009, 461 (7267): 1130-U1232. 10.1038/nature08479.View ArticlePubMedGoogle Scholar
- Fattash I, Bhardwaj P, Hui C, Yang G: A rice stowaway MITE for gene transfer in yeast. PLoS One. 2013, In pressGoogle Scholar
- Yang G, Weil CF, Wessler SR: A rice Tc1/mariner-like element transposes in yeast. Plant Cell. 2006, 18 (10): 2469-2478. 10.1105/tpc.106.045906.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Zhang LL, Meyer E, Bao ZM: Genome-wide analysis of transposable elements and tandem repeats in the compact placozoan genome. Biol Direct. 2010, 5: 520-10.1186/1471-2164-11-520.View ArticleGoogle Scholar
- Feschotte C, Swamy L, Wessler SR: Genome-wide analysis of mariner-like transposable elements in rice reveals complex relationships with stowaway miniature inverted repeat transposable elements (MITEs). Genetics. 2003, 163 (2): 747-758.PubMed CentralPubMedGoogle Scholar
- Han MJ, Shen YH, Gao YH, Chen LY, Xiang ZH, Zhang Z: Burst expansion, distribution and diversification of MITEs in the silkworm genome. BMC Genomics. 2010, 11: 520-10.1186/1471-2164-11-520.PubMed CentralView ArticlePubMedGoogle Scholar
- Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu ZJ, Loftus B, Xi Z, Megy K, Grabherr M: Genome sequence of Aedes aegypti, a major arbovirus vector. Science. 2007, 316 (5832): 1718-1723. 10.1126/science.1138878.View ArticlePubMedGoogle Scholar
- Bureau TE, Wessler SR: Tourist: a large family of small inverted repeat elements frequently associated with maize genes. Plant Cell. 1992, 4 (10): 1283-1294.PubMed CentralView ArticlePubMedGoogle Scholar
- Bureau TE, Wessler SR: Stowaway: a new family of inverted repeat elements associated with the genes of both monocotyledonous and dicotyledonous plants. Plant Cell. 1994, 6 (6): 907-916.PubMed CentralView ArticlePubMedGoogle Scholar
- Bergman CM, Quesneville H: Discovering and detecting transposable elements in genome sequences. Brief Bioinform. 2007, 8 (6): 382-392. 10.1093/bib/bbm048.View ArticlePubMedGoogle Scholar
- Lerat E: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity. 2010, 104 (6): 520-533. 10.1038/hdy.2009.165.View ArticlePubMedGoogle Scholar
- Tu ZJ: Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae. Proc Natl Acad Sci USA. 2001, 98 (4): 1699-1704. 10.1073/pnas.98.4.1699.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Y, Zhou F, Li G, Xu Y: MUST: a system for identification of miniature inverted-repeat transposable elements and applications to Anabaena variabilis and Haloquadratum walsbyi. Gene. 2009, 436 (1-2): 1-7.View ArticlePubMedGoogle Scholar
- Han Y, Wessler SR: MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 2010, 38 (22): e199-10.1093/nar/gkq862.PubMed CentralView ArticlePubMedGoogle Scholar
- Rooke R, Yang GJ: TE Displayer for post-genomic analysis of transposable elements. Bioinformatics. 2011, 27 (2): 286-287. 10.1093/bioinformatics/btq639.View ArticlePubMedGoogle Scholar
- Tempel S, Jurka M, Jurka J: VisualRepbase: an interface for the study of occurrences of transposable element families. BMC Bioinformatics. 2008, 9: 345-10.1186/1471-2105-9-345.PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka T, Antonio BA, Kikuchi S, Matsumoto T, Nagamura Y, Numa H, Sakai H, Wu J, Itoh T, Sasaki T: The rice annotation project database (RAP-DB): 2008 update. Nucleic Acids Res. 2008, 36: D1028-D1033.PubMedGoogle Scholar
- Yang GJ, Hall TC: MAK, a computational tool kit for automated MITE analysis. Nucleic Acids Res. 2003, 31 (13): 3659-3665. 10.1093/nar/gkg531.PubMed CentralView ArticlePubMedGoogle Scholar
- Janicki M, Rooke R, Yang GJ: Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res. 2011, 19 (6): 787-808. 10.1007/s10577-011-9230-7.View ArticlePubMedGoogle Scholar
- Yang GJ, Wong A, Rooke R: ATon, abundant novel nonautonomous mobile genetic elements in yellow fever mosquito (Aedes aegypti). BMC Genomics. 2012, 13: 283-10.1186/1471-2164-13-283.PubMed CentralView ArticlePubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.