Simple sequence repeats (SSRs), also named microsatellites or short tandem repeats (STRs), can be widely found in eukaryotic genomes [1]. The sequences that flank an SSR may be sufficiently conserved to allow specific amplification primers to be designed; thus, SSRs can be detected through conventional PCR amplification and typed based on the amplification products. The majority of SSRs are noncoding and thus can affect the expression, splicing, protein sequence, and genome structure of genes [2, 3]. SSRs makers are commonly used in genome-related studies [4, 5]. SSR genotyping has also become an extensive application in different fields and has been used for population genetic diversity and structure analysis, the construction of genetic maps, and the investigation of intraspecies relationships [6,7,8].
All applications of SSRs are based on accurate SSR genotyping methods, and less accuracy may have serious consequences [9, 10]. Moreover, the construction and application of DNA databases also require the accurate SSR genotyping of samples [11, 12]. Factors influencing accurate SSR genotyping include the following: 1) The slippage of polymerase is inherent to in vitro SSR polymerase PCR amplification, which leads to incorrect SSR alleles and makes it challenging to genotype SSRs accurately; and 2) the occurrence of variations in the SSR or flanking region will directly affect the genotyping results (Fig. 1) [13, 14]. These problems accompanied the SSR genotyping technology development. The technology has experienced the initial gel electrophoresis, capillary electrophoresis, the first- and second-generation sequencing, and the high-throughput amplicon sequencing stage. At present, amplicon sequencing technology is widely used in genetic disease screening and gene diagnosis, as well as in other research [15, 16]. However, there is still no suitably accurate SSR genotyping method that can tolerate nucleotide variations in SSRs and flanking regions which may affect the sequence alignment and genotyping accuracy.
Here, we developed a new open-source microsatellite genotyping strategy that includes an accurate microsatellite genotyping tool based on targeted sequencing (AMGT-TS) and a user-friendly web-based version. AMGT-TS can quickly perform precise SSR genotyping with ultradeep coverage and high locus throughput, and it includes a broad matching algorithm (BMA) that can handle situations with nucleotide variations in the SSR and flanking regions. We also performed a comprehensive assessment of AMGT-TS using internal laboratory testing and simulated data testing. The results showed that AMGT-TS could achieve nearly 100% typing accuracy. Although AMGT-TS developed on plants, which is the focus of our current work, the new method is generic and can be used as a new tool for many biological fields. All completed codes, sample data, and documentation have been submitted to GitHub.
Implementation
AMGT-TS tool design
The process of AMGT-TS has three main steps to obtain accurate genotyping information (Fig. 1). For each sample: first, reads are mapped to their bona fide loci according to the reference sequences; then, the SSR regions are determined by the loci’s flanking information; and finally, AMGT-TS obtains accurate SSR genotyping results based on the dissection of read information, such as read number and primary SSR typing.
In detail, after obtaining raw sequencing data (usually in FASTQ format), we use FASTX (http://hannonlab.cshl.edu/fastx_toolkit/) to remove low-quality data. Then, we perform the "Alignment to loci" processing step with bwa-mem [17], based on the read information for the loci in the reference sequence file. After this step, reads will be grouped to a locus. Next, Picard (https://broadinstitute.github.io/picard/) is used to group reads in the same locus together. At the same time, SAMtools [18] is used to index the data to improve the efficiency of subsequent processing. Next, we use SAMtools to “Split by direction,” separating the forward and reverse data. Then, we use SEQTK (https://github.com/lh3/seqtk) to “Adjust direction,” which flips reverse sequences into forward sequences. After that, we use the BLAST tool [19] to perform the “Find SSR region” operation according to the 20-bp sequences of the left and right flanking sequences of SSR regions in the reference sequence to obtain the SSR region of each read. Finally, we use Python scripts to “Find SSR typing” in the SSR region to obtain SSR typing information.
Here, as an example of the AMGT-TS processing workflows, two actual experimental datasets are provided for result 1 and 2 in Fig. 2. For result 1, an SSR genotyping result of AGAGA*6 for locus s4121 in B73 (a model variety of maize) is shown (Fig. 3). We used the AMGT-TS web platform (https://amgt-ts.plantdna.site:8445/) to generate the alignment figure for the reads. The platform grouped the read files to align the heap sequences in Fig. 3, which shows the results of the classification alignment. Each line is one read, and the yellow background region is the SSR region being typed. The genotype is exactly six repeats of 5 bases. For result 2, the typing result of locus s17883 is shown in Additional file 1: Table S1. An SSR length of 12 (ATA*4) was found for 98.20% of the reads, so we can obtain the result ATA (4,4) (maize is a diploid plant, so each locus has two alleles). In addition, we obtained the SSR typing result of AGG (4,4) for locus s691405. Finally, the genotype of the third locus (s838417) was a homozygous type CTC (5,5), which is a 15 bp long repeat, and the corresponding reads accounted for 98.70% of the total reads. Overall, the typing strategy of AMGT-TS is clear and satisfactory.
Evaluation of typing error
The typing error can be measured in two ways. One is the false-positive rate of SSR typing and the other is the rate of the error reads found for the correct typing result. Equation (1) can be obtained, where j represents the index of the locus, k represents the typing index, and Ramgt(j, k) represents the reads of the k-th typing of the j-th locus from AMGT-TS.
$${\text{Sum }}\;{\text{of}}\;{\text{R}}_{{{\text{amgt}}}} = \mathop \sum \limits_{{\text{j}}} \mathop \sum \limits_{{\text{k}}} {\text{R}}_{{{\text{amgt}}}} \left( {{\text{j}},{\text{k}}} \right)$$
(1)
Sra represents the sum of reads from artificial data and Er represents the error of the reads. Equation (2) can be obtained as follows:
$$E_{r} = \frac{{{\text{S}}_{{{\text{ra}}}} { } - { }\mathop \sum \nolimits_{{\text{j}}} \mathop \sum \nolimits_{{\text{k}}} {\text{R}}_{{{\text{amgt}}}} \left( {{\text{j}},{\text{k}}} \right)}}{{{\text{S}}_{{{\text{ra}}}} }}$$
(2)
In the same way, Et represents typing error, Ta represents the sum of typing of artificial data and Tamgt represents the correct typing result count from AMGT-TS. Equation (3) can be obtained as follows:
$$E_{t} = \frac{{{\text{T}}_{{\text{a}}} { } - { }\mathop \sum \nolimits_{{\text{j}}} {\text{T}}_{{{\text{amgt}}}} \left( {\text{j}} \right)}}{{{\text{T}}_{{\text{a}}} }}$$
(3)
Precise and broad matching algorithms
To archive both the high accuracy of SSR typing and tolerance of the variations in SSRs and flanking regions, we respectively developed two different algorithms, precise and broad matching strategies (Fig. 4). The analytical strategy of precise matching is divided into three steps. The first step is "grouping." For multilocus amplicon sequencing data, sequencing reads are first assigned to the corresponding loci according to the reference sequences. AMGT-TS uses bwa-mem to implement data mapping. The second step is "SSR boundary determination". After extraction of sequences for each locus, AMGT-TS uses flanking sequences of the SSR region of each locus in the reference sequences to determine the boundaries of the left and right flanking sequences, which indirectly determines the boundaries of the SSR region and further extracts the sequence of the SSR region by calling BLAST. The third step is "SSR genotyping." After the SSR sequence has been determined, the repetition number of SSRs is determined by using the precise match method of repeated sequences, and the SSR repetition length is used to name the SSR genotype. For example, the motif of a certain SSR was ATC, with a repetition number of three times, so the SSR was named SSR9.
The broad matching algorithm (BMA) has the same first step as the precise matching algorithm. However, in the second step, BMA directly processes the information of BAM files and uses the Concise Idiosyncratic Gapped Alignment Report (CIGAR) information for each read to mask the classification information, which makes it compatible with the variation within a certain error range and results in better fault-tolerant classification information. As shown in Fig. 4, the SSR motif represents the repeating unit in the SSR region. The SSR region represents the region where the SSR sequence is located. For example, when a sequence of an SSR is AGCAGCAGC, the SSR motif is AGC, and the SSR region is AGCAGCAGC. The precise match identifies only contiguous motifs, so in read 2, only the last 10 bp is identified as two replicates. Read 3 has only one motif repeat. For the broad match, the results identified were identical to those for the precise match, except for read 1, which is a perfectly repeated sequence; different results were obtained for the other two reads. For read 2, when the two red bases are considered to be two SNPs, 5 repeats of the motif are obtained. For read 3, when the region is considered to be an InDel, is the motif is considered to contain 3 repeats.
Simulation test of AMGT-TS
To better simulate different situations, each read was divided into five parts (Additional file 1: Figure S1). Five different categories of reads were considered in our simulation, named Classes A to E. The detailed method of generating these data was as follows:
-
Class A: We created the SSR_Region according to 3 repeats of the motif of s17883. We then added 35 bp from the left and right flanking sequences of the SSR_region of the reference sequence as Flank_L and Flank_R, respectively. Finally, we added one SNP on the left flank and one SNP on the right flank. There were 2000 artificial reads for this dataset.
-
Class B: We created the SSR_Region according to 5 repeats of the motif (AGCT) and 6 repeats of the motif of s423645. We then added 35 bp from the left and right flanking sequences of the SSR_region of the reference sequence as Flank_L and Flank_R, respectively. There were 1000 artificial reads for this dataset.
-
Class C: We created the SSR_Region according to 4 repeats of the motif (CGCAT) and 3 repeats of the motif (CGCAT) + CACAT + 2 repeats of motif s566749. We then added 35 bp from the left and right flanking sequences of the SSR_region of the reference sequence as Flank_L and Flank_R, respectively. There were 2000 artificial reads for this dataset.
-
Class D: In the SSR_Region of this type, Flank_L and Flank_R are also random bases. With the addition of Random_L and Random_R, the total length was randomly extended to 180 ~ 220 bp. There were 1000 artificial reads for this dataset.
-
Class E: The rule is the same as for Class A, but There were 2000 artificial reads for this dataset. Random_L and Random_R are random bases and the total length was randomly extended to 180 ~ 220 bp.
Classes A to C combined contained a total of 8000 reads. Random_L and Random_R regions were random bases, and the total length was randomly extended to 180 ~ 220 bp. Classes A to D combined contained a total of 9000 reads. The quality information from Class A to Class D was marked as the highest. The quality information of Class E was marked as the lowest. For the read numbering rule, numbering is divided into three segments. The first segment is fixed: @BMSTC (Beijing Maize Seed Testing Center), representing the artificial sequence. The second segment is category information, using 1, 2, 3, 4, and 5 to represent A, B, C, D, and E, respectively. The third is the ordinal number, starting from 1 in each category and ending with the maximum number of entries in the current category. After 10,000 reads were created, they were randomly distributed into FASTQ files.
Software and package dependencies
AMGT-TS was verified on Ubuntu Server 14.04.4 LTS and 18.04.2 LTS. AMGT-TS relies on various tools including Bamtools (v2.5.0) [20], BLAST tool suite (v2.6.0 +) [19], BWA (v0.7.17-r1188) [17], fastx_toolkit (v0.0.13), Picard (v2.15.0), SAMtools (v1.3.1) [18], and SEQTK (v1.2). The Java version used in AMGT-TS is OpenJDK1.7. The Python version is 2.7 + . Pandas are required in Python and can be installed using PIP.
AMGT-TS implementation details
AMGT-TS runs on Linux, and Ubuntu 18.04 has been tested. After downloading the code from GitHub, the user needs to install the dependent components as described in the README.md file. The ENV_FILE variable in the launch.sh specifies the location of the configuration file. In the configuration file, the user must configure the corresponding component location. The targeted sequencing sample files are placed in the “working/00_fastq” directory. Under the directory “REF_DIR” of the configuration file is the reference sequence file information for each locus. Once this information is configured, the user can execute the launch.sh file to run the tool. Before running the program, the user can specify the different algorithms: precise or broad. When the tool is finished running, a log file will be generated. In the “working/04_reads” directory, the locus typing information of the whole sample is present. In the directory of each locus is the typing information of the current loci and reads corresponding to each type. For each subheap read file, the user can use the Reads Alignment tool for a graphical presentation.