Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution

Background Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors. Results In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps. Conclusions To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.

(Continued from previous page) micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.
Keywords: Cancer genomics, Genomic micro-satellite, Length distribution estimation, Tumor purity, Computational pipeline, Sequencing data analysis Background Micro-satellites are repetitive DNA sequences that consist of specific oligonucleotide units [1,2], exposing intrinsic polymorphisms in terms of the length, which are often described as length distributions [3]. A distinct event known as micro-satellite instability (MSI) refers to a pattern of hypermutation caused by defects in the mismatch repair system [4], characterized by widespread length polymorphisms of micro-satellites repeats, as well as by elevated frequency of single-nucleotide variants (SNVs) [3,5]. MSI happens if the length distributions of the same micro-satellite region differ significantly between different tissue samples, such as a tumor sample and a normal sample, otherwise the micro-satellite stability (MSS) event exists. Up to 15% -20% of sporadic cases of colorectal cancer exhibit MSI events [6,7], while 12% of advanced prostate cancer cases have MSI events [8]. Some recent studies have surveyed the MSI landscape across a range of cancer types [9][10][11], and imply that these regions have important clinical implications for cancer diagnostics and patient prognosis [12,13]. For example, MSI positive colorectal tumors respond well to PD-1 blocade [14]. Due to these clinical utility, the detection of MSI events has become increasingly important.
Owing to the increasing prevalence of the next generation sequencing (NGS) technologies, several computational tools for MSI diagnosis utilizing NGS data were developed, replacing the traditional fluorescent multiplexed PCR-based methods, which are time-consuming and costly. These algorithms includes MSIsensor [15], mSINGS [16], MANTIS [17], MSIseq [18], MSIpred [19], and MIRMMR [20]. Based on our best knowledge, these algorithms may be roughly divided into two categories: the read-count distribution based ones and mutation burden based ones. MSIsensor is among the first algorithms for analyzing cancer sequencing data, calculating the length distributions of each micro-satellite in paired tumor-normal sequence data and implementing a statistical test to identify significantly altered events between these paired distributions. mSINGS works based on target-gene captured sequencing data, allowing for the comparisons among the numbers of signals that reflect the repetitive micro-satellite tracts by differing lengths from tumor and control samples. mSINGS is computationally complex, and is thus only suitable for small panels. MANTIS analyzes MSI of a normal-tumor sample pair as an aggregate of loci instead of analyzing the differences of individual loci. By pooling the scores of all the loci and focusing on the average score, the impacts that sequencing errors or poorly performing loci may have on the results can be reduced. Meanwhile, MSIseq, MIR-MMR and MSIpred utilize machine learning algorithms to predict MSI status. MSIseq compares the length distributions using four machine learning frameworks: logistic regression, decision tree, random forest and naive Bayes approach. It is a classifier that only reports MSI-H vs. non-MSI-H, without a score or percentage, or information about the instability of particular loci. MIRMMR builds a logistic regression classifier that considers both the methylation and mutation information of the genes belonging to MMR system. MSIpred adopts a support vector machine (SVM) to compute 22 features characterizing the tumor mutational load from mutation data in mutation annotation format (MAF) generated from paired tumor-normal exome sequencing data, and then use these features to predict tumor MSI status in the SVM. The classifier was trained by the MAF data of 1074 samples belonging to four types. But none of these approaches is able to overcome the one-read-length limitation. Since the detector can no longer squeeze the micro-satellites by partially mapping reads, the algorithms cannot locally anchor the micro-satellite by using paired-end reads. To this end, ELMSI has been proposed to break through this one-read-length limitation.
Of note, all of these existing algorithms generally imply a hypothesis that the tumor purity of the input sequenced samples is sufficiently high, where the purity refers to the proportion of tumor cells in the mixed sample, which varies widely among different samples and cancer types. But in practice, the sample purity is not as high as expected. Due to the growth pattern of tumor tissues and clinical sampling method, the tumor sample sequenced is actually a mixture that contains non-cancerous cells [21]. The presence of non-cancerous cells can influence the judgment of micro-satellite state. Ignoring the tumor purity, the micro-satellite length distributions and states may be inaccurate. For a micro-satellite region from a mixed tumor sample, different tissues may carry different length distributions, while the observed "distribution" from the sequencing data is actually a convolution of the distribution in tumor cells with that in normal cells. If we first established an assumption that the input sample is sufficiently pure, which means we have already assumed that there is only one distribution existing in the mathematical model, then we cannot fit the actual two distributions at all (See Fig. 1). Meanwhile, even if we can use a software to estimate the tumor purity p in advance, we cannot directly solve the deconvolution problem. In order to recognize the actual length distribution of the tumor micro-satellite from a given mixed sample, we must calculate the parameter values of the distributions accurately. Furthermore, since the existing algorithms mainly use statistical tests to detect MSI, even if the sample is pure enough, the convolutional distribution inferred based on a set of mixed data containing the normal tissue microsatellite length data, which will dilute the data signal and may mislead the statistical tests to report a MSS event, introducing type-I error finally. Existing tumor purity estimation algorithms, such as EMpurity [22], can accurately identify the proportion of normal cells and tumor cells in sequencing samples respectively, which is helpful for us to further correct the length distributions according to the estimated purity.
Motivated by this, in this article, we proposed a novel algorithm termed ELMSI that offers a new approach to identify the state and length distributions of the microsatellite from a given mixed sample. First, we established a more realistic hypothesis that the sequencing sample is a normal-tumor mixed sample, where the microsatellite lengths are subject to two different distributions.
Secondly, we used the purity estimation algorithms to accelerate the deconvolution process for calculating the respective distribution parameters. Finally, our algorithm was suitable for both short and long MSI detection. To test the performance of ELMSI, a series of simulation experiments were conducted. Because mSINGS is only used for small panels and MSIseq targets the sequencing at smaller regions of interest, while ELMSI instead focuses on longer micro-satellite and larger panel, these algorithms were not selected for comparisons. The experimental results herein were compared with MSIsensor. The results demonstrated that ELMSI can accurately identify the state of micro-satellite and infer the length distributions of it from a given mixed normal-tumor sample. Our algorithm outperformed MSIsensor based on multiple indicators, maintaining satisfactory accuracy even when coverage decreases at the same time.

Computational pipeline
Suppose that we are given a series of mapped files in BinAry Map (BAM) format generated from a normaltumor mixed sample, and the outputs of the proposed algorithm include both the length distributions and the state of each micro-satellite. The proposed approach, ELMSI, consists of three components. The first component is estimating the tumor purity of the given sequenced sample by calculating the read counts of the filtered SNVs. Based on the estimated purity, the second component identifies the length distributions and the state of the shorter micro-satellites from the mixed sample by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm MSIsensor [15]. The third component infers the length distributions of the longer micro-satellites by combining a simplified Expectation Maximization (EM) algorithm with central limit theorem, Here, a model of micro-satellite evolution which has been well recognized in recent years holds that the distribution of micro-satellite length is a balance between length mutations and point mutations [23,24]. Length mutations, the rate of which increases with increasing repeat counts, favor loci to attain arbitrarily high values, whereas point mutations break long repeat arrays into smaller units. Therefore, we make the same assumption [25] that the length distribution approximates a normal distribution. We have made two assumptions on the established computational model: Before building the model, we need to process the input data. We have the Binary Map format (BAM) files of whole-exome sequencing (WES) data mapped to reference genome by bwa [26] as our initial input data. Then, we define the following important terms on the aligned reads.
MS-pair: Two paired reads, one of which is perfectly mapped while the other spans a breakpoint.

SB-read:
A read which is across the breakpoint in an MS-pair.
PSset: A collection of the binary group consisting of initial positions and sequences of the SB-reads, which is represented by (POS, SEQ).
Sk-mer: The sequence consisting of the first k bases. We first find all the micro-satellite candidate regions by scanning the given reference genome, recording microsatellites of maximum repeat unit length 6bp and saving the location and the corresponding sequences of each site. Then, we use a clustering algorithm to find the remanent micro-satellite candidate regions which may be ignored by the initial scanning. This algorithm clusters are based on the distances among the initial mapping positions of the reads across each breakpoint. The number of clusters represents the number of micro-satellite regions. We set L max as the longest length of micro-satellites. The lengths of micro-satellites are generally less than 50kbps [27]. Thus, L max is set to be 50kbps. ELMSI estimates the number of micro-satellites using a clustering algorithm according to the distances of the initial positions of the SB-reads. The clustering strategy is as follows: According to the mapping results from the PSset, two SB-reads will belong to the same cluster only if the distance between their initial positions is less than L max . Each cluster then represents a candidate micro-satellite region, providing the number of micro-satellites.
Once the number of micro-satellites is determined, for each candidate micro-satellite region, ELMSI uses a k-mer based algorithm to split each read. As the repeat units that compose micro-satellites are usually less than 6 bps, we set k = 6 as a default. Starting from the first base of the read sequence, the algorithm detects whether two k-mer sequences are identical replicates. This sequence is a candidate repeat unit, and the first base of the sequence is a candidate breakpoint of the micro-satellite. The same operation is conducted for all reads in the micro-satellite region and other candidate areas, taking the mode of the repeat units and breakpoints as the final results.

Estimating the tumor purity of the sample
First, we introduce a tumor purity estimation algorithm. Due to the limitation of current sequencing technologies, the purity problem is almost inevitable during the actual sampling process, so many algorithms are proposed to solve this problem. Among them, EMpurity [22] has established a probability model to accurately estimate the tumor cell proportion in the mixed sample. The observed indicators are the numbers of reads supporting the reference allele and mutation at each site, respectively, while the unknown hidden states include the tumor purity and the joint genotype. EMpurity designs a probabilistic model to describe the emission probabilities from the hidden states to the observed indicators and the transition probabilities among the hidden states. This model is solved by an Expectation Maximization algorithm.
EMpurity uses the pair-sampled DNA sequencing data as the model input data, and only considers the heterozygous sites with somatic mutations. For one sample in the pair, the set of possible genotype values at each loci is G = {AA, AB, BB}. Let N, T and T M represent the normal sample, virtual pure tumor sample and mixed tumor sample, respectively. Here, the virtual pure tumor sample T is actually part of T M . Then, for the paired samples, the set of possible combined genotype values is a Cartesian product, For any site i, let n i N_ref and n i T M _ref denote the number of reads supporting the reference allele in the normal sample and mixed tumor sample, respectively, each of which follows a binomial distribution with parameters μ N and μ T M . There are only 9 possible joint genotypes, which follow a polynomial distribution with parameter μ G . Considering the bias on read depth, we assume that tumor purity follows a normal distribution across all of the given sites, whose parameters are μ p and λ p . Let This model is solved by an Expectation Maximization algorithm, where the established likelihood function is: The specific EM iterative process can be referred to EMpurity [22].

Estimating the length distribution parameters of the short micro-satellite
For the shorter (shorter than one-read-length) microsatellites, the existing algorithms, such as MSIsensor [15], can accurately calculate the specific length data and estimate the state of them. However, when the sequenced sample is a normal-tumor mixture, the calculated microsatellite lengths actually contain both the normal microsatellite lengths and the tumor micro-satellite lengths, and the state estimated directly is inaccurate. Thus, given a mixed sample with known proportions (normal cells account for (1−p), tumor cells account for p) and a microsatellite region belonging to this sample, MSIsensor can detect this micro-satellite region, obtaining a set of the lengths L = {l 1 , l 2 , ..., l N } as a result. L is actually a length data set sampled randomly from two samples which are independent of each other and subject to two different normal distribution models. According to the law of large numbers, the data in L have a probability of (1 − p) to be the length of a micro-satellite from normal cells, and the probability of p to be that from tumor cells. Given a micro-satellite region, we assume that its length follows a normal distribution N 1 μ 1 , σ 2 1 when it belongs to normal cells, while the length of it follows a normal distribution N 2 (μ 2 , σ 2 2 ) when it belongs to tumor cells. Therefore, the length of this micro-satellite in the mixed sample follows a probability distribution with the density function f = (1 − p)f 1 + pf 2 , where f 1 and f 2 is the density function of N 1 and N 2 respectively, while L = {l 1 , l 2 , ..., l N } is the set of lengths obtained from this mixed microsatellite sample independently. We can get the values of μ 1 , σ 1 by separately detecting normal samples (such as blood samples). Under these known conditions, we can use the Maximum Likelihood Estimation (MLE) step to estimate the values of μ 2 , σ 2 . From the above, the likeli-hood function is the joint probability density function of the lengths: The likelihood function actually reflects the probability of generating these length values in L. The parameter values in the likelihood function which can maximize this probability are the estimated values we need to calculate: By this, the estimated valuesμ 2 ,σ 2 can be obtained. Thus, the length distributions of shorter micro-satellites from a given mixed sample can be recognized, and then we perform a z-test to assess the micro-satellite state.

Estimating the length distribution parameters of the long micro-satellite
On the other hand, for the longer micro-satellites, reads cannot locate them, so we cannot pinpoint their specific lengths. Thus, we use the length distribution to characterize them. Given a mixed sample of normal-tumor cells, we set the proportion of tumor cells as p to facilitate the computation. In this paper, we only consider the following two scenarios (See Fig. 2). Similarly, we have known that the micro-satellite lengths in (1 − p) normal cells follow a normal distribution N 1 μ 1 , σ 2 1 , while the micro-satellite lengths in p pure tumor cells follow an another normal distribution N 2 μ 2 , σ 2 2 . And, normal distribution parameters of N 1 can be estimated by detecting normal tissue cells alone. According to central limit theorem, the average of the samples is roughly equal to the average of the population. Whatever the distribution of the population is (mean is μ, variation is σ 2 ), when the sampling times reach a certain condition (> 30), the means of the samples (sample size n) sampled from it will surround the mean of the population and be normally distributed (mean is μ, variation is σ 2 /n). Due to the specific lengths of longer micro-satellite cannot be assessed by the existing technology, we can use the distribution of the mean length of them to reflect the overall length distribution. Our approach supposes that the length of a micro-satellite is normally distributed. Therefore, ELMSI considers a continuous estimation strategy, Fig. 2 The patterns of sequencing reads from a micro-satellite region sampled from a mixed sample. a short MS region. b long MS region whose basic goal is to estimate the micro-satellite average length based on the coverage of the specified area containing this micro-satellite, and then using the updated micro-satellite average length to estimate the coverage of this specified area in turn. This loop is repeated until there are no longer significant changes in micro-satellite average length. Therefore, we can use at least 30 groups of sampling average lengths to assess the distribution of the overall long micro-satellite. The length of the hybrid longer micro-satellites belonging to this mixed sample subject to a normal distribution with μ = (1 − p)μ 1 + pμ 2 , σ 2 = (1 − p)σ 2 1 + pσ 2 2 . According to the Central Limit Theorem, the sampled average length distribution parameters μ can be obtained to reflect the overall length distribution. However, under the technical restrictions, we can only use the estimated σ 2 to represent the overall variance due to the uncountable sample size. By substituting them in the above formula, the length distribution parameters μ 2 and σ 2 of micro-satellites in the pure tumor sample can be calculated. The specific EM process is as follows: Let WIN − bk be the window on the reference, with the breakpoint of a micro-satellite as the midpoint of it. The default length of WIN − bk is set to be 5000bps. Then, the read pairs can be divided into the following categories. Let C-pair be the paired-reads perfectly mapped to WIN −bk, T-pair be the paired-reads perfectly mapped to the microsatellite region, O-pair be the paired-reads with one read mapped to WIN − bk and the other mapped to the microsatellite region, SO-pair be the paired-reads with one read mapped to the micro-satellite region and the other spanning across a breakpoint, S-pair be the paired-reads with one read mapped to WIN −bk while the other spans across a breakpoint, and S-read be the reads which span across the breakpoints in any SO-pair or S-pair. Figure 3 is a graphical representation of the relevant definitions.
The breakpoints and the repeat units of these microsatellites can be identified by the aforementioned data preprocessing, we set a WIN − bk with the breakpoint as the midpoint. The initial length of WIN − kb is set to be 5000 bps. According to the aligned reads corresponding to WIN − bk, we can obtain the coverage of reference in WIN − bk using the following formulas: where SUM bp represents the total number of bases in WIN − bk, NUM read represents the total number of reads in the target area, L read represents the read length, C represents the coverage of the target area, and L represents the length of the target area. When the WIN − bk length is fixed, SUM bp is a constant. Thus, the lengths of micro-satellites do not affect SUM bp , but do influence the coverage C. We can therefore calculate the normal distribution parameters of the micro-satellite lengths through the following nine steps.

Variable initialization:
Let m be the total number of micro-satellites, i be the ith micro-satellites, S be the sampling times, WIN − bk be the sequence of samples with the micro-satellite's breakpoint as the midpoint, L Win be the length of WIN − bk, L aln be the total number of bases belong to the micro-satellites region in all S-reads, L set be the set of micro-satellite lengths.
Step 1-1: Initializing the number of micro-satellites, the repeating units, breakpoints by the data preprocessing;

The obtained micro-satellite length is
incorporated into a set, L set = L set {L }. 6. In order to assess the normal distribution parameter of a given micro-satellite sequence, we sample 30 times (at least) by changing the size of L Win . Set S = S + 1, if S < 30, and let L Win = L Win + 1000. Then proceed to Step 1. 7. The statistical data regarding micro-satellite lengths obtained from these 30 groups of sampling experiments are tested using a normal test algorithm and the Shapiro-Wilk algorithm. Output the normal distribution parameters of a micro-satellite N μ, σ 2 . μ and σ 2 are the mean and covariance of lengths. 8. If i < m, set i = i + 1, go to Step 1. 9. The independent z-test is used to compare the state of micro-satellite between tumor cells and normal cells. If p-value < 0.05, then the identified micro-satellite is an MSI event, otherwise the identified micro-satellite is an MSS event.

Results and discussion
To test the performance of ELMSI, we first tested its ability of micro-satellite state classification, and also compared the two major indicators -precision rate and recall ratewith those yielded by MSIsensor [15]. And we conducted experiments on a series of simulated datasets with different configurations, which altered the number of microsatellites, coverage, and read length. In these simulation experiments, the following key indicators were calculated to evaluate ELMSI: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). In addition, five popular indicators were further calculated, which are accuracy, recall, precision, MCC and Gain.

Simulation dataset generation
To generate the simulation datasets, we first randomly selected a region of 10Mbps on human chromosome 19.
To design a complex situation, we randomly chose the micro-satellites length, repeat unit, and the breakpoint. As aforementioned, the micro-satellite length in a given individual is normal distributed. We divided the normal distribution N μ, σ 2 into seven parts which are μ − 3σ , μ − 2σ , μ − σ , μ, μ + σ , μ + 2σ , μ + 3σ , and the number of micro-satellites in each part planted into the reference was got through multiplied coverage by corresponding probability 1%, 6%, 24%, 38%, 24%, 6%, and 1% for each part, respectively. Once each micro-satellite was planted, we merged these seven read files. All of the simulated reads were then mapped to the reference sequence. The alignment file was then provided to variant calling tools.

Micro-satellites state classification and comparison experiment
In this part, we first tested the accuracy of ELMSI in classifying the micro-satellite state from the mixed samples. The z-test was used to determine whether the microsatellite is a MSI event.
For the shorter micro-satellites, we compared our algorithm with the proposed approach MSIsensor. Among the proposed micro-satellite state classification algorithms, mSINGS is suitable for small panels and has been reported to be used only for limited exome data, and MSIseq only targets the sequencing at smaller regions. Comparison with these algorithms is meaningless. MSIsensor can accurately identify the micro-satellite state and lengths when the they are shorter than one read length. Thus we chose MSIsensor to do the comparison experiment. The number of micro-satellite was set to be 30, the coverage was set to be 100× and the read-length was set to be 200bps. The tumor purity was set to be 0.9, 0.7, 0.5, 0.3, 0.1, respectively. Micro-satellite state were subsequently identified by the two classification tools MSIsensor and ELMSI. The results are shown in Table 1.
As can be seen, ELMSI has better performance in hybrid micro-satellite state classification. When the tumor purity of the input sequenced sample is below a certain ratio, the MSS signal in the normal sample will dilute the MSI signal, causing MSIsensor to report a MSS event. Thus, when the input tumor sample is a mixture with high normal cell contamination, MSIsensor cannot distinguish the MSI accurately. However, ELMSI can do the classification even if the tumor purity is less than 10%.
On the other hand, for the longer micro-satellites, the paired-reads used to locate the candidate microsatellite region are invalid, and none of the existing approaches is able to overcome the one-read-length lim-  itation. Thus, we proposed ELMSI, which can identify the longer hybrid micro-satellites, and classify their state. Next, we tested the classification accuracy of it. The number of micro-satellite was set to be 30, the coverage was set to be 100× and the read-length was set to be 200bps. The tumor purity was set to be 0.9, 0.7, 0.5, 0.3, 0.1, respectively. The detailed results are shown in Table 2.
As is shown in Table 2, the decreasing tumor ratio can influence the accuracy of the ELMSI. However, even with a purity as low as 10%, the results still indicate that ELMSI can provide a reliable MSI classification.

Estimating the distribution of micro-satellite lengths
To separately verify the validity of ELMSI in estimating the length distributions of the longer micro-satellites. We ignored the influence of tumor purity, and tested the performance of ELMSI by changing micro-satellite number, coverage, and read length. A correct call is defined as follows: a micro-satellites is identified with a correct repeat unit, the breakpoint detected belongs to the (b − −10bps, b + 10bps) where b is the set breakpoint, and the actual micro-satellites length belongs to the (μ − 3σ , μ + 3σ ), where μ and σ are parameter values which have be estimated.
We first changed the number of micro-satellite from 20 to 100. In order to better reflect the influence of microsatellite number on ELMSI, we also varied the coverage from 30×, 60×, 100×, to 120×. The read length was set to be 100bp in this group of experiments. For each differ-ent micro-satellite number, we repeated the test five times using the same setting and output the average results, which are summarized in Table 3.
The increasing micro-satellite number can influence the robustness of the ELMSI. In practice, since microsatellites are very rare, few micro-satellites will exist in a given 10Mbps chromosomal sequence region. Even so, for testing ELMSI, we intended to increase this density. Based on Table 3, we can see that ELMSI can identify micro-satellites and exclude non micro-satellites interference accurately. The results also show that ELMSI can offer a high reliability.
Sequencing coverage affects somatic mutation calling, which in turn would presumably affect the performance  Table 4 the coverage changes intuitively affect the changes in key indicators. In this group of experiments, we set the number of microsatellites to be 20, 40, or 60, and set read length to be 100 bps.
The lower the coverage, the greater the difficulty faced by this computational approach. Consistent with this, Table 4 indicates that the performance of ELMSI increases as coverage increases, with maximal recall rate more than 80%. Thus, the higher the coverage, the higher the accuracy of ELMSI for inferring micro-satellites.
ELMSI can also stay valid when the read length is altered. The number of micro-satellites was set to be 20, or 50, coverage was set to be 30×, 60×, 100×, or 120×, and the read length was set to be 100bps, 150bps, 200bps, 250bps and 300bps. The results are shown in Table 5.
The main weakness of this method is the huge amount of splicing required. The longer the read length, the smaller the splicing workload, and the fewer errors will be introduced by splicing. We thus predict that with the increased of read length, ELMSI performance will improve. Table 5 validates this hypothesis, and shows that the longer the read length is, the more accurate estimation result is.

Conclusion
In this article, we focus on the computational problem of inferring the length distributions and states of all kinds of micro-satellites in tumors with normal cell contamination. Existing approaches, such as MSIsensor, mSINGS, MANTIS and MSIseq, perform well in handling the genomic micro-satellite event whose length is shorter than one read length, but often encounter a significant loss of accuracy when the length of micro-satellite becomes longer. Meanwhile, all of these MSI detection algorithms implies a general assumption before establishing a mathematical model that the input sample is a pure tumor sample, which is difficult to achieve under existing sequencing technology. We have therefore proposed an algorithm to break these limitations, handling micro-satellites with a wide range of length from a mixed normal-tumor sample based on NGS data. Our proposed algorithm, termed ELMSI, directly computes on the aligned reads. ELMSI can clearly recognize the length distributions and states of micro-satellites with a wide range of length from mixed sequenced samples. For short microsatellites, it can identify the lengths accurately, while for long micro-satellites, it can estimate the normal distribution parameters. ELMSI is among the first approaches to recognize and identify long micro-satellites. However, due to the nature of sequencing data and the limitation of computing capacity, the estimated mean μ is relatively accurate, while the estimated variance σ has a certain deviation. Thus, for longer MSI detection, our algorithm uses independent z-test mainly. When the sample size can be calculated during the iteration process, we can estimate the variance of the longer micro-satellite more accurately, and thus we recommend to use independent t-test to infer the MSI state. The performance of ELMSI is compared with MSIsensor, and ELMSI is superior for the hybrid shorter microsatellites classification. For the mixed longer samples, ELMSI can also obtain the satisfactory results. The simulation experimental results demonstrate that ELMSI is robust, with good performance in response to variations in coverage, read length, and the number of microsatellites. It will be useful for micro-satellites screening and we anticipate a wider usage in cancer clinical sequencing.