GenHtr: a tool for comparative assessment of genetic heterogeneity in microbial genomes generated by massive short-read sequencing

Background Microevolution is the study of short-term changes of alleles within a population and their effects on the phenotype of organisms. The result of the below-species-level evolution is heterogeneity, where populations consist of subpopulations with a large number of structural variations. Heterogeneity analysis is thus essential to our understanding of how selective and neutral forces shape bacterial populations over a short period of time. The Solexa Genome Analyzer, a next-generation sequencing platform, allows millions of short sequencing reads to be obtained with great accuracy, allowing for the ability to study the dynamics of the bacterial population at the whole genome level. The tool referred to as GenHtr was developed for genome-wide heterogeneity analysis. Results For particular bacterial strains, GenHtr relies on a set of Solexa short reads on given bacteria pathogens and their isogenic reference genome to identify heterogeneity sites, the chromosomal positions with multiple variants of genes in the bacterial population, and variations that occur in large gene families. GenHtr accomplishes this by building and comparatively analyzing genome-wide heterogeneity genotypes for both the newly sequenced genomes (using massive short-read sequencing) and their isogenic reference (using simulated data). As proof of the concept, this approach was applied to SRX007711, the Solexa sequencing data for a newly sequenced Staphylococcus aureus subsp. USA300 cell line, and demonstrated that it could predict such multiple variants. They include multiple variants of genes critical in pathogenesis, e.g. genes encoding a LysR family transcriptional regulator, 23 S ribosomal RNA, and DNA mismatch repair protein MutS. The heterogeneity results in non-synonymous and nonsense mutations, leading to truncated proteins for both LysR and MutS. Conclusion GenHtr was developed for genome-wide heterogeneity analysis. Although it is much more time-consuming when compared to Maq, a popular tool for SNP analysis, GenHtr is able to predict potential multiple variants that pre-exist in the bacterial population as well as SNPs that occur in the highly duplicated gene families. It is expected that, with the proper experimental design, this analysis can improve our understanding of the molecular mechanism underlying the dynamics and the evolution of drug-resistant bacterial pathogens.


Background
Microevolution is defined as any evolutionary changes below the species level. It is the study of short-term changes within a population or a species of its alleles (alternative genes) and their effects on the phenotype of organisms that make up that population. The result of the below-species-level evolution is heterogeneity, where populations are made up of subpopulations with a large number of structural variations. Heterogeneity analysis is, therefore, essential to our understanding of how the selective and neutral forces shape bacterial populations over a short period of time [1][2][3][4].
In S. aureus, the role of microevolution is especially significant, in which heterogeneity is hypothesized to be the basic molecular mechanism for drug resistance [5]. Indeed, pre-existing drug-resistant mutants were often detected, from which drug-resistant mutants emerged during drug therapy or on the drug-containing growth medium. S. aureus strain of PC-1, PC-2 and PC-3 are great examples [6]. They were isolated at different stages of vancomycin therapy. The first two were recovered from a patient in the early stages of vancomycin therapy, whereas the third was isolated one day before the patient's death. The minimal inhibitory concentration of vancomycin is 2 μg per milliliter for PC-1 and PC-2, while PC-3 is capable of growing on agar plates containing vancomycin at a concentration of 8 μg per milliliter at a frequency of approximately 10 -3 . PC-3* is a colony picked from PC-3 strains capable of growing on agar containing 8 μg of vancomycin per milliliter. The new isolate can even grow on 16 μg of vancomycin per milliliter at a frequency of approximately 10 -5 , a strong indicator for the existence of subpopulations. Similarly, mutants with higher vancomycin resistance were also picked from MM66 plated on vancomycin containing agar with a minimum inhibitory concentration [7,8].
The heterogeneity is not unique to S. aureus. It is found in the population of Mycobacterium tuberculosis and many other bacterial pathogens, where the pathogen can gain drug resistance under the selective pressure of drug therapy. Post et al identified drug-resistant subpopulations within TB patients [9], in which four of the 13 patients acquired additional drug-resistance mutations during the course of treatment [9]. On the other hand, the bacterial pathogen could also lose its pathogenicity when selective pressure is absent, leading to the in vitro accumulation of attenuated mutants. A systematic analysis of individual clones isolated from subcultured M. tuberculosis H37Rv and a non-subcultured frozen stock detected a mixed population in H37Rv containing wild-type cells as well as neutral red and Phthiocerol Dimycocerosate (PDIM) mutants [10]. Microarray analysis confirmed a significant heterogeneity [10]. Heterogeneity was also found in natural bacterial populations of Escherichia coli [11], Salmonella enterica [12], Neisseria meningitidis [13], Haemophilus influenzae [14], Helicobacter pylori [15], Streptococcus pneumoniae [16] and Pseudomonas aeruginosa [17]. From these bacterial pathogens, hypermutation and the pre-existence of drug-resistant subpopulations were detected. It is thus logical to conjecture that the heterogeneity in bacterial populations is fundamental to the development of drugresistant strains in these and possibly other bacterial species. Analysis tools that possess the ability to study heterogeneous bacterial populations and the dynamic of genetic changes in the populations would greatly improve our understanding of the molecular mechanisms of microevolution.
Molecular epidemiological analysis tools play critical roles in our current understanding of bacterial microevolution. The tools include IS6110-based restriction fragment length polymorphisms (RFLP); spoligotyping and the mycobacterial interspersed repetitive-unit-variable-number tandem-repeat (MIRU-VNTR) typing in the study of M. tuberculosis populations; multilocus sequence typing (MLST); amplified fragment length polymorphism analysis (AFLP); double-locus sequence typing (DLST); and spa typing in the analysis of S. aureus population. All of these techniques, however, have severe limitations. First, these tools focus on small sets of genome components, such as DNA sequences of internal fragments of multiple (usually seven) housekeeping genes in MLST [18], single-strand sequencing of partial repeat sequences of genes clfB and sp in DLST [19], and DNA sequence analysis of variable repeat regions of the protein A gene in spa typing [20]. Consequently, either the genetic heterogeneity could not be detected at all or its frequency is seriously underestimated [21][22][23].
The Solexa Genome Analyzer is a sequencing system powered by the next-generation Solexa sequencing technology [24]. It is based on massively parallel, shotgun, clonal sequencing-by-synthesis and is characterized by high throughput and precision in base calls. With this platform, millions of short reads can be obtained with an accuracy of up to 99% [25], allowing massively parallel picoliter-scale amplification and sequence determination of individual DNA molecules [26]. For example, a mixture of three HIV-1 envelope variants pooled in proportions of 89%, 10%, and 1% can be accurately detected [27]. Recently, Wang et al. detected an average of 58 variants per clinical HIV plasma samples using this technology compared to an average of eight variants per sample using conventional direct-PCR dideoxynucleotide sequencing [28]. It can be hypothesized that, with these technologies, bacterial genomes in a heterogeneous population can be adequately sequenced so that population dynamics in respect to genetic heterogeneity can be studied at the whole genome level.
Heterogeneity analyses are straightforward due to the small size of the viral genome and the rarity of sequence duplications in the genome. However, the heterogeneity analysis is expected to face a significant challenge in bacterial genomes. Sequence duplications, insertions and deletions are ubiquitous features of bacterial genomes, which resulted in gene families and super-families with many paralogs [29][30][31][32]. The resulting complexity would make it especially difficult to determine whether particular heterogeneity sites are due to mutations that occur between the paralogs or spontaneous mutations that generate heterogeneous bacterial populations.
A few computational tools have been developed for genome-wide variant analysis, including BFAST [33], RMAP [34] and Maq [35]. This software facilitates the fast and accurate mapping of short reads to detect sequencing errors, single nucleotide polymorphisms (SNPs) and indels with well-defined statistics. For example, Maq calls the consensus based on a statistical model that maximizes the posterior probability [35]. All these tools, however, have limited or no capability for genetic heterogeneity analysis. Even though heterozygotes are called in the Maq analysis process, all analyses are local in nature, meaning that the process narrows down particular genome positions without considering global genome contexts. As a consequence, it is difficult to determine whether the heterogeneity is due to intragenome variation or to the heterogeneous population. In addition, like many other tools, Maq has limited capability to reveal mutations that occur in repetitive DNA. As a matter of fact, SNPs lying in repetition were intentionally excluded from analysis [36].
Here, a new tool referred to as Gen Htr was established. Gen Htr is unique in that it could detect variants within the context of whole genomes. Specifically, the software first establishes the genotype in genetic heterogeneity at all chromosomal positions for both newly sequenced bacterial strains (based on the Solexa reads) and their isogenic reference genomes (based on the simulated data). Comparative genotype analysis allows for the identification of heterogeneity sites where the chromosomal positions are heterogeneous in newly sequenced genomes but homogenous in the isogenic reference genome. Together with Maq, the software then provides clues to determine whether the multiple variants are due to intra-genome duplications, sequencing artifacts or spontaneous mutations, thus helping to prioritize the variants for experimental validation. In addition, Gene-Wise [37,38] was integrated so that synonymous/nonsynonymous mutations can be analyzed as well.
To prove this concept, I applied this approach to SRX007711, the Solexa sequencing data downloaded from NCBI Sequence Read Archive (SRA) for a newly sequenced S. aureus subsp. USA300 cell line. The data set was chosen as a user model organism for the following reasons: First, completely sequenced genomes of two USA300 strains are available in the NCBI [Genome Assembly/Annotation Projects]; second, S. aureus is one of the leading causes of infectious disease mortality; and third, heterogeneity has been implicated in the process of drug-resistance development where S. aureus undergoes genetic shifts during treatment, resulting in the acquisition of subtle genetic changes in S. aureus subpopulations [39]. It is imperative to find cures effective in treating S. aureus infections since failures can lead to a dire consequence: the selection and spread of multiple drug-resistant strains [40]. This paper illustrates the analysis procedure and presents partial results.

Genome sequence data
The Solexa genome sequences (Raw Solexa sequence reads) of S. aureus strain were downloaded from ftp://ftp. ncbi.nlm.nih.gov/sra/static/SRX007/SRX007710/. The S. aureus is a USA300 strain that was sequenced using Illumina sequencing technology by the BROAD Institute in the Staphylococcus_aureus_Assembly_Development project. In addition, the completely sequenced S. aureus genome of S. aureus USA300 FPR3757 was downloaded from NCBI ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ Staphylococcus_aureus_USA300_FPR3757/. Like the newly sequenced strain, this strain belongs to the aureus USA300 subspecies. Its genome includes a chromosome of 2872769 base pairs (bps) (NC_007793) and three plasmids of 3125 bps (NC_007790), 4439 bps (NC_007791), and 37136 (NC_007792), respectively. The USA300 subspecies is methicillin-resistant, community-acquired (CA-MRSA) and has been involved in epidemiologically unassociated outbreaks of skin and soft tissue infections in healthy individuals in at least 21 U.S. states, as well as in Canada and Europe [41].

Gen Htr Steps
Gen Htr is based on a seamless integration of various computational tools such as MegaBlast, Blat and Gene-Wise as well as some in-house-developed Perl modules. MegaBlast uses a greedy algorithm for the nucleotide sequence-alignment search, up to 10 times faster than more common sequence-similarity programs [42]. Blat is a BLAST-Like Alignment Tool, designed specifically for accurate and faster sequence alignments [43]. For the heterogeneity analysis of a particular bacterial strain, a completely sequenced genome of an isogenic strain is selected as a reference genome, named IRG for short.
Gen Htr is partitioned into four conceptual steps (Fig. 1). The first step is to create the database Reference Genome DNA Fragments (RGDF), an entire set of nonoverlapped DNA fragments from the IRG (Fig.1 I.a). The RGDFs have a pre-defined length, e.g. a default length of 1,000 base pairs, which was empirically determined. Larger RGDFs often present a significant challenge since Solexa creates massive sequencing reads in certain genomic areas that often overwhelm MegaBlast, a key tool in this analysis. In such case, no sequence alignments will be displayed.
Once established, RGDF is used to search genomespecific databases of Solexa short reads via MegaBlast to identify its candidate trace sequences ( Fig.1 I.b). Mega-Blast is run with all default parameters except -v and -b. Both are assigned a value of 1,000,000 to allow all possible database sequences to show one-line descriptions (-v) and alignments (-b). The candidate trace sequences were defined as those that display at least 95% sequence identity with a given RGDF. The 95% is used to limit alignments between Solexa Reads and RGDFs to those of at most two mismatches.
Candidate trace sequences are then mapped to the IRG based on the alignments from the MegaBlast analysis ( Fig.1 I.c) to define genetic heterogeneity within a newly sequenced bacterial strain (treated as a population). In the mapping, nucleotide identities (D ij ) are determine at each chromosomal position, where D ij represents the j th alternative nucleotide at chromosomal position i. D ij is identical either to those in the reference genome or to those that are substituted, deleted or inserted. Furthermore, the numbers of candidate Trace sequences (M ij ) used to derive nucleotide identity j at position i as well as the percentage sequence identities (P ij ) between the candidate trace sequences and the RGDF are recorded. As a result, the genetic heterogeneity genotype in the format of P i1 :D i1 :M i1 , P i2 :D i2 :M i2 ... P ii :D ij :M ij ... P in :D in :M ik will be established for all chromosomal positions. Moreover, a 2-2-5 rule is designed specifying that any chromosomal position will be defined as candidate genetic heterogeneity sites if the following conditions are satisfied: First, there must exist at least two alternative nucleotides at certain positions; and second, among the alternative nucleotides at these positions, there must be at least five Solexa reads per an alternative with an average Phred value of 13 (a probability of p that is no greater than 0.05 to be incorrect base calls) [26]. At least five reads were empirically required for accurate single nucleotide mutation calling [44].
The second step is a simulation procedure to analyze genome complexities of DNA mixtures derived from the isogenic reference genome. The procedure first creates an IRG-specific DNA database, where all possible N-base pair "Solexa read-like" DNA fragments are generated from the IRG (Fig.1 II.a). A moving window of N-base pairs is used to scan over the genome to generate DNA fragments. The length of Solexa reads determines the size of the moving window (N = 37 base pairs in this analysis). Then the analysis follows the exact procedures from Fig.1 I.b and I.c to identify candidate sequences from the IRG-specific DNA database ( Fig.1 II.b) and establish genome-wide "heterogeneity" genotypes ( Fig.1 II.c). For consistency, the term "heterogeneity" is still used in this section, but instead of describing the heterogeneous bacterial population, it is Figure 1 The four-step analysis procedure for heterogeneity analysis of bacterial population. The first step is to establish genome-wide heterogeneity phenotypes for the newly sequenced bacterial strain (a clone population) (I.). The step first creates a database of Reference Genome DNA Fragments (RGDF) with a set of non-overlapped DNA fragments from the isogenic reference genome (IRG) (I.a); Once established, RGDF is used to search the database of Solexa short reads of the bacterial strain via MegaBlast to identify its candidate trace sequences (I.b), which are then mapped to the IRG to define genetic heterogeneity (I.c). The second step is a simulation procedure to study the genome complexity of the IRG (II.). The procedure creates an IRG-specific DNA database, covering all possible N-base pair "Solexa read-like" DNA fragments from the IRG (II.a), Then the analysis follow the same procedure from I.b to I.c to identify candidate sequences from the IRG-specific DNA database (II.b) and to establish genome-wide "heterogeneity" genotypes (II.c). The genotypes from step I and II are comparatively analyzed (III.). The genetic heterogeneity sites were analyzed with genewisedb for synonymous/non-synonymous mutations (IV).
used to portray the genome complexity of the IRG. The read depth from simulated data varies from genomic area to genomic area, depending on the complexity of the DNA. In the areas of single copy, the read depth is equal to N, the width of the moving window.
The genotypes from this and the previous step are compared ( Fig.1 III.) to identify the genetic heterogeneity sites where alternative nucleotides were found from the analysis of Solexa sequence data but not from the analysis of the reference genome. Once the genetic heterogeneity sites were predicted, synonymous/non-synonymous mutations at the sites were analyzed with genewisedb (GeneWise) [37,38] (Fig.1, IV.), where candidate trace sequences with heterogeneity sites were comparatively analyzed with their orthologous proteins from the isogenic reference genome. The genewisedb is run with all default parameters.
Gen Htr runs are semi-automated. A run of Gen Htr on the 700 M base pairs of S. aureus takes~10 h on a 500 MB RAM Pentium 4 computer running Linux. Running time increases linearly with the size of the Solexa genome sequences and genome length.

Characterization of the model genome
In heterogeneity analyses, a key pre-requirement is the existence of completely sequenced, assembled genomes of an isogenic bacterial strain. In NCBI, many completely sequenced S. aureus genomes are available, including those for FPR3757 and TCH1516, two strains of S. aureus subspecies USA300. The data set SRX007711, a USA300 strain, was selected as the model organism. The SRX007711 comes from the same subspecies. It was assumed that the newly sequenced genome was highly similar to the two sequenced subspecies. Indeed, using S. aureus subspecies USA300 FPR3757 as a template, Gen Htr can reconstruct the entire genome of SRX007711 with merely five gaps to a total of 410 base pairs (Additional file 1 Table S1). Moreover, only 101 substitutions, 2 insertions and 3 deletions were revealed, suggesting that SRX007711 is isogenic to FPR3757. Another critical pre-requirement is the size of the sequencing data. This analysis indicated that SRX007711 has massive coverage of the bacterial genome with an average number of up to 130 Solexa reads per mapped chromosomal position. Among these chromosomal positions, about 77.37% have greater than 100 of Solexa reads, with a maximum of 1,346.

Analysis of the isogenic reference genome
The results indicated that the simulation in Gen Htr could characterize genome complexity of the IRG. First, it determined whether chromosomal positions occur in single-copy-DNA fragments or in duplicated DNA fragments (Additional file 2 Table S2). The analysis also revealed "heterogeneity" within the IRG, a mixture of DNA fragments derived from recent duplication and subsequent mutations. A total of 981 "heterogeneity sites" were detected at genic areas (Additional file 3 Table S3). A majority of the "heterogeneity sites" are present in proteins important to pathogenesis (about 84% if genes encoding phage-like, transposase and hypothetic proteins were excluded). These include 5 sites on the gene for the immunoglobulin G binding protein A precursor, 8 sites on the gene for the superantigen-like protein, 12 sites on the gene for the cell surface proteins, 18 sites on the gene of teichoic acid biosynthesis protein, 36 sites on the gene of fibronectin binding protein A/B, 71 sites on the genes of clumping factor A/B, 191 sites on the genes of sdrC/D/E proteins and 135 sites on the genes of staphylococcal tandem lipoprotein. Many of those genes encode surface proteins that have a common C-terminal LPXTG/NPQTN cell wall attachment motif, a sequence fragment that plays an essential role in host colonization, biofilm formation, and the evasion of host defense [45]. For example, wall teichoic acids (WTA) have been shown to be essential for the survival of Bacillus subtilis [46] and function as virulence factors in S. aureus [47,48]. Deficient WTAmutants were impaired in their adherence to nasal cells and were unable to colonize cotton rat nares [47,49]. The superantigen-like protein is a bacterial protein toxin that binds to the major histocompatibility complex class II and T-cell receptor to stimulate large numbers of T cells, leading to toxic shock syndrome [50].
The detection of extensive "heterogeneity" from the isogenic reference genome, especially on the pathogenesis-related genes, is not surprising. Gene duplication has long been recognized as one of the most important mechanisms in the evolution of bacterial genomes, creating multiple homologs within a genome. The paralogous gene groups are further involved in mutations and genome rearrangements that help the bacteria adapt to ever-changing environments [51]. By "heterogeneity" analysis, Gen Htr can characterize the genome dynamics of bacterial pathogens.
Heterogeneity analysis of the newly sequenced S. aureus subsp. USA300 cell line The genetic heterogeneity analysis of SRX007711, the Solexa data from the newly sequenced USA300 cell line, detected a similar phenomenon. A total of 2,056 heterogeneity sites were identified through heterogeneity analysis. Among them, 204 are unique to the new cell line when compared to the "heterogeneity" genotypes of its isogenic reference genome. Many of them were also detected by Maq as single nucleotide polymorphisms (SNPs) and passed the "SNPfilter" when Solexa reads that cover the heterogeneity sites were extracted and used as an alternative Maq input (Section I in Table 1). The heterogeneity site that was detected at the gene coding for sensor histidine kinase is a good example. Among the 123 candidate trace sequences mapped to this site, 115 have a nucleotide that is identical to the isogenic reference genome and 7 with a G-> C substitution. All seven substitutions have a Phred value greater than or equal to 27 in the base calling and three of them have a Phred value of 40, the highest possible quality value in the data set. A value of 40 has a converted probability of 0.0001 incorrect reads. In contrast, the position is homogeneous in the isogenic reference genome. The same phenomenon was observed at other heterogeneity sites, including those at the genes encoding for the sulfatase family protein, the lipoate-protein ligase A family protein, the penicillin-binding protein 3, the lantibiotic epidermin biosynthesis protein EpiC, the oxacillin resistance-related FmtC protein, and the putative fibronectin/fibrinogen binding protein. Furthermore, a majority of the heterogeneity sites are located in the single-copy DNA fragment in the isogenic reference genome.

Heterogeneity analysis of the genes of single DNA copies
Further analysis indicated that the genetic heterogeneity has certain unique properties. First, a majority of the genetic heterogeneity sites were detected in unique genomic areas. Many of them have non-synonymous mutations, leading to amino acid alterations in target proteins ( Table 2). For example, the genetic heterogeneity leads to a Q 59 -> K 59 substitution on putative fibronectin/fibrinogen binding protein, a C 717 -> Y 717 on oxacillin resistance-related FmtC protein at a ratio of 3.87% [T:6 G:145], and 3.7% [A:5 G:128] out of the detected candidate sequences, respectively. The genetic heterogeneity also leads to truncated proteins, e.g. at genes encoding sensor histidine kinase at a percentage of 5.74% [C:7 G:115], and phosphotransferase system, glucose-specific IIABC component at a percentage of 2.85% [A:5 C:170].
The genetic heterogeneity is involved in some pathogenesis-related genes, including penicillin-binding proteins (PBP), fibrinogen/fibronectin-binding proteins (Fnbp), and lipoate-protein ligase A family protein. PBP is a key player in the bacterial cell cycle and drug-resistance processes. Altered PBPs with a reduced affinity to penicillin lead to penicillin resistance in clinical isolates of S. pneumoniae [52][53][54]. Fnbp is a major platelet-activating factor on the surface of S.aureus [55]. Truncated derivatives of the genes promote platelet activation when expressed on the surface of S.aureus or Lactococcus lactis, indicating two distinct mechanisms of activation. fmtC is a gene that affects oxacillin resistance in methicillin-resistant S. aureus. Its mutants showed increased susceptibility to beta-lactam antibiotics and bacitracin [56]. The gene of lipoate-protein ligase A family protein is upregulated by daptomycin [57]. The antibiotic induces the S. aureus cell wall stress stimulon and genes responsive to membrane depolarization.
As indicated above, Maq was used as an important tool for validation. However, many genetic heterogeneity sites could not pass the "SNPfilter" (Section II in Table 1), but it is worth illustrating them from the functional perspective because of their potential roles in the pathogenicity and microevolution of S. aureus strains. The sites include a LysR family transcriptional regulator gene with a genotype of [A:5 G:92]. A nonsense mutation results in a truncated LysR family transcriptional regulator. This protein is a global transcriptional regulator, acting as either activators or repressors of single or operonic genes. It regulates a diverse set of genes, including those involved in virulence, metabolism, quorum sensing and motility [58,59]. In mycobacteria, the lack of mismatch correction is recognized as a promoter of mycobacterial evolution [60]. It is, however, not completely clear how the genetic variation and phenotypic diversity are created in S. aureus. The discovery of the multi-variants in the genes that are known to be associated with hypermutability will shed light on the molecular mechanisms. The first such site is located at the gene encoding DNA mismatch repair protein MutS. The heterogeneity site has a genotype of [C:115 G:5]. The C-> G substitution at the chromosomal position of 1,309,034 results in a nonsense mutation, leading to a truncated protein with a loss of 77 amino acids at the C-terminal of the protein. The finding of truncated DNA mismatch repair protein MutS in the newly sequenced S. aureus subsp. USA300 cell line is new but not a surprise. This gene is part of the bacterial mismatch repair system. It functions to correct point mutations and small insertions/deletions that fail to be proofread by DNA polymerase activity. Previous experiments indicated that its disruption is related to the high-frequency hypermutability and drug-resistant mutants. In P. aeruginosa, mutS mutants displayed an increase in antibiotic-resistance. Furthermore, antibiotic-resistant levels of the generated mutants were higher than those measured from spontaneous resistant mutants derived from wild-type cells [61]. In S. aureus, the inactivation of MutS or MutL was associated with the emergence of a hypermutator phenotype that favors the acquisition of antibiotic resistance and facilitates bacterial adaptation during long-term persistence [62]. Drug-resistant subpopulations with mutant genes coding for DNA repair enzymes, referred to as strong mutators, were found in the laboratory populations of E. coli [63]; S. enterica serovar Typhimurium [12] and in natural populations of E. coli [11]; S. enterica [64]; N. meningitidis [13]; and P. aeruginosa [65]. All of them    have a defective mismatch repair system due to the inactivation of mutS or mutL genes. Additional candidates to establishing the genetic diversity necessary for the evolution of drug-resistant strains are the variants found at genes of exonuclease RexB and DNA topoisomerase I. The former has a genotype of [A:5 C:113] at the chromosomal position of 950,365, and the latter has a genotype of [T:9 G:125] at the chromosomal position of 1,252,956. Substitutions at both positions lead to non-synonymous mutations, causing amino acid replacements: Q 620 -> K 620 at exonuclease RexB protein and V 65 -> F 65 at topoisomerase I protein.
The RexB is part of the cellular system that plays a crucial role in homologous recombination for the repair of a variety of DNA damage, involving in the maintenance of chromosome integrity and the generation of genetic variability [66]. Topoisomerases are essential enzymes that solve topological problems arising from the doublehelical structure of DNA. It is, then, related to the status of DNA supercoiling, a critical factor that modulates the expression of virulence genes in pathogenic bacteria at different phases of the host-pathogen relationship [67]. Validation of the heterogeneity sites by targeted-sequencing with different technologies would improve our understanding of the molecular mechanisms underlying the evolution and pathogenicity of S. aureus.

Detection of mutations in the large gene families
Mutations were not only detected in the unique genes but also in large gene families, e.g. 5 S rRNA and 23 S rRNA genes (Section III in table 1). The mutations in these gene families were further validated through Maq with high confidence, where individual sequences from the genes were used as references. The results are not surprising. Previous experiments established a clear association between mutations in the ribosomal genes and hypermutability. Canu et al. detected a variety of ribosomal mutations that conferred resistance to macrolides, clindamycin, streptogramin, and telithromycin in Streptococcus pneumoniae [68]. Prunier et al. provided additional support for the role of the rRNA gene in drug resistance and hypermutability, where six strains of S. aureus were isolated from cystic fibrosis patients after treatment with azithromycin and all carried either A2058G/U or A2059G mutations within the 23 S rRNA gene and all acquired cross-resistances to azithromycin and erythromycin, the two therapeutic agents [69].
Mutations were also detected in the sequences that coded for clumping factor A/B and sdrD proteins. Clumping factor A/B and sdrD protein are pathogenic factors from two large gene families in the S. aureus. Clumping factor A (ClfA) is a fibrinogen-binding protein expressed on the S. aureus cell surface and has previously been shown to act as a virulence factor in experimental septic arthritis [70]. Clumping factor B (ClfB) is one of the microbial surface components that recognizes adhesive matrix molecules, mediates the adherence of S. aureus to immobilized fibrinogen and promotes bacterial clumping in soluble fibrinogen [71]. Just like clumping factor B, the serine-aspartic acid repeat protein SdrD can also promote adhesion to squamous cells [72]. A mutant strain that lacks SdrD proteins can cause defective in adherence [73].

Advancement of the tool
The success of the demonstration indicated that Gen Htr is an improvement over available tools to calculate genetic heterogeneity. For instance, one of the biggest challenges in the Solexa-based heterogeneity analysis, specifically with Maq, is the low frequency of mutant subpopulation [6]. Using the new sequence data of the S. aureus USA300 cell line, Maq detected about 119 single nucleotide polymorphisms (SNPs) but missed a majority of the heterogeneity sites detected by the Gen Htr analysis. Improvement was made when Solexa reads that cover the heterogeneity sites were extracted and used as an alternative Maq input. Additional heterogeneity sites were indeed rediscovered but not all of them. This is because a greater part of the heterogeneity sites detected have read depths less than 10, a threshold used to identify SNP between strains of the highly monomorphic pathogen Salmonella Paratyphi A [36]. Furthermore, the analysis with a simulation procedure indicated that the read depth might contribute to the failure in this data set (Additional file 4Table S4). In the simulation, 1,000 Solexa data sets were created separately for the depth of 4, 6, 8 and 10. Each set consists of unique Solexa reads that were randomly sampled from the 38 Solexa reads covering SNP identified by Maq at the position of 395,176. All failed to pass the Maq filter (SNPfilter) except the data sets of the read depth of 10, where SNPs were detected in 99.9% of the simulated data. SNPfilter is one of the programs in the Maq package, which helps to filter SNPs according to predefined rules of the package. Like many other computational tools, Maq analyses are local and all analyses narrow down to particular genome positions without considering global genome contexts. So, it is difficult to determine whether the heterogeneity is due to intra-genome variation or the heterogeneous population. This is exactly the case for five chromosomal positions detected in the gene-encoding putative transposase, where Maq identified them as heterozygotes (Additional file 5 Table S5). The genotypic analysis indicated that they are indeed heterogeneous in the newly sequenced genome; however, they are also heterogeneous in the simulated sequence population from the isogenic reference genome, implying intra-genome variation at these sites (Additional file 6 Table S6).
In the SNP analysis, Maq detects potential SNPs via comparing the consensus sequence to the reference genomes, which are then filtered by a set of predefined rules [35]. The filtering rules out SNPs if they are, among other constraints, falling in a possible repetitive region. It is understandable to rule out repetitive regions from SNP analysis due to possible intra-genome variations. The filtering, however, may have an unexpected consequence. As illustrated above, gene duplication and subsequent mutations are vital in pathogenic bacteria because they help the bacteria adapt to ever-changing environments [51]. By establishing and comparatively analyzing the heterogeneity genotypes at the newly sequenced strains as well as their isogenic reference strain, Gen Ht can identify SNPs/heterogeneity sites that fall in the highly repetitive regions. This can be demonstrated in the comparative analysis between S. aureus JH9 and S. aureus JH1. Using the simulation procedure (See MATERIALS AND METHODS for detailed procedure), three differentiating loci were identified within the 23 S ribosomal RNA genes (Additional file 7 Fig. S1 and Additional file 8 Table S7). Among them, two loci are heterogeneous in JH1 but homogeneous in JH9.

Future improvement
Despite the success of the demonstration, there are minor concerns in the development and application of Gen Htr . First, Gen Htr utilized MegaBlast for the mapping of the Solexa reads and the identification of the genetic heterogeneity sites. The program uses a greedy algorithm for the nucleotide sequence-alignment search, up to 10 times faster than more common sequence-similarity programs [42]. The alignment tool is able to handle the massive data that Solexa generated but would sacrifice certain levels of accuracies. Besides its speed, another reason for choosing MegaBlast is its intolerance for mismatches between aligned sequences. For example, runs using MegaBlast with default MegaBlast parameters permit only two mismatches in an alignment of 37 base pairs, which allow the identification of Solexa reads that are likely to be orthologous to the reference genome. One of the alternative tools for aligning genome sequences with Solexa reads is Blat. A trial with the alignment tool indicated that it could provide much better alignments; for example, when the DNA fragment of the first 1,000 base pairs of the isogenic reference genome was run against the Solexa reads, MegaBlast achieved a full sequence alignment and a 100% sequence identity for 250 homologous Solexa reads while the number was much larger when Blat was applied -1,530 in total. As a result, genome coverage was increased. The average Solexa reads per position was up 12.3% from 130 to 146 when MegaBlast was replaced with Blat on the genome analysis of SRX07711. Blat was applied with default parameters, except with the option -out=blast. However, Blat took a much longer alignment time than MegaBlast, roughly 35 hours to align 2,873 Reference Genome DNA Fragments (RGDFs) against the Solexa read data while MegaBlast took approximately 8 hours. On the other hand, Blat is much faster if Solexa reads were used as queries against the isogenic reference genome, taking only a few minutes to finish the analysis of the entire Solexa data. The problem is that substitutions could disrupt the alignments when they occur at sub-centric positions (Additional file 9 Table S8). The sub-centric positions mean that the substitutions divide the read into a large DNA fragment of about 28 to 32 base pairs and a small fragment of 5-9 base pairs. Instead of making the alignment with the mismatches, Blat simply aligns the larger one, then searches downstream for the sequence identical to the smaller one and aligns with it. Approximately 81,135 such alignments were found when a Solexa data set was mapped to the USA300 isogenic reference genome. The result indicated that such analysis could compromise the SNP and heterogeneity analysis if not corrected.
Second, the sequencing errors are the main concerns. Gen Htr used the Phred value of 13 (base quality) as a measure to determine whether the base calls from Solexa reads are real or are derived from sequencing artifacts. However, a weighted function will be necessary. First, the quality scores are unevenly distributed along the 37-bp Solexa reads (Additional file 10 Table  S9). Overall, base calls at the 5' end of the Solexa read have a high-quality score that gradually decreases as sequencing reaches the 3' end [74]. For example, a majority of the base calls at the first position of these reads are high quality, where less than 5% of base calls have Phred values of less than 13. In the contrast, 10% of the base calls from position 2 to 13, 25% from 14 to 19, 50% from position 20 to 28, and 75% from position 29 to 37 are considered to be incorrect base calls based on the threshold. Therefore, the geographical location of the bases in the Solexa reads is an important factor in evaluating data quality. On the one hand, bases with a low Phred score are not necessarily wrong base calling. Indeed, previous experiments estimated that the error rate per base read of SNPs detected by Solexa and checked by Sanger sequencing for the Maq b5 (base quality ≥5) is 1.0 × 10 2 , indicating that the majority of the base calling with a low Phred score is correct. On the other hand, a higher base quality is not a guarantee of correct base calling. For the Maq b20 (base quality ≥20), the error rate is still 1.0 × 10 3 although tenfold lower. In addition, genotypes at some heterogeneity sites with a low Phred value showed the exact same patterns as those from the isogenic reference genome (Additional file 11 Table S10). This also indicates that a base calling with a low-quality score can still be correct.
Another approach to evaluate the quality of the heterogeneity sites is to determine the general tendency of base substitutions. As described above, the progressive deterioration in quality scores as the sequencing proceeds leads to higher noise levels. The consequence is intensity more close to the background, which results in misleading base calling, e.g. it is more likely to be called T than A, or C than G [74]. If sequencing errors are the only cause of the heterogeneity, there should be more A-> T and G-> C substitutions than any other types of substitutions at the heterogeneity sites. Detailed examination of the heterogeneity genotypes found no such tendency. In Section III of table 1, there are more C-> A (23%) than any other substitutions. This is followed by G-> T (13.4%) and C-> G (11.5%). In fact, there are only 7% in both A-> T and G-> C substitutions, indicating that the resulting alternative nucleotides at the heterogeneity sites are not likely entirely from sequencing errors.
Third, the frequency of subpopulations in bacterial clone populations represents a great challenge in heterogeneity analysis. Rates from 10 -3 to 10 -6 of more highly vancomycin-resistant cells in hVISA cultures appear to define hVISA strains at the moment. It is certainly below the proven polymorphism detection limit. Evidence also indicated that the frequency, however, varied from strain to strain and condition to condition. Generally, strains with a higher level of drug-resistance appear to have higher frequencies of drug-resistant subpopulations when cultured at a medium with a lower drug concentration. For example, NYH-2*, an hVISA strain, has a minimum inhibitory concentration of 8 μg/ml. The frequency of drug-resistant subpopulations in NYH-2* is 100% at a vancomycin concentration of 2 μg/ml, 10% at 4 μg/ml and 6 × 10 -2 % at 8 μg/ml [6]. PC-3* another hVISA strain with a minimum inhibitory concentration of 16 μg/ml, has a drug-resistant subpopulation frequency of 100% when cultured at all three vancomycin concentrations: 2 μg/ml, 4 μg/ml, and 8 μg/ml [6]. Thus, a better experimental design plus greater genome coverage are expected to overcome the low polymorphism detection limit.
Finally, the complexity of bacterial genomes will make it difficult to assess the heterogeneity status of particular chromosomal positions if they occur in the DNA fragments with multiple copies, e.g. those observed in 5S/23 S ribosomal RNA and genes encoding sdrE protein and clumping factor A/B (Table 2). In these cases, the heterogeneity sites can be interpreted to be due either to mutations between paralogs (within the genome) or to mutation in orthologs (between genomes of different subpopulations). On the other hand, the heterogeneity sites detected in the single-copy DNA fragments are less problematic. These sites cover many positions in proteinencoding genes, including those at penicillin-binding protein 3, exonuclease RexB, DNA topoisomerase I, and DNA mismatch repair protein MutS, and involve in 48 different functions. It will be a rare event for all these genes to be duplicated in such a short period of time between the isogenic isolates. In addition, no recent duplication events are detected at this or the 14 other completely sequenced S. aureus genomes. This allows us to predict that the sites are truly heterogeneous, which likely represent multi-variants in the bacterial populations.

Conclusions
In summary, Gen Htr was developed and tested with a newly sequenced S. aureus USA 300 cell line. Although it is much more time-consuming when compared to Maq, a popular tool for SNP analysis, Gen Htr is able to build genome-wide heterogeneity genotypes for both newly sequenced genomes (using massively parallel short-read sequencing) and their isogenic reference genome (using simulated data). From that, Gen Htr can predict potential multiple variants that pre-exist in the bacterial population as well as SNPs that occur in highly duplicated gene families. In addition, the establishment of genome-wide heterogeneity genotypes for newly sequenced genomes and their isogenic reference genomes allows the heterogeneity to be quantified. For example, we plan to use an evolving distance (d) to quantify how many newly sequenced strains are evolved from the isogenic reference strains and characterize loci that have a greater complexity than those that happened in the 23 S rRNA genes. The successful implementation and testing of Gen Htr is expected to have a large impact on the research of bacterial pathogen. Rather than identifying sequence variations from strains to strains or isolates to isolates [75][76][77][78][79][80][81], the Solexa technology and Gen Htr will allow bacterial strains/isolates to be studied as heterogeneous populations instead of as monomorphic clones [5]. By this approach, the population dynamics of bacterial populations can be carefully characterized and comparatively analyzed with respect to genetic heterogeneity. With the paradigm shift, we expect that the evolutionary forces that shape bacterial populations can be evaluated at the DNA sequence level on the whole genome scale.