Skip to main content

Modeling of shotgun sequencing of DNA plasmids using experimental and theoretical approaches



Processing and analysis of DNA sequences obtained from next-generation sequencing (NGS) face some difficulties in terms of the correct prediction of DNA sequencing outcomes without the implementation of bioinformatics approaches. However, algorithms based on NGS perform inefficiently due to the generation of long DNA fragments, the difficulty of assembling them and the complexity of the used genomes. On the other hand, the Sanger DNA sequencing method is still considered to be the most reliable; it is a reliable choice for virtual modeling to build all possible consensus sequences from smaller DNA fragments.


In silico and in vitro experiments were conducted: (1) to implement and test our novel sequencing algorithm, using the standard cloning vectors of different length and (2) to validate experimentally virtual shotgun sequencing using the PCR technique with the number of cycles from 1 to 9 for each reaction.


We applied a novel algorithm based on Sanger methodology to correctly predict and emphasize the performance of DNA sequencing techniques as well as in de novo DNA sequencing and its further application in synthetic biology. We demonstrate the statistical significance of our results.

Graphical abstract


Optimization in the processing of DNA sequence data may impose a serious challenge regarding the correct prediction of DNA sequencing outcomes without the application of bioinformatics approaches [1]. These approaches play an important role in novel sequencing pipelines, termed Next-Generation Sequencing (NGS) technologies, and they have transformed the sequencing landscape in the past few years [2, 3]. Despite significant scientific achievements in DNA sequencing, there is still a shortage of efficient bioinformatics tools for virtual NGS simulations due to the generation of long DNA fragments and difficulty assembling them [4, 5]. Although some computational algorithms have already been developed and tested for the construction of a realistic data set, such as the MetaSim simulator to model Roche’s 454 and Illumina technologies, they still lack thorough experimental validation of the generated results [6]. Moreover, these algorithms deal with NGS, which might be error-prone [7]. Thus, they may lead to artificial mutations and sequencing bias [7]. In particular, a study of NGS biases defined that introns are less covered with reads than exons due to the much higher complexity of the latter structures [8].

On the other hand, the Sanger method is a termination sequencing technology for determining a nucleotide sequence of DNA molecules that can only be used for short DNA strands of 100 to 1000 base pairs [9], which is suitable for the sequencing of small DNA plasmids. For example, Sentchilo et al. used both Sanger sequencing and 454- sequencing combined with classic CsCl density gradient centrifugation, to characterize a wastewater metagenome of plasmids to determine some larger circular genetic elements [10]. The conventional Sanger method is still considered the “gold standard,” it is the most reliable sequencing methodology, but it might be also laborious and time-consuming [11, 12].

Some attempts have been made to develop open-source bioinformatics tools, simulating shotgun (genomic, metagenomic, transcriptomic, and metatranscriptomic) datasets from reference sequencing platforms, such as the Grinder and Tracembler programs [13, 14]. However, the virtual shotgun sequencing mimicking the Sanger method of the standard cloning vectors with different length sizes has yet to be tested and validated experimentally, using polymerase chain reaction. Therefore, we implemented the virtual sequencing algorithm based on the Sanger methodology to correctly predict and emphasize the performance of this DNA sequencing technique, using the average sequence length for the adjustment of coverage values in experimental settings.


Plasmid selection

The sequencing data for the pCR™4-TOPO® plasmid containing 125 bp insertion (Thermo Scientific, Germany), pQE-30-UA-mCHERRY-GFP (in-house modified vector pQE-30 UA, Qiagen, USA) and pLEXSY-Ig-1 vector for in vitro translation of Ig-like C2-type 1 protein (Jena Biosciences, Germany, [15]) were obtained in a FASTA format from our previous sequencing experiments.

In silico sequencing and fragment assembly

The Sequencer algorithm developed by Bernhard Haubold (Max Planck Institute for Evolutionary Biology) was used to simulate the in silico shotgun sequencing technique for determining the nucleotide sequence of DNA molecules that are no more than a few kilobases ( The TIGR (The Institute for Genomic Research) Assembler, a classic open-access assembly tool developed by the Institute for Genomic Research [16], was applied to build all possible consensus sequences (contigs) from smaller sequence fragments, coming from the virtual shotgun sequencing. The Dotter software, a graphical dot plot program [17], was utilized to provide the complete and detailed comparison of two analyzed sequences and to calculate the Karlin-Altschul statistics [18]. For this, the program has the ability to adjust the stringency cutoffs interactively so that the dot-matrix only needs to be calculated once [17]. The CLUSTALW 2.1 program was used for the DNA sequence alignment [19]. The DNA analysis was performed by using the BioEdit software to calculate guanine-cytosine (GC) and adenine-thymine (AT) contents together with identity matrices between the analyzed sequences in the alignment [20].

PCR validation

Primers (Table 1), targeting 100, 400 and 800 bp regions of the analyzed plasmids, were designed by the Geneious PRO software (Biomatters, New Zealand). The gene-expression vectors were used as templates for PCR, containing Thermo Scientific DreamTaq Green PCR Master Mix (2x), 0.1 μM of each forward and reverse primers and 10 ng of template DNA. All reactions were performed according to the manufacturer’s instructions, with the number of cycles 1, 3, 5, 7, and 9 for each reaction. The amplicons were purified using the NucleoSpin Gel and PCR clean-up kits (Macherey-Nagel, Germany) and quantified by using the Infinite 200 PRO plate reader (Infinite® 200 PRO NanoQuant, Tecan, Switzerland). Statistical analyses were performed using linear and nonlinear regression modeling by the GraphPad Prism v.7 software for Windows (GraphPad Software, San Diego, CA). The differences were considered statistically significant at a p-value of < 0.05. All the necessary files (tested in Linux environment) required for the virtual sequencing, including executable programs, bash scripts, FASTA format files, and program outputs can be found in Additional file 1. All in silico experiments were performed at least three times.

Table 1 List of primers used in PCR


The simulation of the sequencing process was used in order to optimize the DNA sequencing output for sequence assembly (Fig. 1). According to this, the resulting DNA fragments were assembled into the sought template sequence using the TIGR Assembler computer program [16]. To test this program’s performance, it is useful to simulate random DNA fragments in association with the Sequencer algorithm. In particular, the algorithm takes a template DNA sequence as input and outputs random reads until the number of sequenced nucleotides, divided by the length of the template molecule, has reached a threshold known as the coverage value.

Fig. 1

Flowchart describing the methodology used to model shotgun sequencing of DNA plasmids with experimental and theoretical approaches

For this purpose, we simulated this sequencing process of pCR™4-TOPO® (4079 bp), pQE-30-UA-mCHERRY-GFP (4886 bp) and pLEXSY-Ig-1 (2319 bp) cloning vectors via the Sequencer algorithm using 100, 400, and 800 bp lengths of the sequences to determine the coverage rate. In the past, most of the gene expression plasmids were being sequenced using Sanger sequencing of ~ 3 kb clone libraries, but presently it has been switched to the Roche/454 platform with GS FLX Titanium sequencing chemistry and Illumina sequencing technology [21, 22]. Therefore, the plasmid sequencing can be done by NGS, allowing us to analyze samples in a high-throughput manner with small reads of approximately 200–500 bp, but this also might be expensive and time-consuming [23]. The coverage parameter was tested in the range of 1 to 9, representing the rate at which every nucleotide in DNA should be sequenced on average. In fact, the Sanger sequencing delivers reads of up to 800 bp; however, the critical limitation is the volume of the analyzed sample and its low scalability in comparison to modern techniques [24]. Therefore, the maximal average length of the sequences was chosen to be 800 nt, which is a realistic assumption for the Sanger sequencing [25].

Next, we implemented the TIGR Assembler algorithm in order to recover the original cloning vectors and calculate the number of the assembled sequence fragments as contiguous sequences. Optimally, the program was designed to generate a single contiguous string from the various overlapping DNA fragments [16]. In order to test the quality of sequencing, we used different sequence lengths and coverage parameters to produce one contiguous sequence, which was then observed for all the analyzed vectors (Fig. 2 [a-c]), starting from the coverage number of 4 and higher sequence lengths (400 nt). Notably, the number of contigs at low sequence length (100 nt) fits a Gaussian distribution with reliable statistics (r2 = 0.83–0.86), whereas this parameter is linearly distributed at higher sequence lengths. Similar patterns had been previously inspected as multivariate distributions of tetranucleotide frequencies of artificial DNA fragments, where these distributions can be approximated by a single Gaussian function [26]. On the other hand, the results corresponding to 800 nt also correlate with the number of sequences for all the analyzed plasmids (Fig. 3 [a-c]) produced by the Sequencer algorithm, which is minimal at maximal sequence length. Consequently, the relationship between coverage and the number of sequences for the analyzed cloning vectors was estimated by the linear regression analysis with reliable statistics (r2 = 0.99) and a p-value of < 0.0001. The diagonal lines in Fig. 4 [a-c], indicating the DNA molecules generated and assembled by the Sequencer and TIGR Assembler algorithm, are corresponded to the template DNA as negative slopes (m < 0) of the lines from the reverse DNA strand. Furthermore, the pairwise sequence alignments using the CLUSTALW and BioEdit programs at the maximal value of sequence length and coverage indicated the exact match between the reference cloning vectors and the assembled DNA. However, the virtual DNA fragments were assembled in the manner (“impaired topology”), where the uncovered DNA located at the beginning of the reference DNA was complementary to the uncovered DNA located at the end of the assembled sequence (Fig. 4). This “impaired topology” outcome might be associated with the development of TIGR Assembler based on the data derived from more than 20 sequencing projects, leading, however, to a sequence assembler that produces some misassemblies [27]. Despite this drawback, the algorithm has been successfully implemented in whole-genome shotgun sequencing of prokaryotic and eukaryotic organisms, bacterial artificial chromosome-based sequencing of eukaryotic organisms, and expressed sequence tag assembly [16, 28, 29].

Fig. 2

The relation between coverage, sequence length (nt), and the number of the assembled sequences (contigs) for the pCR™4-TOPO® (a), pQE-30-UA-mCHERRY-GFP (b), and pLEXSY-Ig-1 (c) cloning vectors using the Sequencer algorithm. The Gaussian distribution function was used for a curve fitting

Fig. 3

The relation between coverage, sequence length (nt), and the number of the generated sequences for the pCR™4-TOPO® (a), pQE-30-UA-mCHERRY-GFP (b), and pLEXSY-Ig-1 (c) cloning vectors using the Sequencer algorithm

Fig. 4

Dot plots of the pCR™4-TOPO® (a), pQE-30-UA-mCHERRY-GFP (b), and pLEXSY-Ig-1 (c) cloning vectors (x-axis, nt) versus the assembled DNA molecules with the maximal sequence length and coverage (y-axis, nt) using the Sequencer and TIGR Assembler algorithms

To access the local sequence alignment, the expectation value (E-value) as the expected number of local alignments with a given score (S) was calculated according to the Karlin-Altschul statistics (Table 2), using the following equation [18]:

$$ E=K\ast mn{e}^{-\lambda S} $$

where m is the MSP sequence length; n is the size of the database (n = 1 as no database was used); e is the exponential function; and K and λ are the search-space related and normalizing constants, respectively.

Table 2 Karlin-Altschul statistics for the analyzed DNA sequences

For the local alignments, the E-values were found to be the same (0.002) for all the analyzed vectors, indicating the statistically significant (E-value < 0.05) data produced by the Karlin-Altschul approach. Nonetheless, it has been shown previously that E-values might be dependent on the query sequence length, which might generate some “false positive” hits, previously observed, analyzing short primer regions and small domain regions [30, 31]. On the other hand, the DNA composition and identity analysis (Table 3) for the reference and assembled DNA revealed that their GC and AT contents were almost identical at the minimum vector size (pLEXSY-Ig-1) and the highest identity value (0.78). From the previous DNA sequence alignments (Fig. 5), it is clear that the identity values depend on the vector size and the “impaired topology” of the assembled DNA generated by the virtual sequencing algorithm. It has also been reported for NGS that extreme base compositions could lead to uneven coverage of reads, hindering genome assembly [32]. However, our experiments were conducted using the balanced GC and AT biases (~ 40–60%), which prevents the results from sequencing errors related to GC-poor sequences with a mean GC content of less than 25% [33].

Table 3 DNA composition and identity analysis for the reference (Ref) and assembled (Ass) DNA molecules
Fig. 5

DNA sequence alignments of the analyzed cloning vectors as reference versus the assembled DNA molecules (output_1) produced by the Sequencer algorithm at the maximal value of sequence length and coverage, using the CLUSTALW and BioEdit programs. The starting and ending parts of the sequences are shown to enhance overall clarity

To validate the in silico output, the PCR methodology was applied to calculate the number of DNA copies (n) produced by amplifying the different sequence lengths (100, 400, and 800 nt) of the analyzed vectors (Additional file 2), using the following equation [34]:

$$ n=\frac{\left(a\times {N}_A\right)}{l\times 650\times {10}^9} $$

where a is the amount of DNA; NA is the Avogadro constant; l is the sequence length (nt). The PCR technique largely depends on several factors, such as a polymerase, number of cycles, probe degradation, template volume and size of the reaction mixture [35, 36]. In contrast, virtual sequencing is a fully independent system, mimicking the sequencing strategy and identifying novel features of genomes, namely the satellite repeats, variations, and single-nucleotide polymorphism. Whereas our primers designed to amplify the specific DNA parts, the virtual sequencing allows targeting the random areas, which enables it to be used for de novo sequencing of random DNA fragments. For the experimental generation of truly novel plasmids in their native host, where the genetic material requires a correct assembly, it might be necessary to enrich and purify the plasmid DNA [37]. This can be achieved by closing the sequence gaps between contigs by PCR-amplification and subsequent Sanger sequencing of the PCR-product [37]. In our case, to provide consistent results, we used the same conditions for each experiment, including the design of primers with the same melting temperature, reaction time, and amount of reaction mixture. In addition, we used the fixed number of DNA copies (10 ng), contributing to the sequencing accuracy in our experiments.

As an outcome, the numerical and experimental data corroborated reasonably for the number of DNA copies, which was minimal at 800 nt of sequence length, representing a relationship with average r2 and p-values of 0.92 and 0.012 for all the analyzed genetic elements (Fig. 6 [a-c]). This demonstrates that the number of amplicons in the PCR experiments and virtual sequencing correlate with the numbers of cycles of coverage rate consistently. Therefore, this technique can be applied (i) as an inexpensive quality control technique for sequencing analysis and (ii) as a support for the user with a reduced sequencing budget to emphasize sequencing data in silico. Currently, several sequencing strategies are available to determine the correct DNA sequence [38, 39]. Further applications of in silico sequencing algorithms might include the single-molecule sequencing method, able to analyze short-length segments in a large volume, which does not require the amplification of a DNA template [40] together with old-fashioned but very precise methods, such as the Sanger sequencing [41, 42].

Fig. 6

The relationship between sequence length (nt), the number of DNA copies (× 1011), and PCR cycles for the pCR™4-TOPO® (a), pQE-30-UA-mCHERRY-GFP (b), and pLEXSY-Ig-1 (c) cloning vectors using PCR


We tested and validated a novel virtual sequencing algorithm able to simulate shotgun sequencing. In reality, these simulations are challenging and require the implementation of multi-step protocols, including data production, assembly, and validation. Despite all the recent sequencing improvements, the Sanger method is still considered the “gold standard” in DNA sequencing due to its high accuracy. Therefore, in this study, we applied a novel algorithm to simulate shotgun sequencing and to build all possible consensus sequences from small DNA fragments for the gene-expression vectors. Therefore, the virtual sequencing approach was validated experimentally using the PCR technique with the number of cycles from 1 to 9 for each reaction. Overall, the obtained results can be used to correctly predict and emphasize the performance of this DNA sequencing technique based on the average sequence length to adjust the coverage values in experimental settings.

Availability of data and materials

Project name: Virtual sequencing project.

Project home pages:

Operating system(s): Linux.

Programming language: C.

Other requirements: gcc, gawk, ranlib.

License: GNU GPL.

Any restrictions to use by non-academics: none.



Next-generation sequencing


Maximal-scoring segment pairs


Guanine-cytosine content


Adenine-thymine content


Polymerase chain reaction


Deoxyribonucleic acid


The Institute for Genomic Research


  1. 1.

    Lam K-K, Khalak A, Tse D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics. 2014;15(Suppl 9):S4.

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376–80.

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24(3):133–41.

    CAS  Google Scholar 

  4. 4.

    Buermans HPJ, den Dunnen JT. Next generation sequencing technology: advances and applications. Biochim Biophys Acta. 2014;1842(10):1932–41.

    CAS  PubMed  Google Scholar 

  5. 5.

    Yeh C-M, Liu Z-J, Tsai W-C. Advanced applications of next-generation sequencing technologies to orchid biology. Curr Issues Mol Biol. 2018;27:51–70.

    PubMed  Google Scholar 

  6. 6.

    Richter DC, Ott F, Auch AF, Schmid R, Huson DH. MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One. 2008;3(10):e3373.

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Muzzey D, Evans EA, Lieber C. Understanding the basics of NGS: from mechanism to variant calling. Curr Genet Med Rep. 2015;3(4):158–65.

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Schwartz S, Oren R, Ast G. Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One. 2011;6(1):e16685.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Meng F, Dong X, Hu T, Liu Y, Zhao Y, Lv Y, Chang S, Zhao P, Cui Z. Analysis of Quasispecies of Avain Leukosis virus subgroup J using Sanger and high-throughput sequencing. Virol J. 2016;13:112.

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Sentchilo V, Mayer AP, Guy L, Miyazaki R, Green Tringe S, Barry K, Malfatti S, Goessmann A, Robinson-Rechavi M, van der Meer JR. Community-wide plasmid gene mobilization and selection. Isme J. 2013;7(6):1173–86.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    McCourt CM, McArt DG, Mills K, Catherwood MA, Maxwell P, Waugh DJ, Hamilton P, O'Sullivan JM, Salto-Tellez M. Validation of next-generation sequencing technologies in comparison to current diagnostic gold standards for BRAF, EGFR and KRAS mutational analysis. PLoS One. 2013;8(7):e69604.

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Chen L, Cai Y, Zhou G, Shi X, Su J, Chen G, Lin K. Rapid Sanger sequencing of the 16S rRNA gene for identification of some common pathogens. PLoS One. 2014;9(2):e88886.

    PubMed  PubMed Central  Google Scholar 

  13. 13.

    Dong Q, Wilkerson MD, Brendel V. Tracembler--software for in-silico chromosome walking in unassembled genomes. BMC Bioinformatics. 2007;8:151.

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012;40(12):e94.

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Bhide M, Natarajan S, Hresko S, Aguilar C, Bencurova E. Rapid in vitro protein synthesis pipeline: a promising tool for cost-effective protein array design. Mol BioSyst. 2014;10(6):1236–45.

    CAS  PubMed  Google Scholar 

  16. 16.

    Pop M, Kosack D. Using the TIGR assembler in shotgun sequencing projects. Methods Mol Biol. 2004;255:279–94.

    CAS  PubMed  Google Scholar 

  17. 17.

    Sonnhammer EL, Durbin R. A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene. 1995;167(1–2):GC1–10.

    CAS  PubMed  Google Scholar 

  18. 18.

    Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990;87(6):2264–8.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–8.

    CAS  Google Scholar 

  20. 20.

    Thompson JD, Gibson TJ, Higgins DG. Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics. 2002;Chapter 2:Unit 2.3.

    PubMed  Google Scholar 

  21. 21.

    Sen D, Brown CJ, Top EM, Sullivan J. Inferring the evolutionary history of IncP-1 plasmids despite incongruence among backbone gene trees. Mol Biol Evol. 2013;30(1):154–66.

    CAS  PubMed  Google Scholar 

  22. 22.

    Brown CJ, Sen D, Yano H, Bauer ML, Rogers LM, Van der Auwera GA, Top EM. Diverse broad-host-range plasmids from freshwater carry few accessory genes. Appl Environ Microbiol. 2013;79(24):7684–95.

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Grumbt B, Eck SH, Hinrichsen T, Hirv K. Diagnostic applications of next generation sequencing in Immunogenetics and molecular oncology. Transfus Med Hemoth. 2013;40(3):196–206.

    Google Scholar 

  24. 24.

    Bybee SM, Bracken-Grissom H, Haynes BD, Hermansen RA, Byers RL, Clement MJ, Udall JA, Wilcox ER, Crandall KA. Targeted amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa phylogenetics. Genome Biol Evol. 2011;3:1312–23.

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975;94(3):441–8.

    CAS  PubMed  Google Scholar 

  26. 26.

    Strous M, Kraft B, Bisdorf R, Tegetmeyer HE. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front Microbiol. 2012;3:410.

    PubMed  PubMed Central  Google Scholar 

  27. 27.

    Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. An optimized protocol for analysis of EST sequences. Nucleic Acids Res. 2000;28(18):3657–65.

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Parsons JD, Rodriguez-Tome P. JESAM: CORBA software components to create and publish EST alignments and clusters. Bioinformatics. 2000;16(4):313–25.

    CAS  PubMed  Google Scholar 

  29. 29.

    Theologis A. Goodbye to ‘one by one’ genetics. Genome Biol. 2001;2(4):COMMENT2004.

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Ochoa A, Storey JD, Llinas M, Singh M. Beyond the E-Value: stratified statistics for protein domain prediction. Plos Comput Biol. 2015;11(11):e1004509.

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    O'Donnell JL, Kelly RP, Lowell NC, Port JA. Indexed PCR primers induce template-specific Bias in large-scale DNA sequencing studies. PLoS One. 2016;11(3):e0148698.

    PubMed  PubMed Central  Google Scholar 

  32. 32.

    Chen YC, Liu TL, Yu CH, Chiang TY, Hwang CC. Effects of GC bias in next-generation-sequencing data on De Novo genome assembly. PLoS One. 2013;8(4):e62856.

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Kozarewa I, Ning ZM, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G plus C)-biased genomes. Nat Methods. 2009;6(4):291–5.

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Staroscik A (2004) Calculator for determining the number of copies of a template. URI Genomics & Sequencing Center.

    Google Scholar 

  35. 35.

    Chen CY. DNA polymerases drive DNA sequencing-by-synthesis technologies: both past and present. Front Microbiol. 2014;5:305.

    PubMed  PubMed Central  Google Scholar 

  36. 36.

    Kuang J, Yan X, Genders AJ, Granata C, Bishop DJ. An overview of technical considerations when using quantitative real-time PCR analysis of gene expression in human exercise research. PLoS One. 2018;13(5):e0196438.

    PubMed  PubMed Central  Google Scholar 

  37. 37.

    Smalla K, Jechalke S, Top EM. Plasmid Detection, Characterization, and Ecology. Microbiol Spectr. 2015;3(1):PLAS-0038-2014.

    PubMed  Google Scholar 

  38. 38.

    Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet. 2011;52(4):413–35.

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Bowden R, Davies RW, Heger A, Pagnamenta AT, de Cesare M, Oikkonen LE, Parkes D, Freeman C, Dhalla F, Patel SY, Popitsch N, Ip CLC, Roberts HE, Salatino S, Lockstone H, Lunter G, Taylor JC, Buck D, Simpson MA, Donnelly P. Sequencing of human genomes with nanopore technology. Nat Commun. 2019;10:1869.

    PubMed  PubMed Central  Google Scholar 

  40. 40.

    Thompson JF, Milos PM. The properties and applications of single-molecule DNA sequencing. Genome Biol. 2011;12(2):217.

    CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Totomoch-Serra A, Marquez MF, Cervantes-Barragan DE. Sanger sequencing as a first-line approach for molecular diagnosis of Andersen-Tawil syndrome. F1000Res. 2017;6:1016.

    PubMed  PubMed Central  Google Scholar 

  42. 42.

    Zheng J, Zhang H, Banerjee S, Li Y, Zhou J, Yang Q, Tan X, Han P, Fu Q, Cui X, Yuan Y, Zhang M, Shen R, Song H, Zhang X, Zhao L, Peng Z, Wang W, Yin Y. A comprehensive assessment of next-generation sequencing variants validation using a secondary technology. Mol Gen Genomic Med. 2019;7(7):e00748.

    Google Scholar 

Download references


Not applicable.


This publication was supported by the German Research Foundation (DFG) and the University of Würzburg in the funding program Open Access Publishing. We also thank the Land of Bavaria for funding (DFG project 324392634 /TR221). This work, including publication costs, was funded by the University of Würzburg.

Author information




This project was conceived and developed by SS, CF, and TD. The experimental validation of theoretical results was performed by EB. The authors read and approved the final manuscript.

Corresponding authors

Correspondence to Sergey Shityakov or Thomas Dandekar.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Input and output data for modeling of shotgun sequencing of DNA plasmids. All the necessary files (tested in Linux environment) required for the virtual sequencing, including executable programs, bash scripts, FASTA format files, and program outputs.

Additional file 2.

PCR results for plasmid vectors. To validate the in silico of DNA plasmids, the PCR methodology was applied to calculate the number of DNA copies produced by amplifying the different lengths (100, 400, and 800 nt) sequences of the plasmid vectors.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shityakov, S., Bencurova, E., Förster, C. et al. Modeling of shotgun sequencing of DNA plasmids using experimental and theoretical approaches. BMC Bioinformatics 21, 132 (2020).

Download citation


  • Shotgun method
  • Sanger sequencing
  • Virtual sequencing
  • Polymerase chain reaction
  • Gene expression vectors
  • Synthetic biology