Generation and comparative genomics of synthetic dengue viruses.

Background Synthetic virology is an important multidisciplinary scientific field, with emerging applications in biotechnology and medicine, aiming at developing methods to generate and engineer synthetic viruses. In particular, many of the RNA viruses, including among others the Dengue and Zika, are widespread pathogens of significant importance to human health. The ability to design and synthesize such viruses may contribute to exploring novel approaches for developing vaccines and virus based therapies. Results Here we develop a full multidisciplinary pipeline for generation and analysis of synthetic RNA viruses and specifically apply it to Dengue virus serotype 2 (DENV-2). The major steps of the pipeline include comparative genomics of endogenous and synthetic viral strains. Specifically, we show that although the synthetic DENV-2 viruses were found to have lower nucleotide variability, their phenotype, as reflected in the study of the AG129 mouse model morbidity, RNA levels, and neutralization antibodies, is similar or even more pathogenic in comparison to the wildtype master strain. Additionally, the highly variable positions, identified in the analyzed DENV-2 population, were found to overlap with less conserved homologous positions in Zika virus and other Dengue serotypes. These results may suggest that synthetic DENV-2 could enhance virulence if the correct sequence is selected. Conclusions The approach reported in this study can be used to generate and analyze synthetic RNA viruses both on genotypic and on phenotypic level. It could be applied for understanding the functionality and the fitness effects of any set of mutations in viral RNA and for editing RNA viruses for various target applications. Electronic supplementary material The online version of this article (10.1186/s12859-018-2132-3) contains supplementary material, which is available to authorized users.

aligned sequences) which are equal to each other and P ijk =0 otherwise. The score Si for the i th column is = 1 ( − 1)/2 ∑ ∑ = +1 =1 and the SP for the alignment is: The following values summarize the SP scores for the multiple alignment of 618 DENV-2 coding sequences analyzed in this study: SP(amino acids) = 0.97, SP(nucleotides) = 0.94.

The Effective Number of Codons (ENC)
The Effective Number of Codons (ENC) is a measure that quantifies how far the codon usage of a coding sequence departs from equal usage of synonymous codons. For each amino acid (AA) let us define to be the number of its synonymous codons of each type in the sequence, and n to be the number of times this AA appears in the sequence: The frequency of each codon is therefore: The ENC for a specific AA is: In case of a missing AA, the corresponding effective number of codons is defined as an average over the given AAs of the same degeneracy.
Finally ENC for a gene is defined as an average of the group ENCs over all the degeneracy AA groups weighted by the number of AAs in each group computed over the entire coding sequence.
ENC can take values from 20, in the case of extreme bias where one codon is exclusively used for each amino acid (AA), to 61 when the use of alternative synonymous codons is equally likely. Therefore smaller ENC values correspond to a higher bias in synonymous codons usage; consequently, a negative correlation with ENC values means is equivalent to a positive correlation with synonymous codons usage.

Codon Pairs Bias (CPB)
To quantify codon pair bias, we define a codon pair score (CPS) as the log ratio of the observed over the expected number of occurrences of this codon pair in the coding sequence. To achieve independence from amino acid and codon bias, the expected frequency is calculated based on the relative proportion of the number of times an amino acid is encoded by a specific codon: , where the codon pair AB encodes for amino acid pair XY and F denotes the number of occurrences. The codon pair bias (CPB) of a virus is then defined as an average of codon pair scores over all codon pairs comprising all viral coding sequences:

The dinucleotide pair bias (DNTB)
The dinucleotide pair bias (DNTB) of a virus is defined as an average of dinucleotide scores over all dinucleotides comprising all viral sequences: The GC content is defined as: Where F() is a number of occurrences of each one of nucleotides A,G,C, and T.

CpG Content
We compute a dinucleotide score (DNTS) for a pair of nucleotides XY as an odds ratio of observed over expected frequencies: , where F denotes the frequency of occurrences.
Specifically, the CpG score is equal to the DNTS corresponding to the CG nucleotide.

List of regions selected for strong/weak folding energy used in this study
Coordinates of regions predicted to be selected for strong/weak folding energy can be found in the following tables (see details in reference [16] in main text): Each row in a file corresponds to one region (number of rows = number of regions) and contains 3 comma separated values x, y, z in the following order: region start coordinate, region end coordinate, maximum folding selection conservation index (FSCI) in the cluster.