Gene design
Visual Gene Developer has a hierarchical and expandable system to define a gene construct and gene construct components. A gene construct is defined as a full length sequence that consists of several gene construct components as building blocks. Each gene construct component has a collection of predefined properties that is referred to as gene construct component type. A property which has its own name and storage space works as a variable that can be used to store information such as DNA sequence. A gene construct component type determines the data structure of a gene construct component, and each gene construct component can possess several different sequence or non-sequence data.
As an example, a gene construct may have 4 different gene construct components such as 5' HindIII restriction enzyme site, Shine-Dalgarno sequence, GFP coding sequence, and 3' multiple cloning site in consecutive order. Each gene construct component belongs to one of the predefined gene construct component types such as 'Coding sequence', 'Non-coding sequence', 'Restriction enzyme site', or 'Multiple cloning site'. In case of 'Coding sequence', it has 3 different properties whose names are 'Original AA', 'Original DNA', and 'Modified DNA' where AA stands for amino acid. Each property holds amino acid sequence or DNA sequence information. In contrast to 'Coding sequence', 'Non-coding sequence' like Shine-Dalgarno sequence doesn't need amino acid sequence data. Therefore, the gene construct component type of 'Coding sequence' has only two properties: 'Original DNA' and 'Modified DNA'.
For reference, the 'Original DNA' sequence data can be used only for special purposes such as to calculate mismatched bases or codons between the original and variant sequences whereas 'Modified DNA' is inevitably used as an essential property that most modules utilize to modify, optimize, analyze, read, and write. With regard to sequence or data size, the software permits both variable length and null size (zero length) of a sequence. Therefore a user can modify the sequence length during the gene optimization process and even generate an "invisible" gene construct component that will not be shown in the 'Gene Construct View' window by setting the 'Modified DNA' to be null.
Furthermore, the software lets a user define new gene construct component types to hold unique information and functions. A gene construct component type can be easily designed in the 'PropertyBag Editor' and 'Module Editor' windows. However, to be functional, it is usually necessary to develop new modules or modify existing modules to handle new properties.
As a main part of the user interface, the software has the 'Gene Construct Designer' window. A user can design a gene construct by adding or deleting gene construct components, or changing a location of a gene construct component. When adding a new gene construct component to a gene construct, a user can choose a gene construct component type among the defined gene construct component types that are listed in a drop down list box in the window.
Sequence analysis
Visual Gene Developer includes basic functions to manipulate a sequence such as sequence parsing, back translation, and conversion into the reverse sequence or the complementary sequence. After putting the source sequence into a text box in the 'Workspace' window, a user can click on one of the main menus to perform the function. Otherwise, a user can develop a module to modify the input sequence in the 'Workspace' window and then expose the module on the 'Toolbox' window as a typical menu item. For gene analysis purposes, the software includes several gene analysis algorithms for calculating the CAI, GC content, and Nc, Codon usage table, and performing sequence comparisons, repeated sequence searches, multiple sequence searches, and mRNA secondary structure prediction. The software also supports a batch processing to analyze several thousands of genes. After setup in the 'Gene Optimization' and 'Gene Construct Designer', a user can import a ASCII text file that contains multiple sequence data and then check the analysis result in the 'Gene Optimization' window. Owing to the programming capability, a user can make use of implemented classes and add new gene analysis metrics or predictions to Visual Gene Developer.
(1) CAI (Codon adaptation index)
The CAI has been widely used as an effective measure of synonymous codon usage bias. It was originally proposed by Sharp and Li to quantify the extent of codon usage similarity between a reference set of genes and a gene of interest [33]. The CAI ranges from 0 to 1 where higher CAI means highly codon biased or higher codon usage similarity between two different codon usage tables. In order to calculate the CAI, we follow the same procedure and make use of the original definition given by Sharp and Li [33]. For reference, the software calculates the RSCU (Relative Synonymous Codon Usage) from a codon usage table of a reference gene and then computes w
i
(Relative adaptiveness of a codon) value for each codon by dividing RSCU by RSCU
max
. Finally, CAI value can be calculated using the following equation.
where X
ij
is the total number of the j th codon for the i th amino acid in the test gene, w
ij
is the relative adaptiveness of the j th codon for the i th amino acid in the reference gene, k
i
is the number of synonymous codons for the i th amino acid, and L is the total number of codons excluding AUG (Met) and UGG (Trp) in the test gene. As a special case, if w
ij
is smaller than 0.01, it is adjusted to 0.01 [34].
(2) Nc (Effective number of codons)
This quantity was originally defined by Wright [35] to measure a degree of codon bias. It is a number between 20 and 61 where 20 means extremely biased and 61 stands for equally biased between synonymous codons. In contrast to CAI, the calculation of Nc doesn't need a reference codon usage table. First of all, the software calculates codon homogygosity () of the amino acid [35].
where the codon homogygosity of the i th amino acid, n is the total number of the amino acid in the test gene, k
i
is the total number of synonymous codons of the i th amino acid, and p
j
is the codon frequency of the j th allele (synonymous codon).
The effective number of codons is then calculated by summation of the average homogygosities.
where (m = 2, 3, 4, or 6) is the average homogygosity for the amino acids whose total number of synonymous codons is m. For example, .
As Wright suggested, if some amino acids are missing then Visual Gene Developer computes the average homogygosity by taking an average of homogygosities of amino acids present in the test gene. If isoleucine is absent or rarely used, Fuglsang's estimation is used to calculate [36].
(3) Repeated sequence search
The software allows a user to identify repeated sequences in a test sequence. It detects not only forward directional and backward complimentary repeated sequences but also palindromic sequences and consecutively connected repeated sequences. In order to find repeated sequences, a moving window method was employed. The algorithm generates a short sequence clipped from a test sequence and then compares the partial sequence with the test sequence to find matched locations. The search process is repeated while the moving window scans along the test sequence. When the scanning is completed, potentially duplicate findings are removed if they are already included in other findings. The feature is named as the 'Smart filter' in the 'Search sequences' window.
(4) Multiple query sequence search
This function was developed to identify locations of query sequences within a sequence. A user can input a set of multiple query sequences or restriction enzyme names separated by Tab, comma (,), or Carriage Return (Enter key) in the 'Search sequences' window. For in-depth analysis a query string is split into multiple strings of single query sequence. In case of restriction enzyme names, they are converted into DNA sequences. After performing repeated searches for all query sequences in a test sequence, the software shows detailed information about the total number of occurrences and their locations in a gene for every query sequence. A user can choose one of a predefined sequence set such as common restriction enzyme sites, potential intron cryptic splice sites or polyadenylation signal sequences.
(5) Profile calculation of CAI, mRNA Gibbs free energy, or GC content
The software contains 3 implemented modules that are used to calculate a profile of CAI, mRNA binding energy, or GC content of a test sequence. Their algorithms are quite similar between them as the moving window approach was equally adopted and their codes were developed from the same template code. In general, any single calculation such as GC content can be repeatedly performed while a moving window is sliding over a test sequence. The procedure is initiated when a moving window is located at the first base of the test sequence. For example, mRNA binding energy of the first 60 bases is calculated if the size of the moving window is set to be 60 bases that can be adjusted by the user. After the first calculation, the moving window steps forward to the next location such as to the 11th base of the test sequence if the step size of the moving window is 10 bases. In this way, the RNA binding energy is repeatedly computed at an interval of 10 bases until the moving window arrives at the end of the sequence. To generate data for a profile plot, both the location of the moving window and its corresponding mRNA binding energy are recorded as a table format. Since the codes were written in VBScript, a user can easily modify source codes to develop new profiling functions.
Sequence optimization
Visual Gene Developer contains useful modules to optimize a gene construct in terms of codon usage, mRNA binding energy, known conserved sequence, or undesirable sequence. Owing to programming capability, a user can develop new modules utilizing simplified helper functions of the classes mentioned earlier.
(1) Codon optimization
The software provides a predefined module that is based on a well-known Monte-Carlo simulation or a predefined probability table [13, 15, 19, 20]. It utilizes a codon usage table and replaces original codons with new ones while maintaining the identity of the same amino acids. To be specific, Visual Gene Developer not only has a function to import codon usage tables from CUTG (Codon Usage Tabulated from GenBank) but also provides a manual edit mode for the target codon usage map and allows a user to generate a local database of reference sets of optimal codon usage tables. The software automatically calculates the RSCU, RSCU
max
, and w
i
values, and then generates a look-up table (LUT) of synonymous codons. For example, if alanine has four synonymous codons such as GCA, GCC, GCG, and GCT whose expected fractions are 0.1, 0.2, 0.3, and 0.4, respectively, the LUT will consist of 100 GCAs from 1 to 100, 200 GCCs from 201 to 300, 300 GCGs from 301 to 600, and 400 GCTs from 601 to 1000 in a memory array. Finally, one of 1000 codons is randomly chosen and then it replaces the original codon. By utilizing the look-up table, it is possible to perform codon optimization very quickly. In addition, the software has a pre-defined function that allows a user to keep track of changes in codon usage bias as a graphical representation. Meanwhile, since the current version of the software doesn't have a built-in database of optimal codon usage maps of highly expressed genes, a user may need to rely on other available sources including papers and web databases where a user can get an optimal codon usage map for a specific host genome and then put it into Visual Gene Developer.
(2) mRNA optimization
In order to optimize a gene in terms of mRNA binding energy, the algorithm was developed utilizing both mRNA prediction and codon optimization modules. At the code level, an original sequence is continuously modified until its Gibbs free energy is in a specific range given by minimum or maximum Gibbs free energy where the modification refers to synonymous substitution. With regard to the modification strategy, the simplest approach is that the number of mutations is gradually increased one by one when the calculated Gibbs free energy is out of range. Meanwhile, the base module was also used to develop more complicated modules. For example, a binding energy profile of a long test sequence can be optimized by repeatedly applying the base algorithm to all local sequences with a moving window method. In this way, all local mRNA structures can be optimized while minimizing the number of base changes. Similarly, a user can increase or decrease binding energy at specific locations in a sequence. Visual Gene Developer has a specialized window for mRNA optimization for a typical user and provides related class functions for a module developer.
(3) Removal of undesirable sequences
The coding region of a gene may include undesirable sequences such as restriction enzyme sites, potential polyadenylation signal sequences, or cryptic splice sites. Visual Gene Developer provides a function to remove such unwanted sequences without changing the resulting amino acid sequence. The algorithm is based on the synonymous substitution and similar to that for codon optimization except it replaces only a few codons with correspondent synonymous codons in the target sequence region that needs to be modified (Figure 4). The first step is to identify the location of the target DNA sequence in a gene and then determine the location of the site for synonymous substitution. To simplify the substitution process, terminal sequences of the target sequence are truncated if they are located outside of complete codons. The current version of the software carries 4 relevant modules that are written in VBScript. A user can easily remove undesirable sequences including predefined potential polyadenylation signals and intron cryptic splice sequences.
Optimization process
To help users design unique optimization processes, Visual Gene Developer has a versatile and configurable optimization strategy and interface. First of all, the optimization process is based on a novel combination of multiple modules. Each independent and fully functional module does a simple job like codon optimization or silent removal. By integrating individual modules into a comprehensive optimization process, it is possible to implement a more complicated and diverse gene optimization strategy. A user can easily add new modules by choosing one of the listed algorithms in the 'Configuration for Gene Optimization' window. At the same time, a user can determine their priority or the order of module execution. For module development, there are 5 different types of optimization modules such as 'Sequence optimization', 'mRNA structure optimization', 'Gene manipulation', 'Constraint' and 'Search strategy'. Especially, 'Search strategy' belongs to a global optimization module that determines the optimization process and controls all other modules.
Secondly, the software has an ability to generate and handle a large quantity of candidate gene constructs that satisfy the user's gene design criteria. This is important because the number of possible variant genes is practically infinite even after codon, mRNA structure, or UTR optimization in spite of screening out of undesired gene constructs. Simply, the number of possible variants of a gene can be calculated to be if we assume equal probability between 20 amino acids where n means total number of amino acids of the gene and 3.28 is an average number of synonymous codons of 18 amino acids that have multiple codons. For instance, if a gene consists of 250 amino acids, total number of possible variants is about 1.18 × 10116. One cycle of the optimization process will generate single candidate gene constructs and multiple cycles will produce many candidate gene constructs. A user can check all generated gene constructs in the 'Gene Optimization' window. As one interesting feature of the software, a generated gene construct can have its origin like a relationship between mother and daughter, and a user can specify a source gene construct for the next round of the optimization process. The option is useful to find desirable sequences step by step in a short time.
Thirdly, the software has a built-in screening system to remove undesirable gene constructs. A researcher may prefer using active processes such as silent removal of unwanted sequences. However, the screening process can be a faster and simple way to find good candidate gene constructs. Any designed modules that are registered as 'Constraint' type will be used to determine whether a gene construct satisfies a certain set of criteria or not. If a module returns a 'Not pass' value, the current gene construct will be discarded.
(1) Excluding query sequences or specified restriction enzyme sites
The purpose of the module is to avoid undesirable DNA sequences including potential cryptic splice sites, polyadenylation signal, and restriction enzyme sites. When those sequences are found, the gene construct will be excluded from the candidate gene construct list.
(2) Checking stability of mRNA secondary structure
This algorithm is a modification of the 'mRNA Gibbs free energy plot'. It is used to analyze mRNA secondary structures of all partial sequences of a sequence. If the calculated Gibbs free energy of a local sequence in a moving window is lower than a threshold value, the module returns 'Not pass' value and consequently the gene construct will be screened out.
(3) Removing repeated sequences
In order to prevent repeated sequences in a gene construct, the module is developed to count the total number of repeated sequences in a sequence. If the number is more than a prescribed set point, the gene construct will be ruled out.
Comparison with other similar web servers and software
Basically, most available software including Visual Gene Developer share a similar codon optimization strategy. Monte Carlo algorithm or 'one amino acid-one codon method' is frequently adopted [19]. For high gene expression, several programs such as Gene Composer, Gene Designer, JCat, OPTIMIZER, and Synthetic Gene Designer include optimal codon usage maps of highly expressed genes. Regarding mRNA secondary structure optimization, Gene Composer and Visual Gene Developer carry the most sophisticated modules. Both software have functions to eliminate stable mRNA hairpin structure and control Gibbs free energy utilizing advanced mRNA folding algorithms. GeneOptimizer, Gene Designer, GeMS, and JCat don't calculate Gibbs free energy of mRNA folding. However, they indirectly eliminate potential mRNA structure sequences by analyzing sequence repetitions or calculating energy scoring functions in a short range of a test sequence. The other software tools such as Codon optimizer, DNAWorks, DyNAVacS, GeneDesign, OPTIMIZER, Synthetic Gene Designer, and UpGene don't have a function to predict mRNA secondary structure and don't perform mRNA optimization that means the use of Gibbs free energy analysis to assess the stability of mRNA secondary structure. Compared with other available software, Visual Gene Developer has several novel implementations that have not been implemented elsewhere such as artificial neural network modeling, integrated programming environment using VBScript and JScript (= Java script), network/multi-threaded computing, and sophisticated batch analysis and optimization process for multiple gene construct candidates. However, one of the limitations is that Visual Gene Developer is platform-dependant as a Microsoft Windows™ application whereas other software supports multiple platforms or web browsers (DyNAVacs, GeneDesign, Gene Designer, GeneOptimizer, JCat, OPTIMIZER, Synthetic Gene Designer). In addition, further development is needed to include other useful features such as a built-in database of codon usage tables of highly expressed genes or robust regression toolboxes like PLS (partial least square) or SVM (support vector machine) model that have not been implemented yet.