Overview of Multi-objective Evolution Algorithm
In real life, most optimization problems involve multiple objectives (possibly in conflict), which should be satisfied simultaneously. Generally, optimization with single objective often results in "optimal" solutions against only one objective but unacceptable with respect to the other ones. To solve this trade-off problem, multiple instead of single reasonable solutions should be reserved, incorporating multiple objectives optimized simultaneously. A solution is said to be Pareto optimal if it is not dominated by any other solutions in the solution space, which cannot be improved with respect to any objective without worsening at least one of other objectives. The set of all feasible non-dominated solutions is referred to as the non-dominated or Pareto optimal solutions and the Pareto frontier can be mapped out by ranking the fitness of the Pareto optimal solutions in the search space, see references [43–45] for more details.
Through the operation on the population of individuals with inherent parallel mechanism, GA or EA is suitable to solve multi-objective optimization problems. The crossover operator may exploit structures of good solutions with respect to different objectives to create new non-dominated solutions in unexplored parts of Pareto front. There are many MOGA and MOEA methods tailored for different practical problems, and generally speaking they are merely different with respect to fitness assignment procedure, elitism, or diversification approaches [46].
We adopt ο-MOEA method, a robust algorithm based on the concept of ο-domination developed by Deb et al. [47, 48] as our conformational searching engine. In ο-MOEA, two populations co-evolve, one is the general evolving population P(t) identical to those used in single-objective GA, the other is the "archive" population A(t) serving as the collection of Pareto solutions generated in the t th generation. A(0) is the ο-non-dominated solution set of P(0) after initialization. The crossover operator is applied to two individuals randomly selected from P(t) and A(t), respectively, and each new child is compared with the individuals in the two populations by the ο-domination method to check if it would be accepted or not. Each individual in the archive is defined as identification array B as following:
where
is the minimum probable value of the j th objective function and ο
j
is the user-definable tolerance between two objective values. Identification array divides the whole j th objective function space into the hyper-boxes with the size of ο
j
. If the identification array of the new child dominates any archive individual, then the dominated individual is substituted by the new child; otherwise the new child is discarded. If none of the two conditions is satisfied, the new child is a ο-non-dominated solution. Following criteria are used to handle the new child:
-
1.
When the new child and one archive individual share the same identification array B, namely they are in the same box, the new child would only be accepted either if it dominates the archive individual or it is nearer to B.
-
2.
When the new child doesn't share one identification array B with any of archive individuals, the new child would be accepted.
In this way, the distribution of Pareto optimal solution set is attained under the condition that each box of the Pareto frontier can only be occupied by one solution. The number of Pareto optimal solutions can be restrained by scaling the ο value.
Similarly, the selection procedure in the population P(t) is also determined by epsilon domination. If the new child dominates one or more individuals in P(t), the new child would substitute one of such individuals randomly, or else it should be discarded. If they are non-dominated by each other, one of them would be substituted by the other randomly to fix the size of the population.
Details of MOEA in Cyndi
2.1 Genome Encoding
Similar to the encoding scheme used in traditional single objective GA operated on molecules, namely the molecules are divided into a set of rigid fragments linked by several rotatable single bonds, Cyndi encodes the genome into a vector of real numbers within the interval of [0, 2π), which represents the torsion angle each rotatable bond to be rotated during the conformation generation. The phenotypes, namely the conformers, are easily translated through rotating the single bond by the values coded in each gene. There have been some reports declaring that reducing the continuous torsion space into discrete one (using torsion grid or biasing the torsion value towards the favourable ranges discovered through mining torsion space) may boost convergence [15], however, this assumption may cause missing some conformational significant areas due to the limited accessible information, for this reason, we still manipulate the genomes in continuous torsion space.
2.2 Optimization Objective Functions
Two terms of potential energy calculated with Tripos force field, VDW and torsion energy, are taken as parts of the objectives for separating the solutions in energy space, during which both bond lengths and bond angles are frozen. The torsion term is directly subjected to the geometric variations during the evolution, consequently it is selected as one of the objectives; the second objective is VDW energy in that it increases exponentially with intensive steric bumps so as to penalize the "ugly" conformers. The format of VDW term is a conventional Lenard-Jones 12-6 function defined in Tripos force field and no cutoff is employed regarding the small size of most drug-size molecules. Since all the energy calculations are performed without considering solvent effects, electrostatic term is also discarded to avoid abnormal compact conformations induced by intra-molecular interaction such as internal hydrogen bonds and salt bridges.
The profile of energy landscape is often described as a zigzagged hypersurface. Two geometrically similar conformers may locate quite distant in energy hypersurface but converge to one minimum after energy optimization. Therefore a mechanism is necessary to promote geometric diversity in the population of evolving conformers; the geometric dissimilarity (GD) between each individual and the input conformation is set as the third objective, to control the geometrical distribution of the Pareto optimal solutions. The GD calculation is formulated as eq. (2):
where xindiv, iand xinput, iare the positions of the i th heavy atom in each individual and the input conformer. Here only the positions of the heavy atoms are considered in GD calculation.
Previous study indicated most of the bioactive conformers tend to adopt relatively extended geometries [16], in light of this important observation, Cyndi provides an optional objective to direct the evolution towards generating conformers with different geometrical extension degree. Similar to the work by Izrailev et al., the objective that quantifies the degree of extension is formulated by calculating the gyration radius for each molecule, as defined in eq.(3):
where x
i
and m
i
are the position and mass, respectively, of the i th atom;
is the geometrical centre of the molecule. This objective may improve the conformational diversity if we have no idea about whether the bioactive conformer is compact or extended.
The multi-objective optimization model of Cyndi consists of a set of n parameters (design variables), a set of l objective functions, and a set of m constraints. Objective functions and constraints are functions of the design variables. It can be formulated mathematically as follows:
where x is the design vector x= {Tb 1, ⋯, T
bn
}T, in which Tb 1, ⋯, T
bn
are the torsion angles of the n th rotatable bonds. Accordingly, the constraints for the design variables (g ( x )s) can be represented as
0 ≤ Tb 1, ⋯, T
bn
< 2π
y is the objective vector, which consists of VDW term f1(x), torsion score f2(x), and the diversity of geometry f3(x) or f4(x), respectively. Note that the maximal problem for GD or gyration radius of each conformation is converted to minimal problem through reciprocal to ensure the most conformational distribution in geometric space. The maximal X is denoted as the decision space, and Y is denoted as the objective space.
The objective functions of Cyndi are easy to extend and customize without modifying kernel of the algorithm. By integrating other objectives as constraints, such as the interaction energy of ligand with the binding receptor or pharmacophore model matching, Cyndi would evolve to facilitate structure or ligand based virtual screening easily.
2.3 Archive Post Processing
The final archive contains the optimal solutions on Pareto frontier which have been separated by the objectives and are non-dominant to each other. However, the conformers in the archive may be energy-unfavourable or geometrically crowded because they may be suboptimal in the single objective case. Therefore, the final set of conformations is the pruned result of archive members by user-definable energy and RMSD filters.
The conformational energy and geometry are optimized using the conjugated gradient (CG) method with Tripos force field, and all energy terms excluding electrostatic term and torsion term lest inner electrostatic interaction and torsion angle perturbation undermine generated conformation. After energy minimization, following energy cutoff is applied to discard those conformers beyond the energy window:
E
cutoff
= E
w
+ k × N
rot
where E
w
is a user-definable energy cutoff (default is 20 kcal/mol), N
rot
is the number of rotatable bonds of current molecule and k is the scaling parameter (user-definable and the default value is 0.5). Similar to Balloon's approach, such energy cutoff takes into account of the effect imposed by molecular flexibility and reflects the observed dependence between the local minima and number of rotatable bonds of bioactive conformation [49]. Only those conformers within the energy threshold are labelled as energy-favoured and survived to subsequent RMSD filter.
The conformers survived from the energy filter are submitted to a RMSD filter to remove geometrical duplicates. We adopted the filter scheme defined in Catalyst [50]. Conformers are compared with each other in terms of RMSD and any conformer within the RMSD tolerance (vide infra) of another conformer is discarded:
where c is a user-definable scaling parameter, and N
rot
, again, is the number of rotatable bonds. The involvement of molecular flexibility can efficiently reduce the geometrical redundancy for highly flexible molecules.
Testing of Cyndi
The ability for conformation generation of Cyndi was tested against a data set containing 329 drug-sized molecules whose bioactive conformations were extracted from the PDB database (see Additional file 1). This is a combined set of part of CCDC/Astex subset [51, 52] used in Balloon's validation and the subset used by Izrailev et.al. [53, 54]. To avoid the influences on the generation of conformers from existing crystal structures, single 3D conformer was regenerated for each molecule from 2D structures using the Standardizer module of ChemAxon package [55].
Cyndi test run was performed on each of the molecules in the test set with 200 populations and 200 generations. The probabilities for crossover and mutation operation were set to 0.85 and 0.1, respectively. The relatively larger mutation probability was designated to sample the conformational space more broadly by mutating the existing conformers more frequently. The epsilon values for the four objectives (VDW energy, torsion energy, GD value, and gyration radius) were set as 20 kcal/mol, 5 kcal/mol, 0.2 Å and 0.1 Å, respectively. The maximum iteration for post processing CG minimization was set to 100, and the convergence criterion based on gradient RMS was set to 0.1 kcal·mol-1·Å-1. No initial optimization against the input conformers was applied and the input conformers were discarded from the final conformer ensembles. Other parameters were set to the default values mentioned previously.
To make a comparison, the Pentium4 processor optimized version of Balloon (version 0.6.6.4641) [56] as well as three conformer generation methods (Catalyst Best, Catalyst Fast and CEASAR) in DS2.0 [57–60] were run against the same test. All default parameters were applied and the maximum number of generated conformers was limited to 100. The initial single 3D conformers generated by Standardizer of ChemAxon were also retained for further comparison (see Table A1 in Additional file 2).
To make an impartial comparison, the minimum RMSD values between the crystal bound conformers and the conformers generated by all methods were calculated with a third-party program, Vrms 1.0, which is an RMSD calculator removing artificial differences caused by symmetry [61]. In this study, a conformation is considered to be successfully reproduced if the RMSD value is less than 0.5 Å, as compared to the original X-ray structure. Balloon, Cyndi, Vrms and DS 2.0 were all run on a workstation with an Intel Pentium Dual Core Processor (2.66 GHz).
Hierarchical clustering was applied to analyze the conformational diversity of each conformer ensemble generated by Cyndi. Vrms calculated RMSD values between each pairs of conformers were used to populate a distance matrix, which was then analyzed by applying complete linkage agglomerative hierarchical clustering. The clustering results were visualized by dendrograms and reordered heatmaps of the distance matrix.