 Methodology article
 Open Access
 Published:
A unified framework for packing deformable and nondeformable subcellular structures in crowded cryoelectron tomogram simulation
BMC Bioinformatics volume 21, Article number: 399 (2020)
Abstract
Background
Cryoelectron tomography is an important and powerful technique to explore the structure, abundance, and location of ultrastructure in a nearnative state. It contains detailed information of all macromolecular complexes in a sample cell. However, due to the compact and crowded status, the missing edge effect, and low signal to noise ratio (SNR), it is extremely challenging to recover such information with existing image processing methods. Cryoelectron tomogram simulation is an effective solution to test and optimize the performance of the above image processing methods. The simulated images could be regarded as the labeled data which covers a wide range of macromolecular complexes and ultrastructure. To approximate the crowded cellular environment, it is very important to pack these heterogeneous structures as tightly as possible. Besides, simulating nondeformable and deformable components under a unified framework also need to be achieved.
Result
In this paper, we proposed a unified framework for simulating crowded cryoelectron tomogram images including nondeformable macromolecular complexes and deformable ultrastructures. A macromolecule was approximated using multiple balls with fixed relative positions to reduce the vacuum volume. A ultrastructure, such as membrane and filament, was approximated using multiple balls with flexible relative positions so that this structure could deform under force field. In the experiment, 400 macromolecules of 20 representative types were packed into simulated cytoplasm by our framework, and numerical verification proved that our method has a smaller volume and higher compression ratio than the baseline singleball model. We also packed filaments, membranes and macromolecules together, to obtain a simulated cryoelectron tomogram image with deformable structures. The simulated results are closer to the real CryoET, making the analysis more difficult. The DOG particle picking method and the image segmentation method are tested on our simulation data, and the experimental results show that these methods still have much room for improvement.
Conclusion
The proposed multiball model can achieve more crowded packaging results and contains richer elements with different properties to obtain more realistic cryoelectron tomogram simulation. This enables users to simulate cryoelectron tomogram images with nondeformable macromolecular complexes and deformable ultrastructures under a unified framework. To illustrate the advantages of our framework in improving the compression ratio, we calculated the volume of simulated macromolecular under our multiball method and traditional singleball method. We also performed the packing experiment of filaments and membranes to demonstrate the simulation ability of deformable structures. Our method can be used to do a benchmark by generating large labeled cryoET dataset and evaluating existing image processing methods. Since the content of the simulated cryoET is more complex and crowded compared with previous ones, it will pose a greater challenge to existing image processing methods.
Background
Cryoelectron tomography (CryoET), which was first proposed in 1970s, is now a popular and powerful imaging technique in the fields of life and medical sciences [1–3]. A series of twodimensional images recorded by electron microscopes are collected to generate 3D reconstruction of macromolecules and then used to analyze the architecture of these structures. In traditional sample preparation process, due to the incompatibility with vacuum, water in sample cells tends to boil out and leads to unacceptable explosions. This seriously effected the efficiency of the sample preparation process and the accuracy of result data. In order to extract the cellular structure more accurately with higher resolution, cryoelectron tomography has emerged, which prepares samples at low temperatures and is able to record the cellular structure in a natural state [4].
In the fields of protein visualization and structural biology, various machine learning methods are applied to the analyse the structure of macromolecules and ultrastructures [5–7]. These methods can resolve the structure of macromolecules to a large extent, but the accuracy still need to be improved. On one hand, due to the lack of training data, only unsupervised methods could be introduced. These models often fail to obtain credible results and can only make fuzzy estimates. On the other hand, the parameter adjustment process is very time consuming, and it is extremely difficult to verify whether these specific parameters can obtain the best results [8, 9]. Therefore, generating simulated cryoET is very important. To make the simulation result more realistic, Molecular Dynamics(MD) is introduced. The label of simulated cryoET data is known which is very helpful in training machine learning algorithms and improving the performance. It is also of great help to adjust parameters and find the optimal ones.
There are many studies on the simulation of the interaction and movement of subcelluar structures. Some studies focus on the simplified modeling of these structures [10–13] and the generation of topology file [14]. Others tend to explore the interaction between a pair of structures [15, 16]. In this field, the study of macromolecular crowding is of great importance[17]. In 2015, Pei et al. [18] proposed to simulate cryoET of macromolecular crowding. This method modeled macromolecules with single ball, and squeezed them into a limited space. This method can simulate crowded cell cytoplasm at varying crowding levels by a score function. Random noise is used to achieve varying SNR levels. However, representing a macromolecule with only one sphere can not obtain the tightest packing results.
In general, traditional packing methods tend to model a single structure with a single cubic, sphere, cylinder, or other threedimensional box. A specific macromolecule is placed in the smallest box that can hold it, which successfully simplified the complex into an alternative with regular boundaries. Although these models are easier to operate for the experimenter, only rough results can be obtained due to the huge waste of space. For example, when modeling a sticklike substance, the radius of the minimum boundary ball is extremely large due to its large height to thickness ratio. This means a large area of vacuum is present inside the sphere. This kind of model cannot represent the sticklike substance and its surrounding environment accurately during the simulation process. In contrast, a cuboid may be more suitable for simulating this type of structure. Besides, the scale of the minimum boundary is fixed, so it cannot exhibit the deformation and stretching of structures. The packing result is not tight enough because it is limited by the space waste issue. The result of the simulation is not rich enough because it lacks the dynamic features. These inherent problems with existing methods lead to the inaccurate representation of the cytoplasmic environment, and it hardly provides adequate assistance to subsequent image process methods. To this end, a new scheme to simulate deformable ultrastructure, such as filaments and membranes, and nondeformable macromolecules, like proteins, in a unified framework is of great significance. It is also very important to achieve the result of crowded packing, which is based on the movement and interaction of the above structure according to molecular dynamics. This method is not only a supplement and development of traditional biological structure modeling, but also a great challenge in computer graphics, expecially physicsbased simulation [19–21].
In the field of protein visualization, many basic methods have been proposed to simplify a macromolecule which composed of thousands of atoms based on their structural information. These methods are collectively referred to as coarsegrained methods, which include the efficient and feasible clustering methods. Classical clustering methods include hierarchical methods [22], partitionbased methods [23], densitybased methods [24], gridbased methods [25], and modelbased methods [26]. Specifically, partitionbased methods are simple and efficient for large data sets. This type of methods has a low space complexity which means it requires less time than other methods. The kmeans method [27], as a representative partitionbased method, allows the user to freely set the number of clusters and divides the data according to the similarity between discrete points. When used with macromolecules, the clustered atoms are close in distance, and the positions of the atoms can be used effectively as the basis for dividing the atoms into clusters.
In this paper, we proposed a unified simulation framework for packing deformable and nondeformable structures together under a rather crowded status. The nondeformable macromolecules are coarsegrained by the kmeans clustering method. A limited number of fixedposition spheres with different radii are used to represent a single macromolecule. Deformable filament and membrane are modeled by multiple balls with flexible relative positions. Generally speaking, filaments and membranes are relatively larger than ordinary maceomolecules, and they are more flexible in interactions and movement modes. The number of balls in a specific structure is depending on its atom number. Next, the deformability of various structures is achieved by designing different typologies. For example, balls in a macromolecule will be fully connected to each other and the length of bonds is hard to change. This will make the relative position of the balls relatively fixed, and the macromolecule will reflect the rigid body characteristics. The filaments and membranes may have relatively few bonds which by using a not fully connected topology. The angle between two bonds is changeable to achieve deformable properties. Then, certain number of macromolecules, filaments, and membranes are placed in a given space, and be moved towards the center by applying a external force pointing to the center. After a certain period of time, the system reaches a sufficient crowding level. The movement process under bonded forces, nonbonded forces, and external forces follows molecular dynamics and conforms to physical laws.
Method
The simulation of Cryoelectron tomogram with rigid macromolecules and deformable ultrastructure contains three parts: the modeling of macromolecules and ultrastructure, molecular dynamics simulation, and generating simulated cryoelectron tomograms. In this section, we will demonstrate the above three parts respectively.
Modeling of macromolecules and ultrastructure
To simulate the intercellular environment realistically, the modeling of intercellular ultrastructures and macromolecule complexes is required. Traditional simulation methods tend to model a structure with singleelement model, and do not take into account the deformable filament or membrane. This greatly reduces the richness of simulation results, and cannot represent the compact Cryoelectron tomography properly.
Therefore, we proposed to represent a macromolecule or ultrastructure with multipleball model which is able to describe deformable and nondeformable body in a unified framework, and could also save the space. This process is also called “coarse graining”. A highresolution structure consisting of thousands of atoms is represented at a low resolution by limited number of balls. A topology file is also required to describe how a structure is organized, including the atomic mass and bond configurations. In this paper, the macromolecule, which is nondeformable, is approximated using multiple balls with fixed relative positions; while the deformable ultrastructure, like membranes and filaments,are approximated using multiple balls with flexible relative positions. This feature is achieved by setting different force fields for balls, and setting the chemical bond angle to have different elasticity.
Coarse graining of macromolecule
There are thousands of atoms in one macromolecule, which makes it extremely difficult to accurately represent a macromolecule with a limited number of balls. In this paper, we start with the coordinates of atoms and use the kmeans method, a classical clustering algorithm in machine learning, to realize the coarse granulation of macromolecules. The algorithm of kmeans method could be found in supplementary document, Section S2.
To improve the richness and accuracy of simulation, 20 macromolecules with different scales and shapes were selected from the Protein Data Bank (PDB) [28]. These PDB files contain abundant information about the complex such as the source, structure, sequence, etc. In a PDB file, the most important value that we are focused on is the spatial position of each atom, called the coordinates.
Clustering methods are unsupervised learning methods which is used to devide the data into several groups. It group objects with similar properties into the same cluster. It has a wide range of applications and can be applied to almost all objects. In kmeans clustering method, k denotes that we want k clusters, and mean represents that we calculate the center of the cluster using the mean value of the attribute values in each cluster. In a word, the cluster is described by the centroid of each cluster. The more similar the objects in the same cluster are, the better the clustering accuracy is. Cluster analysis attempts to find the similarity between different objects, which is usually represented by their Euclidean distance, cosine distance, or Hamming distance. Obviously, for a macromolecule the Euclidean distance is a simple and effective way to group atoms which are close to each other into one cluster. In this paper, the clustering process was implement based on the kmeans program in scikitlearn package [29].
In this paper, the kmeans method is used to divide the atoms in a macromolecule into k clusters. First, the coordinates of all n atoms is extracted from the PDB file: atoms={coord_{1},coord_{2},...,coord_{n}}, where coord_{i}={(x_{i},y_{i},z_{i})∣x_{i},y_{i},z_{i}∈R}. Then k atoms is randomly selected as the initial cluster center: centers_{tmp}={center_{1},center_{2},...center_{k}}, where center_{i}={(x_{i},y_{i},z_{i})∣(x_{i},y_{i},z_{i})∈atoms}. Secondly, for each atom i in the macromolecule, the distance to all k centers dist_{i}={dist_{i→1},dist_{i→2},...,dist_{i→k}} is calculated by Euclidean Distance formula:
Next, each atom is assigned to the nearest cluster by finding the minimum value in its distant list \({dist_{i_{min}} = min\left (dist_{i}\right)}\). Then, the centroid of each cluster is recalculated, and k new cluster centers is obtained by the arithmetic mean:
where i denotes the number of center, coord_{ik}=(x_{k},y_{k},z_{k}) is the coordinate of the k_{th} atom in cluster i. Then repeat the second and third steps to redistribute the atoms into different clusters until the cluster center no longer changes. Finally, in coarse graining process, the center of the k small balls is represented by the coordinates of the cluster center, and the radius is the maximum distance \({dist_{i_{max}} = max(dist_{i})}\) from the cluster center to the atom in this cluster. The radius of the small balls in a macromolecule is in the range of 2065 \(\overset {\circ }{A}\), most of which are concentrated between 3555 \(\overset {\circ }{A}\). The overall size of the macromolecule is around 60250 \(\overset {\circ }{A}\), most of which is concentrated between 120160 \(\overset {\circ }{A}\).
Due to the variety of macromolecule sizes,the number of atoms contained in each macromolecule is also different. Therefore, a uniformed cluster number k can not meet the need of reasonable coarse granulation. In this paper, the original number of cluster clusters k_{tmp} is set base on the scale of the macromolecule by the following formula:
where N_{atom} refers to the atom number of a macromolecule. This means that every five thousand atoms will be represented by one ball. When there are less than 5,000 remaining atoms, one ball will still be added to ensure that all particles are represented. It is important to make sure that each macromolecule is divided into at least three clusters to get three cluster centers. This means that each macromolecule can form at least two linearly independent vectors to determine the rotation angle of the macromolecule in space during subsequent molecular dynamics simulation step. Eq. 4 is used to limit the range of k:
The process of coarse graining is shown in Fig. 1, its algorithm is shown in Algorithm 1.
Model of filaments and membrane
The single ball/cylindrical model is difficult to properly simulate a largescale filament or membrane. In contrast, the multiball model can not only decrease the vacuum volume, but also flexibly represent the shape change of the above structure by changing the relative position of the balls. In order to reduce the complexity of the model, the filament and membrane was initialized by the simplest model with a rule layout.
For the filament, its initial state was set to a straight line, arranged by m balls. The model is shown in Fig. 2a. For the membrane, its initial state was set to a rectangular mesh, arranged by m∗n balls. The model is shown in Fig. 2c.
Topological structure
The deformability of a specific macromolecules or ultrastructure is determined by its topology structure. For macromolecules, we set the structure to be indeformable. k spheres that make up the macromolecule are connected in pairs (see Fig. 1d), and the angles of the chemical bonds are fixed. For filament and membrane, we set the structure to be deformable. The angles of the chemical bonds is able to change elastically.
In a filament, inside balls are sequentially connected to the nearest neighbors of the left and right by the serial number. As shown in Fig. 2b, only balls at both ends are bound by one chemical bond, while the other balls are bound by two chemical bonds.
In a membrane, balls are in a mesh structure, and each ball is connected to the neighbors in the four directions of the upper, lower, left, and right directions. It is also connected to the neighbors of the four diagonal directions. The structure of a membrane is shown in Fig. 2d, the balls at the apex are bound by three chemical bonds, the balls at the boundary are bound by five chemical bonds, and the inner balls are bound by eight chemical bonds.
Molecular dynamics simulation
The setting of the biomolecules force field is the most important part of dynamics simulation process. In this paper, force field could be divided into three types: bonded force, nonbonded force and external force. In each time step, all the particles move under the above forces. As for a specific structure, if the relative position of inner particles change significantly, this macromolecular complex will deform. For each ball i in the force field, its acceleration a_{i} at time t is described by the following formula:
where m denodes the mass of each ball, t is the time, F_{i}(t) is the resultant force, F_{i}^{b}(t), F_{i}^{nb}(t) and \({F_{i}^{ext}(t)}\) represent the bonded force, nonbonded force and external force respectively. Since each ball represents 5,000 atoms, the mass m in our experiment was set to 5000. Then the velocity and position of each ball is describe by the following foluma:
where δt is the time step, and it was set to 1 ns in this paper.
In biomelecules system, the force is calculated by negative gradient of the scalar potential function. The force from particle j at position x_{j} to particle i at position x_{i} is F(x_{ij})=−∇U(x_{ij}), where x_{ij}=x_{i}−x_{j} is the position vector. Then Eq. 5 can be written as:
where \({ U_{i}^{b}(t)}\) is the bonded potential function, \({U_{i}^{nb}(t)}\) is the nonbonded potential function. The calculation of the above functions is implemented by NAnoscale Molecular Dynamics(NAMD) [30].
External force field
Our packing method considers external force which is a force field toward the center of simulation scene. The external forces is applied directly to the particles without the use of potential function. This force field lead all particles to move toward the center of the scene and squeeze together, which plays a major role in the packing operation.
In order to speed up the simulation process while maintaining the stability of the calculation in the crowded area, a force field action radius is set. Outside this radius, balls move toward the center under a uniform constant max force action. In order to avoid instability, the force field was set to be evenly decremented, and is reduced to zero in the center of the scene. The segmented external force field for ball i is as follows:
where ∣∣x_{i}∣∣ is the distance from ball i to the center of simulation scene.
Bond interactions
Bond interactions involve two type of constraint: bond stretching and angle stretching. As seen in Fig. 3, they are used to describe the 2body and 3body interaction of covalency bonded atoms respectively. The bond potential could be written as:
For a pair of bonded particles i and j, their bond harmonic vibrational motion is described by spring bond potential:
where k_{b} is the spring constant and it was set to 2000, ∣∣x_{ij}∣∣=∣∣x_{i}−x_{j}∣∣ is the distance between particle i and j, x_{0} is the equilibrium distance of this particle pair. In this paper, ∣∣x_{0}∣∣ of two specific bonded balls is set as the initial distance between two ball centers.
For three bonded particle i, j and k, their angular vibrational motion is described by angular bond potential:
Where k_{θ} is the angle constant, θ_{ijk} is the angle formated by vectors x_{ji}=x_{j}−x_{i} and x_{ki}=x_{k}−x_{i}, θ_{0} is the equilibrium angle. In this paper, θ_{0} of is set as the initial angle of three ball centers.
Given the atomic coordinates and topology, the bonded constraint is the mainly influence factor of the deformation properties of macromolecules and ultrastructures. For a macromolecule, its topological structure is relatively stable because there are chemical bonds between any two balls in it. This means macromolecules do not deform during simulation as long as the length of the chemical bond is not changed. Therefore, k_{b} in the bond potential function is set to 2000, which is a very large value, to effectively ensure the rigidity of the macromolecule. In this paper, filament and membrane are set to deformable and nonstretchable structures. Therefore, k_{b} for filament and membrane is also set to 2000 to ensure the nonstretch feature, and k_{θ} is set to zero to enable the deformability of the angle.
It is obvious that the deformability of a structure is determined by its topological structure and bond constraint. The stability of the topological structure can determine the feasibility of the stretching and bending feature; the bond constraint is used to adjust the difficulty of the deformation.
Nonbond interactions
For particles pair i and j without chemical bond, van der Waals force play the main role of the interaction between them. There is a strong core repulsion between particle i and particle j if they are too close together, while there is a weak dipole attraction between them if they are a little far apart. The potential function of van der Waals force is described by Lennard–Jones potential:
where ε is the depth of potential wall with the value \({U_{ij}^{van}(R_{min}) }\), R_{min} is the distance at which the potential reach its minimum. This potential approaches 0 rapidly as x_{ij} increases, so a switchdist and a cutoff point are chosen to save the computational resource. In this paper, the switchdist and cutoff are set to S=600nm and C=610nm respectively. This is a piecewise force field function. When the distance between the two particles is smaller than S, \({U_{ij}^{nb} = U_{ij}^{LJ} }\). Then, it starts to drop evenly from point S and become zero at point C.
In biomelecules system, the electrostatic force is also an important part of interactions. In this paper, since all the balls represent (0,5000] atoms, it is hard to say which sign of charge does it have. Besides, the goal of this paper is to pack a number of macromolecules into a limited space. Under the given external force, there is no need to calculate the intermolecular forces accurately. Van der Waals force is enough to describe nonbond interactions and prevent particles form overlapping. In order to save computing resources, the charge of all balls is set to zero, and the electrostatic force is ignored.
Generating simulated cryoelectron tomograms
To generate a simulated cryoelectron tomogram, we need to obtain the displacement vector and rotation angle of all the structures first. In this paper, each structure is approximated by multiple balls and represented by multiple vectors. For a structure S with k balls B={b_{1},b_{2},...,b_{k}}, its position coord_{S}=(x_{S},y_{S},z_{S}) could be calculated by the arithmetic mean of k balls:
where b_{i}=(x_{bi},y_{bi},z_{bi}) is the coordination of the i_{th} ball center. The displacement vector is the vector from the initial position to the final position, that is \({\vec {V}_{d} = coord_{S}^{init}  coord_{S}^{final} }\).
The rotation of a point can be uniquely determined by a Euler angle which is represented by a triple array [31]. As shown in Fig. 4, the rotation accordance in this paper is ZYZ sequence, and the final rotation can be represented by a rotation matrix:
where α, β and γ are the rotation angle in three subprocess.
For a rigid body in threedimensional space, its rotation states can be uniquely determined by two linearly independent vectors inside it, and the rotation angle is obtained from the rotation matrixes of two vectors. Therefore, at least three points within each macromolecule need to be labeled, which is why each macromolecule need to consist of at least three clusters in the coarse graining process.
As shown in Fig. 5, macromolecule P is rotated from (P_{0}) to subfigure P_{1} during time δt. Three points A_{0},B_{0},C_{0} move to the point A_{1},B_{1},C_{1} via the above transformation. The rotation matrix R_{AB} of A_{0}B_{0} to A_{1}B_{1} can be obtained, which represents the rotation of macromolecule P except the rotation with AB as an axis. To solve the rotation with AB as the axis, it is necessary to consider the motion of point C_{0}. The intermediate vector A_{1}C^{′}_{0} is solved by A_{0}C_{0} and R_{AB}, then the second rotation matrix R_{AC} of A_{1}C^{′}_{0} to A_{1}C_{1} is obtained. The final rotation matrix R_{P} is calculated by superimposing R_{AB} and R_{AC}, which is the rotation matrix of macromolecule P.
For each complex, density maps are generated at 4 nm resolution and with pixel size of 1 nm using the PDB2VOL program of the Situs 2.0 package [32].Then based on the coordinate,orientation and the density map of single macromolecule, the overall density map could be generated. As for a deformable structure, it density map is generated based on the final shape after packing, and the rotation angle is set to zero.
Experiment of packing subcelluar structures together and generating simulated cryoelectron tomogram
In order to verify the effectiveness of our method, four experiments with different type of Macromolecules and ultrastructure are carried out. The first one is a packing of 400 macromolecules with 20 types which is used to compare the crowded level of our method and singleball method. The second one is the packing of five macromolecules to show the process and result clearly. The third one is the packing of six macromolecules, one filament and one membrane which is used to show the ability of packing deformable and undeformable sturcture under a unified framework. The fourth experiment shows the deformation process of a single deformable structure. All the rendering result is visualized by Visual Molecular Dynamics(VMD) [33]. The PDB view of all experiments are visualized by Chimera [34]. The video of the above experiments could be found in the Additional file 2.
Crowded packing of macromolecules
Packing of 400 macromolecules. As shown in Fig. 6, 20×20types=400 macromolecules are packed in a box of 120nm×120nm×120nm. It could be seen in Subfig. 6a and Fig. 6c, all macromolecules gradually squeeze together from the initial dispersed state, forming a spherelike structure under a force pointing to the scene center. Subfigure 6b shows the PDB view of the final result with different types of macromolecules in different colors. A slice in the volume is also shown in Subfig. 6d. In Fig. 7, the simulated cryoelectron tomography of this packing result is shown. In this experiment, macromolecules are tightly packed together, and the gaps between each other are very small.
Including 1A1F shown in Fig. 1, 20 kinds of different macromolecules which are varying in structure, size and shape were selected and approximated with multiple balls. In order to model them more reasonable, the number of balls was determined by macromolecule’s size using Eqs. 4 and 3. Atoms in a specific macromolecule are divided into k clusters according to their coordinates. The radius of each cluster is determined by the most distant atom in this cluster to the cluster center.
In a macromolecule, each pair of balls are linked by a chemical bond to form a stable topology. The length stretching and angular stretching of a chemical bond are constrained by the bonding potential function. Due to the high spring constant of the chemical bond, the relative position of balls in a macromolecule is hard to change, and the stability of the structure can be ensured.
The PDB view, clustering results, coarsegrained multipleball model, ball stick represent of 20 selected macromolecules are shown in Figs. S1, S2, S3, S4 in supplementary document, respectively.
Volume comparison. As shown in Fig. 8, the PDB point cloud model isosurface volume, multiball model volume and singleball model volume of the 20 macromolecule models were determined. The volume of the PDB point cloud isosurface is the smallest one, which can be regarded as the true volume of each macromolecule. The atoms in a macromolecule form a point cloud, and the isosurface of the point cloud of all 20 macromolecules are shown in Fig. S5 in supplementary document. The calculation method is illustrated in supplementary document (Section S3) as well. Compared to the singleball model, there is much less empty space in the multiball model, which means it is easier to achieve a tightly packed result. For 75% macromolecule, the multiball model is able to save 11%72% volume, and our model has obvious advantages over the singleball model in modeling macromolecule 1BXR, 1GYT and 1QO1. However, for some special macromolecules, our method does not show an advantage, which is also our followup research.
Small scene with 5 macromolecules. To show the packing process clearly, an experiment with only five macromolecules is conducted. As shown in Fig. 9, five macromolecules (1KY1, 1VPX, 1W6T, 2BO9, 2GHO) are put in the scene with 50nm×50nm×50nm. The wireframe view (see Fig. 9a) and the multipleball view (see Fig. 9b) display the packing process clearly. Five macromolecules are moving from distance to center and eventually squeeze together. The final PDB view (see Fig. 9c) and inside slice (see Fig. 9d) demonstrate that the macromolecules are indeed tightly packed. The density map and Cryoelectron tomogram 2D slice image corresponding to this packing result are shown in Fig. 10.
Unified packing of nondeformable macromolecules and deformable ultrastructure
Unified packing of nondeformable and deformable structures. Our method can simulate nondeformable structures (macromolecules) and deformable structures (filaments, membranes) under a unified framework. In order to show the simulation results clearly, a small scene packing process including six macromolecules, a filament, and a membrane was performed. The names of the selected macromolecules are 1A1S, 1BXR, 1EQR, 1F1B and 1GYT. As shown in Fig. 11, the wireframe view and the multisphere view of the packing process are shown separately. It can be clearly seen that under the external forces, structures are gathering toward the center, and the membrane (sky blue wireframe and small ball) and the filament (bright blue wireframe and small ball) are deforming. The corresponding density map and the cryoelectron tomography are shown in Fig. 12.
For singleball model, since the size of the minimum bounding ball changes as these structures deforming, it is not able to reasonably simulate the filament and membrane structures with deformable properties. In contrast, our method represent the deformation of filaments and membranes by changing the relative position between multiple balls. The deformable topology of filament and membrane are shown in Fig. 2, this structure ensures their flexibility.
Deformation of single deformable structure.In order to verify the capability of the deformation process simulation, we simulated the deformation of a single filament and a single membrane. As shown in Fig. 13, a filament made up by 9 small balls is deforming under the applied force. The 3D deformation process is shown in Fig. 13b. In order to better show the shape of the filament during the deformation process, 36 slices corresponding to each state are generated (see Fig. 14). Figure 14 shows the deformation process of a single membrane, in which Fig. 14a and Fig. 14b are the 3D deformation process and the slice image, respectively. Due to having more chemical bonds, the membrane deforms much less than the filament.
Image processing method testing
Test of DoG reference free particle picking method
The first step to any analysis done on the cryoET is to identify the locations of existing particles and structures in the image. In the past, several machine learning algorithms have been proposed for reliable and efficient particle picking in a cryoET, but the performance of such methods still remain unconvincing due to the lack of labelled data for testing. The simulated tomograms can serve as the ground truth for this exact purpose.
Specifically our study focused on the performance of the Difference of Gaussian (DoG) reference free particle picking method. The method generates two different Gaussian filtered images and take the subtraction of the two. Then predicted locations of potential particles are then determined by detecting peaks in the resulting map.
The Gaussian filtered image is obtained using the Gaussian function G(σ) for a given σ value. More specifically, the filtered image I_{G}(σ) is obtained by multiplying the original image and the Gaussian function \(G(\sigma) =\frac {1}{\sigma \sqrt {2\pi }}e^{\frac {r^{2}}{2\sigma ^{2}}}\). The difference of Gaussian image transformation is obtained by subtracting two such filtered images using two different σ, hence the name DoG. Typically, the ratio of the two σ is set to 1.1 for cryoET.
After the DoG image transformation is generated, local density peaks are detected to identify the possible particle locations. Such local peaks are filtered using a density threshold T to remove those resulted by noise.
where M is the maximum density value of all local peaks, m is the minimum density value of all local peaks, and t is the threshold level at which we wish to filter the noise. The threshold level t is set to 5 for this particular performance evaluation.
Evaluating the performance of particle picking
To evaluate the Difference of Gaussian method mentioned above, there need to be a metric to identify true positives along with other useful information such as false negatives and false positives, and this process requires knowledge of the exact molecules and the locations in the cryoET.
Fortunately, we could use the simulated tomograms as the ground truth for this task. The labels of the particles, their locations, angles, radius, and other information are generated and embedded within the simulation process itself, eliminating the need for human annotation or other predictive measures.
Knowing the correct location and radius information, we could determine the true positives if there exist a detected local density peak close to the center of the particle. In other words, given the center and the radius of a particular particle, we consider it to be accurately labelled if one density peak is detected within the radius of the particle.
Precision and recall are effective method for performance analysis, and F score is used for overall particle picking performance defined by
Test results of particle picking
Considering the following equations for precision and recall:
where TP is the number of true positives (see true positive definition above), FP is the number of false positives, and FN is the number of false negatives.
The simulated tomogram contained 400 particles which was then used as the ground truth, and the Difference of Gaussian particle picking method predicted potential locations for 400 particles. Note that TP+FP is the total number positive predictions the method make, which equals to 400, and TP+FN is the total number of true positives the tomogram had, which also equals to 400. Therefore, the precision and recall values would be exactly the same. Now consider the Fscore, notice that the value of the Fscore would also equal to the precision and recall. Below in the table, we only present the Fscore value, but keep in mind that this would also be the precision and recall value.
Difference of Gaussian particle picking method at different σ_{1} value was performed on the simulated tomogram and the particle picking results were analyzed giving the Fscore. The experiment result is shown in Table 1.
Figure 15 illustrates the simulated tomogram as well as the center of predicted particle locations given by the Difference of Gaussian particle picking method where the centers of the predictions are marked with black dots.
Test of semantic segmentation method using fully convolutional network
After identifying locations of potential particles, another useful method to be applied for CryoET analysis is semantic segmentation. Being able to decode the structural information stored in a tomogram requires divide the tomogram in separate pieces for each structure for better analysis. However, one problem that still exists in the research field of CryoET semantic segmentation is the tradeoff between model accuracy and the amount of required training data. Models with high accuracy are typically deep learning models that require a large amount of high quality labelled training data. And because of the challenging and timedemanding nature of image segmentation and labeling, such training data is very scarce. Other methods that do not have as high demand for training data usually do not have comparable results as the deep learning models. To resolve this issue, our simulated data can serve as the labelled training data for high performing deep learning models.
Fully Convolutional Network (FCN) is one of the most popular neural network model due to its ability to learn both higher level patterns and low level details. Specifically, the FCN will make predictions of a tomogram by analyzing every pixel. The network contains 3D convolutional layers to capture highlevel patterns as well as upsampling and pooling layers for finer segmentation information. The exact architecture of the network is given in supplementary material Section S4 and Fig. S6. This encoderdecoder variant of the FCN adds skip connections in the upsampling phase that combines the higher layers with the lower pooling layers to fully utilize global context information while integrating the local information to accurately map lowresolution images to voxelwise predictions. In a previously published paper, it is found that this model has consistently outperformed other similar variants of FCNs for the segmentation task [7]. Our simulation provides 3D simulated cryoET data that can serve as the training data for EDSSN3D network and reduce the amount of training data preparation required of this model and thus successfully break the originally thought tradeoff between accuracy and data preparation.
Evaluation of image segmentation method
To evaluate this network model, the annotated data was split into two sets: 90 percent of the prepared data was used as the training data for the network, and the remaining 10 percent of data was held back for evaluation purposes. After the training is done, the 10 percent held out data would then be fed into the trained network and predictions would be generated with the real labels hidden from the network. The predictions would then be compared to the real label.
Pixel accuracy \({accuracy = \frac {TP}{\text {all data points}}}\) was calculated by counting the number of pixels the network made the correct predictions, where TP is the number of true positives. In our experiment, the final accuracy of this method is 0.7015. The simulation results produced by the method proposed in this paper pose a greater challenge to existing image segmentation methods. There is an urgent need to develop new algorithms to enable better experimental results.
Discussion
In this paper, we proposed the first framework to generate realistic simulated cryoelectron tomograms including nondeformable macromolecular complexes and deformable ultrastructures under a crowded status. The simulated cryoET can be regarded as labeled data to assist subsequence analysis methods.
To achieve this goal, we approximated a macromolecules using multiple balls with fixed relative positions, and approximated ultrastructure (membranes, filaments) using multiple balls with flexible relative positions. Compared to the traditional method [18] in which each macromolecule is approximated by a single minimum boundary sphere, small square or small cylinder, our multiball model can achieve a higher spatial compression ratio. We selected 20 representative macromolecules for experiments, taking 5000 atoms as a unit and dividing the atoms into near clusters based on coordinates. Taking the cluster center as the ball’s center, the distance of the farthest atom in this cluster as the ball’s radius, a multisphere model is obtained. Compared to the singleball model, our method reduces the volume by 11%72%.
Besides, it is difficult for the singleball model to dynamically represent the change of minimum bounding sphere size due to the shape change, while our multipleball model can effectively express the deformability of the structure through the relative displacement of the inner balls. In this paper, different typologies are designed to describe macromolecules, filaments and membrane with different deformability. In macromolecule, all balls are chemically linked. In the initial model of filament, the balls are collinear and are connected end to end in order. Balls in the initial model of the membrane are coplanar and are connected to their neighbors in the eight directions of up, down, left, right and four diagonal directions. For macromolecules, the topology determines that the structure does not deform as long as the bond length does not change. For filaments and membranes, the topology determines that they can deform flexibly. Compared to the singleball model, the bonding force is need to be considered in addition to the nonbonding force and external force. The elasticity of the bond length determines its stretching capability, and the angle elasticity determines its bending capability. A single filament and membrane simulation experiment proves that the multiball model can effectively represent the deformation process of the structure in space. For the multiball model, since one macromolecular complex or ultrastructure is described by multiple points, the rotation angle during motion is achievable. In contrast, traditional single ball model does not require this step and the angle of rotation is randomly set which not fit the physical law. In this paper, the rotation angle of a macromolecule is uniquely determined by three noncollinear points in it, which is why the minimum cluster number is set to 3.
In this paper, six nondeformable macromolecules and two deformable filaments and membranes are placed in the same scene, and the object with different propertied could be distinguished easily. Experiments show that the proposed method can obtain realistic and highly crowded simulated cryoelectron tomogram, including deformable objects and nondeformable objects.
Our simulation data can be used to evaluate existing image processing methods. Machine learning methods can deliver high accuracy in various tasks, sometimes outperforms humans. One major shortcomings of machine learning and deep learning methods is the need for large quantity of accurately labelled data. In the field of CryoET analysis, the lack of labelled data has resulted in unconvincing and sometimes inconclusive algorithm searching. By generating realistic simulated tomograms, we can bridge the gap by providing benchmarks for various algorithms [35–40] for cryoET analysis.
The accuracy and performance of Difference of Gaussian particle picking method is highly dependent on the hyperparameter σ_{1}. With real tomograms, the performance evaluation can be quite challenging and error prune as the particles in the image needs to be manually identified. In a simulated image, the exact locations of every particle are known, allowing the true precision and recall to be calculated. Simulated tomograms gives more accurate evaluation schemes. For supervised deep learning methods, the availability of correctly labelled data is even more important. By using this simulation framework, we can generate large quantities of tomograms with the correct label which can then serve as the training and testing data. The CryoET data generated by our method is closer to the real state, so that its analysis is more difficult than previous simulated cryoET. The existing image processing methods still have much room for improvement on our simulation data, and new algorithms are expected to be proposed to obtain better experimental results.
Conclusions
In this paper, we proposed a unified framework to simulate cryoET with both deformable and nondeformable subcelluar structures, including macromolecules, filaments and mambranes. Our multipleball model makes it possible to achieve more crowded status than sinlgeball model, which is more close to real status in cell. The proposed packing algorithm is combined with the Molecule Dynamics, which makes the simulated cryoET results more reliable. The experiment results shows that our method can pack subcelluar structures more tightly. The simulated cryoET data is easy to be labeled and able to improve DoG particle picking method and bioimage processing methods based on Machine Learning.
In the future, we will generate cryoET datasets using this method to provide help of all the researchers in this research field. At the same time, we will focus on improving the Molecule Dynamics model to make it more accurate. Futhermore, We will try to improve the density map merging process of deformable structures and nondeformable structures to make the simulated result clearer.
Availability of data and materials
The macromolecular complexes can be found at the Protein Data Bank (the Research Collaboratory for Structural Bioinformatics: http://www.rcsb.org/pdb).
Abbreviations
 cryoET:

Cryoelectron tomography
 PDB:

Protein data bank
 SNR:

Signal to noise ratio
References
 1
Irobalieva RN, Martins B, Medalia O. Cellular structural biology as revealed by cryoelectron tomography. J Cell Sci. 2016; 129(3):469–76.
 2
FernandezLeiro R, Scheres SH. Unravelling biological macromolecules with cryoelectron microscopy. Nature. 2016; 537(7620):339–46.
 3
Cheng Y, Grigorieff N, Penczek PA, Walz T. A primer to singleparticle cryoelectron microscopy. Cell. 2015; 161(3):438–49.
 4
Oikonomou CM, Chang YW, Jensen GJ. A new view into prokaryotic cell biology from electron cryotomography. Nat Rev Microbiol. 2016; 14(4):205–20.
 5
Wang F, Gong H, Liu G, Li M, Yan C, Xia T, Li X, Zeng J. DeepPicker: a deep learning approach for fully automated particle picking in cryoEM. J Struct Biol. 2016; 195(3):325–36.
 6
Frank J. Advances in the field of singleparticle cryoelectron microscopy over the last decade. Nat Protoc. 2017; 12(2):209–12.
 7
Liu C, Zeng X, Lin R, Liang X, Freyberg Z, Xing E, Xu M. Deep learning based supervised semantic segmentation of electron cryosubtomograms. In: 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE: 2018. p. 1578–82.
 8
Guay MD, Emam ZA, Anderson AB, Leapman RD. Transfer learning for efficient segmentation of subcellular structures in 3D electron microscopy. Biophys J. 2019; 116(3):288.
 9
Li R, Zeng X, Sigmund SE, Lin R, Zhou B, Liu C, Wang K, Jiang R, Freyberg Z, Lv H, Xu M. Automatic localization and identification of mitochondria in cellular electron cryotomography using fasterRCNN. BMC Bioinformatics. 2019; 20(3):75–85.
 10
Kmiecik S, Gront D, Kolinski M, Wieteska L, Dawid AE, Kolinski A. Coarsegrained protein models and their applications. Chem Rev. 2016; 116(14):7898–936.
 11
de Jong DH, Singh G, Bennett WD, Arnarez C, Wassenaar TA, Schafer LV, Periole X, Tieleman DP, Marrink SJ. Improved parameters for the martini coarsegrained protein force field. J Chem Theory Comput. 2012; 9(1):687–97.
 12
Bond PJ, Holyoake J, Ivetac A, Khalid S, Sansom MS. Coarsegrained molecular dynamics simulations of membrane proteins and peptides. J Struct Biol. 2007; 157(3):593–605.
 13
Tozzini V. Coarsegrained models for proteins. Curr Opin Struct Biol. 2005; 15(2):144–50.
 14
Al Nasr K, Chen L, Si D, Ranjan D, Zubair M, He J. Building the initial chain of the proteins through de novo modeling of the cryoelectron microscopy volume data at the medium resolutions. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM: 2012. p. 490–7.
 15
Pierce BG, Hourai Y, Weng Z. Accelerating protein docking in ZDOCK using an advanced 3D convolution library. PloS ONE. 2011; 6(9):24657.
 16
Lindow N, Baum D, Bondar AN, Hege HC. Exploring cavity dynamics in biomolecular systems. BMC Bioinformatics. 2013; 14(19):5.
 17
Ellis RJ. Macromolecular crowding: an important but neglected aspect of the intracellular environment. Curr Opin Struct Biol. 2001; 11(1):114–9.
 18
Pei L, Xu M, Frazier Z, Alber F. Simulating cryo electron tomograms of crowded cell cytoplasm for assessment of automated particle picking. BMC Bioinformatics. 2016; 17(1):405.
 19
Liu S, Wang B, Ban X. Multiplescale simulation method for liquid with trapped air under particlebased framework. In: 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE: 2020. p. 842–50.
 20
Liu S, Wang X, Ban X, Xu Y, Zhou J, Zhang Y. Viscositybased vorticity correction for turbulent sph fluids. In: 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE: 2019. p. 1048–9.
 21
Wang X, Liu S, Ban X, Xu Y, Zhou J, Wang C. Recovering turbulence details using velocity correction for sph fluids. In: SIGGRAPH Asia 2019 Technical Briefs: 2019. p. 95–8.
 22
Peng K, Zheng L, Xu X, Lin T, Leung VC. Balanced iterative reducing and clustering using hierarchies with principal component analysis (PBIRCH) for intrusion detection over big data in mobile cloud environment. In: International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer: 2018. p. 166–77.
 23
Lahari K, Murty MR, Satapathy SC. Partition based clustering using genetic algorithm and teaching learning based optimization: performance analysis. In: Emerging ICT for Bridging the FutureProceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2. Springer: 2015. p. 191–200.
 24
Kriegel HP, Kröger P, Sander J, Zimek A. Densitybased clustering. Wiley Interdiscip Rev Data Min Knowl Disc. 2011; 1(3):231–40.
 25
Amini A, Wah TY, Saybani MR, Yazdi SRAS. A study of densitygrid based clustering algorithms on data streams. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE: 2011. p. 1652–6.
 26
McNicholas PD. Modelbased clustering. J Classif. 2016; 33(3):331–73.
 27
Mohri M, Rostamizadeh A, Talwalkar A. Foundations of Machine Learning: MIT press; 2018.
 28
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res. 2000; 28(1):235–42.
 29
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011; 12:2825–30.
 30
Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kale L, Schulten K. Scalable molecular dynamics with NAMD. J Comput Chem. 2005; 26(16):1781–802.
 31
Slabaugh GG. Computing euler angles from a rotation matrix. Retrieved on August. 1999; 6(2000):39–63.
 32
Wriggers W, Milligan R, McCammon J. Situs: A package for the docking of protein crystal structures to lowresolution maps from electron microscopy. In: BIOPHYSICAL JOURNAL. BIOPHYSICAL SOCIETY 9650 ROCKVILLE PIKE, BETHESDA, MD 208143998 USA: 1999. p. 23.
 33
Humphrey W, Dalke A, Schulten K. VMD: visual molecular dynamics. J Mol Graph. 1996; 14(1):33–8.
 34
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem. 2004; 25(13):1605–12.
 35
Xu M, Singla J, Tocheva EI, Chang YW, Stevens RC, Jensen GJ, Alber F. De novo structural pattern mining in cellular electron cryotomograms. Structure. 2019; 27(4):679–91.
 36
Zeng X, Leung MR, ZeevBenMordehai T, Xu M. A convolutional autoencoder approach for mining features in cellular electron cryotomograms and weakly supervised coarse segmentation. J Struct Biol. 2018; 202(2):150–60.
 37
Zeng X, Xu M. GumNet: Unsupervised geometric matching for fast and accurate 3D subtomogram image alignment and averaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 2020. p. 4073–84.
 38
Zeng X, Xu M. AITom: Opensource AI platform for cryoelectron Tomography data analysis. arXiv preprint arXiv:1911.03044. 2019.
 39
Hrabe T, Chen Y, Pfeffer S, Cuellar LK, Mangold AV, Förster F. PyTom: a pythonbased toolbox for localization of macromolecules in cryoelectron tomograms and subtomogram analysis. J Struct Biol. 2012; 178(2):177–88.
 40
Moebel E, MartinezSanchez A, Lariviere D, Fourmentin E, Ortiz J, Baumeister W, Kervrann C. Deep learning improves macromolecules localization and identification in 3D cellular cryoelectron tomograms. bioRxiv. 2020.
Acknowledgements
The authors acknowledge Deen Zhang, Jiawei Han and Hongchun li, Yifan Wu for their thoughtful discussion. The author acknowledge Wenqian Kang and Xinyu Liu from School of Precision Instruments and Optoelectronics Engineering in Tianjin University for their help in processing data.
Funding
This work was supported in part by National Natural Science Foundation of China grants 61873299 and 61572075 (to XB), Beijing Top Discipline for Artificial Intelligent Science and Engineering, University of Science and Technology Beijing (to XB). The computing work is (partly) supported by MAGICOM Platform of Beijing Advanced Innovation Center for Materials Genome Engineering (to XB). The work was supported in part by U.S. National Institutes of Health grant P41GM103712 (to MX), U.S. National Science Foundation (NSF) grants DBI1949629 (to MX), Mark Foundation for Cancer Research grant 19044ASP (to MX). SL was supported by China Scholar Council (CSC). XZ was supported by a predoctoral fellowship from Carnegie Mellon University’s Center for Machine Learning and Health. This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/260201 and URF/1/300701(to XG). The founder played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
MX conceived the study. MX and XB provided guidance for this project. SL and MX proposed the method and design the study. SL, HZ, FZ, FC and WW implement the simulation methods, carried out the simulation experiments and analyzed the result with the help of MX and XZ. SL wrote the manuscript with the help of MX, XG, XZ and TH.
During revision process, SL, XZ, FZ, WW, YG revised the paper under the guidnace of MX and XB. FZ helped finish the simulation experiment. WW helped finish the supplementary document. YG helped finish the “Image Processing Method Testing” section. All authors have read and approved the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that one of the authors, Xin Gao, is a member of the editorial board (Associate Editor) of this journal.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary document. Section S1: The visualization of 20 selected macromolecules. Section S2: Kmeans method. Section S3: The calculation of PDB volume. Section S4: The network architecture of EDSSN3D.
Additional file 2
A video of all the experiments in this paper, including the visualization of the packing process, the simulated density map and the simulated cryoET.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, S., Ban, X., Zeng, X. et al. A unified framework for packing deformable and nondeformable subcellular structures in crowded cryoelectron tomogram simulation. BMC Bioinformatics 21, 399 (2020). https://doi.org/10.1186/s1285902003660w
Received:
Accepted:
Published:
Keywords
 Cryoelectron tomography
 Unified packing
 Coarsegraining
 Molecular dynamics