 Research article
 Open Access
 Published:
Robustness of signal detection in cryoelectron microscopy via a biobjectivefunction approach
BMC Bioinformaticsvolume 20, Article number: 169 (2019)
Abstract
Background
The detection of weak signals and selection of single particles from lowcontrast micrographs of frozen hydrated biomolecules by cryoelectron microscopy (cryoEM) represents a major practical bottleneck in cryoEM data analysis. Templatebased particle picking by an objective function using fast local correlation (FLC) allows computational extraction of a large number of candidate particles from micrographs. Another independent objective function based on maximum likelihood estimates (MLE) can be used to align the images and verify the presence of a signal in the selected particles. Despite the widespread applications of the two objective functions, an optimal combination of their utilities has not been exploited. Here we propose a biobjective function (BOF) approach that combines both FLC and MLE and explore the potential advantages and limitations of BOF in signal detection from cryoEM data.
Results
The robustness of the BOF strategy in particle selection and verification was systematically examined with both simulated and experimental cryoEM data. We investigated how the performance of the BOF approach is quantitatively affected by the signaltonoise ratio (SNR) of cryoEM data and by the choice of initialization for FLC and MLE. We quantitatively pinpointed the critical SNR (~ 0.005), at which the BOF approach starts losing its ability to select and verify particles reliably. We found that the use of a Gaussian model to initialize the MLE suppresses the adverse effects of reference dependency in the FLC function used for templatematching.
Conclusion
The BOF approach, which combines two distinct objective functions, provides a sensitive way to verify particles for downstream cryoEM structure analysis. Importantly, reference dependency of the FLC does not necessarily transfer to the MLE, enabling the robust detection of weak signals. Our insights into the numerical behavior of the BOF approach can be used to improve automation efficiency in the cryoEM data processing pipeline for highresolution structural determination.
Background
Cryoelectron microscopy (cryoEM) has recently emerged as a mainstream approach for highresolution structure determination of biological macromolecules [1]. Image formation in electron microscopy is understood as the weakphase approximation of thin, electronpenetrable objects [2]. The electron image formed after the objective lens is a convolution of the exit wave function passing through the object with the point spread function of the objective lens [2]. The phasecontrast transfer function (CTF), which is the Fourier transform of the point spread function of the objective lens, gives rise to a tradeoff between the resolution and the contrast of the image [3]. To image biomolecular structures in their native states by cryoEM, the molecules of interest are flashfrozen in a thin layer of amorphous ice suspended over holes in a perforated carbon film. Thus, the biomolecular objects are surrounded by imaging noise from electrons scattered by the amorphous ice. Another thin carbon film over the holes may also be used as a support to enrich biomolecules for cryoEM; in this case, the carbon film adds further noise. Moreover, additional noise may be introduced in the process of electron signal transfer into the recording medium, such as detection noise in a CCD camera and electroncounting noise in a direct electron detector. The strong background ice noise, together with weakphase approximation in image formation, results in extremely low signaltonoise ratios (SNR), which are often in the range of 0.005–0.05. Therefore, the determination of cryoEM structures of biomolecules at high resolution requires that a large number of singleparticle images, often on the scale of hundreds of thousands to a million, are acquired, aligned and averaged to remove background image noise in signal reconstruction.
Due to the required large number of images, the selection of singleparticles from noisy cryoEM micrographs represents a major practical bottleneck. Since manual selection can be very timeconsuming and is prone to errors resulting from subjective factors, a number of automated approaches have been investigated. Computerized procedures for signal detection in singleparticle cryoEM involve two steps: particle picking and particle verification [4,5,6]. A number of algorithms have been developed to automate templatematching procedures for particle picking. However, these procedures require subsequent manual selection of particles, in some cases with the help of data clustering to expedite the rejection of false positives [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. A popular implementation of these templatematching methods is based on the crosscorrelation function, in which the fast local correlation (FLC) is calculated between a template image and an equally sized local area of the cryoEM micrograph [8, 12, 13]. A disadvantage of the FLC function lies in its sensitivity to noise, which can create false correlation peaks that do not result from real signals. Furthermore, the outcome of crosscorrelation algorithms may be influenced by the alignment of noise to the template used as a reference, known as “reference bias” or “reference dependency” [23].
Maximum likelihood estimation (MLE), which exhibits reduced susceptibility to reference bias compared to the crosscorrelation algorithm [24, 25], has been used to evaluate the homogeneity of the picked particles by multireference image alignment [26, 27]. In principle, the use of two mathematically distinct objective functions in signal recognition can serve as a test of the robustness of the image analysis and a verification of the detected signals, since reference dependency is not expected to be reproduced in the same way by two different objective functions. The combination of one objective function (FLC) for particle picking and another (MLE) for particle alignment may allow the reconstitution of the true signal from the selected images. However, despite the application of both FLC and MLE in singleparticle analysis of cryoEM structures [22, 28,29,30,31,32], it remains unknown how the biobjective function (BOF) scheme performs in terms of various control parameters, such as signaltonoise ratio (SNR) and initialization inputs.
Beyond FLC and MLE, several machinelearning approaches, such as deep learning based on convolutional neural networks, have been applied to address the problem of signal detection in cryoEM data [20, 33,34,35,36]. These approaches not only relieve the burden of postpicking manual selection [20, 33], but also work in a templatefree fashion [34,35,36]. However, these advantages come at a significant computational cost. Thus, except for a few cases dealing with highly dynamic complex machineries that have benefitted from the deeplearningbased particle selection approach [37,38,39], most highresolution cryoEM structures published to date have relied heavily on FLCbased particle picking [40,41,42].
In the present study, we systematically evaluated how the performance of the BOF approach is affected by three variables: (1) the SNR of the cryoEM data, (2) the choices of the template used for particle picking, and (3) the initialization reference used in MLE alignment for signal verification. We quantitatively characterized the performance and robustness of the BOF approach with simulated micrographs exhibiting a wide range of SNRs, as well as with realworld cryoEM data of a 173kD glucose isomerase. We performed comparative BOF studies with different references to investigate how the adverse effect of reference dependency incurred by the use of the FLC may be suppressed by the application of the MLE initialized using a Gaussian model.
Methods
A brief review on objective functions used for signal alignment
Within a set of N singleparticle images, each of which is a noisy, translated and rotated copy of the underlying 2D projection structure A, the ith image can be represented by the equation.
where X_{i} is the observed ith image comprising J pixels with values X_{ij}; R(ϕ_{i}) denotes the inplane transformation depending on the parameter vector ϕ_{i} = (α_{i}, x_{i}, y_{i}) that comprises a rotation α_{i} and two translations x_{i} and y_{i} along two orthogonal directions; A is the underlying signal with pixel values A_{j} that is common to all images; G_{i} is the noise of a Gaussian distribution with a unity standard deviation, further scaled by a scalar factor σ. Because the parameter vector ϕ_{i} is experimentally unknown, the problem of image alignment is to determine the solution of a set of parameter vectors Φ = { \( {\phi}_i^{(n)} \); i = 1, 2, … N} that allows an optimal estimate of the underlying true signal through averaging of these images.
in which \( {R}^{1}\left({\phi}_i^{(n)}\right) \) is the reverse transformation that brings the image X_{i} to the common orientation and position of A. This image alignment problem may be mathematically translated into different optimization problems. Two main types of mathematical translations have emerged in past studies [24, 43]. In the first type, the image alignment problem was addressed by maximizing the squared magnitude of the summed images [43], which can be described as
The maximum of this function is equivalent to the minimization of the least squares target
A local minimization of this function can be obtained by iteratively maximizing the crosscorrelation between each image and the average.
Here, the dot indicates an inner product between two images \( \boldsymbol{X}\cdot \boldsymbol{A}={\sum}_{k=1}^J{x}_k{a}_k \). An approximate solution may be obtained by iteratively estimating the underlying signal A^{(n)} and the alignment parameter \( {\phi}_i^{(n)} \) according to eqs. (3) and (5).
In the second type, the image alignment problem is interpreted as a maximumlikelihood estimate (MLE) of the signal A, i.e. the maximization of the probability function
whereby P(X_{i} Θ) is the probability density function observed for the image X_{i} given the set of model parameters Θ = (A, σ, ξ), where ξ characterizes the statistics of R(ϕ_{i}). In this case, the alignment parameters Φ = { ϕ_{i}; i = 1, 2, … N} are treated as latent variables. The maximization of the probability function \( \mathcal{L}\left(\varTheta \right) \) is more conveniently replaced by its logarithm
A local maximum of the loglikelihood function L(Θ) can be obtained by finding the value of Θ at which the partial derivatives of L(Θ) are zero. The problem of finding the maximum likelihood can be numerically tackled through the expectationmaximization algorithm. This algorithm is an iterative method that alternates between an expectation (E) step, which computes the expectation of the loglikelihood evaluated using the current estimate for the model parameters, and a maximization (M) step, which computes model parameters maximizing the expected loglikelihood found in the Estep [24]. These estimates of parameters are then used to determine the distribution of the latent variables in the next Estep. In each Estep, the observed data X_{i} and the current estimates of model parameters Θ^{(n)} are used to calculate the expectation of the loglikelihood function as
Under the assumption of a Gaussian distribution of the latent variables Φ = { ϕ_{i}; i = 1, 2, … N} and the observed signal, this gives rise to
In the Mstep, Q(Θ, Θ^{(n)}) is maximized with respect to the model parameters
which corresponds to the minimization of a weighted leastsquares target with a weight of P(ϕ X_{i}, Θ^{(n)}) for each image. Note that this is in marked contrast to eq. (4). The estimate of the signal therefore is a weighted average including contributions from all possible values of ϕ for every image X_{i}, so that the class averages can be updated in a probabilityweighted manner
All other model parameters in Θ^{(n + 1)} are updated in the Mstep similarly as probabilityweighted averages [24].
It is also necessary to consider the mathematical relationships and differences between the image alignment approaches. First, in recovering the signal A, the latter approach uses a probabilityweighted average instead of the deterministic average used in the former approach, as illustrated by the differences between eqs. (2) and (11). Second, if one assumes that the estimate of the hidden variable Φ is deterministic instead of probabilistic, P(ϕ_{i} X_{i}, Θ^{(n)}) adopts the form of a Dirac δfunction. Under this condition, the maximization of the loglikelihood function shown in eq. (9) is simplified to the minimization of the leastsquares target shown in expression (5), instead of the probabilityweighted leastsquares target in eq. (9). At the same time, the estimate of the signal by eq. (11) can be reduced to eq. (2). Third, despite this conditional equivalence in terms of numerical optimization, the two approaches adopt essentially different objective functions that include different variables and parameters, as evidenced by a comparison of eqs. (5) and (8). Importantly, all model parameters Θ = (A, σ, ξ) are reestimated during each iteration of optimization in the latter approach, whereas only one type of model parameter, A, is reestimated during the course of optimization in the former approach.
Previously proposed solutions to the particlepicking problem were mostly derived from the crosscorrelationbased approach. In a typical case, the locally normalized correlation function is calculated between a search object S (template) and target micrograph T under the footprint of a mask M [8]:
where \( \overline{S} \) and σ_{S} are the average and standard deviation of the search object S_{k}; \( \overline{T} \), and σ_{MT} are the local average and standard deviation of T within the footprint of mask M; x is the position of the footprint of mask M, and P is the total number of nonzero points inside the mask. If \( \overline{S} \) and σ_{S} are set to zero and unity, respectively, eq. (12) is reduced to
The local standard deviation of T can be calculated via
This and other similar implementations of a particlepicking strategy have been collectively referred to as “template matching”. As the image size of S is much smaller than that of T, the local crosscorrelation is calculated with the mask M rasterscanning across the entire micrograph to produce a crosscorrelation map. The local maximum in the correlation map is identified, ranked, and used to indicate the position of the picked candidate particle image. The FLC function expressed in eq. (13) has led to a more efficient implementation of a computational particlepicking procedure [8, 12, 13].
As explained above, the FLC function is notably different from the MLE in signal recognition in their mathematical forms. In the absence of noise, the crosscorrelation function and MLE should both lead to the same solution for the image alignment problem [24]. However, in the presence of noise, the FLC and MLE behave differently [24]. The FLC is very fast and efficient in computation. However, it demonstrates an increasing propensity to identify falsepositive particles or introduce misalignment as the SNR decreases [8, 12, 13]. By contrast, at the expense of significantly more computational power, the exhaustive probability search across parameter space in the MLE substantially reduces the effect of false positives over the iterations of the expectationmaximization algorithm. The probabilityweighted averages further limit the contribution of false positives and misalignment to the estimation of the signal. Therefore, the FLC and MLE are complementary to each other in their responses to noise, as well as in their computational efficiency.
Procedure of the BOF approach
Throughout this study, the following BOFbased procedure was applied to 26 datasets of either pure noise or simulated micrographs of the trimeric ectodomain of the influenza hemagglutinin (HA) glycoprotein [44], as well as an experimental dataset of focalpair micrographs of the 173kDa glucose isomerase complex. The BOF strategy and an implementation of the BOF procedure are shown in Fig. 1, a and b, respectively.
Step 1: Particle picking by fast local crosscorrelation
We used template matching by FLC implemented in SPIDER to pick particles [45]. The SPIDER system is a comprehensive software package for image processing that supports rapid scripting to handle batch processing of cryoEM data [45]. The SPIDER script lfc_pick.spi has already been applied to the ribosome [12] and has served as a control for the recent development of a referencefree particlepicking approach [35]. This procedure applies the FLC function to particle recognition [8]. In this study, we picked particles using single 2D templates, as described in the specific experiments below. Note that previous studies have shown that using the FLC function with a single template can pick many views of particles [12]. Nonetheless, it has been suggested that using more templates can potentially reduce the number of false positives that are picked [8, 12, 13].
Step 2: Candidate particle selection using a threshold in the ranking of correlation peaks and manual rejection of obvious artifacts
The SPIDER particlepicking program lfc_pick.spi sorts and ranks the picked particles according to their correlation peaks, from high to low peak values. Upon sorting and ranking, the potential true particles often appear at higher correlation peak values and the pure noise images at lower correlation peaks. A threshold that approximately demarcates the boundary between the potential true particles and pure noise can be used to select the initial candidate particles, followed by manual inspection of each particle and rejection of obvious artifacts. The rejection of suspected artifacts and false positives can be done in batch mode if the picked particles are grouped into many 2D classes by multivariate statistical analysis or unsupervised clustering [15, 19, 46, 47].
Step 3: Particle validation by a MLE alignment with multiple classes
Image similarity measured via the MLEbased probability, and the subsequently calculated class averages obtained by integrating over all probabilities, are more sensitive to the presence of true signals [24]. The particles belonging to the class averages that clearly exhibit the expected signal features are chosen for further processing; the particles in the class averages that are suspicious or apparently artefactual may then be discarded. This step provides an opportune checkpoint to efficiently remove nonparticles in batch mode.
BOF testing of simulated and experimental noise micrographs
To conduct a baseline control, we first simulated 200 micrographs containing only Gaussian noise using the SPIDER command MO (option R with Gaussian distribution). Each micrograph had dimensions of 4096 × 4096 pixels. We then used one projection view of the ~ 11Å human immunodeficiency virus (HIV1) envelope glycoprotein (Env) trimer [28] as a template for particle picking from the simulated Gaussiannoise micrographs. The box size was 256 × 256 pixels. Although the micrographs can be binned twice or 4 times to speed up the computational procedure of particle picking by FLC, it is necessary to extract the particles from unbinned original micrographs because they are required for highresolution 3D reconstruction in later steps in an actual scenario of structure determination [48]. In each micrograph, about 20–25 boxed images of the highest local correlation peaks were selected to assemble a particle stack of 4485 images. After particle picking and selection, each particle image was scaled 4 times to 64 × 64 pixels using xmipp_scale, and normalized using xmipp_normalize [49]. Subsequent MLE alignment using xmipp_ml_align2d was repeated with three different starting references: (1) a noise image randomly chosen from the entire image stack, which contains weak signal that is likely to introduce some initiation bias; (2) a Gaussian circle, which follows a Gaussian distribution in radial intensity and does not introduce any prior bias to the reference; and (3) an average of a random subset of the unaligned images that replicates the template used for particle picking, which can be used to test the reference dependency of the MLE alignment. Comparison among these three cases would allow us to examine whether and how the initial reference used for MLE impacts the potential capability of MLE to suppress reference dependency introduced during FLCbased particle picking.
To repeat the above BOF test on realworld experimental ice noise, we imaged a cryogrid that was flashfrozen from a buffer containing no protein sample. The composition of the buffer was 20 mM TrisHCl, pH 7.4, 300 mM NaCl and 0.01% Cymal6 (Anatrace, USA). This was the same buffer used for vitrifying the HIV1 Env trimer for its cryoEM structural analysis [28, 32]. The cryogrid was made from a Cflat holey carbon grid using the FEI Vitrobot Mark IV (Thermo Fisher Scientific, USA). The data were collected on an FEI Tecnai G2 F20 microscope (Thermo Fisher Scientific, USA) operating at 120 kV, equipped with a Gatan Ultrascan 4096 × 4096pixel CCD camera (Gatan, USA), at a nominal magnification of 80,000×. We selected 218 micrographs of pure ice noise collected in one cryoEM session. The same particlepicking procedure performed with the simulated Gaussian noise micrographs (see above) was applied to the experimental ice noise micrographs, with the same HIV1 Env trimer template. After particle picking, the apparent icecrystal contaminants were manually rejected from the particle set, leaving only images of amorphous ice noise. By selecting only about 10–25 boxed images with the highest local correlation peaks from each micrograph, a particle stack of 4591 images was assembled, and was subjected to the same MLE alignment as described above for the data from the simulated Gaussian noise micrographs. These BOF tests on both the simulated and experimental pure noise micrographs (Fig. 2) served as controls for the subsequent examination of the BOF approach.
BOF testing of simulated micrographs
Throughout this study, the SNR was defined as the ratio of signal variance to noise variance [3, 50],
When the background noise has a mean value of zero, its power P_{Noise} equals its variance \( {\sigma}_{Noise}^2 \). In singleparticle cryoEM images, the particles are located at different positions in the micrographs and carry the signal. When the mean value of the signal is normalized to zero, P_{Signal} becomes equal to \( {\sigma}_{Signal}^2 \), and the power ratio of signal to noise thus equals the variance ratio. The SNR of a micrograph was calculated as the power ratio of the signal from all the particles to the background noise in this micrograph. For the SNR of a singleparticle image, the noise variance was calculated on a boxed background area without any particle, and the signal variance was calculated on the particle image of the same box size without background noise.
We simulated 120 micrographs of noiseless particles corresponding to the crystal structure of the influenza A virus hemagglutinin (HA) glycoprotein ectodomain (PDB ID: 3HMG) using xmipp_phantom_create_micrograph [44]. The simulation assumed a pixel size of 1.0 Angstrom and micrograph dimensions of 4096 × 4096 pixels. To simulate the aberration effect of the objective lens in electron microscopy, the contrast transfer function (CTF) was applied in the Fourier transform of the simulated noiseless micrographs using a separate SPIDER script. The CTF simulation assumed an acceleration voltage of 200 kV, a defocus of − 1 μm, a spherical aberration Cs of 2.0 mm, an amplitude contrast ratio of 10%, and a Gaussian envelope half width of 0.333 Å^{− 1}. In each simulated micrograph, there were 323 HA molecules that assumed random orientations. To add different levels of Gaussian noise to the noiseless micrographs, the standard deviation of the background of each micrograph was calculated and used as input to simulate a background Gaussian noise image that was added to the noiseless micrographs. The simulated micrographs with Gaussian noise additively yielded SNRs of 0.1, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001 or 0.0005. A typical series comprising a simulated noiseless micrograph and the derived noisy micrographs at different SNRs is shown in Additional file 1: Figure S1. A comparison of the corresponding behaviors of the power spectra in Fourier space is shown in Fig. 3. Note that the SNR calculated for an entire micrograph is often lower than the SNR calculated from boxed singleparticle images, since there are more empty background areas in the micrograph than in appropriately boxed singleparticle images.
For the simulated micrographs at each SNR value, we conducted BOF tests using three different templates for particle picking: a Gaussian circle, one projection view of the influenza virus HA trimer filtered to 30 Angstroms, and one projection view of the HIV1 Env trimer filtered to 30 Angstroms (Fig. 4). Each set of micrographs with a given SNR and selected by a particular particlepicking template was treated as a separate case. Therefore, there were 8 × 3 = 24 cases studied and compared in our BOF tests. For each case, a stack of 38,760 particle images was assembled from 120 simulated micrographs, based on a selection threshold of 323 particles per micrograph. The original box dimension for particle picking was 180 × 180 pixels. After particle picking and selection, each particle image was first scaled 3 times to a dimension of 60 × 60 pixels, normalized for background noise, and subjected to multireference MLE classification into 5 classes, using two different initial references: (1) the average of a randomly selected subset of particles (Fig. 5), and (2) a Gaussian circle, which follows a Gaussian distribution in radial intensity (Fig. 6). When extrapolating to the SNR of singleparticle images, the SNR of an entire micrograph needs to be multiplied by a factor (> 1), which depends on the particle density and the box size of particles, to make it equivalent to the SNR of singleparticle images. Given the aforementioned parameters, the SNRs of the simulated micrographs at 0.1, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001 and 0.0005 correspond to the singleparticle SNRs of 0.16, 0.08, 0.032, 0.016, 0.008, 0.0032, 0.0016 and 0.0008, respectively. Throughout the rest of this paper, unless stated explicitly, the “SNR” refers to that of the simulated micrographs instead of the singleparticle SNRs.
BOF tests on experimental cryoEM data
We collected an experimental cryoEM dataset of the 173kDa glucose isomerase complex (Hampton Research, CA, USA). A 2.5μl drop of a 3 mg/ml glucose isomerase solution was applied to a glowdischarged Cflat grid (R 1.2/1.3, 400 Mesh, Protochips, CA, USA), and flashfrozen in liquid ethane using the FEI Vitrobot Mark IV (Thermo Fisher Scientific, USA). The cryogrid was imaged in an FEI Tecnai Arctica microscope (Thermo Fisher Scientific, USA) at a nominal magnification of 21,000× and an acceleration voltage of 200 keV. We selected 95 focal pairs of micrographs collected using a Gatan K2 Summit direct detector camera (Gatan Inc., CA, USA), with a defocus difference of 1.5 μm and a pixel size of 1.74 Å. The actual defocus values of the micrographs were determined through CTFFind3 [51]. The first exposure was taken at a defocus between − 1.0 and − 3.0 μm. In this defocus range, the visibility of the complexes was marginal, posing difficulties for manual particle identification. The second exposure was taken at a defocus between − 3.0 and − 5.0 μm. In this defocus range, the particles were more visible. We then used FLC to pick particles directly from the micrographs of the first exposure, and used the second exposure to manually verify the particle selection from the first exposure. Using the first exposure at a lower defocus, which gives lower singleparticle SNRs, provides a more stringent test of the robustness of the BOF approach than using the second exposure at a higher defocus.
To perform BOF tests on these cryoEM data, we assembled three particle stacks (comprising 22,298, 20,632 and 22,828 particles, respectively) using three different templates for particle picking, i.e., a Gaussian circle, one projection view of the glucose isomerase crystal structure (PDB ID: 1OAD) filtered to 30 Å, and one projection view of the HIV1 Env trimer filtered to 30 Å. Particle images of 90 × 90 pixels, picked by FLC, were phaseflipped to partially correct the CTF effect. The three stacks of particles were normalized for background noise and subjected to multireference MLE classification into 5 classes, using two different initial references: (1) the average of a randomly selected subset of particles; and (2) a Gaussian circle, which follows a Gaussian distribution in radial intensity.
Results
BOF tests on simulated and experimental noise
As a control experiment to investigate the ability of the BOF approach to resist reference bias, we conducted BOF tests on simulated micrographs that contain only Gaussian noise. A single 2D projection of the HIV1 Env trimer was used as the template for picking “particles” by FLC (Objective function A) (Fig. 2a). Images with the highest local correlation peaks were selected and subjected to MLE alignment, using three different starting references for MLE optimization (Objective function B). In the first BOF test, a raw pure noise image randomly chosen from the particle stack was used as the starting reference for MLE optimization (Fig. 2b). Over more than 3000 iterations of MLE alignment, no 2D structure resembling the particlepicking template was observed. The resulting average image in each iteration was still a random noise image. We then used a Gaussian circle as the starting reference to repeat the MLE optimization (Fig. 2c). Again, the resulting average image contained only random noise but no observable 2D model. As the third starting reference for MLE optimization, we used the average of templateselected particle images without any further alignment. Notably, this average closely resembled the HIV1 Env trimer template used for particle picking (Fig. 2d), and apparently resulted from reference dependency in templatebased particle picking by the FLC. When this average image was used as the starting reference for the MLE alignment, the replica of the template faded away in the average image and nearly disappeared upon the convergence of MLE optimization. Thus, the BOF approach can work against reference bias associated with the alignment of pure noise during the particlepicking process, particularly when the MLE verification is conducted using a random noise image or a Gaussian circle as the starting reference. Note that in the abovementioned test, we performed up to 3000 iterations of MLE optimization. Such a prolonged optimization provides the computation with a greater opportunity to evade local optima and helps to examine the robustness of the convergence [24].
Next, we wanted to know if the results observed with the simulated micrographs of Gaussian noise would be reproduced with images of actual cryoEM noise resulting from amorphous ice. We repeated the BOF tests on the dataset assembled from experimental ice noise micrographs. When aligned using MLE, starting with pure noise or a Gaussian circle as the starting reference, no structure was observed after more than 3000 iterations of optimization (Fig. 2e and f). Thus, images of experimental ice noise taken by a CCD camera reproduced the results observed with simulated Gaussian noise, supporting the notion that the experimental cryoEM noise from amorphous ice basically exhibits Gaussianlike behavior [3]. Particle verification by MLE with starting references comprising random noise or a Gaussian circle effectively removed reference bias arising from the alignment of simulated or experimental noise. By contrast, when the unaligned average of the templateselected images was used as the starting reference for MLE alignment, the structure of the particlepicking template in the class average faded over the iterations of MLE, but was not completely removed by the MLE alignment (Fig. 2g).
FLC performance on simulated micrographs with different SNRs
We further tested the FLCbased particlepicking program on a number of simulated micrograph datasets (Additional file 1: Figure S1). As expected, the visibility of particles was drastically diminished in the images with lower SNRs [52]. Figure 3 shows the power spectra of the simulated micrographs and their corresponding spectral SNRs (SSNRs). We applied a number of contrastenhancement techniques, including histogram normalization, contrast stretching, lowpass filtering and pixel binning, to the simulated micrographs with different SNRs. We found that these approaches were insufficient to restore unambiguous visibility to particles when the SNR approached 0.005 (Additional file 1: Figure S2). Because the loss of visibility created difficulties with directly verifying the true and false positives in the same micrograph in our particlepicking test, the original noiseless micrograph from which the lowcontrast micrograph was derived was used to verify the particlepicking performance (Additional file 1: Figure S3).
Using the noisy micrographs containing the randomly oriented influenza virus HA trimers, we picked particles using three different templates  a Gaussian circle, one projection view of the influenza virus HA trimer, and one projection view of the HIV1 Env trimer. Figures 4ac show the plots of the correlation peaks versus the rank numbers of the picked particles. Notably, when the Gaussian circle was used as a template (Fig. 4a), the plots corresponding to SNRs of 0.1, 0.05, 0.02 and 0.01 showed a clearcut dropoff in the value of the correlation peak at a rank of 323, which was the number of actual simulated particles in each micrograph [4]. All of these 323 particles with high correlation peak values were confirmed to be true positives. When the Gaussian circle was used to pick particles from micrographs with an SNR of 0.005, the plot of the correlation peaks still exhibited a discernible dropoff at N = 323, but with a much smoother edge (Fig. 4a). The dropoffs in correlation peak values were smoother and less prominent at lower SNR values (0.002, 0.001 and 0.0005). Using 323 as the threshold for particle selection, the number of false positives was less than 2% at an SNR of 0.005, and increased to approximately 7% at an SNR of 0.002 (Fig. 4d).
We evaluated the specificity of particle picking when using templates other than a Gaussian circle, i.e., one projection view of the influenza virus HA trimer itself, and one projection view of the HIV1 Env trimer, which bears little similarity to the HA trimer (Fig. 4b and c). For both templates, clear dropoffs in the correlation peakranking plots at N = 323 were observed at SNR values of 0.005 and higher. Notably, in all cases where we used different templates in the particlepicking test, the falsepositive rate was below 2.5% at the SNR values of 0.005 and above; there were no false positives at SNR values of 0.02 and greater (Fig. 4d). However, using the Gaussian circle template allowed better centering of picked particles than using the other two templates (Additional file 1: Figures S3 and S4). Among the cases compared here, the centering of picked particles was the worst when a dissimilar 2D structure (the HIV1 Env trimer) was used as a template for micrographs with the lowest SNRs (0.005–0.0005) (Additional file 1: Figure S4). This implies that particle recognition is less sensitive to the detailed shape of the particlepicking template than are the specificity and particlecentering accuracy. Thus, the use of a dissimilar template allowed overall particle recognition, but resulted in a greater miscentering of the picked particles and more false positives at the lowest SNRs (0.005–0.0005).
BOF tests on the simulated cryoEM datasets
We evaluated the ability of the BOF approach to verify the presence of genuine signals in the particles selected from micrographs with different SNRs using different particlepicking templates. Strikingly, for those datasets derived from micrographs with SNRs higher than 0.002, the class averages after the MLE alignment all recapitulated the projection views of the influenza virus HA trimer, no matter what type of initial reference was used for both FLC and MLE (Figs. 5 and 6). The MLE alignment results using particles selected from micrographs with SNR values of at least 0.002 were comparable for those selected using the three distinct templates. Evidently, the model used for the particlepicking template does not govern the outcome of MLE optimization when a sufficiently strong signal is present. Below the SNR value of 0.002, the MLE reduced but did not completely remove the reference dependency in the converged class averages when the unaligned class average was used as the starting reference for MLE alignment (Fig. 5i and l). Nonetheless, this effect was substantially reduced in the converged class averages when the Gaussian circle was used as the starting reference for the MLE alignment (Fig. 6i and l).
BOF tests on experimental cryoEM data of glucose isomerase
To further examine the robustness of the BOF approach, we applied BOF tests to an experimental cryoEM dataset of the 173kDa glucose isomerase complex (Additional file 1: Figure S5). The singleparticle SNR of this dataset is approximately 0.005–0.01. The BOF tests successfully produced class averages that corresponded to projection views of the glucose isomerase complex in all six cases (Fig. 7 and Additional file 1: Figure S6). Consistent with our observations with the simulated micrographs, the use of a Gaussian circle as both the particlepicking template and the MLE alignment reference performed as well or better than the other combinations in generating class averages corresponding to glucose isomerase projections (Fig. 7b). When the HIV1 Env trimer was used as the particlepicking template and the unaligned average used as the starting reference for MLE alignment, two class averages showed structures that were strongly biased by the particlepicking template (rows 3 and 4 in Fig. 7e). By contrast, the other three class averages more closely reflected the lowresolution projection views of glucose isomerase (rows 1, 2 and 5 in Fig. 7e), although some residual elements of the HIV1 Env trimer persisted in the background. However, when the Gaussian circle was used as the starting reference for MLE alignment, the particlepicking template of the HIV1 Env trimer was no longer recapitulated in any of the converged class averages (Fig. 7f). Even when one of the class averages demonstrated indistinct features, perhaps due to a clustering of nonparticle false positives, the aligned average did not resemble the particlepicking template of the HIV1 Env trimer (second row in Fig. 7f). As discussed above, such classes of particles can be discarded, which provides an opportunity to cull nonparticles in batch mode. These results therefore indicate that the BOF approach, when used with Gaussian references, can be successfully applied to experimental cryoEM data of a 173kD protein complex.
BOF robustness
The ability of BOF tests to suppress reference bias can be quantitatively evaluated by assessing the Fourier ring correlation (FRC) between the particlepicking template and the class averages as they evolve during the process of MLE optimization. We first analyzed the cases in which the HIV1 Env trimer was used to pick particles, and unaligned class averages were used as starting references for MLE optimization (solid curves in Fig. 8). In these cases, the FRC curves showed a significant correlation (> 0.5) in the lowresolution range (20–50 Å) at the beginning of the MLE optimization (black solid curves in Fig. 8). However, as MLE optimization progressed to convergence, the FRC values decreased and the image of the particlepicking template diminished in significance (red solid curves in Fig. 8). In the case of the simulated data at an SNR of 0.005, the frequency of FRC0.5 dropped to 0.015 Å^{− 1} upon convergence, indicating an efficient removal of reference bias (Fig. 8a). Correspondingly, the converged class averages efficiently recovered the projection views of the influenza virus HA trimer (Fig. 5c). At SNRs of 0.002 and lower, the frequency of FRC0.5 was reduced to 0.02–0.04 Å^{− 1} upon convergence, indicating a less efficient removal of reference bias (Figs. 8be). By contrast, in all MLE alignments performed using a Gaussian circle as the starting reference, the FRC curves showed no significant correlation (> 0.5) between the particlepicking template and the converged class averages at a spatial frequency higher than ~ 0.02 Å^{− 1} (dashed curves in Fig. 8). Thus, when a Gaussian model was used as the starting reference for MLE optimization, the converged class averages did not recapitulate the structure of the particlepicking template.
Discussion
This study provides insights into the numerical performance of the BOF procedure in the detection of weak signals. First, the FLC implementation in SPIDER successfully picked particles from micrographs with SNRs as low as 0.002–0.005, at least in our tests (Fig. 4); such low SNRs are potentially relevant to small proteins below 200 kD or certain views of larger proteins with less ordered or dynamic structures. Together with previous studies [8, 12, 13], our results suggest that the FLC function is sensitive to the presence of weak signals. A Gaussian circle seems to be as effective at picking particles as a single projection view of the imaged molecule. Second, the output parameters in the particlepicking problem are the xy coordinates of the particle box. The choice of template in particle picking affects the coordinates of the extracted boxes, probably through biases in the correlation between the noise and the template. Consequently, the average image of the picked particles after boxing and before alignment closely resembled the particlepicking template. However, the template does not change the true signal in the boxed particle images, which allows objective signal validation by the MLE function with proper initialization. Third, the adverse effects of reference bias resulting from FLCbased particle picking can be suppressed by MLEbased alignment using a Gaussian circle as the starting reference. In other words, the reference bias derived from the FLC function does not necessarily translate into reference bias in the MLE function initialized with a Gaussian model. Finally, at the lowest SNRs (0.001 and below), the BOF procedure became inefficient at verifying signals from our dataset of 38,760 particles. In this case, the MLE alignment initialized with a Gaussian model mostly led to a blank or blurred class average that was insufficient to reproduce the particlepicking template. A similar lower bound of SNR (0.001 and below) was also found for a deeplearningbased particlepicking approach [34].
We found that the use of a dissimilar structure as the particlepicking template slightly increased the number of false positives in the examined cases. Thus, a Gaussian circle could be a preferred picking template in the initial stage of automated particle picking, since it can help avoid any potential selection bias [6]. Notwithstanding, although the Gaussian model works well for picking particle images of globular proteins or similar macromolecules, it could be errorprone and potentially miss particles with unique shapes and topologies, such as ringlike and other centrally sparse structures [6]. In this case, a validated initial model lowpass filtered at 30–60 Å, which follows the lowfrequency features of the particles, could be used as a particlepicking template.
Falsepositive particles, such as ice contamination, can hardly be avoided by the FLC function. Nevertheless, the percentage of false positives in the candidate particle pools can be reduced by manual curation [8, 12, 13, 19]. Moreover, recent advances in applying machine learning to particle recognition can mostly remove these types of false positives, with little manual intervention [34, 35]. Thus, the objective functions in the BOF approach could be replaced with more advanced ones, such as those based on deep learning or manifold learning [34, 47], to further improve the performance of signal detection by the BOF approach.
Importantly, the aforementioned technical insights can be used to optimize and quality control the everyday practice of cryoEM data processing. First, all current implementation of FLCbased templatematching procedures, such as those in SPIDER [45] and RELION [22], requires 2D templates derived either from 2D class averaging of thousands of manually picked particles or from 2D projections of an initial 3D model, both of which are still timeconsuming and laborious to achieve. The use of a Gaussian circle as a default template for initial FLCbased particle picking can improve the level of automation and save significant labor in generating initial 2D class averages or 3D models. This strategy has already been successful in highresolution cryoEM structure determination in a few cases [42].
Second, in our practice of cryoEM data processing, we have found that templates for FLC derived by averaging manually selected particles can potentially generate bias in particle picking toward the views with orientations similar to those of the templates. This is particularly a concern for smaller proteins below 200 kD or nonglobular particles (platelike, discoidal or rodshaped, etc.) [30], of which some views might have much lower contrast or SNR than other views and could thus evade visual detection in initial manual picking. If certain views that have projection structures or shapes significantly different from the orthogonal views are missed or not included in the particlepicking templates, the FLC procedures can potentially result in more false negatives of these views, causing artificial orientation preference in the selected particle dataset. In this case, we have found that the use of a Gaussian circle as an FLC template to thoroughly pick all potential particles, followed by deeper 2D classification using statistical manifold learning [47], can reduce or avoid the artificially introduced orientation preference in the particle selection, thus eventually improving the quality and resolution of the 3D reconstruction.
Third, it has been previously hypothesized that wrong templates used for particle picking can be inadvertently recapitulated in the final 3D reconstruction of these particles, resulting in the visualization of nonexistent objects [53,54,55]. The present study systematically demonstrates that, given sufficient SNR in the images, such an outcome is unlikely when a Gaussian circle is used to initiate the image alignment by MLE, regardless of what type of template is used for FLC. When the initiation reference for MLE is the same as the template used for FLC on the data with lower SNRs (0.001 or lower), elements of the particlepicking template can be recapitulated in some 2D class averages generated by MLE, and could potentially bias the resulting 3D reconstruction. Thus, the use of a Gaussian circle to initialize MLEbased image alignment and refinement can be very useful for either validating the authenticity of the reconstruction or safeguarding routine cryoEM data processing over a broad range of SNRs, avoiding the reconstruction of nonexistent structures and features out of noise [31].
Our study of the variables that affect BOF performance was limited to the combination of FLC and MLE. There are other choices for the two distinct objective functions in the BOF framework. For example, the FLC can be replaced with a deep convolutional neural network [34]. With additional testing, these modifications may further improve the utility of the BOF framework in real cryoEM data processing pipelines.
Conclusions
In this work, we examined the effects of SNR and choice of initialization on the ability of the BOF approach to select and verify particles from noisy cryoEM micrographs. We quantitatively characterized the critical SNR at which BOF performance begins to degrade, and found it to be surprisingly small, as low as 0.002–0.005, given the size of the dataset (38,760 particles) tested in each case. Importantly, reference dependency of the FLC does not necessarily transfer to the MLE, making possible the robust detection and validation of weak signals. When a nonGaussian template is used for particle picking by the FLC, the use of a Gaussian model to initialize the MLE optimization can largely suppress reference dependency of the FLC on the particlepicking template. Thus, given an SNR above the critical value, the combination of two distinct objective functions may provide a sensitive and robust way to detect and verify weak signals in cryoEM micrographs. The essential insights into the numerical behavior of the BOF approach provided by our systematic study can guide optimization of weak signal verification and improve automation efficiency in the cryoEM data processing pipeline for highresolution structural determination.
Abbreviations
 BOF:

Biobjective function
 CryoEM:

Cryoelectron microscopy
 FLC:

Fast local correlation
 MLE:

Maximum likelihood estimate
 SNR:

Signaltonoise ratio
References
 1.
Nogales E. The development of cryoEM into a mainstream structural biology technique. Nat Methods. 2016;13(1):24–7.
 2.
Spence JCH. Highresolution electron microscopy. 4th ed. Oxford: Oxford University Press; 2013.
 3.
Frank J. Threedimensional electron microscopy of macromolecular assemblies : visualization of biological molecules in their native state. 2nd ed. Oxford; New York: Oxford University Press; 2006.
 4.
Frank J, Wagenknecht T. Automatic selection of molecular images from electron micrographs. Ultramicroscopy. 1984;12:169–76.
 5.
Nicholson WV, Glaeser RM. Review: automatic particle detection in electron microscopy. J Struct Biol. 2001;133(2–3):90–101.
 6.
Glaeser RM. Historical background: why is it important to improve automated particle selection methods? J Struct Biol. 2004;145(1–2):15–8.
 7.
Lata KR, Penczek P, Frank J. Automatic particle picking from electron micrographs. Ultramicroscopy. 1995;58(3–4):381–91.
 8.
Roseman AM. Particle finding in electron micrographs using a fast local correlation algorithm. Ultramicroscopy. 2003;94(3–4):225–36.
 9.
Hall RJ, Patwardhan A. A two step approach for semiautomated particle selection from low contrast cryoelectron micrographs. J Struct Biol. 2004;145(1–2):19–28.
 10.
Huang Z, Penczek PA. Application of template matching technique to particle detection in electron micrographs. J Struct Biol. 2004;145(1–2):29–40.
 11.
Mallick SP, Zhu Y, Kriegman D. Detecting particles in cryoEM micrographs using learned features. J Struct Biol. 2004;145(1–2):52–62.
 12.
Rath BK, Frank J. Fast automatic particle picking from cryoelectron micrographs using a locally normalized crosscorrelation function: a case study. J Struct Biol. 2004;145(1–2):84–90.
 13.
Roseman AM. FindEMa fast, efficient program for automatic selection of particles from electron micrographs. J Struct Biol. 2004;145(1–2):91–9.
 14.
Wong HC, Chen J, Mouche F, Rouiller I, Bern M. Modelbased particle picking for cryoelectron microscopy. J Struct Biol. 2004;145(1–2):157–67.
 15.
Zhu Y, Carragher B, Glaeser RM, Fellmann D, Bajaj C, Bern M, Mouche F, de Haas F, Hall RJ, Kriegman DJ, et al. Automatic particle selection: results of a comparative study. J Struct Biol. 2004;145(1–2):3–14.
 16.
Adiga U, Baxter WT, Hall RJ, Rockel B, Rath BK, Frank J, Glaeser R. Particle picking by segmentation: a comparative study with SPIDERbased manual particle picking. J Struct Biol. 2005;152(3):211–20.
 17.
Chen JZ, Grigorieff N. SIGNATURE: a singleparticle selection system for molecular electron microscopy. J Struct Biol. 2007;157(1):168–73.
 18.
Voss NR, Yoshioka CK, Radermacher M, Potter CS, Carragher B. DoG picker and TiltPicker: software tools to facilitate particle selection in single particle electron microscopy. J Struct Biol. 2009;166(2):205–13.
 19.
Zhao J, Brubaker MA, Rubinstein JL. TMaCS: a hybrid template matching and classification system for partiallyautomated particle selection. J Struct Biol. 2013;181(3):234–42.
 20.
Langlois R, Pallesen J, Ash JT, Nam Ho D, Rubinstein JL, Frank J. Automated particle picking for lowcontrast macromolecules in cryoelectron microscopy. J Struct Biol. 2014;186(1):1–7.
 21.
Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ. EMAN2: an extensible image processing suite for electron microscopy. J Struct Biol. 2007;157(1):38–46.
 22.
Scheres SH. Semiautomated selection of cryoEM particles in RELION1.3. J Struct Biol. 2015;189(2):114–22.
 23.
Shaikh TR, Hegerl R, Frank J. An approach to examining model dependence in EM reconstructions using crossvalidation. J Struct Biol. 2003;142(2):301–10.
 24.
Sigworth FJ. A maximumlikelihood approach to singleparticle image refinement. J Struct Biol. 1998;122(3):328–39.
 25.
Sigworth FJ, Doerschuk PC, Carazo JM, Scheres SH. An introduction to maximumlikelihood methods in cryoEM. Methods Enzymol. 2010;482:263–94.
 26.
Scheres SH. Classification of structural heterogeneity by maximumlikelihood methods. Methods Enzymol. 2010;482:295–320.
 27.
Scheres SH, Valle M, Nunez R, Sorzano CO, Marabini R, Herman GT, Carazo JM. Maximumlikelihood multireference refinement for electron microscopy images. J Mol Biol. 2005;348(1):139–49.
 28.
Mao Y, Wang L, Gu C, Herschhorn A, Xiang SH, Haim H, Yang X, Sodroski J. Subunit organization of the membranebound HIV1 envelope glycoprotein trimer. Nat Struct Mol Biol. 2012;19(9):893–9.
 29.
Brown A, Amunts A, Bai XC, Sugimoto Y, Edwards PC, Murshudov G, Scheres SHW, Ramakrishnan V. Structure of the large ribosomal subunit from human mitochondria. Science. 2014;346(6210):718–22.
 30.
Zhang L, Chen S, Ruan J, Wu J, Tong AB, Yin Q, Li Y, David L, Lu A, Wang WL, et al. CryoEM structure of the activated NAIP2NLRC4 inflammasome reveals nucleated polymerization. Science. 2015;350(6259):404–9.
 31.
Mao Y, CastilloMenendez LR, Sodroski JG. Reply to Subramaniam, van Heel, and Henderson: validity of the cryoelectron microscopy structures of the HIV1 envelope glycoprotein complex. Proc Natl Acad Sci U S A. 2013;110(45):E4178–82.
 32.
Mao Y, Wang L, Gu C, Herschhorn A, Desormeaux A, Finzi A, Xiang SH, Sodroski JG. Molecular architecture of the uncleaved HIV1 envelope glycoprotein trimer. Proc Natl Acad Sci U S A. 2013;110(30):12438–43.
 33.
Sorzano CO, Recarte E, Alcorlo M, BilbaoCastro JR, SanMartin C, Marabini R, Carazo JM. Automatic particle selection from electron micrographs using machine learning techniques. J Struct Biol. 2009;167(3):252–60.
 34.
Zhu Y, Ouyang Q, Mao Y. A deep convolutional neural network approach to singleparticle recognition in cryoelectron microscopy. BMC Bioinformatics. 2017;18(1):348.
 35.
Langlois R, Pallesen J, Frank J. Referencefree particle selection enhanced with semisupervised machine learning for cryoelectron microscopy. J Struct Biol. 2011;175(3):353–61.
 36.
Wang F, Gong H, Liu G, Li M, Yan C, Xia T, Li X, Zeng J. DeepPicker: a deep learning approach for fully automated particle picking in cryoEM. J Struct Biol. 2016;195(3):325–36.
 37.
Dong Y, Zhang S, Wu Z, Li X, Wang WL, Zhu Y, StoilovaMcPhie S, Lu Y, Finley D, Mao Y. CryoEM structures and dynamics of substrateengaged human 26S proteasome. Nature. 2019;565(7737):49–55.
 38.
Zhu Y, Wang WL, Yu D, Ouyang Q, Lu Y, Mao Y. Structural mechanism for nucleotidedriven remodeling of the AAAATPase unfoldase in the activated human 26S proteasome. Nat Commun. 2018;9(1):1360.
 39.
Chen S, Wu J, Lu Y, Ma YB, Lee BH, Yu Z, Ouyang Q, Finley DJ, Kirschner MW, Mao Y. Structural basis for dynamic regulation of the human 26S proteasome. Proc Natl Acad Sci U S A. 2016;113(46):12991–6.
 40.
Lu Y, Wu J, Dong Y, Chen S, Sun S, Ma YB, Ouyang Q, Finley D, Kirschner MW, Mao Y. Conformational landscape of the p28bound human proteasome regulatory particle. Mol Cell. 2017;67(2):322–33 e326.
 41.
Zhao Q, Zhou H, Chi S, Wang Y, Wang J, Geng J, Wu K, Liu W, Zhang T, Dong MQ, et al. Structure and mechanogating mechanism of the Piezo1 channel. Nature. 2018;554(7693):487–92.
 42.
Masiulis S, Desai R, Uchanski T, Serna Martin I, Laverty D, Karia D, Malinauskas T, Zivanov J, Pardon E, Kotecha A, et al. GABA_{A} receptor signalling mechanisms revealed by structural pharmacology. Nature. 2019;565(7740):454–9.
 43.
Penczek P, Radermacher M, Frank J. Threedimensional reconstruction of single particles embedded in ice. Ultramicroscopy. 1992;40(1):33–53.
 44.
Weis WI, Brunger AT, Skehel JJ, Wiley DC. Refinement of the influenza virus hemagglutinin by simulated annealing. J Mol Biol. 1990;212(4):737–61.
 45.
Frank J, Radermacher M, Penczek P, Zhu J, Li Y, Ladjadj M, Leith A. SPIDER and WEB: processing and visualization of images in 3D electron microscopy and related fields. J Struct Biol. 1996;116(1):190–9.
 46.
Xu Y, Wu J, Yin CC, Mao Y. Unsupervised CryoEM data clustering through adaptively constrained Kmeans algorithm. PLoS One. 2016;11(12):e0167765.
 47.
Wu J, Ma YB, Congdon C, Brett B, Chen S, Xu Y, Ouyang Q, Mao Y. Massively parallel unsupervised singleparticle cryoEM data clustering via statistical manifold learning. PLoS One. 2017;12(8):e0182130.
 48.
Scheres SH. RELION: implementation of a Bayesian approach to cryoEM structure determination. J Struct Biol. 2012;180(3):519–30.
 49.
Sorzano CO, Marabini R, VelazquezMuriel J, BilbaoCastro JR, Scheres SH, Carazo JM, PascualMontano A. XMIPP: a new generation of an opensource image processing package for electron microscopy. J Struct Biol. 2004;148(2):194–204.
 50.
Baxter WT, Grassucci RA, Gao H, Frank J. Determination of signaltonoise ratios and spectral SNRs in cryoEM lowdose imaging of molecules. J Struct Biol. 2009;166(2):126–32.
 51.
Mindell JA, Grigorieff N. Accurate determination of local defocus and specimen tilt in electron microscopy. J Struct Biol. 2003;142(3):334–47.
 52.
Rose A. The sensitivity performance of the human eye on an absolute scale. J Opt Soc Am. 1948;38(2):196–208.
 53.
Henderson R. Avoiding the pitfalls of single particle cryoelectron microscopy: Einstein from noise. Proc Natl Acad Sci U S A. 2013;110(45):18037–41.
 54.
Subramaniam S. Structure of trimeric HIV1 envelope glycoproteins. Proc Natl Acad Sci U S A. 2013;110(45):E4172–4.
 55.
van Heel M. Finding trimeric HIV1 envelope glycoproteins in random noise. Proc Natl Acad Sci U S A. 2013;110(45):E4175–7.
Acknowledgements
The authors thank J. Jackson and T. Song for assistance in maintaining the highperformance computing system; C. Marks, A. Graham, A. Magyar and D. Bell for assistance in maintaining the imaging system; Y. Zhu, Y. McLaughlin and E. Carpelan for assistance in manuscript preparation.
Funding
The experiments and data processing were performed in part at the Center for Nanoscale Systems at Harvard University, Cambridge, MA, USA, a member of the National Nanotechnology Infrastructure Network (NNIN), which is supported by the National Science Foundation of the USA under NSF award no. 1541959. The cryoEM facility was supported by the NIH grant AI100645 to the Center for HIV/AIDS Vaccine Immunology and Immunogen Design (CHAVIID). This work was funded in part by an Intel academic grant, by the National Institutes of Health (NIH) (AI93256, AI67854, AI100645 and AI24755), by an Innovation Award and a Fellowship Award from the Ragon Institute of MGH, MIT and Harvard, by the National Natural Science Foundation of China grant No. 11774012, Beijing Natural Science Foundation grant No. Z180016 and by gifts from Mr. and Mrs. Daniel J. Sullivan, Jr.
Availability of data and materials
Scripts are provided in the online Supplementary Material.
Author information
Author notes
Affiliations
Contributions
YM conceived the guiding principles, designed the experiments, and conducted the simulation study. WLW, LRCM, and YM conducted the cryoEM experiments. ZY conducted the tests on the realworld cryoEM data of the glucose isomerase complex. YM and JS wrote the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to Youdong Mao.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
Figure S1. The simulated micrographs with different SNRs. Figure S2. Contrast enhancement of the simulated micrographs by a number of conventional techniques, including histogram normalization, contrast stretching, lowpass filtering and binning, at the SNRs of 0.005 (A) and 0.002 (B). Figure S3. An example of FLCbased particle picking from micrographs of the influenza virus HA trimer with low SNRs. Figure S4. Comparison of the FLCbased particlepicking results near the critical SNR with different templates. Figure S5. Automated particle picking from lowdefocus (closetofocus) micrographs and manual verification of picked particles from highdefocus (farfromfocus) micrographs. Figure S6. Verification of the class averages after ML classification for the BOF tests on the real cryoEM data, using the atomic model of the glucose isomerase complex (PDB ID: 1OAD). (PDF 18817 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Automatic particle picking
 Fast local correlation function
 CryoEM
 Maximumlikelihood estimate
 Singleparticle analysis