Navigating 3D electron microscopy maps with EM-SURFER

Background The Electron Microscopy DataBank (EMDB) is growing rapidly, accumulating biological structural data obtained mainly by electron microscopy and tomography, which are emerging techniques for determining large biomolecular complex and subcellular structures. Together with the Protein Data Bank (PDB), EMDB is becoming a fundamental resource of the tertiary structures of biological macromolecules. To take full advantage of this indispensable resource, the ability to search the database by structural similarity is essential. However, unlike high-resolution structures stored in PDB, methods for comparing low-resolution electron microscopy (EM) density maps in EMDB are not well established. Results We developed a computational method for efficiently searching low-resolution EM maps. The method uses a compact fingerprint representation of EM maps based on the 3D Zernike descriptor, which is derived from a mathematical series expansion for EM maps that are considered as 3D functions. The method is implemented in a web server named EM-SURFER, which allows users to search against the entire EMDB in real-time. EM-SURFER compares the global shapes of EM maps. Examples of search results from different types of query structures are discussed. Conclusions We developed EM-SURFER, which retrieves structurally relevant matches for query EM maps from EMDB within seconds. The unique capability of EM-SURFER to detect 3D shape similarity of low-resolution EM maps should prove invaluable in structural biology.


Background
The three dimensional (3D) structure of proteins and other biomolecules provides the molecular basis for understanding mechanisms of biological functions, interactions, pathways, and serves as foundation for numerous areas in biotechnology. In addition to the exponential growth of solved 3D protein structures and complexes in the Protein Data Bank (PDB) [1,2], which are mostly determined by Xray crystallography or NMR, low-resolution biomolecular structural data determined by cryo-electron microscopy (cryo-EM) and electron tomography are notably being rapidly accumulated in the Electron Microscopy Data Bank (EMDB, http://www.emdatabank.org/) [3]. Cryo-EM is an important technique in structural biology used to solve large protein complex and subcellular structures. Currently, EMDB holds over 2600 entries, and the number of entries is growing rapidly. The mean resolution of the EM maps is currently about 15 Å, but recent papers [4][5][6] report highresolution structures at around 3.5 Å. There is no doubt that EMDB will become increasingly important not only in structural biology, but also in various areas including molecular biology and bioinformatics.
To take full advantage of these valuable resources of 3D biomolecular structures, it is necessary for one to be able to efficiently perform a structure-based search against the entire structure databases in real-time. Similarity search is the most essential operation that needs to be provided with a database. However, compared to biological sequence databases that are usually equipped with realtime database search methods, structure databases are behind with respect to efficient search methods, particularly for low-resolution structural data.
To this end, we have developed EM-SURFER for realtime searching of EM density maps from EMDB. Users can search for similar EM maps in EMDB in terms of the global shape and the volume of a query map. A query can be either chosen from existing EMDB entries or uploaded. Unlike atomic detailed structures stored in PDB, EM density maps are at low resolution and thus conventional structure comparison approaches cannot be directly applied.
A fast map comparison is achieved by using a mathematical representation of 3D shapes named 3D Zernike Descriptor (3DZD) [7]. 3DZD is a vector derived from a series expansion of a 3D function, which describes an EM map in a compact and rotation-invariant fashion. 3DZD has been successfully applied to represent various biomolecular structure analyses [8], including protein 3D shape comparison [9], protein docking [10][11][12], ligand binding site comparison [13,14], and fast ligand database search [15].
In EM-SURFER, each search is performed on-the-fly and only takes a few seconds. The database of EM maps is automatically synchronized with EMDB weekly. In what follows, we first describe how 3D EM maps are represented in EM-SURFER, and then explain input data and output search results with examples.

Implementation
The main operation performed by EM-SURFER involves comparing two EM maps using an efficient structure representation with 3DZD. The descriptor is derived from a mathematical series expansion of a 3D function based on the 3D Zernike moments. 3DZD was originally derived by Canterakis [7] and later applied to 3D object retrieval [16]. A 3DZD can be viewed as a fingerprint that consists of a vector of real numbers, where each number is a coefficient of the series expansion. Comparisons between these fingerprints form the basis of the rapid search performed by our server. The similarity between 3DZD vectors is quantified by their Euclidean distance.
EM density maps for EM-SURFER are obtained from EMDB [3], the primary repository of electron microscopy data, and updated on a weekly basis. For each EM map, 3DZD vectors are computed. It was shown in previous studies [17,18] that 3DZD can properly represent EM maps. An EM map is a 3D grid where an electron density value is assigned at each grid point. Using the author-recommended density contour level provided in EMDB, grid points with an electron density that is equal or larger than the author-recommended density are marked with 1 and 0 otherwise. The value-mapped 3D grid was considered as a 3D function, f(x). This f(x) is expanded into a series in terms of the Zernike-Canterakis basis defined as follows: The ranges of parameters l and m are defined by the order n: − l < m < l, 0 ≤ l ≤ n, and n-l even. We used order Figure 1 3DZD computation pipeline. Every map in EMDB yields several 3D Zernike descriptor fingerprints. The raw map is used to generate four voxelizations: one from the author-recommended density value, one at one standard deviation, which is lower than the author-recommended contour level, and two additional thresholds that reveal core features. Each surface is represented by 121 descriptors, which are concatenated to generate various fingerprints. n = 20, which corresponds to 121 invariants. Y m l ϑ; ϕ ð Þ are the spherical harmonics and R nl (r) are the radial functions constructed in a way that Z m nl r; ϑ; ϕ ð Þ can becalculated as norms of vectors Ω m nl . The norm gives rotational invariance to the descriptor: A similar rotation-invariant 3D shape descriptor can be constructed by using only spherical harmonics Y m l ϑ; ϕ ð Þ. Particularly, in the spherical harmonics descriptor (SHD), a 3D object is segmented by a set of concentric spheres, for each of which a rotation-invariant descriptor using spherical harmonics is constructed and concatenated to incorporate distance information from the object center [19][20][21]. 3DZD is mathematically superior to SHD because SHD computes rotation invariant descriptor for each concentric sphere separately, and thus the shells can be rotated independently by random angles without changing the resulting descriptors. Also, in 3DZD, the orthonormality of the Zernike-Canterakis basis results in less information redundancy. In contrast, in SHD, descriptors coming from adjacent shells are highly correlated, making them redundant to some extent. That usually makes the size (the length of the descriptor) of SHD larger than 3DZD. Moreover, 3DZD was shown to perform better than SHD in shape-based object retrieval [16] and protein global surface shape comparison [22]. For more discussion about 3DZD and spherical harmonics, refer to a review paper [23]. The distance between two 3DZDs is quantified as the Euclidean distance between the vectors. Comparisons between fingerprints form the basis of the rapid search performed by our server. A more detailed derivation of 3DZD as well as the mathematical foundation can be found in previous publications [7,16,24].
Besides the author-recommended density level, a voxelization at one standard deviation of electron density, and two additional voxelizations at higher density levels, 1/3 and 2/3 of the highest density, were computed ( Figure 1). The purpose of the additional map descriptions with one lower and two higher densities is to capture shapes at different contour levels of the molecules. Each contour level yields its own vector of 121 3DZD invariants. In total, five  EM map descriptors were prepared: the 3DZD for 1) the author-recommended density level, descriptors that concatenate the 3DZD of 2) the author-recommended density level and another 3DZD computed at one standard deviation, 3) 1/3 maximum density, or 4) 2/3 maximum density, and 5) a descriptor that concatenates the authorrecommended and 1/3 and 2/3 density level 3DZDs. The second to the fourth descriptors have 242 invariants and the last one has 363 invariants. The 3DZDs were precomputed for each EMDB entry. They will be computed on-the-fly for a query if users upload their own EM map.
PDBj (Protein Databank Japan, http://pdbj.org/) provides a list of structurally similar maps for each EM map entry in their EM Navigator. Similar maps are identified by vector quantization and the similarity of all EM maps are visualized in a two dimensional map (named the Omokage map) computed by multidimensional scaling. Although details of the implementation of the method are not provided at the EM Navigator website (http://pdbj.org/emnavi/ emnavi_doc.php?doc=omokage), differences between EM-SURFER and EM Navigator include the following: Unlike in the Omokage map, which seems to be pre-computed, similarity search for a query is performed on-the-fly in EM-SURFER. Thus, a search can be performed also for a map that is uploaded by a user.
The validity of applying 3DZD for EM map database search was shown in previous studies [17,18]. These two studies demonstrated database searches for simulated and actual EM maps, which achieved high accuracy by describing EM maps with 3DZD.

Results and discussion
The main result generated by EM-SURFER is a list of EM maps, with queries submitted through the Search page ( Figure 2). To submit a query entry, users should go through the following four steps. In Step 1, the contour shape representation should be specified. The default is set to the author-recommended contour level. In Step 2, users need choose the EMDB entry ID or upload an EM map file. To find an ID from a protein name or other information, use the EMDB text search page at http://www.ebi.ac.uk/pdbe/emdb/searchForm.html. In Step 3, a volume filter is provided, which is enabled by default. When this filter is on, a search only retrieves EM maps that have a volume similar to the query (the ratio between the query and each retrieved map should be between 0.8 to 1.2). Finally, a resolution filter allows users to restrict the maps returned for the query to be in the specified resolution range.
The results page displays the top 20 entries in the database that have the most similar global shape to the query EM map. Figure 3 shows the four most similar EM maps for EMD-1375 as query. In the top panel, it shows the query entry ID and its molecule name, a  figure of the query (which is provided by EMDB), as well as the 3DZD that characterizes the query entry in text and graphic forms. The query entry ID is a unique 4digit accession number used in EMDB. Also in the top panel, the user is given a link to a text file for a list of the most similar maps. In the bottom graphic panel, a list of retrieved entries for the query is shown. They are ranked by the distance of their 3DZDs to that of the query entry (quantified by Euclidean distance, EucD, i.e. the square root of the sum of the squares of the differences between corresponding values). The smaller the EucD is, the more similar the shapes of the two EM maps are. Empirically, entries with a Euclidean distance of less than 8.0 are biologically related. For each retrieved entry, it also shows the ratio of the volume of the retrieved entry to the query, which is defined as the volume of the retrieved entry divided by that of the query, as well as the resolution of the map. Clicking on the image of a retrieved entry will trigger a new search using the clicked entry as a query. We show three examples of search results by EM-SURFER. For these searches, the author-recommended density level was used. Only structures with a resolution provided in their meta-data are retrieved in these examples. The volume filter was on. In Figure 4 and Table 1, detailed information of the top eight most similar EM maps for the first two queries are shown. The first example is a search from a 30S ribosomal complex structure (EMD-2456). Among the top 10 most similar maps retrieved from the database, all of them are 30S ribosomal subunit structures. The second example ( Figure 3B) shows search results of tubulin that have cylindricalshape (EMD-1033). The top thirteen retrieved EM maps are all from tubulins. Similar to the first example, entries retrieved with a Euclidean distance of 6.5 or less are all tubulins. The second example demonstrates that EM-SURFER can retrieve similar EM maps not only for globular-shape EM maps but also for cylindrical complexes.
The examples shown above demonstrate that EM-SURFER successfully retrieves related entries of the same molecules. However, since EM-SURFER performs global shape and volume comparison between EM maps, entries of the same molecule but in different conditions that lead to overall different shape would not be retrieved at a high rank, even if they would be easily retrieved by the text search, which is currently available at EMDB. Table 2 and Figure 5 provide results that exemplify this type of situation. Nine EMDB entries, EMD-2055 to 2563, are maps under different conditions and mutants of hexameric AAA+ chaperone ClpB (gray region in Figure 5) bound (or not bound) to protease ClpP (green). These entries were reported in the same paper [25]. Six copies of ClpB assemble into a ring-shape complex (gray region) and work as chaperone, where a misfolded protein will go through the pore at the center of the hexamer ring and be unfolded. In a study by Carroni et al., mutants of ClpB were constructed that lock the complex in active or repressed states, which yielded the nine EM structures [25].
As shown in Table 2, when a search was performed from query EMD-2556, not all the other eight entries were close: Three entries, EMD-2555, 2558, and 2560, were retrieved within a distance of 8.0, but the remaining five entries (EMD-2557, 2559, 2561, 2562, and 2563) were more distant than 10.0 (12.0 to 23.0). To understand why the five entries have a large distance, we computed the similarity of ClpB (gray) and ClpP (green) regions separately ( Figure 5). Interestingly, it turned out that actually those entries that have a large Euclidean distance have ClpB in different shapes reflecting their different functional states. The ClpP region is similar in all the entries (the distance ranges from 4.08 to 6.87). In the case of EMD-2563, it does not even have bound ClpP in the map, which makes the overall shape of the map completely different from the shape of the query. Thus, in this example, EM-SURFER detected different states of the same complexes, which would be very useful for analyzing sub-states of the same macromolecules.
The current EM-SURFER identifies entries with globally similar shape to the query EM map, but does not detect local shape similarity between maps. Local map similarity search is left as future work.

Conclusions
We reported a web application named EM-SURFER for real-time biomolecular structure search based on electron microscopy density maps. EM density maps are updated weekly from EMDB. The unique feature of EM-SURFER, the ability of searching EM maps by shape similarity in a matter of seconds, should prove invaluable in structural biology. A similar strategy will be also valuable for other types of low-resolution biological structure data.