The information stored in the Protein Data Bank [1] would make possible fully automated *in silico* studies if mislabeled chemical groups, broken protein- and nucleic acid chains and other errors were corrected. Even today, the newly submitted data is verified "by hand" by human experts. In an earlier work, we applied a rigorous cleaning and re-structuring procedure for the entries in the Protein Data Bank [2], and created the RS-PDB database. We made use of non-trivial mathematical, mainly graph-algorithms: Computing the InChI™ code [3, 4] applied a graph-isomorphism testing, transforming aromatic notation to Kekule-notation used a non-bipartite graph-matching algorithm [5], breadth-first-search graph traversals [6] were used throughout the work [2], depth-first search [6] was used in building the ligand molecules and identifying ring structures, kd-trees [7] were applied for computing covalent bonds, and hashing [6] were utilized for the fast generation of protein-sequence ID's.

The resulting RS-PDB database is capable to serve intricate structural queries on all the three-dimensional protein structures known to mankind.

It is of basic importance to map the physico-chemical properties of protein-ligand binding sites, most importantly the Coulomb and Van der Waals forces, in order to predict protein-ligand binding, to design ligands for a given binding site on the surface on a protein, or in designing inhibitors or activators in enzymatic mechanisms. The exact description of the forces in question are deep quantum-chemical problems. The atomic environment of the binding sites clearly has strong effect to these forces; consequently, by examining the atomic environments of the ligands in the crystallographically verified protein-ligand complexes in the PDB would yield insight in binding mechanisms and biologically active molecule design. The first step in this direction need to be the analysis of the simplicial structures of the atoms, forming the protein structures themselves. The second step is the analysis of simplicial neighborhoods of the ligand atoms.

In the present work we define a certain simplicial decomposition on the heavy atoms of the protein structures in the PDB, and analyze some geometrical properties of the tetrahedra of different atomic composition. By this way we – first time in the literature – succeeded in defining a structure capable to answer topological questions concerning the distribution of volume and shape of heavy protein-atoms in the whole PDB. One of our main results is the identification of the volume-shape relation of tetrahedra of distinct atomic composition.

### Delaunay-decompositions

Even the refined, cleaned RS-PDB database [2] lacks important features, such as easy acceptance of queries such as: What atoms surround a certain (ligand- or protein-) atom in the structure? Which atoms are neighbouring with the atom/amino acid X in the protein? How many ligand-atoms are surrounded by exactly the tetrahedron with C-C-C-O atoms in its vertices? How frequent are the tetrahedra with vertices C-C-O-N? Are there differences in the shape of tetrahedra of different composition?

Note, that such queries cannot be answered from the amino-acid sequence of the protein, since they intrinsically depend on the tertiary structure of the protein. Consequently, one need to use some cleaned version of the PDB as the initial data.

We have chosen Delaunay decomposition in the discretization of the dataset in the RS-PDB database, since in this "tessellation", the tetrahedra are close to regular ones, and it is a natural and well defined notion, with a well-known algorithm for the generation of the tessellation.

**Definition 1** *Given a finite set of points A* ⊆ *R*^{3}, *and a H* ⊆ *A such that the points of H are on the surface of a sphere and the sphere does not contain any further points of A, then the convex hull of H is called a Delaunay region*.

Delaunay regions define a partition of the convex hull of *A*. If the points of *A* are in general position, (i.e., no five of the points are on the surface of a sphere), then all regions are tetrahedra.

Singh, Tropsha and Vaisman [8] applied Delaunay decomposition to protein-structures as follows: they selected *A* to be the set of C_{
α
}atoms of the protein, and analyzed the relationship between Delaunay regions volume and "tetrahedrality" and amino acid order in order to predict secondary protein structure.

They gave the following definition:

**Definition 2** ([8]) *The tetrahedrality of the tetrahedron with edge-lengths* ℓ_{1}, ℓ_{2}, ℓ_{3}, ℓ_{4}, ℓ_{5}, ℓ_{6} *is defined*

4{\left({\displaystyle \sum _{k}{\ell}_{k}}\right)}^{2}{\displaystyle \sum _{i<j}\frac{{({\ell}_{i}-{\ell}_{j})}^{2}}{15}}

*where* ℓ_{
i
}*is the length of edge i*.

Note, that the tetrahedrality of the regular tetrahedron is 0.