Adjusting protein graphs based on graph entropy

Measuring protein structural similarity attempts to establish a relationship of equivalence between polymer structures based on their conformations. In several recent studies, researchers have explored protein-graph remodeling, instead of looking a minimum superimposition for pairwise proteins. When graphs are used to represent structured objects, the problem of measuring object similarity become one of computing the similarity between graphs. Graph theory provides an alternative perspective as well as efficiency. Once a protein graph has been created, its structural stability must be verified. Therefore, a criterion is needed to determine if a protein graph can be used for structural comparison. In this paper, we propose a measurement for protein graph remodeling based on graph entropy. We extend the concept of graph entropy to determine whether a graph is suitable for representing a protein. The experimental results suggest that when applied, graph entropy helps a conformational on protein graph modeling. Furthermore, it indirectly contributes to protein structural comparison if a protein graph is solid.


Background
Graph theory is now widely used in information theory, combinatorial optimization, structural biology, chemical molecule, and many other fields. Graph similarity measuring is a practical approach in various fields. When graphs are used to represent of structured objects, the problem of measuring similarities between objects becomes one of computing similarities between graphs [1]. Protein remodeling is another field wherein multiple-domains within structures are considerably complicated.
It is believed that proteins are important molecules for living organisms. In fact, they are essential parts of organisms and participate in almost every process within cells. A protein contains at least one linear chain of amino acid residues called a polypeptide. By various synthesis, e.g., biosynthesis and chemical synthesis, a polypeptide is folded into a unique 3-dimensional structure. Usually, the structure of a protein determines its biological function performed in organisms. Knowledge of a protein structure can help us understand biological functions and evolution. Measuring protein similarities according to 3-dimensional structures of proteins provides a valuable tool for evaluating proteins with low sequence similarities when evolutionary relations among proteins cannot be detected by sequence alignment techniques. To perform a structural comparison of molecules, accurate information of two superimposed protein structures must be obtained. However, optimizing these two quantities simultaneously is difficult. Unlike the sequence alignment problem, the structural alignment problem has not even been classified as solvable.
For decades, studies have attempted to define topological relations and notations on protein structures, a schematic description is essentially expected to describe its topology. Mathematical formulations of structural patterns can facilitate the composition in a polypeptide chain. A schematic description has the advantage of simplicity, making the implementation of graph theory as an alternative approach possible [2]. By selectively ignoring protein structural features, it has the potential to detect further homologous relationships based on various geometric methods and motivations.
The structure of a protein can be regarded as a conformation with various local elements (e.g., helixes, sheets) and forces (e.g., Van der Waal's forces, hydrogen bonds), folding into its specific characteristic and functional structure. With the help of graph transformation, folded polypeptide chains can be represented as a graph using several mapping rules. Proteins contain complex relationships in its polymer: residual reactions, covalent interactions, peptide bonding, and hydrophobic packing are essential parts in structural determination. The intention is to transform a protein structure into a graph. Formally, a graph transforming system consists of a set of graph rewrite rules: L R, which L is called pattern graph and R is called replacement graph [3]. It is the key operation in graph transformation.

Protein Remodeling
As mentioned to the protein remodeling, a study reviewed in detail of protein graph (abbreviated as P-graph) description can be found in [4]. Usually, the vertex set of a P-graph can be defined by C a atoms, residues, side chains, DSSP (the dictionary of protein secondary structures), and SSE (secondary structure elements). For the edge set, it is usually defined by the distance of two vertices with some labels, e.g., chemical properties. Figure 1 shows an overview of protein graph remodeling. Table 1 shown an outline of some categories of the protein graph approach to a set of graphs, representing each specific graph rewriting and graph measuring skills. Therefore, it is useful to begin with the summarized common research into the following matters: geometric relation and chemical relation.
Proteins have been represented in various ways using different levels of detail. The conformation of protein structure has been shown to be determined geometrically by various constraints [5]. Therefore, the most common method for protein modeling is to reserve its topological relationship in graphs. From the perspective of graph theory, a simplified representation of protein structure aims attention at connectivity patterns. It helps to go into details on interacted relation within a polypeptide folding. In brief, the geometric-based protein modeling is to refine its edges (relations) among vertices (objects), adapting the information from interobject distances for all pairs of objects.
Comparing with geometric relationship, chemical properties provides a more complicated description in the protein graph model. Amino acid contain various chemical properties, including electrostatic charge, hydrophobicity, bonding type, size and specific functional groups [6]. By giving values to edges and nodes in graph, each different labeled component that varies between the various types of chemical relation.

Entropy
Entropy defines a quantitative equilibrium property within a system and it implies the principle of disorder from the second law of thermodynamics [7]. It is particularly important in describing how energy is applied and transferred in an isolated system. The higher the disorder, the greater the entropy of the system [8]. Similarly, this concept is presented in life. As we known, life is composed of many cells, tissues, and organs from the vital element of protein. Since proteins are biochemical compounds, consisting of one or more polypeptide chains, the arrangement of protein polymers are assumed to be in a compact state, according to its backbone dihedral angles and side chain rotamers. This is called conformational entropy. There is considerable evidence to prove that the same Figure 1 An overview of protein graph remodeling. Table 1 Recent studies for constructing protein graphs.

Ref.
Vertex Set Edge Set [20] C a atoms labeled edges [21] DSSP attributed edges [22] side chains defined by interacted energy [23] residues defined by geometrical constraints [24] SSE labeled edges observation can be applied to a protein graph model. In such a case, a graph model should also follow the second law of thermodynamics. For an n-object system G, assume that each object i is associated with a probability p i . Then the entropy of the system G is defined as in Formula 1 [9].
In graph theory, the entropy of a graph is usually defined by its degree sequence. For example, we consider the cycle with 4 vertices, i.e., C 4 . The degree sequence is (2, 2, 2, 2). Thus, the p i for each vertex v i is 2 8 = 0.25 . By definition,

Methods
In this section, we extend the concept of graph entropy to measuring protein graphs. To demonstrate the calculation of graph entropy exemplarily, peptide chains of MHC (Major Histocompatibility Complex) are selected as the materials for examining the utilities of graph entropy.

Graph entropy
For a given graph G = (V, E) and two vertices u and v in V , let d(u, v) denote the length of the shortest path between u and v. Let N k (u) = {v | d(u, v) = k}. In graph theory, N k (u) is called the k-distance neighborhood of u and is also called the k-sphere of u [10]. By counting kdistance neighbors of v i , it gives a good account of nodes mutual connectivity in G. We define the following formula.
In Formula 2, k is the longest length for u to reach to a vertex, (i.e., N k (u) ≠ ∅ but N k+1 (u) = ∅). The idea of our formula makes that every other vertex v contributes an impact to the current vertex u. In particular, the closer distance between v and u, the greater the impact of v. For simplicity, we let Finally, our modified entropy formula for a graph G = (V, E) is as follows.
For convenience, we consider the graph depicted in Figure 1 as an example which is a P-graph based on small proteins of the plant crambin. This graph is an unlabeled graph corresponding to the protein. The following equations are easy to obtain: So the entropy of graph depicted in Figure 1 is: Let us consider the four graphs depicted in Figure 2.   makes a better decision. Note that in graph entropy, the higher entropy of a graph indicates that the graph structure is more stable.

Edge adjustment
By the definition of I′(G), its value is not increased monotonously if the density of G is increased. Thus, we have the following cases to determine how to adjust the graph. Assume that G is the current graph and I′(G) = x. Let I′ (G − e) = y and I′(G + e) = z where G − e means that we remove the longest edge from G and G + e means that we add a shortest non-edge to G.
• Case 1: y = 0 It means that after this edge is removed, G is no longer a connected graph.
• Case 2: z > × > y It means that by adding a new edge, G will become more stable.
• Case 3: x > z > y It means that G is stable enough.
• Case 4: y > × > z It means that by removing an old edge, G will become more stable.
As illustrated in Figure 3, it shows when edges are added or removed from a graph, their entropy values will be changed. A set of connected 5-node graphs is shown in Figure 4.

Graph spectra
Given two graphs G A = (V A , E A ) and G B = (V B , E B ), the graph matching problem is to find a one-one mapping f : V A V B such that if (u, v) ∈ E A , then the possibility of (f(u), f (v)) ∈ E B is as higher as possible. Therefore, numerous attempts have been made on graph similarity to show its efficiency in recent years. In [11], it revealed that the problem of graph matching may be divided into different types depended on their levels. According to the graph scoring, it qualitatively measures a mutual dependence of two objects [12]. Generally, a value of similarity ranges between 0 and 1, from dissimilar to identical.
Occasionally, topologies of graphs are complicated; therefore, one practical way is to symbolize it as matrix, turning graph into numbers and vectors. Since it is hard to determine graph isomorphism, graph spectra gives an alternative solution for graph matching. By definition, a spectrum of a finite graph G is the spectrum of its adjacency matrix A G and diagonal degree matrix D G , whose entries a i,j and d i,j can be written as in Formula (6) and Formula (7), respectively. That is, its connected neighbors of eigenvalues together with their multiplicities [13].
The Laplacian spectrum of G is the matrix, L G = D G − A G , indicating a topological properties and connectedness of the graph. In brief, a graph spectra of G can be regarded as a set of eigenvectors-l [14]. Apparently, comparing with the binary relation of graph G, the spectrum of G tends to improve its information on adjacent relation. We give some examples to describe a graph spectrum transformation. Let X be the resulting graph by removing one edge from K 4 . Let Y be the graph depicted in Figure 1. Figure 3 The effects for increasing and decreasing edges from a graph.
Obviously, |V X | = 4 < |V Y | = 5. Therefore, when spectra are different sizes, the smaller one may be padded with zero values to equalize the size of G X and G Y . By definition, the spectra l X and l Y can be obtained, i.e., l X = [ 4, 4, 2, 0, 0 ] and l Y = [ 4, 4, 1, 1, 0 ]. The similarity between G X and G Y can be simply measured by the Euclidean distance of l X and l Y . In this case, the similarity of G X and G Y is 1.414.

Results and discussion
In this experiment, we validated the remodeling function of the P-graph by using extended graph entropy to verify the stability of a given P-graph. For the P-graph construction, please refer to [15]. Thus, we were interested in only the impact of connectivity on protein structural similarities. Various types of MHC were chosen as the material to verify the verification of proposed method: 1HDM, 1K5N, 2ENG, 1VCA, 1ZXQ, 1UXW, 1A2Y, 3ARD, 2Q3Z, and 2CRY. MHC, as an immune system in most vertebrates, encodes for a small complex cell surface protein. It is also known for HLA (Human Leukocyte Antigen), one of the most intensively studied genes in human [16]. Due to a great diversity of microbes in the environment, MHC genes vary widely its peptide through several mechanisms [17]; this is also the major reason why MHC proteins were selected as materials for this studies.

P-graph comparison
Let G = (V, E) be the P-graph after remodeling from the construction proposed by [15]. Vertices of V in G are created according to the DSSP. Under this metric, a protein secondary structure is represented by a single letter code, H-helix (containing G, H, and I), T-hydrogen turn (containing T, E, and B), or C-coiled (containing only C). For controlling one variable in this experiment, let the edge set E of G be changed from a specific range.
A comparison of MHC proteins is shown in Table 2. In the table, PID is the protein identification number in PDB [18]. Since MHC proteins are composed of multiple polypeptide chains, they are multimeric Domain. Furthermore, Dens means the density in the graph. It is defined as 2|E| |V|(|V|−1) ranging from 0 to 1. AVG indicates the average distance within DSSP vertices. If the distance of v i and v j is no greater than AVG, then there is an edge between them. In the table, +ke (−ke) means that we add (remove) the k shortest (longest) possible edges. For example, +1e means that we add the edge with the shortest length that is greater than AVG. In the table, NaC indicates that the resulting graph is not a connected graph.
The relationship between |E| and I′(G) is as follows. First, when the density in G increases, the graph G appears to go from sparse to dense. However, its extended entropy does not increase completely with its density. It seems a little anomalous in this appearance. Second, the edge set in protein remodeling issue can be determined from its extended entropy. By definition, the P-graph G should be a connected graph. Once the G becomes a disconnected graph, we cannot decide its entropy. For example, 1A2Y is not a connected graph when the density is lower than 0.400. Third, E appears to be considerably related to V in graph entropy. Consider the P-graph 2CRY as another example. If a protein remodeling function adapts a specific value on the basis of its geometrical edge, then it might be an error to assume a fixed value as a criterion. This is an essential fact to stress. It is worth pointing out that the construction of a P-graph is limited by V .

P-graph verification
To validate the previous assumptions, a method for protein structural comparison is adapted to measure its similarity. Graph spectra gives an alternative solution to graph matching. It is a set of relational parameters, consisting of a characteristic polynomial and eigenvectors of its adjacency matrix or Laplacian matrix. Graph spectra quantitatively provide graph information, e.g., structure, topology, connectivity [19]. In Table 3 we list the results of protein  Result · · · -+ + + + + + structure remodeling matters. The field Old shows a remodeling based on the specific value of edge length, and New indicates that the edges in G are adjusted by extended entropy. The value in each entry is the distance of the two spectra. If our method obtains a better result in the comparison, then we simply mark "+" to denote a better result; otherwise, it is marked "=" (not bad) or "−" (worse). Table  4 shows the CATH codes for the selected macromolecules. In summary, the extended entropy determines a better conformational graph from protein structure remodeling.

Program and environment
The procedure for computing the extended entropy for a P-graph was implemented and has been tested with the MHC PDB dataset. The environment was running under 2 Ghz PC with 512 MB of main memory with Linux-2.6.11-1.1369. The implementation was written using Bash-3.00. 16(1) and Octave-3.0.0.

Conclusion
In this paper, we proposed a measurement to determine graph stability for protein structure remodeling based on graph entropy. Our modified entropy validation shows a positive result for protein structural comparison. This graph-based approach offers a practical concept to support protein structural alignment.  Result · · · · · · · · + +