Adjusting protein graphs based on graph entropy

Peng, Sheng-Lung; Tsay, Yu-Wei

doi:10.1186/1471-2105-15-S15-S6

Volume 15 Supplement 15

Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013)

Proceedings
Open access
Published: 03 December 2014

Adjusting protein graphs based on graph entropy

Sheng-Lung Peng¹ &
Yu-Wei Tsay¹

BMC Bioinformatics volume 15, Article number: S6 (2014) Cite this article

1578 Accesses
1 Citations
Metrics details

Abstract

Measuring protein structural similarity attempts to establish a relationship of equivalence between polymer structures based on their conformations. In several recent studies, researchers have explored protein-graph remodeling, instead of looking a minimum superimposition for pairwise proteins. When graphs are used to represent structured objects, the problem of measuring object similarity become one of computing the similarity between graphs. Graph theory provides an alternative perspective as well as efficiency. Once a protein graph has been created, its structural stability must be verified. Therefore, a criterion is needed to determine if a protein graph can be used for structural comparison. In this paper, we propose a measurement for protein graph remodeling based on graph entropy. We extend the concept of graph entropy to determine whether a graph is suitable for representing a protein. The experimental results suggest that when applied, graph entropy helps a conformational on protein graph modeling. Furthermore, it indirectly contributes to protein structural comparison if a protein graph is solid.

Background

Graph theory is now widely used in information theory, combinatorial optimization, structural biology, chemical molecule, and many other fields. Graph similarity measuring is a practical approach in various fields. When graphs are used to represent of structured objects, the problem of measuring similarities between objects becomes one of computing similarities between graphs [1]. Protein remodeling is another field wherein multiple-domains within structures are considerably complicated.

It is believed that proteins are important molecules for living organisms. In fact, they are essential parts of organisms and participate in almost every process within cells. A protein contains at least one linear chain of amino acid residues called a polypeptide. By various synthesis, e.g., biosynthesis and chemical synthesis, a polypeptide is folded into a unique 3-dimensional structure. Usually, the structure of a protein determines its biological function performed in organisms. Knowledge of a protein structure can help us understand biological functions and evolution. Measuring protein similarities according to 3-dimensional structures of proteins provides a valuable tool for evaluating proteins with low sequence similarities when evolutionary relations among proteins cannot be detected by sequence alignment techniques. To perform a structural comparison of molecules, accurate information of two superimposed protein structures must be obtained. However, optimizing these two quantities simultaneously is difficult. Unlike the sequence alignment problem, the structural alignment problem has not even been classified as solvable.

For decades, studies have attempted to define topological relations and notations on protein structures, a schematic description is essentially expected to describe its topology. Mathematical formulations of structural patterns can facilitate the composition in a polypeptide chain. A schematic description has the advantage of simplicity, making the implementation of graph theory as an alternative approach possible [2]. By selectively ignoring protein structural features, it has the potential to detect further homologous relationships based on various geometric methods and motivations.

The structure of a protein can be regarded as a conformation with various local elements (e.g., helixes, sheets) and forces (e.g., Van der Waal's forces, hydrogen bonds), folding into its specific characteristic and functional structure. With the help of graph transformation, folded polypeptide chains can be represented as a graph using several mapping rules. Proteins contain complex relationships in its polymer: residual reactions, covalent interactions, peptide bonding, and hydrophobic packing are essential parts in structural determination. The intention is to transform a protein structure into a graph. Formally, a graph transforming system consists of a set of graph rewrite rules: L → R, which L is called pattern graph and R is called replacement graph [3]. It is the key operation in graph transformation.

Protein Remodeling

As mentioned to the protein remodeling, a study reviewed in detail of protein graph (abbreviated as P-graph) description can be found in [4]. Usually, the vertex set of a P-graph can be defined by C_α atoms, residues, side chains, DSSP (the dictionary of protein secondary structures), and SSE (secondary structure elements). For the edge set, it is usually defined by the distance of two vertices with some labels, e.g., chemical properties. Figure 1 shows an overview of protein graph remodeling. Table 1 shown an outline of some categories of the protein graph approach to a set of graphs, representing each specific graph rewriting and graph measuring skills. Therefore, it is useful to begin with the summarized common research into the following matters: geometric relation and chemical relation.

Table 1 Recent studies for constructing protein graphs.

Full size table

Proteins have been represented in various ways using different levels of detail. The conformation of protein structure has been shown to be determined geometrically by various constraints [5]. Therefore, the most common method for protein modeling is to reserve its topological relationship in graphs. From the perspective of graph theory, a simplified representation of protein structure aims attention at connectivity patterns. It helps to go into details on interacted relation within a polypeptide folding. In brief, the geometric-based protein modeling is to refine its edges (relations) among vertices (objects), adapting the information from inter-object distances for all pairs of objects.

Comparing with geometric relationship, chemical properties provides a more complicated description in the protein graph model. Amino acid contain various chemical properties, including electrostatic charge, hydrophobicity, bonding type, size and specific functional groups [6]. By giving values to edges and nodes in graph, each different labeled component that varies between the various types of chemical relation.

Entropy

Entropy defines a quantitative equilibrium property within a system and it implies the principle of disorder from the second law of thermodynamics [7]. It is particularly important in describing how energy is applied and transferred in an isolated system. The higher the disorder, the greater the entropy of the system [8]. Similarly, this concept is presented in life. As we known, life is composed of many cells, tissues, and organs from the vital element of protein. Since proteins are biochemical compounds, consisting of one or more polypeptide chains, the arrangement of protein polymers are assumed to be in a compact state, according to its backbone dihedral angles and side chain rotamers. This is called conformational entropy. There is considerable evidence to prove that the same observation can be applied to a protein graph model. In such a case, a graph model should also follow the second law of thermodynamics.

For an n-object system G, assume that each object i is associated with a probability p_i. Then the entropy of the system G is defined as in Formula 1 [9].

I (G) = \sum_{i = 1}^{n} - p_{i} \times {log}_{2} p_{i}

(1)

In graph theory, the entropy of a graph is usually defined by its degree sequence. For example, we consider the cycle with 4 vertices, i.e., C₄. The degree sequence is (2, 2, 2, 2). Thus, the p_i for each vertex v_i is $\frac{2}{8} = 0.25$ . By definition, I(C₄) = −4 × 0.25 × log₂(0.25) = 2.

Methods

In this section, we extend the concept of graph entropy to measuring protein graphs. To demonstrate the calculation of graph entropy exemplarily, peptide chains of MHC (Major Histocompatibility Complex) are selected as the materials for examining the utilities of graph entropy.

Graph entropy

For a given graph G = (V, E) and two vertices u and v in V , let d(u, v) denote the length of the shortest path between u and v. Let N_k (u) = {v | d(u, v) = k}. In graph theory, N_k(u) is called the k-distance neighborhood of u and is also called the k-sphere of u [10]. By counting k-distance neighbors of v_i, it gives a good account of nodes mutual connectivity in G. We define the following formula.

f (u) = \sum_{i = 1}^{k} \frac{| N_{i} (u) |}{n - i + 1}

(2)

In Formula 2, k is the longest length for u to reach to a vertex, (i.e., N_k (u) ≠ ∅ but N_k+1(u) = ∅). The idea of our formula makes that every other vertex v contributes an impact to the current vertex u. In particular, the closer distance between v and u, the greater the impact of v. For simplicity, we let f(V) = ∑_v∈V f(v). Assume that V = {v₁, v₂, . . . , v }. We define q_i for each v_i as follows.

q_{i} = \frac{f (v_{i})}{f (V)}

(3)

Finally, our modified entropy formula for a graph G = (V, E) is as follows.

I^{'} (G) = \sum_{i = 1}^{n} - q_{i} \times {log}_{2} q_{i}

(4)

For convenience, we consider the graph depicted in Figure 1 as an example which is a P-graph based on small proteins of the plant crambin. This graph is an unlabeled graph corresponding to the protein. The following equations are easy to obtain:

\begin{matrix} f (v_{1}) & = & f (v_{3}) = \frac{3}{5} + \frac{1}{4} \\ f (v_{2}) & = & \frac{2}{5} + \frac{2}{4} \\ f (v_{3}) & = & f (v_{5}) = \frac{1}{5} + \frac{2}{4} + \frac{1}{3} \end{matrix}

So the entropy of graph depicted in Figure 1 is:

\begin{matrix} I^{'} (G) & = & - q (v_{1}) {log}_{2} q (v_{1}) - q (v_{2}) {log}_{2} q (v_{2}) \\ - q (v_{3}) {log}_{2} q (v_{3}) - \dots - q (v_{5}) {log}_{2} q (v_{5}) \\ = & - 2 q (v_{1}) {log}_{2} (q (v_{1})) - q (v_{2}) {log}_{2} (q (v_{2})) \\ - 2 q (v_{4}) {log}_{2} (q (v_{4})) \\ = & - 2 \times 0.3188 \times {log}_{2} (0.3188) - 0.2455 \times {log}_{2} (0.2455) \\ - 2 \times 0.2214 \times {log}_{2} (0.2214) \\ = & 2.4835 \end{matrix}

(5)

Let us consider the four graphs depicted in Figure 2. They are C₄, a cycle of four vertices, K₄, a clique of four vertices, P₄, a path of four vertices, and S₄, a star of four vertices. For C₄ and K 4, since the four vertices are in the same situation, they have the same probability. Thus I(C 4) = I′(C₄) = I(K₄) = I′(K₄) = log2(4) = 2. However, for P₄ and S₄, we have I(P₄) = 1.918 and I(S₄) = 1.793. By Formula 4, we have I′ ( P₄) = 1.894 and I′(S₄) = 1.995. The densities of P₄ and S₄ are the same, (i.e., 0.5). However, the diameter of P₄ is greater than that of S₄. According to traditional graph entropy, I(S₄) <I(P₄) < I(K₄) = I(C₄). However, in our formula, I′(P 4) < I′(S₄) < I′(K₄) = I′(C₄). Intuitively, S₄ is more compact than P₄. Thus, our formula makes a better decision. Note that in graph entropy, the higher entropy of a graph indicates that the graph structure is more stable.

Edge adjustment

By the definition of I′(G), its value is not increased monotonously if the density of G is increased. Thus, we have the following cases to determine how to adjust the graph. Assume that G is the current graph and I′(G) = x. Let I′(G − e) = y and I′(G + e) = z where G − e means that we remove the longest edge from G and G + e means that we add a shortest non-edge to G.

Case 1: y = 0 It means that after this edge is removed, G is no longer a connected graph.
Case 2: z > × > y It means that by adding a new edge, G will become more stable.
Case 3: x > z > y It means that G is stable enough.
Case 4: y > × > z It means that by removing an old edge, G will become more stable.

As illustrated in Figure 3, it shows when edges are added or removed from a graph, their entropy values will be changed. A set of connected 5-node graphs is shown in Figure 4.

Graph spectra

Given two graphs G_A = (V_A, E_A) and G_B = (V_B, E_B), the graph matching problem is to find a one-one mapping f : V_A → V_B such that if (u, v) ∈ E_A, then the possibility of (f(u), f (v)) ∈ E_B is as higher as possible. Therefore, numerous attempts have been made on graph similarity to show its efficiency in recent years. In [11], it revealed that the problem of graph matching may be divided into different types depended on their levels. According to the graph scoring, it qualitatively measures a mutual dependence of two objects [12]. Generally, a value of similarity ranges between 0 and 1, from dissimilar to identical.

Occasionally, topologies of graphs are complicated; therefore, one practical way is to symbolize it as matrix, turning graph into numbers and vectors. Since it is hard to determine graph isomorphism, graph spectra gives an alternative solution for graph matching. By definition, a spectrum of a finite graph G is the spectrum of its adjacency matrix A_G and diagonal degree matrix D_G, whose entries a_i,j and d_i,j can be written as in Formula (6) and Formula (7), respectively. That is, its connected neighbors of eigenvalues together with their multiplicities [13].

a_{i, j} = \{\begin{matrix} 1, & if (i, j) \in E, \\ 0, & otherwise . \end{matrix}

(6)

d_{i, j} = \{\begin{matrix} deg (v_{i}), & if i = j, \\ 0, & if i \neq j . \end{matrix}

(7)

The Laplacian spectrum of G is the matrix, L_G = D_G − A_G, indicating a topological properties and connectedness of the graph. In brief, a graph spectra of G can be regarded as a set of eigenvectors-- λ [14]. Apparently, comparing with the binary relation of graph G, the spectrum of G tends to improve its information on adjacent relation. We give some examples to describe a graph spectrum transformation. Let X be the resulting graph by removing one edge from K₄. Let Y be the graph depicted in Figure 1.

L_{X} = (\begin{matrix} 3 & - 1 & - 1 & - 1 & 0 \\ - 1 & 2 & - 1 & 0 & 0 \\ - 1 & - 1 & 3 & - 1 & 0 \\ - 1 & 0 & - 1 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix})

(8)

L_{Y} = (\begin{matrix} 3 & - 1 & - 1 & - 1 & 0 \\ - 1 & 2 & - 1 & 0 & 0 \\ - 1 & - 1 & 3 & 0 & - 1 \\ - 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & - 1 & 0 & 1 \end{matrix})

(9)

Obviously, |V_X| = 4 < |V_Y| = 5. Therefore, when spectra are different sizes, the smaller one may be padded with zero values to equalize the size of G_X and G_Y . By definition, the spectra λ_X and λ_Y can be obtained, i.e., λ_X = [ 4, 4, 2, 0, 0 ] and λ_Y = [ 4, 4, 1, 1, 0 ]. The similarity between G_X and G_Y can be simply measured by the Euclidean distance of λ_X and λ_Y . In this case, the similarity of G_X and G_Y is 1.414.

Results and discussion

In this experiment, we validated the remodeling function of the P-graph by using extended graph entropy to verify the stability of a given P-graph. For the P-graph construction, please refer to [15]. Thus, we were interested in only the impact of connectivity on protein structural similarities. Various types of MHC were chosen as the material to verify the verification of proposed method: 1HDM, 1K5N, 2ENG, 1VCA, 1ZXQ, 1UXW, 1A2Y, 3ARD, 2Q3Z, and 2CRY. MHC, as an immune system in most vertebrates, encodes for a small complex cell surface protein. It is also known for HLA (Human Leukocyte Antigen), one of the most intensively studied genes in human [16]. Due to a great diversity of microbes in the environment, MHC genes vary widely its peptide through several mechanisms [17]; this is also the major reason why MHC proteins were selected as materials for this studies.

P-graph comparison

Let G = (V, E) be the P-graph after remodeling from the construction proposed by [15]. Vertices of V in G are created according to the DSSP. Under this metric, a protein secondary structure is represented by a single letter code, H-helix (containing G, H, and I), T-hydrogen turn (containing T, E, and B), or C-coiled (containing only C). For controlling one variable in this experiment, let the edge set E of G be changed from a specific range.

A comparison of MHC proteins is shown in Table 2. In the table, PID is the protein identification number in PDB [18]. Since MHC proteins are composed of multiple polypeptide chains, they are multimeric Domain. Furthermore, Dens means the density in the graph. It is defined as $\frac{2 | E |}{| V | (| V | - 1)}$ ranging from 0 to 1. AVG indicates the average distance within DSSP vertices. If the distance of v_i and v_j is no greater than AVG, then there is an edge between them. In the table, +ke (−ke) means that we add (remove) the k shortest (longest) possible edges. For example, +1e means that we add the edge with the shortest length that is greater than AVG. In the table, NaC indicates that the resulting graph is not a connected graph.

Table 2 A selected proteins with corresponding extended entropies.

Full size table

The relationship between |E| and I′(G) is as follows. First, when the density in G increases, the graph G appears to go from sparse to dense. However, its extended entropy does not increase completely with its density. It seems a little anomalous in this appearance. Second, the edge set in protein remodeling issue can be determined from its extended entropy. By definition, the P-graph G should be a connected graph. Once the G becomes a disconnected graph, we cannot decide its entropy. For example, 1A2Y is not a connected graph when the density is lower than 0.400. Third, E appears to be considerably related to V in graph entropy. Consider the P-graph 2CRY as another example. If a protein remodeling function adapts a specific value on the basis of its geometrical edge, then it might be an error to assume a fixed value as a criterion. This is an essential fact to stress. It is worth pointing out that the construction of a P-graph is limited by V .

P-graph verification

To validate the previous assumptions, a method for protein structural comparison is adapted to measure its similarity. Graph spectra gives an alternative solution to graph matching. It is a set of relational parameters, consisting of a characteristic polynomial and eigenvectors of its adjacency matrix or Laplacian matrix. Graph spectra quantitatively provide graph information, e.g., structure, topology, connectivity [19]. In Table 3 we list the results of protein structure remodeling matters. The field Old shows a remodeling based on the specific value of edge length, and New indicates that the edges in G are adjusted by extended entropy. The value in each entry is the distance of the two spectra. If our method obtains a better result in the comparison, then we simply mark "+" to denote a better result; otherwise, it is marked "=" (not bad) or "−" (worse). Table 4 shows the CATH codes for the selected macromolecules. In summary, the extended entropy determines a better conformational graph from protein structure remodeling.

Table 3 A comparison of protein structure remodelings.

Full size table

Table 4 CATH codes for the selected macromolecules.

Full size table

Program and environment

The procedure for computing the extended entropy for a P-graph was implemented and has been tested with the MHC PDB dataset. The environment was running under 2 Ghz PC with 512 MB of main memory with Linux-2.6.11-1.1369. The implementation was written using Bash-3.00.16(1) and Octave-3.0.0.

Conclusion

In this paper, we proposed a measurement to determine graph stability for protein structure remodeling based on graph entropy. Our modified entropy validation shows a positive result for protein structural comparison. This graph-based approach offers a practical concept to support protein structural alignment.

References

Bunke H: Graph Matching: Theoretical Foundations, Algorithms, and Applications. Proc Vision Interface 2000. 2000, 21:
Google Scholar
Gilbert D, Westhead DR, Nagano N, Thornton JM: Motif-based searching in TOPS protein topology databases. Bioinformatics. 1999, 15 (4): 317-326. 10.1093/bioinformatics/15.4.317.
Article CAS PubMed Google Scholar
Ehrig H, Engels G, Kreowski H: Handbook of Graph Grammars and Computing by Graph Transformation: Applications, Languages and Tools. 1997, World Scientific Publishing Company
Google Scholar
Vishveshwara S, Brinda K, Kannan N: Protein Structure: Insights from Graph Theory. Journal of the Comp Chem. 2002, 1: 187-211. 10.1142/S0219633602000117.
Article CAS Google Scholar
Lund O, Hansen J, Brunak S, Bohr J: Relationship between Protein Structure and Geometrical Constraints. Protein Science: a Publication of the Protein Society. 1996, 5 (11): 2217-2225. 10.1002/pro.5560051108.
Article CAS Google Scholar
Nelson DL, Cox MM: Lehninger Principles of Biochemistry. 2004, Freeman, 4
Google Scholar
Shannon C: Prediction and Entropy of Printed English. Bell Systems Technical Journal. 1951, 30: 50-64. 10.1002/j.1538-7305.1951.tb01366.x.
Article Google Scholar
Chang R: Physical Chemistry for the Biosciences. 2005, University Science
Google Scholar
Simonyi G: Graph Entropy: a Survey. Combinatorial Optimization. 1995, 20: 399-441.
Google Scholar
Dehmer M, Emmert-Streib F: Structural Information Content of Networks: Graph Entropy based on Local Vertex Functionals. Computational Biology and Chemistry. 2008, 32 (2): 131-138. 10.1016/j.compbiolchem.2007.09.007.
Article CAS PubMed Google Scholar
Zager LA, Verghese GC: Graph similarity scoring and matching. Applied Mathematics Letters. 2008, 21: 86-94. 10.1016/j.aml.2007.01.006.
Article Google Scholar
Rand WM: Objective Criteria for the Evaluation of Clustering Methods. J Amer Statistical Assoc. 1971, 66 (336): 846-850. 10.1080/01621459.1971.10482356.
Article Google Scholar
Brouwer AE, Haemers WH: The Gewirtz graph: an exercise in the theory of graph spectra. Eur J Comb. 1993, 14: 397-407. 10.1006/eujc.1993.1044.
Article Google Scholar
Cvetkovic DM, Doob M, Sachs H, Cvetkovi&cacute M, Horstlistprice : Spectra of Graphs: Theory and Applications, 3rd Revised and Enlarged Edition. 1998, Vch Verlagsgesellschaft Mbh
Google Scholar
Hsu CH, Peng SL, Tsay YW: An Improved Algorithm for Protein Structural Comparison based on Graph Theoretical Approach. Chiang Mai Journal of Science. 2011, 38 (2): 71-81.
CAS Google Scholar
Goodsell DS: The Machinery of Life. 2009, Springer, 2nd ed. edition
Chapter Google Scholar
Pamer E, Cresswell P: Mechanisms of MHC class I-restricted antigen processing. Annual Review of Immunology. 1998, 16 (10): 323-358.
Article CAS PubMed Google Scholar
Berman HM, Westbrook J, Feng Z: The Protein Data Bank. Nucl Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.
Article PubMed Central CAS PubMed Google Scholar
Peng SL, Tsay YW: On the Usage of Graph Spectra in Protein Structural Similarity. Journal of Computers. 2012, 23: 95-102.
Google Scholar
Huan J, Bandyopadhyay D, Wang W, Snoeyink J, Prins J, Tropsha A: Comparing Graph Representations of Protein Structure for Mining Family-specific Residue-based Packing Motifs. J Computational Biology. 2005, 12 (6): 657-671. 10.1089/cmb.2005.12.657.
Article CAS Google Scholar
Peng SL, Tsay YW: Measuring Protein Structural Similarity by Maximum Common Edge Subgraphs. Advanced Intelligent Computing Theories and Applications. 2010, 100-107. LNCS, 6216
Google Scholar
Canutescu A, Shelenkov A, Dunbrack R: A Graph-theory Algorithm for Rapid Protein Side-chain Prediction. Protein Science. 2003, 12 (9): 2001-2014. 10.1110/ps.03154503.
Article PubMed Central CAS PubMed Google Scholar
Samudrala R, Moult J: A Graph-theoretic Algorithm for Comparative Modeling of Protein Structure. J Mol Biology. 1998, 279: 279-287.
Article Google Scholar
Borgwardt K, Ong C, Sch¨onauer S, Vishwanathan SN, Smola A, Kriegel HP: Protein Function Prediction via Graph Kernels. Bioinformatics. 2005, 21 (suppl 1): i47-i56. 10.1093/bioinformatics/bti1007.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Science Council, Taiwan, under the Grant No. NSC 101-2221-E-259-004.

Declarations

Publication charges for this article have been funded by the authors.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 15, 2014: Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S15.

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, 974, Taiwan
Sheng-Lung Peng & Yu-Wei Tsay

Authors

Sheng-Lung Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wei Tsay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng-Lung Peng.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Peng, SL., Tsay, YW. Adjusting protein graphs based on graph entropy. BMC Bioinformatics 15 (Suppl 15), S6 (2014). https://doi.org/10.1186/1471-2105-15-S15-S6

Download citation

Published: 03 December 2014
DOI: https://doi.org/10.1186/1471-2105-15-S15-S6

Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013)

Adjusting protein graphs based on graph entropy

Abstract