Exploratory analysis of protein translation regulatory networks using hierarchical random graphs

Abstract Background Protein translation is a vital cellular process for any living organism. The availability of interaction databases provides an opportunity for researchers to exploit the immense amount of data in silico such as studying biological networks. There has been an extensive effort using computational methods in deciphering the transcriptional regulatory networks. However, research on translation regulatory networks has caught little attention in the bioinformatics and computational biology community. Results In this paper, we present an exploratory analysis of yeast protein translation regulatory networks using hierarchical random graphs. We derive a protein translation regulatory network from a protein-protein interaction dataset. Using a hierarchical random graph model, we show that the network exhibits well organized hierarchical structure. In addition, we apply this technique to predict missing links in the network. Conclusions The hierarchical random graph mode can be a potentially useful technique for inferring hierarchical structure from network data and predicting missing links in partly known networks. The results from the reconstructed protein translation regulatory networks have potential implications for better understanding mechanisms of translational control from a system’s perspective.


Background
The central dogma of molecular biology describes that the genetic information is transferred from DNA to mRNA through transcription and from mRNA to protein via translation. In every living organism, translation is a vital cellular process in which the information contained in the mRNA sequence is translated into the corresponding protein by the complex translation machinery.
There are three major steps in protein biosynthesis: initiation, elongation, and termination. Initiation is a series of biochemical reactions leading to the binding of ribosome on the mRNA and the formation of the initiation complex around the start codon. This process involves various regulatory proteins (the so-called initiation factors). Eukaryotic protein synthesis exploits various mechanisms to initiate translation, including cap-dependent initiation, re-initiation, and internal initiation. For the majority of mRNAs in the cell, their translation is via the cap-dependent pathway. Although debatable, it is widely believed that some cellular mRNAs contain internal ribosome entry sites (IRES) and there exists a cap-independent, IRES mediated translation [1]. During elongation, codon-specific tRNAs are recruited by the ribosome to grow the polypeptide chain one amino acid at a time while the ribosome moves along the mRNA template (one codon at a time). This process also involves various elongation factors and proceeds in a cyclic manner. In termination, the termination codon is recognized by the ribosome. The newly synthesized peptide chain and eventually the ribosomes themselves are released [2].
Recent years have witnessed the breakthrough in highthroughput technologies that have been used in monitoring the various components of the transcription and translation machineries. DNA microarrays enable the estimation of the copy number for every mRNA species within a single cell and the changes in gene expression temporally or under different physiological conditions [3]. Two-dimensional gel electrophoresis coupled with tandem mass spectrometry makes it possible to measure simultaneously specific protein levels for thousands of proteins in the cell. These high-throughput technologies and the success of several genome projects are rapidly generating an enormous amount of data about genes and proteins that govern such cellular processes as transcription and translation. Analyzing these data is providing new insights into the regulatory mechanisms in many cellular systems. One of the major goals in postgenomic era is to elucidate in a holistic manner the mechanisms by which sub-cellular processes at the molecular level are manifest at the phenotypic level under physiological and pathological conditions.
The complexity and the large sizes of the transcription and translation machineries make computational approaches attractive and necessary in facilitating our understanding the design principles and functional properties of these cellular systems. Transcriptional regulation, used by cells to control gene expression, has been a focus in a variety of computational methods to infer the structure of genetic regulatory networks or to study their high level properties [4]. However, research on translational regulatory networks has caught little attention in the bioinformatics and computational biology community, either being underestimated or neglected. This contrast may partly due to two factors. Firstly, transcriptional control, other than translational control, has long been regarded by conventional wisdom as the primary control point in gene expression. Secondly, the success of genome projects and the application of high-throughput technologies provide tremendous amount of data about transcriptional regulation that are readily available for computational analysis. On the contrary, data about translational control are still probably too specialized so that they are consumed primarily by biologists.
Proteins, rather than DNAs or mRNAs, are the executors of the genetic program. They provide the structural framework of a cell and perform a variety of cellular functions such as serving as enzymes, hormones, growth factors, receptors, and signalling intermediates. Biological and phenotypic complexity eventually derives from changes in protein concentration and localization, posttranslational modifications, and protein-protein interactions. Expression levels of a protein depend not only on transcription rates but also on such control mechanisms as nuclear export and mRNA localization, transcript stability, translational regulation, and protein degradation.
Results from biological research have demonstrated that translational regulation is one of the major mechanisms regulating gene expression in cell growth, apoptosis, and tumorigenesis [5]. Therefore, study of protein translation networks, especially from computational systems biology approaches, may provide new insights into our understanding of this important cellular process.
Mehra and colleagues [6] develop a genome-wide model for the translation machinery in E. coli that provides mapping between changes in mRNA levels and changes in protein levels in response to environmental or genetic perturbations. They also propose a mathematical and computational framework [7] that can be applied to the analysis of the sensitivity of a translation network to perturbation in the rate constants and in the mRNA levels in the system.
However, toward the goal of understanding how translation machinery functions from a system's perspective that may enable us to form new theories and make new predictions, it is imperative that we have a better understanding of the structure and properties of protein translation networks. In pursuing such a goal, we previously reported a global analysis of network analysis of Protein Translation Regulatory Networks (PTRN) in yeast [8]. In this paper, we extend our efforts to study one important network feature: hierarchy.
Biological processes are hierarchically organized, evident from interactions between molecules within a cell to relationships among members of an ecological system, and hierarchical structure plays an important role in the dynamics of these processes.
Active research has been done to assess whether a network is actually organized in a hierarchical manner and to identify the different levels in the hierarchy. The majority of the work has been focusing on identifying "global signatures" of a hierarchical network architecture. For example, Ravasz and colleagues [9] studied the hierarchical structure of metabolic networks and reported that the uncovered hierarchical modularity closely overlaps with known metabolic functions in E. coli.
Out of many methods proposed to investigate the hierarchical organization in a network [10][11][12][13][14], an especially appealing one is the hierarchical random graph model introduced by Clauset and colleagues [13,14].
In the following, we define a PTRN that contains proteins involved in translational regulation and controls. We then describe the hierarchical random graph model and the adapted approach we use based on this model to infer the hierarchical structure of the constructed network and further to predict missing links within the network.

Datasets
The yeast protein-protein interactions data were downloaded from the General Repository for Interaction Datasets (GRID) [15]. We select GRID because it contains arguably the most comprehensive data. The GRID database includes all published large-scale interaction datasets as well as available curated interactions such as those deposited in BIND [16] and MIPS [17]. The yeast dataset we downloaded has 4,948 distinct proteins and 18,817 unique interactions. From this network, we derive the protein translation networks which contain proteins with MIPS functional categories related to protein translation as described next.

Construction of PTRN
We extract proteins that are involved in protein biosynthesis from MIPS functional category database as shown in Table 1. The extracted proteins belong to the following categories: 12.04 (translation), 12.04.01 (translation initiation), 12.04.02 (translation elongation), 12.04.03 (translation termination), and 12.07 (translational control). There are totally 133 unique proteins in this dataset. We then build the network by using protein-protein interaction data, including interactions among the selected proteins only and ignoring all other interactions. With the exclusion of the isolated proteinsthose without any edges connecting to themand self-looping interactions, the resulted network contains 108 vertices and 342 edges.
There are several reasons for such a construction. First of all, our interest in this research has been focused on protein translation regulatory networks. Secondly, protein-protein interaction data are notorious noisy and incomplete. The approach we take allows us not only to study the hierarchy but also to predict missing links even with the noise and incompleteness in the background. At current stage, it is also more feasible computationally with networks of smaller sizes. In addition, we want to examine if hierarchical structure exists even in such isolated subnetworks.

Hierarchical random graphs
Our approach is based on a hierarchical random graph proposed by Clauset and colleagues [13,14], incorporating with work by Sales-Pardo and colleagues [12]. There are two important assumptions in this approach. Firstly, if a network has sub-networks with an equal probability connecting them, then the network can be represented by splitting off the sub-network off one at a time until the last one. Secondly, there may be more than one hierarchical random graph that best fits the observed network data.
In hierarchical random graphs, the probabilities of connecting any two vertices and sub-networks are independent of the presence or absence of other connections. This is similar to the classical Erdos-Renyi random graph. However, in the hierarchical random graph, the probabilities are dependent on the topological structure of the graph.

1) Graph notation
We intuitively model a protein translation network as an undirected graph, where vertices represent proteins and edges represent interactions between pairs of proteins.
An undirected graph, G = (V, E), is comprised of two sets, vertices V and edges E. An edge e is defined as a pair of vertices (i, j) denoting the direct connection between i and j. The graphs we use in this paper are undirected, unweighted, and simplemeaning no selfloops or parallel edges.

2) Definition of a hierarchical random graph
Let n be the size of vertices set, n = |V|. Let D be the dendrogram with n leaves representing vertices of G. Let r be an internal node of D with a probability P r which denotes the probability that an edge exists between two vertices sharing r as their lowest common ancestor in D. A hierarchical random graph is thus defined by (D, {P r }).

3) Inferring the hierarchical structure
As stated earlier, one assumption is that the likelihood of all hierarchical random graphs is a priori equal. By Bayes' theorem, the probability that a model (D, {P r }) explains the observed data is proportional to the posterior probability or likelihood L.
Let E r be the number of edges in G with r as their lowest common ancestor, L r and R r be the numbers of leaves in the left and right subtrees rooted at r in D. We have For each internal node r in D, the probability { p r } that maximizes L is E L R r r r . Thus, the likelihood of D at this maximum is Conveniently, instead of using the above equation directly, we use its logarithm form:

4) Markov chain Monte Carlo method
Since it is an NP hard problem to maximize L(D, {P r }), the estimation is done by using a Markov chain Monte Carlo method by sampling D with probability proportional to their likelihood.
With networks of relative small sizes, the Markov chain converges fairly quickly. Therefore, it is suitable for our constructed PTRNs.

Results
Fitting the hierarchical random graph to data We construct our protein translation network using protein-protein interactions among extracted proteins and then fit the hierarchical random graph model to the constructed network. Fig. 1 shows an example of maximum likelihood dendrogram with logL = -539. The dendrogram clearly divides the majority of proteins into groups coherent to their MIPS function categories.  Fig. 2 shows an example of a consensus dendrogram constructed from the sampled hierarchical random graphs. A consensus dendrogram is a summary of a set of dendrograms that fit the observed data. We may expect it to capture the topological features consistent across the majority of the dendrograms and can better characterize the structure of the network than any individual dendrogram.

Prediction of missing links
The most interesting and possibly the most useful application of hierarchical random graphs is the prediction of missing interactions in networks in which the available information is incomplete as in the case of protein-protein interaction data, especially in our case of studying protein translation regulatory networks. Table 2 is the compiled result of top 15 possible missing links with the highest probabilities from 10 runs of the predicting algorithms.
On top of the list is the interaction between SUP35 and PAT1. SUP35 is translation termination factor eRF3, involved in the termination of protein translation. PAT1 is topoisomerase II-associated deadenylationdependent mRNA-decapping factor. It is required for faithful chromosome transmission, maintenance of rDNA locus stability, and protection of mRNA 3'-UTRs from trimming. There is no interaction between these two proteins in our downloaded datasets. However, this interaction has been reported rather recently [18].
An intriguing finding of the prediction results is that a few proteins have multiple highly probable missing links, such as GCD11, SUI3, SUI2, RLI1, IST1, and HCR1. GCD11 is the gamma subunit of the translation initiation factor eIF2, involving in the identification of the start codon. Its interaction with HCR1 has been reported recently [18]. RLI1 is an essential iron-sulfur protein required for ribosome biogenesis and translation initiation. Its interaction with SUI3 is also reported [18]. SUI3 is the beta subunit of the translation initiation factor eIF2, involved in the identification of the start codon and possibly in mRNA binding as well. HCR1 is a dual function protein involved in translation initiation as a substoichiometric component (eIF3j) of translation initiation factor 3 (eIF3) and is required for processing of 20S pre-rRNA. The interaction between SUI3 and HCR1 has also been reported [18].

Discussion
In this paper, we present the exploratory analysis of a protein translation regulatory network using hierarchical random graphs.
We constructed a protein translation network by extracting proteins categorized in MIPS function database [17] and protein-protein interaction data curated in BioGRID [16]. One important feature of such reconstructed networks is its incompleteness. Our current knowledge about the links may only be a fraction of all interactions among these proteins that may exist in reality. It thus is an enormous challenge to study such partial networks. As shown in Figure 1, by using the hierarchical random graphs, the reconstructed dendrogram divided the majority of proteins into groups corresponding to their MIPS function categories. Our results clearly demonstrated 1) the existence of the hierarchical structure in the constructed protein translation network; and 2) the usefulness of the hierarchical random graph model in exploring the network structure.
Our results also show the ability of predicting missing links in networks by using the hierarchical random graph. At least four of the top 15 predicted missing links has been reported recently [18]. It is very beneficial for experimental biologists to use such drastically narrowed list to formulate and validate hypotheses. One of our future work will be to collaborate with biologists to validate the predicted missing links and eventually help build up a much more complete translation regulatory network.
A limitation of current approach using Markov chain Monte Carlo is its high computational cost. Improving the computation efficiency in the future will allow us to apply this approach to larger networks.

Conclusions
In this paper, we apply a hierarchical random graph model in analyzing yeast protein translation regulatory networks. We reconstruct protein translation regulatory networks from a protein-protein interaction dataset. Using the hierarchical random graphs, we show that the reconstructed network exhibits well organized hierarchical structure. Furthermore, we apply this technique to predict missing links in the network. Therefore, the hierarchical random graph mode can be a potentially useful technique for inferring network hierarchical structure and predicting missing links in partly known networks. The results have potential implications for better understanding mechanisms of translational control from a system's perspective.