GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors

Background Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in membrane proteins. In this work, an amino acid substitution matrix (GPCRtm) is calculated for the membrane spanning segments of the G protein-coupled receptor (GPCR) rhodopsin family; one of the largest transmembrane protein family in humans with great importance in health and disease. Results The GPCRtm matrix reveals the amino acid compositional bias distinctive of the GPCR rhodopsin family and differs from other standard substitution matrices. These membrane receptors, as expected, are characterized by a high content of hydrophobic residues with regard to globular proteins. On the other hand, the presence of polar and charged residues is higher than in average membrane proteins, displaying high frequencies of replacement within themselves. Conclusions Analysis of amino acid frequencies and values obtained from the GPCRtm matrix reveals patterns of residue replacements different from other standard substitution matrices. GPCRs prioritize the reactivity properties of the amino acids over their bulkiness in the transmembrane regions. A distinctive role is that charged and polar residues seem to evolve at different rates than other amino acids. This observation is related to the role of the transmembrane bundle in the binding of ligands, that in many cases involve electrostatic and hydrogen bond interactions. This new matrix can be useful in database search and for the construction of more accurate sequence alignments of GPCRs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0639-4) contains supplementary material, which is available to authorized users.

Background G protein-coupled receptors (GPCRs) constitute a large family of integral membrane proteins that mediate numerous signaling pathways through second messenger cascades [1]. These receptors are activated by a vast chemical diversity of ligands, ranging from small molecules to lipids, peptides, or hormones [2] and display a highly conserved molecular architecture characterized by the presence of seven α-helical transmembrane segments (7TM) [3]. GPCRs are classified into six main families or classes (named A to F) based on sequence similarity, with only four of them (A, B, C and F) present in vertebrates [4]. The class A, also known as rhodopsin family [5], is the largest (~847 genes in humans) and exhibit a distinctive feature that most effector molecules bind to a cavity formed by the TM helices. The rhodopsin family is the subject of numerous studies due to their pharmacological relevance, representing the largest family of individual drug targets [6,7].
The importance of the GPCRs in cellular physiology has inspired the development of numerous computational tools and databases for their study over the years [8][9][10][11][12][13][14][15]. The majority of these approaches have required multiple sequence alignments with very low identities (~20 %), in many cases below the twilight region significant for homology detection [16]. One important part of sequence alignment algorithms is the use of substitution matrices to account for the exchange rates of the amino acids within proteins [17]. Amino acid substitution matrices are obtained by the application of statistical methods on sequence alignments of evolutionarily related proteins (generally globular) and in all cases are biased by the composition of the data set used [18]. In this regard, it is known that the evolutionary selective pressure that governs the conservation and relative mutability of amino acids varies among protein families. As a consequence, the application of a standard matrix for the alignment of a determinate protein family could give inaccurate results, particularly if the amino acid composition differs from those used for the matrix construction. Still, only a few standard substitution matrices have been employed for database search and comparison of protein sequences during decades [19][20][21]. Nonetheless specific substitutions matrices for certain families of proteins are continuously developing [22][23][24]. These matrices, in many cases have proven to be more effective than the standard matrices in recognizing evolutionary relationships between the proteins of interest.
In this work, we computed a substitution matrix from a curated alignment of one thousand sequences of the TM regions of the GPCR rhodopsin family. Analysis of amino acid frequencies and values obtained from the matrix reveals patterns of residue replacements different from other standard substitution matrices. Charged and polar residues in particular seem to evolve at different rates than other amino acids. This observation could be related to the extraordinary diversification of the 7TM helical bundle in GPCRs for ligand recognition [25].

GPCR sequences retrieval and alignment
Class A GPCR protein sequences from the four main groups (α, β, δ and γ) and 13 sub-branches [5], including orphans, were obtained from the UniProt database from different biological sources [26]. This dataset was extended with the inclusion of 314 sequences from a curated set of functional human olfactory GPCR repertoire [27]. To avoid poorly aligned positions, UniProt and GPCRdb [14] annotations were used to identify TM segments and to remove the highly divergent intra and extracellular loops and the N-and C-terminal regions of the receptors. Boundaries of the TM helices were defined attending to the available crystal structures of class A GPCRs [28,29]. Sequences corresponding to TMs 1-7 were aligned using the Win32 version of ClustalW 2.1 [30] and the closely related (>90 % identity) were excluded from the analysis. The resulting alignment was manually curated in order to achieve the optimal match between conserved sequence motifs present in the rhodopsin family [31] and small gaps were inserted in the TM2 and 5 according to previous studies [32]. This resulted in a final alignment of 1019 non-redundant TM GPCR sequences (see Additional file 1).

Construction of GPCRtm
The alignment of the TM regions was used to generate a substitution matrix representing changes on GPCR sequences using an implementation of the methodology described by Henikoff et al. [20]. In this regard, the corresponding TM segments (1-7), which consist of multiple alignments of short regions (<40 amino acids), were treated as sequence blocks. As initial step, a transition count (frequency) table was computed to determine the total number of amino acid transitions pairs from each column of the alignment. After the transition count table was completed, observed and expected probability of transition were computed for each pair. The observed probability (O) for the amino acid pair (i,j) is the total number of transitions observed (from the frequency table) divided by the total number of transitions for the entire alignment.
The expected probability (e) of occurrence for each (i,j) pair was calculated from the observed probabilities for the pair.
For a single residue: for an (i,j) pair: Using the expected (e) and observed (O) probabilities of transitions, the substitution values were calculated from the odds ratio matrix, as the logarithm of odds, where each entry is obtained according to: The scaling factor of 2 is taken from Henikoff et al. [20] in order to facilitate comparisons. In the final 20 × 20 amino acid matrix ( Fig. 1), substitutions values where rounded to the nearest integer value. In addition, we  calculate the average mutual information per amino acid pair or relative entropy (H) according to:

Results and Discussion
Amino acid compositional bias in the rhodopsin family of GPCRs The average amino acid composition of the TM regions of the rhodopsin family was compared with amino acid frequencies derived from other studies ( Table 1). As expected, the fraction of hydrophobic residues in the membrane spanning regions of GPCRs is similar to other TM proteins (JTTtm and PHDhtm) and is higher than in general proteins (BLOSUM62, and Swiss-Prot). Leucine is the most common occurring residue followed by valine and isoleucine. Nonetheless, there are differences in the amino acid composition of GPCRs. This is the case for charged and polar residues, with the exception of serine and threonine that behave similar in all datasets. The accumulated percentage for the R, K, H, D, E, N, and Q amino acids in the GPCRtm dataset (19.6 %) is in between JTTtm (9.5 %) and PHDhtm (9.9 %) datasets and BLOSUM62 (32.3 %) and Swiss-Prot (33.8 %) datasets. In addition, TM regions of the rhodopsin family are also characterized for a lower frequency of glycine (4.6 %) and a higher frequency of cysteine (3.6 %) residues relative to the other datasets. Given such differences in amino acid composition, we presume that general protein matrices such as the BLOSUM series and TM-derived protein matrices may not perform accurately in the alignment of the TM regions of GPCRs. A curated alignment of more than one thousand membrane spanning sequences of class A GPCRs from different organisms were used for the generation of an amino acid substitution matrix (Fig. 1). The matrix was built using an approach similar to the one employed for the construction of the BLOSUM series of matrices [20]. Unlike BLOSUM matrices, built from sequence blocks of a variety of biological sources, we employ sequences of only GPCRs that accounts for the compositional bias in this family of receptors. Inspecting the diagonal elements of the matrix in the Fig. 1 we can estimate the mutability potential of each residue. Hydrophobic residues (V, L, I, A, F) display the highest level of relative mutability (corresponding to low values on the matrix, ≤ 2), whereas charged and polar residues are in general less mutable. Polar serine and threonine residues are special cases, displaying similar values than hydrophobic residues. These two amino acids, unlike other polar or charged residues, do not destabilize TM helices, as their hydrogen bonding potential can be satisfied by interacting with the carbonyl oxygen in the preceding turn of the same helix [36]. In contrast, N, D, R, W and P amino acids display the lowest level of relative mutability (corresponding to high values on the matrix, ≥ 7). All these residues display a high conservation pattern in at least one of TM helices of class A GPCRs [31,37]: N in TM 1 (present in 98 % of the sequences), D in TM 2 (93 %), R in TM 3 (95 %), W in TM 4 (96 %) and P in TMs 5 (76 %), 6 (98 %) and 7 (93 %). Significantly, the position of these highly conserved amino acids in each helix is the same in the superimposition of the currently available crystal structures [38]. Positively (K, R, and H) and negatively (D, E) charged residues are easily interchangeable with each other. This could be due to a selection pressure to adapt the binding cavity of the TM bundle to the different chemical features of the ligands that, in many cases, display strong electrostatic properties (discussed below). TM-derived and globular protein matrices with regard to the majority of charged and polar residues, which suggest a distinctive role of these amino acids in GPCRs.
One of the most important aspects of substitution matrices is amino acid grouping based on their chemical properties. These similarities could be easily visualized through the construction of dendograms and multi-dimensional projections to account for the correspondence of amino acids in the matrix (Fig. 3). Clearly, clustering of residues in GPCRtm, JTTtm and BLOSUM62 follow similar patterns, but with significant differences. The cluster of hydrophobic residues (I, V, L, M) is closer to the cluster of small amino acids (A, S, T) in all cases. However, GPCRtm differs from other matrices in that phenylalanine is grouped with hydrophobic amino acids (the I, V, L, M, F cluster), whereas in BLOSUM62 is grouped with the aromatic tyrosine and in JTTtm with cysteine. Similarly, glycine is clustered together with the other small amino acids (A, S, T), in contrast to other matrices in which is grouped alone. Histidine clusters with positively charged and polar amino acids in GPCRtm and JTTtm, in contrast to BLO-SUM62. This residue is grouped with glutamine in GPCRtm and JTTtm, probably due to its hydrogen bond donor/acceptor properties, whereas in BLOSUM62 is grouped with phenylalanine and tyrosine probably due to its aromaticity. GPCRtm clusters tryptophan and tyrosine together, preserving aromaticity and hydrogen bond capacity, whereas in the other matrices tryptophan is unaccompanied. The negatively charged aspartate and glutamate form one group in GPCRtm and JTTtm, while in BLOSUM62 aspartate pairs with asparagine and glutamate with glutamine. In this regard, positive (K, R) and negative (D, E) residues are grouped at closer distance in BLOSUM62. In contrast, positive and negative residues are distant in GPCRtm and JTTtm. Interestingly, the distance between branches containing opposite charged residues in GPCRtm is larger than in JTTtm, suggesting than the sign of the charge is apparently more conserved in the GPCR TM sequences than in a general set of TM proteins.
Overall, the results show that GPCRtm prioritized the reactivity properties of the amino acids over their bulkiness. In this way, hydrophobic residues (including phenylalanine), which are key in TM regions, are  clustered together. On the other side, the hydrogen bond capacity and electronic properties of the amino acids tend to be maintained in GPCR sequences. Thus, the H/ Q, K/R, E/D/N and W/Y pairs together. These residues contribute largely to the diversity of interactions between ligands and the 7TM bundle as can be observed in the 3D structures of ligand-receptor complexes in some members of the rhodopsin family (see Fig. 4). In this respect, GPCRs are distinguished from most TM proteins for their ability to interact with a diverse variety of chemical entities.

Evaluation of the GPCRtm matrix
The GPCRtm matrix was tested on sequence similarity searches and pairwise alignments. The results of GPCRtm were compared with commonly used amino acid exchange matrices, the JTTtm and PHAT transmembrane matrices and the general-purpose BLOSUM45 and BLOSUM62 matrices. At high sequence identity values (above the twilight zone) all matrices behave similarly. However, as sequence identity falls below 40 %, significant differences emerged. Table 2 shows a comparison among the different substitution models in BLASTP database searches for one hundred GPCR queries against the PDB database [39]. As observed in the table, the GPCRtm matrix performs better than other matrices. The second best performance was achieved by the closely related PHAT matrix, followed by the BLOSUM62, BLOSUM45 and JTTtm matrices, respectively. Criteria for the performance evaluation were based on the recognition of the closest homologue with known three-dimensional structure for a determinate query, according to the well-established GPCR classification systems [4,5]. Table 3 illustrates an example for the adrenergic receptor (ADR) subfamily of GPCRs. ADRs interact with the endogenous catecholamines adrenaline and noradrenaline and constitute essential regulators of central and peripheral metabolic functions [40]. These receptors are classified into three main groups: the α 1 -, α 2 -and β-adrenoceptors. Only two members (β 1 -or ADRB1 and β 2 -or ADRB2) have been solved by X-Ray crystallography, constituting the reference structures for the adrenoceptors subfamily [41]. According to the results shown in Table 3, the GPCRtm matrix performs better than general-purpose matrices in BLASTP searches, resolving a receptor of the same subfamily (ADRB1 or ADRB2) as a first hit for searches involved the nine ADR subtypes as queries. On the other hand, in some instances (at lower identities) the standard matrices deliver as best hit a receptor of a different GPCR subfamily.
One of the best ways to test alignment accuracies is to compare the results with structure-based information derived from three-dimensional structural data. In this regard, the GPCRmt matrix was tested on pairwise sequence alignments of class A GPCR whose structures are known. Figure 5 shows the result of the alignment between the adenosine A 2A receptor (AA2AR) and sphingosine-1-phosphate receptor 1 (S1PR1) using different substitution matrices. Both receptors are members of the MECA receptor cluster of the rhodopsin family [5] with known three-dimensional structures [42,43]. In Fig. 5 Example of pairwise alignments of the adenosine AA2AR and sphingosine-1-phosphate S1PR1 amino acid sequences using: GPCRtm (a), JTTtm (b), PHAT (c), BLOSUM62 (d) and BLOSUM45 (e) substitution matrices. Transmembrane regions TM 1 to 7 appear outlined in red according on the crystallographic 3D structural data for each receptor (PDBid: 3EML and 3V2Y). Pairwise sequence alignments were done with MAFFT program [35] this example, the resulting alignments denote the accuracy of the GPCRtm to correctly align the TM helices of both receptors, whereas generalized matrices fails to correctly align some of the TM regions. According to these results, the GPCRtm matrix improve the detection of closest homologues and produce accurate alignments in the TM regions of GPCRs, even at low sequence identities. This is particularly relevant in the development of homology models for structure-based drug discovery, which in many cases are generated from low sequence identity alignments due to the limited number of GPCRs crystallographic structural templates [32].