BMC Bioinformatics BioMed Central Methodology article H2r: Identification of evolutionary important residues by means of

Background A multiple sequence alignment (MSA) generated for a protein can be used to characterise residues by means of a statistical analysis of single columns. In addition to the examination of individual positions, the investigation of co-variation of amino acid frequencies offers insights into function and evolution of the protein and residues. Results We introduce conn(k), a novel parameter for the characterisation of individual residues. For each residue k, conn(k) is the number of most extreme signals of co-evolution. These signals were deduced from a normalised mutual information (MI) value U(k, l) computed for all pairs of residues k, l. We demonstrate that conn(k) is a more robust indicator than an individual MI-value for the prediction of residues most plausibly important for the evolution of a protein. This proposition was inferred by means of statistical methods. It was further confirmed by the analysis of several proteins. A server, which computes conn(k)-values is available at . Conclusion The algorithms H2r, which analyses MSAs and computes conn(k)-values, characterises a specific class of residues. In contrast to strictly conserved ones, these residues possess some flexibility in the composition of side chains. However, their allocation is sensibly balanced with several other positions, as indicated by conn(k).

In Additional Figure 2, elements of five groups of columns were compared on their U(k, l)values. The groups subsumed columns with frac values of 0.2, 0.4, 0.6, 0.8, and 1.0, respectively. In each group, blocks 2 to 5 were represented, which differ in the number of symbols constituting the frac part. As H2r excludes strictly conserved residues from further analysis, entries of block 1 were missing in group 5. As expected, the U(k, l)-values increased with frac. This increase did not markedly depend on the number of symbols constituting the correlated pairs. If two or more symbol-pairs comprised the same amount of conservation, the U(k, l)-values were somewhat larger that those of the one-symbol case.
Please note that this score does not favour a certain degree of intermediate conservedness as other algorithms do; see [2].
Additional Figure 3 is a plot of the U(17, l)-values. In column 17, two symbols constitute a frequency of 40%, each; the remaining 20% of symbols were chosen randomly. The bars show that those columns having a composition, which were in register with column 17, gained high U(k, l)-scores, irrespective of the number of symbols filling the columns.
In Additional Figure 4, all U(k, l)-values originating from MSA_1 were plotted colour-coded.
Columns 9 and 10, which represent strictly conserved residues (compare Additional Figure   1), were excluded from further analysis. As already seen, the U(k, l)-values did not markedly depend on the number of amino acids that contribute to a certain level of conservation. All values confirmed the stability of the U(k, l) approach. MSA_1 does model some special cases of correlated amino acid frequencies. The above findings confirmed that U(k, l) quantifies the coupling of residues for these situations according to expectation. Due to the number of cases that had to be considered, an exhaustive analysis of more complicated combinations was not feasible. Therefore, it is more reasonable to study predictions in the context of protein structure and function, as done here.

High scoring residue pairs form clusters
It is known for some algorithms of correlation analysis that residues predicted to interact are far apart in the protein structure. Additionally, it has been deduced that coupled residues can be classified into two types according to their cluster membership. There exist isolated pairs k, l and larger groups of residues which form clusters [3]. For the optimisation of parameters (see below), we focussed on these clusters.
We complemented H2r with a module to read pdb-entries, to determine 3D-distances, and to link a protein structure to the columns of an MSA. Thus, we were able to study predictions with respect to their spatial distance. Please note that we define the distance dist min (k, l) of two residues k, l as the minimal space to be measured between van der Waals spheres of any pair of atoms belonging to k or l; see Methods.
Dekker et al. have used the MSA constituting PFAM family PF01053 (Cys_Met_Meta) and the corresponding crystal structure of cystathionine γ synthase (pdb-code 1QGN) to benchmark algorithms that assess perturbations [4]. In order to compare the performance of H2r with these results, we analysed this dataset, too. Additional Table 1 lists those 20 residue-pairs having the highest U(k, l)-values. The distance distribution of these residue pairs is skewed. A KS-test [5] confirmed that their 3D-distances deviated statistically significantly from the distance distribution deduced from all residue pairs (p << 0.001).
Interestingly, among the first ten entries the residues of three pairs are more than 10Å apart in 3D-space. Most striking is the pair (86, 388). The distance dist min of these two residues is more than 28Å. Without any doubt, no atomic force field ranges that wide. Therefore, two alternatives most plausibly explain this finding: Either it is an artefact (false positive prediction), or these two residues are part of a network interlinking several residues.
Ranganathan and co-workers have postulated the existence of connected pathways among residues and have done mutational studies in the PDZ domain which confirmed their predictions [6]. This finding made the second explanation more plausible.
We used a simple neighbour joining algorithm to deduce clusters from HSRPs. Starting from a sorted list (compare Additional Table 1), a first cluster was instantiated consisting of those two residues k and l possessing the highest U(k, l)-value. Then, in descending order, sets of residue pairs m, n were allocated according to their U(m, n)-values. If m or n was an element of an existing cluster, the sets were merged; otherwise, an additional cluster {m, n} was created. Following the arguments of [4], 75 HSRPs were used to create clusters. This choice of a cut-off is somewhat arbitrary but uncritical; see [4]. Finally, a set of clusters CL_MSA resulted for each MSA.

A minimal spanning tree allows assessing clusters of coupled residue pairs
A generally accepted method for the characterisation of networks and clusters is a minimal spanning tree [7]. In our case, the sets CL_MSA had to be processed. During the construction of spanning trees, strictly conserved residues were accepted as additional nodes, if such a node interlinked two HSRPs of CL_MSA. The rationale for this approach was that conserved residues cannot be analysed by our coupling analysis but could be elements of an interaction network.
Additional Figure 5 is a plot of the resulting three trees for Cys_Met_Meta. Please note that these trees contain all residues occurring at least once as an element of the 75 HSRPs. It is highly unlikely that all HSRP residues can be allocated to a small number of clusters (trees) if the selection were random. With high probability, several links should be missing among this extremely small fraction of all possible residue-pairs. In summary, more that 74 000 residue-pairs constitute the complete dataset, and only a fraction of 75 / 74 000 ≈ 1·10 -3 was used to construct spanning trees. Most surprisingly, from the 32 residues contributing to these 75 highest scores, 27 were elements of one tree. Two smaller trees with two (residues 155 and 197) or three nodes (residues 164, 403, 404), respectively, were generated. The direct neighbours of residues 164 and 403 were elements of the dominating tree indicating that just one link was missing for merging these trees. Cys_Met_Meta possesses 35 strictly conserved residues. Only five were involved in the network establishing the dominating spanning tree.
Most strikingly, the distances dist min of the residues constituting the spanning tree were much shorter than those listed in Additional Table 1: Only one distance was larger than 10Å and the mean distance was 2.2Å. This finding and the fact that 27 out of 35 residues were elements of one tree strongly support the hypothesis that coupled residues identified by H2r are elements of a tightly interconnected network. In order to confirm the validity of this statement, we randomly allocated U(k, l)-values to residue pairs and generated spanning trees in 500 (H2r_train) or 1000 (PF01053) individual experiments. We determined the mean distance for each spanning tree, both with and without integrating strictly conserved residues.
In all experiments, all clusters had a mean distance > 6Å. This finding and the performance tests introduced below further supported the network hypothesis. For the moment, we can state that the minimum spanning tree and its characteristic parameters are proper concepts for the evaluation of residue networks and the mode of their generation.

Selecting the input for H2r
Considering the selection of sequences constituting MSAs for correlation analysis, two quite different approaches have been utilised so far. Frequently, sequences have been accumulated by using PSI-BLAST and species-specific hits exceeding a certain E-value were used for MSA generation [8], [3]. When developing PSI-BLAST, Altschul and co-workers have argued that a large set of closely related sequences carries little more information than a single member but may induce the outvoting of a small number of divergent sequences [9]. Therefore, a sequences weighting scheme has been implemented for PSI-BLAST. For proteins sharing less than 20% identical residues, it is unclear whether they possess an identical fold [10]. This is why we eliminated unrelated sequences. Considering highly similar sequences, we followed the second of the above lines of arguments. H2r uses two parameters ident min and ident max , which define the minimal and the maximal sequence identity values. All sequences were compared pairwise and only sequences, which possessed at least ident min and at most ident max residues, were accepted for further analysis. This filtering was initiated with the first entry of the MSA. Due to the results of parameter optimisation (see below), the default for ident min is 20% and for ident max it is 90%.
In the case of H2r, strictly conserved positions must be filtered out. For each column k, H2r determines the max ( ) that at least 10% of the samples vary for each residue. A high fraction of gaps occurring in a column is an indicator for a possibly low alignment quality. Therefore, columns with more than 25% gaps were masked and not processed further.

A test bed for parameter optimisation
As already pointed out, the construction of a high quality MSA is a crucial prerequisite for the analysis of correlated mutations. In order to eliminate the overrepresentation of closely related sequences and to avoid the inclusion of unrelated sequences, all sequences S i , S j are usually compared pairwise to determine the number of identical residues ident(S i , S j ). If ident(S i , S j ) exceeds the cut-off ident max or falls below ident min , S i , or S j is eliminated. Instead of fixing these parameters heuristically, we wanted to train them. Two more parameters had to be optimised in the case of H2r: 1) frequ max , which is the maximal frequency for the occurrence of an amino acid in any column. H2r does not analyse residues where any amino acid occurs with f(a i ) > frequ max . 2) λ, the weight factor for adding pseudo counts (see Methods).
In addition, we wanted to test whether the choice of the programme used for the generation of MSAs influences the results. In order to compose a training set, we randomly selected entries of a database containing the names of structures (pdb-files) resolved with high quality (see compilations on web site of [11]). We accepted the first 20 hits that belonged to both a PFAM-MSA and an InterPro family with at least 90 entries. We selected a low cut-off value in order to generate datasets allowing to study the impact of pseudo counts on the quality of the output. One PFAM entry had 91 sequences, a second 164, and the remaining ones more than 200. The mean number of sequences was for the PFAM datasets 429 and for the InterPro entries 391. As an alternative to PFAMs, we generated MSAs by using the respective InterPro family as input for MAFFT [12], which we utilised with the refinement method L-INS-i. We named this dataset, whose elements are listed in Additional Table 2,

H2r_train.
In order to optimise the above parameters, the mean connectivity and the compactness of spanning trees were analysed. We defined compactness as Additional Table 3 lists the results. For the set of PFAM MSAs, ident min = 20%, ident max = 90%, and frequ max = 90% gave the best results, both without and after adding pseudo counts (λ = 1.0). For H2r_train, we were not able to outperform the PFAM results by creating individual MSAs. Interestingly, adding pseudo counts increased the mean connectivity to a certain degree in both cases.
To study in more detail how the mode of MSA generation influences the results of H2r, we focussed on PF01053, which has been analysed previously [4]. We created MSAs by applying several methods to the sequences of IPR000277, which is the related InterPro family [13]. In addition to the above output parameters, we analysed the mean distance between residues of the spanning tree. The results are listed in Additional Table 4. In summary, the mean distances dist min determined for the spanning trees varied between 1.64Å and 2.96Å. This finding corroborates the above finding that HSRPs are tightly linked and form connected networks. This notion is further supported by a mean connectivity of approximately 2.
In order to evaluate the power of an algorithm for predicting physically close residues, the 3D-distances of 75 non-trivial high scoring pairs have been analysed [4]. Non-trivial residue pairs are those, which are more than 8 residues apart in the primary sequence. These distances were compared to the median distance deduced from all residue pairs of the dataset. The quotient of the fractions of HSRPs lying below or above the median has been proposed as a quality criterion [4]. For PF01053, Ranganathan's approach reached 56 / 19, whereas the ELSC algorithm gained 62 / 13; see [4]. Additional Table 4  MSAs consisting of more than 150 sequences; Additional Table 5 lists relevant parameters.
We named this set of MSAs H2r_filt_100. For every MSA(i) 0 H2r_filt_100, we created 25 MSAs consisting each of a randomly selected fraction of 75%, and 25 MSAs consisting each of a randomly selected fraction of 60% of the sequences belonging to MSA(i). We named these samples H2r_filt_75 and H2r_filt_60. For each of these 500 MSAs, we determined conn(k)-values and compared the outcome to that of H2r_filt_100. We named a residue k possessing a conn(k)-value above the MSA-specific threshold a distinguished residue. We used Dist_100 to name the set of residues being distinguished in H2r_filt_100; accordingly, we used the names Dist_75 and Dist_60. Dist_75 consists of those residues being distinguished in at least one of the 250 MSAs 0 H2r_filt_75.
For H2r_filt_60 these were 20 out of 25 datasets, respectively.
In addition, we characterised the set of residues being not an element of Dist_100 but belonging to Dist_75 or Dist_60, respectively. For the 75% samples, these were 12 residues; for the 60% samples, these were 33. As some residues were distinguished in several of the 25 analyses, 49 times a conn(k)-value above the threshold occurred for the 75% samples. The H2r_filt_75. For the 60% samples, these were 29 cases. Additional Table 5 list the occurrence of these additionally distinguished residues. For the 60% samples, 18 of these distinguished residues occur in PFAM00068 and PF00557. The corresponding MSAs of the 60% samples contained 137 or 113 sequences, respectively. As explained above, approximately 125 sequences are a lower limit for the determination of U(k, l) values [1].
Therefore, the small sample size most plausibly explains the higher variation of distinguished residues in H2r_filt_60. When assessing the number of additional residues possessing conn(k)-values of 4 or 5 in the 75% and 60% datasets one has to consider that 20 residues of H2r_filt_100 had a conn(k)-value of 3, which was too low for a "distinguishment" in H2r_filt_100. Thus, relatively small fluctuations in the correlation signals made these residues to distinguished ones in H2r_filt_75 or H2r_filt_60.
In order to study these fluctuations in more detail, we HSRPs are approximately 1l of all residue pairs. Additional Table 6  HSRPs as well as 1‰, 2.5‰, and 5‰ of sequence length. Additional Table 5 lists the results. In summary, this setting did not markedly influence the identification of those residues having highest conn(k)-values. The comparison of the HSRPs = 75 and the 2.5‰ results showed that the outcome overlapped to a great extent. For short sequences, a larger percentage of residues has to be chosen. There was no obvious trend in the dependency of the rank of conn(k)-values and the number of HSRPs used for computation. Therefore, we propose as a starting point HSRPs = 75. However, the web server [15] accepts different values.

Mode of MSA generation is uncritical
In order to characterise conn(k)-values with respect to the mode of MSA generation, we created additional alignments. Several input filters and methods of MSA generation were applied to the sequences of PF01053 and IPR000277, which is the corresponding InterPro family [13]. As MSAs used for coupling analysis contain large numbers of sequences, we focussed on MAFFT [12], which combines accuracy and speed [16]. Altogether, 11 different MSAs were generated; see Additional Table 4. For each element of a HSRP, we recorded the minimal and maximal connectivity value observed in any experiment. Additional Table 7 summarises the results. 43 residues were an element of at least one HSRP. For 12 residues, their occurrence showed a skewness that might be caused by a specific dataset or a method of MSA generation. 6 of these residues occurred in PFAM specific datasets, 6 in those MSAs originating from InterPro sequences by using MAFFT. The two refinement modes of MAFFT (FFTNS or L-INS-i), which we tested did not contribute a suspiciously high number of extra residues. Importantly, none of these 12 extra residues gained in any of the experiments a conn(k)-value > 3. The set of HSRPs originating from a realignment of PF01053 generated by means of MAFFT did not notably deviate from the above set of HSRPs (data not shown).

The output of algorithms for correlation analysis differs significantly
In H2r_filt_100, each PFAM is associated with a protein-structure being resolved with high quality. In order to characterise a larger set of proteins, conn(k)-values were compared with the outcome of CorrMut [17] and P2PConPred [18]. For both programmes, the protein structure representing the PFAM (see Additional Table 8 For each of the 10 MSAs of H2r_filt_100, 25 MSAs were generated by randomly choosing 75% or 60% of the sequences. These two sets of MSAs were named H2r_filt_75 or H2r_filt_60. Dist_100 are those residues having assigned a conn(k)-value above the cut-off in H2r_filt_100. NOT Dist_100 are those residues having no conn(k)-value above the cut-off in H2r_filt_100 but were distinguished in at least one of the 250 MSAs of H2r_filt_75 (or H2r_filt_60). Dist_100, NOT Dist_75 are those residues having a conn(k)-value above the cut-off in H2r_filt_100 but not in H2r_filt_75. All other abbreviations were used accordingly. Mean conn(k)-values resulted from the 25 MSAs constituting H2r_filt_75 or H2r_filt_60. Bootstrap-values give for each residue the fraction of conn(k)-values above the cut-off in all of the 25 samples. Panel A lists the results for comparing the outcome of H2r_filt_100 and H2r_filt_75. Panel B ditto for H2r_filt_60. In Panel C, mean conn(k)and bootstrap-values were plotted for those residues being distinguished in one dataset but not in the other one.

Additional Tables
Additional Table 1 -20 high scoring residue pairs of PF01053 as identified by k a k f max (a k ) Pos l a l U(k, l)  The first column lists the rank deduced from the U(k, l)-values. Pos k and l give the number of the residues; a k and a l are the amino acids occurring at these positions in pdb-entry 1QGN. f max (a k ) is the maximal frequency of any amino acid occurring in column k. U(k, l) is the score calculated according to formula (3). "dist min " gives the distance dist min of the residues in Angström; for calculation see Methods. These values were computed without adding pseudo counts.

# Pos
Additional The dataset comprises 20 combinations of 3D-structures resolved with high quality and the related PFAM families and InterPro datasets. For structures, their pdb-code is given.
-21 -Additional Table 3   All elements of the datasets H2r_train given in Additional Table 2   . Name indicates the origin of the sequences, ident min and ident max are the cut-off values for filtering, frequ max is the maximal frequency of any amino acid, mode of generation gives the MAFFT options, LF indicates that the length of the sequences was used for filtering, too. The column "non-trivial pairs" gives the distribution of 75 non-trivial, high scoring residue pairs below and above the median distance of all residue pairs. "Mean Connectivity" is the mean connectivity value. Compactness is the ratio |residues| / Σ distances. "Mean Distance" gives the mean of all dist min values deduced from all pairs constituting the largest spanning trees, respectively.   For each of the MSAs 1 -11 given in Additional Table 4, the residues constituting clusters were determined and listed. The assignments that could be identified as being specific for a dataset or a method of MSA generation were labelled with a grey background. Example: The second column has to be read as follows: At residue 80, an amino acid occurred with a maximal frequency of 85%. For dataset 4, this residue was element of cluster 4 with a conn(k)-value of 1, in dataset 8 it belonged to cluster 3 and had a conn(k)-value of 2. The minimal conn(k)-value of this residue was 1, the maximal one was 2. In summary, residue 80 was element of 2 clusters (Sum value), 1 cluster originated from the PFAM datasets, and the other one was generated by using the FFTNS option of MAFFT.