The AlignHUSH procedure has been developed to detect remote homologs more effectively than the currently existing tools without compromising on the alignment accuracy. The objective of the current work was to develop a generally applicable algorithm without the use of explicit structural information to align two sequence-based HMMs. The sequence conservation information based profiles currently used for remote homology detection are inadequate for detection of very remote homologs since it is known that the sequence conservation is not seen in such cases of remote homology. The structure is still seen to be conserved and hence use of structural information can lead to an improvement in the remote homology detection [18]. The use of structural information may lead to many relationships being incorrectly found as significant matches in a search since there may not be a robust method to distinguish between related and unrelated sequence of secondary structures. One way to distinguish such cases would be to find the packing of the secondary structures, but such information can be derived only from an analysis of the 3-D structure. An approximation to the packing can be derived from the hydrophobicity pattern since it is known that the core of any protein is composed of hydrophobic residues. Such an approach has been shown to improve remote homology detection in sequence-profile alignments recently [27]. Thus, in the AlignHUSH procedure a contribution to the score is derived from explicit hydrophobicity calculations which might enable the detection of remote homologues with finer detail.
Another fact that has been exploited in the development of the AlignHUSH procedure is the observation that conservation of residues occurs over a stretch of residues. Thus, adding the contribution from neighboring columns should enable better differentiation than by using information from a single column. In case of a match between homologous patches, the score per column is enhanced but in unrelated patches, it is expected that the effect will be on average cancel out and the score for such patches will remain largely unchanged. Thus, use of neighbor information is expected to increase the separation between the scores for homologous matches and non-homologous matches which should enhance the difference in scores between the unrelated and related profiles.
In the next few sections, the incorporation of these ideas into the algorithm will be described in detail, followed by assessment based on the SCOP database.
Datasets used
The dataset of HMMs used in the analysis was obtained from the Superfamily database [28] which provides HMMs and PSSMs of SCOP database. SCOP 1.69 [29, 30] has been used throughout the analysis for parametrization. Out of the 10,894 HMMs present in the database corresponding to 2829 SCOP families, one HMM has been selected randomly from each family. This resulted in a total of 2829 families which will henceforth be called the 'full SCOP 1.69 dataset'. The latest SCOP database (version 1.75) profiles from Superfamily database were also used to determine the sensitivity and specificity. The HHSearch compatible HMMs for the Superfamily database (version 1.75) were downloaded from HHSearch website ftp://toolkit.lmb.uni-muenchen.de/HHsearch/databases/. From a total 13921 profiles, extraction of single profile per family as performed for SCOP 1.69 database lead to a dataset of 3464 profiles which were used for comparison of the performance of three methods.
The true relationships in this work have been defined as two profiles corresponding to two SCOP families that belong to the same SCOP Superfamily. Any two profiles belonging to SCOP families that are present in different classes have been deemed as unrelated (false positives in a search). The SCOP families belonging to two different folds in the same class are ignored. In the next section, the procedure used to incorporate additional information to HMMs is described.
Format of HMMs in AlignHUSH procedure
The AlignHUSH procedure uses information which is not present in the HMMs generated by HMMER/SAM packages. Hence, the input HMM was re-encoded to AlignHUSH format and this is explained in detail in the subsequent sections.
Gap penalties
HMMER2.0 uses the plan7 architecture of HMMs, which has been well documented (refer to the User Guide of HMMER2.0). The architecture results in three 'states' at each 'node' (a node corresponds roughly to a column of a MSA). The states can be either M (match), I (insert), or D (delete) states. Each of the match/delete states correspond exactly to one column in the MSA from which the HMM is derived. Multiple columns of MSA can lead to a single Insert state. The Match - Match transition probability is close to 1 in most nodes. Where the probability of occurrence of inserts is high, the transition from a previous match state to the current insert state is also high. This leads to a position-specific gap penalty which is one of the reasons why HMMs are more accurate and sensitive than PSSM-based methods which usually have uniform gap penalties. The alignment between profiles is usually thought of as optimal match of two conservation patterns. Hence, the information on the probability of insert at a site is important but, the composition of insert states is not important. Thus, the insert emission line is discarded in the HMMs for use in AlignHUSH.
Scores for emission of amino acids at each match state
The conservation of amino acids at a position in a MSA is not sufficient to discriminate between related and unrelated patterns. Thus, to enhance the information content of a column in the MSA, information on the hydrophobicity and predicted secondary structure scores are added to a column of the MSA. The HMMER suite of programs stores the match emission probabilities as log odds scores. These scores are first converted into probabilities using the equations specified in HMMER UserGuide. The probabilities are modified and stored as ratios of probability of amino acid to square root of background probability of that amino acid (this step makes it easy to calculate sum of log odds ratios later). The explanation for this unusual step will be made clear when alignment procedure is explained.
Since AlignHUSH uses hydrophobicity and secondary structure at each position, these values/features are incorporated as emission probabilities set. The hydrophobicity of each state is calculated as the product of probabilities of amino acids and the Kyte-Doolittle hydropathy values [31] for the respective amino acids. For the generation of secondary structure scores, secondary structures of each of the sequences in the MSA have been predicted using PSIPRED [32] and the frequency of occurrence of each of the secondary structures in a column of the MSA has been calculated. Since each column in the MSA corresponds to either match state or an insert state, the probabilities of secondary structures occurring in a MSA column can be transferred to either a match state or an insert state. The frequencies for each secondary structural state, for every match position is stored and used for calculating the secondary structure based score.
Alignment scheme
The alignment of profiles can either be global or local. Most profile-profile methods use local alignments since it is thought that only highly conserved amino acid motifs are conserved across families in a SCOP superfamily. This view is appropriate when conservation of amino acids is used as a measure of relatedness. But, it is known from structural classification databases that even if sequence patterns are not conserved, structure is conserved in such sequence fragments. Since the objective of the AlignHUSH method is to recognize remote homologs it is imperative that the alignment length cover at least the structurally similar core of the two families in consideration. This can be achieved if local alignments are constructed and parameters are optimized to allow for the local alignment to cover most of the homologous regions. The Viterbi algorithm has been used for local alignments of profile-profile matches, and the equations used closely follows the procedure of HHSearch [18] and are reproduced in this manuscript. The procedure is briefly described in the next few subsections.
The Viterbi procedure breaks down the goal of aligning two profiles into a series of local optimizations. Initially as no alignment is made yet, the score is set to zero. Throughout this section, one of the two profiles is defined as P(1....n), which is a profile P with 'n' match states, and the other profile is represented by Q (1....m) which is the profile Q with 'm' match states. P(i) means the ith column of profile P and P(ia) means the ath amino acid in the ith column in profile P. The notation for state transition probabilities is derived on similar lines. For example, for a state transition from Match state at i-1th position in profile P to insert state at ith position in profile P is given as P(M->M)i-1.
The alignment between two profiles is thought of as a pair-HMM following the suggestions made in HHSearch procedure [18]. According to the HHSearch procedure, among the nine pair-HMM states, five states MM, MI, IM, DG, and GD are sufficient for describing the profile-profile alignment, where M means 'match', D means delete, G means gap and I means insert in a profile. The Viterbi procedure starts by setting up five matrices for each of the pair-HMM states. The total score till Pi and Qj are aligned is given by the variable Sij (one for each of the five matrices) which is maximized in the Viterbi procedure. The equations for aligning the profiles closely follows the procedure suggested in HHSearch [18] and are given in detail as equations 1a - 1e. The difference between the HHSearch procedure and AlignHUSH lies in the calculation of the score for alignment of two match positions, which is described below
In the AlignHUSH procedure, the match between the columns consists of three scores, the hydrophobic score, the conservation score and the secondary structure score. The conservation score is taken from the work of Soding on HHSearch [18], and is given as a log-sum of odds score (equation 2) where ba refers to the background frequency of amino acid 'a'. The conservation patterns are usually observed in short motifs, and not in isolation, and hence in AlignHUSH procedure, the conservation score is taken over a window of a few columns. Thus the conservation score at position (i,j) is the sum of conservation scores over a window. Optimization using sensitivity as a criterion revealed that a window size of 5 gives the best results and this has been used throughout the analysis (indicated in equation 1). The hydrophobic score and the secondary structure score are likewise derived (equations 3 and 4 given below) and the window size for each score is determined separately. In equation 3, Hx and Hy refer to the hydrophobicity of a column at position x in first HMM and position y in second HMM. Similarly, the f(x)s term is the frequency of occurrence of secondary structure 's' at position x in first HMM, and likewise for the second HMM (at position 'y'). The Lss term in the equation 4, is taken from a substitution table of secondary structural elements that was derived during the AlignHUSH development. The substitution table was derived by calculating the frequency of substitution of one secondary structure (Helix, Sheet, Loop), with another in structural alignment of SCOP family sequences.
The three scores combined with their respective weightage given to each term gives the final column match score (equation 5) that is used to calculate the column match score (Wc, Wh and Ws are the weights given to the conservation core, hydrophobic score and the secondary structure score). In addition to the weights given to the column match score parameters, an additional weightage term is used for gap penalties (the log terms in equations 1a to 1e. This has been done to ensure that the use of neighbor information and additional information does not make the Mij term much larger than the gap penalty terms which could lead to a long alignment between unrelated profiles. Traceback pointers are stored at each position to allow for the alignment to be re-generated later. The traceback starts from the element in the MM matrix where the score is highest, which is also the score determined for the alignment between the profiles. In the next section the definitions of sensitivity and statistical assessment of the method is described in more detail.
Statistical Assessment and sensitivity and error rate analysis
Statistical assessment of similarity between proteins is usually expressed as E-values. The E-value as defined by [33, 34] is the number of random alignments (or profiles) which can be expected to give a score better than the score that is observed between two proteins. If the number is very low, then the alignment score is very significant. The E-value parameters for local ungapped alignments are well understood, but heuristic measures are used for the case of gapped local alignments. The assessment of statistical significance in case of HMMs is more complicated and several methods have been proposed based on whether the alignment has been generated using Viterbi algorithm or the full forward algorithm [9, 35]. In order to decrease the bias due to the length of profile or composition of the profile, a per-family assessment of significance is better suited than an estimation of database specific parameters. For assessment of statistical significance, the distribution followed by alignment scores of random profiles is required. An approximation to alignment to random profiles can be obtained by aligning each query HMM to a large randomly selected profile database. The scores obtained using the random database of profiles is used to fit the E-value parameters and these are used to report E-values in searches.
The definition of homologous and non-homologous families is subjective in several cases, with various classification schemes following different criteria for determining homology. Since the HMMs are derived based on a SCOP based classification, homology as proposed in SCOP has been used. Thus, if two HMMs belong to same SCOP superfamily, then they are considered homologous. HMMs belonging to the different SCOP classes were considered clearly non-homologous Two HMMs belonging to the same SCOP fold but different SCOP superfamilies, or two HMMs belonging to different folds in the same class are ignored and not evaluated as either true positives or false positives. The sensitivity is defined as TP/(TP+FN) and error rate as FP/(FP+TP), where TP stands for true positives, FN stands for false negatives and FP stands for false positives.
The comparison of sensitivity has been made with two other methods namely, PRC (version 1.5.6) and HHSearch (version 1.5.0 with SCOP1.69 and 1.5.1 with SCOP 1.75 profiles). Since PRC accepts HMMs in HMMER format, no conversion of HMM format was needed to search in the SCOP database of HMMs provided by Superfamily database. The HMMs (for SCOP1.69 version) downloaded from the Superfamily database were converted into the HHSearch format by using the 'hhmake' program and the predicted secondary structure was added using the HHSearch tool 'addpsipred.pl' in order to use secondary structure scores in HHSearch. For the latest SCOP (1.75 version) database, the profiles corresponding to latest SCOP were downloaded from HHSearch web site in HHSearch compatible format. The calibration of each of the HMMs in the SCOP1.69 and the latest SCOP database was done by searching in the calibration database provided with the respective HHSearch programs. Both PRC and HHSearch were used with an E-value of 10 as default, and in case of HHsearch, an option to print at least 100 hits was used to estimate correctly the number of profiles found as false positives for a query.
Assessment of Alignment Accuracy
The assessment of alignment accuracy especially for remotely related proteins is practically not feasible since the evolutionary history of any protein cannot be determined with high accuracy. Currently, there are no methods available which can provide the evolutionary path through which two proteins might have diverged from an ancestor protein. The alignment is a statistical approximation to this path and its accuracy cannot as such be measured reliably. In the absence of such 'gold standard' alignments, careful approximations must be made to derive a set of reference alignments against which the accuracy can be gauged. At present, the structure based alignments are considered the 'gold standard' especially if manual curation has been made to refine small misaligned regions. There are several such databases, out of which the BaliBASE database [36] was chosen since it provides alignments based on sequence identity cutoffs (as low as < 20%), which enables the assessment of alignment accuracy for remotely related proteins.
The BaliBASE database provides manually curated structure-based alignments which have been pruned to ensure that the secondary structure elements align well, and care is also taken to ensure that functional residues are always aligned. Thus, the alignments present in the BaliBASE database are considered as the 'gold standard' of alignments. The comparison of alignments (from SCOP 1.69 dataset) generated using AlignHUSH, PRC and HHSearch with reference alignments in this database was done and the correctly aligned columns in each case were determined. The procedure followed was to take the RV11, RV12, RV20 and RV30 subsets of the BaliBASE 3.0 database, which give alignment between proteins with less than 30% sequence identity. There were 745 pairs of proteins (number of unique proteins being 317), which have less than 30% sequence identity and which are also in different SCOP families in the same SCOP superfamily. The sequences of the 317 proteins were aligned to its cognate SCOP family using the hmmalign program in the HMMER suite of programs. This step might introduce inaccuracies, since the sequence-profile alignment is not 100% accurate. The profile-profile alignment between two SCOP families is taken from each of the three methods, and a sequence - sequence alignment of remotely related proteins is generated by combining the two sequence-profile alignments with the profile-profile alignment. For each test alignment (generated by AlignHUSH, PRC and HHSearch), the region of overlap with the reference alignment was taken and the proportion of correctly aligned residues was calculated (ratio of correctly aligned columns to total aligned columns present in both reference alignment and the test alignment).