The FIRE algorithm
The FIRE algorithm is implemented in the Python programming language, it finds the optimal alignment of two amino acid sequences using the Bayes Empirical Bayes (BEB) maximum likelihood estimates (MLEs) [11] of the evolutionary rate (ω = dN/dS) at codon sites. FIRE requires two multiple sequence alignments (MSAs) of nucleotide sequences with their corresponding phylogenetic trees to generate a pairwise amino acid alignment. A modified Needleman-Wunsch algorithm [18] determines the global alignment between the sequences based on a dynamic programming (DP) approach. The generation of alignments requires MSA files and their corresponding phylogenetic tree files which are used as input for the CODEML program found in the Phylogenetic Analysis by Maximum Likelihood (PAML) suite of software [19] to produce the BEB MLEs of ω at codon sites. This pre-processing using CODEML provides the rst output files which are consequently used as input for the FIRE algorithm to generate alignments.
Data sets
For the initial evaluation of the algorithm, data sets used in the concept paper were utilised [13]. To evaluate the BLOSUM-FIRE algorithm evolutionary rate profiles of the Pfam (Protein family) [20] database were compiled into the evolutionary rates database (EvoDB) described elsewhere (manuscript accepted in Database).
To demonstrate the utility of the database evolutionary profiles of the enigmatic HBx protein were used. HBx was chosen because functional inference has been elusive. HBx curated nucleotide sequences were obtained from [21]. An alignment of 20 nucleotide sequences of the HBx was generated using the CLUSTAL Omega (ver. 1.2.0) program and phylogenetic trees inferred from the FastTree program (ver. 2.1.7) [22]. MLEs of ω were determined using the CODEML program in PAML (ver. 4.4) suite of software.
To infer the domain functions of HBx the EvoDB database was used. EvoDB (www.bioinf.wits.ac.za/software/fire/evodb) is a database of 98 % of the gapped nucleotide sequence alignments for the PFAM-A database. It provides the evolutionary rate (ω = dN/dS) profiles determined under the M2a model (CODEML algorithm in the PAML suite) for 97 % of the PFAM domains. The clustering of proteins into families in PFAM using domain functions provided a suitable framework for implementing a searchable database for inferring domain functions. The database was compiled for use by BLOSUM-FIRE. Briefly, BEBs [11] ω MLEs at codon sites were calculated using the CODEML program (PAML ver. 4.4) [19] under the M2a Model (NSsites = 2). This parameter assumes one ratio for all the branches and allows for the detection of positive selection at codon sites. The ω MLEs were extracted from the rst CODEML output file and used to compile the ω MLE profiles for each dataset. This rst file contains supplemental results including: Naive Empirical Bayes (NEB) probabilities for the site classes of ω, a list of positively selected sites, log likelihood values and the BEBs. These rst files for PFAM domain analysis using CODEML under the M1a and M2a models are available for download on EvoDB. Individual domains form functional units and are less variable than multi-domain proteins. The evolutionary rates at codon sites across discrete domains provide signature profiles of ω values, which may be used for homology detection [13]. An independent study of RNA viruses based on a novel model of sequence evolution also demonstrated that protein-coding regions were moulded into sites of varying selective pressure [12].
To evaluate the accuracy of the new BLOSUM-FIRE algorithm against widely used algorithms: CLUSTAL Omega, MUSCLE and T-COFFEE using MAFFT as a reference, 20 datasets were used. The datasets consisted of ω profiles of unrelated and related domains using nucleotide sequence data for the PFAM-A domains obtained from EvoDB.
Evaluation of the FIRE algorithm with real data
The limitation of the FIRE algorithm identified in the concept paper of [13] was that the algorithm produced false positive results in certain datasets . Those data sets where the FIRE algorithm produces false positives were identified, analysed and the correlation between the statistical distribution of the ω values and the quality of alignments produced was investigated. To further explore the problem posed by false positives, 100 alignments of evolutionary rate profiles of the PFAM-A database were investigated. Alignments were generated using the FIRE algorithm and the number of residues aligned was counted as a measure of the quality of alignment. Alignments were then generated using the CLUSTAL Omega algorithm (ver. 1.2.0) [23] and MAFFT (ver. 7.130b) algorithm [24] using default parameters. Each alignment was scored using the numbers of aligned residues normalised by maximum sequence length indicated by the identity score. Our identity score is equivalent to the percentage identity score (PID) [9] normalised between 0 and 1 for intuitive comparison to a FIRE algorithm score. To assess the false positive rate we defined a false positive as any alignment with a FIRE score above 0.6 for functionally unrelated domains. This threshold was adopted from the concept study where domains with this score or higher were inferred to share similar domain functions. Each alignment was scored using the numbers of matched residues normalised by the maximum sequence length indicated by the identity score. In such scenarios, structural data could provide a standard of truth for evaluating alignment quality [25]; however, the requirement for MSAs in BLOSUM-FIRE and the limited availability of domain structures made this impossible in this study.
Evaluation of FIRE using simulated data
The FIRE algorithm had reduced specificity in highly conserved data sets resulting in false positives. Simulated data sets were therefore created to investigate false positives further. Highly conserved datasets were created such that the ω values were in the range [0,0.02] across all coding sites. Simulations were carried out using real and truncated datasets generated by a combination of custom scripts and manual editing of input files to insert sites of positive selection. Alignments were generated using FIRE default parameters for gap open and gap extension penalties of 0.5 and 0.1, respectively. The simulated datasets were then aligned using the MAFFT and CLUSTAL Omega algorithms.
Coupling FIRE with a BLOSUM substitution matrix
Evaluation of the false positive results identified in the concept paper [13] revealed that the homogeneity of highly conserved data, where the variance of ω values was low, resulted in reduced specificity. To address this challenge the new algorithm called BLOSUM-FIRE was implemented by incorporating the identity of the amino acids using the BLOSUM62 substitution matrix [26]. Recently, the efficacy of substitution matrices has been questioned with the development of search approaches that do not utilise amino acid substitution matrices, for example, the CS-BLAST tool, is a novel search approach in which context-specific (CS) substitution matrices were incorporated into the BLAST algorithm [27]. At the same time, MIQS (Matrix to Improve Quality in Similarity search), a more robust matrix, has been developed from numerous matrices using principal component analysis for detecting remote homologies, for example, in transmembrane regions [28].
Empirical experiments with simulated data sets provided the framework for coupling FIRE and the BLOSUM62 substitution matrix. We propose a scoring function that scores the aligned amino acids depending on their conservation measured through the ω parameter. To simplify the task of generating alignments using the Needleman-Wunsch DP algorithm, a unified scoring function was required. Therefore, for two amino acids: i and j with evolutionary rate values of ω
i
and ω
j
, the guiding principles for the BLOSUM-FIRE scoring function are:
-
Determine the BLOSUM amino acid score s(i,j), from the BLOSUM matrix using the identity of the aligned residues.
-
Determine the selective pressure on the amino acids using the values of ω
i
and ω
j
.
-
Scale the BLOSUM score using the following principles of selective pressure: ω > 1 is positive selection, ω < 1 is negative selection or purifying selection and ω = 1 indicates neutral selection.
-
Determine the similarity of ω
i
and ω
j
values normalised between 0 and 1 to a score called similarity.
-
Use the selective pressure to scale BLOSUM values to a BLOSUM comparable score called selection.
-
Combine the similarity score and selection score to obtain the BLOSUM-FIRE score.
-
Use a modified Needleman-Wunsch DP algorithm to obtain the global alignment of the amino acid sequences.
To simplify the scoring function, a single value was required to score the similarity of the evolutionary rates (dN/dS) and selection of the selective pressure adjusted BLOSUM score. We formulated an additive scoring scheme such that:
$$ BLOSUM\hbox{-} FIRE= similarity+ selection $$
Therefore, the BLOSUM-FIRE scoring function scores amino acid residues using BLOSUM scores which are subsequently adjusted according to selective pressure and this is added to similarity score of ω values aligned at codon sites.
Approximating selective pressure and scaling the BLOSUM score
The ω value is widely accepted as a proxy for the evolutionary pressure on an amino acid at a codon site. This has been extended to indicate whether the amino acid has a functional role and identifying the coding regions in a genome [29, 30]. An ω value close to zero indicates strong selective pressure, while an ω > 1 means non-synonymous substitutions may offer some fitness advantage to the function of the protein concerned at that site. In addition, amino acid variability at such sites with elevated ω values is expected. On the other hand, ω <1 indicates negative or purifying selection and ω = 1 indicates neutral selection. We propose an approximation equation based on these selective pressure guidelines. This equation was chosen as it is monotonic and, more importantly, it approximates the relationship between ω values and the selective pressure according to theoretical guidelines, for example, detecting sites under positive selection [31]. Let a be the ω value of amino acid
i
and b is the ω value of amino acid
j
, the approximated selective pressure on the two aligned amino is acids such that:
$$ se{l}_{ab}={e}^{-\left(a+b\right)} $$
(1)
where sel
ab
is the approximated selective pressure at the codon site in the range [0,1]. The selective pressure (sel
ab
) provides information that is untapped by conventional alignment algorithms. Such an approximation given by Equation (1) emphasises the quality of alignments based on the level of conservation at the amino acid level. We propose that this selective pressure can be used to scale BLOSUM scores. This selective pressure scaled BLOSUM score called selection is given by the equation:
$$ selection= BLOSUM\times se{l}_{ab} $$
(2)
The premise of the FIRE principle is that amino acid sequences under similar selective pressures share similar functions. Several studies have demonstrated that highly conserved sites can be used as a proxy for functionality, for example, ConSurf [32] and Rate4site [33]; it has also been demonstrated that natural selection imprints genes with evolutionary fingerprints which can be used as gene identifiers [12]. Our concept study [13] demonstrated that at least in some cases those domains that are under similar selective pressures inferred from the evolutionary rates can be responsible for similar functions. While amino acid sequences under similar selective pressures may share similar functions the converse is not true. Let a be the evolutionary rate (ω) of amino acid
i
and b the ω value of amino acid
j
, where i and j are indexes in amino acid sequences, then the similarity sim
ab
of the evolutionary rates can be calculated such that:
$$ si{m}_{ab}=1-\frac{\left|a-b\right|}{ \max \left(a,b\right)}\kern1em \mathrm{if}\kern0.5em a\kern0.5em or\kern0.5em b\ \ge\ 1 $$
(3)
or
$$ si{m}_{ab}=1-\left|a-b\right|\kern1em \mathrm{if}\kern0.5em a\kern0.5em and\kern0.5em b<1 $$
(4)
Either Equation (3) or Equation (4) can be used to calculate the similarity sim
ab
to the absolute difference normalised in the range [0,1]. Therefore, sim
ab
, maximises the absolute normalised differences at codon sites.
Coupling the BLOSUM matrix approach with the FIRE approach
To couple the two approaches an assumption that an amino acid match is analogous to the similarity of low evolutionary rates at a highly conserved site was adopted. Therefore, an identical match score in amino acid residues was proposed to be equivalent to a FIRE score of 1 for a highly conserved site. A proportionality constant (K) was proposed for the FIRE approach score or sim
ab
to be comparable to BLOSUM scores. Let sim
ab
from Equation (3) or (4), be the similarity between the ω values at sites a and b, therefore the BLOSUM comparable score is given by:
$$ F=K\times si{m}_{ab} $$
(5)
where F is a BLOSUM comparable score, K is the proportionality constant obtained from the substitution matrix and sim
ab
is described in Equations (3) and (4). To calculate K, an assumption that an identical amino acid match is analogous to a match at codon sites where the evolutionary rates are close to zero was adopted. The proportionality constant (K) is the mean amino acid identical match scores determined from the BLOSUM62 matrix and it was found that K = 5.64. Using this proportionality framework we adopt the BLOSUM approach and weigh ambiguity characters in the same manner as regular amino acids.
The BLOSUM-FIRE scoring function
The strength of the FIRE approach is its ability to detect shared functions using ω values at low sequence similarity without using the amino acid identity. The BLOSUM approach is effective for aligning sequences using amino acid identity. The FIRE algorithm produced false positives when sequence conservation was high. This is because at high conservation the high similarity of the ω values creates “noise” for the scoring function resulting in poor alignments. Under such conditions the power of aligning sequences based on dN/dS values at amino acid sites becomes weak. Therefore, at high conservation the BLOSUM score is more acceptable than a FIRE score.
To allow the algorithm to be dynamic between the two approaches: BLOSUM to FIRE depending on the selective pressure on the amino acids the equations were combined, resulting in a dynamic scoring scheme such that:
$$ BLOSUM\hbox{-} FIRE= BLOSUM\times se{l}_{ab}+K\times si{m}_{ab}\left(1-se{l}_{ab}\right) $$
(6)
Conversely, the approach recognises that when conservation is low the BLOSUM score may not be strong evidence. It has not left our attention that determining evolutionary rates requires nucleotide MSAs, which can be challenging to obtain. Therefore, to mitigate this challenge the database EvoDB is provided (described above) and BLOSUM-FIRE allows for the comparison of protein sequences with evolutionary rate profiles in a one sided comparison. However, this comparison may not be as robust as a pairwise evolutionary rates comparison. To allow for this one-sided comparison, the algorithm assumes that all the protein residue sites for the protein sequence without ω MLEs are highly conserved such that ω = 0.
A variable gap penalty scheme
The affine gap penalty scheme used in the FIRE algorithm proposed by [34] and [35] was modified for the BLOSUM-FIRE algorithm to make it more variable depending on the selective pressure at codon sites. A variable gap penalty scheme similar to the form described by [36] is proposed. Furthermore, Equation (1) was also adopted for approximating the selective pressure at codon sites to determine the gap penalties. The proposed variable gap penalty uses the same principle for adjusting BLOSUM scores. Consequently, those sites that are highly conserved have relatively high gap penalties given by the equation:
$$ P=g{e}^{-a}+tl{e}^{-x} $$
(7)
where P is the total gap open penalty g, is the initial gap open penalty t, is the initial gap extension penalty l, is the length of the gap, e
− a is the approximated selective pressure at the codon site, a is the ω value where the gap is opened and similarly, e
− x is the approximated selection pressure where x is the ω value for the site where the gap is extended. Therefore, gaps are extended based on the level of conservation at each site where the gap open penalty and its extension thereof varies as a function of the evolutionary rate at each site. According to this penalty scheme, variable sites are more likely to be gap insertion or extension sites than conserved sites.
Determining optimal gap penalties for the new BLOSUM-FIRE algorithm
Theoretical guidelines for the determination of gap penalties are scarce. In this study an empirical approach similar to the investigation by [37] was adopted. The datasets consisted of 50 amino acid sequences with their ω MLEs of varying sequence lengths randomly selected from the Pfam-A seed alignments database (ver. 27.0) [20]. The BLOSUM-FIRE score was used as a proxy for the quality of the alignments, the score is the mean of the sequence identity and the normalised evolutionary rate score (old FIRE score) for that alignment. We aligned the Pfam sequences against each other for iterations of the gap penalties. Iterations were carried out using gap open penalties in the range [0,11] and gap extension penalties in the range [0,6]. Analysis was only carried out on the gap penalties and the BLOSUM-FIRE score for that alignment.
Evaluating the effect of MSA quality on the final alignment
We evaluated the effect of sequence number and MSA accuracy on the final alignment. To evaluate the effect of number of sequences on the final alignment, 10 Pfam families were randomly selected and using between 3 and 20 taxa we investigated how shuffling the order of sequences in the MSA affected the BLOSUM-FIRE score. The domains with the varied number of sequence were aligned with the same domain with 20 sequences. The effect of MSA accuracy on the final alignment was investigated using the heat shock hsp70 (PF00012) domain using nucleotide sequence data corresponding to the manually curated Pfam alignments from EvoDB. Using the MAFFT algorithm and by iteratively increasing the gap open penalty from 0 to 3 in increments of 0.1, we were able to obtain alignments of varying quality and accuracy using the comparison against the alignment generated using default parameters as a performance measure. These alignments were then aligned to the reference alignment to evaluate the effect on the final alignment. Phylogenetic trees for the above experiments were determined using the CLUSTALO program and ω profiles were determined under the M2a model using CODEML.
Resources
The resource requirement of an algorithm is an important consideration as some algorithms may require prohibitive resources or running times. While BLOSUM-FIRE has not been optimised for performance, we assessed the time and resource requirements for generating alignments from the MSAs to the final pairwise alignment. For this analysis, the heat shock hsp70 domain was used; we trimmed the sequence to 900 nucleotides to approximate the average length of a protein. Alignments were then de-aligned and iteratively added from 3 sequences to 30. Resources were measured using the Unix time command. Execution times were measured for the generation of alignments and calculating the phylogenetic tree and determining the ω MLEs. Furthermore, the time taken to generate the final alignment using BLOSUM-FIRE was also measured. Resources were measured on one of the nodes on our cluster (WITS-CORE). This is an Intel (R) Xeon (R) machine with 15 E5630 processors at 2.53GHz running the Scientific Linux operating system. The machine has a cache size of 12M and 23GB RAM. All experiments were carried out on a single core using default algorithm parameters.
Evaluation of BLOSUM-FIRE performance
The conventional approach when evaluating the performance of a sequence alignment algorithm is to make use of a sequence alignment benchmark. However, the reliance on pre-processed data makes it a challenge to evaluate the BLOSUM-FIRE approach. The MAFFT algorithm is well known for its accuracy and speed, for example, [38] and more recently [39] and was therefore chosen as a reference aligner for evaluating the new implementation. Datasets comprised 10 unrelated and 10 related domains obtained from EvoDB. To demonstrate the utility of our approach and EvoDB, we simulated the 10 related datasets by randomly selecting and generating an alignment of 10 taxa from the selected families with total sequence numbers in the range [14,187] and these were then aligned against their full alignments. The quality of alignments was measured using the Sum of Pairs Score (SPS) and Total-Column score (TC) implemented in the bali_score tool provided with the BAliBASE benchmark database [40]. The SPS is the ratio of number aligned pairs in the test alignment to the number in the reference alignment; it evaluates the quality of the alignment produced. The TC score is a binary score of the comparison of the test and reference alignment for each column and the SPS was measured with MAFFT as the reference alignment.
The statistical significance of the alignments
To evaluate the significance of the results, a statistical framework was required. The inference of homology requires assessment of the statistical significance of an alignment to identify those real alignments from those due to chance. A p-value framework was implemented to assess the statistical significance of the alignments in EvoDB. The HBx ω MLEs were shuffled and used to query the EvoDB database. The scores of the alignments were tested for statistical significance using a p-value statistical framework. In this framework, the fraction of shuffled alignments scoring higher than the actual alignment was determined and this provided the p-value for that alignment. The functions of those proteins that were statistically significant were assessed by analysing the alignments produced. The BLAST algorithm finds statistically significant regions of local similarity between two sequences. The most statistically significant results found by BLOSUM-FIRE were then assessed using the BLAST algorithm to determine their statistical significance using a local alignment approach.