Physicochemical property consensus sequences for functional analysis, design of multivalent antigens and targeted antivirals
© Schein et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Skip to main content
© Schein et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Analysis of large sets of biological sequence data from related strains or organisms is complicated by superficial redundancy in the set, which may contain many members that are identical except at one or two positions. Thus a new method, based on deriving physicochemical property (PCP)-consensus sequences, was tested for its ability to generate reference sequences and distinguish functionally significant changes from background variability.
The PCP consensus program was used to automatically derive consensus sequences starting from sequence alignments of proteins from Flaviviruses (from the Flavitrack database) and human enteroviruses, using a five dimensional set of Eigenvectors that summarize over 200 different scalar values for the PCPs of the amino acids. A PCP-consensus protein of a Dengue virus envelope protein was produced recombinantly and tested for its ability to bind antibodies to strains using ELISA.
PCP-consensus sequences of the flavivirus family could be used to classify them into five discrete groups and distinguish areas of the envelope proteins that correlate with host specificity and disease type. A multivalent Dengue virus antigen was designed and shown to bind antibodies against all four DENV types. A consensus enteroviral VPg protein had the same distinctive high pKa as wild type proteins and was recognized by two different polymerases.
The process for deriving PCP-consensus sequences for any group of aligned similar sequences, has been validated for sequences with up to 50% diversity. Ongoing projects have shown that the method identifies residues that significantly alter PCPs at a given position, and might thus cause changes in function or immunogenicity. Other potential applications include deriving target proteins for drug design and diagnostic kits.
The most useful information one can glean from aligned sequences of proteins is first, the absolutely conserved residues, which are usually those that maintain the structure of the protein or are vital for function. The pattern, or profile of conserved residues in an alignment of a protein type can be used to identify proteins in the same group, that may have a similar structure [1–4]. In addition, the variance within the sequences, which may occur at specific positions due to random variation (i.e., in RNA viruses, an error prone polymerase), can also indicate a functional change. It is thus important to be able to separate background variation, which, in our approach, is assumed to cause little change in the physicochemical properties (PCPs) at a position, from those that alter these properties sufficiently to lead to a variance in protein function or immunogenicity [5–8].
Here we show applications of a general method to calculate physicochemical property (PCP)-consensus sequences. The method is designed to filter noise due to random amino-acid variations within strains or subtypes from more significant variation. We first modeled PCP-consensus sequences for several proteins, and showed that they were stable after minimization with our FANTOM program. We have also produced several PCP-consensus proteins from synthetic gene sequences in E. coli and tested their ability to be recognized enzymatically and immunologically. As discussed below, PCP-consensus sequences have many uses, in sequence classification, epitope comparison, in defining multivalent sequences as immunogens for vaccine use, and for defining targets for multivalent drug design.
The alignment independent scale factors b p were calculated so that vector values with higher relative entropies at a given column would be more significant, and were calculated as described elsewhere .
For very variable positions or highly biased datasets, the amino acids that naturally occur at each position can be used one time, without regard to their rate of occurrence in the column, to calculate the average values of the 5 property vectors. In that case, equation 2 can still be used, and the chosen “consensus” amino acid is simply that closest in its physical properties to all the naturally occurring amino acids. Other possibilities for dealing with bias in the data set, such as selective sequence weighting can also be used  to determine the property averaging method. This is an area for further study, as a completely mathematical solution for all situations is probably not possible.
It should be stressed that biological findings can be incorporated at any point in this process, in distinguishing sequences that have specific properties. Bioinformaticians should be aware that sequences grouped according to a biological assay may or may not correlate with distinctive genotypes. For example, the four types of Dengue viruses (DENV1-4), first characterized by Sabin in the early fifties based on immunological reactivity , segregate rather cleanly into four distinct genotypes (see below). However, human enteroviruses (HEV), designated Coxsackie virus A or B based on the type of paralysis they caused in newborn mice, did not separate neatly into two distinct sequence groups . While we have relied on the strain designations in the NCBI for Flaviviruses, other useful data that should be part of the functional annotation (such as lethality) is often not specified by those providing the sequences to NCBI.
There are many potential uses for PCP-consensus sequences in virology, for example in classifying strains, identifying functional alterations , and in designing novel, multivalent antigens for vaccines and diagnostics. Here we will show some applications based on data stored in our Flavitrack database (http://carnot.utmb.edu/flavitrack), which is a compendium of annotated Flavivirus sequences [9, 10]. Flaviviruses (FV), which include yellow fever (YFV), DENV, and West Nile viruses (WNV), are important human and animal pathogens which typically require insect vectors to infect mammalian hosts [30–35]. While mosquito control can be effective, antiviral agents and wide-spectrum vaccines are being sought to protect those in endemic areas [36–43]. To design effective vaccines, the areas of the viral proteins required for virus function or infectivity should be targeted by antibodies. Flaviviruses are variable, with many sequence variants found even in single virus isolates from the same patient, so-called “quasispecies” . However, when catalogued, the strains appear redundant from a mathematical standpoint, with interstrain diversity occurring at fewer than 1% of positions. While much of this variation is neutral for phenotype, even a single point mutation can greatly alter the immunogenicity of the envelope protein or alter virus entry [38, 40, 45–47]. Recognizing such function-altering amino acid substitutions is important for designing vaccines that will protect against many Flaviviruses simultaneously, and entry inhibitors.
Our first programs for analyzing the sequences in Flavitrack attempted to highlight all variation in the aligned sequences in a fashion suitable for conventional visual scanning of the data. These first attempts illustrated the need for unbiased data reduction: an alignment of 928 sequences (Flavitrack ca. 2009) covered dozens of pages of paper. Even at the smallest possible type (a microtext version of the database sequences provided to us by Reiner Eschbach’s group at Xerox), no screen setting was adequate to view more than a small part of the data. Also for the purposes of determining variation, we needed a rational mean sequence to compare sequences. Other intrinsic problems in the data were the non-random sequence distribution, with many more sequences available for certain mosquito-borne viruses (WNV and DENV) than for any of the tick-borne or no-known-vector (NKV) groups.
Proper grouping of virus isolates has meaning beyond mere nomenclature: comparing these consensus sequences, we could discriminate residue changes that fell outside the expected group variance. The comparison highlighted insertions and deletions that correlated with whether a species was carried by mosquitoes or ticks, and even with the type of disease (encephalitic vs. haemolytic) resulting from human infection . Additional uses of classification are to detect when a strain that appeared to be adapted to growth only in mosquitoes or bats contained key substitutions that might indicate human cross-over potential.
To further illustrate the potential uses of the method, we designed and produced a PCP-consensus “viral peptide linked to the genome” (VPg) for the human enteroviruses (HEV), which include polioviruses (PV), Coxsackie viruses A and B (CVA and CVB), and Echovirus. To initiate RNA synthesis, HEV polymerases (3D-pol) uridylylate a conserved tyrosine residue in the 22 amino acid long VPg to form VPgpU. We have determined the NMR structure of poliovirus type 1 (PV1)-VPg and PV1-VPgpU and shown that uridylylation stabilized the 3D-structure of the peptide, which is probably necessary for VPgpU to serve as a precursor for RNA synthesis [57–59]. As this reaction is not found in normal cells, it is a target for antiviral drug design . To develop a multivalent target VPg suitable for designing inhibitors against all HEV, the sequences of 33 unique HEV-VPgs were aligned and a PCP consensus protein, VPg-cons, was prepared with our automatic program. Although only about 50% of the amino acids were conserved completely the selected VPgs, the calculated pKa values of all of them were exactly 10.46, suggesting the peptide must be very basic in order to function (this is consistent with our NMR structures, which shows the basic, essential side chain of Arg17 very close to the phosphates of the coupled UMP). The PCP-consensus VPg, which is not identical to any naturally encoded sequence, had the same calculated pKa of 10.46. This illustrates that the consensus represents conserved physicochemical parameters of a sequence set. Both the PCP-consensus HEV-VPg and HEV-VPgpU were prepared synthetically [60, 61]. The HEV-VPg can be uridylylated by both the PV1- and CVA24-RNA-polymerases as well or better than the wild type VPg encoded in their respective genomes. Thus the PCP-consensus VPg represents the conserved properties of HEV wild type VPgs, and functions in a multivalent manner. Further study of the structure of the HEV-VPgpU should aid in deriving a general mechanism for uridylylation. Inhibitors based on the consensus HEV sequence should be multivalent, and prevent replication of all HEVs.
Defining PCP-consensus sequences can aid in analysis of large sequence datasets. The calculation method, based on a previously validated 5D-vector scale for the physicochemical properties of the amino acids, is straightforward, once a suitable alignment of related sequences is obtained. Having a rational consensus allows one to distinguish residue variations that significantly alter the properties at a given position. The method is thus suitable for application to many types of bioinformatics data.
The usefulness of the methodology in virology was demonstrated in two practical applications. A multivalent, PCP-consensus DENV vaccine candidate was designed, produced, and shown to bind antibodies against all four types of DENV. Also, a consensus HEV-VPg has similar properties, particularly pKa, conserved in wild type VPgs, and was uridylylated by two different HEV polymerases. This validated method should find application in many practical areas of virology and other areas of biology.
We thank all coworkers from the UTMB, especially David Beasley for his invaluable assistance with all the DENV work, and Werner Braun for his thoughtful input on handling bias and alternate methods of consensus sequence determination; and the Xerox team of Dr. Reiner Eschbach for microtext versions of alignments.
Funding: The DENV vaccine project was supported in part by grant 1UL1RR029876-01 from the National Center for Research Resources, NIH to the Institute for Translational Studies of the UTMB, (support was to CHS and David Beasley for pilot protocols #763 and #809). Jessica A. Lewis is supported by a pre-doctoral Fellowship from the Sealy Center for Vaccine Development at the UTMB; Dr. Paul work is supported by NIH AI015122 to E. Wimmer. Kay Choi’s work is supported by NIH grant 1R01AI087856. Development of the PCP-consensus method was supported in part by NIH grant AI064913 (to Werner Braun and CHS) and EPA-STAR grant RE-83406601-0 (to CHS). D.V.F. and G.J. v.d. H. v. N. are supported by The Netherlands Organization for Scientific Research (NWO). The computational resources of the Sealy Center for Structural Biology and Molecular Biophysics were also used in this project.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1