The variant surface glycoprotein (VSG) is essential for the survival of Trypanosoma brucei in mammalian hosts. There are ~5.5 × 106 VSG homodimers per cell and the cell surface monolayer that the VSG forms is considered to provide general protection from innate immune mechanisms [1, 2]. The coat nevertheless elicits a specific, trypanocidal immune response. This is countered by antigenic variation, in which trypanosomes switch to expression of a distinct VSG which, if antigenically novel, allows clonal proliferation of the switched cells, generating a new parasitaemia peak. Each trypanosome expresses only one VSG gene but has the potential to switch to any of probably hundreds of others [3, 4].
The VSG is a structural paradigm for α helical coiled coil proteins and for B cell epitope variation [5, 6]. This is because its hundreds or thousands of isoforms have limited similarity in peptide sequence and antigenicity, but strong conservation in higher level structure. In T. brucei, mature VSGs contain 400 – 500 amino acids, most having between 420 and 460 residues. Most of the protein is an N-terminal domain of ~350 residues, which is followed by a C-terminal domain, containing one or two smaller subdomains of 40–80 residues each [5, 7]. N-terminal domains usually have only ~20% identity between different VSGs, although some are more closely related. The most conserved primary structure feature is the cysteine pattern, of which there are three, resulting in the classification of this domain into types A, B and C . In contrast to the extensive sequence diversity, secondary structure potential is conserved, with a consequent overall similarity between the N-terminal domains of distinct VSGs. The backbone comprises two long, antiparallel α helices that form a coiled coil. It is not known exactly which elements of this domain contain the exposed epitopes that are the basis of antigenic variation, but it has been inferred that they are conformational rather than linear and are located in the loops exposed at the most extracellular end of the N-terminal domain, outwith the helices [8–10]. One important question that is yet to be definitively answered is why N-terminal domains vary over their entire sequence, rather than in just the region encoding the exposed protective epitopes.
The C-terminal domain is hidden from antibodies (NJ & MC, unpublished), presumably due to its membrane-proximal location, so is thought not to contribute to antigenic variation. It has ~40% identity between different VSGs. The genome project has revealed that, rather than the four C-terminal domain types (1 – 4) previously recognized, based on their cysteine patterns, there are six types. Types 2, 4 and 5 are single domains, each containing four cysteine residues, whereas types 1, 3 and 6 contain eight cysteines and appear to be composed of two subdomains, each containing four cysteines. Individual VSGs can have any combination of N- and C-terminal domains (A1, A2, A3, A4, B1, B2 etc.), and, as judged from the VSGs analysed so far, there appears to be no restriction on combinations. At its C-terminal end, this domain contains a signal sequence for the addition of a glycosylphosphatidylinositol- (GPI-) anchor . Although sequence features specifying GPI signal sequences have been identified , their full diversity across VSGs is not known, and study of as many as possible potential signal sequences could enable deeper understanding.
Despite about 1000 VSG sequences being currently available, mainly through genome and cDNA sequencing, it is not facile to retrieve a complete set from general databases. We have therefore created a database allowing retrieval of criterion-based subsets. This should facilitate a more detailed analysis of VSG structure and more general questions about protein structure including:
what are the sequence requirements for coiled coil structures?
can epitope diversity be correlated with diversity in primary and higher-order structure?
How diverse are GPI-anchor signal sequences?
How does evolutionary selection for diversification fit within a conserved protein structure?
It is worth noting that relatively few of the silent VSG s sequenced in the genome project are considered to be fully functional, and verification of function of any element, for example GPI anchor signal sequences, requires demonstration of expression. In contrast, the non-genome VSG sequences are based mainly on expression, most having been derived from cDNA sequences, and query returns in FASTA format report their derivation from cDNA or genomic DNA.
At the genetic level, trypanosomes use a strategy common to antigenic variation in a diverse range of microbial pathogens: accessing an archive of silent genes. In T. brucei, there is a large archive of silent, distinct VSG genes, effectively all of which are telomeric and subtelomeric. In the genome strain, about 1600 VSG s are arranged as tandem arrays in subtelomeres of a range of chromosomes, and it is likely that different strains contain substantially larger archives . Only 4.5% of this set are annotated as intact genes, the rest consisting of atypical genes (do not convincingly encode maturable VSGs), pseudogenes (include frameshifts and/or stop codons), and VSG fragments. Another set, of up to ~200 genes, are located telomere-proximally in the ~100 minichromosomes; so far, based on three genes [13, 14], this set appear to be intact VSG s. Despite the enormous size of the silent archive, each trypanosome expresses only one VSG. Expression occurs only from specialized, telomere-proximal transcription units termed bloodstream expression sites (BESs). For archival genes, activation therefore involves duplication into a BES at the expense of the previously transcribed VSG, which is lost. Even pseudogenes can be activated this way, as part of their sequence can contribute to the formation of mosaic genes. Although it is known that homologous recombination participates actively in VSG switching, it is thought that limited sequence homology within the coding sequence is involved in the formation of mosaic genes, and an important function of a database could be enabling identification of such sequence homologies . Duplication of intact VSG s apparently can utilize homologous, imperfect repeats upstream of most VSG s and can end at the other flank in conserved sequences towards the 3' end of the coding region or further downstream, where the 3' untranslated region is encoded. Sometimes the incoming gene duplicate inherits part of the C-terminal domain encoding sequence from the VSG already in the BES.
The VSG archive is very diverse, to the extent that different trypanosome strains have widely different gene sets. How the archive evolves is unknown, but it is evident from the dispersed nature of the VSG gene arrays , and from analysis of duplication events within the archive (LM, JDB, unpublished), that homologous recombination, involving primarily gene conversion, plays a major role. It is now becoming clear that subtelomeres of various organisms, including humans , are preferential sites for the rapid evolution of multigene families , possibly due to the preferred use of particular recombination mechanisms . Due to the availability of the sequence of most VSG s in the silent gene arrays, the trypanosome has now become an experimentally tractable paradigm for the role of subtelomere recombination in multigene family diversification. Thus, ready accessibility to the individual gene sequences in a dedicated database can help address a number of questions about chromosomes and recombination, such as:
How do sequences spread and diversify within and between subtelomeres?
What is the contribution of partial gene conversion, (micro)homologous recombination and point mutation to diversification of the gene family?
What is the rate of evolution of coding sequences and of pseudogenes that can donate partial coding information?