In what follows, we will introduce two examples in the study of influenza hemagglutinin (HA) genes and mammalian olfactory receptor (OR) genes to demonstrate how Phylo-mLogo can assist users to observe and analyze alignment results of many sequences. The total numbers of alignment sequences in these two examples are 453 and 1425, respectively. The phylogenetic relationships of the aligned sequences are acquired in each example.
Example 1: 453 avian influenza HA genes
The spread of H5N1 avian influenza from China to Europe has raised a global concern about their potential to infect humans and cause a pandemic. A more comprehensive collection of data and analysis of avian influenza sequences are critically needed for biologists and epidemiologists to find out the virulence and transmissibility of these viruses from avian species to humans. Obenauer et al. (2006) established the first large-scale sequencing effort to collect additional genomic data on the avian population of influenza A viruses [3]. They introduced a proteotyping method to identify and number unique amino acid signatures, called proteotypes, for sequences that may or may not be distinguished by branches on a phylogenetic tree. They analyzed eight avian influenza genes and provided the proteograms to demonstrate the amino acid signatures within each clade (Figures S2-S9 in the supplementary material of [3]). Based on the observations, they concluded that the virus families tend to have multiple core conserved genes and that the surface glycoproteins, HA and NA, appear to be more freely exchanged than core proteins because of immune pressure [3].
In this part, we downloaded 437 avian influenza HA genes used for analysis in [3]. Because it was very time consuming to infer the phylogenetic tree of these sequences by MrBayes [3, 21], we observed the tree shown in Fig. S6 in [3] directly and constructed their phylogenetic relationship manually. The proteotypes of the analyzed sequences included p1.1, p2.1, p5.1–4, p6.1–6, p8.1, p9.1, p9.2, and p12.1. Based on these proteotypes, we first aligned the sequences of each proteotype and then aligned these proteotypes together, by ClustalW [22]. The total alignment length is 584.
Figure 3A shows the sequence logos and their phylogenetic tree simultaneously. Different from other tools for tree visualization [23], Plylo-mLogo displays the phylogenetic tree by using a standard file browser because this representation is more compact than that of the traditional tree visualization of the original phylogenetic tree as shown in Fig. 3B. Thus, the user can click on different clades shown in various background colors, like selecting different folders in a file browser, to visualize the sequence logos of the alignment at different levels.
Stevens et al. (2006) listed some conserved residues with the receptor binding domains of H1 and H5 serotypes that are implicated in receptor specificity, amino acid positions 183, 190, 193, 194, 216, 221, 222, and 225–8 [24] to which the corresponding positions in our example are 205, 212, 215, 216, 238, 243, 244, and 247–250, respectively. Then, we compared these sequence logos among different proteotypes. To avoid confusion in this example, we used the original positions shown in Stevens et al. (2006). As shown in Fig. 3C, the amino acids at residue sites 194, 225, and 228 are almost conserved across H1, H2, H5, H6, H8, H9, and H12 serotypes. If we only consider H1, H2, and H6, the same clades as H5 [3, 25], the amino acids at sites 183, 190, 194, 225, 226, and 228 are almost the same across these serotypes.
Briefly, Phylo-mLogo can assist users in comparing and visualizing the changes of polymorphisms and indel events across different clades or subtypes of the alignment of many sequences so that users could speculate possible evolutionary and functional mechanisms and examine their hypotheses further.
Example 2: 1425 human and mouse functional olfactory receptor genes
Olfactory receptor (OR) genes that encode the proteins used for detecting odor molecules are the largest multigene family in mammals [2, 26]. OR genes do not have any intron in the coding regions and the encoded proteins are 300 to 350 amino acids in length. Being a member of the G-protein-coupled receptor (GPCR) superfamily, OR genes include seven transmembrane domains (TM1-TM7, 19–26 amino acid each) interconnected by intracellular and extracellular loops, an extracellular C-terminal chain, and an intracellular N-terminal chain [27, 28].
The functional OR genes are separated into class I and class II [29]. In the human genome, Niimura and Nei (2003) have identified 388 OR genes among which 57 and 331 genes belong to class I and class II, respectively [1, 29]. Moreover, they also identified 19 phylogenetic clades (clades A-S) for human class II OR genes and found that many genes belonging to a phylogenetic clade were located in the same genomic cluster [1]. In 2005, they conducted a detailed study of functional OR genes and pseudogenes in mice and identified 1,037 functional genes and 354 pseudogenes. The number of functional genes is 2.7 times greater than that of humans, whereas the number of pseudogenes is slightly smaller in mice than in humans [30].
In this example, we downloaded the whole human and mouse functional OR genes from [1], 1,425 sequences in total, and their detailed annotations. We aligned these sequences by FFT-NS-I [31] and then constructed their neighbor-joining phylogenetic tree by MEGA3 [32]. According to Niimura and Nei's classification [1, 30], we also indicated 19 phylogenetic clades (clades A-S) of class II OR genes in the final alignment and in the phylogenetic tree.
The phylogenetic relationship of all human and mouse functional OR genes is shown in the left part of Fig. 4A. We selected several major branches which include over ten sequences each, and displayed their corresponding logos shown in the right part of Fig. 4A. We observed that the residues at some sites are highly conserved across different branches but others are not. Some blank regions in the logos show that the columns in these regions across the whole alignment are almost filled with gaps due to some insertions in only a few sequences.
For users to quickly identify and annotate functional regions or specific target sites, we have implemented in Phylo-mLogo an operation that allows the user to select some annotated or well-studied sequences in the alignment and to pull them from the sequence pool in the Phylogeny/Relationship Viewing Control to Clade Logos View and create new logos for reference. In this example, we selected HsOR17.1.14, also known as OR1E1, as a reference sequence to determine the locations of TM domains in the alignment. We used this sequence because the positions of its TM domains had been identified [33]. Then, we pulled the sequence from the sequence pool to the top of Clade Logos View and a new logo of the selected sequence was generated (Fig 4B). Moreover, we also selected 19 phylogenetic clades (clades A-S) of class II OR genes according to Niimura and Nei's classification [1, 30]. The boundaries of seven TM domains in the alignment can therefore be identified. For example, the start and the end positions of TM1 are 120 and 146 respectively and those of TM2 domains are 164 and 187, respectively (Fig. 4C).
Interestingly, we have found many sites which seem to have undergone positive selection because their amino acids are fixed within some clades but different from those in the other clades. For example, the majority of amino acids at site 120 in TM1 domain in the clades A, D, E, G, H, L, M, N, O, Q, R, and S are Q's (Glutamine, non-charged). But those in clades C,F, and J are E's (Glutamic acid, negatively charged) while those in clade I and in class I are both positively charged (K (Lysine) in clade I and H (Histidine) in class I). Moreover, the amino acids at site 171 (TM2 domain) in most clades are S's (Serine, polar) but they are A's (Alanine, non-polar) in three clades of class II, implying that at these sites in different clades their physico-chemical properties should be different.
To sum up, the residues at a single site which are conserved only within some clades usually could not be identified by a global logo representation. However, these sites are generally considered to have undergone positive selection and possess some kinds of biological meanings. Although the problem of how to identify positively selected sites should be tackled by concrete computational models rather than by visualization, using Phylo-mLogo, as the examples show, may allow the user to visually inspect the potential sites in the initial step of sequence analysis as well as to demonstrate computational results in the final stage of representation.