Subfamily logos: visualization of sequence deviations at alignment positions with high information content

Beitz, Eric

doi:10.1186/1471-2105-7-313

Software
Open access
Published: 21 June 2006

Subfamily logos: visualization of sequence deviations at alignment positions with high information content

Eric Beitz¹

BMC Bioinformatics volume 7, Article number: 313 (2006) Cite this article

4647 Accesses
13 Citations
6 Altmetric
Metrics details

Abstract

Background

Recognition of relevant sequence deviations can be valuable for elucidating functional differences between protein subfamilies. Interesting residues at highly conserved positions can then be mutated and experimentally analyzed. However, identification of such sites is tedious because automated approaches are scarce.

Results

Subfamily logos visualize subfamily-specific sequence deviations. The display is similar to classical sequence logos but extends into the negative range. Positive, upright characters correspond to residues which are characteristic for the subfamily, negative, upside-down characters to residues typical for the remaining sequences. The symbol height is adjusted to the information content of the alignment position. Residues which are conserved throughout do not appear.

Conclusion

Subfamily logos provide an intuitive display of relevant sequence deviations. The method has proven to be valid using a set of 135 aligned aquaporin sequences in which established subfamily-specific positions were readily identified by the algorithm.

Background

Most protein families can be divided into functionally distinct subfamilies. Such subfamilies exhibit characteristic properties which manifest for instance as binding specificity of regulatory proteins, substrate specificity of enzymes, and pore selectivity of channels and transporters. Functional differences are often linked to sequence characteristics in regions which are conserved throughout the protein superfamily. This is because conserved domains define the fold of the functional protein core or provide catalytic residues. Recognition of subfamily-specific deviations at such sites can be valuable for elucidating mechanistic principles of the protein family by site-directed mutagenesis and subsequent functional analysis of the mutants. An automated approach to identify relevant deviations should (i) provide the ability to take into account a large number of reference sequences, (ii) determine sequence conservation, i. e. positions of high information content, and (iii) visualize deviations, i.e. subfamily characteristics, relative to the information content in a graphical output which is easy to comprehend.

Implementation

One sophisticated way of presenting sequence conservation is to display a sequence logo [6]. Here, the information content I (P_i) of each alignment position i is defined inverse to the uncertainty H(P_i) by the equation

I (P_{i}) = \log_{2} | Σ | - H (P_{i}) = \log_{2} | Σ | + \sum_{j \in | Σ |} P_{i j} \cdot \log_{2} P_{i j}

with |Σ| being the cardinality of the used alphabet, i.e. 4 for DNA and 20 for protein sequences, and P_ijbeing the frequency of residue j at this position (variables according to [7]). Each position is displayed as a stack of residue symbols whose heights l_ijrepresent their proportion of the information content:

l_ij= P_ij·I(P_i)

Protein sequence logos are often adjusted to the background frequency of each amino acid in the alignment [7]. For simplicity, the variable name I (P_i) will be used in the following for both, information content with or without frequency correction. Generally, both approaches are compatible with subfamily logos and have been implemented in the algorithm.

Contrary to a sequence logo that depicts sequence conservation, here, it is desired to display the relevance of deviations at conserved positions. The recently published pairwise HMM logo approach does align the sequence logos of two subfamilies [8]. This certainly facilitates the identification of relevant deviant positions, but one still has to inspect position by position and judge different symbol heights by eye. Subfamiliy logos provide a very intuitive display. They are derived by subtracting from the frequency S_ijof a residue j within a pre-defined subset of sequences, i. e. a subfamily, the frequency R_ijof this residue in the remaining set of sequences for each position i. The difference is then weighted by the overall information content I(P_i) computed from all sequences and the residue is plotted with a symbol height of s_ij:

s_ij= (S_ij- R_ij)· $f_{\tilde{S} \tilde{R}}$ ·I(P_i)

The term (S_ij- R_ij) gives values from -1 to 1. Positive values correspond to residues which are characteristic for the subfamily (shown upright in the output), negative values to those that are typical for the remaining sequences (shown upside-down). Positions with an equal distribution of residue j result in a zero value.

The need for a correction factor $f_{\tilde{S} \tilde{R}}$ is illustrated by the following example. Assume an alignment with an equal number of sequences in the subfamily and in the remaining set of sequences. Further, assume a position i within the alignment where all sequences in the subfamily carry amino acid a and all remaining sequences carry amino acid b with a ≠ b. This situation can be considered as the best possible discrimination between the subfamily and the remaining set of sequences and results in the frequencies P_ia= 0.5, P_ib= 0.5 and all other P_ij= 0. The overall information content at this position, thus, is I(P_i) = Iog₂ 20 + 0.5 Iog₂ 0.5 + 0.5·Iog₂·0.5 = Iog₂ 20 - 1, i. e. one bit less than the maximal information content. For either group of sequences, however, the information content should be maximal due to the frequencies S_ia= 1 (subfamily) and R_ib= 1 (remaining sequences). The decrease in the apparent information content depends on the fraction of sequences in the subfamily ( $\tilde{S}$ ) and in the remaining set ( $\tilde{R}$ ). Hence, the factor $f_{\tilde{S} \tilde{R}}$ was introduced, which follows the form shown in the example above and corrects for the described error:

f_{\tilde{S} \tilde{R}} = \frac{\log_{2} | Σ |}{\log_{2} | Σ | + \tilde{S} \cdot \log_{2} \tilde{S} + \tilde{R} \cdot \log_{2} \tilde{R}}

Results and discussion

Fig. 1 displays two sections of a protein alignment (pos. 42–71 and 173–202) which consists of a total of 135 aquaporin sequences. Two functionally distinct subfamilies are represented by 32 aquaglyceroporins (GlpFs; permeability for water and glycerol), and 103 water-specific aquaporins (AQPs). From the latter, another water-specific subfamily consisting of 11 plant tonoplast intrinsic proteins (TIPs) can be separated.

The frequency-corrected sequence logo on the top highlights conserved positions around the two canonical Asn-Pro-Ala (NPA) motifs. The scale of the ordinate is in bits. Sequence conservation is further indicated by a color scale below the logo based on a structural matrix integrated into TEXshade. The triangles mark positions where the GlpF, AQP, and TIP subfamilies deviate as shown before in various publications [2, 4, 5]. These positions are directly connected to function because they contribute to the layout of the selective pore constriction.

Three frequency-corrected subfamily logos are shown below. Readability is greatly improved when upside-down symbols are tinted by 50%. This gives the impression of a reflective surface with a focus on the positive, subfamily-relevant residue symbols. The output is intuitive and basically self-explaining. Positions which are conserved throughout do not appear in the subfamily logos, see for instance the NPA motifs at positions 63–65 and 194–196. However, sequence deviations become visible dependent on the information content, e.g. Val₁₉₇ vs. Arg in the TIP subfamily, or Asp₁₉₈ and Ser₁₉₈, respectively, in the GlpF or AQP subfamilies. Deviations are less pronounced at positions with a higher number of possible residues due to the lower information content. Nevertheless, subfamily characteristics are still visible if relevant, e.g. at positions 43, 182, and 202. The algorithm further accepts a threshold bit-value above which a deviant residue is additionally highlighted by a symbol (asterisks in Fig. 1). Empirically, this value is set to log₂5 (2.322 bit) for proteins, which corresponds to the presence of one particular residue in 25% of all sequences or 50% of the subfamily, and log₂2 (1 bit) for DNA sequences. The threshold value can be manually adjusted to match the alignment situation in question. It may also be used in the future to indicate statistical evaluations of the residue distribution. Inherently, best results are obtained when only two subfamilies are compared.

Currently, subfamily logos are implemented in TEXshade [see additional files 1 and 2], i.e. a LATEX macro package for setting and shading multiple sequence alignments [1]. Some sample code is displayed in Fig. 2 depicting that a small number of commands leads to satisfying output. TEXshade provides numerous additional commands for individual adjustments of the output and comprehensive labeling. However, implementation of a subfamily logo extension into software that provides a graphical user interface and TEXshade output, such as STRAP [3] or the San Diego Supercomputer Center Biology WorkBench http://workbench.sdsc.edu/, is strongly encouraged. Further, integration of the subfamily logo algorithm into local or web-based sequence logo plotting tools should be straight forward.

Conclusion

Subfamily logos are an extension to the classical application of sequence logos. They provide a novel tool to intuitively visualize subfamily sequence characteristics. The validity of the method was confirmed by analysis of 135 aligned aquaporin sequences and correct identification of subfamily-specific sequence deviations. Their relationship to sequence logos makes it easy to integrate them into existing logo software.

Availability and requirements

Project name: TEXshade

Project home page: http://homepages.uni-tuebingen.de/beitz/txe.html or any CTAN site

Operating system(s): Platform independent

Programming language: LATEX

Other requirements: LATEX2ε

License: GNU GPL

Any restrictions to use by non-academics: none

References

Beitz E: TEXshade: shading and labeling of multiple sequence alignments using LATEX2ε. Bioinformatics 2000, 16: 135–139. 10.1093/bioinformatics/16.2.135
Article CAS PubMed Google Scholar
Beitz E: Aquaporins from pathogenic protozoan parasites: structure, function and potential for chemotherapy. Biol Cell 2005, 97: 373–383.
Article CAS PubMed Google Scholar
Gille C, Frommmel C: STRAP: editor for STRuctural Alignments of Proteins. Bioinformatics 2001, 17: 377–378. 10.1093/bioinformatics/17.4.377
Article CAS PubMed Google Scholar
Lagree V, Froger A, Deschamps S, Hubert JF, Delamarche C, Bonnec G, Thomas D, Gouranton J, Pellerin I: Switch from an aquaporin to a glycerol channel by two amino acids substitution. J Biol Chem 1999, 274: 6817–6819. 10.1074/jbc.274.11.6817
Article CAS PubMed Google Scholar
Pavlović-Djuranović S, Schultz JE, Beitz E: A single aquaporin gene encodes a water/glycerol/urea facilitator in Toxoplasma gondii with similarity to plant tonoplast intrinsic proteins. FEBS Lett 2003, 555: 500–504. 10.1016/S0014-5793(03)01313-9
Article PubMed Google Scholar
Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucl Acids Res 1990, 18: 6097–6100.
Article PubMed Central CAS PubMed Google Scholar
Schuster-Böckler B, Schultz J, Rahmann S: HMM Logos for visualization of protein families. BMC Bioinformatics 2004, 5: 7. 10.1186/1471-2105-5-7
Article PubMed Central PubMed Google Scholar
Schuster-Böckler B, Bateman A: Visualizing profile-profile alignment: pairwise HMM logos. Bioinformatics 2005, 21: 2912–2913. 10.1093/bioinformatics/bti434
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Pharmaceutical Chemistry, University of Tübingen, Morgenstelle 8, 72076, Tübingen, Germany
Eric Beitz

Authors

Eric Beitz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric Beitz.

Additional information

Authors' contributions

EB designed, implemented, and tested the algorithm and prepared the manuscript.

Electronic supplementary material

12859_2006_1052_MOESM1_ESM.dtx

Additional file 1: TEXshade source code. This is the TEXshade macro package containing the sequence logo algorithm and documentation as a LATEX docstrip archive (DTX 719 KB)

12859_2006_1052_MOESM2_ESM.ins

Additional file 2: LATEX instruction file. This file contains the LATEX instructions for unpacking the TEXshade macros (INS 2 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Beitz, E. Subfamily logos: visualization of sequence deviations at alignment positions with high information content. BMC Bioinformatics 7, 313 (2006). https://doi.org/10.1186/1471-2105-7-313

Download citation

Received: 21 March 2006
Accepted: 21 June 2006
Published: 21 June 2006
DOI: https://doi.org/10.1186/1471-2105-7-313

Subfamily logos: visualization of sequence deviations at alignment positions with high information content