Subfamily logos: visualization of sequence deviations at alignment positions with high information content

Background Recognition of relevant sequence deviations can be valuable for elucidating functional differences between protein subfamilies. Interesting residues at highly conserved positions can then be mutated and experimentally analyzed. However, identification of such sites is tedious because automated approaches are scarce. Results Subfamily logos visualize subfamily-specific sequence deviations. The display is similar to classical sequence logos but extends into the negative range. Positive, upright characters correspond to residues which are characteristic for the subfamily, negative, upside-down characters to residues typical for the remaining sequences. The symbol height is adjusted to the information content of the alignment position. Residues which are conserved throughout do not appear. Conclusion Subfamily logos provide an intuitive display of relevant sequence deviations. The method has proven to be valid using a set of 135 aligned aquaporin sequences in which established subfamily-specific positions were readily identified by the algorithm.


Background
Most protein families can be divided into functionally distinct subfamilies. Such subfamilies exhibit characteristic properties which manifest for instance as binding specificity of regulatory proteins, substrate specificity of enzymes, and pore selectivity of channels and transporters. Functional differences are often linked to sequence characteristics in regions which are conserved throughout the protein superfamily. This is because conserved domains define the fold of the functional protein core or provide catalytic residues. Recognition of subfamily-specific deviations at such sites can be valuable for elucidating mechanistic principles of the protein family by sitedirected mutagenesis and subsequent functional analysis of the mutants. An automated approach to identify rele-vant deviations should (i) provide the ability to take into account a large number of reference sequences, (ii) determine sequence conservation, i. e. positions of high information content, and (iii) visualize deviations, i.e. subfamily characteristics, relative to the information content in a graphical output which is easy to comprehend.

Implementation
One sophisticated way of presenting sequence conservation is to display a sequence logo [6]. Here, the information content I (P i ) of each alignment position i is defined inverse to the uncertainty H(P i ) by the equation with |Σ| being the cardinality of the used alphabet, i.e. 4 for DNA and 20 for protein sequences, and P ij being the frequency of residue j at this position (variables according to [7]). Each position is displayed as a stack of residue symbols whose heights l ij represent their proportion of the information content: Protein sequence logos are often adjusted to the background frequency of each amino acid in the alignment [7]. For simplicity, the variable name I (P i ) will be used in the following for both, information content with or without frequency correction. Generally, both approaches are compatible with subfamily logos and have been implemented in the algorithm.
Contrary to a sequence logo that depicts sequence conservation, here, it is desired to display the relevance of deviations at conserved positions. The recently published pairwise HMM logo approach does align the sequence logos of two subfamilies [8]. This certainly facilitates the identification of relevant deviant positions, but one still has to inspect position by position and judge different symbol heights by eye. Subfamiliy logos provide a very intuitive display. They are derived by subtracting from the frequency S ij of a residue j within a pre-defined subset of sequences, i. e. a subfamily, the frequency R ij of this residue in the remaining set of sequences for each position i. The difference is then weighted by the overall information content I(P i ) computed from all sequences and the residue is plotted with a symbol height of s ij : The term (S ij -R ij ) gives values from -1 to 1. Positive values correspond to residues which are characteristic for the subfamily (shown upright in the output), negative values to those that are typical for the remaining sequences (shown upside-down). Positions with an equal distribution of residue j result in a zero value.
The need for a correction factor is illustrated by the following example. Assume an alignment with an equal number of sequences in the subfamily and in the remaining set of sequences. Further, assume a position i within the alignment where all sequences in the subfamily carry amino acid a and all remaining sequences carry amino acid b with a ≠ b. This situation can be considered as the best possible discrimination between the subfamily and the remaining set of sequences and results in the frequencies P ia = 0.5, P ib = 0.5 and all other P ij = 0. The overall information content at this position, thus, is I(P i ) = Iog 2 20 + 0.5 Iog 2 0.5 + 0.5·Iog 2 ·0.5 = Iog 2 20 -1, i. e. one bit less than the maximal information content. For either group of sequences, however, the information content should be maximal due to the frequencies S ia = 1 (subfamily) and R ib = 1 (remaining sequences). The decrease in the apparent information content depends on the fraction of sequences in the subfamily ( ) and in the remaining set ( ). Hence, the factor was introduced, which follows the form shown in the example above and corrects for the described error:

Results and discussion
residues due to the lower information content. Nevertheless, subfamily characteristics are still visible if relevant, e.g. at positions 43, 182, and 202. The algorithm further accepts a threshold bit-value above which a deviant residue is additionally highlighted by a symbol (asterisks in Fig. 1). Empirically, this value is set to log 2 5 (2.322 bit) for proteins, which corresponds to the presence of one particular residue in 25% of all sequences or 50% of the subfamily, and log 2 2 (1 bit) for DNA sequences. The threshold value can be manually adjusted to match the alignment situation in question. It may also be used in the future to indicate statistical evaluations of the residue distribution. Inherently, best results are obtained when only two subfamilies are compared.
Currently, subfamily logos are implemented in TEXshade [see additional files 1 and 2], i.e. a LATEX macro package for setting and shading multiple sequence alignments [1]. Some sample code is displayed in Fig. 2 depicting that a small number of commands leads to satisfying output. TEXshade provides numerous additional commands for individual adjustments of the output and comprehensive labeling. However, implementation of a subfamily logo extension into software that provides a graphical user interface and TEXshade output, such as STRAP [3] or the San Diego Supercomputer Center Biology WorkBench http://workbench.sdsc.edu/, is strongly encouraged. Further, integration of the subfamily logo algorithm into local or web-based sequence logo plotting tools should be straight forward.

Conclusion
Subfamily logos are an extension to the classical application of sequence logos. They provide a novel tool to intu-Subfamily logos in comparison to classical sequence logos Figure 1 Subfamily logos in comparison to classical sequence logos. Sections of three aquaporin subfamilies are shown, i.e. water/glycerol channels (GlpFs), water-specific channels (AQPs), and tonoplast intrinsic proteins (TIPs). Subfamily-specific residues are displayed upright, residues that are typical for the remaining sequences as tinted upside-down characters. The unit of the ordinates is in bits. Triangles mark known positions of relevant subfamily-specific deviations. Asterisks were computed by the subfamily logo algorithm to label subfamily-specifc residues. seq. logo     itively visualize subfamily sequence characteristics. The validity of the method was confirmed by analysis of 135 aligned aquaporin sequences and correct identification of subfamily-specific sequence deviations. Their relationship to sequence logos makes it easy to integrate them into existing logo software.