Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

Figure 3

The amcBPPS procedural substeps used to obtain a hierarchy from a multiple alignment. Starting from a multiple sequence alignment for a particular protein domain, the amcBPPS program applies the following substeps (‘a’ to ‘e’) to create a domain hierarchy. Note that substep (a) corresponds to Step 1 of the amcBPPS algorithm whereas the other substeps correspond to Step 2. (a) Use heuristic procedures to create distinct FD-tables, corresponding to a forest of simple (rooted, branchless) trees; each leaf of a given tree corresponds to a distinct subgroup within the protein class. (The mcBPPS sampler is used to optimally assign sequences to each leaf node; different prior probability settings can be used to favor convergence on subfamilies, families or superfamilies.) (b) Select leaf nodes from the forest corresponding to more or less distinct, functionally divergent subgroups; this is done by combining each set of nearly identical nodes into a single set. Define a root node (labeled R in the figure) corresponding to the universal sequence set. Larger superfamily nodes (labeled with red integers) also are created from related leaf nodes. The haze around nodes indicate the partially-overlapping nature (i.e., fuzziness) of the corresponding sequence sets. (c) Generate a directed acyclic graph (DAG) representing superset-to-subset relationships between nodes and with arcs weighted by (the negative of) the corresponding log-likelihood ratios (LLRs) associated with the BPPS statistical model. For clarity, nodes and arcs directly connected to the root are shown in orange whereas other (non-root) nodes are uniquely colored. (d) Obtain from the DAG a shortest path spanning tree using a breadth-first scanning algorithm [45]. Because the arcs are weighted using LLRs, this procedures returns a maximum likelihood tree associated with the DAG. (e) Prune nodes that both are directly attached to the root and significantly overlap with other nodes and thus correspond to ill-defined sequence sets. For the remaining nodes, remove the overlap between their corresponding sequence sets (see text for details) and prune from the tree those nodes that lack a minimum number of sequences (30 by default). This typically yields a reduced hierarchy (as shown), which is converted into a FD-table (as illustrated in Figure 2) for optimization by the mcBPPS sampler.

Back to article page