Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: MapGL: inferring evolutionary gain and loss of short genomic sequence features by phylogenetic maximum parsimony

Fig. 1

The mapGL algorithm. a Schematic outline for the mapGL algorithm. After initialization, the algorithm loops over query features, performing an initial mapping step against the target species. If the feature maps to the target species, it is labelled as an ortholog and written to output. If not, it enters the ancestral reconstruction stage. The feature is then mapped to each outgroup species in the full phylogeny and the corresponding leaves are labelled to indicate presence or absence. Internal labels are inferred based on the patterns observed at the leaf nodes (see Fig. 2). If the root state cannot be inferred unambiguously, root state disambiguation is performed as shown in panel D. Gain and loss events can then be inferred based on whether a feature is present at the root of the tree. The labelled feature is then written to output. This process is repeated until all query features are labelled. b Full phylogenetic tree describing evolutionary relationships between the query and target species (nodes 3 and 4) plus three outgroups (nodes 6–8). Query, target, and outgroup species occupy the leaf nodes of the tree. These are the only species for which we can directly observe sequence presence/absence. Internal nodes (0, 1, 2, and 5) represent ancestral species. c Since we cannot observe internal sequences directly, we must infer sequence presence/absence based on present-day observations from the leaf species. The core step of the ancestral reconstruction stage involves labelling all leaf nodes with their observed states and performing a post-order tree traversal to infer the states at internal nodes following the principle of maximum-parsimony (MP). The most-recent common ancestor (MRCA) occupies the root node (node 0), and the inferred state at this node is returned and used to predict whether query-specific sequences were gained in the query genome or lost from the target genome (see Fig. 2a-b) for example). (D) In cases when the root state cannot be resolved, the root state is disambiguated following a simple decision tree. In the first step, the larger of the left and right subtrees is chosen. If the state at the base of this tree is unambiguous, the root state is set to the corresponding state. Otherwise, we check the state at the base of the opposite subtree. If this node is not a leaf node and is labeled with an unambiguous state, we set the root state to the corresponding state. If neither left nor right subtrees have an unambiguous root state, or if the only unambiguous descendant node is a leaf node, the root state is chosen based on the –priority parameter. If this is set to “gain,” the root state defaults to 0 (sequence absence). If it is set to “loss,” the root state defaults to 1 (sequence presence)

Back to article page