MapGL applies a simple phylogenetic inference approach based on Wagner parsimony [11] to infer the evolutionary history of short genomic sequence features from a query genome relative to a target genome (Fig. 1a). This approach seeks to minimize the number of state-changes (i.e., sequence gains and losses) necessary to explain the pattern of sequence presence/absence in a multiple-alignment of contemporary sequences. Briefly, for each query feature, an initial mapping step to the target genome determines whether an orthologous sequence exists. If so, the sequence is labeled as an ortholog and written to output. Otherwise, the ancestral state is inferred by projecting observed data from the query, target, and outgroup species onto a precomputed, fixed phylogeny describing the evolutionary relationships between all present-day and ancestral species (Fig. 1b). Ancestral state inference proceeds using an adaptation of Fitch’s Algorithm [12], whereby the most-likely ancestral state is chosen to minimize the total number of gain/loss events required to explain the pattern of sequence presence/absence in the observed data [12] (Fig. 1c). Query sequences are first mapped to each outgroup and state labels recorded at the corresponding leaf nodes: “1” if an orthologous sequence exists or “0” if not. Next, a post-order traversal is performed to infer states at all internal nodes, choosing the most-frequent state observed among child nodes, or storing the union set of observed symbols in case of ties. The inferred state at the root node, representing the MRCA, is then used to infer whether sequences were gained on the branch leading to the query species (i.e., were absent in the MRCA) or lost on the branch leading to the target species (i.e., were present in the MRCA) (Fig. 2a-b).
In cases where the root state cannot be resolved, we set the root state equal to the unambiguous descendant state supported by the subtree with the most leaves. In the case of a single terminal outgroup species, where the right subtree has only one leaf, this is equivalent to pruning the most-terminal outgroup at the branch directly above the root (Fig. 1d, Fig. 2c-d). When this fails, we choose the root state based on the value of the –priority (−p) parameter that prioritizes sequence gain or loss. Setting this to “gain” is equivalent to assigning a lower evolutionary cost to sequence insertions compared to deletions. The root state will then default to “0” in ambiguous cases and sequence presence in descendant nodes will be interpreted as sequence gain. Setting this to “loss” will have the opposite effect. This behavior can be disabled via the –no_prune (−n) option, in which case such sequences will be labeled as “ambiguous.” Optionally, an additional pre-order tree traversal can be performed to infer and report all branch(es) on which sequence gain/loss events occurred throughout the phylogeny.
MapGL takes as inputs a set of genomic query features in BED format, a set of liftover chains [13] corresponding to the target and outgroup species, and a Newick tree describing the full phylogeny. The program is self-documenting and ships with a small example dataset. MapGL can be tuned through several command-line options. The most important of these are –threshold (−t) and –gap (−g). The -t < FLOAT> option specifies the amount of overlap required between query features and target chains, expressed as the fraction of query nucleotides overlapping a target chain, to call a match as a function of the query element length. The default value of 0.0 is equivalent to requiring a single-base overlap while a value of 1 would enforce only full-length alignments. The -g < INT> option specifies the tolerance for gaps and takes an integer specifying the maximum tolerable gap length, in nucleotides, as its argument. The default value of −1 indicates that gaps of any length are allowed. The default values maximize the algorithm’s ability to find orthologous sequences, making gain and loss predictions very conservative. However, these values may lead to erroneous ortholog predictions in cases where aligned features contain embedded indels, may be overly conservative when comparing closely related genomes, and may lead to underprediction of sequence gains/losses in poorly aligned regions. Specifying a non-zero mapping threshold and a gap tolerance near the average gap length in such regions may increase the sensitivity of MapGL to detect gain and loss events, albeit at the expense of reducing its sensitivity for orthologs. Nonetheless, as the prototypical use of this program is to identify whole sequence features that have undergone species-specific gain or loss, the default values are likely reasonable for most users.
MapGL is free and open-sourced under the MIT License. It is based on bnMapper [14], which maps genomic sequences across species based on liftover chains [13]. We retained the core mapping steps from bnMapper with one modification: while bnMapper drops alignments that are split across chains, we keep the longest alignment among all chains represented. This behavior can be controlled with the –drop_split (−d) option. We extended this framework by adding the ability to map against multiple liftover chains to incorporate target and outgroup species and added the ancestral reconstruction steps outlined above. Tree parsing and phylogenetic methods were adapted from the python-newick package (https://pypi.org/project/newick/). MapGL is compatible with MacOS and most Unix-like operating systems, and is available for installation through PyPI and Bioconda, with source code available from the GitHub repository (https://github.com/adadiehl/mapGL).
We present three case studies spanning a wide variety of phylogenetic tree topologies and evolutionary distances to establish the performance of MapGL. We obtained CTCF binding locations in one of two query species: human and D. melanogaster. Human data were compared to mouse, a distant target species, in the mammalian analysis (Fig. S1A), and chimpanzee, a closely related target species, in the primate analysis (Fig. S1C). D. melanogaster data were compared to D. simulans, which are roughly three times more divergent than human and chimpanzee, in the invertebrate analysis (Fig. S1E). Human CTCF data were originally analyzed as part of a study on the effects of transposable element (TE) activity on CTCF binding site content in mammalian genomes [15]. In keeping with this, dog, horse, and elephant were chosen as outgroups to yield phylogenetic tree that encompasses two major groups of mammals: Atlantogenata and Boreoeutheria, spanning the primary activity periods for several TE types associated with CTCF binding (Fig. S1A). Methods for all three analyses are described in detail within the Supplementary Methods.