MapGL: inferring evolutionary gain and loss of short genomic sequence features by phylogenetic maximum parsimony

Background Comparative genomics studies are growing in number partly because of their unique ability to provide insight into shared and divergent biology between species. Of particular interest is the use of phylogenetic methods to infer the evolutionary history of cis-regulatory sequence features, which contribute strongly to phenotypic divergence and are frequently gained and lost in eutherian genomes. Understanding the mechanisms by which cis-regulatory element turnover generate emergent phenotypes is crucial to our understanding of adaptive evolution. Ancestral reconstruction methods can place species-specific cis-regulatory features in their evolutionary context, thus increasing our understanding of the process of regulatory sequence turnover. However, applying these methods to gain and loss of cis-regulatory features historically required complex workflows, preventing widespread adoption by the broad scientific community. Results MapGL simplifies phylogenetic inference of the evolutionary history of short genomic sequence features by combining the necessary steps into a single piece of software with a simple set of inputs and outputs. We show that MapGL can reliably disambiguate the mechanisms underlying differential regulatory sequence content across a broad range of phylogenetic topologies and evolutionary distances. Thus, MapGL provides the necessary context to evaluate how genomic sequence gain and loss contribute to species-specific divergence. Conclusions MapGL makes phylogenetic inference of species-specific sequence gain and loss easy for both expert and non-expert users, making it a powerful tool for gaining novel insights into genome evolution.

: Phylogenetic trees and gain/loss statistics for CTCF binding sites in mammalian, primate, and invertebrate phylogenies Figure S2: Representative gain and loss predictions from the primate and invertebrate analyses

Supplementary Methods
For our pilot analysis, we applied MapGL to a mammalian phylogeny with human (hg19) as the query species and mouse (mm9) as the target species. Dog (canFam2), horse (equCab2), and elephant (loxAfr3) were chosen as outgroups (Fig. S1A). This phylogeny includes species from two major groups of placental mammals: Atlantogenata and Boreoeutheria, and was selected to span an evolutionary distance encompassing the estimated activity dates of major transposable element families under investigation as sources of CTCF binding sites in the human and mouse genomes. Other considerations in selecting outgroup species were genome assembly and alignment quality, branch lengths separating individual outgroup species, and overall branch length separating outgroups from the human-mouse subtree. Horse and dog were chosen because their genome assemblies are of high quality, and they have a similar level of divergence relative to human and mouse. Elephant was chosen as a distal outgroup based on the high quality of its genome assembly and distance from both the human-mouse and horse-dog subtrees, leading to an overall outgroup tree length nearly equal to the human-mouse subtree. For comparison, we applied MapGL to two alternative phylogenies: a primate phylogeny spanning a much shorter evolutionary distance than the mammalian phylogeny, and an invertebrate phylogeny spanning a much larger evolutionary distance than either of the two other phylogenies.
Alignment chains and phylogenetic trees were obtained from the UCSC Genome Browser (1) download portal ( Table S1). The newick tree including only human, mouse, dog, horse, and elephant, was extracted from the 100-way vertebrate phylogeny using tree_doctor, and trees were visualized with draw_tree, both from the Phast software package (2). As part of the parent analysis (3), CTCF ChIP-seq data for two human immune cell lines were obtained from the ENCODE project portal (4) (Supplementary Table 1). These were merged into a union dataset with bedtools merge (5). MapGL was then applied to the union CTCF dataset to label each binding site as an ortholog, gain, or loss. Statistics were gathered in and visualized in R. Results are summarized in Fig. S1B and representative gain and loss events are presented in Figure 2E-F of the main text.
The primate phylogeny consisted of human (hg19) and chimpanzee (panTro4) as the query and target species with rhesus macaque (rheMac3), crab-eating macaque (macFas5), marmoset (calJac3), and squirrel monkey (SaiBol1) as outgroups (Fig. S1C). For this test, we utilized the same human CTCF ChIPseq dataset used in the mammalian analysis. The primate phylogenetic tree was extracted from the 100-way vertebrate phylogeny as described above, and sites were analyzed with MapGL using default options. Results are summarized in Fig. S1D and representative gain and loss events are depicted in Fig. S2A-B.
The invertebrate phylogeny consisted of four Drosophila species: D. melanogaster, D. simulans, D. ananassae, and D. pseudoobscura, and mosquito (A. gambiae) (Fig. S1E). We used CTCF ChIP-seq data for D. melanogaster, obtained from the ENCODE project, as a query dataset, and D. simulans as the target, with the remaining three species as outgroups. As no phylogenetic tree was available from UCSC for the dm3 assembly, we retrieved the 27-way insect phylogeny for the dm6 assembly from the UCSC Genome Browser download portal (1). We extracted the five species used in the invertebrate comparison using tree_doctor (2) and relabeled the "dm6" and "droPse3" nodes as "dm3" and "dp3", respectively, to match assembly names used in this analysis. The resulting tree was visualized using draw_tree (2). CTCF binding sites were then analyzed with MapGL using default options. Data were analyzed for visualization in R, with results presented in Fig. S1F. Typical gain and loss events are presented in Fig. S2C-D.   Figure S2: Representative gain and loss predictions from the primate and invertebrate analyses. Images were captured from the UCSC Genome Browser {Kent:2002ch}. Primate predictions are referenced to the hg19 genome build while invertebrate predictions are referenced to the dm3 genome build. A) A typical human-specific sequence gain prediction found on human chromosome (chr1:196,903,018-196,903,441). Note absence of sequence annotations within chimp, rhesus, crab-eating macaque, rhesus macaque, marmoset, and squirrel monkey multiple alignment and chain tracks. B) A typical chimpanzee-specific sequence loss prediction found on human chromosome 1 (chr1:1,565,635-1,565,850). Note the absence of sequence in chimp sequence in the multiple alignment and chain tracks, whereas sequence alignments and chains are present for all outgroup species. C) Representative D. melanogaster sequence gain prediction found on chromosome 3R (chr3R:4,541,763-4,542,082). Aside from small flanking fragments in D. pseudoobscura, all outgroup species lack sequence in this region, making D. melanogaster-specific sequence gain the most likely explanation for the observed data. D) Representative D. simulans sequence loss prediction found on D. melanogaster chromosome 2R (chr2R:13,559,969-13,560,288). Sequence is present in D. melanogaster, D. ananassae, and D. pseudoobscura, but absent in D. simulans and A. gambiae, making sequence loss on the branch leading to D. simulans the most parsimonious explanation for the observed data.