fagin: synteny-based phylostratigraphy and finer classification of young genes

Background With every new genome that is sequenced, thousands of species-specific genes (orphans) are found, some originating from ultra-rapid mutations of existing genes, many others originating de novo from non-genic regions of the genome. If some of these genes survive across speciations, then extant organisms will contain a patchwork of genes whose ancestors first appeared at different times. Standard phylostratigraphy, the technique of partitioning genes by their age, is based solely on protein similarity algorithms. However, this approach relies on negative evidence ─ a failure to detect a homolog of a query gene. An alternative approach is to limit the search for homologs to syntenic regions. Then, genes can be positively identified as de novo orphans by tracing them to non-coding sequences in related species. Results We have developed a synteny-based pipeline in the R framework. Fagin determines the genomic context of each query gene in a focal species compared to homologous sequence in target species. We tested the fagin pipeline on two focal species, Arabidopsis thaliana (plus four target species in Brassicaseae) and Saccharomyces cerevisiae (plus six target species in Saccharomyces). Using microsynteny maps, fagin classified the homology relationship of each query gene against each target genome into three main classes, and further subclasses: AAic (has a coding syntenic homolog), NTic (has a non-coding syntenic homolog), and Unknown (has no detected syntenic homolog). fagin inferred over half the “Unknown” A. thaliana query genes, and about 20% for S. cerevisiae, as lacking a syntenic homolog because of local indels or scrambled synteny. Conclusions fagin augments standard phylostratigraphy, and extends synteny-based phylostratigraphy with an automated, customizable, and detailed contextual analysis. By comparing synteny-based phylostrata to standard phylostrata, fagin systematically identifies those orphans and lineage-specific genes that are well-supported to have originated de novo. Analyzing within-species genomes should distinguish orphan genes that may have originated through rapid divergence from de novo orphans. Fagin also delineates whether a gene has no syntenic homolog because of technical or biological reasons. These analyses indicate that some orphans may be associated with regions of high genomic perturbation. Electronic supplementary material The online version of this article (10.1186/s12859-019-3023-y) contains supplementary material, which is available to authorized users.

S2 synder results and summary Table S4: Numeric summary of the lengths of the search intervals inferred by synder [38] between the focal species (A. thaliana and S. cerevisiae) and each of the listed target species for the Brassicaceae (top) and Saccharomyces (bottom) case studies. Column 2-7 refer to the minimum, 25th quantile, median, 75th quantile, and maximum of the block lengths. The final column, N, is the total number of blocks in the synteny map.  fagin infers the search intervals by calling the synder search function. This is a function of a synteny map and four parameters: 1) trans specifies the function needed to transform the score column in the synteny map to one that is additive; 2) k is the number of conflicting syntenic intervals allowed in a block before it is broken (set to 0 by default); 3) r is a score decay rate that is used in calculating a score for each syntenic block created by synder; 4) offsets which specify the input and output bases (0 or 1) of the synteny map. Parameters 1 and 4 are specific to the tool that created the synteny map.
The output of the synder search function is: 1) a set of one or more search intervals for each focal gene (summarized in Table S4) and 2) a description of each search interval consisting of a flag describing each edge of the interval, whether the search interval overlaps a syntenic region in the synteny map, and a relative score for the search interval. The search intervals can be classified as shown in Table S5.

S3 fagin validation and parsing of GFF files
The GFF format is simple but highly error prone and GFFs from difference sources can follow very different conventions. This complicates practical analysis. While more standardized formats exist for storing feature information, GFF has persisted as the most commonly used. For this reason, fagin uses GFF, but also performs extensive validation and cleaning. The most problematic component of GFF is the 9th column that stores tag-value data. It is from these tag-value pairs that we build the gene models that we use to extract the locations of features, the protein sequences for genes, and the RNA sequences of transcripts.
fagin perfoms the following checking and cleaning steps: • Assert that all columns have correct type (see Table S7) • Unify type synonyms (see Table S6) • Handle AUGUSTUS fields. The AUGUSTUS gene prediction program uses the tag 'Other' to represent the Parent relationship. If the source column of the GFF3 file (2nd column) is "AUGUSTUS", then the tag 'Other' will be converted to 'Parent' (with a warning that will be passed through rmonad to the user).
• Assert that each Parent, ID, and Name tag contains a unique value. In the GFF3 spec, a tag can be associated with a comma-delimited list of values. fagin currently does not support this and will raise an error if this case is found.
• Treat Parent tags with a value of '-' as missing.
• Handle unnamed fields. If no ID is given, but there is one untagged field, and if there are no other fields, then cast the untagged field as an ID. This is needed to accommodate the irregular output of AUGUSTUS.
• Assert parent child relations are correct.
-If a feature links to a Parent, then the Parent must exist in the GFF -All Parent IDs must have either type gene or mRNA -All CDS and all exon must have a Parent (gene or mRNA) -All mRNA and IDs must be unique  S4 fagin data extraction from GFF and genome files Get mRNAs. Given the GFF and genome, the extraction of the transcripts (mRNAs) is fairly straightforward. The sequences of all exons are extracted and concatenated. If the sense of the mRNA is negative, the result is then reverse transcribed.
Get proteins. Extracting the protein coding sequences from the genome given the GFF is slightly more involved. The GFF records all Coding Sequences (CDS) and associates each with a parent (an mRNA or gene feature). The CDS may be spread across many exons, and thus be a list of DNA intervals. These DNA intervals can be extracted from the genome and pasted together to form the full CDS. However, there is some nuance to this step. First, if the mRNA is negative sense, the CDS must be reverse transcribed. A more difficult case arises when the initial interval of the CDS does not begin in the correct reading frame. This can happen, for example, when part of the gene model is missing from the assembly. So the first interval in the CDS may begin on the 2nd or 3rd position on the codon. The 'phase' column of the GFF stores the number of nucleotides that must be subtracted from the beginning of a CDS interval to read the first complete codon. fagin stores the phase data and will trim all models that start in a non-zero phase. Thus, partial protein models are allowed.
Once the CDSs, or partial CDSs, have been extracted, they may then they can be translated. fagin uses the translation function from the Biostrings package of the Bioconductor project. By setting if.fuzzyi.codon="solve", it perform a "fuzzy" translation where codons with ambiguous nucleotides (e.g. N for unknown base or Y for pyrimidine) will either be translated as X (if more than one amino acid matches the pattern) or a specific amino acid (if only one amino acid matches). The resulting proteins are given the name of their parent and stored for future use.
Get ORFs in mRNAs and the genome. fagin identifies ORFs in the mRNAs and across the entire genome. ORF identification is limited to the mono-exonic case (i.e. splice sites are not searched for). This is often reasonable since young genes tend to have few or no introns (though there are notable exceptions to this trend [8]). fagin uses the Bioconductor ORFik package [54] to identify the longest, uninterrupted ORF for each stop codon in the genome (or mRNA). For the genome (but not the mRNA) it search both strands. The start and stop codons can be set by the user (the fagin default is START=ATG and STOP=TAA,TGA,TAG). The minimum ORF length can also be set by the user, with the default being 30 amino acids.
fagin currently use the standard gene table for all genes. This would cause problems in animal and fungi mitochondria and other cases where non-standard gene codes appear (plant organelle genomes use the standard gene code).
S5 fagin homology inference statistics fagin calculates sequence similarity for proteins and DNA through Smith-Waterman alignments of the query features of the focal genome (e.g., the gene sequence, spliced mRNA sequence, or translated coding sequence) against the features on the target genome that overlap the search interval. These alignments provide a score, but no direct measure of statistical significance. To estimate the statistical significance of sequence matches, fagin infers p-values from an estimated false-positive distribution.
To simulate the false positive distribution, fagin starts with reversed query sequences. The sequence reversal erases the homology signal while preserving the sequence composition and site-dependencies (assuming site-dependencies are symmetric). For DNA, the order of nucleotides is reversed but not complemented. The idea of using reversed sequences as a control has been explored in the past [55,42]. For proteins, the order of amino acids is reversed. Reversed proteins have been used as a control to test sequence masking algorithms [56]). There is no natural process that will reverse the order of codons in a coding sequence (since this would require many independent tri-nucleotide inversions), or to reverse the order of bases in a nucleotide sequence (inversions would also take the complement of the bases). fagin next compares randomly selected query/target pairs of sequences. For protein searches, k query proteins are randomly sampled with replacement. For each sampled query protein, one target is randomly selected for each target that overlaps a search space on the target genome. For nucleotide searches, the k query genes are each searched against a random search interval. The alignments of the reversed query genes against the randomly chosen target sequences provides scores that approximate draws from the false positive distribution.
The raw Smith-Waterman scores need to be adjusted to account for search space size. The search space size is the product of the length of the focal sequence and the summed lengths of all target sequences. To account for search space size, fagin replaces the score with an adjusted score: where S i is the adjusted score, S i is the original raw alignment score, m i is the length of the focal amino acid sequence, n j is the length jth of the target sequence, and J is the total number of target sequences that will be searched. b 0 and b 1 are the regression coefficients for the robust regression of the raw scores on the search log(mn) where mn is the search space size. Robust linear regression was performed using the R function L1fit from the L1Pack package which implements the Barrodale-Roberts algorithm for L1 linear approximation [57].
Once adjusted scores are obtained against random search intervals, fagin fits the simulated highest scores for each query gene to a Gumbel distribution (a model of maximum values) using the R packages fitdistrplus [58] (using the maximum goodness-of-fit estimation with Cramer-von Mises distance), and calculates one-sided p-values for observed hits from this distribution. Finally, the p-value for each query is adjusted for the number of target sequences (across all species) that it is searched against (using the Holm method by default).
fagin uses the local Smith-Waterman algorithm, as implemented in the Bioconductor Biostrings package [59], to align each query gene to each of the similar sequences found in its target-genome search interval(s). For protein sequences, fagin uses the BLOSUM80 substitution matrix by default (the user may choose a different one). Nucleotide sequences are aligned with a local Smith-Waterman algorithm with a gap opening penalty of 10 and gap extension penalty of 4.