AdmixPipe: population analyses in Admixture for non-model organisms

Background Research on the molecular ecology of non-model organisms, while previously constrained, has now been greatly facilitated by the advent of reduced-representation sequencing protocols. However, tools that allow these large datasets to be efficiently parsed are often lacking, or if indeed available, then limited by the necessity of a comparable reference genome as an adjunct. This, of course, can be difficult when working with non-model organisms. Fortunately, pipelines are currently available that avoid this prerequisite, thus allowing data to be a priori parsed. An oft-used molecular ecology program (i.e., Structure), for example, is facilitated by such pipelines, yet they are surprisingly absent for a second program that is similarly popular and computationally more efficient (i.e., Admixture). The two programs differ in that Admixture employs a maximum-likelihood framework whereas Structure uses a Bayesian approach, yet both produce similar results. Given these issues, there is an overriding (and recognized) need among researchers in molecular ecology for bioinformatic software that will not only condense output from replicated Admixture runs, but also infer from these data the optimal number of population clusters (K). Results Here we provide such a program (i.e., AdmixPipe) that (a) filters SNPs to allow the delineation of population structure in Admixture, then (b) parses the output for summarization and graphical representation via Clumpak. Our benchmarks effectively demonstrate how efficient the pipeline is for processing large, non-model datasets generated via double digest restriction-site associated DNA sequencing (ddRAD). Outputs not only parallel those from Structure, but also visualize the variation among individual Admixture runs, so as to facilitate selection of the most appropriate K-value. Conclusions AdmixPipe successfully integrates Admixture analysis with popular variant call format (VCF) filtering software to yield file types readily analyzed by Clumpak. Large population genomic datasets derived from non-model organisms are efficiently analyzed via the parallel-processing capabilities of Admixture. AdmixPipe is distributed under the GNU Public License and freely available for Mac OSX and Linux platforms at: https://github.com/stevemussmann/admixturePipeline.


Background
Advances in genomics during the past decade have accelerated research in molecular ecology by significantly increasing the capacity of researchers to generate vast quantities of data at relatively low cost. These advances largely represent the development of reduced representation genomic libraries [1][2][3] that identify tens of thousands of SNPs for non-model organisms, coupled with high-throughput sequencing methods that efficiently genotype fewer SNPs for thousands of individuals [4]. However, data generation, particularly through these novel and affordable marker-discovery methods [5], has greatly outpaced analytical capabilities, and especially so with regard to evolutionary and conservation genomics.
Technological advances have also precipitated a suite of new analytical issues. The thousands of SNPs generated in a typical RADseq project may exhibit biases that impact the inferences that can be drawn from these data [6], and which necessitate careful data filtration to avoid [7]. Yet, the manner by which data are filtered represents a double-edged sword. While it is certainly mandated (as above), the procedures involved must be carefully evaluated in the context of each study, in that downstream analyses can be seriously impacted [8,9], to include the derivation of population structure [10].
For example, the analysis of multilocus codominant markers in evaluation of population structure is frequently accomplished using methods that make no a priori assumptions about underlying population structure. One of the most popular methods in this regard is the program STRUCTURE [11][12][13]. However, it necessitates that users test specific clustering values (K), and conduct post hoc evaluation of results so as to determine an optimal K [14]. This typically involves searching a complicated parameter space using heuristic algorithms for Maximum Likelihood (ML) and Bayesian (BA) methods that, in turn, provide additional complications such as a tendency to sample local optima [15].
A common mitigation strategy is to sample multiple independent replicates at each K, using different random number seeds for initialization. These results are subsequently collated and evaluated to assess confidence that global rather than local optima have indeed been sampled. Clearly, this procedure must be automated so as to alleviate the onerous task of testing multiple replicates across a range of K-values. Pipelines to do so are available for STRUCTURE, and have been deployed on high-performance computing systems via integrated parallelization (STRAUTO, PARALLELSTRUCTURE) [16,17]. Multiple programs have likewise been developed for handling STRUCTURE output (i.e., CLUMPP, DISTRUCT) [18,19]; and pipelines constructed to assess the most appropriate K-values (i.e., STRUCTUREHARVESTER, CLUMPAK) [20,21].
Despite the considerable focus on STRUCTURE, few such resources have been developed for a popular alternative program (i.e., ADMIXTURE [22]). The Web of Science indexing service indicates that (as of January, 2020) ADMIXTURE has been cited 1812 times since initial publication (September, 2009). This includes 479 (26.4%) in 2019 alone. Despite its popularity, it has just a single option that promotes the program as part of a pipeline (i.e., SNIPLAY3 [23]), which unfortunately requires a reference genome as an adjunct for its application. Needless to say, its applicability is thus limited for those laboratories that employ non-model organisms as study species.
Options for post-processing of ADMIXTURE results are similarly limited, but some packages do exist. One positive is that CLUMPAK is flexible enough in its implementation to allow for the incorporation of ADMIXTURE output, as well as that of STRUCTURE. Alternatively, PONG provides options for processing and visualizing ADMIX-TURE outputs [24]. However, no available software currently exists to summarize variation in cross-validation (CV) values, the preferred method for selecting an optimal Kvalue in ADMIXTURE [25].
Here we describe a novel software package that integrates ADMIXTURE as the primary component of an analytical pipeline that also incorporates the filtering of data as part of its procedure. This, in turn, provides a high-throughput capability that not only generates input for ADMIXTURE but also evaluates the impact of filtering on population structure. ADMIXPIPE also automates the process of testing multiple K-values, conducts replicates at each K, and automatically formats these results as input for the CLUMPAK pipeline. Optional post-processing scripts are also provided as a part of the toolkit to process CLUMPAK output, and to visualize the variability among CV values for independent ADMIXTURE runs. Sections of the pipeline are specifically designed for use with non-model organisms, as these are the dominant study species in evolutionary and conservation genomic investigations.

Implementation
The workflow for ADMIXPIPE is presented in Fig. 1. The pipeline requires two input files: a population map and a standard VCF file. The population map is a tab-delimited text file with each row representing a sample name/ population pair. The VCF file is filtered according to user-specified command line options that include the following: minor allele frequency (MAF) filter, biallelic filtering, data thinning measured in basepairs (bp), and missing data filtering (for both individuals and loci). Users may also remove specific samples from their analysis by designating a file of sample names to be Fig. 1 The workflow for AdmixPipe involves two files as Input: 1) a VCF-formatted file of genotypes, and 2) a tab-delimited population map. These proceed through admixturePipeline.py which handles filtering, file conversion, and execution of Admixture according to user-specified parameters. After completion, the user can submit their output to Clumpak for analysis. The resulting files can then be visualized using distructRerun.py, and variability in cross validation (CV) values is assessed using cvSum.py ignored. All filtering and the initial conversion to PLINK (PED/MAP) format [26] is handled by VCFTOOLS [27].
An important consideration in filtering is mitigation of linkage disequilibrium. VCFTOOLS can calculate linkage disequilibrium statistics, however these do not consider population information, thereby increasing the potential for type I error [28]. PLINK not only suffers from these limitations, but also requires a "window size" input that specifies the lengths of genomic regions within which statistical comparisons among loci are conducted. This is typically inappropriate for non-model organisms due to a lack of whole-genome resources. Non-overlapping contigs produced via reducedrepresentation methods can be short (e.g., 100 bp), making it a reasonable assumption that all SNPs within a contig are linked. Therefore, we suggest specifying a thinning interval in excess of the longest contig length to ensure that ADMIXPIPE samples a single SNP per contig. This method is homologous to solutions implemented in popular RADseq assembly pipelines such as STACKS and IPYRAD to minimize linkage disequilibrium in datasets [29,30].
Additional conversions following the filtering and initial conversion via VCFTOOLS are required before the PLINK-formatted files will be accepted by ADMIXTURE. Popular software packages for de novo assembly of RADseq data, such as pyRAD [29,31] produce VCF files with each locus as an individual "chromosome." As a consequence, these pipelines produce outputs in which the number of "chromosomes" exceeds the number present in the model organisms for which PLINK was originally designed. The initial MAP file is therefore modified to append a letter at the start of each "chromosome" number. PLINK is then executed using the "-allow-extra-chr 0" option that treats loci as unplaced contigs in the final PED/ MAP files submitted to ADMIXTURE.
The main element of the pipeline executes ADMIXTURE on the filtered data. The assessment of multiple K values and multiple replicates is automated, based upon userspecified command line input. The user defines minimum and maximum K values to be tested, in addition to the number of replicates for each K. Users may also specify the number of processor cores to be utilized by ADMIXTURE, and the cross-validation number that is utilized in determining optimal K. The final outputs of the pipeline include a compressed results file and a population file that are ready for direct submission to CLUMPAK for processing and visualization.
The pipeline also offers two accessory scripts for processing of CLUMPAK output. The first (i.e., distructRerun.py) compiles the major clusters identified by CLUMPAK, generates DISTRUCT input files, executes DISTRUCT, and extracts CV-values for all major cluster runs. The second script (i.e., cvSum.py) plots the boxplots of CV-values against each K so as to summarize the distribution of CV-values for multiple ADMIXTURE runs. This permits the user to make an informed decision on the optimal K by graphing how these values vary according to independent ADMIXTURE runs.
ADMIXTURE is the only component of the pipeline that is natively parallelized. Therefore, we performed benchmarking to confirm that processing steps did not significantly increase runtime relative to that expected for ADMIXTURE. Data for benchmarking were selected from a recently published paper that utilized ADMIXPIPE for data processing [32]. The test data contained 343 individuals and 61,910 SNPs. Four data thinning intervals (i.e.,1, 25, 50, and 100) yielded SNP datasets of variable size for performance testing. All filtering intervals were repeated with variable numbers of processor cores (i.e.,1, 2, 4, 8, and 16). Sixteen replicates of ADMIXTURE were first conducted for each K = 1-8 at each combination of thinning interval and number of processor cores, for a total of 20 executions of the pipeline. The process was then repeated for each K = 9-16, for an additional 20 runs of the pipeline. Memory profiling was conducted through the python3 'mprof' package at K = 16, with a thinning interval of 1 as a final test of performance. All tests were completed on a computer equipped with dual Intel Xeon E5-4627 3.30GHz processors, 256GB RAM, and with a 64-bit Linux environment.

Results
The Little change was observed in response to increasing the numbers of processor cores from K = 1-8 (Fig. 2b). A slight decrease in performance was observed in some cases, particularly for the largest dataset. This trend changed at higher K-values, as substantial gains were observed at K = 9-16 ( Fig. 2c) when processors were increased from 1 to 4. The most dramatic performance increase was observed for the 61,910 SNP dataset, where a 24.3-h (34.5%) reduction in computation time occurred when processors increased from 1 to 4. However, only marginal improvements occurred when processors were increased from 1 to 8 (24.5 h; 34.7%) or 16 (26.2 h; 37.7%).
Profiling also revealed efficient and consistent memory usage of ADMIXPIPE. The greatest memory spike occurred during the initial filtering steps, when peak memory usage reached approximately 120 MB. All subsequent usage held constant at~60 MB as ADMIXTURE runs progressed.

Discussion
The performance of ADMIXPIPE improved with the number of processor cores utilized at higher K-values. However, it did not scale at the rate suggested in the original AD-MIXTURE publication. We have been unable to attribute the difference in performance to any inherent property of our pipeline. Filtering and file conversion steps at the initiation of ADMIXPIPE are non-parallel sections. Reported times for completion of these steps were approximately constant across runs, with the maximum being 8 seconds. This indicates that ADMIXTURE itself is the main driver of performance, as it comprises the vast majority of system calls made by ADMIXPIPE.
The original performance increase documented for ADMIXTURE was 392% at K = 3, utilizing four processor cores [25]. Unfortunately, we could not replicate this result with our benchmarking data [32], or the original test data (i.e., 324 samples; 13,928 SNPs) [25] which parallels our own. When we attempted to replicate the original benchmark scores, we found that it also failed to scale as the number of processor cores increased (1-core x = 40.63 s, σ = 0.90; 4-core x = 47.46 s, σ = 4.71). Furthermore, we verified that performance did increase with up to four processor cores at higher K values (K ≥ 9). We therefore view this as 'expected behavior' for ADMIXTURE, and find no reason to believe that ADMIXPIPE has negatively impacted the performance of any individual program.
Results of ADMIXPIPE were similar to those estimated by STRUCTURE for the test dataset, as evaluated in an earlier publication [32], and gauged for the optimum K = 8. This is not surprising, given that ADMIXTURE implements the same likelihood model as does STRUCTURE [22]. However, minor differences have previously been noted for both programs in the assignment probabilities [32,33].
Memory usage was efficient and constant, with the greatest increase occurring when PLINK was executed. Thus, users will be able to execute ADMIXPIPE on their desktop machines for datasets sized similarly to those evaluated herein. Performance gains were minimal with > 4 processors, and this (again) reduces the necessity for supercomputer access, since desktop computers with ≥4 processor cores are now commonplace. However, given the built-in parallelization capabilities of ADMIXTURE, its application on dedicated high-performance computing clusters will be beneficial when runtime considerations are necessary, such as when evaluating K > 8, or SNPs≥20,000.
Finally, our integration of common SNP filtering options provides the flexibility to quickly filter data and assess the manner by which various filtering decisions impact results. A byproduct of the filtering process is the production of a STRUCTURE-formatted file that will facilitate comparisons with other popular algorithms that assess population structure. These options are important tools, particularly given recent documentation regarding the impacts of filtering on downstream analyses. We thus suggest that users implement existing recommendations on filtering RAD data, and use these to investigate subsequent impacts on their own data [7][8][9][10].

Conclusions
Benchmarking has demonstrated that the benefits of ADMIXPIPE (e.g., low memory usage and performance scaling with low numbers of processor cores at high K-values) will prove useful for researchers with limited access to advanced computing resources. ADMIXPIPE also allows the effects of common filtering options to be assessed on population structure of study species by coupling this process with the determination of population structure. Integration with CLUMPAK, and our custom options that allow plotting of data, to include variability in CV-values and customization of population-assignment plots, will facilitate the selection of appropriate K-values and allow variability to be assessed across runs. These benefits will allow researchers to implement recommendations regarding assignment of population structure in their studies, and to accurately report the variability found in their results [34]. In conclusion, ADMIXPIPE is a new tool that successfully fills a contemporary gap found in pipelines that assess population structure. We anticipate that ADMIXPIPE, and its subsequent improvements, will greatly facilitate the analysis of SNP data in non-model organisms. Authors' contributions SMM, MRD, and MED designed the study; SMM and TKC authored the Python code for ADMIXPIPE; TKC and SMM completed data analyses and program testing; all authors contributed in drafting the manuscript, and all approved the final version.

Funding
We acknowledge indirect financial support from the University of Arkansas in the form of university endowments. These include the Bruker Professorship in Life Sciences (MRD), the twenty-first Century Chair in Global Change Biology (MED), a Doctoral Academy Fellowship (SMM), and a Distinguished Doctoral Fellowship (TKC). Funding agencies played no role in the design and/or conclusions of this study.
Availability of data and materials Data utilized for benchmarking was part of an earlier publication, and are available on Data Dryad (https://datadryad. org/stash/dataset/doi:10.5061/dryad.d3q3220). Source code for ADMIXPIPE is released under the GNU General Public License v3.0 at https://github.com/stevemussmann/admixturePipeline. The pipeline will run on Unix-based operating systems such as Mac OSX and Linux. It is compatible with Python 2.7+ and Python 3.5+. Dependencies include other freely available software packages (ADMIXTURE, DISTRUCT, PLINK, and VCFTOOLS).
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.