MI-MAAP: marker informativeness for multi-ancestry admixed populations

Background Admixed populations arise when two or more previously isolated populations interbreed. A powerful approach to addressing the genetic complexity in admixed populations is to infer ancestry. Ancestry inference including the proportion of an individual’s genome coming from each population and its ancestral origin along the chromosome of an admixed population requires the use of ancestry informative markers (AIMs) from reference ancestral populations. AIMs exhibit substantial differences in allele frequency between ancestral populations. Given the huge amount of human genetic variation data available from diverse populations, a computationally feasible and cost-effective approach is becoming increasingly important to extract or filter AIMs with the maximum information content for ancestry inference, admixture mapping, forensic applications, and detecting genomic regions that have been under recent selection. Results To address this gap, we present MI-MAAP, an easy-to-use web-based bioinformatics tool designed to prioritize informative markers for multi-ancestry admixed populations by utilizing feature selection methods and multiple genomics resources including 1000 Genomes Project and Human Genome Diversity Project. Specifically, this tool implements a novel allele frequency-based feature selection algorithm, Lancaster Estimator of Independence (LEI), as well as other genotype-based methods such as Principal Component Analysis (PCA), Support Vector Machine (SVM), and Random Forest (RF). We demonstrated that MI-MAAP is a useful tool in prioritizing informative markers and accurately classifying ancestral populations. LEI is an efficient feature selection strategy to retrieve ancestry informative variants with different allele frequency/selection pressure among (or between) ancestries without requiring computationally expensive individual-level genotype data. Conclusions MI-MAAP has a user-friendly interface which provides researchers an easy and fast way to filter and identify AIMs. MI-MAAP can be accessed at https://research.cchmc.org/mershalab/MI-MAAP/login/.


MI-MAAP User Manual
MI-MAAP is an easy-to-use web-based bioinformatics tool designed for analyzing informative markers for multi-ancestry admixed populations by utilizing feature selection methods and retrieving the associated SNP or gene information from multiple public resources. It integrates a novel allele frequency data based feature selection algorithm, Lancaster Independence Estimator (LIE), as well other genotype data based methods such as PCA, SVM, and Random Forest. LIE is efficient feature selection strategy for determining significant markers from multiple ancestral populations without requiring individual-level genotype data which is usually massive and the computation task can be very expensive. MI-MAAP has a user-friendly interface which provides researchers an easy and fast way to identify and analyze informative ancestry informative markers (AIMs).

Login Page
User has to sign in with their email address in order to access MI-MAAP. First time users will need to register by providing a valid email address and a password.

Use Public Database
When users use the publicly available genome databases, the next section will ask users to select two or more populations from the selected database.
The population information for the provided four databases are shown below: To input markers, users can choose a chromosome number to retrieve all SNPs belonging to that chromosome. In the following text area, users can also type in or copy-paste a list of SNP IDs (one SNP per row) or a single gene name.

Use User Generated Input
When users choose to upload their own data files, a section called "User Defined Input" will show up below the Database Selection. In this section, users are allowed to upload their own allele frequency data or individual-level genotype data. The upload field accepts .txt (tab delimited), .csv, or .xlsx files as input.
When the allele frequency data is uploaded, users need to enter the number of samples for all the populations used in the data file in the attached dynamic formset, which should have the same population order as in the data file. After clicking display button, LIE values will be computed for each marker in the file. When the genotype data is provided, PCA, SVM and Random Forest will be used to analyze the input variants. The sample files of the allele frequency data and the genotype data are shown below:

Sample file of the allele frequency data
Sample file of the genotype data

Select Threshold for the Feature Selection Method (optional)
User can use either the provided threshold value by selecting the radio button or enter a custom value. When three or more populations are involved, LIE values are in the range of 0 to 2; when only two populations are involved, LIE values are in the range of 0 to 1. A marker with LIE value = 0 carries no information.
When user-generated genotype data is uploaded, users can also select thresholds for PCA, SVM and Random Forest algorithms by clicking the 'Others' tab.

Select Spacing Between Markers (optional)
Additionally, user can set a value to specify the physical distance between markers by selecting the values provided in the dropdown menu or by entering a custom value in the unit of kb (1000 base pairs).

Attributes (optional)
MI-MAAP also provide users with an option to select different attributes related to the target SNPs.
These available attributes are grouped into eight categories: SNP information (chromosome, alleles, MAF, functional class, Regulome Score, TSS Score, and links to GWAS Catalog, dbGap, Exome variant, Genome variant and so on), gene information (such as gene ID, gene symbol, synonyms, gene description, CpG sites and mapped diseases), genome and variation (links to ENCODE, dbVar, ClinVar and BioGPS), gene expression (such as GEO profiles, GTEx eQTL, Blood eQTL and so on), biological pathways (links to KEGG pathways, Reactome and BioCarta), gene ontology (Cellular components, Biological process and Molecular function), protein (links to UniProt, Protein Atlas, PFAM and SMART), and species orthologs (such as Entrez IDs for chimp, rhesus, mouse, rat, zebrafish, cattle, chicken, dog and frog).
By clicking "Select All Attributes" checkbox, all attributes will be selected automatically.

Output
By clicking the "Display" button, all the input information will be submitted. Once the analysis is finished, a result table will be generated in the output page.

Output Page
Using 1000 Genomes Project database, a sample output of 5 SNPs on chromosome 22 for three populations CEU, CHB and YRI with LIE threshold  0 is shown in the table below.
On the top of the table, five buttons are provided for users to access the output: the table can be copied using "Copy" button, downloaded using "Excel", "CSV" or "PDF" buttons, and printed using "Print" button. Number of rows displayed in the table can be adjusted using "Show 15 entries" dropdown menu.
The left column shows all the attributes that users selected in the input form page. To view the attribute information for a SNP, users can click the rsID (in blue) in the output table.

SNP Attribute Page
As an example, the attribute information page for rs58468071 is shown as below.

An Example of Racial Ancestry Classification by Using LIE
To demonstrate how MI-MAAP can efficiently extract ancestral informative markers (AIMs) and cluster different individuals into their geographic populations, the principal components analysis (PCA) algorithm was used to analyze the top 100 AIMs that were generated using LIE from the 1000 Genome Project dataset. The PCA plots for two sets of three populations, CEU, CHB and YRI, and ASW, CEU and YRI are shown below.

PCA clustering of parental populations: scatterplots of principal components axis one (PC1) and axis two (PC2) for (A) CEU, CHB and YRI populations, and (B) ASW, CEU and YRI populations.
As one can expect, Figure(A) shows a distinct separation of the three continental ancestral populations CEU, CHB and YRI. In Figure(B) CEU and YRI are clearly separated, but ASW shows a lower density with a large sample variance. Meanwhile, most of the ASW samples are much closer to YRI than CEU and CEU is separated from the other two populations along PC1 axis. This is because African American population is an admixed population with an average of 80% African ancestry and 20% European ancestry. We showed that continental regions can be clearly distinguished, while more markers are necessary to improve the classification of closely related and admixed populations. These observations confirm that the markers selected using LIE were ancestry informative markers.