Skip to main content

Population structure analysis on 2504 individuals across 26 ancestries using bioinformatics approaches


Characterizing genetic diversity is crucial for reconstructing human evolution and for understanding the genetic basis of complex diseases; however, human population genetics are very complicated. Previously, we proved that based on the Hardy-Weinberg equilibrium, the heterozygous vs. non-reference homozygous single nucleotide polymorphism (SNP) ratio (het/nonref-hom) is two[1]. Later, we found that this ratio is race dependent, with African being the most genetically diverse race and Asian being the most homozygous[2]. This observation prompted us to conduct further study to understand the reasoning behind this diversity.

Materials and methods

Using the 1000 Genomes Project (1000G) released genomic data of 2504 individuals (26 races from five major-races), we first computed the (het/nonref-hom) ratio which has been applied as a quality control parameter for sequencing data[1, 3].


As expected, we found that the het/nonref-hom ratio is strongly associated with human ancestry. Africans had the highest het/nonref-hom ratios, followed by Americans and Europeans, and East Asians had the lowest (Figure 1). More interestingly, the het/nonref-hom ratios of South Asians are much higher than those of East Asians, and Americans showed the highest range (Figure 1). Thus we further quantitatively analyzed genetic variation in human populations on the 1000G dataset of 1011 observed genotypes (2504 individuals at 13424776 SNPs) using Structure 2.3.4[4]. The resulting population structure is consistent with the major geographical regions. All races identified a dominate origin population, except Americans who had the most variation in the structure, represented by several populations including the dominant population of Europeans (Figure 2). Moreover, East Asians and South Asians were found to originate from different ancestries (Figure 2).

Figure 1
figure 1

het/nonref-hom ratio across 26 ancestries.

Figure 2
figure 2

Population structure inferred from the 1000G genetic data.


Using novel bioinformatics approach, we identified new insights into the history and geography of human evolution, and are valuable for tracking human migration and adaptation to local conditions.


  1. Guo Y, Ye F, Sheng Q, Clark T, Samuels DC: Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform. 2013

    Google Scholar 

  2. Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y: Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics. 2015, 31 (3): 318-323.

    Article  PubMed  CAS  Google Scholar 

  3. Guo Y, Zhao S, Sheng Q, Ye F, Li J, Lehmann B, Pietenpol J, Samuels DC, Shyr Y: Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics. 2014, 103 (5-6): 323-328.

    Article  PubMed  CAS  Google Scholar 

  4. Hubisz MJ, Falush D, Stephens M, Pritchard JK: Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour. 2009, 9 (5): 1322-1332.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Yan Guo.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Samuels, D.C., Shyr, Y. et al. Population structure analysis on 2504 individuals across 26 ancestries using bioinformatics approaches. BMC Bioinformatics 16 (Suppl 15), P19 (2015).

Download citation

  • Published:

  • DOI:


  • Single Nucleotide Polymorphism
  • Population Structure
  • Human Population
  • Human Evolution
  • Genome Project