A unified STR profiling system across multiple species with whole genome sequencing data

Background Short tandem repeats (STRs) serve as genetic markers in forensic scenes due to their high polymorphism in eukaryotic genomes. A variety of STRs profiling systems have been developed for species including human, dog, cat, cattle, etc. Maintaining these systems simultaneously can be costly. These mammals share many high similar regions along their genomes. With the availability of the massive amount of the whole genomics data of these species, it is possible to develop a unified STR profiling system. In this study, our objective is to propose and develop a unified set of STR loci that could be simultaneously applied to multiple species. Result To find a unified STR set, we collected the whole genome sequence data of the concerned species and mapped them to the human genome reference. Then we extracted the STR loci across the species. From these loci, we proposed an algorithm which selected a subset of loci by incorporating the optimized combined power of discrimination. Our results show that the unified set of loci have high combined power of discrimination, >1−10−9, for both individual species and the mixed population, as well as the random-match probability, <10−7 for all the involved species, indicating that the identified set of STR loci could be applied to multiple species. Conclusions We identified a set of STR loci which shared by multiple species. It implies that a unified STR profiling system is possible for these species under the forensic scenes. The system can be applied to the individual identification or paternal test of each of the ten common species which are Sus scrofa (pig), Bos taurus (cattle), Capra hircus (goat), Equus caballus (horse), Canis lupus familiaris (dog), Felis catus (cat), Ovis aries (sheep), Oryctolagus cuniculus (rabbit), and Bos grunniens (yak), and Homo sapiens (human). Our loci selection algorithm employed a greedy approach. The algorithm can generate the loci under different forensic parameters and for a specific combination of species.

adopted as genetic markers in human [3] and non-human forensic applications such as individual identification [4,5], paternity testing [6,7], and kinship analysis [8]. DNA profiles can provide valuable clues and evidence during investigations of crime scenes. In previous studies, a number of STR profiling systems have been developed for various animal species including dog [9,10], cat [11,12], horse [13,14], etc. However, to maintain one DNA profiling system for each species can be tedious and costly. In previous studies [15][16][17], panels of cross-species STR markers were developed among some closely related species. Given the fact that mammalian animals share large proportions of DNA sequences [18,19], it is sensible to expect that a unified set of DNA markers may exist and the set can be identified from the shared sequence regions, and hence, a unified STR profiling system is possible for multiple species.
In this study, we aim to develop a unified set of STR loci for ten species, namely Sus scrofa (pig), Bos taurus (cattle), Capra hircus (goat), Equus caballus (horse), Canis lupus familiaris (dog), Felis catus (cat), Ovis aries (sheep), Oryctolagus cuniculus (rabbit), and Bos grunniens (yak), and Homo sapiens (human). Our objective is to identify a minimum set of STR loci satisfying the forensic parameters from the shared genome regions among these species. Previous studies have utlized high-volume data from the shotgun sequencing technology in forensic analysis to search the new forensic DNA markers [20,21]. While our targeted species belong to different families or different orders, and it is challenging and time-consuming to obtain the their shotgun data. Therefore, we proposed a model based on whole genome sequence data.
In this work, we downloaded the whole genome sequence (WGS) data of the multiple species. Then, we mapped the sequencing data with BWA [22] to the human genome (hg19) [23]. After that, we obtained the possible STR sites in each species using the software package lobSTR [24]. Last, we performed a greedy locus selection algorithm with the incorporation of the forensic parameters to identify a unified set.
The obtained locus set contains much less loci than the total number of the loci from individual species. The acquired loci set also demonstrates a lower random-match probability and a higher combined power of discrimination when applied to each species. Our statistical results show that the ultimately unified loci set has the combined power of discrimination (CPD, C(L)) larger than 1 − 10 −9 for each involved species, and random-match probability at most 6.30×10 −8 . Furthermore, our proposed loci selection algorithm can also be applied to individual species. The experimental results showed that, in comparison with other previous report work, the proposed method is capable to generate more efficient loci for the given species. Using the C (L) and R(L) values in 13 CODIS loci as benchmarks, the proposed algorithm generated eight loci to meet the criteria with lower C (L) and R(L), indicating that the selected loci are more sensitive for individual identification.
In order to obtain the candidate STR loci in the shared genomic region of involved species, we aligned WGS reads in the samples to hg19 and processed the aligned data with lobSTR. According to the outputs of lobSTR, we estimated allele frequencies as well as genotype frequencies of sites in each species, which were used to estimate forensic parameters described in the following sections.

Problem statement
In this work, we call the multiple species, which we want to search the unified STR locus set from, as an integrated pseudo species. Given the genome data of multiple species, the main purpose here is to obtain a small set of STR loci from the shared genome region among the species to achieve individual identification; that is, we want to find a set of STR loci with the cardinality minimized. In addition, these loci should satisfy the criteria specified by the forensic parameters for both an individual species and the integrated pseudo species.

Call rate
In each species, the call rate at each locus (denoted as η ) is the percentage of the individuals covers the locus in the sample data. The call rate is restricted by the pre-specified threshold δ η so that the selected loci L k could be observed in the majority of the individuals in a species.

Parameters for high distinguishing power
First, within each species, a valid locus should satisfy the constraints set by four forensic parameters: the power of discrimination (PD), the probability of matching (PM), the power of exclusion (PE), and heterozygosity (HE) [26]. The constraints set by the parameters to ensure that only loci with high distinguishing power on individuals would be considered. Denote PD, PM, PE, and HE of the locus of the species s as D ,s , M ,s , E ,s , H ,s , respectively. Let thresholds of the four parameters as δ d , δ m , δ e , and δ h , the selected loci set L s of each species s should satisfy Moreover, since multiple species are considered here, we need to define the PD for the integrated pseudo species at each locus. At locus , denote PD for the integrated pseudo species as D , then the calculation of D is defined as ( 2 )

Combined power of discrimination (C(L))
C(L) is also a common parameter to evaluate the capability of individual identification of a locus set. Denote C(L) for species s as C s . Then, The cumulative product of PM refers to the combined probability of matching (CPM) and maximizing the value of C(L) is equivalent to minimizing the value of CPM(1 − C(L), denote it as C (L)).
Also, we also define the CPD for the integrated pseudo species. Similar to the definition above, for the integrated population, C(L) is defined as We aim to find a locus set L to satisfy C s (L) ≥ δ c for each individual species as well as optimize the value of C for the integrated pseudo species.

Random-match probability (RMP)
Random-match probability (RMP, R(L)), defined as the probability that a randomly selected individual from the population other than the suspect would have the given DNA profile, is another criterion to evaluate DNA profile systems. Such constraints are imposed to statistically eliminate the fallacy that an irrelevant individual in the population is matched to the given profile [26]. Therefore, here we limit value of R(L) to be under 1/N where N refers to the number of individuals in the population on which we would like to apply our selected markers. Denote the maximum profile frequency (MPF) at each locus of species s as F ,s . Then the R s (L) of locus set L in species s is calculated as We want to find a locus set L such that for each species s, R s (L) ≤ 1 N s ; that is, the cumulative product of is set to be no more than the reciprocal of the population size. Similarly, we can define the the maximum profile frequency and the random-match probability for the integrated pseudo species, and denote them as F and R(L), respectively.
The loci apply to both individual or the integrated pseudo species; it is necessary to restrict R(L)s for both in individual species and in the integrated pseudo species below the threshold δ r , the reciprocals of the population sizes, for both a single species or the integrated pseudo species.

Locus selection
First, we aligned raw sequencing data to the reference of the human genome, processed the data through lob-STR workflow and obtained STR locus candidate set from in the overlapping genomic area of the involved species. We excluded loci on the sex chromosome, and mononucleotide repeats, which are inapplicable in practical situations. Using loci retrieved by lobSTR, we estimated allele frequency at each locus and used the frequencies to compute PD, PM, PE, HE, as well as maximum genotype frequency. Then we applied the thresholds specified in Eq. (1) on PD, PM, PE, HE and call rate to filter the STR loci.
The selected loci set should have the C s (L) values no less than δ c for each species, and its R s (L)s are no more than δ r for each species and an optimized C(L) for the integrated pseudo species. The R(L) can be expressed as the products of F values, and C (L) can be expressed as the products of PM values.
Next, we employ a greedy algorithm to find a unified STR locus set from the filtered loci. We started from an empty set of loci. At each iteration, we incorporate the locus which reduces 1 − C (denote it as C ) and R the most into the set. As both C and R are probabilities, we choose the locus which decreases their product the most. We repeat the process until the C and R are below the pre-specified thresholds. The algorithm is displayed as (Algorithm 1). To initialize, we create two vectors v C and v R , with initial values as 1. They will store the values of M ,s ) and F ,s of all loci in every species. v C (s) and v R (s) are the M ,s and F ,s values of species s of the current identified loci. Once a species s has its C s (L) (or R s (L)) reach the given thresholds, we skip the species for later iterations.

Algorithm 1 Locus Selection
Input: Candidates loci set A including the PD(M l,s ), freq of STR type(F l,s ) and each locus information Output: A shared loci set L. for l in A do 5: compute-the-decrease(l, M, F). 6: end for 7: let be the l which has largest decrease. 8: Initialize v C and v R as vector of ones for each s do 5: if v C (s) ≥ 1 − δ c then

Sample size determination
According to Hale et. al [27], most of the informative alleles, alleles at a frequency of ≥ 0.05, could be captured by 30 samples, thus the estimated allele frequencies from 30 samples can represent the population allele frequencies.
Here, we utilized the data of the Phase 3 publication of the 1000 Genomes Project, and randomly sampled different numbers of individuals from the data set. The statistical results showed that with 25 or more samples, at least 90% of alleles at each site can be detected, and that the difference between the estimated and theoretical values of PD, HE and MPF is small-less than 0.1.

Performance on pseudo species consisting of 10 species
Prior to the locus selection process, we conducted experiments to examine impact of different sample sizes on the detection of alleles as well as estimation of the involved forensic parameters. We chose the data in the 1000 Genomes Project. We sampled different numbers of individuals from this population and estimated number of alleles detected at each site, as well as the forensic parameters (PD, HE, MPF) involved in individual identification. As shown in Fig. 1, with 25 or more samples, over 90% of alleles can be detected and meanwhile, the difference between the estimated and theoretical values of forensic parameters are small. Therefore, using 25-30 samples will also be able to generate sensible estimations.
After processing data through lobSTR, 119,717 loci were detected with reference to human genome of all 23 chromosome pairs. The initial distribution of call rate among STR loci is shown in Fig. 2 .
First, we filtered out loci that were observed only in single species, 7784 STR loci remains. Next, we excluded mononucleotide repeat loci as well as loci on sex chromosomes, and 4614 loci remain. After we dropped loci with call rate less than the threshold δ η and filtered out loci that failed to satisfy any of the thresholds of forensic parameters (δ d = 0.6, δ m = 0.4, δ e = 0.3, δ h = 0.5), we obtained 1268 loci subjected to further processing. The distribution of PD and genotype frequency among remaining loci is shown in Figs. 2 and 3. To be noted, in Fig. 3 the y-axis refers to the joint PD defined in (2), while in Fig. 2, the y-axis represents a subtraction of the cumulative value of genotype frequency at each locus. The red lines Figs. 2 and 3 stand for our desired thresholds for these two parameters in the loci selection algorithm.
We applied the selection algorithm on STR sites of the concerned species. We set the threshold for C(L) at 0.9999 and R(L) at 10 −7 for every individual species and generated a set with 31 loci for ten species. Table 1 shows forensic parameters of selected loci with highest PDs, and Table 2 shows assessment of selected loci on their combined power for individual identification as well as paternity testing on each species population. The generated loci set has C(L)s higher than 1 − 10 −9 , R(L)s less than 10 −7 for every concerned species, suggesting its power for distinguishing individuals randomly chosen from a related species. Also, to consider the possible use of generated loci set in paternity testing, we evaluated the combined power of exclusion (CPE) within each species population. We can observe that CPEs are higher than 0.99 for every species group. In addition, our algorithm can be applied with different thresholds and on a different number of species. We applied the selection algorithm on 10 species with C(L) threshold ranging from 1 − 10 −3 to 1 − 10 −7 and R(L) threshold from 10 −3 to 10 −10 . As shown in Table 3, at a given C(L) threshold, the number of loci would increase with descending threshold of R(L); conversely, when the threshold of R(L) is set, the number of loci may slightly increase or remain unchanged when more rigorous C(L) threshold is given. Moreover, we examined the value of R(L) that different number of loci could achieve among ten species. (Figure 4) It can be seen that to satisfy R(L) upper bounded at 10 −5 in each species, at least 21 loci are required. At given C(L) and R(L) threshold, the number of loci generated among different number of species is shown in Fig. 5. It can be seen that when R(L) threshold is settled, the increment of species number will not result in continuously growing of loci number, indicating that our proposed algorithm is effective when more species groups are involved.

Our method beats CODIS on human
The Federal Bureau of Investigation (FBI) has published thirteen core loci for the Combined DNA Index System (CODIS) [28] in 1997, which have been used as dominant DNA markers in human profiling. Here, we retrieved the thirteen loci from 320 individuals of Chinese population in 1000 Genome Project. We computed the C(L) and R(L) of CODIS based on the data. To evaluate the performance of the proposed algorithm, we used the calculated values of C(L) and R(L) as thresholds, and applied the algorithm on the same population. Eight loci were selected by our algorithm to satisfy both thresholds. The fewer number of loci suggests that our algorithm is effective in searching for loci to meet given thresholds. As shown in Table 4, the selected loci has higher C(L) and lower R(L) than the CODIS loci. To further evaluate the efficacy of the proposed algorithm in loci selection, we generated sets containing different number of loci for the same Chinese population. For each generated loci set L, we simulated profiles of 10,000 individuals at each selected locus, and computed the values of R(L).
The results are shown in Fig. 6. With more loci chosen by the proposed algorithm, the value of R(L) would decrease accordingly. When a locus set with 13 loci was generated, the R(L) became much smaller than CODIS, suggesting the much lower probability of occurrence of random matching cases. Therefore, it can be concluded that the greedy strategy implemented in the proposed algorithm -to assign priority on loci with high PD and low MPF -is capable to generate a small locus set to achieve pre-specified CPD and RMP on target population. To better evaluate the capability of using the selected loci in paternity test, we conducted simulations of trio paternity testing cases and for human population. We generated sets containing different number of loci -ranging from 8 to 20, where the 8-loci set was generated when the observed values of C(L) and R(L) of CODIS were used as the thresholds in our selection algorithm. We computed the combined paternity index (CPI) [29] as the evaluation metric. We simulated 1000 true trio families (where the man and woman are true biological parents of the child) as well as 1000 false trio families (where the man is a random male chosen from the population) and calculated the paternity index (PI) and CPI for families. As shown in Fig. 7, we took the logarithm of CPI values. In true trio cases, where CPIs are usually higher than 1, larger CPI indicates the corresponding loci set can provide higher paternity probabilities, and thus more reliable in practical use; while in false trio cases, where CPIs become usually lower than 1, smaller value implies the less probability of false exclusion. It can be seen that, in comparison with CODIS, the 8 loci obtained by our algorithm at same thresholds have higher CPI at true trio case and lower CPI at false trio case. Moreover, when increasing the number of loci selected by our algorithm, the normalized curves go further from the y-axis, suggesting the growing reliability of the selected loci in paternity testing. To sum up, in both terms of individual identification and paternity testing, our proposed algorithm is capable to select efficient loci to generate an optimized loci set for a given population.

Our method beats Kemp's on pseudo species consisting of cattle, goat, and sheep
In the work of Kemp et al. [16], a panel of 97 microsatellite markers was developed jointly for cattle, goat, and sheep. Here we extracted samples of cattle, goat, and sheep to form a pseudo species, and applied the selection algorithm on this newly integrated population. With the threshold for C(L) and R(L) at 0.9999 and 10 −7 respectively, 18 loci were selected for the three species. As shown in Table 5, the generated loci set L has low R(L) (under 10 −7 ), high C(L) (greater than 1 − 10 −9 ) and CPE (greater than 0.996), indicating the power of selected loci in identity and paternity testing.

Conclusion
In this study, we developed an algorithm based on whole genome sequencing (WGS) data to generate a unified STR loci set for multiple species. The algorithm was designed to search for minimum number of loci to optimize the combined power of discrimination (C(L)) for each species as well as the integrated population. For each individual species, the selected loci should have C(L) no less than δ c (here we set δ c = 0.9999) to ensure their efficacy to distinguish one individual from another, and random-match probability ( R(L)) no greater than δ r to control the fallacy that two randomly selected individuals would have same profiles at given loci. We included 10 species in this study, namely, Sus scrofa (pig), Bos taurus (cattle), Capra hircus (goat), Equus caballus (horse), Canis lupus familiaris (dog), Felis catus  To obtained all possible STR sites at the common genome region among involved species, we mapped raw sequencing data of individual samples concerning the human genome. We implemented the proposed selection algorithm on STR sites owned by at least two species and finally generated 31 loci at R(L) (≤ 10 −7 ) for all concerned species. Under this threshold, the generated loci set has C(L)s greater than 1 − 10 −9 and R(L)s no greater than 10 −7 , which collectively demonstrate its capability of individual identification in every involved species population. Furthermore, we assessed the capacity of using selected loci in paternity testing by their combined power of exclusion (CPE). The generated loci set could achieve CPEs greater than 0.99.
In addition, we evaluated our proposed algorithm by applying it on different selection thresholds and varying number of species. It turns out that the loci number may increase to satisfy more rigorous R(L) threshold, whereas with settled R(L) threshold, the number of loci would not continue to increase significantly when more species are involved. Thus it can be concluded that the algorithm proposed here can find loci that commonly have high power of discrimination in involved species and generate a loci set to satisfy the criteria of C(L) and R(L) with a minimized number of loci.
With the data from 1000 Genomes Project, we computed the C(L) and R(L) of 13 CODIS loci, used the values as thresholds in our proposed loci selection algorithm, and obtained eight loci that could satisfy the criteria. This loci set generated by our method not only have fewer number of loci, but also demonstrate higher C(L) and lower R(L) in the concerned population. In addition, in the respective simulated 1000 cases of true trio and false trio paternity tests, the generated 8-loci set demonstrated higher reliability than CODIS in terms of the combined paternity index (CPI). Therefore, it can be concluded that, given either a specific or several separated population(s), the proposed algorithm has the capability to generate an optimized loci set that can be utilized in both identity testing and paternity testing with minimized number of loci.
We also compared the study of Kemp et al. [16] on cattle, goat and sheep. Kemp identified 97 loci for individual identification across the three species. Through our algorithm optimized, 18 loci are satisfactory for this task. To summarize, our algorithm can be used for individual identification (on human) or across groups. After comparison with existing research, our results are better than previous studies.

Availability of data and materials
The source and on-line tools can be found in https://spe.deepomics.org Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.