 Methodology Article
 Open access
 Published:
An adaptive threshold determination method of feature screening for genomic selection
BMC Bioinformatics volume 18, Article number: 212 (2017)
Abstract
Background
Although the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BEIDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches.
Results
Verified through six simulations, the adaptive threshold estimated by the BEIDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BEIDC to an Arabidopsis thaliana genomewide data. Out of 216,130 SNPs, BEIDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods.
Conclusions
BEIDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection.
Background
Genomic selection is an important task for increasing the efficiency of plant breeding, disease diagnosis, personalized medicine, and genotyping chip design. Genomic selection is improved by identifying a small subset of influential single nucleotide polymorphisms (SNPs) from highdimensional genetic information to efficiently predict individual’s phenotype [1–5]. The rapid developments of highthroughput genomic technologies, such as whole genome genotyping, next generation sequencing, gene expression microarray, and RNAseq, have dramatically boosted the landscape and power of genomic selection [6, 7], while nevertheless bringing unprecedented challenges for statistical modeling.
Feature screening has been receiving extensive attention as a powerful approach to handle ultrahigh dimensional data, which is defined as p= exp(n ^{ζ}), for some ζ>0. Here p is the number of features and n is the number of observations [8–19]. Specifically, Li et al. developed a distance correlation based sure independence feature screening (DCSIS) strategy that defines an association strength measure for each feature based on its distance correlation with the phenotype [16]. The idea of DCSIS is to theoretically satisfies the sure screening property, ranks the features from the most important to the least important by decreasing distance correlation values, and filters the majority of noise with low values of the defined association strength measure. A very attractive property of DCSIS is that it effectively captures both the linear and nonlinear association between the feature and the phenotype, and feasible for binary, continuous, and categorical features and phenotype, without assuming any specific model structure, distribution, or data type. In addition, DCSIS outperforms the traditional sure independence screening (SIS) [9] and sure independent ranking and screening (SIRS) approaches [13]. Therefore, DCSIS has great potential in serving genomic selection and recognizing the truly influential SNPs from millions of candidates covering the entire genome.
However, a limitation that restricts the application of DCSIS and other feature screening approaches is the lack of a clear threshold determination to separate influential features from noise, which is crucial in genomic selection. Frequently, an arbitrary decision as to the number of genes to retain is made. For example, some papers used sheer empirical experience to select the 50 or 150 highest ranked genes [20–22]. Other traditional approaches selected the SNPs passing the threshold of −log(0.05/p) [23–26]. However, this approach requires that a pvalue is accessible, and hence annuls the possibility of applying any approaches that do not compute a pvalue. Current feature screening literature keeps the top d features having the highest rankings, where d is often computed from an integral multiplier of [n/log(n)] (e.g., d _{1}=[ n/log(n)],d _{2}=2[ n/log(n)], or d _{3}=3[ n/log(n)] is suggested) [9, 16, 17, 27]. These options may work well for some circumstances, but several drawbacks also present themselves: 1) It is still not clear what d exactly should be. For a real data analysis, we have no any idea whether d _{1}, d _{2}, d _{3}, or an even larger value should be used. 2) A formula such as [ n/log(n)] is only restricted by the sample size, but unreasonably neglect two indispensable considerations: the number of features, and the signaltonoise ratio. Assuming the sample size is fixed, it is unreasonable to set the same threshold for one dataset with 100 features and another dataset with 100,000 features. For a dataset with a large number of features, but a weak signaltonoise ratio, the threshold may be relatively large, whereas for a much easier scenario, the threshold may be small. Kong et al. recognized these limitations and proposed a theorem to adaptively determine the threshold for DCSIS, pioneering threshold determination research [28]. However, they assumed that noise SNP was purely independent of the influential SNPs and the phenotype, which may not be true in the genomewide datasets.
In particular, determining a threshold that separates influential SNPs from noise SNPs is a necessity in genomic selection. But method for determining the threshold is also limited in the genomic selection literature. Frequently, an arbitrary decision as to the number of genes to retain was made. For example, some papers used sheer empirical experience to select the 50 or 150 highest ranked genes. Other traditional approaches selected the SNPs passing the threshold of −log(0.05/p). However, this approach requires that a pvalue is accessible, and hence annuls the possibility of applying many other feature screening approaches which do not compute a pvalue.
This article extends the work of Zhong et al. [27], and proposes a backward elimination iterative distance correlation (BEIDC) procedure that adaptively and automatically determines an optimal threshold for genomic selection while guaranteeing prediction accuracy. The smoothly clipped absolute deviation (SCAD) penalized regression model fitting the bootstrap samples is used to compute the mean square prediction error (MSPE). A certain percentage of SNPs (controlled by the drop rate) are backward eliminated after iterative DCSIS ranks the SNPs from the most important to the least important, and the point at which the minimum MSPE is attained is determined as the final threshold. One aim is to optimize the threshold, which is not trivial. If the standard is too stringent and the number of SNPs selected is too small, we may fail to cover all influential SNPs, and hence the power will be reduced; whereas if the rule is too liberal and the number of SNPs selected is too large, too much of the noise will be mistreated as influential features, and hence the false discovery rate will be inflated [29]. Another aim is to obtain the smallest possible set of SNPs that can still achieve acceptable prediction accuracy. The proposed BEIDC realizes these two aims, serves as a good genomic selection procedure for the ultrahigh dimensional genomewide dataset, and overcomes the limitation of previous feature screening approaches.and boosts the potential of feature screening approaches to bring a new horizon for genomic selection
We explored the performances of the BEIDC approach using six different simulation settings and one real genomewide dataset. We also compared the results of the adaptive threshold estimated by BEIDC with those found by the fixed thresholds suggested by current feature screening literature [9, 16, 17, 27]. The average thresholds estimated by BEIDC are uniformly lower than the fixed thresholds while yet achieving 100% power. It is worth mention that the BEIDC uses an average threshold as small as 5.54 to achieve a 100% power for Example 1, which is only 7% of the fixed threshold, d=2[ n/log(n)]=74, used by the current literature [9, 27]. In addition, BEIDC can flexibly and automatically adjust the threshold when the dataset has harder conditions (see Example 3 and Additional file 1: Supplementary example 1). From these six simulations, we conclude that BEIDC shows uniformly excellent performances even when the signaltonoise ratio is low (e.g., only four influential features are truly associated with the phenotype, and 4996 features are noise), and when the number of features is much larger than the number of observations (e.g., p=5,000 and n=200). We also demonstrate that the BEIDC approach selects a very small set of SNPs for Arabidopsis thaliana data. Here, only four SNPs are selected from a pool of 216,130 SNPs covering the entire genome, and the FRIGIDA gene, reported to be highly associated with the FRIGIDA expression trait being analyzed [23], is successfully picked out.
Methods
Iterative DCSIS
Szekely et al. defined an association strength measure for each feature based on its distance correlation (Dcorr) with the phenotype and showed that the Dcorr of two random vectors equals zero if and only if the two random vectors are independent [30]. Li et al. proposed the DCSIS feature screening approach, ranked the SNPs from the most important to the least important by decreasing the values of Dcorr, and proved the sure screening theorem to theoretically ensure that DCSIS will not miss any influential SNPs if the sample size is large enough [16].
Let y be the analyzed phenotype. Let X _{ j } be the genotype of each SNP, j=1,…,p. For each biallelic locus, the three possible genotypes can be coded as 0 (for aa), 1 (for Aa), and 2 (for AA). The distance covariance between the phenotype and each SNP is defined as
where ϕ _{ y }(t) and \(\phantom {\dot {i}\!}\phi _{\mathbf {X}_{j}}(s)\) are the respective characteristic functions of y and X _{ j }, and \(\phantom {\dot {i}\!}\phi _{\mathbf {y},\mathbf {X}_{j}}(t,s)\) is the joint characteristic function of (y,X _{ j }), and
where · stands for the Euclidean norm. Then the Dcorr between the phenotype and each SNP is defined as
Szekely et al. gave a numerically easier estimator of \(\widehat {dcov}^{2}(\mathbf {y},\mathbf {X}_{j})\) as
Let y _{ i1}, y _{ i2}, X _{ i1,j }, and X _{ i2,j } denote the \(i_{1}^{th}\) and \(i_{2}^{th}\) sample observations for y and X _{ j }, respectively. Then
Finally, the point estimator \(\widehat {Dcorr}(y,\mathbf {X}_{j})\) can be estimated by Eqs. (2), (3) and (4). We are then able to rank all SNPs, from the most influential to the least influential, by decreasing values of \(\widehat {Dcorr}(y,\mathbf {X}_{j}), j=1,\ldots,p\) [16]. Let \(\mathbf {X}_{p}=\{X^{*}_{k}, k=1,\ldots,p\},\) be the reordered SNPs, where the asterisk is used to differentiate the top k ^{th} SNP after selection by DCSIS from the originally observed k ^{th} SNP.
While DCSIS is a very powerful feature selection approach for ultrahigh dimensional data, it may neglect some important SNPs that are marginally not relevant, but jointly associated with the phenotype; or, it may rank highly some noise SNPs that are spuriously correlated with the phenotype due to their strong linkage disequilibrium (LD) with other important SNPs. To overcome these shortcomings, Zhong et al. introduced an iterative distance correlation feature screening approach (IDCSIS) [27]. The main idea of IDCSIS is to iteratively regress unselected SNPs on selected SNPs, regain information from the residuals, and effectively break down the effects of correlation structure among SNPs.
DCSIS ranks all SNPs and achieves the set X _{ p } in a single step, while IDCSIS builds up X _{ p } gradually with several steps, i.e. \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2} \bigcup \ldots \bigcup \mathbf {X}_{pm}\), with p=p _{1}+p _{2}+…+p _{ m }, where X _{ pi } stands for the set of SNPs selected at the i ^{th} iterative step, p _{ i } denotes the size of each set X _{ pi }; i=1,…,m, and m is the number of iterative steps. Zhong et al. claimed that a small number of iterations is adequate to guarantee good performance and they suggested m=2,p _{1}=5,p _{2}=d−5, and d=2[n/logn] [27]. Note that this article aims to adaptively determine d without assuming it is given, hence we set m=3,p _{1}=3,p _{2}=3, and p _{3}=p−6 to rank all features. This setting is empirically proven to work well in all simulations.
The details of IDCSIS can be summarized as follows [27]:

Step 1: Use DCSIS for y and X and select the first p _{1} features into X _{ p } (i.e. X _{ p }=X _{ p1}).

Step 2: Define \(\mathbf {X}_{r}=\{I_{n}\mathbf {X}_{p}(\mathbf {X}_{p}^{T}\mathbf {X}_{p})^{1}\mathbf {X}_{p}^{T}\}\mathbf {X}_{p}^{C}\), where \(\mathbf {X}_{p}^{C}\) is the complement set of X _{ p }. Use DCSIS for y and X _{ r } and select the second p _{2} features into X _{ p } (i.e. \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2}\)).

Step 3: Repeat Step 2 for m times until all p features are ranked, i.e., \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2} \bigcup \ldots \bigcup \mathbf {X}_{pm}\), with p=p _{1}+p _{2}+…+p _{ m }. Note that the computational cost will be shockingly large if we repeat step 2 for too many times for a large number of SNPs. Additionally, the theoretical sure screening property may not continue to be true if too many iterations are applied. To balance the computational cost and accuracy, we only selected the first one hundred SNPs by IDCSIS and then applied DCSIS for all remaining SNPs. This combination worked well after verifying by quite a few empirical studies (see simulation section).
Backward elimination
Let d denote the threshold that we need to determine. Let \(\mathbf {X}_{\mathcal {C}}=\{X_{k}^{*}, k=1,\ldots,d\}\) be the subset of influential SNPs, i.e., the conditional distribution function of y depends on X merely through \(\mathbf {X}_{\mathcal {C}}\), and let \(\mathbf {X}_{\mathcal {N}}=\{X^{*}_{k}, k=d+1,\ldots,p\}\) be the set of noise SNPs, i.e., the complement set of \(\mathbf {X}_{\mathcal {C}}\). The goal of genomic selection is to remove \(\mathbf {X}_{\mathcal {N}}\) and pick the subset \(\mathbf {X}_{\mathcal {C}}\). DCSIS is able to rank important features before noise, but a genomic selection process cannot be finalized if d is not determined. The current feature screening literature suggests using a fixed threshold of d= [ n/logn] ([·] is the nearest integer function) [16, 27]. But again, this has several limitations as discussed in the Introduction section.
Starting from the biggest pool, X _{ p }, which contained all reordered SNPs as ranked by IDCSIS, we discarded the noise SNPs by a backward elimination process through several iterations. For each iteration, we computed mean square prediction error based on its current pool, threw away a certain drop rate of SNPs from the bottom of the rank (i.e., those having the smallest Dcorr values), then moved to the next iteration step. The backward elimination considered all SNPs at the initial stage of the modeling to attenuate possible modeling biases.
SCAD penalized regression model
To predict the phenotypic values for the test data while accommodating the ultrahigh dimensionality of the genomewide data (in particular for the first couple of iterations of the backward elimination process), we applied a penalized regression model with the nonconcave SCAD penalty function [31]. Unlike the traditional regression model, the penalized least squares estimators were obtained by minimizing
where the SCAD penalty function was given as
Two unknown tuning parameters λ and α are contained in the penalty function. As suggested by Fan et al. [31], α=3.7 is a good choice for various problems, and λ is selected by cross validation. This penalty function corresponds to a quadratic spline function with knots at λ and α λ. Besides its capability of handling ultrahigh dimensional genomewide data, the SCAD penalty function satisfied three properties that are important for genomic selection: It is singular at the origin to produce sparse solutions and shrink unimportant parameters to zero to reduce model complexity; the resulting estimator is continuous, which retains stability in model prediction; and it is bounded by a constant to produce nearly unbiased estimates for large coefficients to avoid unnecessary modeling bias [31].
Scheme of BEIDC
The details of BEIDC are summarized as follows:

Step 1: Rank all SNPs by IDCSIS, and obtain the reordered set (i.e., X _{ p }), where p is expected to be ultrahigh.

Step 2: Start from the biggest pool (size of p), X _{ p }, and compute the MSPE for the corresponding model.

Step 3: Remove a certain drop rate of SNPs having the lowest Dcorr values, based on the ranks obtained from Step 1. Then compute the MSPE for the model corresponding to the current pool. For more details about the drop rate, please see the simulation section.

Step 4: Repeat Step 3 until the smallest pool (size of 1 at minimum) is reached.

Step 5: Draw a plot of the MSPE versus the number of SNPs and locate the model size for which the MSPE is minimized, as model size decreases from p to 1. Finally, the selected influential SNP set (i.e., \(\mathbf {X}_{\mathcal {C}}\)) and the adaptive threshold (i.e., \(\hat {d}\)) can be simultaneously determined from this optimal spot. The noise set \(\mathbf {X}_{\mathcal {N}}\) is already thrown away during the iterations of Steps 3 and 4.
The computation of the MSPE mentioned in above steps 2 and 3 is done as follows: Draw 1000 bootstrap samples with replacement, divide each bootstrap sample into training data (the observations being drawn) and test data (the observations not being drawn, also called out of bag (OOB) observations), fit the SCAD penalized regression model using the training data, predict for the test data, then compute the mean square prediction errors for all the bootstrap samples.
Following this BEIDC scheme, the prediction accuracy and reproducibility of results on new datasets should be guaranteed because the minimum mean square prediction error is used. However, if a very small threshold is preferred, we suggest using the smallest number of SNPs whose MSPE is within 1 standard error (1 s.e. rule) above the minimum MSPE. In this case, the number of SNPs may be smaller than \(\hat {d}\), and the prediction error a little larger than the minimum MSPE value but the MSPE will still lie within an acceptable range. However, it is expected that the power may decrease and influential SNPs may be missed if this 1 s.e. rule is used. Therefore, unless a very small number of SNPs is preferred for reason of saving experimental cost in breeding or disease diagnosis applications, we suggest taking the threshold to be that for which the MSPE is minimized.
Results
Simulation studies
The performances of DCSIS and IDCSIS have been investigated by Li et al. [16] and Zhong et al. [27] using a fixed threshold. In this section we examined the performance of the adaptive threshold estimated from the new approach BEIDC through six simulation studies. Additionally, we compared the performance of the adaptive threshold with the fixed threshold \(\hat {d}=2[\!n/log(n)]=74\) used by Li et al. [16] and Zhong et al. [27]. To make the comparisons fair, the first two examples were imitated from the current literature [9, 27, 32]. In the third example, we increased the level of rigor. As with the first two simulations, we also used the same criteria that have been widely used to assess the power of a method in current feature screening literature [16, 17, 27]:

Average Threshold \(\bar {\hat {d}}\): the average value of the adaptive thresholds estimated from the 100 simulation replicates;

Strict Power P _{ a }: the proportion of the 100 replicates for which all truly influential features are simultaneously selected by \(\hat {d}\);

Individual Power P _{ j }: the proportion of the 100 replicates for which each individual influential feature is selected by \(\hat {d}\) respectively.
We simulated 100 replicate datasets consisting of n=200 observations and p=5,000 features for Example 12 and Additional file 1: Supplementary example 12. We simulated 100 replicate datasets consisting of n=200 observations and p=2,000 features for Example 3 and Additional file 1: Supplementary example 3.
Example 1
Similar to Zhong et al. [9, 27], we simulated data for Example 1 from the model
where the features were generated from a normal distribution with zero mean, unit variance, and covariance structure following AR(1) design with auto correlation parameter ρ=0.5. Model (5) actually sets the first four features as the influential ones, and all remaining 4,996 features as noise. The white noise was generated from the standard normal distribution and the continuous phenotype y was generated from Model (5) accordingly.
Table 1 summarizes the simulation results of average thresholds \(\bar {\hat {d}}\), strict power P _{ a }, and individual power P _{ j }, j=1,…,4, achieved by fixed threshold d=2[n/log(n)] and adaptive threshold, respectively. From Table 1, we can see that BEIDC achieved 100% strict power using only 5.54 features on average after searching 5,000 features for 100 replications, while the fixed threshold used a threshold of 74, which is 13 times larger than the result of BEIDC to achieve the same power. Simulation further indicated that BEIDC only made an average of 1.54 spurious signal, but that the fixed threshold made an average of 70 spurious signals for 100 simulations (akin to a type I error rate). Note that X _{4} was jointly important but marginally independent to the phenotype y, so it trapped the DCSIS (P _{4}=0%) but was successfully detected by BEIDC using a very small model size (P _{4}=100%). In addition, we assessed the average of 100 mean square prediction errors using the 5fold cross validation for these three methods, respectively. The BEIDC achieved the best prediction accuracy (see Table 1). Given the fact that IDCSIS already beat four other feature screening approaches [27], to wit, LASSO [33], sure independence screening (SIS) [9], iterative sure independence screening (ISIS) [9], and DCSIS [16], this improvement is agreeable. The strict power of BEDC (i.e., backward elimination distance correlation without iteration) is zero even using 9 times more features because X _{4} is purposely designed to be trapped by noise. This example illustrates that the benefits of BEIDC over BEDC.
We also tested the sensitivity of different drop rates (i.e., the percentage of the noise SNPs discarded at each iteration, mentioned in Step 3 of the BEIDC scheme) using this simulation study and empirically verified that the results are quite stable when drop rate varies dramatically from 10 to 50% (see Table 2). The thresholds, \(\hat {d}\), have only negligible differences according to the five different resolutions of the drop rates. Throughout this article, we used a drop rate of 50% to save the computational cost and found that such a big drop rate is able to achieve high powers and small thresholds. But if a high resolution is needed and computational cost is not a concern, a smaller drop rate, say 10%, is suggested.
Example 2
We simulated SNP data for Example 2 following the procedure of Li et al. [32]. Firstly, u _{ ij } was generated from a standard normal distribution with correlation structure of ρ=cov(u _{ ij },u _{ ik })=0.1. To simulate SNPs with equal allele frequencies, we set
where c is the third quartile of a standard normal distribution. Secondly, the additive (\(X_{ij}^{a}\)) and dominant feature (\(X_{ij}^{d}\)) of each SNP were coded as follows,
Thirdly, we let the set j∈{100,200,300,400,500} contains the indices of truly influential SNPs, and the additive (\(\beta _{j}^{a}\)) and dominant coefficients (\(\beta _{j}^{d}\)) of the influential SNPs are given by Table 3.
Finally, the phenotype y was generated following Li et al.’s design [32],
where ε∼N(0,1). Note that X _{ ij } is the feature that we analyzed, therefore the Example 2 connected y and X _{ ij } indirectly by way of Eq. (6).
As can been seen from Table 4, BEIDC achieved the smallest average threshold, \(\bar {\hat {d}}=15.04\), which was about 1/5 of the fixed threshold. In addition, BEIDC had the minimum MSPE of 1.92.
Example 3
To increase the rigor of the above two examples, we 1) weakened the signal of the influential features; and 2) increased the number of influential features. The SNPs were generated similar as in Example 2, except the correlation structure between SNPs was a little more complex, cov(u _{ ij },u _{ ik })=0.2^{j−k};j,k=1,…,p. We fix the indices of ten truly influential SNPs from j∈{100,200,...,1000}. The phenotype y and the truly influential SNPs were directly connected using a similar model as Example 1, but we took the indicator function to accommodate the categorical features, \(y_{i} = \sum _{j} (\beta _{j} I(X_{ij} = 1) + 2\beta _{j} I(X_{ij} = 2)) + \epsilon \), where ε∼N(0,1). The coefficient β _{ j } was randomly generated from Uniform(2,3), where the magnitudes of these coefficients were much weaker than those used in Example 1. Example 3 was simulated for p=2,000 SNPs, with ten influential and 1990 features as noise.
Table 5 summarizes the simulation results and the comparisons of fixed and adaptive thresholds for Example 3. If using a threshold of 37 (or 74), the DCSIS approach only achieved a power as low as 12% (or 44%). Apparently, some feature like X _{10} seems to have very weak signal and trapped the DCSIS. The results of IDCSIS are dramatically better than those of the DCSIS. However, the power of IDCSIS is only 92% if using a fixed threshold of [ n/logn]=37. It indicates that IDCSIS includes 16 more features in average than the \(\bar {\hat {d}}=21.35\) estimated by BEIDC, but was still unable to simultaneously detecting all ten influential features among all 100 replicates. To achieve a 100% power as the BEIDC did, the IDCSIS had to increase the threshold to 2[ n/logn]=74, but it sacrificed 53 more unnecessary noise features on average than BEIDC.
Analysis of Arabidopsis data
The BEIDC procedure was applied to select the most influential SNPs for a continuous trait of the Arabidopsis thaliana diseaseresistance phenotype, lesioning and FRIGIDA expression (FRI), with 164 inbred lines and 216,130 SNPs covering the entire genome. These data are publicly available from the link (http://arabidopsis.usc.edu). Two traditional statistical models have been implemented on this same dataset, i.e., the nonparametric Wilcoxon ranksum test and a linear mixed model implemented in EMMA (Supplementary material of Atwell et al. [23]).
The four influential polymorphisms that were selected by BEIDC are summarized in Table 6. Using the Arabidopsis Genome Initiative (AGI) genetic map and the Arabidopsis information resource (TAIR.org, verified on 9/5/2016) GBrowse database, we matched our significant findings with three genes. The rank 1 SNP lay approximately 217 bp away from the FRIGIDA region (269026  271503 bp). Like all other approaches working on this same dataset, the BEIDC procedure also identified exactly the same position (Chr 4, 268809 bp) as reported by current literature. This SNP was associated with the highest peak on the Manhattan plot by all approaches, including BEIDC. Besides the unanimous agreement on this peak, we detected three new positions that were not found by these traditional approaches for the same dataset. In particular, the rank 4 SNP lay exactly within the single large exon of FRIGIDA or FLOWERING LOCUS A (FLA) region (269026  271503 bp), which locates the expected gene more closely than the previously discussed highest peak. Specifically, FRI encodes a major determinant of natural variation in Arabidopsis flowering time and has been reported to regulate this lesioning and FRIGIDA expression trait under analysis. Additionally, we also found that the rank 2 and rank 3 SNPs lay within exon4 and intron2 in the neighboring RNAHELICASELIKE 8 (RH8) gene (274257  278737 bp), respectively.
Figure 1 a illustrates the MSPE plot versus the number of SNPs, covering the entire search region, during which the backward elimination removed SNPs from 216,130 to 1 iteratively at a drop rate of 50%. We observed that the minimum MSPE is achieved at a very crowded region where the number of SNPs is small. After magnifying this crowded region (Fig. 1 b), the optimal spot \(\hat {d}=4\) is recognized. Figure 2 demonstrates the Manhattan plot of the continuous FRI trait along the entire genome, based on the iterative Dcorr values of 216,130 SNPs against their individual physical chromosomal position. Unlike the traditional Manhattan plot, we did not use the pvalue and −log(0.05) horizontal line to determine the significance. Instead, we used the adaptive threshold determined by the BEIDC procedure (marked in yellow triangle). Unlike the regular Manhattan plot having dense signals, Fig. 2 looks very sparse. We knew that the whole genome data had low signaltonoise ratio and expected that the majority of the noise SNPs could be dramatically filtered out by the proposed BEIDC approach. We also knew that IDCSIS iteratively regressed unselected SNPs on selected SNPs and regained information from the residuals. It was also known that the residuals contain much weaker information than the original data. This explains why the majority of noise SNPs were close to 0 and why Fig. 2 looks sparse.
Discussion
As the level of difficulty increased from Example 1 to Example 3, the average thresholds \(\bar {\hat {d}}\) estimated by BEIDC automatically became larger, causing the power to approach 100%. Comparing the different results of the three simulation examples, we confirmed that BEIDC was indeed able to flexibly and adaptively adjust its estimated threshold value according to the specific scenarios of different datasets. On the contrary, the fixed threshold approaches failed to give a clear and flexible threshold determination. For the real data analysis, where the truth is never known, this datadriven ability is crucial. If a threshold is set too small, influential SNPs will be neglected and power will be decreased (see d=37 of Table 5); on the contrary, if the threshold is set too large, too much of the noise will be mistreated as influential features (i.e., false discovery) (see Tables 1 and 4).
BEIDC works well for SNPs with either binary feature (see Additional file 1: Supplementary example 3 and Table S3) or categorical feature (see Example 2 and Table 4; Example 3 and Table 5, and Additional file 1: Supplementary example 2 and Table S2), as well as continuous features (see Example 1 and Table 1) such as age or BMI, among others [5, 34]. As for the phenotype, BEIDC approach proposed in this article is mainly targeted for continuous phenotype/trait, hence the models, simulations, and real data all focus on continuous traits. However, we tried one simulation study with a categorical phenotype and noticed that the results were also nice (see Additional file 1: Supplementary example 1 and Table S1). We will leave this categorical trait for future examination. This article focuses on the selection of a small subset of influential genes that still achieve sufficient prediction accuracy for new observations, which is a common interest in plant breeding in crop, plant, and cattle species or disease diagnosis and prevention in clinical practices [4, 35–39]. For the case of several hundreds of influential SNPs all individually having small effect, the proposed BEIDC may not be feasible and we will consider it in future work.
Conclusion
This article proposes a BEIDC procedure with the aims of (1) selecting the smallest possible set of influential genes from a big pool that are not only associated with the analyzed phenotype, but also enable accurate prediction for new observations; and (2) determining an adaptive threshold effectively separating influential SNPs from the noise SNPs. An approach with accurate prediction capability will make the results obtained in one dataset be reproduced better in a different dataset. The difficulties that BEIDC overcomes are as follows: issues of ultrahigh dimensionality when the number of SNPs is in the tens of thousands or even in the millions, but the number of observations is in the hundreds or the thousands; detecting signals from a sparse structure (i.e., signaltonoise ratio is very weak); and detecting truly important SNPs that are confounded by noise due to strong linkage disequilibrium.
Abbreviations
 BEIDC:

Backward elimination iterative distance correlation
 Dcorr:

Distance correlation
 DCSIS for distance correlation based sure independence feature screening; FRI:

FRIGIDA expression
 IDCSIS:

Iterative distance correlation feature screening approach
 LD:

Linkage disequilibrium
 MSPE:

Mean square prediction error
 OOB:

Out of bag observations
 RH8:

RNAHELICASELIKE 8
 SCAD:

Smoothly clipped absolute deviation
 SNP stands:

single nucleotide polymorphisms
References
Risch NJ. Searching for genetic determinants in the new millennium. Nature. 2000; 405(6788):847–56.
Meuwissen T, Hayes B, Goddard M, et al.Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001; 157(4):1819–1829.
Merelli I, Calabria A, Cozzi P, Viti F, Mosca E, Milanesi L. Snpranker 2.0: a genecentric data mining tool for diseases associated snp prioritization in gwas. BMC Bioinforma. 2013; 14(Suppl 1):9.
Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E, Atlin G, Jannink JL, McCouch SR. Genomic selection and association mapping in rice (oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet. 2015; 11(2):1004982.
Ueki M, Tamiya G. Smooththreshold multivariate genetic prediction with unbiased model selection. Genet Epidemiol. 2016; 40(3):233–43.
De Bakker PI, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet. 2005; 37(11):1217–1223.
Goldstein DB, Cavalleri GL. Genomics: understanding human diversity. Nature. 2005; 437(7063):1241–1242.
Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. 2006. https://arxiv.org/abs/math/0602133.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008; 70(5):849–911.
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009; 10:2013–038.
Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat. 2009; 18(3):533–50.
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultrahighdimensional additive models. J Am Stat Assoc. 2011; 106(494):544–57.
Zhu LP, Li L, Li R, Zhu LX. Modelfree feature screening for ultrahighdimensional data. J Am Stat Assoc. 2011; 106(496):1464–75.
Zhao SD, Li Y. Principled sure independence screening for cox models with ultrahighdimensional covariates. J Multivar Anal. 2012; 105(1):397–411.
Li G, Peng H, Zhang J, Zhu L, et al.Robust rank correlation based screening. Ann Stat. 2012; 40(3):1846–1877.
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc. 2012; 107(499):1129–39.
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahighdimensional covariates. J Am Stat Assoc. 2014; 109(505):266–74.
Huang D, Li R, Wang H. Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat. 2014; 32(2):237–44.
Liu J, Zhong W, Li R. A selective overview of feature screening for ultrahighdimensional data. Sci China Math. 2015; 58(10):1–22.
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002; 97(457):77–87.
Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004; 20(15):2429–437.
Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005; 48(4):869–85.
Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, et al.Genomewide association study of 107 phenotypes in arabidopsis thaliana inbred lines. Nature. 2010; 465(7298):627–31.
Takahashi Y, Kou I, Takahashi A, Johnson TA, Kono K, Kawakami N, Uno K, Ito M, Minami S, Yanagida H, et al.A genomewide association study identifies common variants near lbx1 associated with adolescent idiopathic scoliosis. Nat Genet. 2011; 43(12):1237–1240.
Barrett JH, Iles MM, Harland M, Taylor JC, Aitken JF, Andresen PA, Akslen LA, Armstrong BK, Avril MF, Azizi E, et al.Genomewide association study identifies three new melanoma susceptibility loci. Nat Genet. 2011; 43(11):1108–1113.
Korte A, Farlow A. The advantages and limitations of trait analysis with gwas: a review. Plant Methods. 2013; 9(1):1.
Zhong W, Zhu L. An iterative approach to distance correlationbased sure independence screening. J Stat Comput Simul. 2014; 85(11):2331–45.
Kong J, Wang S, Wahba G. Using distance covariance for improved variable selection with application to learning genetic risk models. Stat Med. 2015; 34(10):1708–1720.
Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J. Selecting snps in twostage analysis of disease association data: a modelfree approach. Ann Hum Genet. 2000; 64(5):413–7.
Székely GJ, Rizzo ML, Bakirov NK, et al.Measuring and testing dependence by correlation of distances. Ann Stat. 2007; 35(6):2769–794.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001; 96(456):1348–1360.
Li J, Das K, Fu G, Li R, Wu R. The bayesian lasso for genomewide association studies. Bioinformatics. 2011; 27(4):516–23.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996; 58(1):267–88.
Ueki M, Tamiya G. Ultrahighdimensional variable selection method for wholegenome genegene interaction analysis. BMC Bioinforma. 2012; 13(1):72.
DíazUriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006; 7(1):3.
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017; 15:104–16.
Tang W, Wan S, Zou Q. Tumor origin detection with tissuespecific mirna and dna methylation markers. bioRxiv. 2016. https://doi.org/10.1101/090746.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002; 46(n):389–422.
Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinforma. 2014; 15(1):1.
Acknowledgements
We thank Randall Reese for revising the English and improving the writing. We also thank Wei Zhong for clarifying some questions related to the iterative distance correlation.
Funding
This work was supported by a grant from the National Science Foundation (DMS1413366) to GF. The funding body played no role in the design, writing or decision to publish this manuscript.
Availability of data and materials
The data that we analyzed is publicly available from the link (http://arabidopsis.usc.edu). All other data is presented in the main paper.
Authors’ contributions
GF conceived the research and wrote the manuscript; GW created the first version of the programming code; GF and XD revised the programming code and performed the real data analyses; XD and GW performed simulations; XD created all figures. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable. We have no human or animal data involved.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary examples. (PDF 122 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Fu, G., Wang, G. & Dai, X. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics 18, 212 (2017). https://doi.org/10.1186/s1285901716179
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901716179