An adaptive threshold determination method of feature screening for genomic selection
 Guifang Fu^{1}Email author,
 Gang Wang^{1} and
 Xiaotian Dai^{1}
DOI: 10.1186/s1285901716179
© The Author(s) 2017
Received: 9 November 2016
Accepted: 28 March 2017
Published: 12 April 2017
Abstract
Background
Although the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BEIDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches.
Results
Verified through six simulations, the adaptive threshold estimated by the BEIDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BEIDC to an Arabidopsis thaliana genomewide data. Out of 216,130 SNPs, BEIDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods.
Conclusions
BEIDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection.
Keywords
Genomic selection Feature screening Backward elimination FRIGIDA expressionBackground
Genomic selection is an important task for increasing the efficiency of plant breeding, disease diagnosis, personalized medicine, and genotyping chip design. Genomic selection is improved by identifying a small subset of influential single nucleotide polymorphisms (SNPs) from highdimensional genetic information to efficiently predict individual’s phenotype [1–5]. The rapid developments of highthroughput genomic technologies, such as whole genome genotyping, next generation sequencing, gene expression microarray, and RNAseq, have dramatically boosted the landscape and power of genomic selection [6, 7], while nevertheless bringing unprecedented challenges for statistical modeling.
Feature screening has been receiving extensive attention as a powerful approach to handle ultrahigh dimensional data, which is defined as p= exp(n ^{ ζ }), for some ζ>0. Here p is the number of features and n is the number of observations [8–19]. Specifically, Li et al. developed a distance correlation based sure independence feature screening (DCSIS) strategy that defines an association strength measure for each feature based on its distance correlation with the phenotype [16]. The idea of DCSIS is to theoretically satisfies the sure screening property, ranks the features from the most important to the least important by decreasing distance correlation values, and filters the majority of noise with low values of the defined association strength measure. A very attractive property of DCSIS is that it effectively captures both the linear and nonlinear association between the feature and the phenotype, and feasible for binary, continuous, and categorical features and phenotype, without assuming any specific model structure, distribution, or data type. In addition, DCSIS outperforms the traditional sure independence screening (SIS) [9] and sure independent ranking and screening (SIRS) approaches [13]. Therefore, DCSIS has great potential in serving genomic selection and recognizing the truly influential SNPs from millions of candidates covering the entire genome.
However, a limitation that restricts the application of DCSIS and other feature screening approaches is the lack of a clear threshold determination to separate influential features from noise, which is crucial in genomic selection. Frequently, an arbitrary decision as to the number of genes to retain is made. For example, some papers used sheer empirical experience to select the 50 or 150 highest ranked genes [20–22]. Other traditional approaches selected the SNPs passing the threshold of −log(0.05/p) [23–26]. However, this approach requires that a pvalue is accessible, and hence annuls the possibility of applying any approaches that do not compute a pvalue. Current feature screening literature keeps the top d features having the highest rankings, where d is often computed from an integral multiplier of [n/log(n)] (e.g., d _{1}=[ n/log(n)],d _{2}=2[ n/log(n)], or d _{3}=3[ n/log(n)] is suggested) [9, 16, 17, 27]. These options may work well for some circumstances, but several drawbacks also present themselves: 1) It is still not clear what d exactly should be. For a real data analysis, we have no any idea whether d _{1}, d _{2}, d _{3}, or an even larger value should be used. 2) A formula such as [ n/log(n)] is only restricted by the sample size, but unreasonably neglect two indispensable considerations: the number of features, and the signaltonoise ratio. Assuming the sample size is fixed, it is unreasonable to set the same threshold for one dataset with 100 features and another dataset with 100,000 features. For a dataset with a large number of features, but a weak signaltonoise ratio, the threshold may be relatively large, whereas for a much easier scenario, the threshold may be small. Kong et al. recognized these limitations and proposed a theorem to adaptively determine the threshold for DCSIS, pioneering threshold determination research [28]. However, they assumed that noise SNP was purely independent of the influential SNPs and the phenotype, which may not be true in the genomewide datasets.
In particular, determining a threshold that separates influential SNPs from noise SNPs is a necessity in genomic selection. But method for determining the threshold is also limited in the genomic selection literature. Frequently, an arbitrary decision as to the number of genes to retain was made. For example, some papers used sheer empirical experience to select the 50 or 150 highest ranked genes. Other traditional approaches selected the SNPs passing the threshold of −log(0.05/p). However, this approach requires that a pvalue is accessible, and hence annuls the possibility of applying many other feature screening approaches which do not compute a pvalue.
This article extends the work of Zhong et al. [27], and proposes a backward elimination iterative distance correlation (BEIDC) procedure that adaptively and automatically determines an optimal threshold for genomic selection while guaranteeing prediction accuracy. The smoothly clipped absolute deviation (SCAD) penalized regression model fitting the bootstrap samples is used to compute the mean square prediction error (MSPE). A certain percentage of SNPs (controlled by the drop rate) are backward eliminated after iterative DCSIS ranks the SNPs from the most important to the least important, and the point at which the minimum MSPE is attained is determined as the final threshold. One aim is to optimize the threshold, which is not trivial. If the standard is too stringent and the number of SNPs selected is too small, we may fail to cover all influential SNPs, and hence the power will be reduced; whereas if the rule is too liberal and the number of SNPs selected is too large, too much of the noise will be mistreated as influential features, and hence the false discovery rate will be inflated [29]. Another aim is to obtain the smallest possible set of SNPs that can still achieve acceptable prediction accuracy. The proposed BEIDC realizes these two aims, serves as a good genomic selection procedure for the ultrahigh dimensional genomewide dataset, and overcomes the limitation of previous feature screening approaches.and boosts the potential of feature screening approaches to bring a new horizon for genomic selection
We explored the performances of the BEIDC approach using six different simulation settings and one real genomewide dataset. We also compared the results of the adaptive threshold estimated by BEIDC with those found by the fixed thresholds suggested by current feature screening literature [9, 16, 17, 27]. The average thresholds estimated by BEIDC are uniformly lower than the fixed thresholds while yet achieving 100% power. It is worth mention that the BEIDC uses an average threshold as small as 5.54 to achieve a 100% power for Example 1, which is only 7% of the fixed threshold, d=2[ n/log(n)]=74, used by the current literature [9, 27]. In addition, BEIDC can flexibly and automatically adjust the threshold when the dataset has harder conditions (see Example 3 and Additional file 1: Supplementary example 1). From these six simulations, we conclude that BEIDC shows uniformly excellent performances even when the signaltonoise ratio is low (e.g., only four influential features are truly associated with the phenotype, and 4996 features are noise), and when the number of features is much larger than the number of observations (e.g., p=5,000 and n=200). We also demonstrate that the BEIDC approach selects a very small set of SNPs for Arabidopsis thaliana data. Here, only four SNPs are selected from a pool of 216,130 SNPs covering the entire genome, and the FRIGIDA gene, reported to be highly associated with the FRIGIDA expression trait being analyzed [23], is successfully picked out.
Methods
Iterative DCSIS
Szekely et al. defined an association strength measure for each feature based on its distance correlation (Dcorr) with the phenotype and showed that the Dcorr of two random vectors equals zero if and only if the two random vectors are independent [30]. Li et al. proposed the DCSIS feature screening approach, ranked the SNPs from the most important to the least important by decreasing the values of Dcorr, and proved the sure screening theorem to theoretically ensure that DCSIS will not miss any influential SNPs if the sample size is large enough [16].
Finally, the point estimator \(\widehat {Dcorr}(y,\mathbf {X}_{j})\) can be estimated by Eqs. (2), (3) and (4). We are then able to rank all SNPs, from the most influential to the least influential, by decreasing values of \(\widehat {Dcorr}(y,\mathbf {X}_{j}), j=1,\ldots,p\) [16]. Let \(\mathbf {X}_{p}=\{X^{*}_{k}, k=1,\ldots,p\},\) be the reordered SNPs, where the asterisk is used to differentiate the top k ^{ th } SNP after selection by DCSIS from the originally observed k ^{ th } SNP.
While DCSIS is a very powerful feature selection approach for ultrahigh dimensional data, it may neglect some important SNPs that are marginally not relevant, but jointly associated with the phenotype; or, it may rank highly some noise SNPs that are spuriously correlated with the phenotype due to their strong linkage disequilibrium (LD) with other important SNPs. To overcome these shortcomings, Zhong et al. introduced an iterative distance correlation feature screening approach (IDCSIS) [27]. The main idea of IDCSIS is to iteratively regress unselected SNPs on selected SNPs, regain information from the residuals, and effectively break down the effects of correlation structure among SNPs.
DCSIS ranks all SNPs and achieves the set X _{ p } in a single step, while IDCSIS builds up X _{ p } gradually with several steps, i.e. \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2} \bigcup \ldots \bigcup \mathbf {X}_{pm}\), with p=p _{1}+p _{2}+…+p _{ m }, where X _{ pi } stands for the set of SNPs selected at the i ^{ th } iterative step, p _{ i } denotes the size of each set X _{ pi }; i=1,…,m, and m is the number of iterative steps. Zhong et al. claimed that a small number of iterations is adequate to guarantee good performance and they suggested m=2,p _{1}=5,p _{2}=d−5, and d=2[n/logn] [27]. Note that this article aims to adaptively determine d without assuming it is given, hence we set m=3,p _{1}=3,p _{2}=3, and p _{3}=p−6 to rank all features. This setting is empirically proven to work well in all simulations.

Step 1: Use DCSIS for y and X and select the first p _{1} features into X _{ p } (i.e. X _{ p }=X _{ p1}).

Step 2: Define \(\mathbf {X}_{r}=\{I_{n}\mathbf {X}_{p}(\mathbf {X}_{p}^{T}\mathbf {X}_{p})^{1}\mathbf {X}_{p}^{T}\}\mathbf {X}_{p}^{C}\), where \(\mathbf {X}_{p}^{C}\) is the complement set of X _{ p }. Use DCSIS for y and X _{ r } and select the second p _{2} features into X _{ p } (i.e. \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2}\)).

Step 3: Repeat Step 2 for m times until all p features are ranked, i.e., \(\mathbf {X}_{p}=\mathbf {X}_{p1} \bigcup \mathbf {X}_{p2} \bigcup \ldots \bigcup \mathbf {X}_{pm}\), with p=p _{1}+p _{2}+…+p _{ m }. Note that the computational cost will be shockingly large if we repeat step 2 for too many times for a large number of SNPs. Additionally, the theoretical sure screening property may not continue to be true if too many iterations are applied. To balance the computational cost and accuracy, we only selected the first one hundred SNPs by IDCSIS and then applied DCSIS for all remaining SNPs. This combination worked well after verifying by quite a few empirical studies (see simulation section).
Backward elimination
Let d denote the threshold that we need to determine. Let \(\mathbf {X}_{\mathcal {C}}=\{X_{k}^{*}, k=1,\ldots,d\}\) be the subset of influential SNPs, i.e., the conditional distribution function of y depends on X merely through \(\mathbf {X}_{\mathcal {C}}\), and let \(\mathbf {X}_{\mathcal {N}}=\{X^{*}_{k}, k=d+1,\ldots,p\}\) be the set of noise SNPs, i.e., the complement set of \(\mathbf {X}_{\mathcal {C}}\). The goal of genomic selection is to remove \(\mathbf {X}_{\mathcal {N}}\) and pick the subset \(\mathbf {X}_{\mathcal {C}}\). DCSIS is able to rank important features before noise, but a genomic selection process cannot be finalized if d is not determined. The current feature screening literature suggests using a fixed threshold of d= [ n/logn] ([·] is the nearest integer function) [16, 27]. But again, this has several limitations as discussed in the Introduction section.
Starting from the biggest pool, X _{ p }, which contained all reordered SNPs as ranked by IDCSIS, we discarded the noise SNPs by a backward elimination process through several iterations. For each iteration, we computed mean square prediction error based on its current pool, threw away a certain drop rate of SNPs from the bottom of the rank (i.e., those having the smallest Dcorr values), then moved to the next iteration step. The backward elimination considered all SNPs at the initial stage of the modeling to attenuate possible modeling biases.
SCAD penalized regression model
Two unknown tuning parameters λ and α are contained in the penalty function. As suggested by Fan et al. [31], α=3.7 is a good choice for various problems, and λ is selected by cross validation. This penalty function corresponds to a quadratic spline function with knots at λ and α λ. Besides its capability of handling ultrahigh dimensional genomewide data, the SCAD penalty function satisfied three properties that are important for genomic selection: It is singular at the origin to produce sparse solutions and shrink unimportant parameters to zero to reduce model complexity; the resulting estimator is continuous, which retains stability in model prediction; and it is bounded by a constant to produce nearly unbiased estimates for large coefficients to avoid unnecessary modeling bias [31].
Scheme of BEIDC

Step 1: Rank all SNPs by IDCSIS, and obtain the reordered set (i.e., X _{ p }), where p is expected to be ultrahigh.

Step 2: Start from the biggest pool (size of p), X _{ p }, and compute the MSPE for the corresponding model.

Step 3: Remove a certain drop rate of SNPs having the lowest Dcorr values, based on the ranks obtained from Step 1. Then compute the MSPE for the model corresponding to the current pool. For more details about the drop rate, please see the simulation section.

Step 4: Repeat Step 3 until the smallest pool (size of 1 at minimum) is reached.

Step 5: Draw a plot of the MSPE versus the number of SNPs and locate the model size for which the MSPE is minimized, as model size decreases from p to 1. Finally, the selected influential SNP set (i.e., \(\mathbf {X}_{\mathcal {C}}\)) and the adaptive threshold (i.e., \(\hat {d}\)) can be simultaneously determined from this optimal spot. The noise set \(\mathbf {X}_{\mathcal {N}}\) is already thrown away during the iterations of Steps 3 and 4.
The computation of the MSPE mentioned in above steps 2 and 3 is done as follows: Draw 1000 bootstrap samples with replacement, divide each bootstrap sample into training data (the observations being drawn) and test data (the observations not being drawn, also called out of bag (OOB) observations), fit the SCAD penalized regression model using the training data, predict for the test data, then compute the mean square prediction errors for all the bootstrap samples.
Following this BEIDC scheme, the prediction accuracy and reproducibility of results on new datasets should be guaranteed because the minimum mean square prediction error is used. However, if a very small threshold is preferred, we suggest using the smallest number of SNPs whose MSPE is within 1 standard error (1 s.e. rule) above the minimum MSPE. In this case, the number of SNPs may be smaller than \(\hat {d}\), and the prediction error a little larger than the minimum MSPE value but the MSPE will still lie within an acceptable range. However, it is expected that the power may decrease and influential SNPs may be missed if this 1 s.e. rule is used. Therefore, unless a very small number of SNPs is preferred for reason of saving experimental cost in breeding or disease diagnosis applications, we suggest taking the threshold to be that for which the MSPE is minimized.
Results
Simulation studies

Average Threshold \(\bar {\hat {d}}\): the average value of the adaptive thresholds estimated from the 100 simulation replicates;

Strict Power P _{ a }: the proportion of the 100 replicates for which all truly influential features are simultaneously selected by \(\hat {d}\);

Individual Power P _{ j }: the proportion of the 100 replicates for which each individual influential feature is selected by \(\hat {d}\) respectively.
We simulated 100 replicate datasets consisting of n=200 observations and p=5,000 features for Example 12 and Additional file 1: Supplementary example 12. We simulated 100 replicate datasets consisting of n=200 observations and p=2,000 features for Example 3 and Additional file 1: Supplementary example 3.
Example 1
where the features were generated from a normal distribution with zero mean, unit variance, and covariance structure following AR(1) design with auto correlation parameter ρ=0.5. Model (5) actually sets the first four features as the influential ones, and all remaining 4,996 features as noise. The white noise was generated from the standard normal distribution and the continuous phenotype y was generated from Model (5) accordingly.
Strict and individual statistical power for methods using fixed or adaptive thresholds for Example 1
Methods  Average \(\hat {d}\)  P _{ a }  P _{1}  P _{2}  P _{3}  P _{4}  Average MSPE 

DCSIS (d=74)  74  0%  100%  100%  100%  0%  6.91 
IDCSIS (d=74)  74  100%  100%  100%  100%  100%  1.82 
BEIDC  5.54  100%  100%  100%  100%  100%  1.04 
Strict power and average threshold for BEIDC approach under different drop rates
Drop Rate  Average \(\hat {d}\)  P _{ a } 

50%  5.54  100% 
40%  5.45  100% 
30%  5.34  100% 
20%  5.39  100% 
10%  5.33  100% 
Example 2
Genetic effects of 5 assumed SNPs in Example 2
Position  Additive (\(\beta _{j}^{a}\))  Dominant (\(\beta _{j}^{d}\)) 

100  1.2  0.8 
200  1.2  0.4 
300  1.2  0.8 
400  0.8  1.2 
500  1.0  1.2 
where ε∼N(0,1). Note that X _{ ij } is the feature that we analyzed, therefore the Example 2 connected y and X _{ ij } indirectly by way of Eq. (6).
Strict and individual statistical power for methods using fixed or adaptive thresholds for Example 2
Methods  Average \(\hat {d}\)  P _{ a }  P _{100}  P _{200}  P _{300}  P _{400}  P _{500}  Average MSPE 

DCSIS (d=74)  74  97%  100%  99%  100%  99%  99%  3.67 
IDCSIS (d=74)  74  100%  100%  100%  100%  100%  100%  2.70 
BEIDC  15.04  100%  100%  100%  100%  100%  100%  1.92 
Example 3
To increase the rigor of the above two examples, we 1) weakened the signal of the influential features; and 2) increased the number of influential features. The SNPs were generated similar as in Example 2, except the correlation structure between SNPs was a little more complex, cov(u _{ ij },u _{ ik })=0.2^{j−k};j,k=1,…,p. We fix the indices of ten truly influential SNPs from j∈{100,200,...,1000}. The phenotype y and the truly influential SNPs were directly connected using a similar model as Example 1, but we took the indicator function to accommodate the categorical features, \(y_{i} = \sum _{j} (\beta _{j} I(X_{ij} = 1) + 2\beta _{j} I(X_{ij} = 2)) + \epsilon \), where ε∼N(0,1). The coefficient β _{ j } was randomly generated from Uniform(2,3), where the magnitudes of these coefficients were much weaker than those used in Example 1. Example 3 was simulated for p=2,000 SNPs, with ten influential and 1990 features as noise.
Strict and individual statistical power for methods using fixed or adaptive thresholds for Example 3
Methods  Average \(\hat {d}\)  P _{ a }  P _{100}  P _{200}  P _{300}  P _{400}  P _{500} 
DCSIS (d=37)  37  12%  100%  83%  93%  100%  100% 
DCSIS (d=74)  74  44%  100%  99%  100%  100%  100% 
IDCSIS (d=37)  37  92%  100%  100%  98%  100%  100% 
IDCSIS (d=74)  74  100%  100%  100%  100%  100%  100% 
BEIDC  21.35  100%  100%  100%  100%  100%  100% 
Methods  P _{600}  P _{700}  P _{800}  P _{900}  P _{1000}  Average MSPE  
DCSIS (d=37)  100%  81%  99%  100%  22%  4.23  
DCSIS (d=74)  100%  98%  100%  100%  46%  3.68  
IDCSIS (d=37)  100%  95%  99%  100%  100%  1.44  
IDCSIS (d=74)  100%  100%  100%  100%  100%  1.84  
BEIDC  100%  100%  100%  100%  100%  1.17 
Analysis of Arabidopsis data
The BEIDC procedure was applied to select the most influential SNPs for a continuous trait of the Arabidopsis thaliana diseaseresistance phenotype, lesioning and FRIGIDA expression (FRI), with 164 inbred lines and 216,130 SNPs covering the entire genome. These data are publicly available from the link (http://arabidopsis.usc.edu). Two traditional statistical models have been implemented on this same dataset, i.e., the nonparametric Wilcoxon ranksum test and a linear mixed model implemented in EMMA (Supplementary material of Atwell et al. [23]).
Influential SNPs selected by BEIDC based on AGI physical map (TAIR.org)
Rank  Chr  SNP pos (bp)  Gene  Distance to gene (bp) 

1  4  268809  FRI or FLA  217 
2  4  276143  RH8  0 
3  4  275349  RH8  0 
4  4  269260  FRI or FLA  0 
Discussion
As the level of difficulty increased from Example 1 to Example 3, the average thresholds \(\bar {\hat {d}}\) estimated by BEIDC automatically became larger, causing the power to approach 100%. Comparing the different results of the three simulation examples, we confirmed that BEIDC was indeed able to flexibly and adaptively adjust its estimated threshold value according to the specific scenarios of different datasets. On the contrary, the fixed threshold approaches failed to give a clear and flexible threshold determination. For the real data analysis, where the truth is never known, this datadriven ability is crucial. If a threshold is set too small, influential SNPs will be neglected and power will be decreased (see d=37 of Table 5); on the contrary, if the threshold is set too large, too much of the noise will be mistreated as influential features (i.e., false discovery) (see Tables 1 and 4).
BEIDC works well for SNPs with either binary feature (see Additional file 1: Supplementary example 3 and Table S3) or categorical feature (see Example 2 and Table 4; Example 3 and Table 5, and Additional file 1: Supplementary example 2 and Table S2), as well as continuous features (see Example 1 and Table 1) such as age or BMI, among others [5, 34]. As for the phenotype, BEIDC approach proposed in this article is mainly targeted for continuous phenotype/trait, hence the models, simulations, and real data all focus on continuous traits. However, we tried one simulation study with a categorical phenotype and noticed that the results were also nice (see Additional file 1: Supplementary example 1 and Table S1). We will leave this categorical trait for future examination. This article focuses on the selection of a small subset of influential genes that still achieve sufficient prediction accuracy for new observations, which is a common interest in plant breeding in crop, plant, and cattle species or disease diagnosis and prevention in clinical practices [4, 35–39]. For the case of several hundreds of influential SNPs all individually having small effect, the proposed BEIDC may not be feasible and we will consider it in future work.
Conclusion
This article proposes a BEIDC procedure with the aims of (1) selecting the smallest possible set of influential genes from a big pool that are not only associated with the analyzed phenotype, but also enable accurate prediction for new observations; and (2) determining an adaptive threshold effectively separating influential SNPs from the noise SNPs. An approach with accurate prediction capability will make the results obtained in one dataset be reproduced better in a different dataset. The difficulties that BEIDC overcomes are as follows: issues of ultrahigh dimensionality when the number of SNPs is in the tens of thousands or even in the millions, but the number of observations is in the hundreds or the thousands; detecting signals from a sparse structure (i.e., signaltonoise ratio is very weak); and detecting truly important SNPs that are confounded by noise due to strong linkage disequilibrium.
Abbreviations
 BEIDC:

Backward elimination iterative distance correlation
 Dcorr:

Distance correlation
 DCSIS for distance correlation based sure independence feature screening; FRI:

FRIGIDA expression
 IDCSIS:

Iterative distance correlation feature screening approach
 LD:

Linkage disequilibrium
 MSPE:

Mean square prediction error
 OOB:

Out of bag observations
 RH8:

RNAHELICASELIKE 8
 SCAD:

Smoothly clipped absolute deviation
 SNP stands:

single nucleotide polymorphisms
Declarations
Acknowledgements
We thank Randall Reese for revising the English and improving the writing. We also thank Wei Zhong for clarifying some questions related to the iterative distance correlation.
Funding
This work was supported by a grant from the National Science Foundation (DMS1413366) to GF. The funding body played no role in the design, writing or decision to publish this manuscript.
Availability of data and materials
The data that we analyzed is publicly available from the link (http://arabidopsis.usc.edu). All other data is presented in the main paper.
Authors’ contributions
GF conceived the research and wrote the manuscript; GW created the first version of the programming code; GF and XD revised the programming code and performed the real data analyses; XD and GW performed simulations; XD created all figures. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable. We have no human or animal data involved.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Risch NJ. Searching for genetic determinants in the new millennium. Nature. 2000; 405(6788):847–56.View ArticlePubMedGoogle Scholar
 Meuwissen T, Hayes B, Goddard M, et al.Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001; 157(4):1819–1829.PubMedPubMed CentralGoogle Scholar
 Merelli I, Calabria A, Cozzi P, Viti F, Mosca E, Milanesi L. Snpranker 2.0: a genecentric data mining tool for diseases associated snp prioritization in gwas. BMC Bioinforma. 2013; 14(Suppl 1):9.Google Scholar
 Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E, Atlin G, Jannink JL, McCouch SR. Genomic selection and association mapping in rice (oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet. 2015; 11(2):1004982.View ArticleGoogle Scholar
 Ueki M, Tamiya G. Smooththreshold multivariate genetic prediction with unbiased model selection. Genet Epidemiol. 2016; 40(3):233–43.View ArticlePubMedGoogle Scholar
 De Bakker PI, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nat Genet. 2005; 37(11):1217–1223.View ArticlePubMedGoogle Scholar
 Goldstein DB, Cavalleri GL. Genomics: understanding human diversity. Nature. 2005; 437(7063):1241–1242.View ArticlePubMedGoogle Scholar
 Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. 2006. https://arxiv.org/abs/math/0602133.
 Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008; 70(5):849–911.View ArticleGoogle Scholar
 Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009; 10:2013–038.PubMedPubMed CentralGoogle Scholar
 Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat. 2009; 18(3):533–50.View ArticleGoogle Scholar
 Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultrahighdimensional additive models. J Am Stat Assoc. 2011; 106(494):544–57.View ArticlePubMedPubMed CentralGoogle Scholar
 Zhu LP, Li L, Li R, Zhu LX. Modelfree feature screening for ultrahighdimensional data. J Am Stat Assoc. 2011; 106(496):1464–75.View ArticlePubMedGoogle Scholar
 Zhao SD, Li Y. Principled sure independence screening for cox models with ultrahighdimensional covariates. J Multivar Anal. 2012; 105(1):397–411.View ArticlePubMedPubMed CentralGoogle Scholar
 Li G, Peng H, Zhang J, Zhu L, et al.Robust rank correlation based screening. Ann Stat. 2012; 40(3):1846–1877.View ArticleGoogle Scholar
 Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc. 2012; 107(499):1129–39.View ArticlePubMedPubMed CentralGoogle Scholar
 Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahighdimensional covariates. J Am Stat Assoc. 2014; 109(505):266–74.View ArticlePubMedPubMed CentralGoogle Scholar
 Huang D, Li R, Wang H. Feature screening for ultrahigh dimensional categorical data with applications. J Bus Econ Stat. 2014; 32(2):237–44.View ArticlePubMedPubMed CentralGoogle Scholar
 Liu J, Zhong W, Li R. A selective overview of feature screening for ultrahighdimensional data. Sci China Math. 2015; 58(10):1–22.View ArticleGoogle Scholar
 Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002; 97(457):77–87.View ArticleGoogle Scholar
 Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004; 20(15):2429–437.View ArticlePubMedGoogle Scholar
 Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005; 48(4):869–85.View ArticleGoogle Scholar
 Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, et al.Genomewide association study of 107 phenotypes in arabidopsis thaliana inbred lines. Nature. 2010; 465(7298):627–31.View ArticlePubMedPubMed CentralGoogle Scholar
 Takahashi Y, Kou I, Takahashi A, Johnson TA, Kono K, Kawakami N, Uno K, Ito M, Minami S, Yanagida H, et al.A genomewide association study identifies common variants near lbx1 associated with adolescent idiopathic scoliosis. Nat Genet. 2011; 43(12):1237–1240.View ArticlePubMedGoogle Scholar
 Barrett JH, Iles MM, Harland M, Taylor JC, Aitken JF, Andresen PA, Akslen LA, Armstrong BK, Avril MF, Azizi E, et al.Genomewide association study identifies three new melanoma susceptibility loci. Nat Genet. 2011; 43(11):1108–1113.View ArticlePubMedPubMed CentralGoogle Scholar
 Korte A, Farlow A. The advantages and limitations of trait analysis with gwas: a review. Plant Methods. 2013; 9(1):1.View ArticleGoogle Scholar
 Zhong W, Zhu L. An iterative approach to distance correlationbased sure independence screening. J Stat Comput Simul. 2014; 85(11):2331–45.View ArticleGoogle Scholar
 Kong J, Wang S, Wahba G. Using distance covariance for improved variable selection with application to learning genetic risk models. Stat Med. 2015; 34(10):1708–1720.View ArticlePubMedPubMed CentralGoogle Scholar
 Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J. Selecting snps in twostage analysis of disease association data: a modelfree approach. Ann Hum Genet. 2000; 64(5):413–7.View ArticlePubMedGoogle Scholar
 Székely GJ, Rizzo ML, Bakirov NK, et al.Measuring and testing dependence by correlation of distances. Ann Stat. 2007; 35(6):2769–794.View ArticleGoogle Scholar
 Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001; 96(456):1348–1360.View ArticleGoogle Scholar
 Li J, Das K, Fu G, Li R, Wu R. The bayesian lasso for genomewide association studies. Bioinformatics. 2011; 27(4):516–23.View ArticlePubMedGoogle Scholar
 Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996; 58(1):267–88.Google Scholar
 Ueki M, Tamiya G. Ultrahighdimensional variable selection method for wholegenome genegene interaction analysis. BMC Bioinforma. 2012; 13(1):72.View ArticleGoogle Scholar
 DíazUriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006; 7(1):3.View ArticleGoogle Scholar
 Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017; 15:104–16.View ArticlePubMedPubMed CentralGoogle Scholar
 Tang W, Wan S, Zou Q. Tumor origin detection with tissuespecific mirna and dna methylation markers. bioRxiv. 2016. https://doi.org/10.1101/090746.
 Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002; 46(n):389–422.View ArticleGoogle Scholar
 Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinforma. 2014; 15(1):1.View ArticleGoogle Scholar