KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies

Background Finding epistatic interactions in large association studies like genome-wide association studies (GWAS) with the nowadays-available large volume of genomic data is a challenging and largely unsolved issue. Few previous studies could handle genome-wide data due to the intractable difficulties met in searching a combinatorial explosive search space and statistically evaluating epistatic interactions given a limited number of samples. Our work is a contribution to this field. We propose a novel approach combining K-Nearest Neighbors (KNN) and Multi Dimensional Reduction (MDR) methods for detecting gene-gene interactions as a possible alternative to existing algorithms, e especially in situations where the number of involved determinants is high. After describing the approach, a comparison of our method (KNN-MDR) to a set of the other most performing methods (i.e., MDR, BOOST, BHIT, MegaSNPHunter and AntEpiSeeker) is carried on to detect interactions using simulated data as well as real genome-wide data. Results Experimental results on both simulated data and real genome-wide data show that KNN-MDR has interesting properties in terms of accuracy and power, and that, in many cases, it significantly outperforms its recent competitors. Conclusions The presented methodology (KNN-MDR) is valuable in the context of loci and interactions mapping and can be seen as an interesting addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1599-7) contains supplementary material, which is available to authorized users.


Introduction
KNN MDR is a fortran 90 software implementing the KNN MDR methodology detailed in (Al Chamlat S. and Farnir F., 2014). The aim of the program is to help obtaining clues about the position of the genes involved jointly in a phenotype. One of the interests of the approach is that it is able to find interating genes even in the absence of marginal effects. This capability, which was already present in other methods, such as MDR (Multi Dimensional Reduction) is made available through KNN MDR for situations with more markers and more complicated interaction patterns than was feasible computer-wise with "simple" MDR. The current version has been written for binary (0/1) traits and for SNP data, but could easily be extended to other traits/attributes. These options will be included in future versions of the software.

Methodology
This section shortly summarizes how the method works in order to understand the parameters that need to be provided to the software to obtain results. More details can be found in the original publication. Each data point is represented through a phenotype (0/1, where the meaning of these codes is problemdependent) and a set of N attributes. As mentioned above, in the current version of the program, the attributes correspond to SNP genotypes. It will be assumed that these genotypes are available (through direct genotyping or through an imputation method) for all individuals. The idea behind the MDR methods is to reduce the very large multidimensional space faced in situations involving multiple loci (such as genetic interactions) to one-dimensional space. In basic MDR, a status (0/1) is associated to each multi-locus genotype through a majority vote performed on the individuals presenting this multi-locus genotype in the training set; after that training stage, status (0/1) can be allocated to individuals from the test set on the basis of their multi-locus genotype as well (provided similar genotypes were present in the training set). Accuracy of allocation can then be obtained by computing the false positive (i.e. 1) and false negative (i.e. 0) rates in both training and test sets. KNN MDR uses such a strategy. The difference with the basic MDR is that the allocation phase is performed through a K nearest-neighbors approach: a status is allocated on the basis of the most prevalent status within the set of the K nearest neighbors of the tested individual. The neighborood is defined using a distance, which, in the current version, is a simple euclidian distance between the involved genotypes of both individuals for which a distance is computed. The advantage of such an approach is double: the distance can be easily (i.e. with not much effort computer-wise) computed for any number of markers, and the allocation procedure works even in situation where no other individual in the training set has the same multi-locus genotype. Note that both these points become more relevant as the number of involved markers increases, a practically frequent situation. An issue exists over the definition of the training and test sets. Again, KNN MDR mimicks the approach followed in basic MDR using cross-validation: the complete dataset is randomly split into V equally sized subsets, and each subset is sequentially considered as a test set, while the (V-1) other sets are used as training sets. Accuracy is computed for every configuration, and the final model accuracy is computed as the average of the obtained accuracies. In order to balance the true positive and true negative rates in the results, we used "balanced accuracy" as our accuracy measurement, where "balanced accuracy" is defined as the average of true positive and true negative rates. When looking for sets of genes involved in a phenotype, various attributes sets are usually tested in order to find the one best explaining the data, which is, in our approach, the one with the highest balanced accuracy on the test set. This "best" attribute set will be considered as our "best model". The last problem is to test the significance of the best model. This is done in our software through a permutation procedure: if a specific attribute set is associated to the phenotype, disrupting the association between phenotypes and genotypes should destroy this association. Consequently, by permuting randomly the phenotypes with respect to the genotypes, we create datasets where no association should exist, which corresponds to the null hypothesis we want to test. Comparing the truly obtained balanced accuracy to the ones obtained on the permuted datasets allows one to obtain an estimation of the p-value associated to our best-model.

Parameters
Several parameters have been definend in the previous and can be transmitted to the program. These parameters are provided through a parameters file, which is invoked while calling the program, as follows: path/knn mdr <analysis name> In this command, "path" represents the eventual path leading to the executable, and "name" represents the name of the analysis. This name is used to provide the parameters file just discussed (named <analysis name>.prm) and to name output files (see below). The parameters file is a text file, where each line is used to specify the various options of the program. These options are: • ATT SET FILE file: this option is used to specify the file containing the list of attributes sets for which an evaluation is demanded. The best model will be chosen among these attributes sets. Attributes sets are specified on distinct lines of the file by providing a comma separated list of the positions of the attributes to be considered in the attributes file.
When several consecutive attributes have to be used, the notation using the first and the last attribute separated with an hyphen can be used. For example, "1,3,7-10" means "use first, third, seventh, eighth, nineth and tenth attributes" of the attributes file. No default exists for this parameter.
• ATTRIB FILE file: with this option, the file containing all attributes for all individuals in the analysis can be given. Again, "file" is a text file, with one line per individual, and at least as many blank separated columns as the number M of attributes. The attributes file also contains a column with an individual identifier, and may also contain the (0/1) phenotype. Since attributes, in the current version, are SNP genotypes, these genotypes are assumed to be recoded genotypes: for each SNP, one of the allele is arbitrarily considered as the reference allele, and the recoded genotype is simply the number of occurrences of the reference allele in the genotype. Consequently, the allowed attributes values are either 0, 1 or 2.
No default exists for this parameter.
• HELP: this option is used to obtain an short reminder of the available options.
• KLOW n: this option allows to specify the minimum number of neighbors to be used to allocate status to tested individuals. Default is KLOW = 5.
• KHIGH n: this option allows to specify the maximum number of neighbors to be used to allocate status to tested individuals. Default is KHIGH = 5.
• MODEL model: with this option, a model can be specified. In the current version, the only available model is KNN.... Default is 'KNN'.
• NB ATTRIB n: this options indicates how many attributes should be found in the data file. Default is NB ATTRIB = 1.
• NB CROSS V n: this is used to provide the number V of cross-validation subsets. Default is NB CROSS V = 10.
• NB INDIV n: this option indicates how many individuals should be found in the data file. Default is NB INDIV = 1.
• NB PERM n: with this option, the number of permutations can be provided. Default is NB PERM = 0.
• PHENO FILE file: with this option, the file containing the phenotypes for all individuals in the analysis can be given. As above, "file" is a text file, with one line per individual, one column with the 0/1 phenotype and a column with an individual identifier. this file may be the same as the attributes file. No default exists for this parameter.
• POS FIRST ATTRIB n: with this option, the position (column number) of the first attribute to be considered can be provided. Default value is POS FIRST ATTRIB = 1.
• POS LAST ATTRIB n: with this option, the position (column number) of the last attribute to be considered can be provided. Default value is POS LAST ATTRIB = 1.
• POS ID ATTRIB n: this option allows to provide the position (column number) of the individual identifier in the attributes file. Default is POS ID ATTRIB = 1.
• POS ID PHENO n: this option allows to provide the position (column number) of the individual identifier in the phenotypes file. Default is POS ID PHENO = 1.
• POS PHENO n: this option allows to provide the position (column number) of the phenotype field in the phenotypes file. Default is POS PHENO = 2.
• SEED s: since random choices (cross-validation subsets, permutations) are made, successive invocations of the program will not necesssarily result in identical outputs. Identical (different) runs can be performed by specifying identical (different) seeds through this option.

Example
In this section, we show the use of the program on a simulated example. Twenty attributes are measured for 500 cases and 500 controls. All data are included in one file, named 'sample.dat'. The individual identifier is the first field, followed by the 0/1 phenotype, and then by 20 attributes. An interaction has been introduced artificially between attributes 4 and 12 as follows: all genotypes at locus 4 are generated randomly, irrespectively of the status of the individuals. This should ensure that no marginal effect exists for this locus on the trait. Controls genotypes for locus 12 are also randomly allocated, but cases genotypes for locus 12 are copies of controls ones. This creates an interaction between these two loci.
Step 6 reports that set 6, containing markers 1-5 and 11-15, is significantly associated to the phenotype, which is good news given the way the dataset has been generated... Note also that the p value equal to 0 is obtained through permutations and is only an estimator of the true one. Computing a confidence interval for this p value would lead to show that p is within [0; 0.036] with a 95% confidence level, so this seems to be a really significant signal!