kruX: matrixbased nonparametric eQTL discovery
 Jianlong Qi^{1, 2},
 Hassan Foroughi Asl^{3},
 Johan Björkegren^{3, 4} and
 Tom Michoel^{1, 5}Email author
DOI: 10.1186/147121051511
© Qi et al.; licensee BioMed Central Ltd. 2014
Received: 10 October 2013
Accepted: 11 January 2014
Published: 14 January 2014
Abstract
Background
The KruskalWallis test is a popular nonparametric statistical test for identifying expression quantitative trait loci (eQTLs) from genomewide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of markertrait combinations onebyone can become computationally prohibitive.
Results
We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the KruskalWallis test statistic for several millions of markertrait combinations at once. KruX is more than ten thousand times faster than computing associations onebyone on a typical human dataset. We used kruX and a dataset of more than 500k SNPs and 20k expression traits measured in 102 human blood samples to compare eQTLs detected by the KruskalWallis test to eQTLs detected by the parametric ANOVA and linear model methods. We found that the KruskalWallis test is more robust against data outliers and heterogeneous genotype group sizes and detects a higher proportion of nonlinear associations, but is more conservative for calling additive linear associations.
Conclusion
kruX enables the use of robust nonparametric methods for massive eQTL mapping without the need for a highperformance computing infrastructure and is freely available from http://krux.googlecode.com.
Keywords
eQTL Nonparametric methods Matrix algebra SoftwareBackground
Genomewide association studies have identified hundreds of DNA variants associated to complex traits including disease in human alone [1]. To understand how these variants affect disease risk, genotype and organismal phenotype data are integrated with intermediate molecular phenotypes to reconstruct disease networks [2]. A first step in this procedure is to identify DNA variants that underpin variations in expression levels (eQTLs) of transcripts [3], proteins [4] or metabolites [5]. As modern technologies routinely produce genotype and expression data for a million or more singlenucleotide polymorphisms (SNPs) and tenthousands of molecular abundance traits in a single experiment, often repeated across multiple cell or tissue types, the number of statistical tests to be performed when testing each SNP for association to each trait is huge. Furthermore, multiple testing correction requires all tests to be repeated several times on permuted data to generate an empirical null distribution. Despite being trivially parallelisable, the computational burden of testing SNPtrait associations onebyone quickly becomes prohibitive.
Recently a new approach ("matrixeQTL") was developed which uses the fact that the test statistics for the additive linear regression and ANOVA models can be expressed as multiplications between rescaled genotype and expression data matrices, thereby realising a dramatic speedup compared to traditional QTLmapping algorithms [6]. A limitation of these models is their assumption that the expression data is always normally distributed within each genotype group. For this reason, QTL and eQTL studies have frequently used nonparametric methods which are more robust against variations in the underlying genetic model and trait distribution [7, 8]. In particular, the nonparametric KruskalWallis oneway analysis of variance [9] does not assume normal distributions and reports small Pvalues if the median of at least one genotype group is significantly different from the others [8].
Here we report a matrixbased algorithm ("kruX"), implemented in Matlab, Python and R, to simultaneously calculate the KruskalWallis test statistics for several millions of SNPtrait pairs at once that is more than ten thousand times faster than calculating them onebyone on a human test dataset with more than 500,000 SNPs and 20,000 expression traits. Additional benefits of kruX include the explicit handling of missing values in both genotype and expression data and the support of genetic markers with any number of alleles, including variable allele numbers within a single dataset.
Implementation
Input data
KruX takes as input genotype values of M genetic markers and expression levels of N transcripts, proteins or metabolites in K individuals, organised in an M × K genotype matrix G and N × K expression data matrix D. Genetic markers take values 0,1,…,ℓ, where ℓ is the maximum number of alleles (ℓ = 2 for biallelic markers), while molecular traits take continuous values. We use builtin functions of Matlab, Python and R to convert the expression data matrix D to a matrix R of data ranks, ranked independently over each row (i.e. molecular trait). KruX assumes that the input expression data has been adjusted for covariates if it is necessary to do so [10, 11] and all data quality control has been performed.
Calculation of the KruskalWallis test statistic by matrix multiplication
using efficient vectorised operations. If none of the rows in D contain ties, then each entry S(n,m) equals the KruskalWallis test statistic for testing trait n against marker m [9]. For markers with less than the maximum of ℓ genotype values, 0/0 division will result in NaN columns in the intermediate matrices with entries S_{ i }(n,m)^{2}/N_{ i }(m) for the empty genotype groups. By replacing all NaN’s by zeros before making the sum in eq. (2), the corresponding entries in S will be the correct statistics for a test with fewer than ℓ degrees of freedom. Thus we need ℓ + 1 matrix multiplications and the associated elementwise operations to calculate the test statistic values for all markertrait combinations.
Pvalue calculation and empirical FDR correction
KruX takes as input a Pvalue threshold P_{ c }, calculates the corresponding test statistic thresholds for d degrees of freedom (d = 1,…,ℓ  1), and identifies the entries in S which exceed the appropriate threshold value. For these entries only a Pvalue is calculated using the χ^{2} distribution. Empirical falsediscovery rate (FDR) values are computed by repeating the Pvalue calculation (with the same P_{ c }) multiple times on data where the columns of the expression data ranks are randomly permuted. The FDR value for any value P ≤ P_{ c } is defined as the ratio of the average number of associations with P^{′} ≤ P in the randomised data to the number of associations with P^{′} ≤ P in the real data.
Handling missing values
where Z is the N × K matrix with Z(n,k) = 0 whenever D(n,k) = NaN and 1 otherwise, produces the corrected number of individuals in the ith group of the mth marker for the nth trait. Replacing the constant K in eq. (2) by a N × M matrix K where K(n,m) is the number of nonmissing samples for trait n and performing elementwise division and substraction operations then gives the correct test statistic for all pairs.
Handling missing genotype data is less easy because the expression ranks that need to be adjusted are specific to each markertrait combination (e.g if a marker has a missing value where a trait has rank r_{1}, then all samples with ranks r = r_{1} + 1,…,K need to be lowered by 1). KruX uses the fact that missing genotype values are generally due to sample quality and therefore patterns of missing values are often repeated among markers. For each unique missing value pattern, a new genotype matrix for all markers with that pattern and a new expression data matrix with the corresponding samples removed are constructed to calculate the test statistics for all affected marker gene combinations. Missing genotype data increases the computational cost of the algorithm considerably and it is recommended to limit the number of missing values by only considering markers with a sufficiently high call rate.
Handling tied data
In the presence of tied observations, the statistic in eq. (2) needs to be divided by a factor $1\frac{\sum T}{{K}^{3}K}$, where the summation is over all groups of ties and T = t^{3}  t for each group of ties, with t the number of tied data in the group [9]. The factor T is automatically computed for each trait during the ranking step and the matrix S is therefore easily corrected using elementwise matrix operations (Matlab version only). Whereas ties are usually rare in standard gene expression datasets, the ability to handle tied data expands the scope of kruX to countbased, discretised or qualitative data types.
Data slicing
Since kruX needs to create intermediate matrices of size N × M, where N is the number of traits and M the number of markers, which do not usually fit into memory for large datasets, kruX supports the use of data ‘slices’ to divide the complete data into manageable chunks. In typical applications, the number of markers is one or two orders of magnitude larger than the number of traits. Therefore the default behaviour of kruX is to keep the expression data as a single matrix and simultaneously test all traits against subsets of markers. The user can provide either a slice size and kruX will process marker blocks of this size serially, or a slice size and initial marker and kruX will process a single slice starting from that marker. The latter option allows trivial parallelisation across multiple processors.
Results and discussion
Validation data
To test kruX we provide example analysis scripts and a small anonymised dataset of 2,000 randomly selected genes and markers from 100 randomly selected yeast segregants [12]. Here we describe an application of kruX on a human dataset of 19,610 genes and 530,222 SNP markers measured in 102 whole blood samples from the Stockholm Atherosclerosis Gene Expression (STAGE) study [13]. All SNPs in the dataset had minor allele frequency greater than 5%, no missing values and probability to be in HardyWeinberg equilibrium greater than 10^{6}.
kruX is exact and fast
The KruskalWallis test is more conservative than corresponding parametric tests
The KruskalWallis test is more robust and detects more nonlinear associations
Conclusions
We have developed kruX, a software tool that uses matrix multiplications to simultaneously calculate the KruskalWallis test statistics for millions of markertrait combinations in a single operation, thereby realising a dramatic speedup compared to calculating the test statistics onebyone. The availability of a fast method to identify eQTL associations using a nonparametric test allowed us to assess in more detail how differences in model assumptions compared to parametric methods lead to differences in identified eQTLs. Our results on a typical human dataset indicate that the the parametric ANOVA method is highly sensitive to the presence of outlying gene expression values and SNPs with singleton genotype groups. We caution against its use without prior filtering of such outliers. Linear models reported the highest number of eQTL associations after empirical FDR correction. These are understandably biased towards additive linear associations and were also sensitive to the presence of skewed genotype group sizes, albeit to a much lesser extent than the parametric ANOVA method. The KruskalWallis test on the other hand is robust against data outliers and heterogeneous genotype group sizes and detects a higher proportion of nonlinear associations, but it is more conservative for calling additive linear associations than linear models, even after FDR correction.
In summary, kruX enables the use of nonparametric methods for massive eQTL mapping without the need for a highperformance computing infrastructure.
Availability and requirements

Project name: kruX

Project home page:http://krux.googlecode.com

Operating systems: Platform independent

Programming language: Matlab, R, Python

Other requirements: None

License: GNU GPL v3

Any restrictions to use by nonacademics: None
Declarations
Acknowledgements
This work was supported by the Freiburg Institute for Advanced Studies and Roslin Institute Strategic Grant funding from the BBSRC.
Authors’ Affiliations
References
 Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genomewide association loci for human diseases and traits. Proc Nat Acad Sci. 2009, 106 (23): 93629367. 10.1073/pnas.0903103106.View ArticlePubMed CentralPubMedGoogle Scholar
 Schadt EE: Molecular networks as sensors and drivers of common human diseases. Nature. 2009, 461: 218223. 10.1038/nature08454.View ArticlePubMedGoogle Scholar
 Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M: Mapping complex disease traits with global gene expression. Nat Rev Genet. 2009, 10 (3): 184194. 10.1038/nrg2537.View ArticlePubMed CentralPubMedGoogle Scholar
 Foss EJ, Radulovic D, Shaffer SA, Ruderfer DM, Bedalov A, Goodlett DR, Kruglyak L: Gentic basis of proteome variation in yeast. Nat Genet. 2007, 39: 13691375. 10.1038/ng.2007.22.View ArticlePubMedGoogle Scholar
 Nicholson G, Rantalainen M, Li JV, Maher AD, Malmodin D, Ahmadi KR, Faber JH, Barrett A, Min JL, Rayner NW, et al: A genomewide metabolic QTL analysis in Europeans implicates two loci shaped by recent positive selection. PLoS Genetics. 2011, 7 (9): 100227010.1371/journal.pgen.1002270.View ArticleGoogle Scholar
 Shabalin AA: Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012, 28 (10): 13531358. 10.1093/bioinformatics/bts163.View ArticlePubMed CentralPubMedGoogle Scholar
 Kruglyak L, Lander ES: A nonparametric approach for mapping quantitative trait loci. Genetics. 1995, 139 (3): 14211428.PubMed CentralPubMedGoogle Scholar
 Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, Kasarskis A, Zhang B, Wang S, Suver C, Zhu J, Millstein J, Sieberts S, Lamb J, GuhaThakurta D, Derry J, Storey JD, AvilaCampillo I, Kruger MJ, Johnson JM, Rohl CA, van Nas A, Mehrabian M, Drake TA, Lusis AJ, Smith RC, Guengerich FP, Strom SC, Schuetz E, et al: Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 2088, 6: 107View ArticleGoogle Scholar
 Kruskal WH, Wallis WA: Use of ranks in onecriterion variance analysis. J Am Stat Assoc. 1952, 47 (260): 583621. 10.1080/01621459.1952.10483441.View ArticleGoogle Scholar
 Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3 (9): 16110.1371/journal.pgen.0030161.View ArticleGoogle Scholar
 Listgarten J, Kadie C, Schadt EE, Heckerman D: Correction for hidden confounders in the genetic analysis of gene expression. Proc Nat Acad Sci. 1646, 107 (38): 516470.Google Scholar
 Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. PNAS. 2005, 102 (5): 15721577. 10.1073/pnas.0408709102.View ArticlePubMed CentralPubMedGoogle Scholar
 Hägg S, Skogsberg J, Lundström J, Noori P, Nilsson R, Zhong H, Maleki S, Shang MM, Brinne B, Bradshaw M, Bajic VB, Samnegard A, Silveira A, Kaplan LM, Gigante B, Leander K, de Faire U, Rosfors S, Lockowandt U, Liska J, Konrad P, Takolander R, FrancoCereceda A, Schadt EE, Ivert T, Hamsten A, Tegner J, Björkegren J: Multiorgan expression profiling uncovers a gene module in coronary artery disease involving transendothelial migration of leukocytes and LIM domain binding 2: the Stockholm atherosclerosis gene expression (STAGE) study. PLoS Genet. 1000754, 5 (12):
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.