 Methodology article
 Open access
 Published:
Optimized permutation testing for information theoretic measures of multigene interactions
BMC Bioinformatics volume 22, Article number: 180 (2021)
Abstract
Background
Permutation testing is often considered the “gold standard” for multitest significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is rerun after permuting the phenotype labels. This can become intractable in multilocus genomewide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large.
Results
In this paper, we develop an approach for permutation testing in multilocus GWAS, specifically focusing on SNP–SNPphenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speedup by a factor of over 10^{3} for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples.
Conclusions
The proliferation of largescale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multilocus genotypephenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permutecounts.
Background
Genomewide association studies (GWAS) have shed light on the genetics of complex traits and diseases, but singlelocus analyses fail to detect the epistatic gene–gene interactions, which play a crucial role in the genetics of complex traits [1,2,3]. This has resulted in a proliferation of methods for detecting gene–gene interactions [4], for example: regression methods, including regularized regression techniques such as LASSO [5, 6]; ensemble methods such as random forests [7,8,9]; and multifactor dimensionality reduction [10, 11].
We focus here on the class of techniques based on information theory, which formulate entropybased measures sensitive to multigene epistatic interactions. These approaches are powerful due to being inherently modelfree and particularly sensitive to nonlinear relationships [3]. This has led to its own proliferation in entropybased measures of epistatic interaction, including: Conditional Mutual Information [12], Information gain [13,14,15,16,17,18], Relative Information Gain [16, 19,20,21], Total Correlation [22,23,24,25,26,27], Synergy [28, 29], and the Information Delta [30, 31].
Though these different formulations vary, they share many of the same advantages inherent to the information theorybased approach, but also many of the same weaknesses, and of particular note here is the recurrent difficulty in constructing statistical tests for the significance of a detected interaction. There is typically no simple analytic formulation for the null distributions of these estimators, and thus significance tests require either some approximation or, more reliably, permutation testing. Permutation testing is often considered the “gold standard” for multitest significance analysis [32, 33], and is the approach utilized by the majority of the above studies [20,21,22,23,24,25,26,27, 29, 34, 35].
Even in a singlelocus GWAS, permutation testing is computationally costly [33]. SNP arrays may contain hundreds of thousands of individual SNPs, and thus there are hundreds of thousands of pairwise SNPphenotype relationships to be tested. Higherorder relationships quickly lead to computationally intractable problems: this same number of SNPs leads to billions of possible threeway SNP–SNPphenotype interactions, and to tens of trillions of fourway SNP–SNPSNPphenotype interactions. Detecting and testing these interactions becomes difficult on both statistical and computational levels.
In its simplest naive form, permutation testing consists of iterations of randomly permuting the phenotype labels and rerunning the analysis pipeline. However, this approach can be optimized considerably, especially when performed multiple times: for example, standard packages such as PLINK [5] by default use an adaptive approach, which iteratively checks if the permutations already performed are sufficient to rule any of the observed SNPphenotype associations as statistically insignificant, and drops insignificant SNPs from subsequent computations. Even the cost of this approach can be further reduced by an order of magnitude, and there exist multiple approaches for optimizing these singlelocus analyses, including PRESTO [36], SLIP and SLIDE [37], and PERMORY [38].
In this paper, we develop an approach which reduces the computational cost of permutation tests by orders of magnitude for all information theory based measures. We identify the construction of count tables as the largest computational bottleneck, and devise a method for directly transforming these count tables to replicate a permutation test, without having to reconstruct them. We find that this reduces the computation time of each permutation by over three orders of magnitude. This approach therefore allows for the principled assessment of statistical significance in a multiSNP association study, and enables the consideration and comparison of multiple candidate measures for multivariable dependence.
Results
Construction of count tables
Genotype and phenotype data can be represented with an \(n\times m\) genotype array G and a lengthn phenotype vector p, where m SNPs are measured for n individuals. The number of threeway SNP–SNPphenotype interactions is typically quite large, as this scales as \(m^2\). In this case of large n and m, we find that the bulk of the computation consists of merely computing the count tables for each possible tuple.
The computation of the joint entropy between the variables in a given tuple first requires the calculation of a count table. Consider a tuple consisting of two SNPs and a single phenotype. Each SNP takes a value 0, 1, 2 (for homozygous major, heterozygous, and homozygous minor alleles respectively), and the phenotype is binary with possible values 0 and 1. The count table C is then a \(3\times 3 \times 2\) array:
where \(c_{ijk}\) is the number of individuals for whom the first SNP has a value of i, the second SNP has a value of j, and the phenotype has a value of k. Clearly, the elements sum to the total number of individuals n; dividing this array by n gives the joint probability estimates, from which the various joint entropies can be calculated, which can then be used to calculate informationtheoretic measures for the corresponding tuple.
Notation and reasoning
A count table C must be constructed for each of the billions of tuples. A naive approach to permutation testing would simply randomly shuffle the phenotype vector p and repeat the entire analysis, including the reconstruction of count tables from the data. We instead seek a transformation which, starting from a count table C, will generate a randomized count table \(C^*\) from the same distribution of randomized count tables given by naive permutation. The first crucial observation is that the sum over the third axis of C will remain constant over a permutation test:
where \(n_{ij}\) is the number of individuals for whom the first SNP is i and the second SNP is j. With this notation, we can write:
We need only compute the \(k=0\) layer of this array, from which the \(k=1\) layer immediately follows. We also have the constraint that:
\(n_0\) is the total number of individuals with phenotype label 0, which will also remain constant as the labels are shuffled.
With our notation and the constraints of Eqs. 2 and 4, we can begin to consider the effect of a permutation test upon a count table. Firstly, how is \(c_{000}^*\) distributed? Consider the \(n_{00}\) individuals with this genotype. If we randomly shuffle the phenotype labels, we are, in effect, randomly drawing without replacement \(n_{00}\) labels from the population of n labels, \(n_0\) of which have a value of 0. This process of drawing from a finite set of labels without replacement is described by the hypergeometric distribution, and we can write:
from which \(c^*_{001}=n_{00}c^*_{000}\) immediately follows.
When computing the next element, we must consider that the previous step has already assigned \(n_{00}\) labels, \(c_{000}^*\) of which had a value of 0. We again draw without replacement \(n_{10}\) labels, now from a total population of \(nn_{00}\) phenotype labels, of which \(n_0c_{000}^*\) have value 0:
The next element is drawn iteratively in the same manner:
This process is repeated until all of the elements have been assigned.
Algorithm for transformed count tables
More formally, this count transformation process can be written as follows:

1.
From the original count table \(c_{ijk}\), compute the genotype counts \(n_{ij}\), the value0 phenotype count \(n_0\), and the total phenotype count n.

2.
Assign an (arbitrary) order to the indices (i, j). This will be the order in which the elements are assigned. For example, let:
$$\begin{aligned} \{(i,j)\} = \{(0,0)< (1,0)< (2,0)< (0,1)< \cdots < (2,2)\} \end{aligned}$$ 
3.
For each (i, j) in the ordered set, sample from the hypergeometric distribution:
$$\begin{aligned} c^*_{ij0} \sim \text {Hypergeometric}\left( n\sum \nolimits _{(i',j')<(i,j)}n_{i'j'},\quad n_0\sum \nolimits _{(i',j')<(i,j)}c^*_{i'j'0},\quad n_{ij}\right) \end{aligned}$$ 
4.
Calculate the corresponding number of counts with phenotype value 1:
$$\begin{aligned} c^*_{ij1} = n_{ij}  c^*_{ij0} \end{aligned}$$
Discussion
Comparing generated distributions
To check that this method works as intended, we verified that the distribution of count tables generated via our method is indistinguishable from count tables generated by direct permutation of the phenotype labels. Specifically, we randomly generated a total of \(N_P=1{,}000{,}000\) permuted count tables using each method, and found the distributions of the permuted elements \(c^*_{ij0}\) to be both visually and statistically indistinguishable (via an ensemble of Epps–Singletons tests between the two distributions [39]). Further details on how these count tables were generated and how the analysis was performed are given in the Methods.
Comparing computational complexity
We can also generate synthetic data (as described in the Methods) to compare the computational cost of each approach. Figure 1 compares the computational complexity of the naive direct permutation approach as compared to our method, as a function of both the number of individuals n and number of permutations \(N_p\). In Fig. 1a, we calculate the computation time as a function of \(N_p\), with a fixed \(n=10{,}000\) samples and 100k SNPs. The computation time of both methods scales linearly with the number of permutations (i.e. they are both \(\mathcal {O}(N_p)\)). However, the linear fits to each method imply a time per permutation of 586.6s for the direct permutation method and 0.52s for our method. Our method is therefore over \(10^3\) times faster for each permutation, for this number of samples.
Figure 1b, which calculates the computation time as a function of number of samples n with a fixed \(N_p=20\), shows an even clearer computational advantage of our approach. The direct permutation approach scales linearly with the number of samples (i.e. it scales as \(\mathcal {O}(n)\)), whereas the computation time for our method does not depend on the number of samples (i.e. it scales as \(\mathcal {O}(1)\)). This is not unexpected, since our method bypasses the need to perform any operations on the original \(n\times m\) array. This represents a considerable computational savings for datasets with a large number of samples.
Conclusion
This paper outlines the algorithm for direct transformation of count tables, shows that the results are identical to those obtained by the naive approach of directly permuting the phenotype labels, and shows the considerable reduction in computational expense using this method. Specifically, we demonstrate a reduction of computation time per permutation by a factor of over \(10^3\), and show that our method is insensitive to the total number of samples while the naive approach scales linearly. By bypassing the most computationally expensive step of the naive approach to permutation testing, our method therefore considerably decreases the cost of permutation testing for information theoretic measures.
Future developments on this method should incorporate additional methods for decreasing the computational cost of permutation analyses. For example, it is common for pairwise GWAS analyses to use an adaptive scheme which iteratively drops interactions if they are clearly not statistically significant (e.g. this is done by default in PLINK [5]). A similar adaptive scheme could be implemented here on top of our method.
Given the recent proliferation of large datasets for which multilocus analyses can yield novel biological insights, and given the importance of permutation testing for information theoretic measures without a clean analytically known null distribution, we believe that our approach is a valuable contribution towards making these large and important analyses more computationally tractable. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permutecounts.
Methods
Synthetic dataset and its count distributions
Each SNP–SNPphenotype tuple in our synthetic dataset is generated as described in this section. SNP data is generated independently for both SNPs by assuming perfect HardyWeinberg equilibrium with a minor allele frequency of \(p = 0.45\) (i.e. we generate a \(n\times 2\) genotype array where each element has a probability \(p^2\) of being 0, probability \(2p(1  p)\) of being 1, and probability \((1p)^2\) of being 2). We similarly generate a binary phenotype vector which has a probability \(q=0.66\) of equaling zero. As we will establish later, the values of p and q do not affect our results.
The above parameters lead to a random count table such as the one below, generated for \(n=10{,}000\) individuals:
We verify that our method is working as desired by permuting the above count table \(N_P=1{,}000{,}000\) times using two different approaches: (1) the naive permutation testing approach, in which we randomly shuffle the phenotype vector and recompute the count table; (2) our method as outlined in Sect. of the main text. The distributions of the elements \(c_{ij0}^*\) are shown in Fig. 2. As shown in the figure, the resulting distributions are nearly identical, and the distributions generated from the two approaches overlap almost perfectly. The computational cost savings of this approach are considerable. On our machine, generating \(N_P=1{,}000{,}000\) permuted count tables took a total of 761.8 seconds using the naive method and only 5.7 seconds using our method.
It is immediately obvious from Fig. 2 that the distributions are very close to normal distributions, which is not surprising given our relatively large choice of \(N_P\). One may be tempted to use this fact to formulate a simpler approach to generating random count tables: could we simply estimate the normal distributions for each \(c_{ij0}^*\) and sample those directly? This approach will not work because the elements are not independent from each other, meaning that an iterative procedure such as ours is required.
Distributions of information measures
Having generated an ensemble of 1,000,000 count tables using each method, we can compute the joint entropies of our variables as well as any information theoretic measure which is a function of the entropies. For example, we can compute the multiinformation:
where \(H_i\) are the entropies of each individual variable, and \(H_{123}\) is the joint entropy of all three variables (i.e. our two SNPs and the phenotype).
The subsequent computation of information measures is considerably less expensive than the construction of the count tables. For instance, computing the distributions of \(\Omega\) values using either set of the 1,000,000 count tables generated in the previous section took 1.6 s. Figure 3 shows the distributions of \(\Omega\) values based on the count tables generated by two different permutation methods in the previous section. Once again, we see that our method yields a nearly identical distribution to the naive method of direct permutation. In the case of real data analysis, these permuted distributions would serve as null distributions in our significance analysis. This result verifies that our method produces a null distribution equally sufficient for significance analysis as the naive permutation method, but at considerably less computational expense.
Statistical testing of distribution equivalence
The distributions in Figs. 2 and 3 appear to be nearly identical, but we wish to test (1) whether or not they may be distinguished via statistical testing, and (2) whether or not this result is sensitive to the choice of parameters p and q. We therefore ran 1000 trials of the following:

1.
Independently choose parameter values p, q from a uniform random distribution on (0.01, 0.99), and use this to generate a count table with \(n=10{,}000\) samples.

2.
Generate \(N_P=1000\) permuted count tables using both direct permutation of phenotype labels and our method.

3.
For each \(c_{ij0}^*\), perform a twosample Epps–Singleton test comparing the two methods.
This will yield 9000 p values generated under a broad range of different parameter values. The Epps–Singleton test [39] has the null hypothesis that both samples are drawn from the same distribution (and is used here since it allows for discrete distributions). By definition, the p values should be uniformly distributed under the null hypothesis. In Fig. 4, we show that our p values are fully compatible with a uniform distribution, such that the count tables generated by naive permutation and those generated by our method are not statistically distinguishable.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the Zenodo repository, http://doi.org/10.5281/zenodo.4068765
Abbreviations
 GWAS:

Genomewide association study
 SNP:

Single nucleotide polymorphism
References
Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86:6–22. https://doi.org/10.1016/j.ajhg.2009.11.017.
Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Nat Acad Sci. 2012;109(4):1193–8.
Ferrario PG, König IR. Transferring entropy to the realm of GxG interactions. Briefings Bioinf. 2016;19(1):136–47. https://doi.org/10.1093/bib/bbw086.
Cordell HJ. Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404.
Purcell S, Neale B, Brown TK, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. Plink: a toolset for wholegenome association and populationbased linkage analysis. Am J Hum Genet. 2007;81:559–75.
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714–21.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Screening largescale association study data: exploiting interactions using random forests. BMC Genet. 2004;5(1):32.
Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol: Off Publ Int Genet Epidemiol Soc. 2005;28(2):171–82.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–47.
Gola D, Mahachie John JM, Van Steen K, König IR. A roadmap to multifactor dimensionality reduction methods. Briefings Bioinf. 2016;17(2):293–308.
Zuo X, Rao S, Fan A, Lin M, Li H, Zhao X, Qin J. To control false positives in genegene interaction analysis: two novel conditional entropybased approaches. PLoS ONE. 2013;8(12):e81984.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241(2):252–61.
Fan R, Zhong M, Wang S, Zhang Y, Andrew A, Karagas M, Chen H, Amos C, Xiong M, Moore J. Entropybased information gain approaches to detect and to characterize genegene and geneenvironment interactions/correlations of complex diseases. Genet Epidemiol. 2011;35(7):706–21.
Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, Raghuram J, Yuan X, Herrington DM, Wang Y. Comparative analysis of methods for detecting interacting loci. BMC Genom. 2011;12(1):344.
Kwon MS, Park M, Park T. Igent: efficient entropy based algorithm for genomewide genegene interaction analysis. BMC Med Genomics. 2014;7(1):6.
Su L, Liu G, Wang H, Tian Y, Zhou Z, Han L, Yan L. Research on single nucleotide polymorphisms interaction detection from network perspective. PLoS ONE. 2015;10(3):e0119146.
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An informationgain approach to detecting threeway epistatic interactions in genetic association studies. J Am Med Inform Assoc. 2013;20(4):630–6.
Dong C, Chu X, Wang Y, Wang Y, Jin L, Shi T, Huang W, Li Y. Exploration of gene–gene interaction effects using entropybased methods. Eur J Hum Genet. 2008;16(2):229–35.
Yee J, Kwon MS, Park T, Park M. A modified entropybased approach for identifying gene–gene interactions in casecontrol study. PLoS ONE. 2013;8(7):e69321.
Chattopadhyay AS, Hsiao CL, Chang CC, Lian IB, Fann CS. Summarizing techniques that combine three nonparametric scores to detect diseaseassociated 2way SNP–SNP interactions. Gene. 2014;533(1):304–12.
Chanda P, Sucheston L, Zhang A, Brazeau D, Freudenheim JL, Ambrosone C, Ramanathan M. Ambience: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics. 2008;180(2):1191–210.
Chanda P, Sucheston L, Zhang A, Ramanathan M. The interaction index, a novel informationtheoretic metric for prioritizing interacting genetic variations and environmental factors. Eur J Hum Genet. 2009;17(10):1274–86.
Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. Informationtheoretic gene–gene and geneenvironment interaction analysis of quantitative traits. BMC Genom. 2009;10:509. https://doi.org/10.1186/1471216410509.
Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M. Comparison of informationtheoretic to statistical methods for gene–gene interactions in the presence of genetic heterogeneity. BMC Genom. 2010;11(1):487.
Chanda P, Zhang A, Ramanathan M. Modeling of environmental and genetic interactions with ambrosia, an informationtheoretic model synthesis method. Heredity. 2011;107(4):320–7.
Knights J, Yang J, Chanda P, Zhang A, Ramanathan M. Symphony, an informationtheoretic method for gene–gene and gene–environment interaction analysis of disease syndromes. Heredity. 2013;110(6):548–59.
Anastassiou D. Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol. 2007;3(1):83.
Curk T, Rot G, Zupan B. SNPsyn: detection and exploration of SNP–SNP interactions. Nucleic Acids Res. 2011;39(suppl_2):444–9.
Sakhanenko NA, Galas DJ. Biological data analysis as an information theory problem: multivariable dependence measures and the shadows algorithm. J Comput Biol. 2015;22(11):1005–24.
Sakhanenko NA, KunertGraf J, Galas DJ. The information content of discrete functions and their application in genetic data analysis. J Comput Biol. 2017;24(12):1153–78.
Westfall PH, Young SS. Resamplingbased Multiple Testing: Examples and Methods for Pvalue Adjustment, vol. 279. Hoboken: Wiley; 1993.
Han B, Kang HM, Eskin E. Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS Genet. 2009;5:1–13. https://doi.org/10.1371/journal.pgen.1000456.
Shang J, Zhang J, Sun Y, Zhang Y. Epiminer: a threestage coinformation based method for detecting and visualizing epistatic interactions. Digit Signal Proc. 2014;24:1–13.
Ignac T, Skupin A, Sakhanenko N, Galas D. Discovering pairwise genetic interactions: an information theorybased approach. PLoS ONE. 2014. https://doi.org/10.1371/journal.pone.0092310.
Browning BL. Presto: rapid calculation of order statistic distributions and multipletesting adjusted pvalues via permutation for one and twostage genetic association studies. BMC Bioinf. 2008;9:309. https://doi.org/10.1186/147121059309.
Han B, Kang HM, Eskin E. Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS Genet. 2009;5(4):1–13. https://doi.org/10.1371/journal.pgen.1000456.
Pahl R, Schäfer H. PERMORY: an LDexploiting permutation test algorithm for powerful genomewide association testing. Bioinformatics. 2010;26(17):2093–100. https://doi.org/10.1093/bioinformatics/btq399.
Epps T, Singleton KJ. An omnibus test for the twosample problem using the empirical characteristic function. J Stat Comput Simul. 1986;26(3–4):177–203.
Acknowledgements
We wish to acknowledge support from the Pacific Northwest Research Institute.
Funding
Research reported in this publication was supported by the National Heart, Lung, And Blood Institute of the National Institutes of Health under Award Number U01HL126496. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
J.K., N.S., and D.G. conceived of and designed the project; J.K. performed the computations and formal analysis; N.A. and D.G. supervised the project and validated the results; J.K. visualized the results; J.K., N.A. and D.G. wrote and edited the paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent to publish
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
KunertGraf, J.M., Sakhanenko, N.A. & Galas, D.J. Optimized permutation testing for information theoretic measures of multigene interactions. BMC Bioinformatics 22, 180 (2021). https://doi.org/10.1186/s12859021041076
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021041076