 Methodology Article
 Open Access
 Published:
High dimensional model representation of loglikelihood ratio: binary classification with expression data
BMC Bioinformatics volume 21, Article number: 156 (2020)
Abstract
Background
Binary classification rules based on a smallsample of highdimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glassbox models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the loglikelihood ratio accounting for the effects of each individual gene as well as genegene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions.
Results
We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular stateoftheart methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated genegene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios.
Conclusion
The proposed HDMRbased approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or genegene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis.
Background
The notion of a simple binary classification, as one of the corner stones of modern data analysis, has been considered in many different contexts and an abundance of algorithms have been proposed for this task. While research has recently shifted focus to classification rules in the context of big data, many bioinformatics applications deal with smallsample, highdimensional prediction problems. Current highthroughput “omics” technologies measure tens of thousands of molecular features for each experimental unit (for instance, a patient’s tissue sample); however, research data is still usually limited to small sizes, rarely more than a few hundred units, impeding reliable analysis. Additionally, data might be heavily imbalanced, which adds to the challenge of correct classification in a smallsample, highdimensional setting, with the minimum misclassification error criteria being too unreliable for consistent feature selection across multiple datasets [1, 2].
In contrast to many applications where machine learning methods are used merely to predict and do not have to provide explicit decision rules, the bioinformatics applications demand highly interpretable glassbox models to explain how a specific decision is obtained. In many instances it is important to know which features, e.g., genes, are used by the classifier, whether these features are biologically relevant, whether the distributional differences in features across two classes indicate biological variability or are merely artifacts of the measurement/normalization process, and what is the uncertainty of prediction at a new test point?
Answers to these questions are necessary to hypothesize about biological mechanisms of complex diseases such as cancer and to evaluate clinical utility of developed decision rules for tasks such as diagnosis and prognosis. But they are also necessary to explain how certain patterns in the data might motivate different actions, such as choosing a specific treatment over another for targeted therapy, exploring alternative treatments, or how to form hypotheses on the biological mechanisms that can potentially be targeted in drug discovery applications.
The smallsample highdimensional nature of the problem, interpretability of outputted statistics, and complex feature dependencies, force the development of methods with few degrees of freedom that place strong assumptions (e.g. distributional assumptions) on the classification problem. For example, linear discriminant analysis (LDA) assumes features are Gaussian and have the same covariance matrix in both classes, quadratic discriminant analysis (QDA) assumes features are jointly Gaussian with different classconditioned covariances, and a logistic regression model assumes that the loglikelihood ratio is an additive function of features. The common idea behind these methods is that although the “optimal” decision rule might be very complex (e.g., a high dimensional separating surface), it can be well approximated by a low dimensional model, and an appropriate family of models should contain a point close to the “best” low dimensional representation that can be reliably approximated given the observed data. In the current paper we follow a rather similar general approach, but apply a much more flexible method for deriving classifiers that allow for more flexible classification rules.
Recent studies emphasize the importance of gene synergies and genetic interactions for reliable analysis [3]. However, two general themes of the recent method developments are leveraging big data, such as the cancer genome atlas (TCGA), or taking advantage of side information such as sets of comutated genes or disease protein subnetworks, e.g. [4, 5]. Such information may not be readily available or may not be easily applicable to the current dataset, as cancer gene interactions are highly context dependent [6].
Utilizing pairwise interactions for reliable prediction aside, detecting diseaseassociated genetic interactions has been studied as a “gene discovery problem”. To that end, mutual information based synergy scores are proposed, e.g., [7, 8]. However, reliably inferring mutual information from data is a challenging task, which can be circumvented by quantizing expression values, building dendrograms based on expressions, utilizing ranked expressions instead of raw continuous values, or defining new statistics based on genepair expression rankings [7–10]. In [9] it is stated that dendrogram and doublet (a specific collection of transformations merging gene pair expressions into onedimensional values) based methods are “helpless for discovering pairwise gene interactions”. The information theoretic score of [8] cannot be easily utilized to test significance, using limited permutations of data to approximate the null, hypothesis which is computationally intensive [9]. Finally, [9] proposes a new conversion transformation, the absolute difference of ranked expressions constructing a tstatistic, which seems to balance performance and computation cost.
High dimensional model representation
Consider a set of predictors as a random vector and a dependent variable as a function of predictors, e.g., class labels as a function of observed expressions. High dimensional model representation (HDMR) is a recently proposed framework to decompose functions of a random vector, i.e., the dependent variable, into a hierarchy of low dimensional models based on partial marginals of the full joint distribution [11]. Intuitively speaking, HDMR expansion optimally decomposes a high dimensional nonlinear system into a hierarchy of lower dimensional nonlinear systems, simplifying the process of studying each highdimensional component. It enjoys several interesting properties. For example, the d^{th} order expansion is the best representation, in mean square error (MSE) sense, to estimate the dependent variable given its marginal distribution with all subsets of the predictors with at most d elements. Additionally, higher order expansion terms are independent of the lower order terms. HDMR assumptions are mild, only requiring certain moments to exist. Unfortunately, computing the HDMR expansion requires complete knowledge of the full joint distribution, and potentially solving large families of highly complex integral equations unless simplifying assumptions are made or special cases are considered [12]. This can be a deal breaker in many practical applications, where it is not always possible to obtain the full joint distribution given the small sample size. In this work, as a workaround, we propose algorithms that aim to approximate the HDMR expansion without directly estimating the full joint distribution and solving integral equations.
Our contribution
The novelty of the proposed classification framework is threefold. (1) Our approach provides a hierarchy of low dimensional representations of data, possibly allowing for analyzing progressively more complex interactions among features. (2) We propose a regression based approach to circumvent solving complex integral equations. (3) We can easily study the effects of any specified subsets of variables, and assess how their interactions affect the classifier output. As a side note, the proposed framework can also easily combine different parametric and nonparametric methods for computing loglikelihood ratios, an interesting property that adds further flexibility. However, we leave this extension for future work.
The paper is organized as follows. We first briefly overview the theory of HDMR expansion, and how it can be used for binary classification, considering in particular the special case of second order HDMR expansion. We then explain the regressionbased algorithm of approximating the second order HDMR expansion and perform synthetic simulations comparing our method with several other popular classification rules proposed in the literature. Finally, we provide several real data examples, studying breast cancer, leukemia, and lung cancer.
Methods
Here we describe our classification methodology based on the HDMR expansion, studied in detail in [11, 13]. We briefly review the theory, and then show how it applies to binary classification.
HDMR expansion
HDMR provides a hierarchy of functions that describe how the interactions of variables affect the output. In particular, assuming output Z as a function of input random vector X=[X_{1},⋯,X_{D}], i.e., Z=h(X), HDMR studies how h(X) can be decomposed to a hierarchy of partial observations. Let F={1,⋯,D}. The HDMR expansion of order d is the collection of functions h_{u}(X_{u}) for all u⊆F with u≤d that minimize the mean square error (MSE) of estimating Z given E(ZX_{u}) for all u under the condition that for all f∈u, E(h_{u}(X_{u})X_{u∖f})=0, which is equivalent to a hierarchical orthogonality criterion [13], i.e., HDMR terms of different orders are independent of each other. From [13] we have
Eq. 3 suggests that in the general case of dependent variables a component function, h_{u}(x_{u}), depends on all other expansion terms that have a nonempty intersection with u. However, assuming elements of X are independent, the last term of (3) equals zero. While this greatly simplifies the process of computing the HDMR expansion, the independence assumption can be heavily violated for expression data. We hereafter use E_{d}(ZX) to denote the d^{th} order HDMR expansion.
Approximate second order HDMR for classification
We now focus on the second order HDMR expansion for correlated features. Observe that under the independence assumption, we have
for some \(w_{0},w_{f}, w_{f,f'} \in \mathbb {R}\). In case of dependent features, we still assume the second order HDMR expansion follows a structure similar to (4), except that the coefficients \(\phantom {\dot {i}\!}w_{f}, w_{f,f'}\) are different than the independent case. Now, consider a binary classification problem with class labels y=0,1 and feature index set F. Let X be a random unlabeled observation with true label y_{x}. Given a training sample \(\mathscr {S}\), it is desired to design a decision rule that assigns a label, \(\hat {y}_{x}\) to X so that \(\hat {y}_{x}=y_{x}\) with high probability. Note that given the full joint distribution of X and y_{x}, one could have easily computed P(y_{x}=1X), or equivalently the loglikelihood ratio L(X)= log(P(y_{x}=1X)/P(y_{x}=0X)), and use a decision rule \(\hat {y}_{x}=1_{L(X)>T}\), where 1_{q} is the indicator function of statement q being correct and T is a threshold.
However, the full joint distribution is typically not available, and is usually estimated using \(\mathscr {S}\). Alternatively, many models assume the classification rule belongs to a family parametrized by θ, which is estimated from \(\mathscr {S}\). For example, a generalized linear model (GLM) with the logit link assumes \(L(X)=\beta _{0}+ \sum _{f \in F}\beta _{f} X_{f}\), where X_{f} is the value of X for feature f. Here θ is the collection of β_{0} and β^{f}’s. However, such model may be insufficient when pairwise feature interactions are of interest, and it can be difficult to train a GLM considering all potential pairwise interactions using LASSO and elastic net penalties due to the potentially large number of feature pairs. Assuming Z=L(X), and \(\mathscr {S}\) is the training sample, we have
for some \(c_{0},c_{f},c_{f_{i},f_{j}} \in \mathbb {R}\) where
It now remains to estimate coefficients c_{f} and \(c_{f_{i},f_{j}}\), which is part of our classification algorithm discussed in the next section. Note that we also assume an external mechanism which already outputs \(E(L(X)X_{f},\mathscr {S})\) and \(E(L(X)X_{f_{i},f_{j}},\mathscr {S})\).
Identifying pairwise feature interactions
An important application is identifying feature interactions that are significantly different between the classes, i.e., identify u={f_{i},f_{j}}’s for which h_{u}≢0. We have the following hypothesis test:
Note this is a very difficult problem in general, and only the special case of Gaussian features is studied here. Assuming f_{i} and f_{j} are jointly Gaussian in each class, for an HDMR expansion of first order to be able to grasp the exact form of the loglikelihood ratio we need (1) all features of u to be independent given class label y, or (2) have the same covariance in both classes (assuming \(\mu ^{f}_{0} \neq \mu ^{f}_{1}\) for all f∈F). In either case we have \(L(X)=a_{0}+\sum _{f} a_{f} L(XX_{f})\), for some coefficients \(a_{0},a_{f} \in \mathbb {R}\). It is straightforward but tedious to show that other cases result in a second order expansion. Therefore, we can reformulate the hypothesis test of Eq. 7 as
where \(\rho ^{f_{i},f_{j}}_{y}\) and \(\Sigma ^{f_{i},f_{j}}_{y}\) are the correlation coefficient and covariance matrix of feature pair f_{i},f_{j} in class y, respectively. Figure 1 provides several examples on cases with and without pairwise feature interactions. In cases (a), (b), and (d) there are no pairwise feature interactions. Case (c) denotes a degenerate case and is studied in the Supplementary. Cases (e) and (f) depict feature pairs with pairwise interactions.
Testing conditional independence is a statistically difficult problem [14], and has been studied for certain cases in [14]. As an approximation, we adopt the following approach. Let \(P^{f_{i},f_{j}}_{y}\) be the pvalue of an independence test performed on data in class y. We use the Pearson linear correlation test for Gaussian data. Assuming points in different classes are independent, we treat them as independent tests, and use Fisher’s method for metaanalysis: \(C=2 \left (\log \left (P^{f_{i},f_{j}}_{0}\right)+ \log \left (P^{f_{i},f_{j}}_{1}\right)\right)\) follows a χ^{2} distribution with 4 degrees of freedom under the null, giving us the final pvalue. We use the likelihood ratio test of [15] with the χ^{2} adjustment to find pvalues of Σ0f_{i},f_{j}=Σ1f_{i},f_{j}. Finally, we use the union bound and add the two pvalues to obtain our final pvalue, which is an overestimate. We hereafter call this approach multiple test mixing for pairwise interactions (MTM) and note that it is appropriate for differential network analysis. Indeed, identifying feature pairs with interesting interactions, i.e., pairwise dependencies that require looking at a second order expansion, instead of a first order expansion, are a goal of coexpression and differential network analysis. Different modes of coexpression is discussed in [16], and some of its applications, such as expression analysis, functional classification, and genedisease prediction, are described in [17, 18]. Differential network analysis is further discussed in [19, 20].
The classification algorithm
Here we describe our approach to build a classifier that is inspired by second order HDMR expansion of the loglikelihood ratio. Again suppose sample \(\mathscr {S}\) is collected, and for each set u such that u≤2, we have a method that outputs the loglikelihood ratio of the test point belonging to class 1, i.e., we have a method that outputs \(L(X_{u}\mathscr {S})\) for all u such that u≤2. Given all these values, it remains to combine them into a test score. However, in a smallsample problem the number of feature pairs can be much larger than sample size. This creates an illposed problem as the number of equations is smaller than the number of parameters to estimate. Therefore, many classical methods for obtaining the model parameters from data, such as maximum likelihood, are not applicable. Not being able to compute the exact second order HDMR expansion by framing it as a regression problem is the main reason we need to look at alternative training methods, and is the main reason we label the concomitant classifier design approach as “HDMR expansion inspired” or as an “HDMR expansion perspective” for classification.
To circumvent the illposed problem, we use a variation of objective functions mostly studied in compressed sensing, e.g. [21], that estimate a sparse signal given 1bit quantized observations. The connection between these objectives and a convex relaxation to the logistic regression problem is discussed in [22]. Heuristically speaking, given a feature vector in the form of loglikelihood ratios of partial observations x_{u}, here we find weights that maximize the distance between the average points of each class. The heuristic for using such objective function is as follows. On one hand, HDMR obtains the weights that result in the “best” low dimensional representation, i.e., the MSE estimate of the loglikelihood ratio, yielding low prediction errors. On the other hand, weights that maximize the distance between the projections of the center points of the two classes into a one dimensional space should also yield low prediction error. Hence, such objective function should result in a model with an error probability close to that of the HDMR expansion. Here we have used the HDMR theory to obtain the functional form of the solution, and use algorithms borrowed from 1bit compressed sensing to estimate the HDMR coefficients.
In many highdimensional statistics applications, it is common to favor some bias to reduce estimation variance. Examples and a detailed discussion on this issue is provided in [23]. In order to reduce variance in our setting, we remove the weakest classifiers, in forms of single feature classification rules or relatively independent feature pairs whose information is already provided in the first order expansion. We compute
where \(\mathscr {S}_{y}\) is the restriction of \(\mathscr {S}\) to points in class y, and n_{y} is sample size in class y. We remove features for which r^{f}<T_{1}, where T_{1} is a threshold. For feature pairs we compute \(\phantom {\dot {i}\!}r^{f_{i},f_{j}}\) defined as
and remove feature pairs for which \(\phantom {\dot {i}\!}r^{f_{i},f_{j}}<T_{2}\), for some threshold T_{2}. Feature pairs for which \(r^{f_{i},f_{j}}>T_{2}\phantom {\dot {i}\!}\) are risk increasing pairs, and feature pairs for which \(r^{f_{i},f_{j}}<T_{2}\phantom {\dot {i}\!}\) are risk decreasing pairs. We now find weights that combine \(S(XX_{u},\mathscr {S})\) of feature and feature pairs with large r^{f} and \(r^{f_{i},f_{j}}\phantom {\dot {i}\!}\), respectively. Although the HDMR expansion of the loglikelihood ratio is unique, the actual true weights may be complicated to derive. Since we mostly compare the HDMR expansion against a specific threshold to assign a label to a newly observed point, we may consider the following optimization problem to approximately solve for the desired weights.
where V(x) is the collection \(S(X_{f}\mathscr {S})\)’s for all features f such that r^{f}>T_{1} and \(S(X_{f_{i},f_{j}}\mathscr {S})\) of all feature pairs f_{i},f_{j} such that \(\phantom {\dot {i}\!}r^{f_{i},f_{j}}>T_{2}\), and “.” denotes inner product. Figure 2 depicts how b^{∗} is selected given vectors V(X) for the training data. Given a new observation X, we find R(X)=b^{∗}.V(X), and we assign class label \(\hat {Y}=1_{R(X)>T}\), where T is a threshold. Note T_{1}, T_{2}, and T are parameters of the model. Here they are selected using a grid search within cross validation; however, efficient parameter tuning strategies should be explored (see the Conclusion and Future Work section). We hereafter refer to this classification rule as linear approximation second order HDMR expansion (LASHDMR). The pseudocode of LASHDMR is provided in Algorithm 1. To summarize, given the data, the machinery that outputs log likelihood ratios of features and feature pairs, and the algorithm parameters, LASHDMR computes the risks of individual features and feature pairs, removes weak ones, and computes the weights that maximize the distance between the centers of each class. The overall pipeline is provided in Fig. 3.
The block model extension
The optimization problem in Eq. 11 maximizes the Euclidean distance between the centers of points in different classes, which might be most suitable for cases where elements of V(x) are independent. However, the feature pairs in LASHDMR can be heavily correlated. Therefore, it might be useful to merge heavily correlated feature pairs in blocks, average feature pairs of each block, and use the average loglikelihood of each block in V(x). Note that almost any community detection algorithm over graphs can be used to cluster feature pairs into blocks, where each node is a feature pair and the edges measures the correlations between loglikelihood ratios of the two feature pairs (see [24, 25] for a review on graph community detection). We consider the following simple block construction scheme. For each feature f_{i} construct risk increasing and risk decreasing blocks \(\phantom {\dot {i}\!}P^{f_{i}}\) and \(\phantom {\dot {i}\!}N^{f_{i}}\), respectively, as follows:
Afterwards, for each block \(\phantom {\dot {i}\!}P^{f_{i}}\) and \(N^{f_{i}}\phantom {\dot {i}\!}\) compute \(r^{P^{f_{i}}}\phantom {\dot {i}\!}\) and \(r^{N^{f_{i}}}\phantom {\dot {i}\!}\), the risks of positive and negative risk feature pairs containing f_{i}, being the average risks of feature pairs in the block. We then remove “weak” blocks. Weak risk increasing blocks are those for which \(r^{P^{f_{i}}}<T_{3}\phantom {\dot {i}\!}\), and weak risk decreasing blocks are those for which \(r^{N^{f_{i}}}>T_{3}\phantom {\dot {i}\!}\). Note T_{3} is a parameter of the model that is tuned via a grid search within cross validation. Now, given observation x, V(x) is comprised of the loglikelihood ratio of single features for which r^{f}>T_{1} and average \(\phantom {\dot {i}\!}S(X_{f_{i},f_{j}})\) of risk increasing and risk decreasing blocks that had absolute average risk large than T_{3}. We again use Eq. 11 to obtain HDMR coefficients. Finally, given new observation X, V(X) is formed, R(X)=b^{∗}.V(X) is computed, and \(\hat {Y}=1_{\{R(X)>T\}}\) is the predicted label. We hereafter call this classification algorithm linear approximation of block second order HDMR expansion (LABSHDMR). The pseudocode of LABSHDMR is described in Algorithm 2, and the overall pipeline is provided in Fig. 3.
Synthetic simulations
Here we perform several simulations to study LASHDMR and LABSHDMR classifiers in more detail. We use a synthetic model developed to mimic microarrays and gene expression levels for data generation. The original model is proposed in [26], and has been extended in [27, 28]. Here we use the extended model of [27] in which features are markers or nonmarkers. Markers are either global or heterogeneous, and comprise blocks of size k, where features in the same block are dependent and features in different blocks are independent of each other. Each block of global markers in class 0 is Gaussian with zero mean and covariance \(\sigma ^{2}_{0} \Sigma _{0}\), where diagonal elements of Σ_{0} are 1 and offdiagonal elements are ρ_{0}. Global markers in class 1 are either synergetic or marginal. In the synergetic case mean vector of each block in class 1 is [1,1/2,⋯,1/k], and in the marginal case it is [1,0,⋯,0]. The covariance matrix of the Gaussian distribution is \(\sigma ^{2}_{1} \Sigma _{1}\). Diagonal elements of Σ_{1} are 1 and off diagonal elements are ρ_{1}. Heterogeneous markers are similar to global markers in class 0 but comprise c subclasses in class 1, where in each subclass certain points follow a distribution similar to class 1 global markers and the remaining points are similar to class 0 markers. Nonmarkers are either low variance or high variance. Low variance nonmarkers are similar to class 0 markers. High variance nonmarkers are independent of each other, and each HV nonmarker follows a Gaussian mixture of the form \(pN(0,\sigma ^{2}_{0})+(1p)N(1,\sigma ^{2}_{1})\), where p is drawn uniformly at random over the interval [0, 1]. Note \(\sigma ^{2}_{0}\), \(\sigma ^{2}_{1}\), ρ_{0}, ρ_{1}, k, and c are parameters of the model. Figure 4, adopted from [26], shows an illustration of the feature types developed in this synthetic model.
We now study how LASHDMR and LABSHDMR perform under this model. In this simulation, all features are heterogeneous markers, to create a more difficult problem. We fix F=60, c=2, k=10, \(\sigma ^{2}_{0}=0.25\), and \(\sigma ^{2}_{1}=0.64\) and consider 4 scenarios: (1) synergetic markers with ρ_{0}=ρ_{1}=0.5, (2) synergetic markers ρ_{0}=0.1, and ρ_{1}=0.9, (3) marginal markers with ρ_{0}=ρ_{1}=0.5, and (4) marginal markers with ρ_{0}=0.1 and ρ_{1}=0.9. We generate a stratified sample of size n (to be specified below) with an equal number of points in each class for training, and a stratified sample of size 2000 with an equal number of points in each class for testing. Given training data, several classifiers are trained, which are then applied to the test data. We compute the receiver operator characteristic (ROC) curve and the area under curve (AUC) averaging over 100 iterations. Note large test sets were used to accurately compute prediction errors. Despite the test data being balanced, we believe AUC is a more reliable performance statistic than accuracy, as experimental data are typically imbalanced. That being said, in the current setup, for each point on the ROC curve obtained for threshold T, accuracy is 1−0.5(P_{I}+P_{II}), where P_{I} and P_{II} are probabilities of Type I and Type II errors, respectively.
In addition to LASHDMR and LABSHDMR we implement the following classifiers for comparison: regularized quadratic discriminant analysis (RQDA) with regularization value ranging from 0 to 1 in steps of 0.1, regularized linear discriminant analysis (RLDA) with regularization value ranging from 0 to 1 in steps of 0.1, linear support vector machine (SVM), random forest (RF) with the number of tree ranging from 10 to 100 in steps of 20, k nearest neighbors (kNN) with k=3,5,⋯,30, and generalized linear models with linear and quadratic probit links using LASSO and elastic net penalties (α=0.5) with penalty coefficients λ=0.01:0.01:0.1. These methods are discussed in detail in [29–31].
For each family with multiple tuning parameters and for each sample size, we report the largest AUC among all tested parameter values over the test data. Figure 5 plots the AUCs over test data averaging over 100 iterations as the sample size increases from 20 to 80 in steps of 4. When features have similar correlation matrices in both classes, classical methods such as RLDA and RQDA perform best and are closely followed by LASHDMR and LABSHDMR. However, when correlation coefficients differ between the two classes LASHDMR and LABSHDMR outperform other tested classifiers. Here we observe little difference between LASHDMR and LABSHDMR, suggesting we do not need to merge feature pairs into blocks and the number of feature pairs to consider is not too large for this problem. We leave a more thorough comparison of LASHDMR and LABSHDMR for future work. ROC plots are provided in the Supplementary.
Note that we started from the problem of finding the “best” second order representation of the loglikelihood ratio, but, due to computational difficulties, had to make several assumptions and approximations along the way. Therefore, it is possible that we end up misspecifying the exact second order HDMR decomposition. In such scenarios, it can be very probable that another method outperforms LASHDMR and LABSHDMR. Note LASHDMR and LABSHDMR enjoy competitive overall performance in all tested scenarios, and outperform other methods when correlation coefficients differ between the classes. They are competitive methods, and hence can be suitable for a wide range of problems. Note that settings where LASHDMR and LABSHDMR do not perform best correspond to those in which correlation coefficients are equal in both classes, for which RLDA and linear probit models perform best. This suggests maybe in these cases the first order HDMR expansion is more appropriate to represent the data (LDA is equivalent to a first order HDMR expansion under its modeling assumptions), although variances are slightly different between the classes. The small sample sizes used in this simulation may impede quadratic classifiers to satisfactorily estimate the distribution parameters, which may result in their poor performance. Similar patterns are observed in [32, 33]. Additionally, given that the first order expansion is sufficient to represent data, a second order HDMR model might suffer curse of dimensionality.
Identifying pairwise interactions
Here we evaluate the performance of MTM in identifying significant pairwise feature interactions. A comparison of MTM with the method of [9] is provided in the Supplementary. We fix F=5000, with 20 global markers, 80 heterogeneous markers, and 2000 high variance nonmarkers. We again assume k=10, c=2, \(\sigma ^{2}_{0}=0.25\), \(\sigma ^{2}_{1}=0.64\), and consider the 4 scenarios of the previous section for mean types and correlation coefficients. Figure 6 provides the ROC curves of MTM for different marker and correlation values when n=40, averaging over 100 iterations. MTM performs best when correlations are different between the two classes: red and blue lines denoting unequal correlations for synergetic and marginal markers, respectively, enjoy a higher probably of detection compared with black and magenta lines denoting equal correlations for synergetic and marginal markers, respectively. Note the mean types (marginal or synergetic) have little effect on the ROC curves. Figure 7 provides ROC curves of the equal correlation cases for different sample sizes, averaging over 100 iterations. As expected, with increase in the sample size it becomes easier to detect pairwise interactions. To benchmark MTM we compared it with the absolute conversion method of [9], which is proposed as a “fast” algorithm. However, we observed it is computationally more intensive than MTM. We considered marginal and synergetic markers with equal correlations, fixed n=40, and reduced the number of iterations to 50. Results are provided in the Supplementary, in which MTM outperforms the method of [9].
Experimental data analysis
We apply LASHDMR, LABSHDMR, and the comparison classifiers of the previous section to datasets studying relapsing breast and lung cancer patients. We also evaluate if MTM can detect significant pairwise gene interactions in realistic settings. We specifically selected datasets resulting in tasks more challenging than healthy versus normal labels. Such datasets are in particular challenging as data can be small in size and imbalanced (only a small portion of followed up patients may relapse). Additionally, breast and lung cancers are well studied in the literature, allowing us to evaluate if the detected patterns are biologically plausible. A leukemia dataset is studied in the Supplementary.
Breast cancer
The data obtained in [34] and [35] deposited on gene expression omnibus (GEO) database [36] with accession number GSE25066, containing expression levels of 397 relapse free and 111 relapsing breast cancer patients, all of whom went through neoadjuvant taxaneanthracycline chemotherapy. Data is based on the GPL96 platform, and is already preprocessed and normalized. The dataset contains 22,283 probes, of which 20,967 map to genes. We only use probes that map to genes in our analysis. First, 100 relapsing and 360 nonrelapsing patients randomly select as training data, and the remaining points are used for testing. The likelihood ratio test (LRT) statistic of [37], which is equivalent to the optimal Bayesian filter scoring function under independent Gaussian models [27, 38] under Jeffreys prior, is used to select the top 100 differentially expressed genes. We iterate 100 times.
2D and 3D tSNE [39] plots using the cityblock distance are provided in Fig. 8, suggesting the two classes do separate. It seems each class contains a few points which may truly belong to the other class, i.e., each class is polluted with a small subpopulation truly belonging to the other class. Alternatively, larger followup times may be necessary to further determine if certain nonrelapsing patients relapse, and hence should belong to class 1. The large number of nonrelapsing patients that resemble relapsing patients reduces the measured AUC. Table 1 lists the AUC of different classifiers on this dataset (over the hold out test data). Figure 9a provides the ROC plots.
As Table 1 suggests, all methods do not enjoy a high AUC. We observed that the variant of LASHDMR using RQDA with λ=0.8 achieved the highest AUC. The largest AUC for the variant of LASHDMR using RLDA was 63.12% obtained for λ=0.9. In contrast, LABSHDMR seems to enjoy the highest AUC, obtained using RQDA with λ=0.1, which is the closest tested variant to conventional QDA. This may suggest that a second order expansion is not satisfactory enough for this dataset, emphasizing the need to look at higher order expansions. Finally, Fig. 10a provides the KaplanMeier survivorship plots based on the assigned labels to the test data, averaging over 100 iterations for LASHDMR and LABSHDMR. The figure provides extra assurance that indeed the proposed algorithms separate the two classes, the approximate second order HDMR expansion of the loglikelihood ratio, i.e., R(X), is an appropriate statistic to denote the “risk” of an event, and the proposed methods can be further used in conjunction with other data analysis tools. As tSNE plots in Fig. 8 suggest, many low risk patients resemble high risk patients, and we expect a welldesigned classification rule to mislabel such points; otherwise, the separating plane (curve) should be extremely complex, raising serious concerns of overfitting. This explains why many high risk patients have not relapsed up to the followup time.
Although LASHDMR and LABSHDMR may not yield high classification accuracy similar to other classification algorithms, their glass box nature simplifies the process of identifying genes and gene pairs that contribute the most to the classifier’s prediction. We use all of the data for training, and use RQDA with λ=0.8 to obtain the loglikelihood ratios. Table 2 lists the top 10 genes of LASHDMR and their risks. Many of the top genes, such as IL8 [40], also known as CXC motif chemokine ligand 8 (CXCL8), and growth regulating estrogen receptor binding 1 (GREB1) [41] are suggested to be affected in breast cancer. Table 3 lists the top 10 LASHDMR gene pairs. Comparing Tables 2 and 3 we observe that gene interactions tend to have a larger risk than individual features. Note all risks describe the average increase/decrease of the log likelihood ratio, and are hence on the same scale for all genes and gene pairs. Scatter plots of several gene pairs with interesting interactions is provided in the Supplementary). For example, we observed either GREB1 or carboxypeptidase B1 (CPB1) are overexpressed among nonrelapsing patients, and underexpression of both GREB1 and CPB1 is necessary to have a high risk of relapse. Finally, many of the top genegene interaction pairs contain GREB1, signal peptide CUB domain EGFlike 2 (SCUBE2), GATA Binding Protein 3 (GATA3), and IL8, suggesting their interaction might be key to studying breast cancer. The variant of LABSHDMR that achieved the highest AUC used RQDA with λ=0.1, and is studied in detail in the Supplementary.
Now we look for significant pairwise gene interactions. First, for each gene, we only consider the probe ranking highest by LRT so that probes mapping to the same genes do not disrupt the analysis, resulting in 13,211 different genes. This results in 87,258,655 tests. We observed that MTM can be heavily affected by small subpopulations, heavy tails, and outliers, and therefore used MATLAB’s built in isoutlier function with its default settings to remove potential outliers before further analysis. Bounding the false discovery rate (FDR) by 5% using the BenjaminiHochberg (BH) procedure [42] 1,275,351 pairwise interactions are significant (about 1.46% of tested hypotheses). Table 4 lists several of the top gene pairs and their adjusted pvalues. Figure 11 provides scatter plots of several gene pairs. We observe many interesting patterns that require further investigation: (1) underexpression of both SAM pointed domain containing ETS transcription factor (SPDEF) and MLPH, also known as synaptotagminlike protein 2A (SLAC2A), increases the risk of relapse, (2) overexpression of anterior gradient protein 2 homolog (AGR2) and NAcetyltransferase 1 (NAT1) is an indicator of low risk, overexpression of AGR2 and underexpression of NAT1 is an indicator of “medium” risk, and underexpression of both AGR2 and NAT1 is an indicator of high risk, (3) overexpression of NAT1 or DNALI1 is an indicator of low relapse risk. Comparing Fig. 11a and d suggests solute carrier family 2 member 10 (SLC2A10) and MLPH are heavily correlated with a positive correlation coefficient, which is indeed observed in the data as well.
Considering a weighted graph where nodes are genes and edge weights are − logpvalue of the gene pair, we observed many detected gene pairs construct highly connected clusters. To verify if the selected gene pairs are biologically relevant we (a) associated each gene with its smallest gene pair pvalue, (b) selected the top 200 genes, (c) constructed their graph, (d) used community detection algorithm of [43] to identify network clusters, (e) selected the genes corresponding to the largest cluster, and (f) used Ingenuity Pathway Analysis^{Footnote 1} (IPA) [44] to identify the networks associated with these genes only using experimentally observed results as well as the top canonical pathways. The top canonical pathways and the largest detected network are provided in Figs. 12 and 13, respectively. Note the top ranking IPA gene network is identified with cellular development, cellular growth, and cell cycle functions. The log fold changes, computed using method of [45], were used to identify over/underexpressed genes in the network; however, as data is highly heterogeneous, these effects might not be highly pronounced. Many of the selected genes are connected directly or indirectly with only one gene in between. Literature review suggests many of the top IPA pathways are also affected in breast cancer.
We now randomly leave out one point in each class, train LASHDMR, and look at the genes and gene pair that yield the highest scores in absolute values for these two test points. The minimum value for HDMR terms, either S(X_{f}) or \(S(X_{f_{i},f_{j}})\), was −3.62 and −0.59 for the points in classes 0 and 1, corresponding to gene pairs GREB1 and NAT1, and ZNF673 and ZNF391, respectively. The largest values were 3.64 and 14.32 respectively, corresponding to the ERBB2 gene, and gene pair ZNF395 and PTOV1. Figure 14 plots how the different genes and gene pairs are combined to arrive at the final loglikelihood ratio estimate.
Lung cancer
Data obtained in [46] is deposited on GEO with accession number GSE68465, containing expressions of 443 lung cancer patients. 279 patients whose cancer relapsed or died within the follow up time comprise class 1, and the remaining 164 patients comprise class 0. This dataset is based on the GPL96 platform. We again only use probes mapping to genes, perform a lognormalization step, randomly select 250 points in class 0 and 140 points in class 1 for training, use the remaining points for testing, use the top 100 LRT genes for classifier design which we evaluate on test data, and iterate 100 times. Before we train the classifiers we provide 2D and 3D tSNE [39] plots using the cityblock distance in Fig. 15, suggesting the two classes do separate; however, many patients who relapsed or died within six year (high risk) resemble those who survived (low risk). Both tSNE plots suggest there are at least two high risk subpopulations.
Table 1 lists the AUCs (over test data), and Fig. 9b provides the ROC curves. Again we observe that none of the classifiers enjoy a very high AUC, and both LASHDMR and LABSHDMR enjoy competitive performance compared with other classifiers. This may again suggest that a quadratic model might not be enough to capture the complicate structure of data. In this dataset, the variants of LASHDMR and LABSHDMR achieving the highest AUCs are RLDA with λ=0.1 and RLDA with λ=0.2, respectively. Figure 10b provides the KaplanMeier survivorship plots based on the assigned labels to the test data, averaging over 100 iterations for LASHDMR and LABSHDMR, providing extra assurance that indeed the proposed algorithms separate the two classes. Here, the large ratio of high risk patients who resemble low risk ones in the tSNE plots of Fig. 15 suggest that a reasonable decision rule would mislabel many high risk patients as low risk, explaining the low survivorship of the estimated low risk patients in Fig. 10b.
Table 5 lists the top 10 genes of LASHDMR and their associated risks. We again observe that many of the top genes, such as bromodomain PHD finger transcription factor (BPTF) [47] and LUC7 like 3 preMRNA splicing factor (LUC7L3) [48], are shown or suggested to be affected in lung cancer. Table 6 lists the top 10 gene pairs and their associated risks. A deeper analysis, including the results of LABSHDMR, is provided in the Supplementary.
For each gene we only use the probe ranking highest by LRT giving us 13,211 different genes and 8,758,655 pairwise dependence tests. We also perform outlier detection to further improve identifying general gene interaction patterns. Bounding FDR by 5% using BH 701410 gene pairs are significant, about 0.8% of all tests. Table 7 lists several of the top gene pairs and their adjusted pvalues. Figure 16 provides scatter plots. We again observe MTM detects interesting pairwise interactions: (a) It seems there is a subpopulation of high risk lung cancer patients with poor survival for whom BCL2 associated transcription factor 1 (BCLAF1) is underexpressed and interleukin enhancer binding factor 3 (ILF3) is overexpressed. For other patients, irrespective of their label, these two genes are positively correlated. (b) Patients for whom both laminin subunit gamma 2 (LAMC2) and cadherin 3 (CDH3) are overexpressed have a high chance of relapse/death, those for whom CDH3 is over expressed and LAMC2 is underexpressed have a “medium” chance of relapse/death, and those for whom both CDH3 and LAMC2 are underexpressed have a low chance of relapse/death. (c) Under expression of either BCLAF or SOS Ras/Rho guanine nucleotide exchange factor 2 (SOS2) can be used as an indicator of poor survival. (d) Patients for whom both Annexin A1 (ANXA1) and CDH3 are overexpressed have a medium chance of relapse/death, those for whom CDH3 is over expressed and ANXA1 is underexpressed have a high chance of relapse/death, and those for whom both CDH3 and ANXA1 are underexpressed have a low chance of relapse/death. Finally, we again perform the IPA analysis similar to the breast cancer dataset, expect we include highly probable interactions in the analysis as well as experimentally observed ones. Figure 17 plots the detected network, which is associated with cell cycle, cellular assembly and organization, and cellular function and maintenance. Again observe many of the selected genes are connected with at most two genes in between. A deeper analysis is provided in the Supplementary.
Discussion
In the simulations and real data analyses we observed that both LASHDMR and LABSHDMR enjoy competitive prediction accuracies compared with several popular classification rules. Additionally, they explicitly reveal how individual features or pairwise feature interactions motivate certain decisions, and how unique patterns of a new observation motivate its predicted label. Scatter plots of breast and lung cancer gene pairs in Figs. 11 and 16, respectively, suggest a quadratic classifier is adequate for predicting class labels from each gene pair; however, AUCs are not very high. This suggests higher order expansions, i.e., the joint interaction of three genes or more, are necessary to increase prediction accuracy. In these datasets we observe LABSHDMR assigns larger risks to the top gene pairs compared with individual genes (see Tables 2 and 3 for breast cancer and Tables 5 and 6 for lung cancer), suggesting gene interactions are crucial to reliable prediction; not that linear classifiers miss important information in the data, but that pairwise interactions seem to carry more information about class labels than individual genes. In the leukemia dataset (studied in the Supplementary) we observed both LASHDMR and LASHDMR perform competitively, but all methods seem to enjoy high AUCs (AUC was larger than 94% for all classifiers). In particular, we observed highest AUCs for linear classifiers, suggesting there is no need to use more complex rules. In particular, quadratic rules such as RQDA perform inferior to LASHDMR and LABSHDMR. Finally, in the leukemia dataset we observe gene pairs have much smaller risks compared with individual genes.
MTM is an interesting test, allowing to extract coexpression patterns that differ between the two classes, i.e., identify pairwise interactions that differ between the two classes. MTM efficiently takes advantage of information in the data and is computationally fast. This is in contrast to several recent pipelines proposed for analyzing genepairs that are rather computationally intensive or may rely on large datasets (see [4, 5, 7–10] for examples).
Conclusion and future work
Glass box models for binary classification open many avenues of research for analyzing genomic data as they enable us to make meaningful hypotheses about the underlying biological dysfunctions that are involved in a disease. To that end, HDMR seems to be an interesting theory for studying low dimensional glassbox models. The limitation of this approach in its current form is the assumption of known mechanisms that output the loglikelihood ratios given any partial observation. On the other hand, different feature sets can use different classification rules tailored to their joint distribution. This provides HDMR with tremendous flexibility, for instance, we may use QDA for some feature while a GLM is used for another feature pair. While this is a very exciting potential benefit of HDMR, we did not really exploit it in our current analysis and leave its careful consideration for possible future work. Another future direction for improving LASHDMR and LABSHDMR is the development of more efficient methods of parameter selection to reduce computation cost of the currently implemented grid search approach.
The ability of MTM to identify pairwise interactions containing information not encoded in the likelihood function of each feature is a very practical contribution of our work. Here, instead of determining if two features are dependent, the goal is to verify if the pairwise interactions contain additional information about the classes which no model can extract by observing features individually. To do so, we developed a test where the null assumes the first order HDMR expansion is sufficient to explain the class differences. In the case of Gaussian features this translates into testing specific structures on the covariance matrices. Synthetic simulations and experimental data analyses suggest that MTM is indeed a powerful tool to extract dysregulated pairwise gene coexpression patterns that motivate new hypotheses about cancer biology. In the real data analyses we observed many pairs might have a gene in common, for instance, SPDEF in the breast cancer dataset and BCLAF1 and CDH3 in the lung cancer dataset. These patterns motivate hypotheses not only about gene pairs, but more generally about heavily correlated marker families. Graph representations are interesting tools for analyzing families of pairwise interactions, and graph community detection algorithms can infer marker interaction blocks given pairwise interaction graphs. However, future work should investigate the use of HDMR for directly inferring such structures. To that end, LABSHDMR seems to be an ideal first step, where its constructed blocks can be potential first approximations to marker families of interest.
Availability of data and materials
All cancer data used in this work is publicly available via GEO database accession numbers GSE25066 and GSE68465. The synthetic datasets are available upon request.
Notes
 1.
QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuitypathwayanalysis
Abbreviations
 AGR2:

Anterior gradient protein 2 homolog
 ANXA1:

Annexin A1
 AUC:

Area under the ROC curve
 BCLAF1:

BCL2 associated transcription factor 1
 BH:

BenjaminiHochberg stepup procedure
 BPTF:

Bromodomain PHD finger transcription factor
 CDH3:

Cadherin 3
 CPB1:

Carboxypeptidase B1
 CXCL8:

CXC motif chemokine ligand 8
 FDR:

False discovery rate
 GATA3:

GATA binding protein 3
 GEO:

Gene expression omnibus
 GLM:

Generalized linear model
 GREB1:

Growth regulating estrogen receptor binding 1
 HDMR:

High dimensional model representation
 ILF3:

Interleukin enhancer binding factor 3
 IPA:

Ingenuity pathway analysis
 kNN:

k nearest neighbors
 LABSHDMR:

Linear approximation of block second order HDMR expansion
 LAMC2:

Laminin subunit gamma 2
 LASHDMR:

Linear approximation second order HDMR expansion
 LASSO:

Least absolute shrinkage and selection operator
 LDA:

Linear discriminant analysis
 LRT:

Likelihood ratio test
 LUC7L3:

LUC7 like 3 preMRNA splicing factor
 MSE:

Mean squared error
 MTM:

Multiple test mixing for pairwise interactions
 NAT1:

NAcetyltransferase 1
 QDA:

Quadratic discriminant analysis
 RF:

Random forest
 ROC:

Receiver operator characteristic
 RLDA:

Regularized linear discriminant analysis
 RQDA:

Regularized quadratic discriminant analysis
 SLC2A10:

Solute carrier family 2 member 10
 SOS2:

Ras/Rho guanine nucleotide exchange factor 2
 SPDEF:

SAM pointed domain containing ETS transcription factor
 SCUBE2:

Signal peptide CUB domain EGFlike 2
 SLAC2A:

Synaptotagminlike protein 2A
 SVM:

Support vector machine
 TCGA:

The cancer genome atlas
References
 1
Sima C, Dougherty ER. What should be expected from feature selection in smallsample settings. Bioinformatics. 2006; 22(19):2430–6.
 2
Sima C, Dougherty ER. The peaking phenomenon in the presence of featureselection. Pattern Recognit Lett. 2008; 29(11):1667–74.
 3
Tutuncuoglu B, Krogan NJ. Mapping genetic interactions in cancer: a road to rational combination therapies. Genome Med. 2019; 11(1):62.
 4
ReganFendt KE, Xu J, DiVincenzo M, Duggan MC, Shakya R, Na R, Carson WE, Payne PR, Li F. Synergy from gene expression and network mining (syngenet) method predicts synergistic drug combinations for diverse melanoma genomic subtypes. NPJ Systs Biol Appl. 2019; 5(1):1–15.
 5
Deng X, Das S, Valdez K, Camphausen K, Shankavaram U. Slbiodp: Multicancer interactive tool for prediction of synthetic lethality and response to cancer treatment. Cancers. 2019; 11(11):1682.
 6
Henkel L, Rauscher B, Boutros M. Contextdependent genetic interactions in cancer. Curr Opin Genet Dev. 2019; 54:73–82.
 7
Chen Y, Cao D, Gao J, Yuan Z. Discovering pairwise synergies in microarray data. Sci Rep. 2016; 6:30672.
 8
Watkinson J, Wang X, Zheng T, Anastassiou D. Identification of gene interactions associated with disease from gene expression data using synergy networks. BMC Syst Biol. 2008; 2(1):10.
 9
Xing P, Chen Y, Gao J, Bai L, Yuan Z. A fast approach to detect gene–gene synergy. Sci Rep. 2017; 7(1):1–8.
 10
Chopra P, Lee J, Kang J, Lee S. Improving cancer classification accuracy using gene pairs. Plos ONE. 2010; 5(12).
 11
Li G, Rabitz H. General formulation of HDMR component functions with independent and correlated variables. J Math Chem. 2012; 50(1):99–130.
 12
Lu R, Wang D, Wang M, Rempała GA. Estimation of Sobol’s sensitivity indices under generalized linear models. Commun StatTheory Methods. 2018; 47(21):5163–95.
 13
Hooker G. Generalized functional ANOVA diagnostics for highdimensional functions of dependent variables. J Comput Graph Stat. 2007; 16(3):709–32.
 14
Shah RD, Peters J. The hardness of conditional independence testing and the generalised covariance measure. arXiv preprint arXiv:1804.07203. 2018.
 15
Gupta AK, Tang J. Distribution of likelihood ratio statistic for testing equality of covariance matrices of multivariate Gaussian models. Biometrika. 1984; 71(3):555–9.
 16
Crow M, Paul A, et al. Exploiting singlecell expression to characterize coexpression replicability. Genome Biol. 2016; 17(1):101.
 17
van Dam S, Võsa U, et al.Gene coexpression analysis for functional classification and gene–disease predictions. Brief Bioinforma. 2017; 19(4):575–92.
 18
Ruan J, Dean AK, et al. A general coexpression networkbased approach to gene expression analysis: comparison and applications. BMC Syst Biol. 2010; 4(1):8.
 19
Ideker T, Krogan NJ. Differential network biology. Mol Syst Biol. 2012; 8(1):565.
 20
Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010; 11(1):95. https://doi.org/10.1186/147121051195.
 21
Plan Y, Vershynin R. Onebit compressed sensing by linear programming. Commun Pure Appl Math. 2013; 66(8):1275–97.
 22
Plan Y, Vershynin R. Robust 1bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Trans Inf Theory. 2013; 59(1):482–94.
 23
Wasserman L. All of Nonparametric Statistics, 1st edn. New York: Springer; 2010.
 24
Fortunato S. Community detection in graphs. Phys Rep. 2010; 486(3):75–174.
 25
Lancichinetti A, Fortunato S. Community detection algorithms: a comparative analysis. Phys Rev E. 2009; 80(5):056117.
 26
Hua J, Tembe WD, et al. Performance of featureselection methods in the classification of highdimension data. Pattern Recog. 2009; 42(3):409–24.
 27
Foroughi pour A, Dalton LA. Heuristic algorithms for feature selection under Bayesian models with blockdiagonal covariance structure. BMC Bioinformatics. 2018; 19(3):70.
 28
Foroughi pour A, Dalton LA. Robust feature selection for block covariance Bayesian models. In: Proceedigns of IEEE International Conference on Acoustics, Speech and Signal Processing: 2017. p. 2696–700.
 29
Fukunaga K. Introduction to Statistical Pattern Recognition, 2nd edn. Boston, MA: Academic Press; 1990. https://doi.org/10.1016/B9780080478654.500053. http://www.sciencedirect.com/science/article/pii/B9780080478654500053.
 30
Theodoridis S, Koutroumbas K. Pattern Recognition, 4th edn. Boston, MA: Academic Press; 2009. https://doi.org/10.1016/B9781597492720.500050. http://www.sciencedirect.com/science/article/pii/B9781597492720500050.
 31
Bishop CM. Pattern Recognition and Machine Learning, 1st edn. New York: Springer; 2006.
 32
Lu J, Plataniotis KN, et al.Regularized discriminant analysis for the small sample size problem in face recognition. Pattern Recog Lett. 2003; 24(16):3079–87.
 33
Wu B, Abbott T, et al.Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics. 2003; 19(13):1636–43.
 34
Itoh M, Iwamoto T, et al.Estrogen receptor (ER) mRNA expression and molecular subtype distribution in ERnegative/progesterone receptorpositive breast cancers. Breast Cancer Res Treat. 2014; 143(2):403–9.
 35
Hatzis C, Pusztai L, et al. A genomic predictor of response and survival following taxaneanthracycline chemotherapy for invasive breast cancer. JAMA. 2011; 305(18):1873–81.
 36
Edgar R, Domrachev M, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30(1):207–10.
 37
Pearson ES, Neyman J. On the problem of two samples In: Neyman J, Pearson ES, editors. Joint Statistical Papers 1967: 1930. p. 99–115.
 38
Foroughi pour A, Dalton LA. Optimal bayesian filtering for biomarker discovery: Performance and robustness. IEEE/ACM Trans Comput Biol Bioinforma (to appear). 2018.
 39
Maaten Lvd, Hinton G. Visualizing data using tsne. J Mach Learn Res. 2008; 9(Nov):2579–605.
 40
Finak G, Bertos N, et al.Stromal gene expression predicts clinical outcome in breast cancer. Nat Med. 2008; 14(5):518–27.
 41
Rae JM, Johnson MD, et al.GREB1 is a critical regulator of hormone dependent breast cancer growth. Breast Cancer Res Treat. 2005; 92(2):141–9.
 42
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). 1995:289–300.
 43
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008; 2008(10):10008.
 44
Krämer A, Green J, et al.Causal analysis approaches in ingenuity pathway analysis. Bioinformatics. 2013; 30(4):523–30.
 45
Zhou W, Wang Y, et al. A standardized fold change method for microarray differential expression analysis used to reveal genes involved in acute rejection in murine allograft models. FEBS Open Bio. 2018; 8(3):481–90.
 46
Shedden K, Taylor JMG, et al.Gene expression–based survival prediction in lung adenocarcinoma: a multisite, blinded validation study. Nat Med. 2008; 14(8):822–7.
 47
Dai M, Lu JJ, et al. BPTF promotes tumor growth and predicts poor prognosis in lung adenocarcinomas. Oncotarget. 2015; 6(32):33878–92.
 48
Lu Y, Wang L, et al. Geneexpression signature predicts postoperative recurrence in stage I nonsmall cell lung cancer patients. PloS ONE. 2012; 7(1):30880.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Science Foundation (DMS 1853587 and DMS 1923038 to GR). The funding body did not play any role in the design of the study, analysis and interpretation of data, or in writing the manuscript.
Author information
Affiliations
Contributions
AF and GR developed the proposed approach and performed analyses. AF drafted the initial manuscript. LAD, AF, MP, and GR provided comments on the draft and helped edit the manuscript. All authors reviewed and approved the final manuscript. AF is currently at the Jackson laboratory for genomic medicine.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary. The supplementary contains extra information regarding the synthetic simulations and real data analysis. It also studies a Leukemia dataset not discussed in the main manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Foroughi pour, A., Pietrzak, M., Dalton, L.A. et al. High dimensional model representation of loglikelihood ratio: binary classification with expression data. BMC Bioinformatics 21, 156 (2020). https://doi.org/10.1186/s128590203486x
Received:
Accepted:
Published:
Keywords
 High dimensional model representation
 Classification
 Disease prediction
 Loglikelihood ratio
 Expression analysis