Skip to main content

Quasi-linear score for capturing heterogeneous structure in biomarkers



Linear scores are widely used to predict dichotomous outcomes in biomedical studies because of their learnability and understandability. Such approaches, however, cannot be used to elucidate biodiversity when there is heterogeneous structure in target population.


Our study was focused on describing intrinsic heterogeneity in predictions. Because heterogeneity can be captured by a clustering method, integrating different information from different clusters should yield better predictions. Accordingly, we developed a quasi-linear score, which effectively combines the linear scores of clustered markers. We extended the linear score to the quasi-linear score by a generalized average form, the Kolmogorov-Nagumo average. We observed that two shrinkage methods worked well: ridge shrinkage for estimating the quasi-linear score, and lasso shrinkage for selecting markers within each cluster. Simulation studies and applications to real data show that the proposed method has good predictive performance compared with existing methods.


Heterogeneous structure is captured by a clustering method. Quasi-linear scores combine such heterogeneity and have a better predictive ability compared with linear scores.


In recent years, biomedical data have become complicated and high-dimensional [1, 2]. For example, a single human gene expression dataset contains tens of thousands of features, many of which are highly correlated [3]. In addition, large mixed datasets are crucial for personalized treatment, in which the optimal treatment strategy is determined based on a dataset that combines a very large number of prognostic factors [4].

From the viewpoint of statistical machine learning, supervised and unsupervised learning methods play central roles in such biomedical studies [5]. In fact, shrinkage methods such as ridge and lasso are frequently used in the context of prediction [6], and clustering methods are used in the context of interpretation [7], potentially revealing novel findings.

Supervised learning methods are often used to estimate risk scores when predicting dichotomous outcomes. Linear scores are among the most widely used forms because it is easy to learn the predictive score from a training dataset. Moreover, it is easy to understand the estimated score. Linear scores are often evaluated by linear discriminant or logistic regression analysis, and achieve not bad discriminative performance. For example, the study of [8] used a linear score, and their discoveries led to the development of Mammaprint, a diagnostic kit for breast cancer metastasis.

Unsupervised learning methods can also yield beneficial insights in high-dimensional data analysis. For example, [9] used biclustering to reveal more detailed subtypes in breast carcinomas with distinctive gene expression profiles from a group that was previously regarded as a homogenous unit. Their study revealed that in order to understand biodiversity, the heterogeneous structure of the targeted population must be considered, and that such heterogeneity can be clarified by the clustering method. Previously published reviews have described both clustering methods [10] and biclustering algorithms [11].

Several studies have combined supervised and unsupervised learning methods. For example, [12] used clustering to discover different patterns of gene expression in different subgroups. They then derived the respective scores for these groups and achieved good specificity without loss of sensitivity relative to existing diagnostic rules. Sample heterogeneity may result in marker heterogeneity. As a result, different samples in different subgroups may have different intrinsic characteristics in their environmental and genetic factors as demonstrated by the motivated example in the “Methods” section. Such heterogeneity may have unexpected effects on a therapy or treatment which is considered as best practice, and lead to an unfavorable risk in one part of the population. Bravo et al. [13] focused on the marker heterogeneity by detecting the genes that showed different variation between healthy and disease samples. They then defined an anti-profile score as the number of hyper-variable genes. Thus, more and more studies have considered heterogeneous structure and reflected this heterogeneity in their predictions. However, the risk scores highlighted by published papers are linear, and heterogeneity is therefore not directly reflected in the score form. In this study, we focused on heterogeneity and determined how to directly reflect this intrinsic characteristic in the score form. We developed the quasi-linear score as a result, which combines linear scores as a Kolmogorov-Nagumo average [14, 15], enabling us to reflect the clustering result naturally, because it is based on separated feature vectors.

The rest of this paper is organized as follows. In the “Methods” section, we first present a motivated example of gene expression data and develop the quasi-linear score. Heterogeneity is observed via the clustering method, and we define the quasi-linear score to reflect gene clusters with a generalized average form. We also formulate the quasi-linear logistic model and discuss the difference between the linear and quasi-linear scores. We subsequently evaluated our method by numerical simulations and applications to real datasets. We refer to the relationship between the quasi-linear score and traditional combined approaches in “Discussion” section. All technical details given as Appendix are available in Additional file 1.


Motivation and derivation

We studied the gene expression dataset from [8]. This dataset is derived from 51 non-metastatic and 46 metastatic breast cancer patients. In their study, the linear score was evaluated to discriminate metastatic events. Because estimation of the predictive linear score is often achieved by a diagonal Fisher’s linear discriminant analysis (DLDA) [16], we considered applying DLDA to this dataset. Because the coefficients of the linear score estimated by DLDA correspond to the t-statistic values, we checked the t-statistics directly for the purpose of visualization. If the data have heterogeneous structure, it can be clarified by observing the difference between two divided, independent datasets. Therefore, we divided the full data into two independent sets, data1 and data2, before calculating the t-statistics for each of them separately. Figure 1 shows the correspondence of the t-statistics. Some genes had no consistency in the signs of their t-values, indicating that some samples from the metastatic group had higher expression, whereas other samples had lower expression, relative to the non-metastatic group. This phenomenon may be caused by heterogeneous factors [17]. In fact, due to the existence of multiple subtypes of breast cancer, this disease is known to exhibit heterogeneity [9]. For such heterogeneous data, clustering methods should work well, as shown in [9]. We applied clustering according to the Ward’s method [18], as shown in Fig. 2, which highlights the results of clustering and the correlation matrix arranged by the clustered genes. Although biclustering result was not suggestive of the heterogeneity in appearance, it was observed via the correlation matrix. Some genes are strongly correlated with others in the same cluster. Thus, we observed the existence of heterogeneity using a t-statistics plot and trends in the expression patterns by clustering. Next, we developed an appropriate score form for discriminating such heterogeneous data based on clustering.

Fig. 1
figure 1

t-statistic values for two datasets from van’t Veer et al. (2002). The red points show the genes with sign mismatched t-values for these data

Fig. 2
figure 2

The hierarchical clustering and the correlation matrix of 70 genes for the dataset from van’t Veer et al. (2002). The figure shows the clustering result (upper) and the correlation matrix (lower). There are 70 rows representing genes and 78 columns representing samples (upper) and the gene expression data ranging from green (negative) to red (positive) are displayed. Elements of the correlation matrix (lower) ranging from blue (negative) to yellow (positive) are displayed

We assume to know decomposition of p biomarkers into K groups by clustering. Based on these K sets of clustered markers, we define a quasi-linear score as

$$\begin{array}{@{}rcl@{}} Q=\text{log}\left(\sum_{k=1}^{K} \text{exp}(L_{k})\right), \end{array} $$

where \(L_{k}=\alpha _{k}+\beta _{k}^{\top } X_{(k)}\) with the parameters α k ,β k , and the marker vector X (k) for the k-th cluster of k=1,2,,K. When K=1, the quasi-linear score Q is reduced to the linear score,

$$\begin{array}{@{}rcl@{}} L=\alpha+\beta^{\top} X, \end{array} $$

where α and \(\beta =\left (\beta _{1}^{\top },\cdots,\beta _{K}^{\top }\right)^{\top }\) are parameters, and X is the full vector of X (1),,X (K). We note that the intercepts α k ’s in (1) are reduced to the single intercept α in (2). Additional file 1: Appendix A gives another parameterization for Q in which the intercepts α k are uniquely decomposed to the overall intercept and weights of the K clusters.

When determining how to reflect cluster information in the score form, we had two main considerations: which scores should be integrated, and how integration should be performed. All the linear scores L k are integrated in the quasi-linear score Q. We believe that this is reasonable because there are similar markers in each cluster, and we expect that heterogeneity would be caused by different mixed homogeneous features that are sufficiently described by the linear form. Although such an idea of combining the linear scores has already been proposed by [19] as a composite link, it is different from the quasi-linear score in several ways. One of the most significant differences between them is that the quasi-linear score is defined by disjointed sets of markers. This results in a small number of parameters for the predictive score: the parsimonious expression. The difference in these forms is mentioned in the Discussion. Moreover, the quasi-linear score Q summarizes L k approximating the maximum function. In fact, (1) is equal to a soft maximum function discussed by [20], which can be approximated with

$$\begin{array}{@{}rcl@{}} M=\max_{1 \leq k \leq K}L_{k}. \end{array} $$

Therefore, the quasi-linear score Q respects the maximum of K linear scores from all clusters. See [21] for a discussion of Eq. (3) as maxout in neural networks.

The relationships among Q, L, and M are clearly evaluated when a tuning parameter, τ, is introduced in the quasi-linear score Q as

$$\begin{array}{@{}rcl@{}} Q_{\tau}=\frac{1}{\tau}\text{log}\left(\sum_{k=1}^{K} \text{exp}(\tau L_{k})\right), \end{array} $$

where τ is a positive parameter. If L k is fixed for all k, then the form of (4) is defined solely by the tuning parameter τ. When τ is equal to 1, Q τ is equal to the quasi-linear score Q by definition. When τ goes to infinity, Q τ is simply the maximum score M. When τ goes to 0, Q τ is equivalent to the linear score L. Thus, these are unified by the hardness of approximation to the maximum function. More details are provided in Additional file 1: Appendix B. The characteristics of the quasi-linear score Q are understood by a more general expression of (1) : \(G=\phi ^{-1} (\sum _{k=1}^{K} \phi (L_{k}))\), where ϕ is an invertible function. We define G as a generalized quasi-linear score because the form is the generalized mean called the Kolmogorov-Nagumo average. If we take a simple average, ϕ(z)=z, then the generalized quasi-linear score G corresponds to the linear score L. In this sense, the linear score L is a simple mean, and the quasi-linear score Q is a generalized mean of linear scores L k averaged by the exponential function. Although the simplest integration of clustered information is achieved by a simple average, resulting in the linear score form, it is intuitively unsatisfying because the predictive performance of these linear scores L k differs among the clusters. A cluster that strongly discriminates the outcome on its own should be respected in comparison with the other clusters. If only the cluster with the highest linear score is reflected in the prediction, it is described by the maximum score M. However, this situation is still not ideal, because only one cluster is reflected in the prediction, and form (3) is difficult to handle mathematically because it is not differentiable when two or more linear scores are equal. Consequently, parameter estimation becomes impossible. The quasi-linear score Q is therefore naturally derived, and it is reasonable for discriminant analysis of the heterogeneous data because the quasi-linear score Q plays an important role in cluster selection, as discussed later.

In the following sections, we let ϕ(z)=exp(z), both because this form is approximated by the maximum function, and because it is optimal in the sense of Bayes risk consistency when we consider the simple case in which the label conditional random variables follow a mixture of normal distribution and a normal distribution with equal variance, respectively. Additional file 1: Appendix C provides more detail about the Bayes risk consistency of the situation. Moreover, the exponential function gives us an understandable interpretation of the parameter estimation.

Because we modify only the scoring form, the quasi-linear score Q can be applied to all traditional settings in biostatistics, as the generalized linear model with the L1 and L2 shrinkage methods. In particular, when we combine the quasi-linear score Q with lasso shrinkage, the important clusters and variables in each cluster are determined simultaneously because of soft maximum property and L1 sparseness. This property provides good performance for the discriminant problem when the data have a much larger number of correlated markers than the number of samples. Therefore, we derive the L1 and L2 shrinkage quasi-linear logistic model and display the performance of the quasi-linear score Q when it is applied to gene expression data.

Likelihood for logistic model and maximum likelihood estimation

Consider the data {(X i ,Y i );i=1,,n}, where X i is a covariate vector and Y i is a dichotomous outcome which takes 0 or 1 with the i-th individual. Assume that we know the decomposition of X i as X i(1),,X i(K) with a fixed cluster size K, and that this is identical among individuals. We denote the size of X i(k) as p k , where \(\sum _{k=1}^{K} p_{k}=p\). We note that the decomposition is given a priori by one of clustering methods for {X i ;i=1,,n}.

We derive the likelihood for a logistic model of the quasi-linear score because of versatility. Therefore, we assume that Y i is independently distributed according to the Bernoulli distribution with a parameter π i , and consider the logistic model. Below, we denote the quasi-linear score Q based on X i(1),,X i(K) as Q i for simplicity. The association between π i and the quasi-linear score Q i is described by

$$\begin{array}{@{}rcl@{}} \text{log}\frac{\pi_{i}}{1-\pi_{i}}=Q_{i}. \end{array} $$

In this setting, the unknown parameters are {α k ,β k ;k=1,,K} which specify Q i ’s over individuals as in Eq. (1). The log-likelihood function of parameter \(\theta =\left (\alpha _{1},\cdots,\alpha _{K}, \beta _{1}^{\top }, \cdots, \beta _{K}^{\top }\right)^{\top }\) is

$$\begin{array}{@{}rcl@{}} l(\theta)&=&\sum_{i=1}^{n} Y_{i} Q_{i}-\text{log}(1+\text{exp}(Q_{i})). \end{array} $$

The maximum likelihood estimator (MLE) of θ is therefore the solution of

$$\begin{array}{@{}rcl@{}} \frac{\partial l(\theta)}{\partial\theta}=W^{\top}(Y-\Pi), \end{array} $$

where W=( Q 1/ θ,, Q n / θ), Y=(Y 1,,Y n ) and Π=(π 1,,π n ). The solution is calculated by updating some initial value repeatedly by Fisher’s scoring method as

$$ \theta^{(t+1)}=\theta^{(t)}+\left(W^{(t)\top} V^{(t)} W^{(t)} \right)^{-1} W^{(t)\top}\left(Y-\Pi^{(t)}\right), $$

where \(W^{(t)}=\left (\partial Q_{1} / \partial \theta,\cdots,\partial Q_{n} / \partial \theta \right)^{\top } |_{\theta =\theta ^{(t)}}\), \(V^{(t)}=\text {diag}\left \{\pi _{1}^{(t)}\left (1-\pi _{1}^{(t)}\right),\cdots,\pi _{n}^{(t)}\left (1-\pi _{n}^{(t)}\right) \right \}\) and \(\Pi ^{(t)}=\left (\pi _{1}^{(t)},\cdots,\pi _{n}^{(t)}\right)^{\top }\). The R source code of the parameter estimation of the quasi-linear logistic model is available in Additional file 2. In the framework of a generalized linear model, Z (t)=W (t) θ (t)+V (t) −1(YΠ (t)) is called the working response, and this algorithm is referred to as the iteratively reweighted least-square method [22] because the Eq. (8) is written as θ (t+1)=(W (t) V (t) W (t))−1 W (t) V (t) Z (t). Thus, the parameter estimation strategy is very similar to the linear-logistic model. However, the estimation is not stable in a high dimensional setting. In such situation, W (t) V (t) W (t) becomes a singular matrix. It is thus difficult to compute the inverse matrix in Eq. (8) for each step. We can avoid the problem by regularization method, just as for the penalized linear logistic model [23, 24].

L1 and L2 regularization of the quasi-linear logistic model

The L2 penalized log-likelihood is described by

$$ l^{\text{ridge}}(\theta,\lambda) = l(\theta)-\frac{1}{2}\lambda_{0} \sum_{k=1}^{K} \alpha_{k}^{2}-\frac{1}{2}\sum_{k=1}^{K} \lambda_{k} \beta_{k}^{\top} \beta_{k}. $$

We note that we regularized α k ’s by λ 0 to avoid computational difficulty in calculating the inverse matrix, although the intercept parameters should not be regularized in the linear logistic model.

A MLE with the ridge regularization of θ is calculated by Fisher’s scoring method as

$$\begin{array}{@{}rcl@{}} \theta^{(t+1)}&=& \left(W^{(t)\top} V^{(t)} W^{(t)}+R \right)^{-1} W^{(t)\top} V^{(t)}\\ &&\times\left\{W^{(t)\top}\theta^{(t)}+{V^{(t)}}^{-1}\left(Y-\Pi^{(t)}\right)\right\}. \end{array} $$

Here \(R=\text {diag}(\lambda _{0} I_{K},\lambda _{1} I_{p_{1}},\cdots,\lambda _{K} I_{p_{K}})\phantom {\dot {i}\!}\), where I m denotes the identity matrix with size m. The derivation of the algorithm is described in Additional file 1: Appendix D in greater detail.

Next we consider L1 regularization for the quasi-linear logistic model. The L1 penalized log-likelihood is given by

$$\begin{array}{@{}rcl@{}} l^{\text{lasso}}(\theta,\lambda) &=& l(\theta)-\sum_{k=1}^{K} \lambda_{k} |\beta_{k}|. \end{array} $$

This form is compatible with the group lasso [25]. We note that the group lasso has a very similar concept in that regularizations are performed for each cluster. However, the score forms are different between the two regularization methods. The comparison of group lasso and quasi-linear score are performed in the “Application” subsection of the “Results”. For the quasi-linear score Q, it is computationally difficult to solve the problem of maximization with (11) by a method that involves the inverse matrix. Therefore, we applied the gradient ascent method of [26] by using the directional derivative, which is a simple gradient ascent algorithm based on the components of a score function:

$$ \theta^{(t+1)}=\theta^{(t)}+\text{min} \left\{t_{\text{opt}}\left(\theta^{(t)}\right), t_{\text{edge}}\left(\theta^{(t)}\right)\right\}g\left(\theta^{(t)}\right), $$

where g(θ)=(g 1(θ),,g p+K (θ)),

$$\begin{aligned} t_{\text{edge}}(\theta)=\min_{1+K \leq j \leq p+K}\left(-\frac{\theta_{j}}{g_{j} (\theta)} : \text{sign}(\theta_{j})=-\text{sign}(g_{j}(\theta))\neq 0\right) \end{aligned} $$


$$t_{\text{opt}}(\theta)=\frac{|g(\theta)|}{g(\theta)^{\top} \frac{\partial^{2} l(\theta)}{\partial \theta \partial \theta^{\top}}g(\theta)}. $$

Here g j (θ)=l j (θ) for j=1,,K and

$$\begin{array}{@{}rcl@{}} g_{j}(\theta)= \left\{\begin{array}{ll} l_{j}(\theta)-\lambda_{k} \text{sign}(\theta_{j}) & \text{if}\ \theta_{j} \neq 0 \\ l_{j}(\theta)-\lambda_{k} \text{sign}(l_{j}(\theta)) & \text{if}\ \theta_{j}=0 \ \ \text{and}\ |l_{j}(\theta)|>\lambda_{k} \\ 0 & \text{otherwise} \end{array}\right. \end{array} $$

for j=K+1,,p+K, where sign(z) is a sign function, l j is the j-th component of Eq. (7) and k denotes the cluster number the j-th marker belongs to. In each step, the t opt provides the optimal solution of the gradient descent algorithm and t edge controls the direction of the gradient so as to avoid changing the signs of the parameters. The vector of the tuning parameters (λ 1,,λ K ) is determined by a cross-validation method from candidate sets of parameters.

Non-linearity of the quasi-linear score

The quasi-linear score Q is non-linear by definition. The non-linearity of the quasi-linear score Q can be demonstrated by a simple illustration. Figure 3 shows the fitted curve of Q when p=k=2. In this figure, it looks as if two linear planes, specialized to each sub-space, are connected smoothly. In this case, the linear surface is curved while still maintaining local linearity, thus forming a quasi-linear surface. As an extreme case, let there be only one cluster with strong markers. When all scores are integrated, the information from this cluster should not be affected by the others. The quasi-linear score Q makes up this nature because this approximates the maximum function. If there is an ,1≤K such that

$$\begin{array}{@{}rcl@{}} L_{\ell} \gg L_{k} \end{array} $$
Fig. 3
figure 3

The boundary (upper) and contour (lower) of the quasi-linear score. Left and center; F(x 1,x 2)=log(exp(1+x 1)+exp(1+x 2)), the right; G(x 1,x 2)=max(1+x 1,1+x 2). The center panel is an expansion around the origin (0,0) of the left panel

for k, then \(\sum _{k=1}^{K} \exp (L_{k})\approx \exp (L_{\ell })\), so that Q almost equals L and the score is almost evaluated by the -th cluster. In such a case, the quasi-linear score Q achieves the cluster selection. In the numerical sense, even if the inequality (13) is not very evident, selection is considered to be achieved because the exponential function inflates the input sufficiently. For example, log{exp(5)+ exp(2)+ exp(−1)+ exp(−4)}=5.051, which essentially means that only the first term is reflected in the construction of the quasi-linear score Q. Accordingly, QL if X is in a set {X:L =M}, say C . We note that C is expressed by the intersection of K−1 half planes, such that C is a convex polyhedron. Thus the quasi-linear score Q is locally linear over disjointed and exhaustive regions of the space of all biomarkers : \(\bigcup _{\ell =1}^{K} C_{\ell }\). Thus we observe that the quasi-linear score Q is approximately equal to the linear score L that dominates over the other K−1 scores. This property contrasts with the ordinary linear score, which is the sum of K linear scores. In particular, the quasi-linear score Q is advantageous in cases where there are predominant sets of separate biomarkers within the space of all biomarkers.

Also, for both logistic models in the parameter estimation steps, we can see the difference between the linear and quasi-linear models reflected in the derivative term as:

$$\begin{array}{@{}rcl@{}} \frac{\partial l^{L}\left(\theta^{L}\right)}{\partial \theta^{L}}&=&\left(1, X^{\top}\right)^{\top}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \frac{\partial l(\theta)}{\partial \theta}&=&\left(S_{1},\cdots,S_{K},S_{1} X_{(1)}^{\top},\cdots,S_{K} X_{(K)}^{\top}\right)^{\top}, \end{array} $$

where l L(θ L) is a log-likelihood function of the linear logistic model with parameter θ L=(α,β ) and \(S_{k}=\text {exp}(L_{k})/\sum _{k=1}^{K} \text {exp}(L_{k})\). A derivation of Eq. (15) is given in Additional file 1: Appendix E. Thus, the data space is decomposed by updated S k and composed as one unit in each learning step. This concept used in probabilistic models is referred to as the divide and conquer strategy, which is employed in many machine-learning studies as a mixture of expert models [27].


Simulation study

We examined the efficiency of the quasi-linear score Q using logistic models (QL), compared with the linear score L using logistic model (LL). We conducted simulations with five different settings. For each dataset, the samples were divided between the disease group (Y=1) and normal group (Y=0).

First, to show the consistency of the quasi-linear logistic model without regularization, we used a simple setting that has an optimal solution of the quasi-linear form. In this example, we simulated 1000 random datasets. Each dataset was either small, containing 400 samples, or large, containing 1600 samples. Next, we estimated the parameters using Eq. (8) and checked the consistency.

Second, we examined four high dimensional settings focusing on marker’s selection. The divided populations were considered to have homogeneous or heterogeneous structure, which were described by normal or mixed normal distribution. In these examples, we simulated 1000 random datasets, each containing either 400 or 200 samples for training and test datasets, respectively. For these settings, we use the L1 and L2 shrinkage method in order to avoid overfitting and hard computation. Below, we define \(\boldsymbol {r}_{p}=(r,r,\cdots,r) \in \mathbb {R}^{p}\) for the simple notations.


In this example, we assumed normality for the normal group and mixture normality for the disease group.

$$\begin{array}{@{}rcl@{}} X|(Y=0) &\sim& \mathrm{N}\left(\boldsymbol{0}_{2}^{\top}, I_{2}\right), \quad X|(Y=1)\\ &\sim& \sum\limits_{g=1}^{2} \tau_{g} \mathrm{N}\left(\mu_{1g}, I_{2}\right),\quad\sum\limits_{g=1}^{2} \tau_{g} =1. \end{array} $$

We let μ 11=(−1,0) and μ 12=(0,1.5). In this setting, the Bayes optimal form is log(exp(α 1+β 1 X 1)+exp(α 2+β 2 X 2)). Figure 4 shows box plots of estimated parameters for 1000 trials. The optimal parameter derived from the true likelihood is (α 1,α 2,β 1,β 2)=(−1.19,−1.82,−1.00,1.50). The means of the estimated parameters from 1000 trials are (α 1,α 2,β 1,β 2)=(−1.28,−1.99,−1.07,1.61) for the small datasets and (α 1,α 2,β 1,β 2)=(−1.21,−1.85,−1.01,1.53) for the large datasets. We observed that parameter estimation was more precise when the sample size was large, and that the estimated parameters were consistent.

Fig. 4
figure 4

Box plot of the estimated parameters in the simple simulation. Left: small sample size setting (n=400); right: large sample size setting (n=1600). The red lines show the optimal parameters derived from the true likelihood

High dimensional settings

  • (a): homo-homo

    In this example we assumed normality for both groups.

    $$\begin{array}{@{}rcl@{}} X|(Y=y) \sim \mathrm{N}\left(\mu_{y}, I_{p}\right) \quad (y=0,1). \end{array} $$

    We had three settings: (1) \(p=2,\mu _{0} =\boldsymbol {0}_{2}^{\top },\mu _{1}=\boldsymbol {1}_{2}^{\top }\), (2) \(p=100, \mu _{0} = \boldsymbol {0}_{100}^{\top },\mu _{1}=\boldsymbol {0.1}_{100}^{\top }\), (3) \(p=100, \mu _{0} = \boldsymbol {0}_{100}^{\top },\mu _{1}= \boldsymbol {0.5}_{100}^{\top }\). For the quasi linear score Q, we assumed the misspecification of heterogeneous structure, as K=2 and p 1=p 2=1 for (1) or p 1=p 2=50 for (2) and (3).

  • (b): homo-hetero

    In this example, we assumed normality for the normal group and mixed normality for the disease group.

    $$\begin{array}{@{}rcl@{}} X|(Y=0) &\sim& \mathrm{N}(\mu_{0}, I_{p}), \quad X|(Y=1)\\ &\sim& \sum_{g=1}^{G} \tau_{g} \mathrm{N}(\mu_{1g}, I_{p}), \quad \sum_{g=1}^{G} \tau_{g} =1. \end{array} $$

    We had four settings. In (1) and (2), we let G=2, p=100, τ 1=τ 2=0.5, \(\mu _{0}=\boldsymbol {0}_{100}^{\top }\). In (3) and (4), we let G=3, p=100, τ 1=τ 2=τ 3=1/3, \(\mu _{0}=\boldsymbol {0}_{100}^{\top }\). The mean parameter for the disease group was set as (1) μ 11=(−1,0 99), μ 12=(0 50,1.5,0 49), (2) μ 11=(−1 10,0 90), μ 12=(0 50,1.5 10,0 40), (3) μ 11=(−1.5,0 99), (4) μ 11=(−1.5 3,0 97), μ 12=(0 34,1.5 3,0 63), μ 13=(0 67,1 3,0 30). For the quasi-linear score Q we assumed the correct specification of heterogeneous structure as K=G and p 1=p 2=50 or p 1=34,p 2=p 3=33.

  • (c): hetero-hetero

    In this example, we assumed mixed normality for both groups.

    $$\begin{array}{@{}rcl@{}} X|(Y=y) &\sim& \sum_{g=1}^{G} \tau_{yg} \mathrm{N}(\mu_{yg}, I_{p}),\\ &&\sum_{g=1}^{G} \tau_{yg} =1\ \ (y=0,1). \end{array} $$

    We used the following settings: G=2, p=100, τ yg =0.5 (y=0,1, g=1,2), \(\mu _{01}=\boldsymbol {0}_{100}^{\top }\), μ 02=(0 50,0.3 10,0 40), μ 11=(0.5 50,0 50), μ 12=(0 50,0.8 50). For the quasi-linear score Q we assumed to specify there are heterogeneous structure as K=2 and p 1=p 2=50.

  • (d): correlated

    In this example, we assumed normality for the normal group and mixed normality for the disease group.

    $$\begin{array}{@{}rcl@{}} X|(Y=0) &\sim& \mathrm{N}(\mu_{0}, \Sigma), \quad X|(Y=1)\\ &\sim& \sum_{g=1}^{G} \tau_{g} \mathrm{N}(\mu_{1g}, \Sigma), \quad \sum_{g=1}^{G} \tau_{g} =1. \end{array} $$

    The variance assumption was based on a real dataset, as shown in Fig. 2. We used the following settings: (1) \(G\,=\,2, p\,=\,70, \tau _{1}=\tau _{2}=0.5, \mu _{0}=\boldsymbol {0}_{70}^{\top }, \mu _{11}=(-\boldsymbol {0.5}_{5},\boldsymbol {0}_{65})^{\top }\), μ 12=(0 35,1 5,0 30), \(\Sigma = \left (\begin {array}{cc} \Sigma _{1} & \Sigma _{2} \\ \Sigma _{2}^{\top } & \Sigma _{1} \end {array}\right)\), where Σ 1=0.7I 35+0.3J 35,Σ 2=−0.15J 35, where J m is a matrix of size m of which all components are 1. For the quasi-linear score Q we assumed to specify there are heterogeneous structure as K=2 and p 1=p 2=50.

Table 1 (a) summarizes the AUC value of the test datasets for the (a) settings. We note that the linear score L is optimal, in terms of the likelihood ratio, under this assumption. However, the quasi-linear score Q is not less than the simple linear score L regardless of the misspecified structure. This is because the quasi-linear score Q includes the local linear boundary, and almost of all data points are fitted to it. As a result, the predictions based on the quasi-linear score Q were not so mismatched. Table 1 (b) summarizes the AUC values of the test datasets for the (b) settings. We note that the quasi-linear score Q is Bayes-optimal under this assumption. Unlike in a situation that involves checking for consistency, the quasi-linear score Q succeeded in making a difference in performance relative to the ordinary linear score L. As the numbers of effective explanatory valuables increased, the difference in predictive performance between the quasi-linear and linear scores also grew. In these settings, the L1 shrinkage method performed well, because the number of effective explanatory variables was small compared to the number of noisy variables. Table 1 (c) summarizes the AUC value of test datasets for the (c) setting. When we assumed normal heterogeneity for both groups, the optimum form of the score was no longer simple, and differs from the linear and the quasi-linear forms. However, the quasi-linear score Q also worked well in this setting. This result indicates that the quasi-linear score Q should have good predictive performance relative to the linear score L in complex heterogeneous settings like real datasets. Table 1 (d) summarizes the AUC value of the test datasets for the (d) setting. The quasi-linear score Q also worked well in this setting.

Table 1 Estimated AUC (standard deviation) of 1000 repetitions


We applied our method for two datasets, namely breast cancer and prostate cancer data. For both types of datasets, two independent datasets were used as training and testing to evaluate the predictive ability by test AUC. First, we compared the test AUC among decision tree (DT), random forest (RF), support vector machine (SVM), naive Bayes (NB), group lasso (GL), neural network (NN), L1 or L2 penalized linear logistic (LL1, LL2) and L1 or L2 penalized quasi-linear logistic (QL1, QL2). Performance was evaluated by the test AUC and the 95% CIs of the test AUC based on 2000 bootstrapping sampling, as described in [28]. The tuning parameters were determined with a grid search and resampling method as needed. Second, the stability for marker selection was compared among LL1, QL1 and GL. We used a similarity index proposed by [29] defined by S(A,B)=|AB|/|AB|, where A and B are subsets of marker index set, and |A| is a cardinality of the set A. S takes a value between 0 and 1 whose high value means high stability. We evaluated the stability measure by \(\frac {2}{R(R-1)}\sum _{i=1}^{R-1}\sum _{j=i+1}^{R} S(M_{i},M_{j})\), where M 1,,M R are sets of the selected marker for R bootstrap sample sets from the training data set. R was set to 100 below.

Breast cancer data

The training dataset was taken from [8] and the test datasets were from [30]. Yan et al. [28] used these datasets and compared the AUCs by the linear score L, which they evaluated by traditional methods as well as methods they proposed. We focused on the 70 genes detected by [8], as in [28]. These datasets include 78 patients in one and 307 patients in the other. For QL, grouping of 70 genes was based on the Ward’s clustering method only by training dataset. We had two options for dividing all the genes into clusters. In the first option, the 70 genes were divided into two clusters, one with 36 and the other with 34 genes. In the second option, the 70 genes were divided into three clusters of 36, 16, and 18 genes. For GL, We used two clusters option.

Figure 5 displays the estimated AUC for the test dataset. QL1 and QL2 performed better than LL1, LL2 and any other non-linear methods, and the highest test AUC was obtained when we used QL1 based on two clusters. The test AUC of the quasi-linear score did not change for different cluster sizes (K=2 and K=3). The numbers of selected markers in LL1, QL1 (K=2), QL1 (K=3) and GL were 14, 14, 24 and 70, respectively. Similarly, the stability measures were 0.323, 0.320, 0.399 and 0.960, respectively. The stability did not differ between LL1 and QL1 greatly. We note that GL almost did not shrink any coefficients to zero in this setting.

Fig. 5
figure 5

Box plots of the test AUC for all comparative methods by breast cancer data

When we use the linear score L, the absolute value of the coefficients of each marker reflects the order of importance of all markers for prediction. Therefore, the linear score is understandable in the sense that we can recognize strong markers. This is no longer a consideration when we use a generalized non-linear score. However, the quasi-linear score enables us to compare coefficients within the same cluster. An example is shown in Fig. 6, which displays the ranking of the absolute values of the estimated coefficients by the ridge regularization method based on the existence of two clusters. The gene labels are arranged in order of the rankings. We observed that Q and L gave quite different rankings. This result shows that the quasi-linear score would produce different interpretations for the relationship between the markers.

Fig. 6
figure 6

Ranking of the absolute values of the coefficients within the cluster with ridge regularization

Figure 7 shows learning and fitting of the quasi-linear score Q using the lasso regularization method. The score distributions in the training and test datasets were quite well-matched. Figure 7 shows that the quasi-linear score of two clusters with L1 regularization will work well if we give a cut-off value for binary decisions. For example, the test error rates of Q and L were 37.8% and 45.0%, respectively, when we used the Youden-index [31].

Fig. 7
figure 7

Learning and fitted plot for the training and test dataset when using the quasi-linear score of two clusters with lasso regularization. The horizontal and vertical axis are the linear scores of the first and second cluster. Red points indicate the metastatic group and black points indicate the control group. Curve lines are contours of the quasi-linear score and blue line shows cut-off value based on Youden-index

Although the quasi-linear score Q is approximately equivalent to the maximum function, the two are numerically different. In fact, the test AUC of the quasi-linear score with the lasso regularization method when we assumed two clusters was 0.752, and the corresponding maximum score M is 0.745, so that the smooth non-linearity of the quasi-linear form produced good predictive performance

The elastic net shrinkage method [32], which combines the lasso and ridge shrinkage methods, is among the most frequently used. When we combined the quasi-linear score and the elastic net regularization, the number of the tuning parameters was inflated. Although we used the elastic net experimentally for the application for some selected parameters, the predictive performance was not significantly different from the performance obtained with either the lasso or ridge. Detailed results are summarized in Table 2. Moreover, to check the utility of the unsupervised clustering, we randomly divided the 70 genes into two subsets of 36 and 34 genes, and applied QL2 for the test dataset (2000 times). Figure 8 shows that clustered subsets (red line) performs better than randomly divided subsets. Thus, unsupervised clustering naturally benefits supervised learning via the quasi-linear form.

Fig. 8
figure 8

Test AUCs by the quasi-linear score for the dataset from Buyse et al. (2006). The score is learning by randomly divided genes subsets for the dataset from van’t Veer et al (2002). The red line is the test AUC by the quasi-linear score, which consists of subsets of genes clustered by unsupervised learning

Table 2 Estimated AUC (95% confidence interval) by elastic net shrinkage; training dataset from [8], test dataset from [30]

Prostate cancer data

The data set was taken from [33] which contains expression data for 6144 genes obtained from 455 prostate cancer tumors. The tumors were from 103 subjects determined to be fusion status-positive and 352 subjects determined to be fusion status-negative. We randomly divided the whole dataset into two independent datasets with the same number of tumor samples (training and test data) while maintaining the ratio of positive to negative statuses. First, we selected 100 relevant genes which had top 100 absolute value of t-statistic between the two statuses using only the training dataset. Such marker preselection has been performed in many studies [34]. For QL, grouping of 100 genes was based on the Ward’s clustering method only by training dataset. We had two options for dividing all the genes into clusters. In the first option, the 100 genes were divided into two clusters, one with 81 and the other with 19 genes. In the second option, the 100 genes were divided into three clusters of 25, 56, and 19 genes. For GL, We used two clusters option. We then compared the test AUC among all comparative methods. Figure 9 displays the estimated AUC for the test dataset. As well as the application for breast cancer data, QL1 and QL2 performed better than any other comparative methods. The numbers of selected markers in LL1, QL1 (K=2), QL1 (K=3) and GL were 31, 38, 67 and 100, respectively. Similarly, the stability measures were 0.361, 0.993, 0.982 and 1.00, respectively. The stability of QL1 was higher than LL1. We note that GL almost did not shrink any coefficients to zero as application for breast cancer data set.

Fig. 9
figure 9

Box plots of the test AUC for all comparative methods by prostate cancer data


We focused on heterogeneous structure and determined how to reflect such heterogeneity in the score function defined in (1). For this purpose, the quasi-linear score was derived as the generalized mean called the Kolmogorov-Nagumo average. The quasi-linear form is also called a soft maximum function or log-sum-exp function [35]. In machine learning, the softmax function is often used as a differentiable approximation of the maximum. In computer science, the log-sum-exp function is used to avoid computational problems such as overflow. The non-linearity of the quasi-linear score is explained by the soft maximum function. The quasi-linear score achieves cluster selection because of the soft maximum property as discussed in the subsection of “Non-linearity of the quasi-linear score”. This formulation does not require any prior information or assumption to separate markers into clusters, because this is achieved by the unsupervised learning step.

The quasi-linear score is based on the idea of combining predictors, which is related to several ideas in the literature. For example, a mixture of expert models suggest the idea of decomposing input space [27], in which the model divides the problem space probabilistically and the scores learned in all sub-spaces are combined. The quasi-linear score utilizes the information given by the clustering method to reflect the heterogeneity of markers and combines the linear scores of all clusters. Hence, it relies on the disjointed decomposition of the markers. The method of combining linear scores was also discussed in [19], known as composite links, which assumes that the score is formed by a weighted sum of block-wise markers. Unlike the generalized linear model, the composite link model does not restrict the use of the single link function. In a special case, the composite link logistic model corresponds to the quasi-linear logistic model. However, these ideas differ in that the composite link considers the sum of the linked linear scores whereas the quasi-linear score considers a linkage of the summarization of linear scores in all clusters. The key in our proposal is to model heterogeneity using information from the clustering method, thereby connecting supervised learning with unsupervised learning without any assumption via a change in the score form from the simple average to the Kolmogorov-Nagumo average.

For future work, we intend to extend some fixed settings presented in this report. These include the choice of the clustering method, the size of the markers and clusters, the set of tuning parameters, the type of outcomes, and the format of the targeted data. Because the quasi-linear score can be defined by any decomposition ideas, the performance should be evaluated by clustering methods other than Ward’s method, such as the k-means method [36]. Moreover, we need to investigate the sizes of markers and clusters, and the number of candidate sets of tuning parameters in addition to the parameter τ in (4), to obtain a more flexible form of the quasi-linear function. Although we applied and evaluated the proposed method after marker preselection in Application, the performance should be evaluated in much higher dimensional setting. An especially big concern is how to decide the cluster size for the quasi-linear score. As described in the “Application” subsection, the quasi-linear score by cluster size 2 gave the best predictive performance for breast cancer data, and adding more clusters yielded no improvement. Figure 2 supports this result because whole markers were divided into two primary clusters. It is necessary to develop an objective index of definite cluster size selection for general applications.

The quasi-linear score would be also applicable in a case of the continuous outcomes and in a regression model, although we focused on binary outcomes and the logistic model in this study. The performance of the quasi-linear score would be exhibited in the mixed large dataset, which would play an important role in biomedical studies in the near future, because such data must be heterogeneous. Furthermore, our method is not limited to biomedical data, and could also be beneficial for analyzing any data that have heterogeneous structure.


In this paper, we focused on heterogeneous structure. Such heterogeneity was captured well by a clustering method. The quasi-linear score was naturally derived by Bayes risk consistency between mixed and standard normal distributions. Moreover, the quasi-linear score approximates the maximum function and plays an important role in selecting the most effective cluster for prediction from given clusters. The quasi-linear score has better predictive ability compared to linear score as shown in simulation studies and applications to real data.



Area under curve


Diagonal Fisher’s linear discriminant analysis


maximum likelihood estimator


  1. Elsebakhi E, Lee F, Schendel E, Haque A, Kathireason N, Pathare T, et al. Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms. J Comput Sci. 2015; 11:69–81.

    Article  Google Scholar 

  2. Li Y. Big biological data: Challenges and opportunities. Genomics Proteomics Bioinforma. 2014; 12:187–9.

    Article  Google Scholar 

  3. Yun T, Yi GS. Biclustering for the comprehensive search of correlated gene expression patterns using clustered seed expansion. BMC Genomics. 2013; 14:144.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lu W, Zhang HH, Zend D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2013; 22:493–504.

    Article  PubMed  Google Scholar 

  5. Foster KR, koprowski R, Skufca JD. Machine learning, medical diagnosis, and biomedical engineering research - commentary. Biomed Eng Online. 2014; 13:94.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Brimacombe M. High-dimensional data and linear models: a review. Open Access Med Stat. 2014; 4:17–27.

    Article  Google Scholar 

  7. Oghabian A, Kilpinen S, hautaniemi S, Czeizler E. Biclustering methods: Biological relevance and application in gene expression analysis. PLoS ONE. 2014; 9:90801.

    Article  Google Scholar 

  8. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415:530–6.

    Article  Google Scholar 

  9. Sørie T, Perou CM, Tibshirani R, Aas T, Geisler SJ, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Nat Acad Sci USA. 2001; 98:10869–74.

    Article  Google Scholar 

  10. Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Comput Surv. 1999; 31:264–323.

    Article  Google Scholar 

  11. Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis; a survey. IEEE/ACM Trans Comput Biol Bioinforma. 2004; 1:24–45.

    Article  CAS  Google Scholar 

  12. Wang Y, Kijin JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005; 365:671–9.

    Article  CAS  PubMed  Google Scholar 

  13. Bravo HC, Pihur V, McCall M, Irizarry RA, Leek JT. Gene expression anti-profiles as a basis for accurate universal cancer signatures. BMC Bioinforma. 2012; 13:272.

    Article  Google Scholar 

  14. Naudts J. Generalized Thermostatistics. New York City: Springer; 2011.

    Book  Google Scholar 

  15. Eguchi S, Komori O. Path connectedness on a space of probability density functions. Lecture Notes Comput Sci. 2015; 9389:615–24.

    Article  Google Scholar 

  16. Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005; 48:869–85.

    Article  Google Scholar 

  17. Omae K, Komori O, Eguchi S. Reproducible detection of disease-associated markers from gene expression data. BMC Med Genomics. 2016;9:53. doi:10.1186/s12920-016-0214-5.

  18. Ward JHJ. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963; 58:236–44.

    Article  Google Scholar 

  19. Thompson BR, Baker RJ. Composite link functions in generalized linear models. J R Stat Soc. 1981; 30:125–31.

    Google Scholar 

  20. Cook J. Basic properties of the soft maximum. In: UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series, Available at Http:// 2011.

  21. Goodfellow IJ, Warde-Farley D, Mirza M, Courville CA, Bengio Y. Maxout networks. ICML. 2013; 28:2356–64.

    Google Scholar 

  22. Nelder JA, Wedderburn RWM. Generalized linear models. J R Stat Soc. 1972; 125:370–84.

    Google Scholar 

  23. Park MY, Hastie T. l 1 regularization path algorithm for generalized linear models. J R Stat Soc. 2007; 69:659–77.

    Article  Google Scholar 

  24. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33:1–22.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Meier SL, van de Geer S, Bühlmann P. The group lasso for logistic regression. J R Stat Soc. 2008; 70:53–71.

    Article  Google Scholar 

  26. Goeman JJ. l 1 penalized estimation in the cox proportional hazards model. Biometrical J. 2010; 52:70–84.

    Google Scholar 

  27. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE. Adaptive mixture of local expert. Neural Comput. 1991; 3:79–87.

    Article  Google Scholar 

  28. Yan L, Tian L, Liu S. Combining large number of weak biomarkers based on auc. Stat Med. 2015; 34:3811–830.

    Article  PubMed  Google Scholar 

  29. Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms. In: Proc. 5th IEEE International Con- ference on Data Mining (ICDM’05). IEEE: 2005. p. 218–225.

  30. Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas A, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Nat Cancer Inst. 2006; 98:1183–92.

    Article  CAS  PubMed  Google Scholar 

  31. Youden WJ. Index for rating diagnostic tests. Cancer. 1950; 3:32–5.

    Article  CAS  PubMed  Google Scholar 

  32. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc. 2005; 67:301–20.

    Article  Google Scholar 

  33. Setlur S, Mertz K, Hoshida Y, Demichelis FLM, et al. Estrogen-dependent signaling in a molecularly distinct subclass of aggressive prostate cancer. J Nat Cancer Inst. 2008; 100:815–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003; 19(9):1061–9.

    Article  CAS  PubMed  Google Scholar 

  35. Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004.

    Book  Google Scholar 

  36. McQueen J. Some methods for classification and analysis of multivariate observartions. Proc 5-th Berkeley Symp Math Stat Probab. 1967; 1:281–97.

    Google Scholar 

Download references


We would like to thank Li Yan for helpful comments and for providing us with datasets.


This work was supported by JSPS KAKENHI Grant Number 25280008. The funding bodies had no role in the design and the conclusions of the presented study.

Availability of data and materials

All the data used to perform the application described in this paper are freely available. The data of van’t Veer et al. is available on the Gene Expression Omnibus data base [], series GSE2990. The data of Buyse et al. is available on the European Bioinformatics Institute ArrayExpress database [], accession number E-TABM-77.

Authors’ contributions

KO, OK and SE designed the methods used in this study. KO carried out the simulation study and data analysis, and wrote the paper. All authors have read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Katsuhiro Omae.

Additional files

Additional file 1

Technical derivations. In this file, we perform some technical derivations and evaluations for quasi-linear score: parameterization; the relationship with linear and maximum score; the Bayes risk consistency; L1 and L2 regularization methods; the derivatives. (PDF 45 kb)

Additional file 2

R source code of the parameter estimation of the quasi-linear logistic model. In this file, we introduce the R source code of the parameter estimation of the quasi-linear logistic model, which was used for Simulation and Application. (PDF 12 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Omae, K., Komori, O. & Eguchi, S. Quasi-linear score for capturing heterogeneous structure in biomarkers. BMC Bioinformatics 18, 308 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: