### Empirical Bayes hierarchical model

If the transcript-specific parameter

, where

*μ*
_{
t
} and

are the location and scale parameters respectively, then the conditional likelihood of the

*t*th transcript’s expression measurement

y
_{
t
}= (

*y*
_{
t1},

*y*
_{
t2},…,

*y*
_{
tn
}) can be expressed as

(

*t*=1,2,…,

*T*). The location parameter

*μ*
_{
t
} follows the prior distribution,

*Π*(

*μ*
_{
t
}|

θ), where

θ is the hyper-parameter specifying the prior distribution. The predictive likelihood of

y
_{
t
} (unconditional on the location parameter

*μ*
_{
t
}) is obtained by integrating over the location parameter,

*μ*
_{
t
}, as follows:

When expression measurements between two groups (for example, different cell types) are compared for transcript

*t*, the measurements are partitioned into two user defined groups

*G*
_{1} and

*G*
_{2} of sizes

*n*
_{1} and

*n*
_{2} respectively, where

*n*
_{1} +

*n*
_{2} =

*n*. If there is no significant difference between the means of the two groups, the gene is assumed to be equivalently expressed (EE); otherwise, it is assumed to be a DE gene. If the

*t*th transcript is DE, the two groups will have different mean expression levels,

. Given the values of

and

, the conditional likelihood of

is written as follows:

because components of

y
_{
t
}are independent of each other. Assuming that the group means

(such that

) independently originate from

*Π*(

*μ*
_{
t
}|

θ), then the predictive likelihood of

y
_{
t
} (unconditional on the location parameters

) is obtained as a mean of the conditional likelihood of

y
_{
t
}(2) over the prior distribution of

and

as follows:

Because it is unknown whether the

*t*th gene is EE or DE between the two groups, the final likelihood of

y
_{
t
}(unconditional on the location parameters) becomes a mixture of two distributions (1) and (3) as follows:

Here,

*p*
_{0} and

*p*
_{1} are the mixing proportions of the EE and DE transcripts in the two user defined groups respectively, such that

*p*
_{0} +

*p*
_{1} = 1. The posterior probability of differential expression (PPDE) is calculated by Bayes rule using the estimates of

*p*
_{0},

*f*
_{0} and

*f*
_{1} as follows:

It should be noted here that θ and
in equations (1)-(5) are assumed to be exactly the same.

### Maximum *β*-likelihood estimation of mixture distribution using an EM-like algorithm to calculate *β*-posterior probabilities of differential expressions

Box and Cox
[

38] proposed a family of power transformations of the dependent variable in regression analysis to robustify the normality assumption. By choosing an appropriate value of

*λ* in the transformation,

the standard linear regression model with the normality assumption fits well to a wide range of data. Inspired by this idea, Basu

*et al*[

36] and Minami and Eguchi
[

37] proposed a robust and efficient method for estimating model parameter

θ by minimizing a density power divergence in a general framework of statistical modeling and inference. They
[

36,

37] have also shown that minimizer of density power divergence is equivalent to the maximizer of

*β*-likelihood function. According to the current problem in this paper, the

*β*-likelihood function for

θ given the values of the mixing parameter

*p*
_{0} = 1 −

*p*
_{1} and the gene specific scale parameter

for all

*t* can be written as

where

*f*(.) is the mixture of distributions as defined in (4) and

which is independent of observations. Because the gradient of (6) can be converted as follows,

the maximum *β*-likelihood estimator (*β*-MLE) of θ can be regarded as a weighted (quasi-) likelihood estimator. Then the weight of gene *t* is described as a power function of its likelihood,
, where *f*(.) is defined by equation (4). Thus, the genes with low likelihoods have unexpected expression patterns and have low weights because the normal density function produces smaller outputs for larger inputs. By assigning low weights to outliers, the inference becomes robust. It is obvious from (7) that *β*-MLE reduces to the classical MLE for *β* = 0. Because the expression pattern (EE or DE) of each gene is unknown, it is difficult to optimize both the classical log-likelihood function and the proposed *β*-likelihood function for directly estimating θ. To overcome this problem, we consider the EM-like algorithm to obtain *β*-MLE of θ treating the mixture distribution (4) as an incomplete-data density. The hyper-parameters θ and the mixing proportion *p*
_{0} are estimated by EM algorithm as follows:

The hyperparameters,

θ,

*p*
_{0} are estimated by the EM algorithm in two steps.

**E-step**: Compute the Q-function which is defined by the conditional expectation of the complete-data

*β*-likelihood with respect to the conditional distribution of missing data (

Z) given the observed data (

Y) and the current estimated parameter value

as follows:

where

*k* = 0 for

y
_{
t
} belongs to EE pattern and

*k* = 1 for

y
_{
t
} belongs to DE pattern. Here

which does not depend on observations,

is the posterior probability of

*k*th pattern for gene

*t* and the value of

*p*
_{1} = 1 −

*p*
_{0} is updated by a separate EM formulation as follows:

For
, the proposed Q-function
reduces to the standard Q-function *Q*(θ|θ
^{(j)}) of the standard empirical Bayes approaches
[8, 30].

**M-step**: Find θ
^{(j + 1)}by maximizing the proposed Q-function as defined in (8). Continue EM iterations up to the convergence of successive estimates of θ. The estimate of θ after convergence is taken to be the *β*-MLE of θ according to the EM properties.

The tuning parameter, *β*, controls the balance between the robustness and efficiency of the estimators. By setting a tentative value for *β*
_{0}, the optimal value is estimated by maximizing the predictive *β*
_{0}-likelihood via a five-fold cross validation. The dataset is divided into five subsets by transcripts. For each value of *β*, the predictive *β*
_{0}-likelihood of each subset is calculated based on the maximum *β*-likelihood estimates of the parameters based on the rest of the data. Finally, the *β* value that maximizes the average predictive *β*
_{0}-likelihood is selected as the optimal value of *β*. For more information about *β*-selection, please see
[39, 40].

Then, based on the estimate values of the model parameters, we can compute the PPDE between two groups of y
_{
t
} using equation (5) for all *t*. However, PPDE of contaminated gene using equation (5) might be produced misleading result, since PPDE of y
_{
t
} depends on the estimate values of parameters and measurements of y
_{
t
}. To overcome this problem, we detect contaminated genes using *β*-weight function and replace the contaminated measurements in y
_{
t
}by its group means. Then we compute the PPDE of contaminated y
_{
t
} using equation (5) also. The PPDE based on *β*-MLE, we call *β*-PPDE in this paper. The detail discussion for computation of *β*-PPDE under LNN model is discussed below in the LNN model.

### The LNN model

In this paper, we use the LNN (log-normal-normal) hierarchical model for computing the posterior probability of differential expressions. In the LNN model, log-transformed gene expression measurements are assumed to follow normal distribution for each gene with the transcript-specific parameter

, where

*μ*
_{
t
} is the transcript-specific mean and

is the transcript-specific variance for gene

*t*[

8,

30]. A conjugate prior for

*μ*
_{
t
} is assumed to follow the normal with some underlying mean

*μ*
_{0}and variance

; that is,

, where

. By integrating as in (1), the density

*f*
_{0}(·) for an

*n*-dimensional input becomes Gaussian with the mean vector

μ
_{0} =

and an exchangeable covariance matrix as follows:

where I
_{
n
} is an *n* × *n* identity matrix and M
_{
n
}is a matrix of ones.

The gene specific variance

is computed separately assuming prior distribution for

as scale-inverse

, where

*ν*
_{∗} is the degrees of freedom and

is the scaled parameter. Yang et al.
[

30] proposed that

could be estimated by a Bayes estimator defined as,

is the pooled sample variances with

as the sample variance in group

*g* = 1,2. By viewing the pooled sample variances

as a random sample from the prior distribution of

, the estimates

of

are obtained using the method of moments. However, it is obvious that (12) will be very sensitive to outliers. Therefore, we have used a maximum

*β*-likelihood estimation of

which is highly robust against outliers
[

39] and can be obtained iteratively as follows:

is the *β*-weight function for estimating robust mean and variance which produces an almost zero or very small weight for *y*
_{
ti
} if it is an outlying/extreme observation.

To estimate the hyper-parameters

by maximizing of the proposed Q-function (8) in the M-step, we compute the gradient of

with respect to

θ which is given by

It reduces to the gradient of the standard Q-function denoted by

based on the log-likelihood function for

*β* = 0. The second term on the right-hand side of equation (15) is independent of observations; the first term is the weighted gradient of

*Q*(

θ|

θ
^{(j)}) with the weight function

. This weight function produces a smaller weight if the

*t*th gene is contaminated by outliers; otherwise, it produces a comparatively larger weight for the

*t*th gene independent of whether it is EE (

*k*=0) or DE (

*k*=1). Therefore contaminated genes cannot influence the estimates and robust estimates of the parameters can be obtained. For convenience of choosing the threshold weight to identify contaminated genes statistically, we define the

*β*-weight function for the gene

*t* as follows

where the circumflex above a parameter indicates the proposed estimate of the parameters. Excluding the normalization constant, the

*β*-weight function corresponding to an EE gene becomes,

which measures the deviation of each gene expression data vector from the grand mean vector for the expression of all the genes in the dataset. The

*β*-weight function corresponding to a DE gene becomes

where
=
and
=
are the grand mean vectors, and
and
are the exchangeable covariance matrices in two user defined groups. Both the *β*-weight functions defined by equations (17) and (18) for genes *t* = 1,2,…,*T*produce weights that are between 0 and 1 for any data vector y
_{
t
}.

Because, both weight functions are the negative exponential function of the squared Mahalanobis Distance (MD) defined by
between the data vector y
_{
t
} and and the mean vector
. From equations (17) and (18), the *β*-weight for gene *t* decreases when MD_{
t
} increases and increases when MD_{
t
} decreases. That is, the *β*-weight for a gene *t* becomes smaller (≥ 0) when y
_{
t
} is contaminated by outliers, and larger (≤ 1) when it is not contaminated.

The large number of transcripts in microarray data enables a statistical investigation of the observed distribution of the

*β*-weights compared to the predicted distribution under the assumption that the model is correct and the data is free from outliers. To investigate this further, we start with the case where the predicted distribution can be obtained theoretically. When the normality assumptions hold and there are no outliers, and when the gene-specific variance is known for EE genes, the cumulative distribution of the

*β*-weight

for gene

*t* with known gene specific variance (

) becomes,

which implies that *w*
_{
t
}follows
, where
denotes the chi-square variable which assumes values
for 0 < *w*
_{0} ≤ 1, with *n* degrees of freedom. Similarly, for DE genes (18) the *β*-weight
also follows
, for 0 < *w*
_{0} ≤ 1 using the additive property of *χ*
^{2} distributions.

In many cases, however, the variance is unknown. For such cases, the distribution of the

*β*-weights is obtained by parametric bootstrapping. Thus statistically, we can examine whether or not a gene is contaminated by outliers using either one of the two

*β*-weight functions because both weight functions follow the same distribution and show similar trends for the observed weights of both gene expression patterns (DE and EE). However, the

*t*th gene is defined as contaminated by outliers if

where

*ξ*
_{
p
} is the

*p*-quantile of the

*β*-weights defined by

Heuristically, we choose

*p* = 10

^{−5} for the detection of contaminating genes. Then we compute the

*β*-PPDE using equation (5) updating the measurements in the contaminated genes. To compute the

*β*-PPDE with respect to a contaminating gene expression, say, for example,

by equation (5), we modify the contaminated measurements in

using the robust mean

obtained iteratively using equation (13). Here

is taken to be the

*i*th contaminated measurement of

in group

*g*=1, 2 if

where

*α*
_{
p
} is the

*p*-quantile of the

*β*-weights defined by

Here
is the *β*-weight function that is used to compute the robust mean and variance (14), which follows
, where
denotes the chi-square variable which assumes values of
for 0 < *w*
_{0} ≤ 1, with 1 degree of freedom. However, we can set an arbitrary threshold (*α*
_{0} = 0.2 ) to detect contaminated measurements with weights that are below the threshold, because weights are close to zero for outlying/extreme observations.

### Simulated data that were used to examine the performance of the *β*-EB approach

The *β*-EB approach that we developed detected a large proportion of outliers with p-values less than 10^{−5}. In the microarray data of head and neck cancer, 1.75% of the genes were outliers; in the lung cancer data, 13.75% were outliers; and in *Arabidopsis thaliana*, 16.59% were outliers in the empirical data analysis. A detailed inspection of the outliers detected in the lung cancer data reflected misspecification of the model. To investigate the effect of outliers and model misspecification, we conducted a numerical simulation in which we compared the performance of the proposed *β*-EB approach with the t-test, linear models for microarray data (Limma)
[22], SAM
[17], and other EB approaches (EB-LNN, eGG
[29], eLNN
[29], GaGa
[21]). The t-test, Limma, and SAM detect DE genes based on p-values while, the EB procedures and the *β*−EB approach detect DE genes based on posterior probabilities. Therefore, we calculated the AUC (area under the curve) and pAUC (partial area under the curve) of the ROC curves. We also compared the estimated proportion of DE genes obtained using the *β*−EB and EB approaches. This characteristic plays an important role, especially when the aim of the study is to identify the major regulatory elements that influence the expressions of a large number of genes. The EB approaches estimate the proportion of DE genes by the mean posterior probability. The *β*−EB approach estimates it by using equation (11). No reasonable procedure to calculate the proportion of DE genes for the t-test, Limma and SAM methods could be found, because, in these methods, the estimation depends on the threshold value of the p-values.

#### Simulated gene expression profiles with and without outliers

We generated 50 datasets that roughly reflect the head and neck cancer data described in empirical data analysis below. Each dataset contained measurements of 1,000 genes, and 50 out of the 1,000 genes were DE (*p*
_{1} = 0.05). The log-transformed expression was assumed to follow normal distribution. The mean log-expression level of a gene followed a normal distribution with the mean *μ*
_{0} = 2.0 and the variance
The gene-specific variance
of the log expression level among the genes varied from the exponential distribution with a mean of *σ*
^{2} = 0.10.

We considered two scenarios with different proportions of contaminating genes (10%, 20%), and two scenarios with two patterns of outliers (mild outliers: *μ*
_{
ti
}
*′* = 5*μ*
_{
ti
}), and (extreme outliers: *μ*
_{
ti
}
*′* = 10*μ*
_{
ti
}). To estimate the dependence of the performance on the sizes of the groups, we considered two more scenarios with different group sizes (moderate/large (*n*
_{1}= *n*
_{2} = 30) and small (*n*
_{1}= *n*
_{2} = 10)).

#### Simulated gene expression profiles from misspecified model

To show how the *β*− weight can be used for model diagnosis, we generated the expressions of each of the 1,000 genes in the dataset from their gamma distribution. The shape parameter that we obtained followed log normal distribution with the location parameter 1 and scale parameter 1. The scale parameter of the gamma distribution was set to 0.067. The LNN model was applied to this data. When the shape parameter is large, a gamma distribution can be approximated by a log normal distribution; however, when the shape parameter is small, especially when it is smaller than 1, the gamma distribution has a heavy mass near 0 and it cannot be approximated by a log normal distribution. In our simulation scenario, the proportion of transcripts with a shape parameter < 1 was 0.159. We used the dataset that contained the measurements of 1,000 genes with 30 samples in each of the two groups. The measurements for 50 out of 1,000 genes were DE (*p*
_{1} = 0.05). The gene-specific variance (scale) of the log expression level among genes varied from the gamma distribution.