In the following section, we propose a novel normalization method for RNA-seq data with different species by utilizing the available knowledge of conserved orthologous genes and the hypothesis testing framework.

### Notations and model

Let *G*={*g*_{1},*g*_{2},…,*g*_{n}} be the complete set of genes from two different species, and *G*_{0} be the set of one-to-one orthologous genes that are to be tested for differential expression. For species *t*=1 or 2, let \(X_{g_{k}t}\) be the random variable that represents the count of reads mapped to the orthologous gene *g*_{k}∈*G*_{0}, and \(X_{g_{k}t}\) be the observed value of \(X_{g_{k} t}\). Accordingly, the total number of orthologous reads for species *t* is \(N_{t}=\sum _{g_{k} \in G_{0}}x_{g_{k}t}\). For ease of presentation, our normalization method is presented for the setting of one sample in each species only. Our proposed method, however, can be readily extended to more general settings including multiple samples for each species. For gene *g*_{k} in species *t*, we consider the mean model:

$$ E(X_{g_{k}t}) = \frac{\mu_{g_{k}t}L_{g_{k}t}}{S_{t}}N_{t}, $$

(1)

where \(\mu _{g_{k}t}\) is the true expression level, \(L_{g_{k}t}\) is the true gene length, and \(S_{t}=\sum _{g_{k} \in G_{0}}\mu _{g_{k}t}L_{g_{k}t}\) is the total expression output of all orthologous genes in species *t*. Note that, since \(L_{g_{k}t}\) is often different between species, we have included it in model (1) to alleviate the bias in gene length.

### Novel normalization method

We propose a novel normalization method by employing the available knowledge of conserved orthologous genes and the hypothesis testing framework. Specifically, we choose a scale to minimize the deviation between the empirical and nominal type I errors in RNA-seq data based on the hypothesis test.

To detect differential expressions of orthologous genes between two species, for each *g*_{k}∈*G*_{0}, we consider the hypothesis

$$\begin{array}{@{}rcl@{}} H_{0}^{g_{k}}: \mu_{g_{k}1}=\mu_{g_{k}2}\quad \text{versus} \quad H_{1}^{g_{k}}: \mu_{g_{k}1}\neq \mu_{g_{k}2}. \end{array} $$

We further assume that the reads mapped to the orthologous genes are Poisson random variables with \(\lambda _{g_{k}1}=E(X_{g_{k}1})\) and \(\lambda _{g_{k}2}=E(X_{g_{k}2})\). Then under model (1), the hypothesis is equivalent to

$$ {{} \begin{aligned} H_{0}^{g_{k}}\!:\!\lambda_{g_{k}1}\,=\,\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}c\lambda_{g_{k}2}~ \text{versus}\ \ H_{1}^{g_{k}}\!:\!\lambda_{g_{k}1}\!\neq\!\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}c\lambda_{g_{k}2}, \end{aligned}} $$

(2)

where *c*=*S*_{2}/*S*_{1} is the scaling factor for normalization.

Given that \(X_{g_{k}1}+X_{g_{k}2}=n_{g_{k}}\) with \(n_{g_{k}}\) a fixed integer, the random variable \(X_{g_{k}1}\) follows a binomial distribution with the conditional probability density function as

$${\begin{aligned} &P\left(X_{g_{k}1}= x_{g_{k} 1}\big|{X_{g_{k}1}+X_{g_{k}2}}=n_{g_{k}}\right) \\&\quad=\frac{n_{g_{k}}!}{x_{g_{k}1}!\left(n_{g_{k}}-x_{g_{k}1}\right)!} \left(p_{0}^{g_{k}}\right)^{x_{g_{k}1}} \left(1-p_{0}^{g_{k}}\right)^{n_{g_{k}}-x_{g_{k}1}}, \end{aligned}} $$

where

$$p_{0}^{g_{k}}=\frac{\lambda_{g_{k}1}}{\lambda_{g_{k}1}+\lambda_{g_{k}2}}= \frac{ {g_{k}1}N_{1}}{L_{g_{k}2}N_{2}+{cL}_{g_{k}1}N_{1}}$$

is the probability of success under the null hypothesis of (2). For the above model, the *p*-value of the test is

$$\begin{array}{*{20}l} p_{g_{k}}(c)&=P\left(| X_{g_{k}1}-n_{g_{k}} p_{0}^{g_{k}}|\geq | x_{g_{k}1}-n_{g_{k}} p_{0}^{g_{k}}|\big|n_{g_{k}}\right)\notag \\ &=P\left(|(1+\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}c\right)X_{g_{k}1}-\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}{cn}_{g_{k}}| \geq \\ &~~~~~\left|\left(1+\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}c)x_{g_{k}1}-\frac{L_{g_{k}1}}{L_{g_{k}2}}\frac{N_{1}}{N_{2}}{cn}_{g_{k}}|\right|n_{g_{k}}\right). \end{array} $$

(3)

Note that the *p*-value in (3) is a function of the scaling factor *c* under the condition \(X_{g_{k}1}+X_{g_{k}2}=n_{g_{k}}\). To search for the optimal *c* for normalization, we apply the following two questions as criteria. (i) Does the normalization method improve the accuracy of DE detection, i.e., whether or not it will decrease the false discovery rate (FDR) of the tests? (ii) Does the normalization method result in a lower technical variability or specificity? For multiple testing, Storey [23] pointed out that different hypothesis tests will result in different significant regions. To transform these tests into a common space, the *p*-value is a natural way to do so with respect to the positive false discovery rate (pFDR). By taking the number of set *G*_{0} identical hypothesis tests, the pFDR is defined as follows:

$$\begin{array}{*{20}l} {} \text{pFDR}_{g_{k}}&=\!\frac{P(H_{0};c)P(R_{g_{k}}\mid H_{0}; c)}{P(R_{g_{k}};c)} \\ &\,=\,\frac{P(H_{0};c)P(R_{g_{k}}\mid H_{0};c)}{\!P(H_{0};c)P(R_{g_{k}}\mid H_{0};c)\,+\,P(H_{1};c)P(R_{g_{k}}\mid H_{1};c)\!}, \end{array} $$

(4)

where *α* is the significance level and \(R_{g_{k}}=\{p_{g_{k}}(c) < \alpha \}\) is the rejection region. By (4), the pFDR of gene *g*_{k} is a function of both *α* and *c*. Given the values of *α* and *c*, we can apply the empirical distributions to estimate \(P(R_{g_{k}}|H_{0};c)\) and \(P(R_{g_{k}}|H_{1};c)\). Let *V*_{0} and *V*_{1} be the sets of non-DE genes and DE genes in *G*_{0}, respectively. Then, \(\text {pFDR}_{g_{k}}(\alpha ;c)\) can be estimated as

$$\begin{array}{@{}rcl@{}} \widehat{\text{pFDR}}_{g_{k}}=\!\frac{P(H_{0};c){\widehat P}(R_{g_{k}}\mid H_{0};c)}{P(H_{0};c){\widehat P}(R_{g_{k}}\mid H_{0};c)+P(H_{1};c){\widehat P}(R_{g_{k}}\mid H_{1};c)\!}, \end{array} $$

where

$${\widehat P}(R_{g_{k}} \mid H_{0};c)=\frac{1}{n_{0}}\sum_{g_{k}\in V_{0}}I(p_{g_{k}}(c)<\alpha|H_{0};c)$$

for any *g*_{k}∈*V*_{0}, and

$${\widehat P}(R_{g_{k}}\mid H_{1};c)=\frac{1}{n_{1}}\sum_{g_{k}\in V_{1}}I(p_{g_{k}}(c) <\alpha|H_{1};c)$$

for any *g*_{k}∈*V*_{1}, where *I*(·) is the indicator function, and *n*_{0} and *n*_{1} represent the cardinalities of *V*_{0} and *V*_{1}, respectively.

When all non-DE genes in *V*_{0} are given, we can perform our new normalization by determining the optimal scaling factor that minimizes the value of pFDR. For real data, however, it is not uncommon that only a small proportion of non-DE genes are known a priori by background knowledge. In this paper, we assume that a set of conserved orthologous genes between species are given in advance, which may either be reported in other studies or be selected by a certain biological measure [7, 24]. For the given set *H* of conserved orthologous genes that are considered as non-DE genes for its stability between species, we search for the optimal scaling factor by minimizing the deviation between the empirical and nominal type I errors. Let *m* be the number of genes in the set *H*. Given the true value of *c*, the *p*-values of the tests for the conserved orthologous genes follow a uniform distribution on interval (0,1). That is, for the specified *α* and *c*, the value of \(\sum _{g_{k} \in H}(1/m)I(p_{g_{k}}(c)<\alpha |H_{0};c)\) should be around the nominal level at *α*. In our method, we define the optimal scaling factor as *c*_{opt} that minimizes the objective function \(\mid \sum _{g_{k} \in H}(1/m)I(p_{g_{k}}(c)<\alpha |H_{0};c)-\alpha \mid \); that is,

$$\begin{array}{*{20}l} c_{\text{opt}}=\underset{c>0}{\text{argmin}} \big|\sum_{g_{k} \in H}\frac{1}{m}I(p_{g_{k}}(c)<\alpha|H_{0};c)-\alpha \big|. \end{array} $$

(5)

Finally, to estimate the optimal scaling factor defined in (5), we apply a grid search method and denote the best estimate as \(\hat c_{\text {opt}}\). For convenience, we refer to the proposed scale based normalization method as the SCBN method.