Identification of gene pairs through penalized regression subject to constraints

Background This article concerns the identification of gene pairs or combinations of gene pairs associated with biological phenotype or clinical outcome, allowing for building predictive models that are not only robust to normalization but also easily validated and measured by qPCR techniques. However, given a small number of biological samples yet a large number of genes, this problem suffers from the difficulty of high computational complexity and imposes challenges to the accuracy of identification statistically. Results In this paper, we propose a parsimonious model representation and develop efficient algorithms for identification. Particularly, we derive an equivalent model subject to a sum-to-zero constraint in penalized linear regression, where the correspondence between nonzero coefficients in these models is established. Most importantly, it reduces the model complexity of the traditional approach from the quadratic order to the linear order in the number of candidate genes, while overcoming the difficulty of model nonidentifiablity. Computationally, we develop an algorithm using the alternating direction method of multipliers (ADMM) to deal with the constraint. Numerically, we demonstrate that the proposed method outperforms the traditional method in terms of the statistical accuracy. Moreover, we demonstrate that our ADMM algorithm is more computationally efficient than a coordinate descent algorithm with a local search. Finally, we illustrate the proposed method on a prostate cancer dataset to identify gene pairs that are associated with pre-operative prostate-specific antigen. Conclusion Our findings demonstrate the feasibility and utility of using gene pairs as biomarkers.

removed or added from the dataset, imposing computational challenges for large studies. Moreover, any analysis using selected genes based on one dataset may be sensitive to normalization, leading to non-generalizable and/or non-reproducible scientific findings [4].
To address the foregoing challenges, a modeling method based on gene pairs was first presented in the top-scoring pair (TSP) classifier by [5] and later implemented by [6]. Compared to predictors based on individual genes, gene pair-based predictors are more robust to normalization and have better predicting or classifying accuracy. Another advantage of gene-pair based predictive modeling is its ease of evaluation and validation by qPCR methods. Ideally, to use qPCR to measure a single gene's expression level, one applies the delta-Ct method [7], in which the differenced Ct values between the gene of interest and another housekeeping gene such as GAPDH measures gene expression level. However, between sample variation of a housekeeping gene may be large, imposing a great challenge [8]. In this sense, gene-pair based modeling removes the requirement of housekeeping genes since the differenced Ct values between the two genes of interest can be directly treated as a measurement. Consequently, the two genes in a gene pair serve as internal controls for each other. Due to all these advantages, gene pair-based predictors have been adopted in several cancer studies [9][10][11].
Despite the advantage of the gene pair approach, due to the combinatorial complexity, identifying the best gene pair, or best combinations of several gene pairs, is statistically and computationally challenging, from all the possible pairs from a pool of tens of thousands of genes. For instance, the TSP algorithm employs a direct search, whose running time grows quadratically in terms of the number of candidate genes. Although in practice one can first identify differentially expressed genes and then perform a restrictive search to these individual genes, such a two-step approach is no longer invariant to normalization and may miss informative pairs in which at most one gene is differentially expressed [5]. The computational problem is even more severe when more than one gene pair is sought, such as in k-TSP which involves exactly k top disjoint pairs in prediction [12].
Moreover, even though rank-based gene pair predictors like the TSP are robust to normalization, their utility in modeling complex data remains limited. One possible extension is to use ratios of gene expression levels as predictors and use regression models to select gene pairs. In recent years, regression models with penalties enforcing sparsity (such as the Lasso [13], SCAD [14], and TLP [15] penalties) have been widely used for variable selection, and many efficient algorithms have been proposed for fitting such models. One may employ such an approach by treating ratios of gene expression levels from all possible gene pairs as candidate predictors. However, this amounts to a quadratic complexity in the number of candidate genes.
In this paper, we develop a new regression approach and an efficient algorithm for identifying gene pairs associated with biological phenotype or clinical outcome. we propose an equivalent model subject to a sum-to-zero constraint on regression coefficients, where the correspondence between nonzero coefficients in these models is established. The model of this type has been proposed for compositional data [16] and recently for reference point insensitive data [17]. One salient aspect is that this model is more parsimonious, involving only predictors linearly in the number of candidate genes. To deal with the constraint, we develop an efficient algorithm based on the alternating direction method of multipliers (ADMM) [18,19], for identification and model parameter estimation. The new approach shares not only the benefit of simplicity in interpretation but also a linear complexity. Most importantly, the proposed method substantially improves the statistical accuracy and computation efficiency. Finally, in simulations, the method compares favorably against the traditional method in terms of the accuracy of identification, and our ADMM algorithm is more computationally efficient than a coordinate descent with local search (CD+LS) algorithm of [17].

High-dimensional linear regression
This section proposes predictive models based on combinations of ratios of gene expression levels on the ground that ratios of gene expression levels not only are robust to normalization but also can be easily validated and measured by qPCR techniques.
Given p predictors (x 1 , . . . , x p ) measuring the expression levels of p genes (g 1 , . . . , g p ), we consider informative second-order interactions defined by pairwise ratios {x j /x k , 1 ≤ j < k ≤ p} of (x 1 , . . . , x p ) with respect to a continuous response such as the pre-operative prostatespecific antigen level measured from prostate cancer patients, as demonstrated in the "Results" section. It is assumed that there are only a small number (i.e., much smaller than p) of informative genes. Now consider a regression model in which response Y depends on a predictor vector x in a linear fashion: where α = (α 12 , α 13 , . . . , α p−1p ) T and z = (log(x 1 /x 2 ), log(x 1 /x 3 ),. . . ,log(x p−1 /x p )) T are q = p(p−1) 2 -dimensional vectors of regression coefficients and predictors, and is random error that is independent of z. For convenience, for i < j, we let α ji = −α ij . In Eq. (1), primary reasons for the logarithm of the pairwise ratios {x j /x k , 1 ≤ j < k ≤ p} are two-fold. First, it stabilizes the variance of gene expression levels so that Eq. (1) is suitable. In fact, the logarithm transformation is widely used in the literature on gene expression modeling [20]. Second, it facilitates an efficient model fitting algorithm to be introduced subsequently. Our objective is to identify nonzero coefficients of α corresponding to informative gene pairs based on gene expression.
There are several challenges for identification of informative ratios within the framework of Eq. (1), in which p may greatly exceed the sample size n, known as highdimensional regression. Normally, one may apply a feature selection method such as the Lasso [13] for this task. Unfortunately, however, high-dimensionality of Eq. (1) impedes the accuracy of feature selection in the presence of noise in addition to computational cost, which are roughly proportional to p 2 . To overcome these difficulties, we propose an alternative yet equivalent model of Eq. (1) through a more parsimonious representation involving one linear constraint.
The next proposition says that f (z) in Eq. (1) has an equivalent representation with only p-variables. In a sense, it achieves the objective of dimensionality reduction.

Proposition 1
The following equivalent form of f (z) in Eq. (1) is as follows: Importantly, p j=1 β j = 0.
Based on Proposition 1, we derive an equivalent model of Eq. (1): (3) has been proposed for compositional data [16] and recently also for reference point insensitive data [17]. Here [16]  . The non-existence of oneto-one correspondence between α and β is due to the fact that model Eq. (1) is largely non-identifiable. In fact, for any cycle formed by sequence i 1 , i 2 , . . . , i k , i k+1 = i 1 , we can add any constant c to the α i j i j+1 's formed by adjacent indices without changing the model. That is, we can construct α where α i j i j+1 = α i j i j+1 + c for j = 1, . . . , k and α ij = α ij otherwise and model Eq. (1) with α is equivalent to that with α . Therefore, when we obtain a solution β by solving Eq. (3), due to the argument above, there are an infinite number of α that are related to β through Eq. (2). Among them, we would like to choose the "simplest" one.
In this paper, we define the "simplest" α to be the one(s) with the minimum L 1 norm, where the L 1 -norm of a vec- In practice, given an estimate of β from (3), an estimate of α can be obtained using Algorithm 1 below.
Step 1: Identify one positive and one negativeβ j 's, saỹ β k 1 > 0 andβ k 2 < 0, where k 1 and k 2 are two distinct indices from {1, · · · , p}. For instance,β k 1 andβ k 2 can be taken as the most positive and most negative (ties can be broken arbitrarily)β j 's. This can always proceed as long as not allβ j 's are zero.
Algorithm 1 terminates in at most p steps because the number of nonzeros inβ decreases by either 1 or 2 after each iteration. Note thatβ k 1 andβ k 2 identified in Step 1 may not be unique. Therefore it may lead to differentα's. Importantly, this algorithm always yields a minimum L 1norm α estimate (see Proposition 5 later in this section).
The following two propositions characterize properties of such α with respect to its representations. Proposition 2 (Minimum L 1 -norm of α) Given α and β satisfying Eq. (2), the following conditions are equivalent: Proposition 3 (Uniqueness of the representation of α) Given α and β satisfying Eq. (2), the following conditions are equivalent: The conditions in Proposition 2 are met by α.
Furthermore, there does not exist distinct The following proposition establishes the relations between the numbers of nonzero elements of α and β under different settings. In view of condition (H) in Proposition 4, if those conditions of Proposition 2 are met with B = 2 or 3, then those of Proposition 3 must be satisfied.

Proposition 5 For anyβ satisfying the sum-to-zero constraint, the correspondingα produced by Algorithm 1 satisfies Proposition 2.
The proofs of the propositions are supplied in Appendix.

Constrained penalized likelihood
is obtained, based on which the loglikelihood function l(β) can be written as In a high-dimensional situation, model Eq. (3) is overparameterized when p > n, and hence that l(β) has multiple maximizers. Towards this end, we introduce a constrained penalized likelihood as a generalization of the Lasso regression using L 1regularization.
Minimization of Eq. (4) in β yields its minimizerβ. Since the term σ 2 can be absorbed into the regularization coefficient λ in the penalized likelihood, we set σ = 1 in the objective function for simplicity.
In contrast to the Lasso problem, Eq. (4) has one additional linear constraint. The coordinate descent algorithm has been shown to be very efficient for solving L 1 -penalized problems [21] since the nondifferentiable L 1 penalty is separable with respect to β j 's. However, the sum-to-zero constraint destroys the separability so that the coordinate descent algorithm can no longer guarantee convergence. In [17], the authors proposed adding additional diagonal moves and random local search to the coordinate descent algorithm, which improves the chance for convergence.
To deal with this convex optimization subject to linear constraints, we develop an algorithm using the alternating direction method of multipliers (ADMM) [18,19] to solve iteratively, see Appendix for details. In each iteration, we derive an analytic updating formula to expedite convergence of ADMM, and convergence is guaranteed by a result in Section 3.1 of [18]. We compare our ADMM algorithm with the algorithm proposed in [17] in the "Results" section.

Comparison of ADMM and CD+LS algorithms
This section compares our ADMM algorithm with a coordinate descent with local search (CD+LS) algorithm of [17] for Eq. (4) with respect to computation efficiency through one simulated example. The CD+LS algorithm is implemented in R package zeroSum (version 1.0.4, https:// github.com/rehbergT/zeroSum). In this example, we consider correlated predictors, that is,x i 's are drawn iid from N(0, V ) and are independent of i 's that are sampled from N(0, 1), and V is a p × p matrix whose ijth element is 0.5 |i−j| . Moreover, the true β j 's are drawn iid from N(0, 1) and then centered to have a sum zero, and λ is fixed as 1. Then, their rates for successful convergence and running times are recorded for the ADMM and the CD+LS algorithms over 20 simulations. Particularly, a precision or tolerance error of 10 −10 is used for both algorithms. Successful convergence is reported if a solution from an algorithm satisfies the sum-to-zero constraint with a tolerance error of 10 −8 , and both the solution and its objective value are no further than 10 −8 to the optimal solution and its corresponding objective value in terms of the L 2 -distance. Here, the optimal solution is defined as the one, among the two solutions from the two algorithms, having the minimal objective value and satisfying the sum-to-zero constraint.
Four different settings are compared, ranging from lowto high-dimensional situations. As showed in Table 1, the proposed ADMM algorithm outperforms the CD+LS algorithm with respect to both convergence guarantee and running time.

Comparison of the proposed method and the Lasso
This section examines effectiveness of the proposed method through simulated examples. Specifically, the proposed method is compared with the Lasso in terms of predictive accuracy and identification of the true model, where the Lasso is implemented in R package glmnet [21].
Simulated examples are generated with correlation structures as to be analyzed. These simulations are designed to examine various operating characteristics of the proposed method with respect to (p, n), noise level σ 2 , and correlation structures among predictors in Eqs. (1) and (3). For tuning, λ is searched over 100 grid points that are uniformly spaced (in the log-scale) between 10 4 and 10 −2 . An independent testing dataset with 1000 randomly generated data points are used to find the optimal λ which minimizes the mean squared error (MSE).
For performance metrics, an independent validation dataset with 1000 randomly generated data points are used to evaluate the performance of the fitted model in terms of mean squared error (MSE) and R 2 . To assess robustness of the approaches under data normalization, we randomly add sample-wise shifts from N(0, 1) to the validation dataset. Furthermore, we consider other two metrics for parameter estimation and the quality of identification of zero elements of true α in Eq. (1) and β in Eq. (3). For parameter estimation, we use the relative error , withγ being a truncated version ofγ such that only the coefficients with the | 0 | largest absolute values are retained, and all others are zeroed out. Our simulated example concerns correlation structures among predictors. In Eqs. (1) and (3), log x i 's are iid from N(0, V ) and are independent of i 's that are iid from N(0, σ 2 ), and V is a p × p matrix whose ijth element is 0.5 |i−j| , α = (α 12 , α 13 , . . . , α p−1p ) T . Three settings for α are considered: 1) α 12 = 1, α 13 = 0.5, α 24 = 0.5, and α jk = 0 otherwise, which does not satisfy the conditions defined in Proposition 2 with β 1 = 1.5, β 2 = β 3 = β 4 = −0.5, and β j = 0 for j ≥ 5. The proposed method is compared with the Lasso in two models, corresponding to the gene-pair level design matrix z in Eq. (1) andx in Eq. (3) (without the sum-to-zero constraint), referred to as Lasso 1 and Lasso 2, respectively. These three methods are examined on simple and difficult situations correspondingly with (p, n, σ ) = (25, 50, 0.5), (100, 25, 0.2). Then the values of MSE, R 2 , RE, and FR are reported. As suggested in Table 2, the proposed method performs better or the same compared with Lasso 1 and Lasso 2 in terms of accuracy and robustness across all the six situations. The improved performance is attributed to a reduced number of candidate parameters in Eq. (1) than Eq. (3), as well as to the sum-to-zero constraint introduced in Eq. (3). Interestingly, the false identification rates of the proposed method are almost zero in three low-noise setting of (p, n, σ ) = (20, 50, 0.5) regardless if the conditions in Propositions 2 and 3 are met, and are small in the other Sample means (standard errors in parentheses) of mean squared error (MSE), R 2 , relative error (RE) and false identification rate (FR), based on 20 simulation replications, for the proposed method and the Lasso three settings. In contrast, Lasso 1 has a higher relative error and false identification rate. While Lasso 2 has similar relative error and false identification rate as the proposed method, it has higher MSE and lower R 2 in all settings, due to its non-robustness to sample-wise scaling without the sum-to-zero constraint. Overall, all three methods perform better in the low-noise situation of (p, n, σ ) = (20, 50, 0.5) than the high-noise situation of (p, n, σ ) = (100, 25, 0.2). Across the three settings of α, the performance of the proposed method is rather stable. However, Lasso 1 performs much worse for settings in which α fails to satisfy the conditions in Proposition 3, corresponding to non-uniqueness of the representation of α. Most critically, when α satisfies the conditions in Proposition 3, the proposed method continues to outperform its counterpart in terms of the performance metrics. Overall, the proposed method achieves our objective.

An application to a real RNA-Seq dataset
This section applies the proposed method to a prostate adenocarcinoma (PRAD) RNA-Seq dataset published as part of The Cancer Genome Atlas (TCGA) project [22]. Particularly, we identify gene pairs that are associated with pre-operative prostate-specific antigen (PSA), an important risk factor for prostate cancer. Towards this end, we download normalized gene expression data from the TCGA data portal (https://tcga-data.nci.nih.gov/ docs/publications/tcga/). As described by TCGA, tissue samples from 333 PRAD patients were sequenced using the Illumina sequencing instruments. While raw sequencing reads were processed and analyzed using the SeqWare Pipeline 0.7.0 and MapspliceRSEM workflow 0.7 developed by the University of North Carolina, read alignment was performed using MapSplice [23] to the human reference genome, and gene expression levels were estimated using RSEM [24] with gene annotation GAF 2.1, and further normalized so that the upper quartile count is 1,000 in each sample. All these steps were performed by the TCGA consortium. In our experiment, the normalized RSEM gene expression estimates are used, excluding samples with missing pre-operative PSA values and genes for which the average normalized expression level is lower than 10. This prepossessing step yields p = 15, 382 genes and n = 187 samples. Furthermore, we run Pearson correlation tests for each gene between log-transformed expression levels and log-transformed pre-operative PSA levels, and exclude genes with false discovery rate (FDR) values (calculated using the Benjamini-Hochberg [25] method based on the p-values from the Pearson correlation tests) larger than 0.01. Consequently, only 520 genes are retained in the analysis, on which we fit model Eq. (3) using the proposed ADMM algorithm.
To visualize the selection result, we display the solution paths of the model fitting. As shown in Fig. 1, the first pair of genes entering the model are PTPRR and KRT15. While PTPRR is a member of the protein tyrosine phosphatase (PTP) family, which is known to be related with prostate cancer [26,27], KRT15 is a member of the keratin gene family, which is known to be associated with breast cancer [28] and lung cancer [29]. Interestingly, we find no publication record on PubMed with keywords such as "KRT15 AND PSA" or "KRT15 AND prostate". By correlating log expression levels and log PSA levels in the 187 patients, we find that both PTPRR and KRT15 are significantly correlated with PSA levels (r = 0.28 and p < 10 −4 for PTPRR, r = −0.33 and p < 10 −5 for KRT15). Not surprisingly, their log-ratio is even more strongly correlated with log PSA levels (r = 0.41 and p < 10 −8 ), demonstrating the potential of using gene pairs as biomarkers.
To demonstrate the scalability of our proposed method, we employ the proposed ADMM algorithm to fit Eq. (3) with all the p = 15, 382 genes without pre-screening. In this situation, the first pair of genes entering the model are BCL8 and KRT15, where BCL8 is known to be associated with lymphoma [36]. The other selected genes are PTPRR, LRAT and LCN2, which are very similar to the foregoing results based on pre-screening. By comparison, fitting the corresponding model Eq. (1) using a standard Lasso algorithm, such as glmnet [21], would be practically prohibitive, which requires storing a design matrix of p(p−1) 2 × n ≈ 22 × 10 9 elements. To further demonstrate robustness of the proposed method with respect to data normalization, we randomly scale the gene expression levels, both along the gene dimension and the sample dimension, mimicking the gene length normalization and the library size or sequencing depth normalization, respectively. Numerically, we find that the solution path fitted using the randomly scaled data is always identical to that fitted using the original data.

Discussion
Model Eq. (3) has been proposed for compositional data [16] and recently for reference point insensitive data [17]. In this article, we explore Eq. (3) for the identification of gene pairs as biomarkers, enjoying robustness to samplewise scaling normalization (which is a common practice for RNA-Seq data) and simplicity of validation and measurement by qPCR techniques. Through Propositions 1-4, we establish the relationship between models Eq. (1) and Eq. (3). Additionally, we develop an efficient ADMM algorithm for solving model Eq. (3), which is guaranteed to converge and is shown to be highly competitive in terms of computational efficiency.
One interesting yet important issue of model Eq. (3) is determination of the value of α. One proposal is choosing the α to minimize the L 0 -norm instead of L 1 -norm. However, in such a situation, it remains unclear what kind of conditions as those in Propositions 2, 3 and 4 may be. Furthermore, minimization of the L 0 -norm in α continues to be challenging by itself due to non-convexity and discontinuity of the L 0 -function. Therefore, our approach based on the L 1 -norm gives rise to convex minimization, which is easier to manage.
One important aspect of model Eq. (3) is that it enables to identify gene pairs in an unbiased manner without any prior knowledge of the known biology of the disease. However, in some situations, information regarding the disease of interest is available from previous studies. Then the prior knowledge needs to be integrated for gene pair identification so that more probable subset of genes or pathways should have a higher chance of being selected. This can be accomplished through weighted regularization with weights {λ k } p k=1 , with a large weight corresponding to a small chance of being selected. Moreover, in some other situations, gene pairs are constrained in that gene pairs are formed only between relevant genes from the same pathway. This can be achieved by replacing the Lasso penalty by either a (sparse) group Lasso penalty [37] and/or the simple equality constraint by a set of constraints, each corresponding to a given pathway of interest. Finally, some non-convex penalties such as the SCAD penalty [14] and the TLP penalty [15] can be used as opposed to the Lasso penalty to achieve a higher accuracy of selection at an expense of computation.
For a large-scale problem, an ADMM algorithm may have a linear convergence rate. To expedite convergence, an inexact ADMM algorithm may be useful in our setting, which has been shown recently to lead to a substantial improvement over the standard ADMM algorithm [19]. Furthermore, parallelization of our ADMM algorithm may achieve further scalability, which is one advantage of ADMM algorithms over many other optimization techniques [18].
One extension of Eq. (1) is generalized linear models such as logistic regression or other predictive models like support vector machine. In such a situation, the proposed method for Eq. (1) is directly applicable to gene pair identification with some modifications. Further investigation is necessary.

Conclusion
In conclusion, the experimental results demonstrate that gene pairs can be used as robust biomarkers which can tolerate sample-wise scaling normalization. Furthermore, using L 1 penalized regression with equality constraints, the model fitting can be formulated as a convex optimization problem which can be solved efficiently using the proposed ADMM algorithm. This approach has the potential to discover novel and reliable biomarkers for biological or clinical studies.
This completes the proof.

Proof of Proposition 4:
(G): By Eq. (2), β j = 0 only if at least for some k = j, α jk = 0. Based on α and β, we can construct an undirected graph G = (V , E) with p vertices such that there is an edge between vertices i and j if and only if α ij = 0. We know that β j can not be nonzero unless vertex V j has a degree of at least 1. Since the total number of edges is (H): Suppose α satisfies the conditions defined in Proposition 2. If α ij = 0, let α ij be the weight associated with edge connecting V i and V j . By condition (B), for any cycle in the graph formed by sequence i 1 , i 2 , . . . , i k , i k+1 = i 1 , we know that weights associated with adjacent edges (i.e., α i j−1 i j and α i j i j+1 ) always have opposite signs. Therefore, the number of edges in the cycle has to be an even number, which means that the graph has to be a bipartite graph. It is then easy to see that A ≤ B Proof of Proposition 5: It suffices to show thatα satisfies (C) of Proposition 2. Note thatβ in Algorithm 1 satisfies the sum-to-zero constraint at each step of iteration before termination. In the beginning, α 1 = 0 and β 1 = β 1 . After each iteration, α 1 is increased bŷ α k 1 k 2 = min(|β k 1 |, |β k 2 |), and β 1 is decreased by 2α k 1 k 2 .

ADMM algorithm for solving Eq. (4)
We adopt the notation in [18] and reformulate Eq. (4) as where x are the parameters of interest. If C = 1 T and d = 0, we will have all coefficients sum up to 0. When there is an intercept in the model, we can add a scalar 0 as the first element in C, meaning that we do not have constraint on the intercept. Similarly, as a convention we also do not penalize the intercept. Denoting B = C I ,  Let E = A √ ρC , the ADMM algorithm consists of the following iterations Usually, the relative stopping criteria is chosen to be rel = 10 −4 , and the choice of absolute stopping criteria abs depends on the scale of the variable values. See Boyd et al. [18] for details. To compute a solution path for a decreasing sequence of λ values, we adopt the approach in Friedman et al. [21] and use warm starts for each λ value. The sequence of λ values are either provided by the user, or we begin with λ max = A T b ∞ for which all the coefficients are equal to 0. We set λ min = λ λ max , where λ is a small value, such as 0.01, and generate a decreasing sequence of 100 λ values from λ max to λ min on the log-scale.

Availability of data and materials
The datasets and R programs for reproducing the results in this paper are available at http://www-personal.umich.edu/~jianghui/gene-pair/.