On the inconsistency of ℓ1-penalised sparse precision matrix estimation

Background Various ℓ 1-penalised estimation methods such as graphical lasso and CLIME are widely used for sparse precision matrix estimation and learning of undirected network structure from data. Many of these methods have been shown to be consistent under various quantitative assumptions about the underlying true covariance matrix. Intuitively, these conditions are related to situations where the penalty term will dominate the optimisation. Results We explore the consistency of ℓ 1-based methods for a class of bipartite graphs motivated by the structure of models commonly used for gene regulatory networks. We show that all ℓ 1-based methods fail dramatically for models with nearly linear dependencies between the variables. We also study the consistency on models derived from real gene expression data and note that the assumptions needed for consistency never hold even for modest sized gene networks and ℓ 1-based methods also become unreliable in practice for larger networks. Conclusions Our results demonstrate that ℓ 1-penalised undirected network structure learning methods are unable to reliably learn many sparse bipartite graph structures, which arise often in gene expression data. Users of such methods should be aware of the consistency criteria of the methods and check if they are likely to be met in their application of interest.


INTRODUCTION
Estimating the sparse precision matrix, i.e. the inverse covariance matrix, from data is a very widely used method for exploring the dependence structure of continuous variables.The motivation for the approach stems from the fact that for a Gaussian Markov random field model, zeros in the precision matrix translate exactly to absent edges in the corresponding undirected Gaussian graphical model, thus being informative about the marginal and conditional independence relationships among the variables.
The full p-dimensional covariance matrix contains p(p + 1)/2 parameters, making its accurate estimation from limited data difficult.Additionally, the structure learning requires the inverse of the covariance, and matrix inversion is in general a very fragile operation.To make the problem tractable, some form of regularisation is typically needed.Direct optimisation of the sparse structure would easily lead to very difficult combinatorial optimisation problems.To avoid these computational difficulties, several convex 1 -penalty-based approaches have been proposed.Popular examples include 1 -penalised maximum likelihood estimation (Meinshausen and Bühlmann, 2006), which also forms the basis for the highly popular graphical lasso (glasso) algorithm (Friedman et al., 2008). 1 regularisation has also been used for example in a non-probabilistic alternative with linear-programming-based constrained 1 minimisation (CLIME) algorithm of Cai et al. (2011).
At the heart of the optimisation problems considered by all these methods is a term depending on the 1 norm of the estimated precision matrix. 1 -penalisation-based approaches such as lasso are popular for sparse regression, but they have a known weakness: in addition to promoting sparsity they also push true non-zero elements toward zero (Zhao and Yu, 2006).In the context of precision matrix estimation this effect would be expected to be especially strong when some elements of the precision matrix are large, which happens for scaled covariance matrices when the covariance matrix becomes ill-conditioned.This phenomenon occurs frequently under the circumstances where some of the variables are nearly linearly dependent.
In this paper we demonstrate a drastic failure of the 1 penalised sparse covariance estimation methods for a class of models that have a linear latent variable structure where some variables depend linearly on others.For such mod-els even in the limit of infinite data, popular 1 penalised methods cannot yield results that are significantly better than based on random guessing on any setting of the regularisation parameter.Yet these models have a very clear sparse structure that becomes obvious from the empirical precision matrix with an increasing n.
Given the huge popularity and success of linear models in modelling data, structures like the one considered in our work are natural for various real world data sets.Motivated by our discovery, we also explore the inconsistency of 1 penalised methods on models derived from real gene expression data and find them poorly suited for such applications.

BACKGROUND
We start with a quick recap on the basics of Gaussian graphical models in order to formulate the problem of structure learning.For a more comprehensive treatment of the subject, we refer to (Whittaker 1990;Lauritzen 1996).Let X = (X 1 , . . ., X p ) denote a random vector following a multivariate normal distribution with zero mean and a covariance matrix Σ, X ∼ N p (0, Σ).Let G = (V, E) be an undirected graph, where the V = {1, . . ., p} is the set of nodes and E ⊂ V × V stands for the set of edges.
The nodes in the graph represent the random variables in the vector X and absences of the edges in the graph correspond conditional independence assertions between these variables.More in detail, we have that (i, j) ∈ E and (j, i) ∈ E if and only if X i is conditionally independent of X j given the remaining variables in X.
In the multivariate normal setting, there is a one-to-one correspondence between the missing edges in the graph and the off-diagonal zeros of the precision matrix Lauritzen 1996, p. 129).Given an undirected graph G, a Gaussian graphical model is defined as the collection of multivariate normal distributions for X satisfying the conditional independence assertions implied by the graph G.
Assume we have a complete (no missing observations) i.i.d.sample x = (x 1 , . . ., x n ) from the distribution N p (0, Σ).
Based on the sample x, our goal in structure learning is to find the graph G, or equivalently, learn the zero-pattern of Ω.The usual assumption is that the underlying graph is sparse.A naive estimate for Ω by inverting the sample covariance matrix is practically never truly sparse for any real data.Furthermore, if n < p the sample covariance matrix is rank-deficient and thus not even invertible.
One common approach to overcome these problems is to impose an additional 1 -penalty on the elements of Ω when estimating it.This kind of regularisation effectively forces some of the elements of Ω to zero, thus resulting in sparse solutions.In the context of regression models, this method applied on the regression coefficients goes by the name of lasso (Tibshirani, 1996).There exists a wide variety of methods making use of 1 -regularisation in the setting of Gaussian graphical model structure learning (Yuan and Lin 2007;Meinshausen and Bühlmann 2006;Banerjee et al. 2008;Friedman et al. 2008;Peng et al. 2009;Cai et al. 2011;Hsieh et al. 2014).

1 -REGULARISED METHODS
In this section we provide a brief review of selected examples of different types of 1 -penalised methods.

Glasso
We begin with the widely used graphical lasso-algorithm (glasso) by Friedman et al. (2008).Glasso-method maximises an objective function consisting of the Gaussian loglikelihood and an 1 -penalty: where S denotes the sample covariance matrix and λ > 0 is the regularisation parameter controlling the sparsity of the solution.The 1 penalty, ||Ω|| 1 = i,j |ω ij |, is applied on all the elements of Ω, but the variant where the diagonal elements are omitted is also common.The objective function (1) is maximised over all positive definite matrices Ω and the optimisation is carried out in practice using a block-wise coordinate descent.Cai et al. (2011) approach the problem of sparse precision matrix estimation from a slightly different perspective.Their CLIME-method (Constrained 1 -minimisation for Inverse Matrix Estimation) seeks matrices Ω with a minimal 1 -norm under the following constraint

CLIME
where λ is the tuning parameter and |A| ∞ = max i,j |a ij | is the element-wise maximum.The optimisation problem min Ω ||Ω|| 1 subject to the constraint (2) does not explicitly force the solution to be symmetric, which is resolved by picking from estimated values ω ij and ω ji the one with a smaller magnitude into the final solution.In practice, the optimisation problem is decomposed over variables into p sub-problems which are then efficiently solved using linear programming.Liu and Luo (2015) introduced recently a method called Sparse Column-wise Inverse Operator (SCIO).The SCIO-method decomposes the estimation of Ω into a following smaller problems

SCIO
where S and λ are defined as before and e i is an i:th standard unit vector.The regularisation parameter λ can in general vary with i but this is omitted in our notation.The solutions βi form the columns for the estimate of Ω.Also for SCIO, the symmetry of the resulting precision matrix must be forced, and this is done as described in the case of CLIME.

The naive approach
In addition to the above-mentioned 1 -penalised methods, we consider two alternative approaches.In a "naive" approach, we simply take the sample covariance matrix, invert it, and then threshold the resulting matrix to obtain a sparse estimate for the precision matrix.The threshold value is chosen using the ground truth graph so that the naive estimator will have as many non-zero entries as there are edges in the true graph.Setting the threshold value according to the ground truth is of course unrealistic, however, it is nevertheless interesting to compare the accuracy of this simple procedure to the performance of the more refined 1 -methods, when also their tuning parameters are chosen in a similar fashion.

FMPL
Lastly, we consider a Bayesian approach which is based on finding a graph with a highest fractional marginal pseudolikelihood (FMPL) by Leppä-aho et al. (2016).The fractional marginal pseudo-likelihood is an approximation of the marginal likelihood and it has been shown to be a consistent scoring function in the sense that the true graph maximises it as the sample size tends to infinity, under the assumption that data are generated from a multivariate normal distribution.The FMPL-score decomposes over variables and in practice, the method identifies optimal Markov blankets for each of the variables, which are then combined into a proper undirected graph using any of the three different schemes commonly employed in graphical model learning: OR, AND and HC.

MODEL SELECTION CONSISTENCY
The assumptions required for a consistent model selection with an 1 -penalised Gaussian log-likelihood have been studied, for instance, in Ravikumar et al. (2011).The authors provide a number of conditions in the multivariate normal model that are sufficient for the recovery of the zero pattern of the true precision matrix Ω * with a high probability when the sample size is large.For our purposes, the most relevant condition is the following: Assumption 1.There exists α ∈ (0, 1], such that Here S ⊂ V × V is a set defining the support of Ω * , that is, the non-zero elements of Ω * (diagonal and the elements corresponding to the edges in the graphical model) and and Γ AB refers to the specific rows and columns of Γ indexed by A ⊂ V ×V and B ⊂ V ×V , respectively.The norm in the equation is defined as The above result applies to glasso.However, a quite similar result was presented for SCIO in Liu and Luo (2015): Assumption 2. There exists α ∈ (0, 1), such that Here Assumption 2 under the multivariate normality guarantees that the support of Ω * is recovered by SCIO with a high probability as the sample size gets large.

-PENALISATION
Methods for sparse precision matrix estimation generally depend on an objective function (such as log-likelihood) and a penalty function or regulariser, which in a Bayesian setting is usually represented by the prior.The ideal penalty function for many problems would be the 0 "norm" counting the number of non-zero elements: ||x|| 0 = #{i|x i = 0}.This 0 function is not a proper norm, but it provides a very intuitive notion of sparsity.The main problem with its use is computational: using 0 -penalisation leads to very difficult non-convex combinatorial optimisation problems.The most common approach to avoid the computational challenges is to use 1 -penalisation as a convex relaxation of 0 .As mentioned above this works well in many cases but it comes with a price, since in addition to providing the sparsity, 1 also regularises large non-zero values.Depending on the problem, as we demonstrate here, this effect can be substantial and may cause 1 -regularised methods to return totally meaningless results.
Intuitively, 1 -regularised methods are expected to fail when some elements of the true precision matrix become so large that their contribution to the penalty completely overwhelms the other parts of the objective and the penalty.
One example where this happens is when some set of variables depends linearly on another set of variables.In such situation the covariance matrix can become ill-conditioned and the elements of its inverse, the precision matrix, grow.
One example of when this happens is models with a linear latent variable structure.
Let us consider a model for x ∈ R d1 , y ∈ R d2 , where y = Ax + .The graphical structure of the model and the corresponding precision matrix structure are illustrated in Fig. 1.Assuming x ∼ N (0, σ 2 x I), ∼ N (0, σ 2 I), the covariance of the concatenated vectors (x T , y T ) T is given by the block matrix The covariance matrix has an analytical block matrix inverse (Lu and Shiou, 2002) This precision matrix recapitulates the conditional independence result for Gaussian Markov random fields: the lower right block is diagonal because the variables in y are conditionally independent of each other given x.The matrix is clearly sparse, so we would intuitively assume sparse precision matrix estimation methods should be able to recover it.The non-zero elements do, however, depend on σ −2 which can make them very large if the noise σ 2 is small.
It is possible to evaluate and bound the different terms of Eq. (1) evaluated at the ground truth for these models: The magnitude of the last penalty term clearly grows very quickly as σ 2 decreases.Clearly the magnitude of the two first log-likelihood terms grows much more slowly as they only depend on log σ 2 .Thus the total value of Eq. (1) decreases without bound as σ 2 decreases.
Forgetting the ground truth, it is easy to see that one can construct an estimate Ω for which the objective remains bounded.If we assume all values of C to be ≤ 1 (after normalisation), trace(CΩ) ≤ ||Ω|| 1 .
As the other terms only depend on Ω it is easy to choose Ω so that they remain bounded.The estimate Ω that yields these values will in many cases not have anything to do with C −1 , as seen in the experiments below.

EXPERIMENTS
We tested the performance of glasso, SCIO and CLIME as well as FMPL using the model structure introduced in Sec. 3. The performance of the methods was investigated by varying the noise variance σ 2 , and the sample size n.The model matrix A was created as a (d 2 , d 1 )-array of independent normal random variables with mean 0 and variance 1.The majority of the tests were run using input dimensionality d 1 = 2, output dimensionality d 2 = 10 and noise variance σ 2 = 0.1 2 but we also tested varying these settings.For each individual choice of noise and sample size, k = 50 different matrices A were generated and the results were averaged.
Generating n samples using model described, data were normalised and analysed using the five different methods.
We calibrated the methods in a way that number of edges in the resulting graph would match the true number.Similarly, we thresholded the naive method by taking inverse matrix directly to output the correct number of edges.The FMPL method has no direct tuning parameters so we used its OR mode results as such.Similar tuning is not possible in a real problem where the true number of edges is now known.The tuning represents the best possible results the methods could obtain with an oracle that provides an optimal regularisation parameter.
We evaluated the results using the Hamming distance between the ground truth and the inferred sparsity pattern, i.e. the number of incorrect edges and non-edges which were treated symmetrically.For methods returning the correct number of edges, this value is directly related to the precision pr through We will nevertheless use the Hamming distance as it enables fair comparison with FMPL that sometimes returns a different number of edges.
Figs. 2 and 3 show the Hamming distance obtained by the different methods as a function of the noise level when using 100 and 1000 samples, respectively.The results show that especially for low but also for high noise levels, the 1 -based methods all perform very poorly with especially glasso and CLIME performing very close to random guessing level for low noise levels σ ≤ 0.1.The naive inverse and FMPL work much better up to moderate noise levels of σ ≈ 2 after which the noise starts to dominate the signal and the performance of all methods starts to drop.SCIO is a little better than the other 1 -based methods but clearly worse than FMPL and naive in the low noise regime.
Fig. 4 shows the results when changing the output dimensionality d 2 from 10.The results show that the performance of all 1 -based methods is very poor across all d 2 .Glasso performance is close to random guessing level across the entire range considered, while CLIME is slightly better for d 2 ≥ 18 and SCIO slightly better across the entire range.Both FMPL and naive are significantly better than any of the 1 -based methods.Fig. 5 shows the corresponding result when changing the input dimensionality d 1 .The results are now quite different as all methods are better than random especially for larger values.SCIO still outperforms CLIME which outperforms glasso.FMPL is really accurate for small d 1 but degrades for larger d 1 while the naive method is the most accurate in almost all cases.
To further illustrate the behaviour of glasso on these exam- ples, Fig. 6 shows the contributions of the different parts of the glasso objective function (1) as a function of the noise level both for the true solution ("truth") as well as the glasso solution.The results show that for low noise levels the penalty incurred by the true solution becomes massive.The glasso solution has a much lower log-likelihood ("logl") than ground truth but this is amply compensated by the significantly smaller penalty.As the noise increases, the penalty of the true solution decreases and the glasso solution converges to similar values.

NECESSITY OF ASSUMPTION 1
It can be checked that the norm γ in Assumption 1 and Eq.(3) for latent-variable-like models depends on the scale of A. We took advantage of this by creating examples with different values of γ and testing the precision of glasso using the true covariance which corresponds to infinite data limit.The results of this experiment are shown in Fig. 7.The results verify that glasso consistently yields perfect results when γ < 1 which is a part of the sufficient conditions for consistency of glasso.As γ grows and the sufficient conditions are no longer satisfied, it is clearly seen that the accuracy of glasso starts to deteriorate rapidly.This suggests that the sufficient condition of Assumption 1 is in practice also necessary to ensure consistence.1) and the blue curves show the contributions of the last penalty term.Solid lines show the result of the glasso optimal solution while dashed lines show the result for the true solution.

INCONSISTENCY FOR MODELS OF REAL GENE EXPRESSION DATA
We tested how often the problems presented above appear in real data using the "TCGA breast invasive carcinoma (BRCA) gene expression by RNAseq (Illumi-naHiSeq)" data set (Cancer Genome Atlas Network, 2012) downloaded from https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/.The data set contains gene expression measurements for 20530 genes for n = 1215 samples.After removing genes with a constant expression across all samples there are p = 20252 genes remaining.
In order to test the methods we randomly sampled subsets of d genes and considered the correlation matrix C 0 over that subset.We generated sparse models with known ground truth by computing the corresponding precision matrix Λ 0 from the empirical correlation matrix, setting elements with absolute values below chosen cutoff δ = 0.1 to 0 to obtain and the testing covariance matrix C = Λ −1 .The cutoff lead to networks that were sparse with on average 60% zeros in the precision matrix.values below more relaxed bounds.The figure shows that the assumption is reliably satisfied only for very small d while for d ≥ 20, the assumption is essentially never satisfied.Based on the results of Fig. 7 it is likely that glasso results will degrade significantly by for γ > 10 and beyond which are very common for large networks.
We further studied how accurately glasso can recover the graphical structures when the data were generated using the precision matrices described above.We used a similar thresholding with a cut-off value of 0.1 in order to first form sparse precision matrices for a random subset of genes with given dimension.These matrices were then inverted to obtain covariance matrices.We checked that the resulting matrices were positive definite and then used them to sample multivariate normal data with zero mean with different sample sizes.
The obtained data sets were centred and scaled before computing the sample covariance which was used as input to the glasso algorithm.The regularisation parameter was chosen with the aid of the ground truth graph, so that the the graph identified by glasso would contain as many edges as there were in the real graph.Results are shown in Figure 9.The results show that glasso performance decreases as the network size increases and is approaching that of random guessing for the largest networks considered here.Fig. 10 shows the contributions of different parts of the glasso objective function (1) as a function of the number of genes d.The regularisation parameter λ of glasso was tuned to return a solution with the same number of edges as in the true solution.We used the glasso implementation of scikit-learn (Pedregosa et al., 2011), which ignores the diagonal terms of Ω when computing the penalty.The figure shows clearly how the penalty term for the true so- lution increases superlinearly as a function of d. (A linear increase would correspond to a horizontal line.)The result is even more striking given that the optimal λ decreases slightly as d increases.The penalty contribution for glasso solution increases much more slowly.The excess loss in log-likelihood from glasso solution increases as d increases, but this is compensated by a larger saving in the penalty.Together these suggest that glasso solutions are likely to remain further away from ground truth as d increases.

DISCUSSION
The class of latent variable like models presented in Sec. 3 is an interesting example of models that have a very clear sparse structure, which all 1 -penalisation-based methods seem unable to recover even in the limit of infinite data.This class complements the previously considered examples of models where glasso is inconsistent including the "two neighbouring triangles" model of Meinshausen (2008) and the star graph of Ravikumar et al. (2011), the latter of which can be seen as a simple special case of our example.
An important question arising from our investigation is how significant the discovered limitation to inferring sparse covariance matrices is in practice, i.e. how common are the latent variable like structures in real data sets.Given the popularity and success of linear models in diverse applications it seems plausible such structures could often exist in real data sets, either as an intrinsic property or as a result of The gene expression data set is a natural example of an application where graphical model structure learning has been considered.The original glasso paper (Friedman et al., 2008) contained an example on learning gene networks, although from proteomics data.Other authors (e.g.Ma et al., 2007) have applied Gaussian graphical models and even glasso (e.g.Menéndez et al., 2010) to gene network inference from expression data.Our experiments on the TCGA gene expression data suggest that in such applications it is advisable to consider the conditions for the consistency of 1 penalised methods very carefully when planning to apply those.
Previous publications presenting new methods for sparse precision matrix have typically tested the method on synthetic examples where the true precision matrix is specified to contain mostly small values.Specifying the precision matrix provides a convenient way to generate test cases as the sparsity pattern can be defined very naturally through it.At the same time, this excludes any models that have an ill-conditioned covariance.As shown by our example, such ill-conditioned covariances arise very naturally from model structures that are plausible from the application perspective.
Ultimately, our results suggest that users of the numerous

Figure 1 :Figure 2 :
Figure 1: Left: Graphical representation of a latent variable model as an undirected graphical model for a case with somewhat sparse A. Right: The adjacency matrix of the graph showing the sparse pattern of non-zero elements in the corresponding precision matrix.

Figure 3 :
Figure 3: Performances of different methods on latent variable like model with 1000 samples.(Lower values are better.)

Figure 4 :Figure 5 :
Figure 4: Performances of different methods on latent variable like model with varying output dimensionality.(Lower values are better.)

Figure 6 :
Figure6: Contributions of the different terms of the glasso objective (1) for latent variable like model with 1000 samples.The green curves show the contributions of the first two terms of Eq. (1) and the blue curves show the contributions of the last penalty term.Solid lines show the result of the glasso optimal solution while dashed lines show the result for the true solution.

Fig. 8 Figure 7 :
Fig. 8 shows the fraction of covariances derived from random subsets of d genes that satisfy the Assumption 1 of Ravikumar et al. (2011) (c = 1) as well as the fraction of

Figure 8 :
Figure 8: Testing the condition of Assumption 1 of Ravikumar et al. (2011) in Eq. (3) on real gene expression data showing the fraction of random subsets of d genes that fulfil the requirement and various relaxations.The condition (3) requires γ < 1, but the figure shows results also for larger γ cutoffs.

Figure 9 :
Figure9: Average precisions for glasso with different dimensions and sample sizes of the real gene expression data, higher values are better.The precision obtained by random guessing is also illustrated.

Figure 10 :
Figure 10: Average contributions of the different terms of the glasso objective function (1) on real gene expression data over random subsets of d genes.The values are shown for the 1 penalty term as well as the unnormalised loglikelihood, divided by d to make them comparable.Solid lines show the values for glasso result while dashed lines show the result for ground truth.