Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries

Background The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by log-linear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of so-called full-length cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species. Results We develop methods to perform model selection and parameter estimation in log-linear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of log-linear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient ℓ1-penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from full-length cDNA libraries. Conclusion We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.


Background
One of the most striking discoveries of the genomic era is the unexpectedly small number of genes in the human genome. This amount has decreased from more than 100000 [1] to an estimated number of roughly between 20000 and 25000 [2,3], tens of thousands less than ini-tially expected and essentially the same number as found in phenotypically much simpler organisms. A question of overriding biological significance is, how complex phenotypes of higher organisms arise from limited genomes. Part of the explanation may be that many genes undergo a process called alternative RNA splicing, which can generate many distinct proteins from a single gene.
RNA splicing is a post-transcriptional process that occurs prior to mRNA translation. After the gene has been transcribed into a pre-messenger RNA (pre-mRNA), it consists of intronic regions destined to be removed during pre-mRNA processing (RNA splicing), as well as exonic sequences that are retained within the mature mRNA. After transcription occurs the actual splicing process, where it is decided which exons are retained in the mature message and which are targets for removal. In general, exons and introns are retained and deleted in different combinations to create a diverse array of mRNAs from a common coding sequence. This process is known as alternative RNA splicing. Depending on the source, the percentage of alternatively spliced genes lies between 35% and 60% [4][5][6][7][8][9][10]. By screening many full-length cDNAs it is possible to record the complete cDNA from a mature RNA for the same gene again and again and a full-length cDNA library, also known as single-gene library (SGL), builds up. The library contains detailed information about how specific exon combinations go together. This information is directly related to the functional regions of the proteins as they are grouped in domains which in many cases correspond to a single exon which encodes these domains. For example a transcription factor consists of a DNA binding domain and a regulatory domain. Thus the alteration of the exon structure corresponds to an alteration in the function of this particular domain. The central premise is that a dependency in the domains points to a functional association. If domains interact functionally then their splicing should be co-regulated. And this co-regulation has direct biological significance because it shows us which variable components also interact in the expressed protein. Because the polypeptide is intricately folded and tightly packed, segments that are separated by dozens of introns in the primary transcript may encode domains that interact functionally within the protein. These domains need not be structural neighbors even in the folded protein, but may interact through electrical or van der Waals forces, effects of global conformational changes, or even associations with other proteins. Because of these intricacies, there are no inherent distance restrictions, or limits on the number of interacting sites, and separate domains may combine their functional effect in unpredictable ways.
Due to the large number of potential combinations in highly alternatively spliced genes, any library will only comprise a small portion of the total theoretically possible inventory of combinations. Statistically, this leads to sparse contingency tables in which dimensions represent exons and cells represent variants. The investigation of interactions among categorical variables where not all possible combinations are observed, means addressing a model selection problem that is challenging both inferentially and computationally.
As far as alternative splicing is concerned, there is an important reason to determine this interaction structure: searching for intrapeptide interactions in functional assays is a very difficult, open-ended problem, where statistical analysis of the splicing interaction structure in the transcriptome can simplify this task enormously by identifying the sets of interacting domains. And as more investigators become interested in this type of information, and large-scale single-gene libraries become available, there is a strong need for reliable statistical methods for analyzing the resulting datasets.
We develop different statistical methods to analyse sparse contingency tables in order to determine the underlying interaction pattern and we use graphical models to visualize these patterns. The methods are compared in a simulation study and illustrated on full-length cDNA libraries.

General introduction to contingency tables and Log-linear Models
In this section we provide general definitions and notations.
Assume we have q categorical random variables or factors, C = {C 1 ,..., C q }, where each C j can take on a finite number g j of possible values, called levels. The vector (c 1 ,..., c q ) represents a particular combination of levels of the joint random variable C = {C 1 ,..., C q }. The total cardinality of C is , which corresponds to the m different combinations of levels (m = 2 q when all C j are dichotomous, as in our splicing example).
We simplify the notation by mapping each configuration of C to a unique natural number i ∈ {1,..., m} with a (bijective) function f: so we may write c i = (c 1 ,..., c q ). For n observations of C, the corresponding q-way contingency table has m cells, each listing the frequency of a particular configuration c i : A general introduction to contingency tables can be found in [11]. If the observations are independent, with p i the probability of sampling configuration c i , the distribution of the cell counts (n 1 ,..., n q ) t is multinomial with probability p = (p 1 ,..., p q ).
In the splicing example, we may consider the C j as dichotomous random variables representing q sites of alternative splicing, each with two levels, denoted by c j ∈ {1, -1}, corresponding to the presence or absence of exon j in a transcript. The contingency table therefore has m = 2 q cells, with each cell represented by the q-dimensional binary vector c i = (c 1 ,..., c q ). A log-linear model for the cell probabilities can be written the following way: A general log-linear model represents p as: where β is a vector of unknown coefficients and X a suitable design matrix as indicated below. Let's assume that the cell probabilities are expressed in the following way: where δ ∅ is the global mean, is the main effect of the first variable and only depends on the distribution of C 1 .
Similarly is the first order interaction between the first two variables and its value only depends on the joint distribution of these two variables.
We now look for a suitable parametrization of the vector spaces spanned by the main effects , a parametrization for the vector spaces spanned by the first order interactions and so on. To ensure identifiability, we impose constraints on these matrices and denote the resulting matrices by , and so on. The design matrix X finally consists of these submatrices. The constitution of the design matrix X for factors with two levels can directly be derived from (1). The derivation of the design matrix X from (3) in the case of more than two levels per factor is basically an analysis of variance (ANOVA) parametrization with poly-contrasts. Details can be found in Additional file 1 Section 1.
Sometimes we may assume a smaller model without some of the interaction terms. It is of the form as in (2) with some columns removed from the design matrix X. We denote matrices of the form by X a , with . The corresponding subvector of β is denoted by β a .

Graphical Models
A powerful way for visualizing conditional dependencies among variables is given by a graph. A graph consists of a finite set of vertices and a finite set of edges between these vertices. In our context, the vertices correspond to the different discrete random variables. We form the so-called Conditional Independence Graph by connecting all pairs of vertices that appear in the same generator, that is the maximal terms a ⊆ C which are present in the model. To translate a vector β into a graphical model we look for β a ≠ 0 with β b = 0 ∀ a ⊂ b (where b is a strict super-set of a and |a| > 1) and we draw edges between all vertices corresponding to a. From this graph we can directly read off all marginal and conditional independences by the global Markov property for undirected graphs which states: if two sets of variables a and b are separated by a third set of variables c then a and b are conditionally independent given c (a b|c), where for three subsets a, b and c of , we say c separates a and b if all paths from a to b intersect c. For details, see [12].

Model selection -Non-Hierarchical versus hierarchical models
In the following subsections we introduce different model selection strategies for log-linear models. We first develop an ᐍ 1 -regularization model selection approach, which is then expanded to the new so-called level-ᐍ 1 -regularization approach. In addition, different Bayesian model selection strategies, which we use for comparisons, are explained in Additional file 1 Section 2. Hierarchical models are a subclass of models such that if an interaction term β a is zero, then all higher order interaction terms β b for b ⊇ a are also zero. If we consider the example above with 2 levels, this means for example that if the first order interaction coefficient β ij = 0 then all higher order interaction coefficients including i and j are also zero, i.e. β ijk = 0, ∀ k. While it is possible that the true underlying interaction model may not be hierarchical from a biological standpoint, a difficulty in the use of non-hierarchical models arises from the fact that they are not invariant under reparametrization. We have chosen the design matrix X with some constraints to ensure identifiability, and we used a specific, namely an orthonormal basis. In terms of ANOVA, this choice is equivalent to choosing a poly-contrast. We could have imposed different constraints or have chosen a different basis, and this would have resulted in a different design matrix X or in terms of ANOVA, a different choice of contrast. Suppose we have found an interaction vector β for one parametrization of the log-linear model and that this vector corresponds to a non-hierarchical model, meaning there is at least one lower order interaction term β a equal to zero, while β b ≠ 0 for at least one b ⊇ a. If we reparametrize the model, using a different design matrix, the coefficient for the model term a may no longer be zero. On the other hand, by reparametrizing a hierarchical model, all zero terms remain zero after reparametrization. Therefore, hierarchicity is preserved after reparametrization while non-hierarchicity depends on the parametrization. This is a distinct advantage of working within the hierarchical class. In a hierarchical model, all zero coefficients can directly be interpreted in terms of conditional independence, while this is not true for non-hierarchical models.

ᐍ 1 -Regularized model selection
The Lasso, originally proposed by [13] for linear regression, performs regularized parameter estimation and variable selection at the same time. The Lasso estimate is defined as follows: where Y = (Y 1 ,..., Y n ) is the response vector. This can also be viewed as a penalized Maximum Likelihood estimator, as is proportional to the negative log-likelihood function for Gaussian linear regression. While the MLE for the general regression model is no longer uniquely defined and very poor in the case of more variables than observations, the Lasso estimator is still reasonable as long as λ > 0. For our analysis, we have a similar problem, namely that the MLE does not exist in case of zero counts in the contingency table: a detailed description of the existence of the MLE in general log-linear interaction models is given in [14]. Inspired by the Lasso, we estimate our parameter vector β by the following expression: where . This minimization has to be calculated under the additional constraint that the cell probabilities add to 1: A problem of the optimization (4) is that the solution is no longer independent of the choice of the orthogonal subspaces X a . That is, if any set of orthogonal columns X a of X is reparametrized by a different orthogonal set, we get a different solution. To avoid this undesirable outcome we use a penalty that is intermediate between the ᐍ 1 -and the ᐍ 2 -penalty. This penalty, called group-ᐍ 1 -penalty, has the following form: Originally, this has been proposed by [15] for the linear regression problem with factor variables. The estimator of β then becomes subject to the constraint in (5). By imposing a penalty function on the coefficients of the log-linear interaction terms, overfitting as it might occur by using MLE is reduced. Furthermore, the ᐍ 1 -penalty encourages sparse solutions for the single components of β, the group ᐍ 1penalty encourages sparsity at the interaction level, meaning that the vector β a , which corresponds to the interaction term a is either present or absent in the model as a whole. In case of factors with only 2 levels, the group ᐍ 1penalty and the ᐍ 1 -penalty are equivalent.  (7). We use a ten-fold cross-validation in our example.
The resulting model does not necessarily have to be hierarchical and if we consider the hierarchical model induced by this procedure, it might happen that the final model is large for example if a single high order interaction is estimated to be active. To address this, we set up an algorithm described in the next Section.

Level-ᐍ 1 -regularized model selection
In order to prevent the procedure from choosing single high-order interactions, we alter the ᐍ 1 -regularized algorithm described in the previous Section: we do not exclusively apply it to the fully saturated model but also to submodels with lower order interactions. Precisely, a model is fitted with main effects only, and the predictive negative log-likelihood score (7) is calculated for the best main effects model (level 1). The same is done for the model including all main effects and first order interactions (level 2). Proceeding accordingly, we get |C| log-likelihood scores corresponding to the |C| levels. The level with minimal score (7) is then chosen (and within this selected level, we have an ᐍ 1 -regularized estimate).
With this procedure the tendency of including a single high-order interaction while most of its lower order interactions are absent is decreased, and the inclusion is only forced if the predictive negative log-likelihood score strongly speaks in favour of the inclusion. Therefore we tend to select sparser models which can be better hierarchized and interpreted in terms of conditional independence, in contrast to the ordinary ᐍ 1 -model selection procedure.

Algorithm for ᐍ 1 -regularization for factors with two levels
For the regularization approaches we calculate over a large number of values of λ in order to do some cross-validation using (7). For this purpose, an efficient algorithm is required. As one can easily verify by introducing Lagrange multipliers, finding the solution to (6) under the constraint (5)  The algorithm, which is an adaption of the path following algorithm proposed by [16], is set up as follows: (1) Start with = (-log(m), 0,..., 0) (2) Set: λ 0 = 1, = {∅} and t = 0.
(3) While (λ t > λ min ) The pairs , obtained from the algorithm above, represent the estimates from (6) under the constraint (5) for a range of penalty parameters λ t e.g. (t = ε, 2ε...). The choice of the step length ε represents the tradeoff between computational complexity and accuracy. To increase accuracy, one can perform more than one Newton step (3.3) if the gradient starts deviating from zero. The coefficient δ is also flexible. Typically it is chosen in the order of ε. The lowest λ for which one wants the solution to be calculated is denoted by λ min . Technical details concerning the algorithm can be found in the Appendix.

Testing Data
We choose the true underlying interaction vector β con- With the simulated cell counts, is estimated with different methods described in the previous sections and these methods are then compared as follows:

Criteria
As a model selection score (MSS), we consider the fraction of correctly assigned model terms: Moreover, we consider the root mean squared error for the interaction coefficients, For assessing how much the estimation of β varies over multiple datasets, we calculate for every coefficient the estimated standard deviation . The means of these standard deviations are reported as a measure of variability.
To compare the different procedures for estimation of probabilities p = exp (Xβ), we calculate the negative loglikelihood score (NLS) similar to the score in (7):

Results of simulation study
The results of the simulation study are summarized in Table 1, where we also include the MAP estimators of the Bayesian approaches described in Additional file 1 Section 2. We notice that the penalty-based regularization approaches proposed in this article leads to comparable or better results than the Bayesian approaches with respect to the NLS-score, RMSE and the variation (SPREAD), though the results of Bayesian approaches vary with the prior and the set of possible priors has not been extensively explored.
The level-ᐍ 1 -regularization and the relaxed ᐍ 1 -regularization (see below) are both competitive and can be better than MCMC for model selection.
The results of the MCMC procedures are sensitive to the choice of the prior value or the prior distribution for σ 2 . A at prior for α a (σ 2 = 2) results in worse performance than that of a prior that shrinks the coefficients more towards zero (σ 2 = 1/2). This suggests that specification of this prior hyperparameter may be difficult in practice, while we can easily optimize λ in the regularization approach by cross-validation.
The MCMC approaches without model selection perform poorly, as should be expected from data generated by a sparse model. MCMC methods based on a non-hierarchical model selection are also clearly inferior to the hierarchical counterpart. This is not surprising, as we have simulated data from a hierarchical model. In Table 1 we have also added an additional approach, denoted by ᐍ 2 , the equivalent to the ᐍ 1 -regularization but using an ᐍ 2penalty instead of an ᐍ 1 -penalty on the coefficients of the log-linear model. This method is equivalent to the MAP estimator with Gaussian priors on β a , with the parameter of the distribution optimized by cross-validation. This Ridge-type method does not perform variable selection, but it is competitive for all other criteria that we assessed.
In addition we consider the relaxed ᐍ 1 -regularization approach. Rather than using a single penalty parameter λ, the idea of this method is to control variable selection and parameter estimation by incorporating two penalty parameters. For linear regression it has been proven theoretically as well as empirically [17] that under suitable conditions the relaxed ᐍ 1 -regularization is better than Lasso.
Overall, the level-ᐍ 1 -regularization has good model selection performance (high MSS score) in combination with low negative log-likelihood score (NLS) and a low mean squared error for the true β (RMSE). In addition, it is feasible to optimize the tuning parameter λ by cross-validation as the computational cost is very low compared to the MCMC approaches. On the other hand, posterior distributions of estimates from MCMC methods provide additional information about uncertainty in the model space, compared to point estimates from ᐍ 1 -or ᐍ 2 -regularization.

Dataset
We estimate the splicing interaction pattern for a dataset corresponding to the itpr1 gene, one of three mammalian genes encoding receptors for the second messenger inositol 1,4,5-trisphosphate (InsP 3 ). This gene is subject to alternative RNA splicing, with seven sites of transcript variation, 6 of these within the ORF and among these, q = 5 were completely assessed in the single-gene libraries. Five single-gene libraries were built, one for adult rat cerebrum as well as four for different stages of postnatal cerebellar development, namely on days 6, 12, 22 and 90, the latter being considered as adult. Each library consists of between 179 and 277 transcripts which were assessed, i.e. ∈ [179, 277]. This gene is 89% identical at the cDNA level and 95% identical at the amino acid level with the human receptor gene. The complete dataset can be found in [18].

Results of application to Single-Gene Libraries
Unless stated differently, we report the results using the level ᐍ 1 -penalization method. We display the interaction vector graphically by plotting the components for the different tissue and development stages in Figure 1.
Our results suggest that the exons interact mainly in pairs and there is no reliably estimated higher order interaction in the splicing interaction pattern of rat cerebellum. We further notice that the main interaction pattern is very well conserved over different developmental stages. A strong mutual interaction between exons number three, four and five can be observed in all development stages of rat cerebellum as well as in the cerebral tissue. The biggest changes in the interaction pattern during development of rat cerebellum occur from postnatal day six to day 12. This can be seen at position number 10 on the x-axis in Figure  1, and it corresponds to the first order interaction between exons two and three, and from day 12 to day 16, the first main effect changes in sign and magnitude. The first main effect decreases progressively from day 6 to adult, reversing in sign between day 12 and 22. Between day 22 and 90, the interaction pattern is strongly conserved. Comparing the splicing interaction patterns between cerebellum and cerebrum in the adult rat, we see a much more complex pattern in the cerebrum, involving several second order interactions, and therefore a clear distinction from that of the cerebellum.
The conditional independence graphs for the estimated log-linear models are drawn in Figure 2, where the thickness of the edges are proportional to the corresponding coefficient of the interaction vector (the largest, if there are several giving rise to the same edge) and the radius of the vertices are chosen proportional to the corresponding main effect coefficient. Figure 2 graphically exploits the strongly conserved interactions between exons three, four and five. Except for a rather strong interaction between exon two and three on day six, all other interactions appear to be rather small. The graphical representation of the interaction pattern of adult rat cerebrum reveals a β β Graphical display of interaction vector Figure 1 Graphical display of interaction vector. The upper panel shows the estimated splicing interaction vectors of rat cerebellum tissues at postnatal days six, 12 and 22. The lower panel shows the splicing interaction vector of rat cerebellum tissues at the age of 90 days, which is considered adult, as well as the splicing interaction vector of rat cerebral tissue at the age of 90 days. Within an interaction degree, the sequence of coefficients is ordered from left to right as follows: e.g. for 2nd order interactions, 123, 124, 125,..., 345, where 1,...., 5 represent exons 12, 23B, 40, 41, and 42 in the rip3r1 gene, as described in [18]. The approaches and results presented here can provide valuable insight into the underlying processes in alternative splicing in general, and specifically in the brain development experiments considered here. Most striking is the strong conservation over developmental stages at day 12, 22 and 90 (adult); some differences are showing between postnatal day six and day 12. Also, the conservation between the cerebellum and cerebrum is less pronounced than over developmental stages. Finally, second-or higher-order interaction terms seem to be of minor relevance, suggesting that in this gene/tissue combination, direct interaction mainly happens between pairs of exons, but not combinations of three or more exons.
We have also estimated β with the hierarchical Bayesian approach using MCMC. For the choice of σ 2 = 1 this resulted in very similar interaction patterns as for the level ᐍ 1 -penalization method. For σ 2 = 2 it led to remarkably different results. In addition to this, a further dataset was analyzed where the details can be found in Additional file 1 Section 3.

Conclusion
We have developed an efficient method for identifying interaction patterns of categorical variables. This can be used to fit a graphical model which is a valuable tool to visualize the conditional dependence structure among the random variables. In a simulation study, the results of the new level-ᐍ 1 -regularization method are superior in comparison to ᐍ 1 -regularization and slightly better than the Conditional Independence Graphs Figure 2 Conditional Independence Graphs. Conditional independence graphs for the estimated log-linear models for the itpr1 gene. For each graph, the predictive probability score (7) is reported as a goodness of fit measure. Note the strong mutual interaction between exons three, four and five. MAP estimator from some of the MCMC methods we considered. With real data, the level ᐍ 1 -regularization and hierarchical Bayesian approach led to similar results, subject to a specific choice of priors for the Bayesian method. An important computational advantage of the level-ᐍ 1method in comparison to MCMC, is that cross-validation becomes feasible which in turn allows for an empirical choice of the tuning parameter. While the methodology described in this article is motivated by the study of exon splicing interactions in single-gene transcriptomes, it provides a general and flexible toolbox for regularization analysis in relatively high dimensional, sparse contingency tables. Model selection in high dimensional contingency tables has been a traditionally challenging area, and we hope that our generalization of regularization methodologies to this context will prove useful in a variety of areas of computational biology and biostatistics. Several technologies generate categorical data: these include SNP chips that provide genotype and copy number information at the DNA level, sequencing technologies, assays that study binding properties of proteins and binding of RNA to DNA, a variety of disease phenotypes, and more. In most of these contexts the interactions among the variables are critical features in systems biology investigations that aim at studying how the components of complex systems work together in in fluencing biological outcomes. For example, the log-linear models described here provide a natural approach for fitting very general classes of networks to discrete data. The level-ᐍ 1 -regularization is a general tool which can be applied to a wide variety of problems involving sparse contingency tables.
An R package called logilasso will be available for download on the Comprehensive R Archive Network (CRAN).
from a solution which meets condition (9), the new approximately meets (9) again.