 Methodology article
 Open access
 Published:
Penalized likelihood for sparse contingency tables with an application to fulllength cDNA libraries
BMC Bioinformatics volume 8, Article number: 476 (2007)
Abstract
Background
The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by loglinear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of socalled fulllength cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species.
Results
We develop methods to perform model selection and parameter estimation in loglinear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of loglinear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient ℓ_{1}penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from fulllength cDNA libraries.
Conclusion
We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.
Background
One of the most striking discoveries of the genomic era is the unexpectedly small number of genes in the human genome. This amount has decreased from more than 100000 [1] to an estimated number of roughly between 20000 and 25000 [2, 3], tens of thousands less than initially expected and essentially the same number as found in phenotypically much simpler organisms. A question of overriding biological significance is, how complex phenotypes of higher organisms arise from limited genomes. Part of the explanation may be that many genes undergo a process called alternative RNA splicing, which can generate many distinct proteins from a single gene.
RNA splicing is a posttranscriptional process that occurs prior to mRNA translation. After the gene has been transcribed into a premessenger RNA (premRNA), it consists of intronic regions destined to be removed during premRNA processing (RNA splicing), as well as exonic sequences that are retained within the mature mRNA. After transcription occurs the actual splicing process, where it is decided which exons are retained in the mature message and which are targets for removal. In general, exons and introns are retained and deleted in different combinations to create a diverse array of mRNAs from a common coding sequence. This process is known as alternative RNA splicing. Depending on the source, the percentage of alternatively spliced genes lies between 35% and 60% [4–10]. By screening many fulllength cDNAs it is possible to record the complete cDNA from a mature RNA for the same gene again and again and a fulllength cDNA library, also known as singlegene library (SGL), builds up. The library contains detailed information about how specific exon combinations go together. This information is directly related to the functional regions of the proteins as they are grouped in domains which in many cases correspond to a single exon which encodes these domains. For example a transcription factor consists of a DNA binding domain and a regulatory domain. Thus the alteration of the exon structure corresponds to an alteration in the function of this particular domain. The central premise is that a dependency in the domains points to a functional association. If domains interact functionally then their splicing should be coregulated. And this coregulation has direct biological significance because it shows us which variable components also interact in the expressed protein. Because the polypeptide is intricately folded and tightly packed, segments that are separated by dozens of introns in the primary transcript may encode domains that interact functionally within the protein. These domains need not be structural neighbors even in the folded protein, but may interact through electrical or van der Waals forces, effects of global conformational changes, or even associations with other proteins. Because of these intricacies, there are no inherent distance restrictions, or limits on the number of interacting sites, and separate domains may combine their functional effect in unpredictable ways.
Due to the large number of potential combinations in highly alternatively spliced genes, any library will only comprise a small portion of the total theoretically possible inventory of combinations. Statistically, this leads to sparse contingency tables in which dimensions represent exons and cells represent variants. The investigation of interactions among categorical variables where not all possible combinations are observed, means addressing a model selection problem that is challenging both inferentially and computationally.
As far as alternative splicing is concerned, there is an important reason to determine this interaction structure: searching for intrapeptide interactions in functional assays is a very difficult, openended problem, where statistical analysis of the splicing interaction structure in the transcriptome can simplify this task enormously by identifying the sets of interacting domains. And as more investigators become interested in this type of information, and largescale singlegene libraries become available, there is a strong need for reliable statistical methods for analyzing the resulting datasets.
We develop different statistical methods to analyse sparse contingency tables in order to determine the underlying interaction pattern and we use graphical models to visualize these patterns. The methods are compared in a simulation study and illustrated on fulllength cDNA libraries.
Results
Algorithm
General introduction to contingency tables and Loglinear Models
In this section we provide general definitions and notations.
Assume we have q categorical random variables or factors, C = {C_{1},..., C_{ q }}, where each C_{ j }can take on a finite number g_{ j }of possible values, called levels. The vector (c_{1},..., c_{ q }) represents a particular combination of levels of the joint random variable C = {C_{1},..., C_{ q }}. The total cardinality of C is m={\displaystyle {\prod}_{j=1}^{q}{g}_{j}}, which corresponds to the m different combinations of levels (m = 2^{q}when all C_{ j }are dichotomous, as in our splicing example).
We simplify the notation by mapping each configuration of C to a unique natural number i ∈ {1,..., m} with a (bijective) function f:
f: (c_{1},..., c_{ q }) ↔ i ∈ {1,..., m},
so we may write c_{ i }= (c_{1},..., c_{ q }). For n observations of C, the corresponding qway contingency table has m cells, each listing the frequency of a particular configuration c_{ i }:
A general introduction to contingency tables can be found in [11].
If the observations are independent, with p_{ i }the probability of sampling configuration c_{ i }, the distribution of the cell counts (n_{1},..., n_{ q })^{t}is multinomial with probability p = (p_{1},..., p_{ q }).
In the splicing example, we may consider the C_{ j }as dichotomous random variables representing q sites of alternative splicing, each with two levels, denoted by c_{ j }∈ {1, 1}, corresponding to the presence or absence of exon j in a transcript. The contingency table therefore has m = 2^{q}cells, with each cell represented by the qdimensional binary vector c_{ i }= (c_{1},..., c_{ q }). A loglinear model for the cell probabilities can be written the following way:
A general loglinear model represents p as:
log (p) = X β,
where β is a vector of unknown coefficients and X a suitable design matrix as indicated below. Let's assume that the cell probabilities are expressed in the following way:
where δ_{∅} is the global mean, {\delta}_{{c}_{1}}^{{C}_{1}} is the main effect of the first variable and only depends on the distribution of C_{1}. Similarly {\delta}_{{c}_{1},{c}_{2}}^{{C}_{1},{C}_{2}} is the first order interaction between the first two variables and its value only depends on the joint distribution of these two variables.
We now look for a suitable parametrization {\tilde{X}}^{{C}_{i}} of the vector spaces spanned by the main effects {\delta}^{{C}_{i}}, a parametrization {\tilde{X}}^{{C}_{i},{C}_{j}} for the vector spaces spanned by the first order interactions {\delta}^{{C}_{i},{C}_{j}} and so on. To ensure identifiability, we impose constraints on these matrices and denote the resulting matrices by {X}^{{C}_{i}}, {X}^{{C}_{i},{C}_{j}} and so on. The design matrix X finally consists of these submatrices. The constitution of the design matrix X for factors with two levels can directly be derived from (1). The derivation of the design matrix X from (3) in the case of more than two levels per factor is basically an analysis of variance (ANOVA) parametrization with polycontrasts. Details can be found in Additional file 1 Section 1.
Sometimes we may assume a smaller model without some of the interaction terms. It is of the form as in (2) with some columns removed from the design matrix X. We denote matrices of the form {X}^{{C}_{{j}_{1}},\mathrm{...},{C}_{{j}_{k}}} by X_{ a }, with a=\{{C}_{{j}_{1}},\mathrm{...},{C}_{{j}_{k}}\}\subseteq C. The corresponding subvector of β is denoted by β_{ a }.
Graphical Models
A powerful way for visualizing conditional dependencies among variables is given by a graph. A graph \mathcal{G}=(\mathcal{V},\mathcal{E}) consists of a finite set \mathcal{V} of vertices and a finite set \mathcal{E} of edges between these vertices. In our context, the vertices correspond to the different discrete random variables. We form the socalled Conditional Independence Graph by connecting all pairs of vertices that appear in the same generator, that is the maximal terms a ⊆ C which are present in the model. To translate a vector β into a graphical model we look for β_{ a }≠ 0 with β_{ b }= 0 ∀ a ⊂ b (where b is a strict superset of a and a > 1) and we draw edges between all vertices corresponding to a. From this graph we can directly read off all marginal and conditional independences by the global Markov property for undirected graphs which states: if two sets of variables a and b are separated by a third set of variables c then a and b are conditionally independent given c (a ⫫ bc), where for three subsets a, b and c of \mathcal{V}, we say c separates a and b if all paths from a to b intersect c. For details, see [12].
Model selection – NonHierarchical versus hierarchical models
In the following subsections we introduce different model selection strategies for loglinear models. We first develop an ℓ_{1}regularization model selection approach, which is then expanded to the new socalled levelℓ_{1}regularization approach. In addition, different Bayesian model selection strategies, which we use for comparisons, are explained in Additional file 1 Section 2. Hierarchical models are a subclass of models such that if an interaction term β_{ a }is zero, then all higher order interaction terms β_{ b }for b ⊇ a are also zero. If we consider the example above with 2 levels, this means for example that if the first order interaction coefficient β_{ ij }= 0 then all higher order interaction coefficients including i and j are also zero, i.e. β_{ ijk }= 0, ∀ k. While it is possible that the true underlying interaction model may not be hierarchical from a biological standpoint, a difficulty in the use of nonhierarchical models arises from the fact that they are not invariant under reparametrization. We have chosen the design matrix X with some constraints to ensure identifiability, and we used a specific, namely an orthonormal basis. In terms of ANOVA, this choice is equivalent to choosing a polycontrast. We could have imposed different constraints or have chosen a different basis, and this would have resulted in a different design matrix X or in terms of ANOVA, a different choice of contrast. Suppose we have found an interaction vector β for one parametrization of the loglinear model and that this vector corresponds to a nonhierarchical model, meaning there is at least one lower order interaction term β_{ a }equal to zero, while β_{ b }≠ 0 for at least one b ⊇ a. If we reparametrize the model, using a different design matrix, the coefficient for the model term a may no longer be zero. On the other hand, by reparametrizing a hierarchical model, all zero terms remain zero after reparametrization. Therefore, hierarchicity is preserved after reparametrization while nonhierarchicity depends on the parametrization. This is a distinct advantage of working within the hierarchical class. In a hierarchical model, all zero coefficients can directly be interpreted in terms of conditional independence, while this is not true for nonhierarchical models.
ℓ_{1}Regularized model selection
The Lasso, originally proposed by [13] for linear regression, performs regularized parameter estimation and variable selection at the same time. The Lasso estimate is defined as follows:
where Y = (Y_{1},..., Y_{ n }) is the response vector. This can also be viewed as a penalized Maximum Likelihood estimator, as {\displaystyle {\sum}_{i}{(YX\beta )}_{i}^{2}} is proportional to the negative loglikelihood function for Gaussian linear regression. While the MLE for the general regression model is no longer uniquely defined and very poor in the case of more variables than observations, the Lasso estimator is still reasonable as long as λ > 0. For our analysis, we have a similar problem, namely that the MLE does not exist in case of zero counts in the contingency table: a detailed description of the existence of the MLE in general loglinear interaction models is given in [14]. Inspired by the Lasso, we estimate our parameter vector β by the following expression:
where l(β) is the loglikelihood function l(\beta )=\mathrm{log}\phantom{\rule{0.1em}{0ex}}{\mathbb{P}}_{\beta}[n]\propto {\displaystyle {\sum}_{i=1}^{m}\frac{{n}_{n}}{n}}{(X\beta )}_{n}. This minimization has to be calculated under the additional constraint that the cell probabilities add to 1:
A problem of the optimization (4) is that the solution is no longer independent of the choice of the orthogonal subspaces X_{ a }. That is, if any set of orthogonal columns X_{ a }of X is reparametrized by a different orthogonal set, we get a different solution. To avoid this undesirable outcome we use a penalty that is intermediate between the ℓ_{1} and the ℓ_{2}penalty. This penalty, called groupℓ_{1}penalty, has the following form:
Originally, this has been proposed by [15] for the linear regression problem with factor variables. The estimator of β then becomes
subject to the constraint in (5). By imposing a penalty function on the coefficients of the loglinear interaction terms, overfitting as it might occur by using MLE is reduced. Furthermore, the ℓ_{1}penalty encourages sparse solutions for the single components of β, the group ℓ_{1}penalty encourages sparsity at the interaction level, meaning that the vector β_{ a }, which corresponds to the interaction term a is either present or absent in the model as a whole. In case of factors with only 2 levels, the group ℓ_{1}penalty and the ℓ_{1}penalty are equivalent.
For both the ℓ_{1}, and the group ℓ_{1}regularization, the parameter λ can be assessed by crossvalidation: we divide the individual counts into a number of equal parts and in turn leave out one part for the rest to form a training contingency table with cell counts n_{ train }. The solution for an array of values for λ, the socalled solution path, is calculated according to an algorithm described in the following Implementation section. The corresponding vectors of cell probabilities are denoted by p({\stackrel{\_}{\beta}}^{\lambda}). We then use the remainder of the cell counts n_{ test }to calculate the predictive negative loglikelihood score
which is proportional to the outofsample negative loglikelihood. This score is on the same scale when varying the number of observations and may therefore be used to compare contingency tables of the same dimension but with different numbers of cell entries. The parameter λ is chosen as the value which minimizes the crossvalidated score in (7). We use a tenfold crossvalidation in our example.
The resulting model does not necessarily have to be hierarchical and if we consider the hierarchical model induced by this procedure, it might happen that the final model is large for example if a single high order interaction is estimated to be active. To address this, we set up an algorithm described in the next Section.
Levelℓ_{1}regularized model selection
In order to prevent the procedure from choosing single highorder interactions, we alter the ℓ_{1}regularized algorithm described in the previous Section: we do not exclusively apply it to the fully saturated model but also to submodels with lower order interactions. Precisely, a model is fitted with main effects only, and the predictive negative loglikelihood score (7) is calculated for the best main effects model (level 1). The same is done for the model including all main effects and first order interactions (level 2). Proceeding accordingly, we get C loglikelihood scores corresponding to the C levels. The level with minimal score (7) is then chosen (and within this selected level, we have an ℓ_{1}regularized estimate).
With this procedure the tendency of including a single highorder interaction while most of its lower order interactions are absent is decreased, and the inclusion is only forced if the predictive negative loglikelihood score strongly speaks in favour of the inclusion. Therefore we tend to select sparser models which can be better hierarchized and interpreted in terms of conditional independence, in contrast to the ordinary ℓ_{1}model selection procedure.
Algorithm for ℓ_{1}regularization for factors with two levels
For the regularization approaches we calculate {\stackrel{\_}{\beta}}^{\lambda} over a large number of values of λ in order to do some crossvalidation using (7). For this purpose, an efficient algorithm is required. As one can easily verify by introducing Lagrange multipliers, finding the solution to (6) under the constraint (5) is equivalent to minimizing an unconstrained function g(β):
with μ= X β and l(\beta )\propto {\displaystyle {\sum}_{i}\frac{{n}_{i}}{n}{(X\beta )}_{i}}. Here, g is a convex function. If each factor has two levels only, as in our application with singlegene libraries, we can set up an algorithm, which efficiently yields the estimates for a whole sequence of parameters λ. Let \mathcal{A} denote the set of active interaction terms, which means for a ∈ \mathcal{A} it holds that β_{ a }≠ 0; {X}_{\mathcal{A}} is the corresponding submatrix of X, {\beta}_{\mathcal{A}} the corresponding subvector of β and {g}_{\mathcal{A}} is g restricted to the subspace {\beta}_{\mathcal{A}}. We restrict ourselves to the currently active set \mathcal{A}, where \nabla {g}_{\mathcal{A}} and {\nabla}^{2}{g}_{\mathcal{A}} are welldefined:
The algorithm, which is an adaption of the path following algorithm proposed by [16], is set up as follows:

(1)
Start with \stackrel{\_}{\beta} = (log(m), 0,..., 0)

(2)
Set: λ_{0} = 1, \mathcal{A} = {∅} and t = 0.

(3)
While (λ_{ t }> λ_{ min })
(3.1) λ_{t+1}= λ_{ t } ε
(3.2) \mathcal{A} = \mathcal{A} ∪ {j ∉ \mathcal{A}: [X^{t}·\frac{n}{n}  exp (X\stackrel{\_}{\beta})]_{ j } > λ_{t+1}}
(3.3) \stackrel{\_}{\beta} is updated as \stackrel{\_}{\beta}_{t+1}= \stackrel{\_}{\beta}_{ t } {\nabla}^{2}{g}_{\mathcal{A}}(\stackrel{\_}{\beta}_{ t }, λ_{t+1})^{1}·\nabla {g}_{\mathcal{A}}(\stackrel{\_}{\beta}_{ t }, λ_{t+1}).
(3.4) \mathcal{A} = \mathcal{A}\{j ∈ \mathcal{A}: {\stackrel{\_}{\beta}_{t+1,j} <δ}
(3.5) t = t + 1
The pairs ({\stackrel{\_}{\beta}}_{t},{\lambda}_{t}), obtained from the algorithm above, represent the estimates from (6) under the constraint (5) for a range of penalty parameters λ_{ t }e.g. (t = ε, 2ε...). The choice of the step length ε represents the tradeoff between computational complexity and accuracy. To increase accuracy, one can perform more than one Newton step (3.3) if the gradient starts deviating from zero. The coefficient δ is also flexible. Typically it is chosen in the order of ε. The lowest λ for which one wants the solution to be calculated is denoted by λ_{ min }. Technical details concerning the algorithm can be found in the Appendix.
Testing
Data
We choose the true underlying interaction vector β consisting of 5 factors of 2 levels. By enumerating the factors from 1 to 5, the generators of the model are 345 + 235 + 234 + 135 + 123 + 14, which means that all third and fourth order interactions are absent, only five of ten second order interactions and all first order interactions are present. The corresponding coefficients of β are independently simulated using a normal distribution with mean zero and variance one.
Then, 250 draws from a multinomial distribution with probability vector p where log (p) = X β, are taken. This corresponds to a reasonable number of cDNAs in a singlegene library. This is then repeated 10 times. With our choice of β, the resulting contingency tables are sparse. With the simulated cell counts, \stackrel{\_}{\beta} is estimated with different methods described in the previous sections and these methods are then compared as follows:
Criteria
As a model selection score (MSS), we consider the fraction of correctly assigned model terms:
Moreover, we consider the root mean squared error for the interaction coefficients,
For assessing how much the estimation of β varies over multiple datasets, we calculate for every coefficient {\stackrel{\_}{\beta}}_{i} the estimated standard deviation {\stackrel{\_}{\sigma}}_{i}. The means of these standard deviations are reported as
a measure of variability.
To compare the different procedures for estimation of probabilities p = exp (X β), we calculate the negative loglikelihood score (NLS) similar to the score in (7):
Results of simulation study
The results of the simulation study are summarized in Table 1, where we also include the MAP estimators of the Bayesian approaches described in Additional file 1 Section 2. We notice that the penaltybased regularization approaches proposed in this article leads to comparable or better results than the Bayesian approaches with respect to the NLSscore, RMSE and the variation (SPREAD), though the results of Bayesian approaches vary with the prior and the set of possible priors has not been extensively explored.
The levelℓ_{1}regularization and the relaxed ℓ_{1}regularization (see below) are both competitive and can be better than MCMC for model selection.
The results of the MCMC procedures are sensitive to the choice of the prior value or the prior distribution for σ^{2}. A at prior for α_{ a }(σ^{2} = 2) results in worse performance than that of a prior that shrinks the coefficients more towards zero (σ^{2} = 1/2). This suggests that specification of this prior hyperparameter may be difficult in practice, while we can easily optimize λ in the regularization approach by crossvalidation.
The MCMC approaches without model selection perform poorly, as should be expected from data generated by a sparse model. MCMC methods based on a nonhierarchical model selection are also clearly inferior to the hierarchical counterpart. This is not surprising, as we have simulated data from a hierarchical model. In Table 1 we have also added an additional approach, denoted by ℓ_{2}, the equivalent to the ℓ_{1}regularization but using an ℓ_{2}penalty instead of an ℓ_{1}penalty on the coefficients of the loglinear model. This method is equivalent to the MAP estimator with Gaussian priors on β_{ a }, with the parameter of the distribution optimized by crossvalidation. This Ridgetype method does not perform variable selection, but it is competitive for all other criteria that we assessed.
In addition we consider the relaxed ℓ_{1}regularization approach. Rather than using a single penalty parameter λ, the idea of this method is to control variable selection and parameter estimation by incorporating two penalty parameters. For linear regression it has been proven theoretically as well as empirically [17] that under suitable conditions the relaxed ℓ_{1}regularization is better than Lasso.
Overall, the levelℓ_{1}regularization has good model selection performance (high MSS score) in combination with low negative loglikelihood score (NLS) and a low mean squared error for the true β(RMSE). In addition, it is feasible to optimize the tuning parameter λ by crossvalidation as the computational cost is very low compared to the MCMC approaches. On the other hand, posterior distributions of estimates from MCMC methods provide additional information about uncertainty in the model space, compared to point estimates from ℓ_{1} or ℓ_{2}regularization.
Implementation
Dataset
We estimate the splicing interaction pattern for a dataset corresponding to the itpr1 gene, one of three mammalian genes encoding receptors for the second messenger inositol 1,4,5trisphosphate (InsP_{3}). This gene is subject to alternative RNA splicing, with seven sites of transcript variation, 6 of these within the ORF and among these, q = 5 were completely assessed in the singlegene libraries. Five singlegene libraries were built, one for adult rat cerebrum as well as four for different stages of postnatal cerebellar development, namely on days 6, 12, 22 and 90, the latter being considered as adult. Each library consists of between 179 and 277 transcripts which were assessed, i.e. {\displaystyle {\sum}_{j=1}^{m}{n}_{j}} ∈ [179, 277]. This gene is 89% identical at the cDNA level and 95% identical at the amino acid level with the human receptor gene. The complete dataset can be found in [18].
Results of application to SingleGene Libraries
Unless stated differently, we report the results using the level ℓ_{1}penalization method. We display the interaction vector \stackrel{\_}{\beta} graphically by plotting the components {\stackrel{\_}{\beta}}_{j} for the different tissue and development stages in Figure 1. Our results suggest that the exons interact mainly in pairs and there is no reliably estimated higher order interaction in the splicing interaction pattern of rat cerebellum. We further notice that the main interaction pattern is very well conserved over different developmental stages. A strong mutual interaction between exons number three, four and five can be observed in all development stages of rat cerebellum as well as in the cerebral tissue. The biggest changes in the interaction pattern during development of rat cerebellum occur from postnatal day six to day 12. This can be seen at position number 10 on the xaxis in Figure 1, and it corresponds to the first order interaction between exons two and three, and from day 12 to day 16, the first main effect changes in sign and magnitude. The first main effect decreases progressively from day 6 to adult, reversing in sign between day 12 and 22. Between day 22 and 90, the interaction pattern is strongly conserved. Comparing the splicing interaction patterns between cerebellum and cerebrum in the adult rat, we see a much more complex pattern in the cerebrum, involving several second order interactions, and therefore a clear distinction from that of the cerebellum.
The conditional independence graphs for the estimated loglinear models are drawn in Figure 2, where the thickness of the edges are proportional to the corresponding coefficient of the interaction vector \stackrel{\_}{\beta} (the largest, if there are several giving rise to the same edge) and the radius of the vertices are chosen proportional to the corresponding main effect coefficient. Figure 2 graphically exploits the strongly conserved interactions between exons three, four and five. Except for a rather strong interaction between exon two and three on day six, all other interactions appear to be rather small. The graphical representation of the interaction pattern of adult rat cerebrum reveals a more complex interaction pattern with no conditional independences.
The approaches and results presented here can provide valuable insight into the underlying processes in alternative splicing in general, and specifically in the brain development experiments considered here. Most striking is the strong conservation over developmental stages at day 12, 22 and 90 (adult); some differences are showing between postnatal day six and day 12. Also, the conservation between the cerebellum and cerebrum is less pronounced than over developmental stages. Finally, second or higherorder interaction terms seem to be of minor relevance, suggesting that in this gene/tissue combination, direct interaction mainly happens between pairs of exons, but not combinations of three or more exons.
We have also estimated β with the hierarchical Bayesian approach using MCMC. For the choice of σ^{2} = 1 this resulted in very similar interaction patterns as for the level ℓ_{1}penalization method. For σ^{2} = 2 it led to remarkably different results. In addition to this, a further dataset was analyzed where the details can be found in Additional file 1 Section 3.
Conclusion
We have developed an efficient method for identifying interaction patterns of categorical variables. This can be used to fit a graphical model which is a valuable tool to visualize the conditional dependence structure among the random variables. In a simulation study, the results of the new levelℓ_{1}regularization method are superior in comparison to ℓ_{1}regularization and slightly better than the MAP estimator from some of the MCMC methods we considered. With real data, the level ℓ_{1}regularization and hierarchical Bayesian approach led to similar results, subject to a specific choice of priors for the Bayesian method. An important computational advantage of the levelℓ_{1}method in comparison to MCMC, is that crossvalidation becomes feasible which in turn allows for an empirical choice of the tuning parameter. While the methodology described in this article is motivated by the study of exon splicing interactions in singlegene transcriptomes, it provides a general and flexible toolbox for regularization analysis in relatively high dimensional, sparse contingency tables. Model selection in high dimensional contingency tables has been a traditionally challenging area, and we hope that our generalization of regularization methodologies to this context will prove useful in a variety of areas of computational biology and biostatistics. Several technologies generate categorical data: these include SNP chips that provide genotype and copy number information at the DNA level, sequencing technologies, assays that study binding properties of proteins and binding of RNA to DNA, a variety of disease phenotypes, and more. In most of these contexts the interactions among the variables are critical features in systems biology investigations that aim at studying how the components of complex systems work together in in fluencing biological outcomes. For example, the loglinear models described here provide a natural approach for fitting very general classes of networks to discrete data. The levelℓ_{1}regularization is a general tool which can be applied to a wide variety of problems involving sparse contingency tables.
An R package called logilasso will be available for download on the Comprehensive R Archive Network (CRAN).
Appendix
We note that if β is a minimum of g, then {\beta}_{\mathcal{A}} is a minimum of {g}_{\mathcal{A}}.
In our application with singlegene libraries, all factors have two levels only, which allows to construct an efficient algorithm. Since the gradient
where exp(X β) is understood as the componentwise exponential function, it follows that for a minimum {\beta}_{\mathcal{A}} of {g}_{\mathcal{A}}, the following equation holds:
Without loss of generality, we can restrict ourselves to the subspace β∈ ℝ^{} × ℝ^{m1}, because the constraint (5) can only be satisfied for β_{∅} < 0 as is proved in the following Lemma 1. Therefore β_{∅} ∈ \mathcal{A}.
Lemma 1. β_{∅} < 0 for a minimum of g(β) for all λ ∈ ℝ^{+}.
Proof.
log(p) = X β< 0 which yields (1,..., 1)X β= mβ_{∅} < 0 this implies β_{∅} < 0.
This holds because (1,....., 1) is orthogonal to all columns of X except for the first one. □
Additionally for β being a minimum, a necessary condition is:
Conditions (9) and (10) are sufficient for β being a minimum of (8). To find the β's that solve these equations for an array of values for λ, we set up a socalled path following algorithm. The idea is to start from an optimal solution {\beta}^{{\lambda}_{0}} for λ_{0}, and follow the path for decreasing λ, using a secondorder approximation for {\beta}_{\mathcal{A}}. In the following, we restrict ourselves to the currently active set \mathcal{A}, omitting the index \mathcal{A}. It then holds:
The algorithm tries to follow the optimal path as close as possible. At each step, it aims to meet the conditions (9) and (10). In step (3.2), the active set \mathcal{A} is identified, which forces \stackrel{\_}{\beta} to meet the condition (10). In step (3.3), a Newton step as described in (11) is performed. Starting from a solution which meets condition (9), the new {\stackrel{\_}{\beta}}^{\lambda} approximately meets (9) again.
References
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg S, Quackenbush J: Gene index analysis of the human genome estimates approximately 120000 genes. Nature Genetics. 2000, 25: 239240. 10.1038/76126.
International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931945. 10.1038/nature03001.
Southan C: Has the yoyo stopped? An assessment of human proteincoding gene number. Proteomics. 2004, 4: 17121726. 10.1002/pmic.200300700.
Mironov A, Fickett J, Gelfand M: Frequent alternative splicing of human genes. Genome Research. 1999, 9: 12881293. 10.1101/gr.9.12.1288.
Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger SR, J Bork P: EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000, 474: 8386. 10.1016/S00145793(00)015817.
International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860921. 10.1038/35057062.
Brett D, Pospisil H, Valcarcel J, Reich J, Bork P: Alternative splicing and genome complexity. Nature Genetics. 2002, 30: 2930. 10.1038/ng803.
The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science. 2005, 309 (5740): 15591563. 10.1126/science.1112014.
Zavolan M, van Nimwegen E, Gaasterland T: Splice variation in mouse fulllength cDNAs identified by mapping to the mouse genome. Genome Research. 2003, 12: 13771385. 10.1101/gr.191702.
Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi m, Barrero m, Tamura T, YamaguchiKabata Y, Tanino M: Integrative annotation of 21037 human genes validated by fulllength cDNA clones. PloS Biology. 2004, 2: 120. 10.1371/journal.pbio.0020162.
Everitt BS: The Analysis of Contingency Tables. Monographs on Statistics and Applied Probability. 1992, Chapman and Hall, 45: 2
Lauritzen SL: Graphical Models. Oxford Statistical Science Series. 1996, Oxford Clarendon Press, 17:
Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 1996, 58: 267288.
Christensen R: Linear Models for Multivariate Time Series, and Spatial Data. 1991, SpringerVerlag
Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. 2006, 68: 4967. 10.1111/j.14679868.2005.00532.x.
Rosset S: Following Curved Regularized Optimization Solution Paths. Advances in Neural Information Processing Systems. Edited by: Saul LK, Weiss Y, Bottou L. 2005, Cambridge, MA: MIT Press, 17: 11531160.
Meinshausen N: Lasso with relaxation. Computational Statistics & Data Analysis.
Regan MR, Lin DDM, Emerick MC, Agnew WS: The effect of higher order RNA processes on changing patterns of protein domain selection: A developmentally regulated transcriptome of type 1 inositol 1,4,5trisphosphate. Proteins: Structure, Function and Bioinformatics. 2005, 59: 312331. 10.1002/prot.20225.
Acknowledgements
CD was partially supported by the Swiss National Science Foundation grant number NF 200020113270 and by a PhD scholarship from the CCSPMD. GP was partly supported by NSF grant DMS034211.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
CD derived the mathematical details, implemented and tested the algorithm. GP initiated the project, suggested ideas and edited the manuscript. ME provided the datasets and the biological interpretation. PB supervised the project and suggested some of the main ideas. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2007_1848_MOESM1_ESM.pdf
Additional file 1: The Additional file consists of 3 sections. Section 1 contains details concerningthe parametrization of the loglinear model. Section 2 describes some Bayesian model selection approaches, which were used for comparison with our algorithm. In Section 3 a further dataset on which we tested our algorithm is introduced and the results are given on that dataset. (PDF 190 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Dahinden, C., Parmigiani, G., Emerick, M.C. et al. Penalized likelihood for sparse contingency tables with an application to fulllength cDNA libraries. BMC Bioinformatics 8, 476 (2007). https://doi.org/10.1186/147121058476
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121058476