Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries

Dahinden, Corinne; Parmigiani, Giovanni; Emerick, Mark C; Bühlmann, Peter

doi:10.1186/1471-2105-8-476

Methodology article
Open access
Published: 11 December 2007

Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries

Corinne Dahinden^1,2,
Giovanni Parmigiani³,
Mark C Emerick⁴ &
…
Peter Bühlmann^1,2

BMC Bioinformatics volume 8, Article number: 476 (2007) Cite this article

6847 Accesses
17 Citations
Metrics details

Abstract

Background

The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by log-linear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of so-called full-length cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species.

Results

We develop methods to perform model selection and parameter estimation in log-linear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of log-linear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient ℓ₁-penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from full-length cDNA libraries.

Conclusion

We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.

Background

One of the most striking discoveries of the genomic era is the unexpectedly small number of genes in the human genome. This amount has decreased from more than 100000 [1] to an estimated number of roughly between 20000 and 25000 [2, 3], tens of thousands less than initially expected and essentially the same number as found in phenotypically much simpler organisms. A question of overriding biological significance is, how complex phenotypes of higher organisms arise from limited genomes. Part of the explanation may be that many genes undergo a process called alternative RNA splicing, which can generate many distinct proteins from a single gene.

RNA splicing is a post-transcriptional process that occurs prior to mRNA translation. After the gene has been transcribed into a pre-messenger RNA (pre-mRNA), it consists of intronic regions destined to be removed during pre-mRNA processing (RNA splicing), as well as exonic sequences that are retained within the mature mRNA. After transcription occurs the actual splicing process, where it is decided which exons are retained in the mature message and which are targets for removal. In general, exons and introns are retained and deleted in different combinations to create a diverse array of mRNAs from a common coding sequence. This process is known as alternative RNA splicing. Depending on the source, the percentage of alternatively spliced genes lies between 35% and 60% [4–10]. By screening many full-length cDNAs it is possible to record the complete cDNA from a mature RNA for the same gene again and again and a full-length cDNA library, also known as single-gene library (SGL), builds up. The library contains detailed information about how specific exon combinations go together. This information is directly related to the functional regions of the proteins as they are grouped in domains which in many cases correspond to a single exon which encodes these domains. For example a transcription factor consists of a DNA binding domain and a regulatory domain. Thus the alteration of the exon structure corresponds to an alteration in the function of this particular domain. The central premise is that a dependency in the domains points to a functional association. If domains interact functionally then their splicing should be co-regulated. And this co-regulation has direct biological significance because it shows us which variable components also interact in the expressed protein. Because the polypeptide is intricately folded and tightly packed, segments that are separated by dozens of introns in the primary transcript may encode domains that interact functionally within the protein. These domains need not be structural neighbors even in the folded protein, but may interact through electrical or van der Waals forces, effects of global conformational changes, or even associations with other proteins. Because of these intricacies, there are no inherent distance restrictions, or limits on the number of interacting sites, and separate domains may combine their functional effect in unpredictable ways.

Due to the large number of potential combinations in highly alternatively spliced genes, any library will only comprise a small portion of the total theoretically possible inventory of combinations. Statistically, this leads to sparse contingency tables in which dimensions represent exons and cells represent variants. The investigation of interactions among categorical variables where not all possible combinations are observed, means addressing a model selection problem that is challenging both inferentially and computationally.

As far as alternative splicing is concerned, there is an important reason to determine this interaction structure: searching for intrapeptide interactions in functional assays is a very difficult, open-ended problem, where statistical analysis of the splicing interaction structure in the transcriptome can simplify this task enormously by identifying the sets of interacting domains. And as more investigators become interested in this type of information, and large-scale single-gene libraries become available, there is a strong need for reliable statistical methods for analyzing the resulting datasets.

We develop different statistical methods to analyse sparse contingency tables in order to determine the underlying interaction pattern and we use graphical models to visualize these patterns. The methods are compared in a simulation study and illustrated on full-length cDNA libraries.

Results

Algorithm

General introduction to contingency tables and Log-linear Models

In this section we provide general definitions and notations.

Assume we have q categorical random variables or factors, C = {C₁,..., C_q}, where each C_jcan take on a finite number g_jof possible values, called levels. The vector (c₁,..., c_q) represents a particular combination of levels of the joint random variable C = {C₁,..., C_q}. The total cardinality of C is $m = \prod_{j = 1}^{q} g_{j}$ , which corresponds to the m different combinations of levels (m = 2^qwhen all C_jare dichotomous, as in our splicing example).

We simplify the notation by mapping each configuration of C to a unique natural number i ∈ {1,..., m} with a (bijective) function f:

f: (c₁,..., c_q) ↔ i ∈ {1,..., m},

so we may write c_i= (c₁,..., c_q). For n observations of C, the corresponding q-way contingency table has m cells, each listing the frequency of a particular configuration c_i:

\begin{matrix} n_{c_{1}, ..., c_{q}} = n_{i}, & \sum_{i = 1}^{m} n_{i} = n . \end{matrix}

A general introduction to contingency tables can be found in [11].

If the observations are independent, with p_ithe probability of sampling configuration c_i, the distribution of the cell counts (n₁,..., n_q)^tis multinomial with probability p = (p₁,..., p_q).

In the splicing example, we may consider the C_jas dichotomous random variables representing q sites of alternative splicing, each with two levels, denoted by c_j∈ {1, -1}, corresponding to the presence or absence of exon j in a transcript. The contingency table therefore has m = 2^qcells, with each cell represented by the q-dimensional binary vector c_i= (c₁,..., c_q). A log-linear model for the cell probabilities can be written the following way:

\log p_{i} = β_{\emptyset} + \sum_{l \in {1, ..., q}} β_{l} c_{l} + \sum_{\begin{matrix} j, k \\ j < k \in {1, ..., q} \end{matrix}} β_{j k} c_{j} c_{k} + \dots + β_{12... q} c_{1} c_{2} \dots c_{q} .

(1)

A general log-linear model represents p as:

log (p) = X β,

where β is a vector of unknown coefficients and X a suitable design matrix as indicated below. Let's assume that the cell probabilities are expressed in the following way:

\log p_{c_{1}, ..., c_{q}} = δ_{\emptyset} + δ_{c_{1}}^{C_{1}} + \dots + δ_{c_{q}}^{C_{q}} + δ_{c_{1}, c_{2}}^{C_{1}, C_{2}} + \dots + δ_{c_{1}, ..., c_{q}}^{C_{1}, ..., C_{q}},

(3)

where δ_∅ is the global mean, $δ_{c_{1}}^{C_{1}}$ is the main effect of the first variable and only depends on the distribution of C₁. Similarly $δ_{c_{1}, c_{2}}^{C_{1}, C_{2}}$ is the first order interaction between the first two variables and its value only depends on the joint distribution of these two variables.

We now look for a suitable parametrization ${\tilde{X}}^{C_{i}}$ of the vector spaces spanned by the main effects $δ^{C_{i}}$ , a parametrization ${\tilde{X}}^{C_{i}, C_{j}}$ for the vector spaces spanned by the first order interactions $δ^{C_{i}, C_{j}}$ and so on. To ensure identifiability, we impose constraints on these matrices and denote the resulting matrices by $X^{C_{i}}$ , $X^{C_{i}, C_{j}}$ and so on. The design matrix X finally consists of these submatrices. The constitution of the design matrix X for factors with two levels can directly be derived from (1). The derivation of the design matrix X from (3) in the case of more than two levels per factor is basically an analysis of variance (ANOVA) parametrization with poly-contrasts. Details can be found in Additional file 1 Section 1.

Sometimes we may assume a smaller model without some of the interaction terms. It is of the form as in (2) with some columns removed from the design matrix X. We denote matrices of the form $X^{C_{j_{1}}, ..., C_{j_{k}}}$ by X_a, with $a = {C_{j_{1}}, ..., C_{j_{k}}} \subseteq C$ . The corresponding subvector of β is denoted by β_a.

Graphical Models

A powerful way for visualizing conditional dependencies among variables is given by a graph. A graph $G = (V, ℰ)$ consists of a finite set $V$ of vertices and a finite set $ℰ$ of edges between these vertices. In our context, the vertices correspond to the different discrete random variables. We form the so-called Conditional Independence Graph by connecting all pairs of vertices that appear in the same generator, that is the maximal terms a ⊆ C which are present in the model. To translate a vector β into a graphical model we look for β_a≠ 0 with β_b= 0 ∀ a ⊂ b (where b is a strict super-set of a and |a| > 1) and we draw edges between all vertices corresponding to a. From this graph we can directly read off all marginal and conditional independences by the global Markov property for undirected graphs which states: if two sets of variables a and b are separated by a third set of variables c then a and b are conditionally independent given c (a ⫫ b|c), where for three subsets a, b and c of $V$ , we say c separates a and b if all paths from a to b intersect c. For details, see [12].

Model selection – Non-Hierarchical versus hierarchical models

In the following subsections we introduce different model selection strategies for log-linear models. We first develop an ℓ₁-regularization model selection approach, which is then expanded to the new so-called level-ℓ₁-regularization approach. In addition, different Bayesian model selection strategies, which we use for comparisons, are explained in Additional file 1 Section 2. Hierarchical models are a subclass of models such that if an interaction term β_ais zero, then all higher order interaction terms β_bfor b ⊇ a are also zero. If we consider the example above with 2 levels, this means for example that if the first order interaction coefficient β_ij= 0 then all higher order interaction coefficients including i and j are also zero, i.e. β_ijk= 0, ∀ k. While it is possible that the true underlying interaction model may not be hierarchical from a biological standpoint, a difficulty in the use of non-hierarchical models arises from the fact that they are not invariant under reparametrization. We have chosen the design matrix X with some constraints to ensure identifiability, and we used a specific, namely an orthonormal basis. In terms of ANOVA, this choice is equivalent to choosing a poly-contrast. We could have imposed different constraints or have chosen a different basis, and this would have resulted in a different design matrix X or in terms of ANOVA, a different choice of contrast. Suppose we have found an interaction vector β for one parametrization of the log-linear model and that this vector corresponds to a non-hierarchical model, meaning there is at least one lower order interaction term β_aequal to zero, while β_b≠ 0 for at least one b ⊇ a. If we reparametrize the model, using a different design matrix, the coefficient for the model term a may no longer be zero. On the other hand, by reparametrizing a hierarchical model, all zero terms remain zero after reparametrization. Therefore, hierarchicity is preserved after reparametrization while non-hierarchicity depends on the parametrization. This is a distinct advantage of working within the hierarchical class. In a hierarchical model, all zero coefficients can directly be interpreted in terms of conditional independence, while this is not true for non-hierarchical models.

ℓ₁-Regularized model selection

The Lasso, originally proposed by [13] for linear regression, performs regularized parameter estimation and variable selection at the same time. The Lasso estimate is defined as follows:

{\overline{β}}^{λ} = \arg \min_{β} [\sum_{i} {(Y - X β)}_{i}^{2} + λ \sum_{j} | β_{j} |],

where Y = (Y₁,..., Y_n) is the response vector. This can also be viewed as a penalized Maximum Likelihood estimator, as $\sum_{i} {(Y - X β)}_{i}^{2}$ is proportional to the negative log-likelihood function for Gaussian linear regression. While the MLE for the general regression model is no longer uniquely defined and very poor in the case of more variables than observations, the Lasso estimator is still reasonable as long as λ > 0. For our analysis, we have a similar problem, namely that the MLE does not exist in case of zero counts in the contingency table: a detailed description of the existence of the MLE in general log-linear interaction models is given in [14]. Inspired by the Lasso, we estimate our parameter vector β by the following expression:

{\overline{β}}^{λ} = \arg \min_{β} [- l (β) + λ \sum_{j} | β_{j} |],

(4)

where l(β) is the log-likelihood function $l (β) = \log ℙ_{β} [n] \propto \sum_{i = 1}^{m} \frac{n_{n}}{n} {(X β)}_{n}$ . This minimization has to be calculated under the additional constraint that the cell probabilities add to 1:

\sum_{i = 1}^{m} \exp {{(X β)}_{i}} = 1.

(5)

A problem of the optimization (4) is that the solution is no longer independent of the choice of the orthogonal subspaces X_a. That is, if any set of orthogonal columns X_aof X is reparametrized by a different orthogonal set, we get a different solution. To avoid this undesirable outcome we use a penalty that is intermediate between the ℓ₁- and the ℓ₂-penalty. This penalty, called group-ℓ₁-penalty, has the following form:

\sum_{a \subseteq C} {‖ β_{a} ‖}_{ℓ_{2}}, where {‖ β_{a} ‖}_{ℓ_{2}}^{2} = \sum_{j} {(β_{a})}_{j}^{2}

Originally, this has been proposed by [15] for the linear regression problem with factor variables. The estimator of β then becomes

{\overline{β}}^{λ} = \arg \min_{β} [- l (β) + λ \sum_{\begin{matrix} a \subseteq C \\ a \neq \emptyset \end{matrix}} {‖ β_{a} ‖}_{ℓ_{2}}],

(6)

subject to the constraint in (5). By imposing a penalty function on the coefficients of the log-linear interaction terms, overfitting as it might occur by using MLE is reduced. Furthermore, the ℓ₁-penalty encourages sparse solutions for the single components of β, the group ℓ₁-penalty encourages sparsity at the interaction level, meaning that the vector β_a, which corresponds to the interaction term a is either present or absent in the model as a whole. In case of factors with only 2 levels, the group ℓ₁-penalty and the ℓ₁-penalty are equivalent.

For both the ℓ₁-, and the group ℓ₁-regularization, the parameter λ can be assessed by cross-validation: we divide the individual counts into a number of equal parts and in turn leave out one part for the rest to form a training contingency table with cell counts n_train. The solution for an array of values for λ, the so-called solution path, is calculated according to an algorithm described in the following Implementation section. The corresponding vectors of cell probabilities are denoted by p( ${\overline{β}}^{λ}$ ). We then use the remainder of the cell counts n_testto calculate the predictive negative log-likelihood score

\frac{- \sum_{i = 1}^{m} n_{t e s t, i} \cdot \log (p_{i} ({\overline{β}}^{λ}))}{\sum_{i = 1}^{m} n_{t e s t, i}},

(7)

which is proportional to the out-of-sample negative log-likelihood. This score is on the same scale when varying the number of observations and may therefore be used to compare contingency tables of the same dimension but with different numbers of cell entries. The parameter λ is chosen as the value which minimizes the cross-validated score in (7). We use a ten-fold cross-validation in our example.

The resulting model does not necessarily have to be hierarchical and if we consider the hierarchical model induced by this procedure, it might happen that the final model is large for example if a single high order interaction is estimated to be active. To address this, we set up an algorithm described in the next Section.

Level-ℓ₁-regularized model selection

In order to prevent the procedure from choosing single high-order interactions, we alter the ℓ₁-regularized algorithm described in the previous Section: we do not exclusively apply it to the fully saturated model but also to submodels with lower order interactions. Precisely, a model is fitted with main effects only, and the predictive negative log-likelihood score (7) is calculated for the best main effects model (level 1). The same is done for the model including all main effects and first order interactions (level 2). Proceeding accordingly, we get |C| log-likelihood scores corresponding to the |C| levels. The level with minimal score (7) is then chosen (and within this selected level, we have an ℓ₁-regularized estimate).

With this procedure the tendency of including a single high-order interaction while most of its lower order interactions are absent is decreased, and the inclusion is only forced if the predictive negative log-likelihood score strongly speaks in favour of the inclusion. Therefore we tend to select sparser models which can be better hierarchized and interpreted in terms of conditional independence, in contrast to the ordinary ℓ₁-model selection procedure.

Algorithm for ℓ₁-regularization for factors with two levels

For the regularization approaches we calculate ${\overline{β}}^{λ}$ over a large number of values of λ in order to do some cross-validation using (7). For this purpose, an efficient algorithm is required. As one can easily verify by introducing Lagrange multipliers, finding the solution to (6) under the constraint (5) is equivalent to minimizing an unconstrained function g(β):

g (β) = - l (β) + \sum_{i = 1}^{m} \exp (μ_{i}) + λ \sum_{\begin{matrix} a \subseteq C \\ a \neq \emptyset \end{matrix}} {‖ β_{a} ‖}_{ℓ_{2}},

(8)

with μ= X β and $l (β) \propto \sum_{i} \frac{n_{i}}{n} {(X β)}_{i}$ . Here, g is a convex function. If each factor has two levels only, as in our application with single-gene libraries, we can set up an algorithm, which efficiently yields the estimates for a whole sequence of parameters λ. Let $A$ denote the set of active interaction terms, which means for a ∈ $A$ it holds that β_a≠ 0; $X_{A}$ is the corresponding sub-matrix of X, $β_{A}$ the corresponding sub-vector of β and $g_{A}$ is g restricted to the subspace $β_{A}$ . We restrict ourselves to the currently active set $A$ , where $\nabla g_{A}$ and $\nabla^{2} g_{A}$ are well-defined:

\begin{matrix} \nabla g_{A} (β_{A}, λ) = - X_{A}^{t} {\frac{n}{n} - \cdot \exp (X_{A} β_{A})} + λ {(0, s i g n (β_{A}))}^{t} \\ \nabla^{2} g_{A} (β_{A}, λ) = X_{A}^{t} d i a g {\exp {X β)} X_{A} . \end{matrix}

The algorithm, which is an adaption of the path following algorithm proposed by [16], is set up as follows:

(1)
Start with $\overline{β}$ = (-log(m), 0,..., 0)
(2)
Set: λ₀ = 1, $A$ = {∅} and t = 0.
(3)
While (λ_t> λ_min)

(3.1) λ_t+1= λ_t- ε

(3.2) $A$ = $A$ ∪ {j ∉ $A$ : |[X^t· $\frac{n}{n}$ - exp (X $\overline{β}$ )]_j| > λ_t+1}

(3.3) $\overline{β}$ is updated as $\overline{β}$ _t+1= $\overline{β}$ _t- $\nabla^{2} g_{A}$ ( $\overline{β}$ _t, λ_t+1)^-1· $\nabla g_{A}$ ( $\overline{β}$ _t, λ_t+1).

(3.4) $A$ = $A$ \{j ∈ $A$ : { $\overline{β}$ _t+1,j| <δ}

(3.5) t = t + 1

The pairs $({\overline{β}}_{t}, λ_{t})$ , obtained from the algorithm above, represent the estimates from (6) under the constraint (5) for a range of penalty parameters λ_te.g. (t = ε, 2ε...). The choice of the step length ε represents the tradeoff between computational complexity and accuracy. To increase accuracy, one can perform more than one Newton step (3.3) if the gradient starts deviating from zero. The coefficient δ is also flexible. Typically it is chosen in the order of ε. The lowest λ for which one wants the solution to be calculated is denoted by λ_min. Technical details concerning the algorithm can be found in the Appendix.

Testing

Data

We choose the true underlying interaction vector β consisting of 5 factors of 2 levels. By enumerating the factors from 1 to 5, the generators of the model are 345 + 235 + 234 + 135 + 123 + 14, which means that all third and fourth order interactions are absent, only five of ten second order interactions and all first order interactions are present. The corresponding coefficients of β are independently simulated using a normal distribution with mean zero and variance one.

Then, 250 draws from a multinomial distribution with probability vector p where log (p) = X β, are taken. This corresponds to a reasonable number of cDNAs in a single-gene library. This is then repeated 10 times. With our choice of β, the resulting contingency tables are sparse. With the simulated cell counts, $\overline{β}$ is estimated with different methods described in the previous sections and these methods are then compared as follows:

Criteria

As a model selection score (MSS), we consider the fraction of correctly assigned model terms:

MSS = 1 - \frac{1}{m} \sum_{i = 1}^{m} | 1_{{β_{i} \neq 0}} - 1_{{{\overline{β}}_{i} \neq 0}} | .

Moreover, we consider the root mean squared error for the interaction coefficients,

R M S E = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {({\overline{β}}_{i} - β_{i})}^{2}} .

For assessing how much the estimation of β varies over multiple datasets, we calculate for every coefficient ${\overline{β}}_{i}$ the estimated standard deviation ${\overline{σ}}_{i}$ . The means of these standard deviations are reported as

SPREAD = \frac{1}{m} \sum_{i = 1}^{m} {\overline{σ}}_{i},

a measure of variability.

To compare the different procedures for estimation of probabilities p = exp (X β), we calculate the negative log-likelihood score (NLS) similar to the score in (7):

NLS (\overline{β}) = - \sum_{i = 1}^{m} p_{i} \cdot \log (p_{i} (\overline{β})) .

Results of simulation study

The results of the simulation study are summarized in Table 1, where we also include the MAP estimators of the Bayesian approaches described in Additional file 1 Section 2. We notice that the penalty-based regularization approaches proposed in this article leads to comparable or better results than the Bayesian approaches with respect to the NLS-score, RMSE and the variation (SPREAD), though the results of Bayesian approaches vary with the prior and the set of possible priors has not been extensively explored.

Table 1 Performance of different algorithms

Full size table

The level-ℓ₁-regularization and the relaxed ℓ₁-regularization (see below) are both competitive and can be better than MCMC for model selection.

The results of the MCMC procedures are sensitive to the choice of the prior value or the prior distribution for σ². A at prior for α_a(σ² = 2) results in worse performance than that of a prior that shrinks the coefficients more towards zero (σ² = 1/2). This suggests that specification of this prior hyperparameter may be difficult in practice, while we can easily optimize λ in the regularization approach by cross-validation.

The MCMC approaches without model selection perform poorly, as should be expected from data generated by a sparse model. MCMC methods based on a non-hierarchical model selection are also clearly inferior to the hierarchical counterpart. This is not surprising, as we have simulated data from a hierarchical model. In Table 1 we have also added an additional approach, denoted by ℓ₂, the equivalent to the ℓ₁-regularization but using an ℓ₂-penalty instead of an ℓ₁-penalty on the coefficients of the log-linear model. This method is equivalent to the MAP estimator with Gaussian priors on β_a, with the parameter of the distribution optimized by cross-validation. This Ridge-type method does not perform variable selection, but it is competitive for all other criteria that we assessed.

In addition we consider the relaxed ℓ₁-regularization approach. Rather than using a single penalty parameter λ, the idea of this method is to control variable selection and parameter estimation by incorporating two penalty parameters. For linear regression it has been proven theoretically as well as empirically [17] that under suitable conditions the relaxed ℓ₁-regularization is better than Lasso.

Overall, the level-ℓ₁-regularization has good model selection performance (high MSS score) in combination with low negative log-likelihood score (NLS) and a low mean squared error for the true β(RMSE). In addition, it is feasible to optimize the tuning parameter λ by cross-validation as the computational cost is very low compared to the MCMC approaches. On the other hand, posterior distributions of estimates from MCMC methods provide additional information about uncertainty in the model space, compared to point estimates from ℓ₁- or ℓ₂-regularization.

Implementation

Dataset

We estimate the splicing interaction pattern for a dataset corresponding to the itpr1 gene, one of three mammalian genes encoding receptors for the second messenger inositol 1,4,5-trisphosphate (InsP₃). This gene is subject to alternative RNA splicing, with seven sites of transcript variation, 6 of these within the ORF and among these, q = 5 were completely assessed in the single-gene libraries. Five single-gene libraries were built, one for adult rat cerebrum as well as four for different stages of postnatal cerebellar development, namely on days 6, 12, 22 and 90, the latter being considered as adult. Each library consists of between 179 and 277 transcripts which were assessed, i.e. $\sum_{j = 1}^{m} n_{j}$ ∈ [179, 277]. This gene is 89% identical at the cDNA level and 95% identical at the amino acid level with the human receptor gene. The complete dataset can be found in [18].

Results of application to Single-Gene Libraries

Unless stated differently, we report the results using the level ℓ₁-penalization method. We display the interaction vector $\overline{β}$ graphically by plotting the components ${\overline{β}}_{j}$ for the different tissue and development stages in Figure 1. Our results suggest that the exons interact mainly in pairs and there is no reliably estimated higher order interaction in the splicing interaction pattern of rat cerebellum. We further notice that the main interaction pattern is very well conserved over different developmental stages. A strong mutual interaction between exons number three, four and five can be observed in all development stages of rat cerebellum as well as in the cerebral tissue. The biggest changes in the interaction pattern during development of rat cerebellum occur from postnatal day six to day 12. This can be seen at position number 10 on the x-axis in Figure 1, and it corresponds to the first order interaction between exons two and three, and from day 12 to day 16, the first main effect changes in sign and magnitude. The first main effect decreases progressively from day 6 to adult, reversing in sign between day 12 and 22. Between day 22 and 90, the interaction pattern is strongly conserved. Comparing the splicing interaction patterns between cerebellum and cerebrum in the adult rat, we see a much more complex pattern in the cerebrum, involving several second order interactions, and therefore a clear distinction from that of the cerebellum.

The conditional independence graphs for the estimated log-linear models are drawn in Figure 2, where the thickness of the edges are proportional to the corresponding coefficient of the interaction vector $\overline{β}$ (the largest, if there are several giving rise to the same edge) and the radius of the vertices are chosen proportional to the corresponding main effect coefficient. Figure 2 graphically exploits the strongly conserved interactions between exons three, four and five. Except for a rather strong interaction between exon two and three on day six, all other interactions appear to be rather small. The graphical representation of the interaction pattern of adult rat cerebrum reveals a more complex interaction pattern with no conditional independences.

The approaches and results presented here can provide valuable insight into the underlying processes in alternative splicing in general, and specifically in the brain development experiments considered here. Most striking is the strong conservation over developmental stages at day 12, 22 and 90 (adult); some differences are showing between postnatal day six and day 12. Also, the conservation between the cerebellum and cerebrum is less pronounced than over developmental stages. Finally, second- or higher-order interaction terms seem to be of minor relevance, suggesting that in this gene/tissue combination, direct interaction mainly happens between pairs of exons, but not combinations of three or more exons.

We have also estimated β with the hierarchical Bayesian approach using MCMC. For the choice of σ² = 1 this resulted in very similar interaction patterns as for the level ℓ₁-penalization method. For σ² = 2 it led to remarkably different results. In addition to this, a further dataset was analyzed where the details can be found in Additional file 1 Section 3.

Conclusion

We have developed an efficient method for identifying interaction patterns of categorical variables. This can be used to fit a graphical model which is a valuable tool to visualize the conditional dependence structure among the random variables. In a simulation study, the results of the new level-ℓ₁-regularization method are superior in comparison to ℓ₁-regularization and slightly better than the MAP estimator from some of the MCMC methods we considered. With real data, the level ℓ₁-regularization and hierarchical Bayesian approach led to similar results, subject to a specific choice of priors for the Bayesian method. An important computational advantage of the level-ℓ₁-method in comparison to MCMC, is that cross-validation becomes feasible which in turn allows for an empirical choice of the tuning parameter. While the methodology described in this article is motivated by the study of exon splicing interactions in single-gene transcriptomes, it provides a general and flexible toolbox for regularization analysis in relatively high dimensional, sparse contingency tables. Model selection in high dimensional contingency tables has been a traditionally challenging area, and we hope that our generalization of regularization methodologies to this context will prove useful in a variety of areas of computational biology and biostatistics. Several technologies generate categorical data: these include SNP chips that provide genotype and copy number information at the DNA level, sequencing technologies, assays that study binding properties of proteins and binding of RNA to DNA, a variety of disease phenotypes, and more. In most of these contexts the interactions among the variables are critical features in systems biology investigations that aim at studying how the components of complex systems work together in in fluencing biological outcomes. For example, the log-linear models described here provide a natural approach for fitting very general classes of networks to discrete data. The level-ℓ₁-regularization is a general tool which can be applied to a wide variety of problems involving sparse contingency tables.

An R package called logilasso will be available for download on the Comprehensive R Archive Network (CRAN).

Appendix

We note that if β is a minimum of g, then $β_{A}$ is a minimum of $g_{A}$ .

In our application with single-gene libraries, all factors have two levels only, which allows to construct an efficient algorithm. Since the gradient

\nabla [- l (β) + \sum_{j = 1}^{m} \exp (μ_{j})] = - X^{t} \cdot (\frac{n}{n} - \exp (X β)),

where exp(X β) is understood as the componentwise exponential function, it follows that for a minimum $β_{A}$ of $g_{A}$ , the following equation holds:

\nabla g_{A} (β_{A}) = - X_{A}^{t} \cdot (\frac{n}{n} - \exp (X_{A} β)) + {(0, s i g n (β_{A}))}^{t} \cdot λ = 0

(9)

Without loss of generality, we can restrict ourselves to the subspace β∈ ℝ^- × ℝ^m-1, because the constraint (5) can only be satisfied for β_∅ < 0 as is proved in the following Lemma 1. Therefore β_∅ ∈ $A$ .

Lemma 1. β_∅ < 0 for a minimum of g(β) for all λ ∈ ℝ⁺.

Proof.

log(p) = X β< 0 which yields (1,..., 1)X β= mβ_∅ < 0 this implies β_∅ < 0.

This holds because (1,....., 1) is orthogonal to all columns of X except for the first one. □

Additionally for β being a minimum, a necessary condition is:

| {(X^{t} \cdot (\frac{n}{n} - \exp (X β)))}_{j} | < λ, \forall_{j} \notin A .

(10)

Conditions (9) and (10) are sufficient for β being a minimum of (8). To find the β's that solve these equations for an array of values for λ, we set up a so-called path following algorithm. The idea is to start from an optimal solution $β^{λ_{0}}$ for λ₀, and follow the path for decreasing λ, using a second-order approximation for $β_{A}$ . In the following, we restrict ourselves to the currently active set $A$ , omitting the index $A$ . It then holds:

\begin{matrix} \nabla g (β_{t + 1}, λ_{t + 1}) = 0 \approx \nabla g (β_{t}, λ_{t + 1}) + \nabla^{2} g (β_{t}, λ_{t + 1}) δ β . This implies \\ δ β = - \nabla^{2} g {(β_{t}, λ_{t + 1})}^{- 1} \nabla g (β_{t}, λ_{t + 1}) . \end{matrix}

(11)

The algorithm tries to follow the optimal path as close as possible. At each step, it aims to meet the conditions (9) and (10). In step (3.2), the active set $A$ is identified, which forces $\overline{β}$ to meet the condition (10). In step (3.3), a Newton step as described in (11) is performed. Starting from a solution which meets condition (9), the new ${\overline{β}}^{λ}$ approximately meets (9) again.

References

Liang F, Holt I, Pertea G, Karamycheva S, Salzberg S, Quackenbush J: Gene index analysis of the human genome estimates approximately 120000 genes. Nature Genetics. 2000, 25: 239-240. 10.1038/76126.
Article CAS PubMed Google Scholar
International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.
Article Google Scholar
Southan C: Has the yo-yo stopped? An assessment of human protein-coding gene number. Proteomics. 2004, 4: 1712-1726. 10.1002/pmic.200300700.
Article CAS PubMed Google Scholar
Mironov A, Fickett J, Gelfand M: Frequent alternative splicing of human genes. Genome Research. 1999, 9: 1288-1293. 10.1101/gr.9.12.1288.
Article PubMed Central CAS PubMed Google Scholar
Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger SR, J Bork P: EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000, 474: 83-86. 10.1016/S0014-5793(00)01581-7.
Article CAS PubMed Google Scholar
International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.
Article Google Scholar
Brett D, Pospisil H, Valcarcel J, Reich J, Bork P: Alternative splicing and genome complexity. Nature Genetics. 2002, 30: 29-30. 10.1038/ng803.
Article CAS PubMed Google Scholar
The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science. 2005, 309 (5740): 1559-1563. 10.1126/science.1112014.
Article Google Scholar
Zavolan M, van Nimwegen E, Gaasterland T: Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Research. 2003, 12: 1377-1385. 10.1101/gr.191702.
Article Google Scholar
Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi m, Barrero m, Tamura T, Yamaguchi-Kabata Y, Tanino M: Integrative annotation of 21037 human genes validated by full-length cDNA clones. PloS Biology. 2004, 2: 1-20. 10.1371/journal.pbio.0020162.
Article Google Scholar
Everitt BS: The Analysis of Contingency Tables. Monographs on Statistics and Applied Probability. 1992, Chapman and Hall, 45: 2
Google Scholar
Lauritzen SL: Graphical Models. Oxford Statistical Science Series. 1996, Oxford Clarendon Press, 17:
Google Scholar
Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 1996, 58: 267-288.
Google Scholar
Christensen R: Linear Models for Multivariate Time Series, and Spatial Data. 1991, Springer-Verlag
Book Google Scholar
Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. 2006, 68: 49-67. 10.1111/j.1467-9868.2005.00532.x.
Article Google Scholar
Rosset S: Following Curved Regularized Optimization Solution Paths. Advances in Neural Information Processing Systems. Edited by: Saul LK, Weiss Y, Bottou L. 2005, Cambridge, MA: MIT Press, 17: 1153-1160.
Google Scholar
Meinshausen N: Lasso with relaxation. Computational Statistics & Data Analysis.
Regan MR, Lin DDM, Emerick MC, Agnew WS: The effect of higher order RNA processes on changing patterns of protein domain selection: A developmentally regulated transcriptome of type 1 inositol 1,4,5-trisphosphate. Proteins: Structure, Function and Bioinformatics. 2005, 59: 312-331. 10.1002/prot.20225.
Article CAS Google Scholar

Download references

Acknowledgements

CD was partially supported by the Swiss National Science Foundation grant number NF 200020-113270 and by a PhD scholarship from the CC-SPMD. GP was partly supported by NSF grant DMS034211.

Author information

Authors and Affiliations

Seminar für Statistik, ETH Zürich, CH-8092, Zürich, Switzerland
Corinne Dahinden & Peter Bühlmann
Competence Center for Systems Physiology and Metabolic Diseases, ETH Zürich, CH-8093, Zürich, Switzerland
Corinne Dahinden & Peter Bühlmann
Departments of Oncology and Biostatistics, Johns Hopkins Schools of Medicine and Public Health, Baltimore, MD, USA
Giovanni Parmigiani
Department of Physiology, Johns Hopkins School of Medicine, Baltimore, MD, USA
Mark C Emerick

Authors

Corinne Dahinden
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Parmigiani
View author publications
You can also search for this author in PubMed Google Scholar
Mark C Emerick
View author publications
You can also search for this author in PubMed Google Scholar
Peter Bühlmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Corinne Dahinden.

Additional information

Authors' contributions

CD derived the mathematical details, implemented and tested the algorithm. GP initiated the project, suggested ideas and edited the manuscript. ME provided the datasets and the biological interpretation. PB supervised the project and suggested some of the main ideas. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2007_1848_MOESM1_ESM.pdf

Additional file 1: The Additional file consists of 3 sections. Section 1 contains details concerningthe parametrization of the log-linear model. Section 2 describes some Bayesian model selection approaches, which were used for comparison with our algorithm. In Section 3 a further dataset on which we tested our algorithm is introduced and the results are given on that dataset. (PDF 190 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dahinden, C., Parmigiani, G., Emerick, M.C. et al. Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries. BMC Bioinformatics 8, 476 (2007). https://doi.org/10.1186/1471-2105-8-476

Download citation

Received: 16 March 2007
Accepted: 11 December 2007
Published: 11 December 2007
DOI: https://doi.org/10.1186/1471-2105-8-476

Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries

Abstract

Background

Results

Conclusion

Background

Results

Algorithm

General introduction to contingency tables and Log-linear Models

Graphical Models

Model selection – Non-Hierarchical versus hierarchical models

ℓ1-Regularized model selection

Level-ℓ1-regularized model selection

Algorithm for ℓ1-regularization for factors with two levels

Testing

Data

Criteria

Results of simulation study

Implementation

Dataset

Results of application to Single-Gene Libraries

Conclusion

Appendix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12859_2007_1848_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

ℓ₁-Regularized model selection

Level-ℓ₁-regularized model selection

Algorithm for ℓ₁-regularization for factors with two levels