RNA-Seq multiread expression estimation
As we seek to extend the prevalent generative model of RNA-Seq [7–11], we begin by reviewing the basic elements of that model. Let G = (G1, ...,G
M
) be the set of M transcribed regions considered and P = (P1, ...,P
M
) be the proportions of RNA bases attributed to each transcript out of the total number of transcribed bases in a sequenced sample. Regions may be either genes or transcripts, depending on the level of transcription being investigated. We require P to satisfy ∑
gϵG
P
g
= 1 and ∀gϵG, 0 ≤ P
g
≤ 1.
The model describes an RNA sequencing experiment where regions in G are randomly chosen according to the distribution P, start positions in these regions are chosen uniformly, and reads of length ℓ are generated by copying ℓ consecutive bases from each chosen region to produce a set of reads R = (r1,..., r
ρ
). Sequencing is assumed to be error prone, leading to a certain probability of error for each read base. Based on the repetitions present in the set of regions and errors in alignment, reads may fail to map to the region from which they originate or may map to additional locations. Thus, we assign a probability of obtaining read r
j
given that it originated from region . In this case we rely on the alignment of r
j
to G
k
to afford us the best match position instead of summing over all possible starting positions. ℓ
k
is the effective length of G
k
(i.e., the number of start positions from which a full length read can be derived) as defined in [11], ϵ is taken to be a constant per-base error rate, errors are assumed to be independent, and error
jk
is the number of mismatches in the best alignment of r
j
to G
k
.
This formulation leads to the likelihood of observing the data:
(1)
This likelihood function is used to estimate P given the read alignments. Typically, one will use expectation maximization to find the P for which the likelihood is maximized. It is assumed that P(r
j
|G
k
) is zero for all regions to which r
j
is not aligned.
Common population extension
To estimate expression levels in N individuals from a defined population, we modify the above model by assuming that samples are drawn from a common population. This is imposed by having P = [(P11, ...,P1M), .., (PN 1...,P
N M
)] be probability densities drawn from a common Dirichlet distribution, defined by a set of hyper-parameters specific to the population: ∀iϵ[1, N], p
i
= (Pi 1,..., P
iM
) ~ Dirichlet (α1,..., α
M
).
For sample i, we denote the set of reads as , where each r
ij
is mapped to one or more regions in G. The output of a read alignment program defines the set of accepted regions for the read (in practice only alignments with up to 2 errors are accepted) and provides the number of errors in alignment for each read-region pair. This allows us to calculate P(r
ij
|G
k
) as done above for one sample. For convenience we denote P(r
ij
|G
k
) = q
ijk
(taken to be zero for all regions not mapped to), which is independent of α and P.
As before, our objective is to estimate P, but in this case we must optimize by estimating P and α together. We begin by writing the likelihood function:
(2)
Since expression values are sampled from the Dirichlet distribution,
(3)
Where
(4)
and similar to (1) above,
(5)
This leads to
(6)
Taking the log, we get
(7)
Multi-Genome Multi-Read (MGMR) algorithm
We wish to estimate α and p
1
,...,p
n
to maximize equation (7) above. For this purpose, we adopt an alternating iterative procedure of estimating α given the current estimate of p
1
,...,p
n
and vice-versa until the total change in α becomes sufficiently small (or until a pre-set number of iterations have been executed).
Although for EM-based estimation methods convexity guarantees an optimal solution will be obtained, here (as shall be seen below) we have no such guarantee. Thus, we confine updates to be local by performing only one update for P and one for α. By one MGMR iteration, we refer to one EM-based P update followed by one α update.
Estimating P given α
If we assume α is given, we can write the EM steps to find p
1
,...,p
N
:
E step Letting Match signify a matching between reads and regions, and Match(j) be the region from which read j originates, we get:
(8)
which leads to
(9)
(10)
(11)
where
(12)
M step Given that each q
ijk
is fixed, the above reduces to maximizing
(13)
Using Lagrange multipliers and differentiating, we see that this is maximized with
(14)
Estimating α given P
Given a new estimate for P(t), we can use a fixed point iteration [12] to get a new estimate of α
(15)
By using the known bound , we can get a lower bound on F(α):
(16)
where .
We maximize this bound with a fixed point iteration similar to EM, noting that for fixed values of P convergence is guaranteed, and that for the Dirichlet distribution the maximum is the only stationary point [12]. This leads to the update
(17)
Heuristics/Implementation
As we have found F(α) presented in equation (15) is non-convex even in 2 dimensions (Figure 1), we confine updates to be local by allowing only one update for both the α and P estimation steps at each MGMR iteration. For genes with EM expression estimates equaling zero in all samples we substitute 10-20 for their values in MGMR to avoid taking the log of zero. For P updates (e.g., equation 14), we avoid potentially negative P values by adding one to each alpha (thus ignoring -1 in the numerator and denominator). We implemented the algorithm in MATLAB, where the inputs required are read-gene map files for each sample as in SEQEM [7], and an initial P estimate matrix. Alphas are initialized as an M-length vector of ones.