Samples
mRNAs were prepared from mouse embryonic stem (ES) cells at 0, 12, 24, 48, and 96 h after removal of Leukemia Inhibitory Factor (LIF) from the culture medium. The samples subjected to HiCEP reaction were duplicated. We designated each sample as 0 h-1, 0 h-2, 12 h-1, 12 h-2, 24 h-1, 24 h-2, 48 h-1, 48 h-2, 96 h-1, and 96 h-2.
HiCEP analysis
mRNAs prepared from each sample were digested with two 4-bp-cutting endonucleases (Msp I combined with Mse I) and ligated with the corresponding adaptors. The resulting HiCEP templates, MspI-MseI-poly(A) mRNAs, were amplified by fluorescently labelled primers; for labelling, FAM, HEX, and NED were used. In total, 256 primer combinations (16 MspI-NN primers combined with 16 NN-MseI primers; N = {A, C, G, T}) were used in the HiCEP analysis. For example, a primer combination of MspI-TA and GC-MseI is capable of amplifying particular transcript-derived fragments (TDFs) corresponding to that combination. The details of the protocol of the HiCEP reaction are described elsewhere [11]. An animation of the principle is provided at the following URL http://133.63.22.11/english/research/serch03.html.
Electrophoresis and image analysis
The PCR products were denatured and loaded on an ABI Prism 310 (Applied Biosystems) for capillary gel electrophoresis. The digitized images were analyzed by the GeneScan software (Applied Biotech). The size of the fragments was calculated by the software, according to internal molecular size markers of 35, 50, 75, 100, 139, 150, 160, 200, 300, 340, 350, 400, 490, 500, 600, and 700 bp, on each gel. The fragment sizing and baseline subtraction were performed by the software. The software quantifies each peak by the fragment length L (in bp), peak height H, and area A (in arbitrary units). Accordingly, the subsequent normalization procedure accepts these three-tuples as input for detected TDFs between 35 bp and 700 bp. TDFs smaller than 35 bp or larger than 700 bp were omitted from the analysis because the range was outside the size calibration range.
Delineation of quality scores for lanes
The starting point of normalization is a set of lanes (10 time-course measurements; 0, 12, 24, 48, and 96 h, each experiment duplicated) in each of 256 primer combinations. We explain the procedure using data from the primer combination of 'Msp I-CT combined with tt-Mse I (designated as CT-tt)' because the ten electropherograms have some ranges for which fragment sizing is obviously inappropriate (we therefore designated such ranges as "dissimilar ranges").
The first step starts from the Gaussian approximation of each lane. The use of the approximating lane is the same as described in Aittokallio et al. [25–27]. Briefly, a fragment F
i
in lane P is originally characterized by the three-tuples (L
i
, H
i
, A
i
). If lane P consists of n fragments
, the approximation of the lane at length t is given by:
where σ
i
is obtained from the following equation:
The approximation is performed independently for each lane. The ten approximate profiles of time-course data in the primer combination of CT-tt are shown in Fig. 1a.
For the automated identification of 'dissimilar ranges' from the expression profiles of ten lanes
, we next assign quality scores to each of the fragments
, where the fragments are originally numbered with respect to their lengths. By using the ten approximate profiles, relative similarity scores
for intervals from fragment i to fragment (i+T-1) (i = 1, 2,..., n - T + 1) in lane Ptarget(target = {1, ..., 10}) are calculated from the following equation:
where
is the Pearson correlation coefficient between the target lane Ptargetand one of the other lanes Pkin the interval (start-end bp) which always includes T fragments from fragment i to fragment (i+T-1) (i = 1, 2, ..., n-T+1). The interval is defined as: start = L
i
- 2.5σ
i
and end = Li+T-1+ 2.5σi+T-1. In this analysis, the number of fragments T is held constant at T = 5; other numbers are of course possible. By applying a moving window of T fragments, most of the fragments (n-T+2 fragments in this case, with the exception of F1, F2, F3, F4, Fn-3, Fn-2, Fn-1, and F
n
) have T relative similarity scores. Finally, the relative quality value Q(L
i
) for fragment F
i
is defined as the average of the similarity scores which satisfy start ≤ L
i
≤ end. An example of the calculation is given in Table 1. Quality scores at arbitrary lengths t, Q(t), are interpolated by the use of cubic splines to
. The procedure is applied to each of the ten lanes
and then the quality profiles
corresponding to the expression profiles are created (Fig. 1b).
The quality profiles delineated from Q(t) have a clear interpretation. The high (or low) score for Qk(t) in lane k indicates a high (or low) level of relative similarity between the lane and the others around the length t.
Detection of dissimilar ranges
Now we have information (quality profiles) for the automated detection of dissimilar ranges. Here we adopt a simple method for detecting the range. Briefly,
-
1)
Seek 'seed' ranges (C
seed_s
- C
seed_e
bp) which satisfy two conditions: a) Q(t) ≤ thres
seed
and b) they contain at least two peaks.
-
2)
Seek C
tmp_s
which satisfies both
and C
tmp_s
<C
seed_s
; similarly, C
tmp_e
,
and C
tmp_e
<C
seed_e
-
3)
Substitute the nearest marker length
(in this case, M1 = 35, M2 = 50, ...,
= 700) to C
tmp_s
(resp. C
tmp_e
) for C
s
(resp. C
e
); accordingly, both C
s
and C
e
=
and C
s
<C
e
Aparameter thres
seed
is set to 0.3 empirically. Forexample, P9 has the following parameters in Fig. 1b: C
seed_s
= 57.04, C
seed_e
= 89.98, C
tmp_s
= 52.60, C
tmp_e
= 104.60, C
s
= M2, and C
e
= M4. Although fine tuning might be necessary, the procedure enables us to display dissimilar ranges.
Selection of the reference lane
When we want to correct a dissimilar range (C
s
- C
e
bp), we have to select the "reference" (a kind of mean or typical profile in the corresponding range). One method is to choose lane Preferencesatisfying max {
}, where
is the average of Qkin the range (C
s
- C
e
bp) in lane
. For example, the algorithm selects P10 (i.e., 96 h-2) as a reference in a particular range (M1-M2 bp) and also in range (M2-M4 bp).
Two models for the normalization of dissimilar ranges
The meaning of the word "normalization" here is to correct the fragment lengths (L) and the areas (A) of peaks in a dissimilar range so that the similarity between the normalized electrophoretic pattern and the reference pattern in the corresponding range can be maximized. To normalize a particular lane Ptargetagainst the reference Preference, we now consider the following two models. Model 1 is the case of an incorrect fragment sizing at the shortest (or longest) marker peak, i.e, C
s
= M1 = 35 (or C
e
=
= 700). The peak lengths deviate more and more from the reference length moving from C
e
to C
s
(or from C
s
to C
e
). Model 2 is the case of an incorrect fragment sizing near marker length M
j
(C
s
<M
j
<C
e
, j = {2, 3,..., n
M
- 1}; the inside of dissimilar range (C
s
- C
e
bp)). Roughly, the deviation of peak lengths from the reference length gradually increases starting from C
s
; the maximum deviation is reached at M
j
(C
s
<M
j
<C
e
); the deviation decreases gradually; and finally disappears at C
e
bp.
Normalization is performed by either expanding or compressing. Consider, for example in Model 1, normalization for the expression profile of P3 (12 h-1) in range (M1-M2 bp) against the reference P10 (96 h-2). Undoubtedly, the profile displays a systematic deviation from the reference. The degree of the deviation gradually increases starting from M2 bp to M1 bp probably because an intense peak generated near the shortest marker peak for the correction of M1 bp is used mistakenly. We expect the normalization will be achieved by a linear expansion of the short side (M1) of the range (M1-M2 bp) by anchoring the long side. The best approximating profile is found by considering various combinations of normalization candidates starting from D × 100% expansion to D × 100% compression of the short side at intervals of d bp. We set D = 0.4, as a maximal realistic displacement and d = 0.2. Accordingly, in practice, the number of combinations is 2 × (C
e
- C
s
) × D/d + 1 (for example, there are 61 combinations of normalization candidates in the range (M1-M2 bp)) in Model 1.
For each combination x (x = {0, 1, ..., 2 × (C
e
- C
s
) × D/d)}, we make a candidate profile P
x
by changing three parameters (L
i
, A
i
, and σ
i
) accompanied by fragments (F
i
) in the dissimilar range (C
s
- C
e
bp), according to the level of correction (expansion or compression). Those parameters are calculated as follows:
Candidates are made by substituting these transformed three-tuples
in a given range (C
s
- C
e
bp) into eq. (1). The best approximate profile is the one that achieves the highest correlation coefficient between Preferenceand P
x
(x = {0, 1, ..., 2 × (C
e
- C
s
) × D / d}) in the range (C
s
- C
e
bp). In the normalization for the expression profile P3 in the range (M1-M2 bp) against the reference P10, the best normalized profile by our method matches well with the reference (Fig. 2).
A good example of Model 2 is the expression profile P2 (0 h-2) in range (M2-M4 bp) with the reference P10 (96 h-2); there is no possibility of Model 1 (C
s
≠ M1) and the number of incorrect marker lengths is only one (M3). Model 2 is a mixture of Model 1. The normalization is also done by one of the {2 × (C
e
- C
s
) × D / d + 1} combinations starting from D × 100% compression of the long side in (M2-M3 bp) and D × 100% expansion of the short side in (M3-M4 bp) to D × 100% expansion of the long side in (M2-M3 bp) and D × 100% compression of the short side in (M3-M4 bp) at intervals of d (= 0.2) bp. Unlike Model 1, we set D = 0.1 as a maximal realistic displacement. In the normalization for the expression profile P2 in the range (M2-M4 bp) against the reference, the best normalized profile by our method is matches well with the reference (Fig. 3).
It should be noted that when a dissimilar range (M
j
-Mj+lbp) is very wide (j = 1, 2,..., n
M
- l; l ≥ 3), there are two or more possibilities for incorrect marker lengths in Model 2. Of these cases, we only consider cases with j = 1 in Model 1 because such cases are the only realistic ones. For the remaining cases (j = 2,..., n
M
- l; l ≥ 3), the experiment should be redone rather than trying to normalize them by considering numerous possibilities. It should also be noted that there is a case of a dissimilar range (M1-M3 bp) to which both Models 1 and 2 are applicable. In this case, the best approximate profile is decided by comparing the two best possible profiles determined using Models 1 and 2.