Templates and pyrosequencing
The ribosomal rRNA templates that were used for the implementation of the pyrosequencing procedure were derived from a project where the D3–D5 expansion segment region of the LSU was cloned from organisms that are present in meiobenthos samples [5, 8]. The seven sequences that were chosen represent an algae (A), a nematode (N), a tardigrade (T), three crustaceans (C = cyclops, H = harpacticoid, O = ostracod) and an insect (E = ephemeroptera). The fragments were amplified from the clones using primers designed with the software PROBE http://jakob.genetik.uni-koeln.de/bioinformatik/software/probedesign/[10]: forward primer 5'-GAC-CCG-TCT-TGA-AAC-ACG-G-3' and a biotinylated reverse primer 5'-ATC-GAT-TTG-CAC-GTC-AGA-A-3'. Pyrosequencing was performed with 5'-GAA-ACA-CGG-ACC-AAG-GAG-T-3' as sequencing primer according to the instructions of the supplier (Biotage, Uppsala) and on the PSQ96 MA instrument (Biotage). The pyrograms for sixty dispensation cycles were recorded and exported into Excel (Microsoft Inc.) for the further calculation steps.
Deconvolution of the mixture for the pyrosequencing procedure
Each component of the mixture contributes to the final pattern recorded for this mixture. Determination of the pattern contributions, i.e. quantification of the components in the mix, can be achieved by solving a system of linear equations. In theory the following system of equations describes the pyrosequencing process:
Sj – is the peak intensity at j-th step of a specified nucleotide; kji(Z) – the linear coefficient between signal intensity and incorporation event of a specified nucleotide Z (A,T,G,C) at the j-th step of sequencing for the i-th organism; nji(Z) – number of available incorporation events for the nucleotide at the j-th step for the i-th organism (0,1,2,3...) since polynucleotide sequence can have successive repetitions of the same nucleotide several times; xi – the sought concentration of the i-th organism; N – total number of organisms.
In practice the coefficients kji(Z) are unknown, hence it is necessary to record pyrosequencing profiles for each expected component prior to solving the linear system. Pre-recorded profiles represent a set of kji(Z)nji(Z) which is then used to deconvolute a mix. Needless to say that the dispensation order of dNTPs determines pyrograms, therefore it must be consistent throughout individual components and the sample (in our case: A-T-G-C).
In a matrix form the system can be re-written as:
N·X = S
where N – matrix of nji multiplied by kji(Z), actually the pre-recorded profiles; X vector of xi; S – vector of peak intensities. This system can analytically be solved by the "least squares" solution which minimizes the square of the norm of the residual difference [11]:
X = (N
T
·N)
-1
·N
T
·S
To obtain an estimator for the standard deviations of the solutions one has to assume that the values in the matrices N and S, being physical measurements of light intensity, are distributed normally. This assumption allows to calculate errors associated with each solution using the following procedure (according to[11]): a non-scaled covariance matrix for the vector of solutions (X) can be computed as:
C = (NT·N)-1
This matrix needs to be scaled by a factor that can be determined as follows:
where X is the solution of the above equation, r – number of rows and p – number of columns in N respectively [12].
The diagonal elements of the scaled covariance matrix are variances (squared standard deviations) of each solution in the vector X. Therefore, these diagonal elements can be used as a measure of an error associated with each solution. Note that values in the covariance matrix as well as the solution are only meaningful if there is a good correlation between S and NX, where X is the computed solution. Correlating S and NX allows determining how well the pyrogram of the unknown sample can be explained in terms of the pyrograms of individual components. If the correlation is poor, the design matrix N is not adequate to the sample (i.e. there are too many unknown RNAs in the sample) and the solution as well as covariance matrix are meaningless.
The number of steps required for an unambiguous solution must be at least as many as the number of components to be identified plus possible additional steps to provide non-singularity of the matrix N. Any further additional steps provide an overdefinition of the system that makes its resolution more robust. For a given set of anticipated components it is possible to determine a minimal number of sequencing steps beforehand assuming all kji(Z) = 1 and nji(Z) taken from the sequences of these components. A simple algorithm simulating pyrosequencing evaluates singularity of the matrix N at each step until N is no more singular. The solution of the linear equations was carried out with the MathCad (Math Soft Inc.).
Implementation for dideoxy sequencing
A mixture electropherogram from a capillary sequencer can be represented as follows:
where f
i
(t) is the electropherogram of i-th component, t – the scan number or time after the start of electrophoresis, a
i
– the quantity of the i-th component and N – number of components. Thus, the problem can be converged to finding a
i
, which is the best linear fit given that f
i
(t) is known beforehand – again from the set of pre-recorded electropherograms of expected components. One solves a system of linear equations:
FA = Y
where A is a {a
i
}T vector, F – matrix, which k-th row is {f
i
(t
k
)} and Y – vector {y(t
k
)}T. The index k runs from 0 to at least N-1. The tk is a subset of scans from the entire electropherogram. The solution is found as shown above
The functions {f
i
(t)} are determined by sequencing the individual components and storing the electropherograms in the profile library.
Simulation of Dideoxy sequencing results
Gauss-shaped peaks were simulated for random sequences, with unequal assignment of peak height at each position to mimic the known differential incorporation effects. A component library was build from these simulated electropherograms. Concentrations of each component of the library were chosen to be 1.
A mixture was composed from randomized amounts of components adding up to 1. This way of mixture simulation reflects a realistic case where the total amount of DNA or RNA of the mix would be the same as that used to record the library. The distribution of the components was chosen to represent abundant and rare components (compare Fig. 3). The system was solved as described above. Solutions represent fractions of each component of the library in the sample mixture. To estimate the influence of noise, normally distributed random numbers were used to change each peak in the simulated electropherogram of the sample and the library at a specified noise level. The noise level was determined as a fraction of the maximal peak of an electropherogram.