Mixture models for analysis of melting temperature data

Nellåker, Christoffer; Uhrzander, Fredrik; Tyrcha, Joanna; Karlsson, Håkan

doi:10.1186/1471-2105-9-370

Methodology article
Open access
Published: 11 September 2008

Mixture models for analysis of melting temperature data

Christoffer Nellåker¹,
Fredrik Uhrzander²,
Joanna Tyrcha² &
…
Håkan Karlsson¹

BMC Bioinformatics volume 9, Article number: 370 (2008) Cite this article

4290 Accesses
5 Citations
Metrics details

Abstract

Background

In addition to their use in detecting undesired real-time PCR products, melting temperatures are useful for detecting variations in the desired target sequences. Methodological improvements in recent years allow the generation of high-resolution melting-temperature (T_m) data. However, there is currently no convention on how to statistically analyze such high-resolution T_m data.

Results

Mixture model analysis was applied to T_m data. Models were selected based on Akaike's information criterion. Mixture model analysis correctly identified categories in T_m data obtained for known plasmid targets. Using simulated data, we investigated the number of observations required for model construction. The precision of the reported mixing proportions from data fitted to a preconstructed model was also evaluated.

Conclusion

Mixture model analysis of T_m data allows the minimum number of different sequences in a set of amplicons and their relative frequencies to be determined. This approach allows T_m data to be analyzed, classified, and compared in an unbiased manner.

Background

Real-time PCR or semiquantitative PCR is widely used to detect and quantify specific target sequences. The exponential amplification of a sequence is monitored in real time by fluorescence. Commonly, a nonspecific fluorescent dye is used, such as SYBR Green I or LCGreen, which only reports the presence of double-stranded DNA. These dyes do not distinguish sequences and can thus report the amplification of undesired targets. Undesired sequences are normally detected during a dissociation step after thermocycling is complete. During dissociation, the double-stranded PCR products melt into single strands, so fluorescence is diminished. A curve can be produced by plotting the loss of fluorescence against a gradual increase in temperature. The temperature at which the rate of signal loss is the greatest can be defined as the melting temperature (T_m) of the PCR product. Although the T_m is sequence dependent, different sequences do not necessarily have different T_m. However, the converse is true. The detection of different T_m does imply the presence of different sequences. Therefore, by monitoring T_m, we can distinguish different targets for one set of primers. This technique has been used for the detection of single-nucleotide polymorphisms [1], allelic discrimination [2], and strain typing of microorganisms [3–5]. We previously reported the use of T_m analysis to detect the expression patterns of transcripts containing different members of the W family of human endogenous retrovirus (HERV) elements in vitro and in vivo [6, 7].

The precision of the T_m measurements determines the sensitivity with which different sequences can be distinguished. The instrument used to obtain the T_m recordings is the principal factor limiting the amount of information that can be extracted from the data. We recently reported a method that allows improved resolution, reduced spatial bias, and automated data collection for T_m detection in an ABI Prism 7000 Sequence Detection System (Applied Biosystems, Palo Alto, CA) [8]. Using a temperature indicator probe (T_mprobe) and an algorithm (GcTm) to interpolate more-precise T_m measurements from multiple data points, the standard deviation of the measurement error (σ) of the T_m recordings was improved from 0.19°C to 0.06°C [8].

However, there is no convention on how to analyze T_m data to objectively distinguish sequences by T_m. The need for such a tool becomes apparent when the T_m data are: i) not easily stratified because of overlapping clusters of T_m observations, and/or ii) if the number of different sequences and possible T_m categories are unknown. In this report, we use mixture model analysis to construct a model for a particular set of primer targets, to classify T_m data, and to calculate the mixing proportions of the amplicons within these categories. The mixture model technique allows T_m analysis to be applied to any set of primers to determine the minimum number of T_m categories (i.e., the number of different sequences detected) and the mixing proportions (frequency distributions) of the detected categories. Thus, mixture model analysis of T_m data is an objective method with which more refined T_m assays can be established.

Results

In a T_m analysis using the T_mprobe and GcTm program, described previously [8], we demonstrated, using plasmids containing known sequences, that it was possible to distinguish some but not all sequences based on their T_m. In the present report, we applied the mixture models and the ρ established in the previous publication [8] to determine the T_m categories and mixing proportions of these data (Figure 1). Akaike's information criterion (AIC), a measure of how well a model explains the data, with a penalty for the number of parameters estimated, determined that the T_m of the four sequences were best represented by a three-category mixture model. This model precisely estimated the mixing proportions of the T_m into the categories, attributing the correct number of T_m recordings to each of the four sequences (where two of them shared a category). For an overview of the procedure for using mixture models to analyze T_m data, see the Methods section.

We next assessed the performance of the mixture model analysis in constructing models for categories of T_m with varying separations. Therefore, we generated simulated data points mimicking the T_m of four sequences separated by multiples of σ. These data were used to identify the model that best explained the data according to AIC (see an example of the AIC plot in Figure 2) for a range of T_m separations and numbers of data points (Figure 3). A large separation of T_m, 10 × σ (0.6°C), allow the mixture model analysis to close in on four separate categories with only 10 data points. Smaller separations of T_m require larger numbers of data points to determine the correct number of T_m categories. The distinction of categories with a separation of 1 × σ required approximately 2000 data points to model the correct number of T_m categories.

Next, we evaluated the fit of the data points to preestablished models. For this purpose, we generated data points corresponding to a sample containing three of four possible T_m represented in a model. We compared the mixing proportions reported by the mixture model analysis with the mixing proportions in which all four T_m were present at equal frequencies. In Figure 4, the P values obtained from χ² analyses for various separations of the T_m are plotted against the numbers of data points used. The P values for the χ² test drop rapidly with increasing sample numbers for any T_m separation of more than 1 × σ. With smaller separations of the T_m categories, the mixture model analysis is unable to reliably establish the differences in the mixing proportions.

Discussion

We report the application of mixture models to the analysis of high-resolution T_m data. Whereas the plasmid T_m data reported are sufficiently separated to be stratified manually, we use these data to demonstrate the principle that can be applied to analyze more complex T_m data.

Mixture model analysis of T_m data entails the construction of a model based on the T_m data for a set of primers. With such a model established, it is possible to fit smaller subsets of data to calculate the mixing proportions of the T_m categories of the model. This gives a proxy marker for the frequency distributions of different amplicon sequences in the analyzed data. This approach requires no prior knowledge of how many different amplicons are present and there is no limit to the number of different T_m that can be distinguished. However, the T_m analysis method with mixture models only reports the minimum number of different sequences required to explain the T_m data because different sequences can have the same T_m.

Mixture model analysis is a modern type of cluster analysis. The purpose of cluster analysis is to group data that have properties in common. When constructing the mixture model for a set of primers, the number of categories in the model that most appropriately explains the T_m data is determined by AIC. Other information criteria exist, such as the Bayesian information criterion, but this penalizes free parameters more harshly than does the AIC.

By empirical testing with simulated data, we found that smaller separations of T_m require exponentially larger numbers of data points to distinguish the correct number of categories in a mixture model. Insufficient numbers of observations yield an underestimation of the numbers of unique T_m represented by the data, erring on the side of safety. In other words, with insufficient data, the number of unique sequences in the data is underestimated by the optimal model.

In an established model, based on a large number of T_m observations, a smaller number of observations can be fitted to calculate the mixing proportions in the T_m categories. These proportions can then be compared between sets of T_m data as frequency distributions of sequences and analyzed with χ² tests. We observed that, whereas a large number of T_m observations are required to establish a model with a small separation between categories (e.g., 1000 data points are required with 2 × σ separation), far fewer are sufficient for comparisons once the model is established (e.g., 100 data points for P < 0.001). A separation of the T_m categories in the model of less than 1 × σ results in unreliable mixing proportions. However, this should rarely be a problem in practice, because constructing the models puts a larger constraint on T_m separation by AIC. In other words, models constructed with mixture model analysis will consist of T_m categories separated by more than 1 × σ.

Not all dissociation curves are easily defined by a single T_m, as in the case of multiple domain transitions in longer sequences [9] (generally longer than those generated in real-time PCR assays) and for heterodimers. Using the GcTm approach to curve fitting and SYBR Green I chemistry, such melting profiles will be assigned a single T_m value. Although some additional information is therefore lost, mixture model analysis still validly identifies clusters of T_m and sequences. There is an established high-resolution amplicon melting analysis (usually denoted HRM) using LCGreen, primarily based on differences in the profiles of melting curves rather than on absolute T_m [10]. Although this method is superior to mixture model analysis in identifying heterodimers, absolute T_m values are required to identify homodimers. Recently, a method with sufficient resolution to distinguish base-pair neutral homozygotes was reported [11]. Mixture model analysis of T_m can be used in all cases where the T_m can be denoted as a single value, but primarily for homodimer discrimination.

Conclusion

In conclusion, the mixture model analysis of T_m presented here allows the unbiased analysis of high-resolution T_m data. This analysis is applicable to the identification of sequences in T_m data regardless of the method by which the T_m are acquired, provided the measurement error is known. Mixture models allow T_m analyses to be performed on more complex and varied sequence targets than hitherto possible. Possible applications include typing microbial strains and their relative abundances in a population and the analysis of transcripts containing repetitive elements [3, 4, 6, 12].

Methods

Finite mixture models

Mixture models are useful for describing complex populations with observed or unobserved heterogeneity. The term mixture model encompasses many types of statistical structures. Here, we use it to denote mixture distributions. A mixture distribution is a collection of statistical distributions that arise when mixed populations are sampled that have a different probability density function for each component.

Let X be a random variable or vector taking values in sample space χ with the probability density functiong(x) = π₁ f₁ (x) + ... + π_kf_k(x), x ∈ χ,

where 0 ≤ π_i≤ 1, i = 1, ..., k, π₁ + ... + π_k= 1.

Such a model can arise if one is sampling from a heterogeneous population that can be decomposed into k distinct homogeneous subpopulations, called component populations. If these components have been "mixed" together, and we measure only the variable X without determining the particular components, then this model holds. We say that X has a finite mixture distribution and that g(·) is a finite mixture density function. The parameters π₁,..., π_kare called mixing weights or mixing proportions, and each π_irepresents the proportion of the total population in the i- th component.

There is no requirement that the component densities should all belong to the same parametric family, but in this paper, we keep to the simplest case where f₁(x),..., f_k(x) have a common functional form but different parameters.

We apply the theory of finite mixture models to T_m data consisting of normally distributed components in a mixture model, where each component has a standard deviation of σ°C. The finite mixture density function is then as follows:

g (x | ψ) = \sum_{i = 1}^{k} π_{i} \frac{1}{σ \sqrt{2 π}} \exp {\frac{{(x - μ_{i})}^{2}}{2 σ^{2}}}

where ψ = (π₁,..., π_k, μ₁,..., μ_k, σ)^T.

The likelihood function corresponding to the data (x₁,..., x_n) is as follows:

L (ψ; x_{1}, \dots x_{n}) = \prod_{j = 1}^{n} g (x_{j} | ψ) .

The logarithm of the likelihood function is

\ln L (ψ) = \sum_{j = 1}^{n} \ln g (x_{j} | ψ) .

We attempt to find the particular ψ that maximizes the likelihood function. This maximization can be undertaken in the traditional way by differentiating L(ψ; ×) with respect to the components of ψ and equating the derivatives to zero to give the likelihood equation:

\frac{\partial L (ψ)}{\partial ψ} = 0, or equivalently \frac{\partial \ln L (ψ)}{\partial ψ} = 0.

Quite often, the log likelihood function cannot be maximized analytically, i.e., the likelihood equation has no explicit solutions. In such cases, it is possible to compute the maximum likelihood of ψ iteratively. To calculate maximum likelihood estimates, we use the expectation maximization (EM) method in combination with the Newton-Raphson algorithm. Iterations of the EM algorithm consist of two steps: the expectation step or the E-step and the maximization step or the M-step [13, 14]. The Newton-Raphson algorithm for solving the likelihood equation approximates the gradient vector of the log likelihood function by a linear Taylor series expansion [15]. We use the Newton-Raphson algorithm in the M-step of the EM method.

We developed an algorithm that allows the automated estimation, in parallel, of a finite number of normally distributed components. The number of components can be assessed by several different methods, although none of them is optimal. We chose the AIC [16, 17]. AIC is a relative score between different models where the selection of the optimal model is made by considering the number of data points and categories and the separation of the T_m categories. AIC is defined as -2L_m+ 2m, where L_mis the maximized log likelihood and m is the number of parameters.

Acquisition of HERV-W gag T_m

T_m data were generated with GcTm, as previously described [8], on dissociation data obtained from the amplification of plasmids containing known HERV-W gag sequences.

Simulated T_m data recordings and GcTm analysis were performed in MATLAB™ (The MathWorks) version 7.0.1.24704 with the Optimization Toolbox. Mixture model analysis was performed in R 2.6.0 [18] with the MIX software [19, 20].

Overview of mixture model analysis of T_m

A mixture model is constructed for a set of primers. The model should be constructed on a large enough sample of T_m data to expect all possible sequences to be represented. The T_m data are then stratified into small-interval groups and the frequency distributions of these arbitrary categories are used to construct and compare the mixture models. AIC is used to evaluate which model best explains the data, while a minimum number of different categories is used. Lower values of AIC indicate the preferred model, i.e., the one with the fewest parameters. Once a model is selected, T_m data from different samples can be fitted to the model and the mixing proportions compared between samples. Differences between samples can be evaluated with χ² tests if a conservative stance is taken, depending on the separation between the T_m categories and the numbers of data points.

Abbreviations

T_m:: Melting temperature
AIC:: Akaike's information criterion
HERV:: human endogenous retrovirus
EM:: expectation maximization.

References

Germer S, Higuchi R: Single-tube genotyping without oligonucleotide probes. Genome Res 1999, 9(1):72–78.
PubMed Central CAS PubMed Google Scholar
Graziano C, Giorgi M, Malentacchi C, Mattiuz PL, Porfirio B: Sequence diversity within the HA-1 gene as detected by melting temperature assay without oligonucleotide probes. BMC Med Genet 2005, 6: 36. 10.1186/1471-2350-6-36
Article PubMed Central PubMed Google Scholar
Pham HM, Konnai S, Usui T, Chang KS, Murata S, Mase M, Ohashi K, Onuma M: Rapid detection and differentiation of Newcastle disease virus by real-time PCR with melting-curve analysis. Arch Virol 2005, 150(12):2429–2438. 10.1007/s00705-005-0603-0
Article CAS PubMed Google Scholar
Waku-Kouomou D, Alla A, Blanquier B, Jeantet D, Caidi H, Rguig A, Freymuth F, Wild FT: Genotyping measles virus by real-time amplification refractory mutation system PCR represents a rapid approach for measles outbreak investigations. J Clin Microbiol 2006, 44(2):487–494. 10.1128/JCM.44.2.487-494.2006
Article PubMed Central CAS PubMed Google Scholar
Harasawa R, Mizusawa H, Fujii M, Yamamoto J, Mukai H, Uemori T, Asada K, Kato I: Rapid detection and differentiation of the major Mycoplasma contaminants in cell cultures using real-time PCR with SYBR Green I and melting curve analysis. Microbiol Immunol 2005, 49(9):859–863.
Article CAS PubMed Google Scholar
Nellåker C, Yao Y, Jones-Brando L, Mallet F, Yolken RH, Karlsson H: Transactivation of elements in the human endogenous retrovirus W family by viral infection. Retrovirology 2006, 3(1):44. 10.1186/1742-4690-3-44
Article PubMed Central PubMed Google Scholar
Yao Y, Schröder J, Nellåker C, Bottmer C, Bachmann S, Yolken RH, Karlsson H: Elevated levels of human endogenous retrovirus-W transcripts in blood cells from patients with first episode schizophrenia. Genes Brain Behav 2007, 7: 103–112.
PubMed Google Scholar
Nellåker C, Wallgren U, Karlsson H: Molecular beacon-based temperature control and automated analyses for improved resolution of melting temperature analysis using SYBR I Green chemistry. Clin Chem 2007, 53(1):98–103. 10.1373/clinchem.2006.075184
Article PubMed Google Scholar
Volker J, Blake RD, Delcourt SG, Breslauer KJ: High-resolution calorimetric and optical melting profiles of DNA plasmids: resolving contributions from intrinsic melting domains and specifically designed inserts. Biopolymers 1999, 50(3):303–318. 10.1002/(SICI)1097-0282(199909)50:3<303::AID-BIP6>3.0.CO;2-U
Article CAS PubMed Google Scholar
Wittwer CT, Reed GH, Gundry CN, Vandersteen JG, Pryor RJ: High-resolution genotyping by amplicon melting analysis using LCGreen. Clin Chem 2003, 49(6 Pt 1):853–860. 10.1373/49.6.853
Article CAS PubMed Google Scholar
Gundry CN, Dobrowolski SF, Martin YR, Robbins TC, Nay LM, Boyd N, Coyne T, Wall MD, Wittwer CT, Teng DH: Base-pair neutral homozygotes can be discriminated by calibrated high-resolution melting of small amplicons. Nucleic Acids Res 2008, 36(10):3401–3408. 10.1093/nar/gkn204
Article PubMed Central CAS PubMed Google Scholar
Slinger R, Bellfoy D, Desjardins M, Chan F: High-resolution melting assay for the detection of gyrA mutations causing quinolone resistance in Salmonella enterica serovars Typhi and Paratyphi. Diagn Microbiol Infect Dis 2007, 57(4):455–458. 10.1016/j.diagmicrobio.2006.09.011
Article CAS PubMed Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc B 1977, 39(1):1–38.
Google Scholar
McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. New York: Wiley; 1997.
Google Scholar
Dennis JJE, Schnabel RB: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. New Jersey: Prentice Hall; 1983.
Google Scholar
Akaike H: A new look at the statistical model identification. IEEE Trans Automat Control 1974, 19(6):716–723. 10.1109/TAC.1974.1100705
Article Google Scholar
Akaike H, (ed.): Information Theory and an Extension of the Maximum Likelihood Principle. Budapest: Akademiai Kiado; 1973.
Team RDC: R: A Language and Environment for Statistical Computing. 2.6.0 edition. Vienna, Austria: R Foundation for Statistical Computing; 2008.
Google Scholar
Macdonald P: MIX Software for Mixture Distributions. 2.3rd edition. Ontario, Canada: Ichthus Data Systems; 1988.
Google Scholar
Du J: Combined algorithms for fitting finite mixture distributions. In Masters thesis. Hamilton, Ontario: McMaster University; 2002.
Google Scholar

Download references

Acknowledgements

This study was generously supported by the Stanley Medical Research Institute, Bethesda, MD, and the Swedish Research Council (21X-20047).

Author information

Authors and Affiliations

Department of Neuroscience, Karolinska Institutet, Retzius Väg 8 B2:5, 17177, Stockholm, Sweden
Christoffer Nellåker & Håkan Karlsson
Mathematical Statistics, Stockholm University, Kräftriket Hus 6, 106 91, Stockholm, Sweden
Fredrik Uhrzander & Joanna Tyrcha

Authors

Christoffer Nellåker
View author publications
You can also search for this author in PubMed Google Scholar
Fredrik Uhrzander
View author publications
You can also search for this author in PubMed Google Scholar
Joanna Tyrcha
View author publications
You can also search for this author in PubMed Google Scholar
Håkan Karlsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoffer Nellåker.

Additional information

Authors' contributions

CN conceived the study, tested and prepared the manuscript; FU developed the method and critically revised the manuscript; JT developed the method and prepared the manuscript; HK conceived the study and prepared the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Nellåker, C., Uhrzander, F., Tyrcha, J. et al. Mixture models for analysis of melting temperature data. BMC Bioinformatics 9, 370 (2008). https://doi.org/10.1186/1471-2105-9-370

Download citation

Received: 05 March 2008
Accepted: 11 September 2008
Published: 11 September 2008
DOI: https://doi.org/10.1186/1471-2105-9-370

Mixture models for analysis of melting temperature data