Data pre-processing
MLPAinter can not handle the raw electrophoresis signal and therefore requires that the MLPA amplification product peaks have already been linked to the corresponding MLPA probes of the used MLPA kit. After electrophoresis, all MLPA sample trace files should be pre-processed in standard software for basic analysis of MLPA traces. Subsequently, the report files can be imported. Here we used GeneMapper (Applied Bio Systems, Foster City, CA, USA) for MLPAinter, but the system can also import data from the combination Genescan Analysis and Genotyper software (Applied Bio Systems, Foster City, CA, USA). Adaptations to other software programs like Genemarker (Softgenetics, State College, PA, USA) should be straightforward. A step-by-step vignette for Genemapper settings can be found at http://code.google.com/p/mlpainter/. Briefly, the product lengths of the ligated probes are defined with an internal size standard. The peak height and area are calculated for every peak present in the trace. Any undefined peaks are discarded from further analysis. Data tables are then automatically generated with length, height and area of all recognised peaks. These tables are exported from the Genemapper software package and imported into MLPAinter for specific analysis of the raw data. Protocols for linking output files from other software packages are planned for future versions.
Data management
Here, we developed a relational database using Microsoft Access to manage all pertinent information for MLPA experiments and created a front end with Borland Delphi to guide laboratory workflow and data analysis. Characteristics such as the sample number and status, e.g., tumour or normal, DNA concentration and, if available, tumour percentages that are relevant for the performance of the MLPA should be stored in a database. Annotation information like the chromosomal position and gene names of the different probes in a kit should be available for the interpretation of the results in output tables, heat maps, and plots. To assist the laboratory work-flow, electronic and paper sample sheets can be prepared for the automated sequencer. The raw data of the sequencing reports are imported into the database for subsequent quality control steps and analysis.
The relational database contains three hierarchies which are interconnected. The hierarchies are MLPA kits and probes, electrophoresis results, and analyses. In the database tables, next to the specific Kit information, you can find gene and probe names, as well as the physical and cytogenetic location of the probes. All probes in a particular kit are numbered from 1, for the probe with the smallest product size to n, for the probe with the largest product size. Every kit contains a number of probes that can be used for a quality check of the trace. The corresponding products are named based on their size in base pairs. The different kits as defined by MRC-Holland, can be imported from http://www.mlpa.com.
Both the MLPA run and analysis hierarchy use the samples table. This table contains clinical information like the origin of the used DNA, e.g., if the DNA is isolated from whole blood, fresh frozen tissue or formalin fixed paraffin embedded tissue. Every sample is labelled with an N for Normal, T for Test or the Tumour origin of the tissue. Normal samples are treated differently from test samples in the normalisation and analysis steps as described in the normalisation section. An electrophoresis run typically consists of a sample plate to be processed by the sequencer. The sample, the kit, and a unique name for the plate are recorded for each position on the sample plate. Different types of kits can be used within one run. From this information a sample sheet or configuration file is created for the sequencer. The resulting peak heights and peak areas of an MLPA run are imported for all of the probes in a kit and the analysis settings can be set to analyse peak heights or peak areas.
During analysis, specific MLPA runs can be combined from one or more electrophoresis runs. A group of reference probes can be copied from another analysis with the same kit, and can be adapted to suit the needs of the specific analysis. However, to avoid inter experimental differences, values from experiments performed at a different time should not be used. Probes can also be excluded from the analysis. Successful analysis can be finalised by authorising the results. After authorising the analysis, all options are fixed except for visualisation and sorting options.
Quality control
MLPAinter presents three data quality indicators, Q1, Q2 and Q3, (Figure 1A) to assist with the decision of whether to include a trace in the analysis.
The first indicator (Q1) is the ratio between the ligation dependent peak at 94 base pairs and the median of the DNA dependent 64, 70, 76 and 82 peaks (Figure 2). Van Dijk et al. [10] state that this ratio should be greater than 5 to obtain good and reproducible results. Nonetheless, we have observed that in some cases, lower ratios can also give reliable peak patterns (Figure 2).
The second quality indicator (Q2) is the median peak height of the probe signals present in the kit. If the median of the first 20 ligated probe peak heights is below 450 relative fluorescent units (RFU), the trace quality is considered low. Moreover, because of the limits in the detection optics of the instrument, a median peak height over 4000 RFU is indicative that the trace quality is low (Figure 2) [14].
For the last indicator (Q3) all analysis peaks are split in 2 parts based on sequence length. The value is computed as the median signal of the longest probes divided by the median signal of the shortest probes. Often the longest probes show lower signals, however in high quality traces this indicator is usually over 0.5.
Other factors that are important for the assessment of quality, which can optionally be stored into the database, are the DNA concentration of the sample, the tumour percentage of the tumour specimens and the intrinsic DNA quality of the sample. The combination of these quality parameters allows the user to decide on inclusion or exclusion of a trace from the analysis.
Normalisation
Raw MLPA results are not calibrated. Peak areas or heights are dependent on sample quality, hybridisation parameters and instrument settings. To analyse the MLPA traces, internal and external control loci are used for the normalisation of the data. External controls, e.g., normal tissue in tumour analysis, have to be present in every experiment for the pattern comparison. Internal controls for the calibration of the samples are present in every kit and are supposed to be non-altered or reference probes in a tumour sample. These reference probes are compared to the probes where DNA changes are expected.
The top trace in Figure 2 shows a normal sample. It is evident that peak heights or areas differ between probes; and these differences have to be corrected. Also the average peak areas or heights may differ from sample to sample. Therefore, sample calibration and probe calibration have to be performed. Consider the data as a matrix Y, with columns for the probes and rows for the sample. Then, we need to apply normalisation to both rows and columns. Normalisation is implemented as division by row parameters r
i
, i = 1 ... m and column parameters c
j
, j = 1 ... n, such that a matrix X = [xjj] results, with xjb = yjb/(r
i
c
j
). We prefer to work on the original scale instead of with logarithms because loss and gain correspond to integer ratios (including zero) on the original scale.
A simple approach would be to take row and column medians for r and c, respectively. This could work well if the number of deletions or amplifications is relatively small. However, for samples with a large number of deletions (more than 50%), the corresponding row median might become a number near zero and normalisation by dividing with this small number would give a completely wrong result.
To improve normalisation and obtain calibration factors, we use only a subset of the samples and probes. Specifically, we use normal samples, and only a subset of the probes, where copy number changes are unlikely, even in tumour samples. We use the following algorithm:
-
1.
To correct for the sample-to-sample variation, divide the peak heights or areas of all the probes in each sample by their median. This gives provisional row parameters, ř for the normal samples, and provisional normalisation of the normal samples.
-
2.
To correct for systematic differences between probes, divide the peak heights or areas of all the probes within a MLPA run by their median. This results in the normalised peak areas or heights, and represents the column parameters c for all probes. The average of all probes is now close to 1.
-
3.
Select the probes that have a small probability of change in copy number. Call these the reference probes. The remaining probes are called the focus probes, since we look for changes in these. The description file for commercial kits includes this information, and the program uses these probes by default.
-
4.
Select the part of the data that represents the normal control or non-tumour samples and the reference probes.
-
5.
Redo steps 1 and 2 for the subsets of reference samples and reference probes.
-
6.
Determine which probes are most stable. Subtract 1 from each normalised peak height or area and take the absolute value. Compute for integrated MLPA analysis the median of these numbers for each probe. This is the median of the absolute deviations: MAD.
-
7.
The reference probes with the lowest MAD are most stable. Select the five probes closest to zero. These are the probes that we call the calibration probes.
-
8.
Compute the median peak height or area of the 5 calibration probes for each sample (normal and test samples or tumours and non-tumours). Divide all peak heights or areas as computed in step 2 in each sample by this value. This gives the final row parameters r for all samples, and their final normalisation.
Reference probe selection
As in quantitative RT-PCR, the selection of reference probes is a critical element of the analysis [17]. MLPA kits contain about 10 reference probes that are includ for normalisation purposes because they are not involved in the experimental hypothesis/diagnostic question. Alternatively, one can usually find a subset of probes in existing kits that are known not to be involved in the hypothesis. The procedure selects the most stable probes from the reference probes to calibrate the data. The number of calibration probes used (five, in this instance) did not significantly influence the results (data not shown). However, the number is configurable in the program. If probes show high variability between replicates or between normal samples, they should be excluded from the analysis.
Visualisation
We have designed a number of visualisations to interpret the results after the normalisation and quality control of the data set. The first visualisation is a heat map that shows all of the data in an experiment. Deletions and gains are colour-coded with configurable thresholds. Probes can be sorted by locus names or chromosomal position. The reference and calibration probes are clearly differentiated by a grey-shade (Figure 1). Another visualisation shows the normalised values of all replicates of one sample in a plot (Figure 1). Technical replicates are shown in different colours. On the x-axis, the different probes are shown in the selected probe order. The y-axis is on a scale from 0 to 2.5, where 0 stands for absent probes. Ideally probes at genomic loci with loss of a single allele show values around 0.5. Unaltered probes are visualised around 1.0. Probes with DNA gains have values around 1.5 or above. In tumour samples contaminated with normal DNA these values are usually not that outspoken. The researcher should keep this in mind during the interpretation. Information about sample characteristics and probes used are also shown in the plots.
Future developments
Currently the system is suited for the analysis, visualisation and data management of MLPA. However, all of the information generated during an experiment is still not fully integrated in the data analysis. For instance, tumour percentages can be stored in the database and will be displayed, but it is up to the user to incorporate this information into the interpretation. We plan to include the tumour percentage and probably the DNA index for automated identification of the allelic state of the chromosomal aberrations in the analysed sample [18]. Another worthwhile improvement would be to remove the dependency on an external program to do the peak detection.