Auto-validation of fluorescent primer extension genotyping assay using signal clustering and neural networks
© Huang et al; licensee BioMed Central Ltd. 2004
Received: 01 November 2003
Accepted: 02 April 2004
Published: 02 April 2004
SNP genotyping typically incorporates a review step to ensure that the genotype calls for a particular SNP are correct. For high-throughput genotyping, such as that provided by the GenomeLab SNPstream® instrument from Beckman Coulter, Inc., the manual review used for low-volume genotyping becomes a major bottleneck. The work reported here describes the application of a neural network to automate the review of results.
We describe an approach to reviewing the quality of primer extension 2-color fluorescent reactions by clustering optical signals obtained from multiple samples and a single reaction set-up. The method evaluates the quality of the signal clusters from the genotyping results. We developed 64 scores to measure the geometry and position of the signal clusters. The expected signal distribution was represented by a distribution of a 64-component parametric vector obtained by training the two-layer neural network onto a set of 10,968 manually reviewed 2D plots containing the signal clusters.
The neural network approach described in this paper may be used with results from the GenomeLab SNPstream instrument for high-throughput SNP genotyping. The overall correlation with manual revision was 0.844. The approach can be applied to a quality review of results from other high-throughput fluorescent-based biochemical assays in a high-throughput mode.
The imaging software performs the first steps of data analysis: grid alignment and recording of spot images. The software analyzes each spot for morphology; that is, circular shape and uniform pixel intensities across the spot. Spots with low intensity or unsatisfactory morphology are recorded as failed. For spots that pass the morphology test, the values of the fluorescence at the two wavelengths are recorded in a database. The failed spots are recorded as empty (zero intensity) and are carried through the remainder of the analysis.
The GetGenos™ software module performs the next step of the image analysis and assigns the genotype calls to every SNP spot. Three different calls are possible for the SNP site on the two copies of a chromosome in an individual DNA sample: a homozygous genotype (represented by the general XX), indicating that both chromosomes have the same allele of one type; a heterozygous genotype (XY), indicating that the two chromosomes have two different SNP alleles; and the homozygous YY indicating that both chromosomes have the same allele, opposite to the XX type. The software assigns one of these three calls to each point by collecting signals into three clusters, according to the ratio of intensities from two fluorescent colors in the SNP spot and a set of built-in values for cluster geometry and minimum color intensity thresholds. GetGenos may also fail some SNP signal spots if they cannot be included in any cluster or if their intensities are below the default intensity baseline for both allele colors.
For ultra-high throughput work, the sheer volume of plots makes manual review for quality impractical. Therefore, we have developed a neural network algorithm to automatically grade plots as "Pass" or "Fail." The algorithm uses 64 statistical measures of plot quality, derived from the genotype calls from the GetGenos phase, for its automatic grading. The goal of the neural network training was to match the "Pass/Fail" grades of the manual grading made by trained reviewers. The present paper describes the training of the network and the analysis of a large test set of manually graded plots by the GetGenos procedure and then by the trained neural network.
GetGenos suggested grade accuracy
Distributions of the suggested grades of "pass," "look," and "fail" assigned by GetGenos, compared to the P-plot validation made by the trained reviewers.
GetGenos™ suggested grade
% from total with this GetGenos grade
Statistical scores measuring signal call clusters quality
The scores are summarized in Additional file: 1. A stored procedure calculates the parameters that provide the input for the neural net. The parameters measure the geometry of the GetGenos signal call clusters and their relative separation, with the exception of a deviation from the Hardy-Weinberg disequilibrium test (score #60), which measures the data reliability from the point of view of statistical genetics (see Additional file: 1).
Review by the neural net
We identified several parameters of the neural network that affected the learning accuracy the most: neuron activation function, number of learning epochs, and frequency of crossovers between different populations in a genetic algorithm. The learning accuracy was estimated as the percentage of the plots classified correctly by the neural net compared with the human validation. These parameters were tested with different values to find the optimal result. The optimal parameters for neural network training are described in the "Methods" section. The size of the training set was also very important. Most failed P-plots in the database did not have any SNP calls from the entire micro-titer plate. Such trivial cases were not included in the training set. We found only 986 P-plots in the database that have SNP calls and were failed by manual review. Therefore, we had to use 10,000 "Passed" P-plot examples to achieve good accuracy. The learning accuracy using training sets with smaller size produced a less accurate neural network. For example, the training with 1,986 P-plots (986 "Failed" and 1,000 "Passed") yielded the net with a 67% prediction accuracy.
Distributions of Pass/Fail grades assigned by the neural net and by reviewers
% of total for this neural net grade
To investigate a reason for the less accurate predictions on the failed P-plots, we have manually re-reviewed about 20 plots of the 2,027 manually passed plots that the net graded as "failed." This second manual re-inspection indicated that "Fail" was the correct grade for more than half of them, suggesting that the actual prediction accuracy was higher for plots graded "Fail" by the net. The major sources of the human errors are fatigue, inexperience, and insufficient time to complete the manual review in a high-throughput mode. Because our reviewers had been trained, we concluded that the most likely reason for misjudgment was fatigue due to a large volume of data. All SNPs erroneously labeled as "Passed" had very little successful genotype calls made by GetGenos, and these calls had a low signal intensity on the border of the minimum allowed intensity threshold.
Though training the net required 1.5 days, the processing of the 26,854 test plots by the trained net took only 10 minutes. The trained neurons were recovered from the training program, written in C programming language, and transferred to an Oracle stored procedure written in PL/SQL programming language. The procedure's input is a parameter vector from the procedure calculating 64 statistical scores from Table 1. Every P-plot is classified further as "Pass" or "Fail" in the UHT instrument by applying the neuron net to the parameter vector.
It is important to understand that the automated procedure for the validation of the genotype calls described in this paper does not validate the accuracy of the calls themselves. The genotype call is made by the GetGenos software, which has a reported accuracy of 99.5% . Our method validates the reliability of the GetGenos call as applied to a particular SNP measured in a particular sample set on a single micro-titer dish. The neural net algorithm does not modify the GetGenos genotype call; it simply evaluates whether this call is trustworthy or not.
The successful genotyping of a particular SNP depends on the variety of factors. In the case of the SNPstream instrument, they include the following: the quality of the DNA samples; DNA sequence surrounding the SNP  site; and the quality of the micro-titer dish preparation. Human and instrumentation errors can also be the source of a failure. These factors are likely to be source of failure for other SNP genotyping technologies as well. We designed our quality control algorithm to detect the failures due to these factors automatically in a high-throughput manner without human manual intervention. The SNPs failed by the algorithm should not affect the results of any statistical analysis because the algorithm does not discriminate SNPs based on their sequence and thus does not favor any particular allele of the SNP. Thus, it does not create any bias towards a particular allele in the discriminated instance of SNP measurement because it fails the entire measurement including both alleles. Thus, the accuracy of the genotype call still remains 99.5%, when calculated for only passed plots, as reported previously . At most, the algorithm may fail a particular SNP variation such as, for example "G/C," more than other variations on average. More investigation is necessary to find such "unfavorable" variations. However, the result of such investigation would only reflect the difficulties to genotype the particular SNP variation by the technology.
With the addition of the neural net for the final review of the GetGenos results, it is possible to automate the entire procedure for assigning genotypes and monitoring the quality of results. The goal of our neural net development was to match the binary output to the Pass/Fail grades assigned to P-plots by trained reviewers. The reviewers agreed with the fully trained net in 99.98% of the cases that the net graded as "pass" but it is more stringent than the reviewers in that it fails 20.2% of the plots passed by the reviewers. The overall agreement is 92.4%.
The SNP genotyping project used for developing the neural net involved a large number of SNPs and a relatively small number of samples. For production work and other cases with a small number of SNPs and a large number of samples, it would be worthwhile to train the neural net on each SNP individually. We have found that, where studies involve samples from related individuals, automated checking of Mendelian Inheritance is an additional useful tool.
The major advantage of the automated approach is that it eliminates the bottleneck that accompanies manual review of the cluster data. In addition, it provides a uniform approach to review that is not attainable with even the most experienced group of reviewers. The neural net method, in particular, should be applicable to other data from high-throughput projects.
We have developed the approach to automatically validate color or fluorescent biochemical reactions. The procedure clusters result from multiple individual assays and require a training set consisting of manually validated signal clusters. The procedure automatically compares new signal measurements from the instrument with the distribution in the training set. The current work demonstrates the success of the approach with high-throughput SNP genotyping reaction, but it is also clearly applicable to other assays involving review of groups of results.
SNP-IT primer extension reaction
The SNP-IT primer extension reaction has been described previously [5, 6]. In brief, the multiplex SNP-IT reaction requires three oligonucleotide primers for each SNP marker and involves the following three steps: 1) multiplex PCR amplification of the sequence surrounding a SNP from the two chromosome copies, 2) multiplex single nucleotide cycling primer extension using the third tagged SNP-IT primer and fluorescent-labeled dideoxynucleoside triphosphates, 3) tag hybridization of SNP-IT primer to complementary tag oligonucleotide spotted on the solid surface (Figure 1).
The GetGenos procedure converts the blue and green intensities for each point in a sample set to an angle using: Angle = arctan (B/G). The program then finds signal clusters by splitting the angle space into 90 one-degree bins and finding the populated groups of bins. To be in one group, the bins with signals must be closer than a built-in, user-set bin distance threshold (see Appendix). Using the average angle for the group the procedure classifies each group as XX, XY, or YY. Once all the groups have classifications, it sets the boundaries for the XX, XY, and YY genotype clusters. If a calculated boundary is outside a built-in, user-set boundary limit (see Appendix), the boundary is set to the boundary limit. The procedure also determines the threshold for the combined signal strength based on the distribution of all the points.
Manual review of P-plots
About 40,000 manually classified P-plots have accumulated in the database during the development of the instrument. The manual P-plot validation was done by five trained reviewers on three different instruments using the QCreview™ interface for visualization of P-plots from GetGenos results.
Neural network architecture and training
The training was done using 10,986 manually reviewed P-plots from the database. The training set contained 10,000 passed plots and 986 failed plots. The genetic algorithm used for neural net training is described in . The code was optimized substantially to include direct operations on memory for population crossover and mixing functions. The neural activation functions were also changed as described previously in this paper. These optimizations accelerated the learning algorithm more than tenfold. The learning accuracy was also increased by about 10%.
For the genetic algorithm we used 10 populations. Every population contained 60 "peoples" or "individuals". An "individual" in the population contained the vector with the length combined from individual neurons and seven constants added to the vector multiplication product for every neuron. Thus, for an input vector with 64 dimensions, the population matrix for the genetic algorithm had a size of 10 × 60 × 397:
10 - populations, 60 - people in every population, [397 = 64 (size of neuron vector from the first layer) * 6 (number of neurons at first layer) + 6 first layer constants + 6 (second layer neuron size) *1 second layer neuron + 1 second layer constant].
The initial neuron weights were assigned randomly to all 10 populations. The weights for the first half of each population were random value between -1 and 1, the weights from the second half received random values between -100 and 100. Every evolution epoch, or cycle of adjustment of the neuron weights, included the following steps. The five best "people" and the eight "people" selected at random in every population were kept intact. Ten new "people" were added to every population at every epoch. The weights for the new "people" received random values between -10 and 10. Seven "people" were mutated at random by changing the existing weights by no more than twofold. The remaining 35 "individuals" were mutated by crossovers with donors selected randomly from the best five "people" of the same population. The crossover length and site were chosen at random. Every second epoch, the crossover was done with donors from the top five "people" of another randomly selected population. After crossover, the weights in the acceptor vector were also slightly mutated by no more than 1.25 times.
The scoring function evaluating the performance of the single "individual" in the population was calculated as the average of percentage of correctly predicted "passed" vectors plus percentage of correctly predicted "failed" vectors. One thousand learning epochs were executed during training. The training took 1.5 days on an 800 MHz Gateway Pentium PC. The best prediction rate was 65% in the beginning of the training, and the final prediction rate was 95%. The learning rate slowed down exponentially. For example, the prediction rate of 85% was achieved after 150 epochs and 90% after 250 epochs.
Microsoft Visual Studio™ was the development platform for the neural net, which was written in C programming language. It runs as a compiled .EXE file, with a text file exported from an Oracle table as the input.
User-set values for GetGenos
XX_MIN_ANGLE: the lowest angle for a XY cluster center (average). The default value is 7.5.
YY_MAX_ANGLE: the highest angle for a XY cluster center. The default value is 82.5.
XX_ANGLE: a value for used in calculating XX%, below. The default value is 8.
YY_ANGLE: a value for used in calculating YY%, below. The default value is 82.
XX%: The minimum percentage of points in a cluster that must be lower than XX_ANGLE. Otherwise, the cluster will be considered as XY cluster. The default percentage value is 10.
YY%: Set the % of YY cluster to be higher than a certain angle. Otherwise, the cluster will be considered as XY cluster YY_ANGLE. The default % value is 10.
BIG_GROUP: The minimum number of points necessary to form a group or cluster. The default value is 4.
MIN_SPACE: Set the minimum distance, in degrees, between two groups or clusters. The default value is 4.
MIN_BASELINE: The minimum total intensity (Blue + Green) necessary for a point to pass. Points with a total intensity less than this value fail, regardless of angle. The default value is 1000.
PASS_RATE: The percentage of passed sample points must be greater than this value for a plot to receive a provisional grade of "Pass". Default value is 90.
SUGGESTED_PASS_RATE: The percentage of passed sample points must be greater than this value for a plot to receive a provisional grade of "Suggested Pass". The default value is 75.
SUGGESTED_FAIL_RATE: The percentage of passed sample points must be greater than this value for a plot to receive a provisional grade of "Suggested Pass". The default value is 50.
FAIL_RATE: The percentage of passed sample points for a plot that receives a provisional grade of "Fail" is less than this value. The default value is 30.
PASS_HW_SCORE: The maximum Hardy-Weinberg chi-square allowed for a plot that receives a provisional grade of "Pass" or "Suggested Pass". The default value is 5.
SUGGEST_FAIL_HW_SCORE The Hardy-Weinberg chi-square for a plot that receives a provisional grade of Suggested Fail exceeds this value. The default value is 10.
We thank Lori Wilson for proofreading the manuscript. We thank an anonymous reviewer for suggesting the correlation coefficient as the measure of neural net's overall accuracy.
- Yuryev A, Huang J, Pohl M, Patch R, Watson F, Bell P, Donaldson M, Phillips MS, Boyce-Jacino MT: Predicting success of primer extension genotyping assay using statistical modeling. Nucleic Acids Res 2002, 30(23):e131. 10.1093/nar/gnf131PubMed CentralView ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 2000, 407(6803):513–516. 10.1038/35035083View ArticlePubMedGoogle Scholar
- The International SNP Map Working Group Nature 2001, 409: 928–933. 10.1038/35057149
- SNP database Home Page[http://www.ncbi.nlm.nih.gov/SNP/]
- Bell PA, Chaturvedi S, Gelfand CA, Huang CY, Kochersperger M, Kopla R, Modica F, Pohl M, Varde S, Zhao R, Zhao X, Boyce-Jacino MT: GenomeLab SNPstream: ultra-high throughput SNP genotyping for pharmacogenomics and drug discovery. Biotechniques 2002, Suppl: 70–77.PubMedGoogle Scholar
- Nikiforov TT, Rendle RB, Goelet P, Rogers YH, Kotewicz ML, Anderson S, Trainor GL, Knapp MR: Genetic Bit Analysis: a solid phase method for typing single nucleotide polymorphisms. Nucleic Acids Res 1994, 22: 4167–4175.PubMed CentralView ArticlePubMedGoogle Scholar
- Weir BS: Genetic data analysis II Sunderland, MA: Sinauer Associates 1996.Google Scholar
- Weisman O, Pollack Z: NNUGA – Neural Network Using Genetic Algorithms.[http://www.cs.bgu.ac.il/~omri/NNUGA/].
- Fan JB, Chen X, Halushka MK, Berno A, Huang X, Ryder T, Lipshutz RJ, Lockhart DJ, Chakravarti A: Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Res 2000, 10(6):853–860. 10.1101/gr.10.6.853PubMed CentralView ArticlePubMedGoogle Scholar
- SAS Institute Inc SAS/STAT User's Guide, Version 8 Cary, NC 1999.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.