Application of machine learning in SNP discovery
© Matukumalli et al; licensee BioMed Central Ltd. 2006
Received: 11 August 2005
Accepted: 06 January 2006
Published: 06 January 2006
Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.
The ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.
A machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5–10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.
Machine learning (ML) is the study and computer modeling of learning processes including the acquisition of new declarative knowledge, organization of new knowledge into general effective representations, and the discovery of new facts through observation and experimentation. Machine learning programs are advantageous in many cases where the input/output pairs can be specified, but the concise relationship between the input/output pairs is not known. Machine learning programs can help in extracting the complex relationships and correlations hidden in large data sets (a process sometimes known as data mining).
The prediction accuracy of different machine learning programs varies and depends on the type of problem, dataset and the algorithm used. Examples of application domains include protein classification tissue classification for different types of cancer, protein secondary structure prediction , text mining, protein-protein interactions and RNA binding proteins. The most common ML algorithms include decision trees, production rules, support vector machines, naïve Bayes, neural networks, and genetic algorithms. There are several free software suites available, including Weka , C4.5 , and GIST .
Single nucleotide polymorphisms (SNP) are single base variations or short insertions/deletions in the nucleotide sequence from different individuals or between homologous sequences within an individual. SNP markers are relatively dense and abundant when compared to other marker types. SNP can be used for distinguishing between individuals and species, genetic analysis of disease and complex traits, assessment of linkage disequilibrium (LD), haplotype map generation, pharmacogenomics, etc. In a large scale SNP discovery project after sequencing and assembly of the sequences from different individuals/genotypes, candidate SNP are usually identified by using programs like PolyBayes  or PolyPhred .
PolyPhred is more commonly used for SNP detection in re-sequencing data as it can detect the heterozygotes. PolyBayes was designed to statistically detect SNPs in multiple alignments of overlapping EST or shotgun sequences. However, PolyBayes is more suitable for soybean re-sequencing data as soybean is an extensively in-bred species and most "heterozygous" bases observed would be due to single base differences between paralogs. Each of the candidate SNP identified by PolyBayes is expertly verified by visual inspection. The criteria for a good SNP include high quality phred scores of the varying base position, minor allele frequency, agreement between the forward and reverse reads and co-variance of polymorphisms for the same genotype. This paper addresses the issue of reducing the amount of intervention required by human experts.
Application of machine learning in polymorphism discovery
identify features that can influence the polymorphism scoring decisions,
develop a software program for applying the ML program C4.5 to SNP features,
optimize features to improve prediction accuracy of the classifier,
use optimized feature set on a large dataset for improved prediction accuracy.
Training and test data
The training/test candidate polymorphism data for implementing ML algorithms was extracted from a large-scale soybean STS amplification and sequencing project. For the primers designed the STS that produced a single discrete band PCR product on agarose gel electrophoresis were sequenced. A total of 3332 STS comprising 20 Mb were sequenced from both directions in 6 inbred individuals representing each of 6 diverse soybean genotypes previously identified by Zhu et al . Most of these data have a uniform sequence depth of 6 reads in each direction. These data were split into a training set consisting of 1973 STS (12 Mb sequence) with 27,275 candidate polymorphisms (identified by the PolyBayes program) and a test set of 1359 STS (8 Mb sequence) with 18,390 candidate polymorphisms. Subject matter experts classified the above candidate polymorphisms as 2969 true and 24,306 false in the training set and 1435 true and 16,955 false in the test set.
Application of PolyPhred
PolyPhred is a commonly used tool for polymorphism identification in re-sequencing data as it can detect heterozygotes. Application of PolyPhred on the test data resulted in only 1346 candidates (743 true positive, 563 false positive). Thus the sensitivity of this tool for this dataset is only 54.5% with a positive predictive value of 58.1%. The poor performance of polyphred in this case may be partly because of the un-suitability of this tool for in-bred species like soybean where heterozygosity is mostly due to sequencing noise or co-amplification of paralog sequences. In the latter case all genotypes appear to be "heterozygous" at a given position and PolyPhred identifies a SNP at that position.
Feature selection and optimization
Final set of optimized features chosen for machine learning
transition transversion indel
Frequency of major allele
Frequency of minor allele
Relative distance from closest end
Agreement in the forward and reverse reads
Maximum quality of the major allele
Maximum quality of the minor allele
Average quality of major allele
Average quality of minor allele
Haplotype of second variation
Local average quality
Overall average quality
Description of these features is given in the methods section. The optimization runs for feature selection are discussed in more detail at the website containing supplementary material .
Application of C4.5 on training data and evaluation on test dataset
A software program was developed to extract the features described above, execute C4.5 and analyze the results. The software features are described in the methods section 5.2. A five fold cross-validation was performed on the training data of 27,275 cases using both available options with C4.5 i.e., decision tree and production rules. To perform the cross-validation, the data were divided into five parts and the ML classifier was recursively trained on four parts and tested on the remaining part (analogous to a jack-knife procedure). The performance of the resulting decision trees/production rules was evaluated (definitions of the measures used are given in the methods section 5.3). The average prediction accuracies of validation runs were above 96.5%.
Comparison of ML and PolyBayes on test data set
Positive Predictive Value
Negative Predictive Value
Since only PolyBayes predictions were used in both training and test data sets, the true negative and false negative terms for PolyBayes are not known for this study. The prediction accuracy of ML algorithms using both decision trees and production rules was above 96%. Also implementation of ML algorithms resulted in 10 fold increase in productivity by increasing the PPV from 7.8% to 84.8%.
Comparison of positive predictive values (PPV) from PolyBayes and ML predictor
P ≤ 0.60
0.60 < P ≤ 0.70
0.70 < P ≤ 0.80
0.80 < P ≤ 0.90
0.90 < P ≤ 0.95
0.95 < P ≤ 0.97
0.97 < P ≤ 0.99
P = 1.00
ML programs other than C4.5
Several ML algorithms other than C4.5 such as neural networks, SVM and genetic algorithms are being widely used. In this study we explored the use of feed forward neural networks (Matlab toolbox) for the same dataset with different options (number of nodes, layers and training algorithms) and obtained similar overall accuracies (97%) and marginal increase in PPV up to 87%. Details of these runs are provided as supplementary materials on the website. C4.5 is free software that can be implemented with relative ease with an equivalent performance for the options tested with neural networks.
The soybean has a complex genome, with studies suggesting multiple rounds of duplication of some genomic regions resulting in high paralog frequency [13–15], This complex genome structure may account to some extent for observing a higher number of false positive SNP from the PolyBayes analysis. Soybean is an extensively inbred species and the variation between the two homologous chromosomes is negligible. Hence, PolyBayes, rather than PolyPhred, was used for SNP discovery for these data. This paper only attempts to automate the expert confirmation process. To evaluate the performance of expert scoring a different ML training and test dataset is required. Confirmed SNP are currently being mapped to the soybean genome.
PolyBayes and PolyPhred are primarily used for analyzing small sequence datasets. Large-scale, genome-wide SNP discovery projects routinely use customized versions of neighborhood quality standard (NQS) . NQS is a set of rules for SNP filtering based on the sequence quality of the varying base along with the quality of the neighborhood bases.
The application of ML is not dependent on the screening method used, but instead can be used with any of the aforementioned tools that are used for SNP discovery. The ML tool simply automates the rule development and can be applied to small and large datasets where good training data are available.
Machine learning has been applied to polymorphism discovery from amplified STS and was demonstrated to have a positive impact in polymorphism discovery. The optimized ML feature set can be tailored and applied to other instances of polymorphism discovery and ML in general can be applied to other genomics and bioinformatics decision making problems.
Major efforts are now being undertaken in polymorphism discovery in several species, including humans, to help characterize population differences. ML can enhance the prediction accuracies of these existing programs. In this study we have
Identified a feature set to enhance polymorphism prediction accuracies,
Used the ML program C4.5 to generate a decision tree (production rules) from a training set to obtain an overall prediction accuracy of 97% in the five fold cross-validation and from a new unseen test set,
Enhanced the PPV by 5- to 10-fold compared to using only PolyBayes for these data, and
Developed an open source software package in Perl to apply machine learning in polymorphism discovery with modules for computation of the values of the optimized feature set, execution in test mode to retrieve predictions, a graphical interface for easy SNP scoring and a provision to store feature values of new data for further improvements. The system and source code along with test and training data are provided in Additional file 1.
ML enhanced the prediction efficiency overall (97%) along with the PPV (85%) in soybean sequences with a complex genome that might have contributed to high false positives being predicted by PolyBayes. Hence the PPV with sequences from other genomes may vary.
Feature selection and optimization
While ML programs are useful for creating classifiers based on a given set of features, the selection of the relevant features is often a challenging task, usually requiring an iterative approach. We first selected a set of 10 features that were likely to influence the human expert when classifying a putative SNP. These features were then optimized by modifying the existing features and adding new features that helped in improving the prediction accuracy. The final set of optimized features is given in Table 1. The optimization runs for feature selection are discussed in more detail at the website containing supplementary material .
Sequence depth (feature #1) is the count of number of sequences in the alignment at the position of variation. All sequences in the alignment may not overlap at the position of variation; hence this number is different from the total number of the sequences in the alignment. Having more sequence reads at the polymorphic position improves the confidence in making a judgment. We defined the sequence depth sd as:
sd = a + t + g + c + i
Where a, t, g, c and i are the number of occurrences of A, T, G, C and insertions/deletions(indel), respectively, at the position of variation
Variation type (feature #2) can be a transition, a transversion or an indel. In humans, transitions are reported to be more common than transversions with a ratio of 2 to 1 , however in soybean  transitions occur at nearly equal rates as transversions (48 vs 52%). ML programs can learn from the training data and may give more weight based on the variation type observed for a given species. Also, some general rules may evolve when the polymorphism is an indel.
The PolyBayes program  assigns a Bayesian posterior probability value (feature #3) for each called SNP using the frequency priors given for observing a variation at that position. However, the frequencies can be estimated for only very few species and can vary by region (hotspots vs. islands). The PolyBayes engine with its default values still makes a very good judgment in identifying high quality SNP from large sequence alignment data. The default prior probability of PolyBayes is in close agreement with the observed average polymorphism rate in soybean genome.
However, for each STS the localized values tend to be highly variable, which cannot be accounted for by PolyBayes. Table 3 shows an improvement in prediction accuracy by increasing the PolyBayes threshold probability values. Hence, this feature was included to improve the performance of the ML algorithm.
The number of occurrences of different bases including indels at the position of variation is important in determining a polymorphic position. Frequencies of the first (major allele) and the second (minor allele) most commonly observed bases at the positions of variation were selected and are expressed as a ratio with respect to sequence depth. (features # 4, 5)
We first computed the sorted values:
(s1, s2, s3, s4, s5) = sort descending (a, t, g, c, i)
where a, t, g, c, and i are defined as above. We then computed the frequency ratios:
Frequency of the major allele = s1/sd
Frequency of the minor allele = s2/sd
Sequence quality at the ends of the alignment tends to be poor due to inherent limitations of current sequencing technology. Sequence alignment programs like Phrap do not trim the sequence for low quality and use the low quality sequence information to identify overlap between any two reads for creating longer alignments. Hence polymorphisms detected at either end of an alignment tend to be unreliable. To account for this factor, the polymorphism position was represented as the ratio of the distance in the consensus sequence from the closest end, or the relative distance (feature # 6)
rd = p/L (if p/L ≤ 0.5)
rd = 1 - p/L (if p/L > 0.5)
Where rd is the relative distance, p is the polymorphic position in the alignment and L is the length of the consensus sequence.
The PCR amplified DNA fragments or sequence tagged sites (STS) were routinely sequenced in both directions. This often results in a sequence overlap. If the polymorphic site lies on the overlapping sequence segment the polymorphic base is more informative. Hence the number of such sequences having the same base when sequenced from both directions was chosen as one of the features (feature # 7).
Sequence quality of the bases at the polymorphic site is a very important parameter when considering possible SNP. Frequently, variation is observed because of a poor quality base, and such polymorphism will be rejected in the scoring process. Since there will be several reads at the polymorphic site, aggregated features were defined:
These values were derived as follows. As described in 2.4 the sorted frequency values of bases (s1, s2, s3, s4 and s5) were calculated for the polymorphic position. Let b1 be the base for the major allele (s1) and b2 be the base for the minor allele (s2), then the aggregate parameters for all the sequencing reads in the polymorphic position were defined as maximum qualities maxQ(b1), maxQ(b2) and average qualities avgQ(b1), avgQ(b2) (features # 8,9,10 and 11).
Algorithm for haplotype variation factor determination
N is the total number of polymorphic positions
For each polymorphic position i = 1 to N
List of chromatograms having the major allele b1, minor allele b2 are b1(i) and b2(i) respectively.
Set Sum(HapVariationFactor) to zero.
For each of the polymorphic position j = 1 to N and i ≠ j
List of chromatograms having the major allele b1 and minor allele b2 are b1(j) and b2(j)
c(i,j) is the number of elements (chromatograms) common in b2(i) and b2(j) and t is the number of elements in b2(j) then
Sum(HapVariationFactor) += c(i,j)/t
End of For loop
HaplotypeFactor = Sum(HapVariationFactor)/N
End of For loop
Create a list of all chromatograms that had coverage in the polymorphic position.
Set the alignment penalty parameter to zero.
For each chromatogram from the list above
▪ In the neighborhood (+/- 5 bases) of the polymorphism site all the mismatches with the consensus sequence are given a penalty and the penalty is more if the mismatch is an indel.
▪ Ignore the mismatches at other polymorphic positions in the neighborhood.
The alignment penalty parameter is then scaled from 0 to 1, where 1 is the highest quality alignment with no mismatches or indels in the neighborhood of the polymorphic base.
The RepeatMasker program is used to identify low complexity DNA sequences and common repeats that are specific for a given species . Common repeats observed were (A)n, A-rich, AT_rich, (CAG)n, (GAA)n, (GA)n, GC_rich, (TA)n, (TATG)n and (TC)n. Some of the SNP are due to changes in the number of repeat elements also referred to as simple sequence repeats (SSR). This was provided as an optional feature to be able to distinguish SNP from SSR (feature # 16).
Implementation of ML in polymorphism discovery
A software package to support ML in polymorphism discovery was written in Perl that uses other open source Perl modules from Bioperl  and CPAN . The software code will be portable to most platforms where Perl can be executed. The software has modules for:
Extracting the ML features (Table 1) from the output files generated by the sequence assembly program (phredPhrap or CAP3) and polymorphism detection programs (PolyBayes or PolyPhred) and creating a data file in the format required for C4.5 execution.
Running the C4.5 programs to obtain predictions for test cases using either a decision tree or production rules. These programs run in a fraction of the time required to run phredPhrap.
Creating a graphical display similar to the Consed viewer on a web page along with the ML prediction to facilitate easy scoring of polymorphisms.
Distinguishing certain polymorphic heterozygous positions that are common in all genotypes sampled (ignored by PolyPhred). This can help in distinguishing paralogs.
This package [see Additional file 1] can also be integrated as a part of a polymorphism analysis pipeline. The software features were optimized for discovering SNP in homologous chromosomes. For enhancing the prediction accuracies of heterozygous SNP (identified by PolyPhred) additional features can be incorporated. Similarly, the feature set can be tailored for other instances of polymorphism discovery (WGS, EST) depending on the availability of sequence and quality information.
List of abbreviations
Single nucleotide polymorphism
Expressed sequence tag
Simple sequence repeat
Sequence tagged site
Polymerase chain reaction
Positive predictive value
Negative predictive value
Whole genome shotgun sequence
Comprehensive Perl Archive Network
Neighborhood quality standard
The authors like to thank Raymond Fernalld in validating SNP as a subject matter expert.
- Cai YD, Doig AJ: Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics 2004, 20: 1292–1300. 10.1093/bioinformatics/bth085View ArticlePubMedGoogle Scholar
- Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 2004, 20: 2429–2437. 10.1093/bioinformatics/bth267View ArticlePubMedGoogle Scholar
- Cai YD, Liu XJ, Li YX, Xu XB, Chou KC: Prediction of beta-turns with learning machines. Peptides 2003, 24: 665–669. 10.1016/S0196-9781(03)00133-5View ArticlePubMedGoogle Scholar
- Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E: A probabilistic information retrieval approach to medical annotation in SWISS-PROT. Stud Health Technol Inform 2003, 95: 421–426.PubMedGoogle Scholar
- Zhang LV, Wong SL, King OD, Roth FP: Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 2004, 5: 38. 10.1186/1471-2105-5-38PubMed CentralView ArticlePubMedGoogle Scholar
- Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ: Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 2004, 10: 355–368. 10.1261/rna.5890304PubMed CentralView ArticlePubMedGoogle Scholar
- Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics 2004.Google Scholar
- Quinlan JR: C4.5: programs for machine learning. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc; 1993.Google Scholar
- Pavlidis P, Wapinski I, Noble WS: Support vector machine classification on the web. Bioinformatics 2004, 20: 586–587. 10.1093/bioinformatics/btg461View ArticlePubMedGoogle Scholar
- Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR: A general approach to single-nucleotide polymorphism discovery. Nat Genet 1999, 23: 452–456. 10.1038/70570View ArticlePubMedGoogle Scholar
- Nickerson DA, Tobe VO, Taylor SL: PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res 1997, 25: 2745–2751. 10.1093/nar/25.14.2745PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu YL, Song QJ, Hyten DL, Van Tassell CP, Matukumalli LK, Grimm DR, Hyatt SM, Fickus EW, Young ND, Cregan PB: Single-nucleotide polymorphisms in soybean. Genetics 2003, 163: 1123–1134.PubMed CentralPubMedGoogle Scholar
- Hadley HH, Hymowitz T: Speciation and Cytogenetics. Madison, WI, Agron. Monogr; 1973:97–116.Google Scholar
- Lackey JA: Chromosome numbers in the Phaseoleae (Fabaceae:Faboideae) and their relation to taxonomy. Am J Bot 1980, 67: 595–602.View ArticleGoogle Scholar
- Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC: Mining EST databases to resolve evolutionary events in major crop species. Genome 2004, 47: 868–876. 10.1139/g04-047View ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 2000, 407: 513–516. 10.1038/35035083View ArticlePubMedGoogle Scholar
- Zhao Z, Boerwinkle E: Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res 2002, 12: 1679–1686. 10.1101/gr.287302PubMed CentralView ArticlePubMedGoogle Scholar
- Barker G, Batley J, O' Sullivan H, Edwards KJ, Edwards D: Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 2003, 19: 421–422. 10.1093/bioinformatics/btf881View ArticlePubMedGoogle Scholar
- Batley J, Barker G, O'Sullivan H, Edwards KJ, Edwards D: Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol 2003, 132: 84–91. 10.1104/pp.102.019422PubMed CentralView ArticlePubMedGoogle Scholar
- Smit AFA, Hubley R, Green P: Repeat Masker Open - 3.0.1996. [http://www.repeatmasker.org]Google Scholar
- Supplementary Information[http://bfgl.anri.barc.usda.gov/ML/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.