- Oral presentation
- Open Access
Revealing sequence variation patterns in rice with machine learning methods
© Bohnert et al; licensee BioMed Central Ltd 2008
- Published: 30 October 2008
- Rice Cultivar
- Machine Learning Method
- Polymorphic Region
- Important Crop Plant
- Dideoxy Sequencing
The major breakthrough at the turn of the millennium was the completion of genome sequences for individuals from many species, including human, worm and rice. More recently, it has also been important to describe sequence variation within one species, providing the first step towards the linkage of genetic variation to traits.
Today, rice is the most important source for human caloric intake, making up 20% of the calorie supply and feeding millions of people daily. The more detailed understanding and findings on the molecular assembly of phenotypic rice varieties will therefore be essential for future improvement in rice cultivation and breeding. In order to reveal patterns of sequence variation in Oryza sativa (rice), the non-repetitive portion of the genomes of 20 diverse rice cultivars was resequenced, in collaboration with Perlegen Sciences, Inc., using a high-density oligonucleotide microarray technology.
Based on experience gained in polymorphism studies for Arabidopsis thaliana  we developed a method for identifying single nucleotide polymorphisms (SNPs) from the array data using Support Vector Machines (SVMs). In a two-layered approach we trained SVMs to discriminate between SNP and non-SNP positions using information from each cultivar and, in a second step, across all cultivars.
For training and evaluation we compiled a set of reference polymorphisms obtained by dideoxy sequencing of more than 3,500 fragments from the 20 cultivars.
In addition to SNP predictions, our polymorphic region predictor discovered a substantial additional proportion of polymorphism regions, resulting in between ~65,000 and ~203,000 polymorphic regions per cultivar (cf. Figure 2B).
We identified hundreds of thousands polymorphisms on a genome-wide scale, providing the first whole genome set of polymorphisms for the world's most important crop plant. This polymorphism data represents a valuable resource for further functional studies and modern breeding of rice.
Based on the SNP data, high-density genotyping arrays will be designed to investigate genomic variation in many more rice cultivars. The PR predictions will e.g. be helpful to constrain primer design to conserved regions and thus increase PCR success rates.
- Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D: Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana . Science 2007, 317: 338–42. 10.1126/science.1138632View ArticlePubMedGoogle Scholar
- Zeller G, Clark RM, Schneeberger K, Bohlen A, Weigel D, Rätsch G: Detecting Polymorphic Regions in the Arabidopsis thaliana Genome with Resequencing Microarrays. Genome Research 2008, 18: 918–29. 10.1101/gr.070169.107PubMed CentralView ArticlePubMedGoogle Scholar
- Tsochantaridis I, Joachims T, Hofmann T, Altun Y: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 2005, 6: 1453–1484.Google Scholar
- Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-genome Patterns of Common DNA Variation in Three Human Populations. Science 2005, 307: 1072–9. 10.1126/science.1105436View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd.