Revealing sequence variation patterns in rice with machine learning methods

Bohnert, Regina; Zeller, Georg; Clark, Richard M; Childs, Kevin L; Ulat, Victor; Stokowski, Renee; Ballinger, Dennis; Frazer, Kelly; Cox, David; Bruskiewich, Richard; Buell, C Robin; Leach, Jan; Leung, Hei; McNally, Kenneth L; Weigel, Detlef; Rätsch, Gunnar

doi:10.1186/1471-2105-9-S10-O8

Volume 9 Supplement 10

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Oral presentation
Open access
Published: 30 October 2008

Revealing sequence variation patterns in rice with machine learning methods

Regina Bohnert¹,
Georg Zeller^1,2,
Richard M Clark^2,3,
Kevin L Childs⁴,
Victor Ulat⁵,
Renee Stokowski⁶,
Dennis Ballinger⁶,
Kelly Frazer⁶,
David Cox⁶,
Richard Bruskiewich⁵,
C Robin Buell⁴,
Jan Leach⁷,
Hei Leung⁵,
Kenneth L McNally⁵,
Detlef Weigel² &
…
Gunnar Rätsch¹

BMC Bioinformatics volume 9, Article number: O8 (2008) Cite this article

3698 Accesses
1 Citations
Metrics details

Motivation

The major breakthrough at the turn of the millennium was the completion of genome sequences for individuals from many species, including human, worm and rice. More recently, it has also been important to describe sequence variation within one species, providing the first step towards the linkage of genetic variation to traits.

Today, rice is the most important source for human caloric intake, making up 20% of the calorie supply and feeding millions of people daily. The more detailed understanding and findings on the molecular assembly of phenotypic rice varieties will therefore be essential for future improvement in rice cultivation and breeding. In order to reveal patterns of sequence variation in Oryza sativa (rice), the non-repetitive portion of the genomes of 20 diverse rice cultivars was resequenced, in collaboration with Perlegen Sciences, Inc., using a high-density oligonucleotide microarray technology.

Methods

Based on experience gained in polymorphism studies for Arabidopsis thaliana [1] we developed a method for identifying single nucleotide polymorphisms (SNPs) from the array data using Support Vector Machines (SVMs). In a two-layered approach we trained SVMs to discriminate between SNP and non-SNP positions using information from each cultivar and, in a second step, across all cultivars.

Wherever several SNPs or deletion/insertion polymorphisms occur in close vicinity, the hybridisation is suppressed and SNP calling in these regions becomes infeasible. We therefore adapted a machine learning method for sequence segmentation [2, 3] to predict highly polymorphic regions in O. sativa (cf. Figure 1). These regions can then be analysed in more detail using alternative experimental techniques.

For training and evaluation we compiled a set of reference polymorphisms obtained by dideoxy sequencing of more than 3,500 fragments from the 20 cultivars.

Results

Across all cultivars, we discovered 1,349,341 SNPs with the machine learning (ML) method at 316,373 non-redundant positions. In comparison to a model based (MB) SNP calling approach implemented by Perlegen Sciences, Inc. [4], the ML method was found to be much more sensitive by recovering 20.9% of all known SNPs at a precision of 91.7%, compared to 14.4% and 90.9%, respectively, for the MB approach (cf. Figure 2A). The intersection of MB and ML predictions contained 761,606 SNPs predictions at 159,879 non-repetitive positions constituting a set of markedly higher quality with a precision of 97.1%.

In addition to SNP predictions, our polymorphic region predictor discovered a substantial additional proportion of polymorphism regions, resulting in between ~65,000 and ~203,000 polymorphic regions per cultivar (cf. Figure 2B).

Conclusion

We identified hundreds of thousands polymorphisms on a genome-wide scale, providing the first whole genome set of polymorphisms for the world's most important crop plant. This polymorphism data represents a valuable resource for further functional studies and modern breeding of rice.

Based on the SNP data, high-density genotyping arrays will be designed to investigate genomic variation in many more rice cultivars. The PR predictions will e.g. be helpful to constrain primer design to conserved regions and thus increase PCR success rates.

References

Clark RM, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P, Warthmann N, Hu TT, Fu G, Hinds DA, Chen H, Frazer KA, Huson DH, Schölkopf B, Nordborg M, Rätsch G, Ecker JR, Weigel D: Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana . Science 2007, 317: 338–42. 10.1126/science.1138632
Article CAS PubMed Google Scholar
Zeller G, Clark RM, Schneeberger K, Bohlen A, Weigel D, Rätsch G: Detecting Polymorphic Regions in the Arabidopsis thaliana Genome with Resequencing Microarrays. Genome Research 2008, 18: 918–29. 10.1101/gr.070169.107
Article PubMed Central CAS PubMed Google Scholar
Tsochantaridis I, Joachims T, Hofmann T, Altun Y: Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 2005, 6: 1453–1484.
Google Scholar
Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-genome Patterns of Common DNA Variation in Three Human Populations. Science 2005, 307: 1072–9. 10.1126/science.1105436
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Friedrich Miescher Laboratory, Max Planck Society, 72076, Tübingen, Germany
Regina Bohnert, Georg Zeller & Gunnar Rätsch
Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany
Georg Zeller, Richard M Clark & Detlef Weigel
Department of Biology, University of Utah, Salt Lake City, UT, 84112, USA
Richard M Clark
Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA
Kevin L Childs & C Robin Buell
International Rice Research Institute, Metro Manila, The Philippines
Victor Ulat, Richard Bruskiewich, Hei Leung & Kenneth L McNally
Perlegen Sciences, Inc., Mountain View, California, CA, 94043, USA
Renee Stokowski, Dennis Ballinger, Kelly Frazer & David Cox
Bioagricultural Sciences and Pest Management, Colorado State University, Colorado, CO, 80523, USA
Jan Leach

Authors

Regina Bohnert
View author publications
You can also search for this author in PubMed Google Scholar
Georg Zeller
View author publications
You can also search for this author in PubMed Google Scholar
Richard M Clark
View author publications
You can also search for this author in PubMed Google Scholar
Kevin L Childs
View author publications
You can also search for this author in PubMed Google Scholar
Victor Ulat
View author publications
You can also search for this author in PubMed Google Scholar
Renee Stokowski
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Ballinger
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Frazer
View author publications
You can also search for this author in PubMed Google Scholar
David Cox
View author publications
You can also search for this author in PubMed Google Scholar
Richard Bruskiewich
View author publications
You can also search for this author in PubMed Google Scholar
C Robin Buell
View author publications
You can also search for this author in PubMed Google Scholar
Jan Leach
View author publications
You can also search for this author in PubMed Google Scholar
Hei Leung
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth L McNally
View author publications
You can also search for this author in PubMed Google Scholar
Detlef Weigel
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Rätsch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Regina Bohnert.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bohnert, R., Zeller, G., Clark, R.M. et al. Revealing sequence variation patterns in rice with machine learning methods. BMC Bioinformatics 9 (Suppl 10), O8 (2008). https://doi.org/10.1186/1471-2105-9-S10-O8

Download citation

Published: 30 October 2008
DOI: https://doi.org/10.1186/1471-2105-9-S10-O8

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Revealing sequence variation patterns in rice with machine learning methods

Motivation

Methods

Results

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Revealing sequence variation patterns in rice with machine learning methods

Motivation

Methods

Results

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us