Massively parallel sequencing (MPS) has and will continue to produce tremendous biological insights [1]. However, the ability to answer certain genomic questions is dependent on read length, and in some cases, the most commonly available read lengths are shorter than what is required [2]. For example, read length may limit the robustness of de novo genome assembly [3]. Single molecule, high molecular weight (HMW) DNA sequencing by PacBio has had success producing significantly longer read lengths than many other technologies [4], but even their impressive maximum read length, 40 kb, may still be too short to answer some questions regarding genomic structural variants [5]. Until sequencing technologies are able to characterize longer molecules, alternative methods for HMW DNA assembly are required. Restriction fragment length analysis has long been a preferred method for analyzing longer DNA molecules [6–8].
Recent technical developments commercialized by the company BioNano Genomics (BNG) increased throughput for this type of long molecule characterization. Their method uses modified restriction enzymes to incorporate single-strand breaks at restriction sites, which are then labeled by using polymerase to incorporate fluorescent nucleotide analogs. Labeled sample is loaded into an array of nanofabricated channels that linearize the DNA. Waves of DNA can be loaded into the channels and imaged with a high-powered microscope and high-resolution camera. Individual molecules are assembled based on shared patterns of restriction sites into representations of the entire genome.
As with any single molecule technology, there is significant noise in the raw data. Sources of noise in this type of mapping include limitations in camera resolution, enzyme efficiency (particularly in the presence of contaminants), and non-uniform behavior of fluorescent molecules and the DNA duplex [9]. Additionally, depending on genome size and complexity, restriction fragment length patterns may be similar at different genomic loci by chance. Successful assembly algorithms must compensate for this noise in order to reconstruct accurate models of chromosomes. These algorithms incorporate noise compensation measures such as fuzzy matching for lengths between restriction sites, modeling enzyme error probabilities, and requiring whole molecule alignments that are long and similar enough to be unlikely results of chance alone [9]. These compensating measures rely in large part on descriptions of the data error profile, which are provided by the user as input parameters. Therefore, optimum assembly requires that a user select appropriate input parameters.
There are methods for empirical estimation of error profiles, many of which rely on significant genomic resources. For example, BNG provides software that maps a random subset of molecules to a reference genome sequence assembly, and selects error parameters that maximize both the number of molecules that align, and the goodness of fit for those alignments. However, this method depends on a highly contiguous sequence assembly for the organism of interest, which might not be available. One potential alternative for selecting accurate parameters is trial and error. Using a variety of input parameters yields a variety of assemblies, from which an optimal solution might be chosen.
Trial and error is a computationally expensive strategy. In order to be feasible, it should minimize redundant calculations and re-use intermediate results wherever possible. BioNano Genomics produces their own software for de novo assembly, which, internally, can make use of intermediate results. However, the user interface for that software makes result re-use impractical. Therefore, to test the effectiveness of the trial and error strategy, a new interface had to be developed.
We approached the problems of short read limitations, noise in physical map data, and the computational intensity of the trial and error strategy using a specific dataset. Gossypium raimondii is a cotton species that is the closest living relative to one of the subgenome progenitors of the agriculturally significant allopolyploid, Gossypium hirsutum [10]. Gossypium raimondii has a high quality reference genome sequence assembly that was created using MPS, as well as genetic and traditional physical maps [11]. An extended abstract describing a portion of this work has been published previously [12].