A compartmentalized approach to the assembly of physical maps
© Bozdag et al; licensee BioMed Central Ltd. 2009
Received: 30 December 2008
Accepted: 15 July 2009
Published: 15 July 2009
Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data.
We propose a novel compartmentalized method for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly.
The software is available for download at http://www.cs.ucr.edu/~sbozdag/assembler/
A physical map is a linear ordering of a set of overlapping clones in a genomic library. Physical maps are obtained from processing the signatures or fingerprints of all the clones in a library. Fingerprints can be generated by digesting clones with one or more restriction enzymes, or by hybridizing them to a carefully designed set of DNA probes. The computational problem is to build an overlap map of the clones that is consisted with the fingerprint data .
Physical maps have been historically one of the cornerstones of genome sequencing projects. For instance, in clone-by-clone sequencing, first a physical map is constructed; then, a minimum-cardinality set of overlapping clones that spans the genomic region represented by the physical map, called minimal tiling path (MTP), is selected. Finally, the clones in the MTP are sequenced one by one . The clone-by-clone sequencing method has been used to sequence several genomes including C. elegans , A. thaliana , H. sapiens , and O. sativa . In several recent whole-genome shotgun sequencing projects, physical maps have also been employed to validate and improve the quality of sequence assembly . This validation step has been used, for example, in the assembly of M. musculus , G. gallus , and O. anatinus .
The rapid market penetration of next-generation sequencing instruments (Roche/454, Illumina, and ABI SOLiD) is expected to bring physical mapping back to the center stage of genomics. Next-gen sequencing technologies produce massive amounts of short reads (about 200–300 bases for 454, 35 bases for Illumina and SOLiD)  and therefore the de novo assembly of the whole eukaryotic genomes is extremely challenging . Arguably, the only realistic approach at this time is clone-by-clone sequencing, where each clone in the MTP is sequenced using next-gen technology, and the assembly is carried out separately clone by clone (see [12–14] and references therein).
In addition to their prominence in sequencing projects, physical maps can also provide a robust infrastructure required by many applications in genomics such as marker assisted breeding , map-based cloning of a set of genes of interest . and EST mapping .
Physical maps can be built from data obtained by restriction digestion or hybridization experiments . In the former case, overlaps between clones are determined by a statistical method, then clones are arranged in an order that is consistent with the restriction fingerprint data . In the latter case, clone-probe associations (i.e., which clones hybridize to which probe) are used to find an arrangement of the probes such that clones can be ordered consistently . In practice, however, hybridization experiments rarely use single probes. Due to the time and cost involved, hybridizations between probes and clones are typically carried out for a pool of probes (see, e.g., ). In this work, we assume that only clone-pool associations (hereafter called hybridization fingerprints) are available.
Nowadays almost all physical mapping projects that are based on restriction fingerprint data rely on a tool called FingerPrinted Contigs (FPC) . FPC implements an algorithm called consensus band (CB) that constructs a physical map using a combination of greedy and heuristic approaches. At the core of the CB algorithm, clones are assigned to contigs based on a coincidence score, called Sulston score, which measures the probability that two clones share a given number of restriction fragments (bands) according to a binomial probability distribution . Two bands are considered shared if their sizes are within a given tolerance value. Two clones are declared overlapping if their Sulston score is below a given cutoff threshold. For each contig, FPC builds a CB map, which is a coordinate system to which clones are aligned.
FPC does not attempt to resolve all the conflicts arising in the assembly of the physical map, but instead provides interactive features for manual editing. Although manual editing appears to be an unavoidable final step in any physical mapping project, this process is tedious, very time-consuming and requires a significant expertise. Obviously, the better the initial quality of the physical map produced by the algorithm, the smaller is amount of manual work involved.
With the objective of producing more accurate physical maps, here we propose a novel algorithmic pipeline that is capable of integrating both restriction and hybridization fingerprints. We note that the availability of both types of data is common in large-scale genomic projects. In this paper, the hybridization fingerprint data is obtained by hybridizing pools of short oligonucleotide probes to a Bacterial Artificial Chromosome (BAC) clone library .
Given the trust of the scientific community in FPC, we decided not to "reinvent the wheel" but include some of its modules in our algorithmic pipeline as subroutines. In order to integrate FPC in our workflow, some minor changes to its source code had to be carried out. We use FPC to compute the local assemblies on each cluster of clones; we also employ jointly FPC and a novel algorithm described later to merge contigs iteratively. FPC's merge process is based on shared bands between contigs, whereas our algorithm is based on shared clones between contigs. In general, the strategy behind the design of our assembler is "be conservative first". For example, in the beginning of the assembly process, we merge contigs only when there is strong evidence that they overlap while later we allow riskier moves.
We tested our assembler extensively on several data sets. In the experimental section, we report results on the assembly of the physical map of two important crop plants, namely rice and barley. Real fingerprinting data is available for both plants, but real hybridization data is available only for barley, while rice hybridization data was simulated in silico. We constructed physical maps using standard and compartmentalized assembly for both plants, and evaluated different pre-clustering strategies for our assembler. We compared the accuracy of the maps produced by the two methods using a variety of metrics. We also compared the rice maps to the manually edited physical map of rice. Our evaluations show that the compartmentalized assembly produces significantly more accurate maps than maps produced by the standard assembly. In addition, our method is capable of detecting and isolating clones that would induce chimeric contigs if used by the standard assembly.
Results and discussion
Pre-clustering of clones
During the pre-clustering phase, clones that are more likely to be overlapping are assigned to the same cluster. We pre-cluster clones according to hybridization and/or restriction fingerprint data. Below are four pre-clustering approaches we implemented and analyzed.
Pre-clustering based on hybridization fingerprint data (HYB)
In HYB, overlapping clones are detected via an overlap score based on hybridization fingerprint data. Consider two clones c i and c j that hybridize to at least one of the probes in probe pool p. In this case, we say that p is a positive pool for c i , c j and that there is a positive concordance between the two clones. If both clones hybridize to none of the probes in probe pool q then we say that q is a negative pool for c i , c j and there is a negative concordance between the two clones. If one clone hybridizes to a probe pool but the other does not, then we say there is discordance between the two clones.
Our overlap score is based on two observations. First, observe that positive and negative concordance weigh in favor of clone overlaps, whereas discordance weighs against it. Second, observe that the strength of the positive concordance between two clones should be inversely proportional on the number probes in the pool. In fact, the probability that the two clones hybridize to the same probe (and therefore overlap) increases as the size of the probe pool decreases. Vice versa, the strength of the negative concordance between two clones should be directly proportional on the number probes in the pool. This is because as the negative pool size increases, the size of the region that the clones might occupy gets smaller and smaller (i.e., the probability that they reside in the same genomic region increases).
where P I and P U are the intersection and union of positive pools for the pair of clones c i , c j ; N I and N U are the intersection and union of the negative pools for the pair of clones c i , c j ; and |p| denotes the size of pool p. The PNC score ranges between zero and two, inclusively.
Any clustering algorithm can be used in combination with the PNC score. For simplicity, we used the single-linkage hierarchical clustering . Consequently, disjoint (hard) clusters were generated (i.e., one clone can belong to exactly one cluster).
We performed extensive comparative evaluations of the proposed PNC score with other popular overlap scores, such as Sulston score , algebraically corrected Sulston score , Mott score , weighted shared bands score, and positive concordance score. The latter two scores are defined as follows. The weighted shared bands score of two clones is defined as the total weights of their shared bands. The weight of a band b is the fraction between total number of bands of all clones and frequency of bands of size between [b - t/2, b + t/2], where t is the tolerance value. The weight of a band is inversely proportional to its frequency. The positive concordance score of two clones is defined as the fraction between the intersection and union of their positive pools (i.e., the first term of the PNC score).
Soft pre-clustering based on hybridization fingerprint data (sHYB)
When probes are sparsely distributed in the genome, most clones have many more negative pools than positive pools. This generates negative concordances between many clone pairs whether they overlap or not. In addition, since the probability that a probe occurs in the overlap between two clones is very small, we expect only very few clone pairs to show positive concordance. Consequently, the PNC score will be close to one for most of the clone pairs whether they overlap or not.
In this case, a soft clustering might be more appropriate. In soft clustering, clones are assigned to multiple clusters when there is not sufficient evidence for a unique assignment. Initially, an empty cluster is generated for each pool. Then, clones are assigned to the clusters that correspond to their positive pools, i.e., clones that belong to the same pool are assigned to the same cluster. Finally, the clustering is finalized based on the restriction fingerprint data. When sHYB is used, redundant clones and/or redundant contigs can be present in the merged physical map. Additional steps are performed downstream to eliminate this redundancy (see Phase B in Physical Map Construction section).
Pre-clustering based on restriction fingerprint data (RESTR)
When no hybridization fingerprint data are available, one could consider pre-clustering clones based on the restriction fingerprint data. Note that the similarity between clones' fingerprints will be evaluated again during the actual contig assembly, so it is not obvious that pre-clustering using only restriction fingerprint data would bring any improvement in accuracy. For this choice of pre-clustering, we used the single-linkage hierarchical clustering to generate disjoint clusters by assigning clones with similar restricting fingerprint into the same cluster.
Random pre-clustering (RAND)
In random clustering, first, the size of a cluster is determined randomly based on the size distribution of a cluster generated by HYB method and then clones that are not previously assigned to a cluster are assigned to this cluster randomly until the cluster size is reached. This step is repeated until all clones are assigned to some cluster.
Physical map construction
The first step in the compartmentalized method is to run FPC independently on each clone set (i.e., clusters) obtained in the pre-clustering phase. Observe that clone sets are not necessarily disjoint. Since FPC does not offer all of its functionalities in batch mode, we instrumented it so to enable batch mode processing of functions such as END-MERGER, DQER, and REBUILD-CONTIGS. We also added a user remark to identify questionable clones (Q-clones). A clone is called Q-clone if more than 50% of its bands do not align to the CB map (e.g. chimeric clones, which are formed by merging two or more non-overlapping clones during cloning) . No other modification to the internal code of FPC was performed. FPC's key parameters such as cutoff, tolerance, and fromEnd can be set by the user as usual. The proposed compartmentalized method consists of six phases as described below.
A. Initial contig assembly
FPC's BUILD-CONTIGS procedure is run on each clone set. This step generates a "local" physical map for each clone set, composed of contigs and singletons. Clones that are completely contained in other clones are buried by FPC. After the contig assembly, all the local physical maps are concatenated into a single project. When a soft pre-clustering is used, a complication is that FPC cannot handle multiple instances of a clone with the same name. We resolve this problem by adding a distinct suffix so that we can distinguish multiple copies of the same clone. The renaming process is transparent to users and in the final physical map, all clones will have their original (unique) names.
B. Redundancy removal
If soft clustering is employed in the pre-clustering step, the process of concatenating local physical maps can result in redundant clones and contigs.
We call a contig redundant if all of its clones (excluding Q-clones) are completely contained in another contig. By computing the number of common clones between all contig pairs, redundant contigs are detected. In a group of several identical contigs, only one of them is kept alive. All Q-clones that belong to a redundant contig are moved to the singleton set.
We call a clone redundant if either (1) it is a singleton and it also occurs in a contig or (2) it occurs multiple times in the set of singleton clones or (3) it occurs multiple times in the same contig. In a group of several identical clones, only one of them is kept alive.
C. FPC processing
In this phase, the main FPC procedures are run iteratively on the merged map as discussed in . Steps (C2)-(C5) are repeated until convergence. For more details on FPC functionalities, see [18, 26].
(C1) Resolve Q-clones
We run the FPC procedure DQER that reduces the number of Q-clones in an attempt to split the incorrectly merged contigs. DQER runs the CB algorithm on contigs that contain more than q% of Q-clones, where q is an input parameter.
(C2) Merge contigs
We execute the FPC procedure END-MERGER that merges two contigs A and B if M distinct pairs of end clones, one of which is in A and the other in B, match each other with a Sulston score lower than the cutoff value. A clone in a contig is an end clone if it is within fromEnd CB units from one of the ends of the contig, where fromEnd is an input parameter . To avoid making wrong merges early in the process, we run END-MERGER with increasingly lower values of M (6 for the first iteration, 4 for the second, and 3 for the following iterations).
(C3) Eliminate redundant contigs and clones
See Phase B.
(C4) Rebuild contigs
We execute the FPC procedure REBUILD-CONTIGS at this point because END-MERGER does not update the CB map (in FPC v8.0 or above, see ). REBUILD-CONTIGS executes the CB algorithm on the current version of the contigs in order to improve the clone ordering.
(C5) Resolve Q-clones
See Step (C1).
In this fourth phase, we merge contigs with a novel algorithm described below and then we remove possible redundancies present in the physical map. Steps D2-D4 are repeated a few times until convergence. Phase D is needed only when soft clustering is used in the pre-clustering step.
(D1) Eliminate redundant Q-clones
A redundant Q-clone is a Q-clone that occurs as a non-Q-clone in another contig. The removal of redundant Q-clones is performed only in this phase, since DQER resolves most of the Q-clones in the FPC processing phase (see Phase C).
(D2) Merge contigs
MERGE-SIMILAR-CONTIGS is run until no further merging is possible. As in step (C2), the threshold T p is increased at each iteration until it reaches a user-supplied maximum (0 for the first iteration, 1e-30 for the second, and 1e-15 for the following iterations).
(D3) Eliminate redundant contigs and clones
See Phase B.
(D4) Move redundant clones to the singleton set
After merging contigs, there may be still some clones that occur in multiple contigs. Since the location of these clones in the physical map is ambiguous, they are moved to the set of singletons.
E. Singleton processing
Up to this phase, singleton clones that belong to distinct clone sets have not being processed. In order to check whether singleton clones are overlapping, an additional round of contig assembly is performed by running FPC's BUILD-CONTIGS procedure on the singletons set.
In this phase, some final adjustments are carried out on the physical map. Specifically, we reorder the clones (see Step C4) and try to resolve any Q-clone introduced in the last phase (see Step C1).
The compartmentalized assembler is implemented in C/C++ and Perl. Contig assembly is performed by FPC, which is implemented in C . Our software tool compiles and runs under Linux and Mac OS.
We used the genomic data of two plants, namely barley and rice, to compare our compartmentalized approach to the standard FPC assembly.
For barley, we used OLIGO SPAWN to design 12,467 36-mer oligonucleotide probes from a dataset of 53,799 barley unigenes obtained from HarvEST website . A unigene is obtained as a product of assembling several ESTs. Probes were grouped in 70 pools of usually 192 probes each, with a maximum of 310 overgos in a single pool.
Restriction fingerprint data of gene-bearing BAC clones of barley were obtained by using High Information Content Fingerprinting (HICF) as part of our NSF funded project (manuscript in preparation). Exactly 61,647 of these clones were successfully fingerprinted (M.C. Luo, personal communication). The average insert size of these clones is 106 kb, and the average number of bands is 92.
Since the barley genome has not been sequenced yet, we had to resort to an organism with a known genome for our comparative evaluations. We used agarose gel-based restriction fingerprint data and the manually edited physical map of rice obtained from Arizona Genomic Institute [31, 32] for this purpose. The restriction fingerprint data were real, but the hybridization fingerprint data were obtained by carrying out the hybridization of rice BAC clones to 36-mer rice oligonucleotide probes in silico.
To model the hybridization, we decided that if a probe had a perfect match to a BAC clone with 30 or more consecutive bases (out of 36), we considered it a positive hybridization. We also introduced noise in the hybridization experiment (i.e., false positive, false negative hybridization). To model the noise, FN% of clone-probe hits were discarded to generate false negative, and FP% of clone-probe pairs that did not hybridize were considered positive to generate false positive hybridization errors.
In order to carry out the hybridization of rice BAC clones to oligonucleotide probes in silico, we obtained the sequences of rice clones indirectly by uniquely locating their BAC end sequences (BESs) obtained from Arizona Genomic Institute [6, 34] on the rice genome.
There were 59,430 rice BAC clones for which BAC-end sequences (BESs) were available, but only 65% of them had both BESs sequenced. In order to uniquely locate BAC clones in the rice genome, we BLASTed the BESs against the rice genome (fourth release [6, 35–37]) and filtered out low-scoring BLAST hits. We set e-value and word size parameters to 1e-100 and 11, respectively. We also enabled filtering and allowed gaps. If a BAC clone had at least one pair of good BLAST hits, it was selected for further analysis. For each selected BAC clone, we checked all possible pairs of left and right BES hits. The coordinates were assigned only when there was only one pair for which (1) the hits were on the same chromosome, (2) the distance between them was consistent with the typical length of a BAC clone, and (3) the orientations of the alignment for the two ends were opposite to each other. If more than one pair met the criteria (1–3), we declared that the location of that clone in the genome was ambiguous, thus could not be determined. Following this procedure, we obtained 26,469 rice BAC clones for which the sequence could be uniquely determined.
We verified the correctness of this procedure by matching the sequences obtained by our method against the small subset of 3,413 BAC clones sequenced by the International Rice Genome Sequencing Project (IRGSP) . When we aligned the sequences obtained by our method against the actual sequenced BAC clones using MUMmer , only 0.8% of the sequences turned out to be misaligned.
The final dataset contained clones for which a unique location in the rice genome was determined and the restriction and hybridization fingerprint data were available. It contained 22,486 clones (about 10× genome equivalence), where the average insert size is 145 kb and the average number of bands is 29.
Max. cluster size
Physical map statistics
Physical map statistics
Rice FPC Standard
Rice Comp. sHYB
Rice Comp. HYB
Rice Comp. RESTR
Rice Comp. RAND
Barley FPC Standard
Barley Comp. sHYB
Barley Comp. HYB
According to the statistics in Table 2, the compartmentalized assembler produces physical maps, which contain fewer Q-contigs, but also more contigs and singletons than the maps produced by the standard method. As expected, our compartmentalized assembler is more "stringent" than the standard method because it restricts clones that are in different clusters from being assembled in the same contig. This stringency reduces the number of mis-assembled (i.e., falsely merged) contigs as will be discussed in Evaluation I in Comparative Evaluations of Rice Physical Maps section.
The statistics of the compartmentalized maps of rice obtained by applying RESTR or RAND in the pre-clustering phase and the standard map are almost identical. The maps themselves were also almost identical, which suggests that the compartmentalized assembly with RESTR and RAND perform like the standard method, i.e., pre-clustering with RESTR and RAND brings no benefits. This conclusion was somewhat expected because no additional information is exploited in RESTR and RAND.
We also observe that about 99.6% of the singletons in the standard map of rice are also singletons in the maps using sHYB and HYB pre-clustering. For barley, about 92% and 93% of the singletons in the standard map are also singletons in the maps using sHYB and HYB pre-clustering, respectively. When we analyzed the extra singletons in the rice map obtained with sHYB pre-clustering, we determined that about 81% of these extra singleton clones were misplaced in the standard physical map of rice (see Evaluation I in Comparative Evaluations of Rice Physical Maps section for definition of a misplaced clone). In addition, 99% of the misplaced clones in the standard map of rice are singletons in the HYB rice map. The analysis demonstrates that our method is capable of detecting and isolating problematic clones.
Comparative evaluations of rice physical maps
Since the genomic coordinates of the clones in the rice physical maps are known, more precise comparative evaluations can be carried out for this organism. Specifically, we report on four evaluation metrics to compare the rice physical maps produced by the compartmentalized and standard method, as well as the manually edited map. We were unable to evaluate the quality of the contig assembly in the manually edited physical map of rice (Evaluation I-III), since most of the clones in this map cannot be uniquely located in the rice genome.
Evaluation I (Contig assembly)
In this evaluation, we assess the quality of the contig assembly. We considered a contig to be of high quality when most (here 70%) of its clones were truly close to each other in the genome. In order to objectively measure the quality of each contig in the map, we first grouped the clones according to their locations in the genome. Each pair of clones in a given contig were assigned to the same group if they were on the same chromosome and the distance between them was smaller than a predefined threshold. Our tests showed that the choice of specific value for the threshold (in the range 1 kb-100 kb) does not have an impact on the grouping. This suggests that clones are assigned to different groups usually because they reside on different chromosomes (see Table 1 in Additional file 1). In the following evaluation, we show results based on grouping with 1 kb threshold.
The assembly score, the global ordering score, the number of misplaced clones, and the number of mis-assembled contigs for several physical maps of rice
Assembly score (%)
A further analysis showed that the set of misplaced clones in the maps using sHYB/HYB pre-clustering is completely contained in the set of misplaced clones in the standard map (see Table 2 in Additional file 1). Moreover, the map using HYB isolates 98.5% of the additional misplaced clones in the standard map to the singleton set. This analysis shows that our method can detect and isolate clones that are otherwise misplaced by the standard method. These misplaced clones are usually responsible for connecting contigs that are not truly overlapping and therefore creating chimeric contigs if left in the assembly.
Evaluation II (Clone ordering)
It is well known that due to the noise in the restriction fingerprint data, determining the correct ordering of the clones is a challenging problem [26, 40]. Nonetheless, since we have the coordinates of rice clones, we can compute a clone ordering score for each contig. We define the ordering score of a contig as the absolute value of the Pearson product-moment correlation coeffcient between the ranking of its clones in the genome and their order in the contig.
The rankings of clones in the genome are obtained from their coordinates if they belong to the same chromosome. If two clones belong to two different chromosomes then the clone with lower chromosome number has lower ranking than the ranking of the other. For this evaluation, we computed a global ordering score for each physical map as the weighted mean of the ordering score of all contigs in the physical map, using the contig size as the weighting factor.
The results in Table 3 show that the compartmentalized method produces contigs whose the clone ordering is more accurate than the standard FPC method. Among the compartmentalized maps, the rice map using HYB has the maximum global ordering score. From our experience, we conjecture that the global ordering score is inversely correlated with the number of mis-assembled contigs.
Evaluation III (Minimal tiling path)
As mentioned above, the minimal tiling path (MTP) of a physical map is a critical component of clone-by-clone sequencing projects [41, 42]. The quality of an MTP critically depends on the overall quality of its physical map. In this evaluation, first we computed an MTP for the all rice physical maps by using the most recent version of FPC (v8.9 as the time of writing) with default parameters. Then, we compared the number of the MTP clones, the coverage of the MTP clones on the genome, and the percentage of the consecutive MTP clones that truly overlap on the genome.
Evaluation IV (Overlap detection)
In our final evaluation, we focused on the set of clones that overlap in the genome. For each pair of clones that are actually overlapping, we checked whether they were in the same contig (counted as true positive, or TP) or not (counted as false negative, or FN). If one or both clones were in the singleton set, this pair was added to the singletons count. Only clone pairs that overlap by at least 100 kb were considered in the evaluation because, FPC can join two clones only if they overlap by at least 70% of their length (70% of the BAC clone size is about 100 kb) .
Evaluation results for standard, compartmentalized, and manually edited physical maps of rice (based on overlapping clones) and barley (based on genetic markers)
Rice FPC Standard
Rice Comp. sHYB
Rice Comp. HYB
Rice Comp. RESTR
Rice Comp. RAND
Barley Comp. sHYB
Barley Comp. HYB
Comparative evaluation of barley physical maps
Since the barley genome has not been sequenced yet, none of the evaluations discussed above can be carried out. We were able, however, to obtain from the Institute of Plant Genetics and Crop Plant Research (IPK) a list of 340 sets of BAC clones that are known to hybridize to a single oligonucleotide probe. The list was extended to 731 sets of BACs by incorporating BAC-gene deconvolution data obtained from the barley genetic map. The assumption is that the clones in each set should overlap each other because they are all positive for a single probe. In practice, this is not necessarily true for all the clones in the sets due to noise in the hybridization experiment, or because BAC clones overlap a repeat region or a gene family. Although this evaluation is not 100% reliable, it was the best validation of the barley map available.
The evaluation was carried out as follows. For each clone set identified by a probe, we first computed the contig ID that contains the majority of the clones in the set. Then for all clones in the set, we computed the number of clones that were either in that contig (counted as TP), or in another contig (counted as FN), or in the singleton set.
The results shown in Table 5 illustrate that the barley HYB has the fewest errors among all barley maps. In other words, the compartmentalized method with HYB is able to isolate some clones to the singleton set that would have otherwise been misplaced by the standard method.
We also computed the number of gaps in the physical maps of barley. If a set of clones that should belong to one contig is distributed in more than one contig, we record this event as a gap. More precisely, for each list of clones that are assumed to be overlapping, we computed the number of contigs that they belong to and computed the number of gaps for each list. According to the results, barley HYB has 240 gaps, barley sHYB has 293 gaps, and the standard map has 280 gaps. The results show that number of gaps correlates with the FN rate of the maps.
We proposed a novel compartmentalized approach to the construction of physical maps from fingerprinted clones. The compartmentalized method exploits both the restriction and hybridization fingerprint data, which allows it to construct more accurate physical maps. Consequently, we argue that the compartmentalized method reduces the amount of manual editing that is an inevitable step in any physical mapping project. Additionally, we showed that the MTP produced from the compartmentalized physical map is more reliable, and that should help clone-by-clone sequencing projects and de novo sequence assembly with short reads. The software is available in the public domain at http://www.cs.ucr.edu/~sbozdag/assembler/.
This project was supported in part by NSF CAREER IIS-0447773, NSF DBI-0321756 and USDA CSREES Barley-CAP (visit  for more information on this project).
The authors would like to thank the members of Prof. Carol Soderlund's group, in particularly Dr. William Nelson for helpful discussions regarding FPC. They are also grateful to the people in Prof. Rod Wing's lab, in particularly Dr. Andrea Zuccolo and José Luis Goicoechea for their help with rice BESs and physical map. Finally, the authors would like to thank Dr. Ming-Cheng Luo for providing the fingerprinting data of barley.
- Nelson DO, Speed TR: Statistical issues in construction high resolution physical maps. Statist Sci 1994, 9: 334–354.View ArticleGoogle Scholar
- Green E: Strategies for the Systematic Sequencing of Complex Genomes. Nature Reviews Genetics 2001, 2: 573–583.View ArticlePubMedGoogle Scholar
- Coulson A, Sulston J, Brenner S, Karn J: Toward a Physical Map of the Genome of the Nematode Caenorhabditis elegans. PNAS 1986, 83(20):7821–7825.PubMed CentralView ArticlePubMedGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana . Nature 2000, 408(6814):796–815.View ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860–921.View ArticlePubMedGoogle Scholar
- International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature 2005, 436(7052):793–800.View ArticleGoogle Scholar
- Warren RL, Varabei D, Platt D, Huang X, Messina D, Yang SP, Kronstad JW, Krzywinski M, Warren WC, Wallis JW, Hillier LW, Chinwalla AT, Schein JE, Siddiqui AS, Marra MA, Wilson RK, Jones SJM: Physical map-assisted whole-genome shotgun sequence assemblies. Genome Res 2006, 16(6):768–775.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520–562.View ArticlePubMedGoogle Scholar
- International Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004, 432(7018):695–716.View ArticleGoogle Scholar
- Warren WC, Hillier LW, Marshall Graves JA, Birney E, Ponting CP, Grutzner F, Belov K, Miller W, Clarke L, Chinwalla AT, et al.: Genome analysis of the platypus reveals unique signatures of evolution. Nature 2008, 453(7192):175–183.PubMed CentralView ArticlePubMedGoogle Scholar
- Morozova O, Marra MA: Applications of next-generation sequencing technologies in functional genomics. Genomics 2008, 92(5):255–264.View ArticlePubMedGoogle Scholar
- Pop M, Salzberg S: Bioinformatics challenges of new sequencing technology. Trends Genet 2008, 24(3):142–149.PubMed CentralView ArticlePubMedGoogle Scholar
- Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S: Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS ONE 2007, 2(5):e484.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino D, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18(5):821–829.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardner M, Hymowitz T, Xu S, Hartman G: Physical Map Location of the Rps1-k Allele in Soybean. Crop Sci 2001, 41(5):1435–1438.View ArticleGoogle Scholar
- Rymarquis LA, Handley JM, Thomas M, Stern DB: Beyond Complementation. Map-Based Cloning in Chlamydomonas reinhardtii . Plant Physiol 2005, 137(2):557–566.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu C: BAC as tools for genome sequencing. Plant Physiology and Biochemistry 2001, 39: 195–209.View ArticleGoogle Scholar
- Soderlund C, Humphray S, Dunham A, French L: Contigs Built with Fingerprints, Markers, and FPC V4.7. Genome Res 2000, 10(11):1772–1787.PubMed CentralView ArticlePubMedGoogle Scholar
- Alizadeh F, Karp RM, Newberg LA, Weisser DK: Physical mapping of chromosomes: A combinatorial problem in molecular biology. Algorithmica 1995, 13: 52–76.View ArticleGoogle Scholar
- Madishetty K, Condamine P, Svensson JT, Rodriguez E, Close TJ: An improved method to identify BAC clones using pooled overgos. Nucleic Acids Res 2007, 35: e5.PubMed CentralView ArticlePubMedGoogle Scholar
- Sulston J, Mallett F, Staden R, Durbin R, Horsnell T, Coulson A: Software for genome mapping by fingerprinting techniques. Comput Appl Biosci 1988, 4: 125–132.PubMedGoogle Scholar
- Scalabrin S, Morgante M, Meyers B: Mapping and Sequencing Complex Genomes: Let's get Physical! Nature Reviews Genetics 2004, 5: 578–588. [10. 1038/nrg1404] [10. 1038/nrg1404]View ArticlePubMedGoogle Scholar
- McQuitty LL: Elementary Linkage Analysis for Isolating Orthogonal and Oblique Types and Typal Relevancies. Educational and Psychological Measurement 1957, 17: 207–229.View ArticleGoogle Scholar
- Wendl MC: Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping. BMC Bioinformatics 2007, 8: 127.PubMed CentralView ArticlePubMedGoogle Scholar
- Soderlund C, Longden I, Mott R: FPC: a system for building contigs from restriction fingerprinted clones. Comput Appl Biosci 1997, 13(5):523–535.PubMedGoogle Scholar
- Nelson W, Soderlund C: Software for restriction fragment physical maps. In The Handbook of Genome Mapping: Genetic and Physical Mapping. Edited by: Meksem K, Kahl G. Wiley-VCH; 2005:285–306.View ArticleGoogle Scholar
- Nelson W, Bharti A, Butler E, Wei F, Fuks G, Kim H, Wing R, Messing J, Soderlund C: Whole-Genome Validation of High-Information-Content Fingerprinting. Plant Physiol 2005, 139: 27–38.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng J, Svensson JT, Madishetty K, Close TJ, Jiang T, Lonardi S: OligoSpawn: a software tool for the design of overgo probes from large unigene datasets. BMC Bioinformatics 2006., 7(7):PubMed CentralView ArticlePubMedGoogle Scholar
- HarvEST home page[http://www.harvest-web.org/]
- Yu Y, Tomkins J, Waugh R, Frisch D, Kudrna D, Kleinhofs A, Brueggeman R, Muehlbauer G, Wise R, Wing R: A bacterial Artificial chromosome library for barley ( Hordeum vulgare L .) and the identification of clones containing putative resistance genes. Theor Appl Genet 2000, 101: 1093–1099.View ArticleGoogle Scholar
- Rice Physical Map Dataset (Authorization Required)[ftp://ftp.genome.arizona.edu/pub/fpc/rice/]
- Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackmon B, Fang G, Kim H, Frisch D, Yu Y, Sun S, Higingbottom S, Phimphilai J, Phimphilai D, Thurmond S, Gaudette B, Li P, Liu J, Hatfield J, Main D, Farrar K, Henderson C, Barnett L, Costa R, Williams B, Walser S, Atkins M, Hall C, Budiman MA, Tomkins JP, Luo M, Bancroft I, Salse J, Regad F, Mohapatra T, Singh NK, Tyagi AK, Soderlund C, Dean RA, Wing RA: An integrated physical and genetic map of the rice genome. Plant Cell 2002, 14(3):537–545.PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI Rice Unigenes Homepage[ftp://ftp.ncbi.nih.gov/repository/UniGene/Oryza_sativa/]
- Rice BAC library home page[http://www.genome.arizona.edu/stc/rice]
- TIGR Rice Genome Sequence Version 4[ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_4.0]
- Yu J, Hu S, Wang J, Wong GKS, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al.: A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 2002, 296(5565):79–92.View ArticlePubMedGoogle Scholar
- Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al.: A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 2002, 296(5565):92–100.View ArticlePubMedGoogle Scholar
- International Rice Genome Sequencing Project home page[http://rgp.dna.affrc.go.jp/E/IRGSP/index.html]
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12.PubMed CentralView ArticlePubMedGoogle Scholar
- Flibotte S, Chiu R, Fjell C, Krzywinski M, Schein JE, Shin H, Marra MA: Automated ordering of fingerprinted clones. Bioinformatics 2004, 20(8):1264–1271.View ArticlePubMedGoogle Scholar
- Nelson W, Soderlund C: Integrating sequence with FPC fingerprint maps. Nucl Acids Res 2009, 37(5):36.View ArticleGoogle Scholar
- Engler FW, Hatfield J, Nelson W, Soderlund CA: Locating sequence on FPC maps and selecting a minimal tiling path. Genome Res 2003, 13(9):2152–2163.PubMed CentralView ArticlePubMedGoogle Scholar
- Barley CAP Website[http://barleycap.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.