NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences

Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1748-z) contains supplementary material, which is available to authorized users.


Effect of different MUMmer parameters: result comparison approach
In general, the simulated difference is assumed to be correctly detected if its location intersects with the location of a NucDiff-detected difference of the same type. However, some exceptions had to be made in order to allow a fair comparison in cases where there are identical bases nearby just by incident. First, in the cases with all types of deletions and simple relocations and translocations, the detected difference may be located not more than 3 bases before or after the simulated difference. Second, some differences are allowed to have several corresponding types, i.e. simulated simple relocations and translocations may be detected by NucDiff as simple relocations and translocations or relocations and translocations with overlap. In spite of the chosen NUCmer and delta-filter parameter values, we defined that all repeat related differences are detected as non-repeat related if they are shorter than 30 bases. If some fragment was relocated to another place in the query sequence, we defined that it is detected as a simple insertion if it is shorter than 30 bases.
These limits are tool independent and were introduced to avoid detection of random duplications and fragment relocations.

Supplementary figures and tables
Figure S1 Reference fragments placement order depending on query fragment orientations during detection of local differences. a) case shows the placement order of A* and B* when A and B have the same orientation as A* and B*, b) case shows their placement order when A and B have the opposite orientation. The placement relation between A and B, A* and B* may be differ from what is shown here.  In Table S1 case 1 may appear only after merging the fragments in the nested fragment cases. It will never be met in the NUCmer output. In all cases with overlaps between query or reference fragments (cases 5, 6, 7, and 8), the lengths of corresponding differences (  In the simulated modifications the following lengths of regions were used: 1. Len(A/C) = 500 bases in all described cases, except reshuffling case 2. Len(x/y) =10500 bases in all described cases 3. Distance between each manipulation case = 10500 4. Len(B) = {5,20,50,65,85,88,100,150,200,250,300,350,400} bases in deletion (1,2,3,5), insertion (1,2,5) , relocation (1,5,6), translocation (2,3) and inversion (1) (2) case. B3 contains one nucleotide difference with each of B1 and B2. The differences are located in the reference sequence at the same positions where B1 and B2 have two differences of one nucleotide lengths. 11. Len(correct/completely wrong seq) = {5,20,50,65,85,88,100,150,200,250,300,350,400} bases in all unaligned sequence cases.   All other NUCmer parameters, except --maxmatch, have the NUCmer default values and remained fixed in our tests. The --maxmatch parameter, which tells NUCmer to use all anchor matches regardless of their uniqueness, is not used by default in NUCmer, but is required for NucDiff and thus is used in all tests.
As for the delta-filter filtering parameters, -q parameter (query alignment using length*identity weighted LIS [longest increasing subset]) is required for NucDiff to get the output results needed for the analysis and is present in all tests performed.
In the QUAST-like tests, we ran a test with the same parameter values used by QUAST, except for the -q parameter. It is not used by QUAST, but is required for NucDiff.  Table 1 and Table 3) relocation, relocation with overlap, relocation with insertion, relocation with insertion and inserted gap relocation relocation with inserted gap relocation, fake: scaffold gap size wrong estimation unaligned sequence unaligned all translocation types translocation reshuffling local misassembly