JunctionViewer (2.0) is a Perl script using the Tk module to manage its GUI and the BioPerl module to parse biological data files [see Additional file 1]. These modules can be obtained through CPAN http://www.cpan.org/.
Input to the script is provided by a single text file, which defines all user parameters. If no file is given the script generates a template file with comments describing argument fields.
The parameters file is composed of six sections: 1) General parameters (e.g., where the query sequence file is located), 2) NCBI BLAST parameters for each subject sequence database (e.g., E-value filter), 3) WU-BLAST parameters for each subject sequence database, 4) color schemes for displaying BLAST database sequences, 5) color assignments for cross_match subject sequence databases, and 6) custom annotation charts parameters (i.e., which charts will overlap others and in what color each chart will be drawn).
The script runs nucleotide NCBI BLAST, WU-BLAST, cross_match, and MUMmer, and combines these results, as well as custom annotation data (in the form of a numerical value per query nucleotide) into one graphical display per query sequence (see data flow in Figure 1).
Tick marks at the top of each display indicate sequence locations at 1 kb intervals. If 100 contiguous Ns are found within the query sequence (representing gaps within sequence assemblies), a grey vertical bar is drawn. A secondary value below each tick mark represents the number of nucleotides relative to the end coordinate of the last gap.
Below the query coordinates, custom data is represented as overlapping charts in two chart sets. Different y-axes (scale/maximum) may be used for each of the two chart sets (foreground and background). A numerical y-value can be assigned to each nucleotide on each chart, and a different color can be assigned to each chart. The y-axis is automatically extended to the largest value in a given chart.
In the current implementation, JunctionViewer displays homology data in separate cross_match, BLAST and MUMmer panels. Under the charts are displayed sequence homologies based on cross_match results that are represented as filled boxes. Longer contiguously masked sequences are drawn over and eliminate shorter ones, unless the overlap is within a user-defined allowance (set in the parameters file). Each masking database may be assigned a color. Query sequence coordinates of the homologous sequences are also indicated.
BLAST alignments, shown as large filled arrows that indicate HSP orientation, are drawn below the cross_match results. Start and stop coordinates are given for both query and subject sequences. Parameters for each subject database are defined individually. All NCBI and WU BLAST HSPs are combined and competed, i.e., longer HSPs eliminate shorter overlapping ones if they correspond to sequences from different databases, unless the overlap is within a user-defined allowance. If HSPs are generated from sequences within the same database, longer HSPs will overlap but not eliminate shorter ones. This aids in the display of tandem repeats. Each sequence in a BLAST database may be assigned a color. Additionally, subsequence positions within subject sequences may be assigned different colors.
At the bottom of the display, exact sequence matches identified with MUMmer (and user-defined minimum match length), are drawn as thin line arrows. Solid lines indicate exactly matching sequence regions and are connected by dashed lines. Direct and indirect repeats are colored red and blue, respectively. Lines can be drawn on a fixed number of levels (currently set in the main program to 30). The longest matches are drawn first on the top level, and smaller matches that overlap are drawn on lower levels until the number of levels is exceeded.
If a FingerPrinted Contig (FPC [19]) assembly project exists for the query sequences, the project file (.fpc) can be provided as an argument and the script will label and sort the query sequences in the GUI and name PostScript output files based on their positions in the FPC project. This is useful for labeling individual annotated BACs of a set (e.g., from a single FPC contig), which facilitates subsequent arrangement in the correct order (e.g., if they are to be printed out on paper).
Results can be viewed through the JunctionViewer GUI or from PostScript file results. Within the GUI, the number of nucleotides displayed per pixel can be modified, thus changing the zoom of the resulting image. PostScript files are created automatically each time that a given sequence is visualized in the GUI by either highlighting the ID and pressing the "Display single selection" button, or by pressing the "Process all displays" button. PostScript files can be converted into other graphic file formats and imported into Microsoft® PowerPoint®, enabling the display of entire centromeres or other regions spanning 5-10 megabases. We routinely use JunctionViewer to analyze 210 kb genomic fragments that overlap by 10 kb to ensure that features of 10 kb or less (e.g., LTRs) are drawn completely at least once.
On an AMD Athlon™ 64 Processor 3500+ (2.2 GHz) with 4 Gb 166 MHz DDR3 RAM running Fedora™ Linux, it took 4 minutes 43 seconds to generate graphical displays of the two 210 kb centromeric chromosome 5 sequences shown below. Maximum RAM memory usage is 205 Mb and output files, including PostScript output, total 56 Mb.
JunctionViewer 2.0 can be setup and used in four steps: 1) the installation of supporting software (e.g., Perl), 2) populating a parameters file, 3) revising the parameters file as necessary after testing with previously characterized sequences, and 4) using the refined parameters file to automatically annotate uncharacterized sequences [see Additional file 2].
Although defining all parameters for each repeat and algorithm takes some time during the initial setup for each new organism, this up-front cost is quickly compensated by the ease with which large tracts of genomic DNA can subsequently be analyzed and reanalyzed as genome sequence is improved.