Dual Computation mode: Single or Parallel Central Processing Units (CPUs)
ClustalXeed assigns two computation platforms incorporating physical RAM addressing, using a modified version of the clustalw-xeed_mem file, which is similar to the original ClustalW or ClustalW-MPI tools [5, 11] and distributed file allocation, using the clustalw-xeed file. The new clustalw-xeed file enables large-scale disk memory storage and smart allocation of vast sequence data sets to a disk swap space for construction of temporary pair-align matrices and for accelerated computation. ClustalXeed uses a sequential file writing method that provides a straightforward and efficient way to read and write files. For a large number of sequences, ClustalXeed converts input sequences into individual sequence pairs and stores the pairs using the naming rule/tmp/xxxxx000P, where P = the pair sequence number and the generation number of the file name is always +1.
A distance matrix file, which is a single file with a file name of matrix-file080.tmp, is also generated in a disk storage unit and is based on all-to-all pair sets from the input sequence query. The calculated pair scores generated from each computation node are stored in the master node/tmp directory. Creation of this process involved modification of the three main programming functions in the original ClustalW (alloc_aln, pairalign_new.c, pairalign), as well as the pairalign message-passing-interface (MPI) algorithm in ClustalW-MPI [11].
For cluster analysis, ClustalXeed uses the distance matrix and the neighbor-joining clustering method to construct a similarity or guided tree. During this step, temporary changes in tree values are recorded sequentially by creating a new similarity tree file that contains the updated records at each computation stage. This technique provides an efficient file handling methodology for analyses involving frequent writing/reading of large data sets, as is required for dynamic programming using the sum of pairs (SP) scoring method. The similarity matrix files generated at each stage are named using the same naming rule as was described for the input file storage system. The final multiple sequence alignment (MSA) results are stored in a {*.aln} file for easy data retrieval.
Dynamic Scan Load Balancing (DSLB)
Both ClustalW and ClustalX calculate pair-align scores to generate a guided tree for multiple sequence alignment. During this step, a distributed computation strategy may accelerate the computation. This means that the performance of any distributed parallel-computation approach depends on the balance of the workloads among the distributed nodes (Figure 2A). In an ideal case, the sequence of the elements to insert is uniformly divided between the work-node threads.
Depending both on the volume of the data to be aligned and the accuracy of the comparisons, computation by dynamic programming requires time-consuming iterations to achieve high accuracy. As an example, ClustalX has a parallelized version of ClustalX IRIX that is optimized to run on SGI Origin parallel computers running IRIX 6.5. The main problem of this parallelized version, however, is that the user has to spend extra time pre-sorting the input data to reduce load imbalances.
If the overall pair-align computation nodes are well balanced, performance will be greatly improved (Figure 2B). The effect becomes dramatic when input sequences dominate the computation time, due to excessively long or short strings. This is necessary in the case of highly parallel file-swap systems where load balancing is a key speed-up feature for performance enhancement. Physical RAM addressing is preferred for small-to-medium-sized sequences, whereas the file-swap (disk-storage) system is used for very large volume computations.
To overcome the slow speed of the file (-disk) memory computation, we designed an efficient load balancer, which uses a fast, intelligent scanning strategy to find sleeping computation nodes. Based on the previous distribution characteristics of ClustalW-MPI, we propose a dynamic scan-load balancing algorithm for efficient job-assignment of non-uniformly (unequally) distributed pair-align computation nodes, which is specified to increase the speed of interprocessor (nodes) communications (Figure 3). The original parallel version of ClustalW-MPI uses a fixed-size chunk scheduler algorithm to distribute sequence pairs to each node [14]. The main job of the master node in our system is to keep the slave nodes busy, as long as there is work to be done. That is, when a computation node completes its processing, it requests additional cue-sequence pairs from the master. This form of dynamic load-balancing continues until all of the sequences have been aligned. Once the job is submitted, it can be monitored and controlled via ClustalXeed main.
New Features in ClustalXeed Core
Figure 1 shows the input panel for an MSA task. The mainframe of ClustalXeed is the same as that of ClustalX, because ClustalX is very familiar to PC-based MSA users. It provides a consistent interface, except for several new features for parallel computation, but the other functions are very similar to those in ClustalX.
New sequence editing options
ClustalXeed allows the user to change the order of sequences by simply cutting and pasting the sequence names. The sequence block grabber also provides a box-shaded area that enables the user to realign badly aligned sequences in a new window format, and to realign a small box region again. This option provides an independent task for the refinement of aligned sequences. The realignment range can be selected by simply clicking the mouse and dragging on the new sequence area (ClustalX and ClustalX 2.0 do not provide this). Easy Sequence finder enables a search for nucleic acids or proteins based on a partial sequence input.
Phylogenetic tree view Option Enhanced
For the convenience of direct tree-drawing, TreeView [15] open source was embedded because some environments cannot read "unweighted pair group method with arithmetic mean" (UPGMA) trees. This was a known problem in the previous versions of ClustalX and ClustalX 2.0. Users can save the resulting tree image as a postscript {*.ps} file. ClustalXeed supports standard TrueType and Postscript fonts, which may increase the resulting tree resolution when enlarged, especially in the case of a huge sequence data set.
Real-time in-process dialogue box
A real-time in-process dialogue box was built to enable the user to quickly monitor the current status of the computation, the progression of the job, and information on the number of nodes involved in the calculation. Neither ClustalX nor X2.0 gives feedback on the computation status. When the MSA is finished, the user can save all of the computation history to a {result.log} file. The used parameters are also saved in the working directory as a {result.par} file. This allows a user to view and edit tag information about all the individual batch sequence-alignment jobs.
Secondary structure prediction
The GOR IV [16] and PHD [17] options were added, although their installation requires permission from the original developer. We do not provide this permission in ClustalXeed.
Protein weight matrices
ClustalXeed provides more options for selecting protein-weight matrices. The former version of ClustalX (or 2.0) provides only three different protein-weight matrices: BLOSUM 30, PAM 350, and Gonnet 250. We added more than 34 different types of protein-weight matrices for specifying scoring tables for the easy and accurate adjustment of SP score improvement. This allows the user to reduce or increase the multiple sequence alignment sensitivity.
Comments
View archived comments (2)