Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Figure 3

Scaled-up computational pipeline for sequence clustering. As with the basic pipeline, the scaled-up workflow begins with a raw sequence file. Before calculating genetic distances, the file is divided into in-sample and out-of-sample sets for use in Interpolative MDS. Full MDS and NW distance calculations on the in-sample data yield trained distances, which are used to interpolate the remaining distances. The interpolation step includes on-the-fly pairwise NW distance calculation. The overall complexity of the pipeline is reduced from O(N2) for the basic pipeline to O(M2 + (N-M)*M) for the pipeline with interpolation, where N is the size of the original sequence set and M is the size of the in-sample data. To enhance computational job management and resource availability, all computational portions of the depicted pipeline were implemented using the Twister Iterative Map Reduce runtime.

Back to article page