The multiple alignment of homologous sequences provides important information on the evolution and the sequence-function relationships of protein families. Two types of methods, tree-based or space-based methods, can be used to compare sequences (reviewed in ). Both methods depend on a multiple alignment of homologous sequences. Tree methods assume a hierarchical, binary structure of the data to infer phylogenetic relationships. On the other hand, space methods are based on multivariate analysis of a distance matrix between the sequences and do not assume a specific structure for the data. Such a method is metric multidimensional (MDS) which is a powerful method to visualize distances between elements [2–5]. MDS, also named principal coordinate analysis, starts from a matrix of distances between elements and visualizes these elements in a low dimensional space in which the distances best approximate the original distances. Applied to biological sequences, this method usefully complements phylogeny [6–11].
The completion of the genome sequencing of a wide variety of organisms has paved the way to the comparison of protein families from different species. A very interesting property of MDS is the possibility to project supplementary elements onto a reference or “active” space. The positions of the supplementary elements (a.k.a. “out of sample” elements) are obtained from their distance to the active elements [2, 12, 13]. This property provides a very useful tool to compare orthologous sequences to a reference sequence set. In particular, when several orthologous protein families are compared, this method can be used to visualize evolutionary drifts .
MDS is based on the eigen-decomposition (i.e., principal component analysis) of a cross-product matrix derived from the distance matrix [2–5] and can be performed with the default tools included in the R statistical language (e.g., cmds function). In addition, several R packages such as ade4
made4, adegenet, and vegan[14–17] have been developed to provide multivariate analysis in the field of bioinformatics, including MDS. For example, the dudi.pca function in ade4 or the wcmdscale function in vegan performs MDS analysis. However, the projection technique has not been widely used yet and, to the best of our knowledge, is not included in the available R packages.
Thus, we have developed the R package bios2mds (from BIOlogical Sequences to MultiDimensional Scaling) to provide all the tools necessary to perform the MDS analysis of multiple sequence alignments. This package includes a function that projects supplementary sequences onto a reference space and, thus, makes it possible to compare orthologous sequence sets.