Our study shows that ordination of distance matrices, while introducing a small amount of distortion, recovers phylogenetic signal remarkably well. For non-ambiguous data with a 'known' topology derived from uncoded DNA, INAASE and PICS-Ord with Clustal identity scores performed similarly, with most but not all clades recovered accurately. PICS-Ord with Ngila zeta cost scores slightly outperformed both methods, whereas the performance of ARC could be best characterized as fair. Problems with ARC have been reported  and are based on the fact that the recoding method used in ARC is not distance-based but encodes sequences based on length and relative frequency of individual bases and base pairing patterns . Under certain, usually rare circumstances, this can lead to erroneous codes, as the following example illustrates: consider sequences (1) TTGGCCAACCGGTT, (2) AACCGGTTGGCCAA, and (3) AGCCAGCTGGCTAA. Sequences (2) and (3) are more similar to one other, differing in four transitions only, whereas sequences (1) and (2) are dissimilar. However, because they have similar base and base pair frequencies, ARC will encode sequences (1) and (2) as being more similar to one other and sequences (2) and (3) as being dissimilar, as the ARC codes demonstrate: (1) 00000000000000000000000, (2) 01001100000000100001001, and (3) 01001011111111211112112. Therefore, ARC may not only recover topologies in conflict with non-ambiguous portions of the alignment but also in conflict with the phylogenetic signal contained in the ambiguous regions. Distance-based methods avoid this problem. INAASE has been shown to perform well when the dataset is sufficiently small, recovering phylogenetic signal with great accuracy, even though the actual number of codes is very small, with a single character representing each ambiguous region [37, 38]. For large datasets with over 32 distinct sequence patterns in ambiguous regions, PICS-Ord with Ngila zeta cost scores offers a good and fast alternative. Zeta cost scores slightly outperformed simple identity and cost scores in our analysis, confirming the results of previous studies [25, 49].
Since PCoA ordination is an eigenvector analysis, the eigenvalues can be used to assess the amount of information represented by each ordination axis and be implemented as weight factor. However, if the PICS-Ord codes are used as ordered characters, the coding method encodes the ordination scores proportionally to the amount of variance explained by each axis, and a weighting factor will not markedly affect the overall performance. Weighting of the axes based on eigenvalues is recommended when the codes (equivalent to columns or sites) produced by PICS-Ord are analyzed as unordered characters or in a GTR model under maximum likelihood, although tests (results not shown) did not suggest marked changes in topology or support with unweighted or weighted PICS-Ord codes. One might also consider weighting to balance the influence of DNA versus PICS-Ord characters in a partitioned dataset. However, in general this will not be necessary. The number of code columns (sites) retained by PICS-Ord for each ambiguous region depends on the number of different sequence motifs present, with a maximum number corresponding to the number of OTUs. In our experience, only about 25-35% of sites will have positive eigenvalues and about 15-25% will be retained after removing invariant sites. The first ambiguous region each of the 100-OTU Graphidaceae, the 706-OTU Physciaceae, and the 1814-OTU Parmeliaceae dataset retained 20, 172, and 320 sites, respectively. In addition, only the first few axes will be clade-informative, that is they contain structure largely congruent with clades resolved by non-coded DNA, and hence increase clade support, whereas the higher axes tend to be 'near-constant'. In a typical dataset of 100-1000 OTUs, the number of sites retained by PICS-Ord for each ambiguous region that are 'clade-informative', will be roughly 5-25. In ITS datasets containing roughly 450 unambiguously-aligned nucleotide sites, the 'clade-informative' PICS-Ord axes, assuming 2-3 ambiguous regions, would therefore add roughly about 15-75 sites, replacing originally ambiguous portions of roughly 100-150 bases in length.
The usefulness of including ambiguous regions in phylogenetic analyses and the performance of the corresponding recoding method can be evaluated using two criteria: improved confidence (statistical support) and improved topology (phylogenetic accuracy). Topology can be judged indirectly: when two different methods applied to the same dataset result in topological differences, but under certain conditions the topologies converge, this can be seen as improvement towards phylogenetic accuracy, as long as the resolution does not decrease and no novel topologies appear [50, 51]. Ambiguous regions likely contain homoplastic phylogenetic signal which could mask the signal contained in non-ambiguous portions of the alignment. A simple way to test this is to plot distance matrices obtained from non-ambiguous and ambiguous regions against each other. If there is an acceptable level of congruence, one would expect that inclusion of ambiguous regions by means of a coding method should improve support and/or topology. The best way of testing these criteria is through the use of simulation studies [52, 53]. However, simulated data are typically not as 'messy' as real biological data, and only a combined approach inclu-ding biological and simulated data allowed us to assess the performance of our novel recoding method.
The simulation study showed that excluding ambiguous regions resulted in significantly worse topologies and that including them by means of PICS-Ord allowed the recovery of a substantial part of the phylogenetic signal contained therein. The most accurate topologies were obtained when analyzing the simulated datasets unchanged ('as is'); however, since in real biological data we cannot know the true alignment, the inclusion of ambiguous regions by means of recoding, rather than excluding them, is the next best option. In PICS-Ord, recoding ambiguous regions is based on a single optimal solution for each pairwise alignment given NGILA's model of log-affine gap costs, and the transformation of these pairwise alignments into distances reduces the risk of misinterpretation of positional homologies compared to frequency-based methods such as ARC.
The potential power of recovering phylogenetic signal contained in ambiguous regions is shown in our analysis of the 100-OTU Graphidaceae dataset. The topology and support obtained when including ambiguous regions of the mtSSU gene by means of PICS-Ord matches the topology and support obtained by a three-gene tree [unpubl. data] better than the topology based on exclusion of ambiguous regions. Published 2-gene and 3-gene phylogenies of Graphidaceae [[46, 47]; unpubl. data] recovered the Fissurina, Ocellularia, Phaeographis, and Thelotrema clades with strong support (90-100% maximum likelihood bootstrap and 0.98-1.00 posterior probability). In our approach, PICS-Ord recoding for these clades increased support by 9% for the Fissurina clade and by 18-20% for the Ocellularia and Thelotrema clades. Graphis was supported sister to the Ocellularia clade when ambiguous regions were excluded but that support disappeared when using PICS-Ord recoding; in 2-gene and 3-gene phylogenies, Graphis does not appear as sister to the Ocellularia clade. Similarly, Wirthiotrema appeared sister to the Thelotrema clade when ambiguous regions are excluded, but sister to a clade including Diploschistes under PICS-Ord, which is more in line with published 2-gene and 3-gene phylogenies. This indicates that the phylogenetic signal contained in the ambiguous portion of the mtSSU gene is congruent with the phylogenetic signal contained in other genes (nuLSU, RPB2) and therefore should not be excluded from phylogenetic analysis. The predictive power contained in ambiguous portions of the mitochondrial small subunit ribosomal DNA and in the nuclear internal transcribed spacer ribosomal DNA, especially at lower hierarchical level, make these genes promising candidates for DNA barcoding in Ascomycota [54–56].
During our study, we made some preliminary comparisons (results not shown) between PICS-Ord and direct optimization methods such as POY, BALi-Phy, PRANK, and SATè [14–20, 57]. Although PICS-Ord recoding introduces a slight distortion of the original data, we did not find that topological accuracy was increased when using DO instead of PICS-Ord recoding, either with the simulated or with the biological datasets. In fact, some of the DO methods tested consistently returned less accurate topologies, even if the alignment and tree scores were improved under variable cost settings. In addition, inference time was substantially increased for all DO methods compared to analyzing mixed datasets including PICS-Ord codes under maximum likelihood in RAxML, by a factor of ten to one hundred or even more depending on the method. The detailed results of our comparison with DO methods will be presented in a forthcoming publication.
PICS-Ord thus offers a simple and cheap-to-compute alternative to direct optimization and recoding methods such as INAASE , when phylogenetic trees are derived from fixed multiple alignments with substantial ambiguous portions. For smaller datasets capable of being handled by INAASE (maximum of 32 or 64 sequence patterns per ambiguous region), INAASE and PICS-Ord coding with Clustal give fairly similar results, and INAASE might be the preferred method, since the step matrices reflect the actual sequence distances before ordination. However, INAASE can only be implemented with parsimony analysis and not within a Bayesian or maximum likelihood framework. Also, PICS-Ord with Ngila zeta costs scores outperformed INAASE, and application of a power-law model of indel evolution was shown to be superior to other methods for sequences with variable indels [25, 49]. A criticism of PICS-Ord might be that the resulting codes are abstract entities and do not directly correspond to DNA or phenotype data. However, since the ordination axes are perpendicular to each other and the ordination space is a reflection of the original pairwise distance matrix space, the PICS-Ord codes can be interpreted as mathematically independent components of the original sequences' distances, thus fulfilling two important requirements for their phylogenetic analysis: mathematical independence and reflecting the original distance space. Besides easy computation of ambiguous region codes and seamless integration of data partitions (no user-defined step matrices are required), PICS-Ord allows for a practically unlimited number of OTUs and sequence patterns to be analyzed.
The modularity of PICS-Ord allows for flexible parameter settings, including transition:transversion ratio and gap penalties similar to those of INAASE when calculating simple pairwise cost scores in Ngila [25, 49]. Alternative distance measures other than those provided by Clustal (identity) or Ngila (simple or zeta cost scores) are also conceivable, such as those based on 'N-local decoding' [35, 36]. Another possibility for fine-tuning PICS-Ord lies in the number of ordination axes selected for recoding and in the way the principal coordinates are encoded. This allows for adjustments of PICS-Ord codes with respect to the relative length of ambiguous regions within a given alignment. The fact that PICS-Ord codes are simple integer values permits combined analysis of DNA and PICS-Ord code partitions in a Bayesian framework , with up to 6-state ordered codes, and under maximum likelihood with RAxML 7.2.6, using a GTR or MK model or characters as ORDERED, with up to 32 states. This was previously impossible with mixed letter/integer codes or codes representing user-defined step matrices. In addition to the improved topological accuracy, a further argument for using PICS-Ord versus direct optimization or INAASE is computational speed: recoding the ambiguous region in our 705-OTU dataset on a dual-core INTEL processor took two minutes, and analysis of the partitioned dataset under maximum likelihood in RAxML on the same machine required about 36 hours including rapid bootstrapping (100 replicates). For the 1814-OTU Parmeliaceae dataset, recoding took about 35 minutes for each region and maximum likelihood analysis including rapid bootstrapping (100 replicates) in RAxML lasted eight days. This is comparable to the time a 50-100-OTU dataset would have taken to be analysed on the same processor under maximum parsimony in PAUP with ambiguous regions included as INAASE step matrices. Computational speed can further be substantially increased when running the software on multi-processor computers and web servers . Since PICS-Ord uses a set of integer codes to represent each ambiguous region for a given OTU, another problem of INAASE is avoided: the limited number of available symbols when coding an entire ambiguous region as a single character.
While PICS-Ord recoding was here applied to DNA data, the underlying method can be used to incorporate any kind of multidimensional distance matrix as unidimensional columns in a phylogenetic dataset and hence simplify the analytical approach and considerably increase computational speed.