Visualizing phylogenetic tree landscapes

Background Genomic-scale sequence alignments are increasingly used to infer phylogenies in order to better understand the processes and patterns of evolution. Different partitions within these new alignments (e.g., genes, codon positions, and structural features) often favor hundreds if not thousands of competing phylogenies. Summarizing and comparing phylogenies obtained from multi-source data sets using current consensus tree methods discards valuable information and can disguise potential methodological problems. Discovery of efficient and accurate dimensionality reduction methods used to display at once in 2- or 3- dimensions the relationship among these competing phylogenies will help practitioners diagnose the limits of current evolutionary models and potential problems with phylogenetic reconstruction methods when analyzing large multi-source data sets. We introduce several dimensionality reduction methods to visualize in 2- and 3-dimensions the relationship among competing phylogenies obtained from gene partitions found in three mid- to large-size mitochondrial genome alignments. We test the performance of these dimensionality reduction methods by applying several goodness-of-fit measures. The intrinsic dimensionality of each data set is also estimated to determine whether projections in 2- and 3-dimensions can be expected to reveal meaningful relationships among trees from different data partitions. Several new approaches to aid in the comparison of different phylogenetic landscapes are presented. Results Curvilinear Components Analysis (CCA) and a stochastic gradient decent (SGD) optimization method give the best representation of the original tree-to-tree distance matrix for each of the three- mitochondrial genome alignments and greatly outperformed the method currently used to visualize tree landscapes. The CCA + SGD method converged at least as fast as previously applied methods for visualizing tree landscapes. We demonstrate for all three mtDNA alignments that 3D projections significantly increase the fit between the tree-to-tree distances and can facilitate the interpretation of the relationship among phylogenetic trees. Conclusions We demonstrate that the choice of dimensionality reduction method can significantly influence the spatial relationship among a large set of competing phylogenetic trees. We highlight the importance of selecting a dimensionality reduction method to visualize large multi-locus phylogenetic landscapes and demonstrate that 3D projections of mitochondrial tree landscapes better capture the relationship among the trees being compared. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1479-1) contains supplementary material, which is available to authorized users.

When the lower and upper dimensions are equal, their value, D q (µ), is the qdimension of µ [46]. The correlation dimension is the special case of q = 2 [25]. If this dimension estimator is applied to a countable set of points, Y = {y 1 , y 2 , · · · }, (1) becomes where H(x) = 0 if x < 0 and H(x) = 1 if x ≥ 0. When it exists the correlation dimension has the form d COR = D 2 (µ) = lim →0 log C 2 ( ) log .

Maximum Likelihood estimator
This estimator assumes that the points are Y = {y 1 , y 2 , · · · , y N } independently identically distributed samples from some probability density f (y). f (y) is assumed to be approximately constant near any point y i . It is also assumed that the number of points inB t (y) follow a inhomogeneous spatial Poisson process, P (t, y) with rate is the volume of the unit sphere in R d M L and Γ(z) is the Gamma function. The log-likelihood of the process P (t, y) is where θ = log f (y) and R is the radius of the ball around y on which f is approximately constant [47]. The dimension, as a function of R and y, at which L(d , θ M L ) is maximized iŝ Levina and Bickel [27] suggest that, in practice, it is more convenient to fix the number of neighbors, k, rather than the radius and replace (4) witĥ where T k (y) is the distance between point y and k-th nearest point to y. Rather than choosing a single y i from the set, the mean is considered for a range of k values. The interval [k 1 , k 2 ] is chosen so that the profile ofd M L (k) is close to flat and the estimate of the dimension is defined as

Nearest Neighbor estimator
This estimator assumes that the points are Y = {y 1 , y 2 , · · · , y N } independently identically distributed samples from some probability density f (y). Pettis et al. [28] derive the approximation where V (d) and T k (y) are defined as in the Maximum Likelihood estimator. This yields −1/d . Pettis et al. [28] show that C n is independent of k and log(G k,d ) is close to 0. So the relation between log k and logT k (y) is close to a linear function with slope d N N .
A nearest neighbor estimator can therefore be computed using the algorithm: T k (y i ), for k = 1, 2, · · · 2. plot the figure with log k versusT k 3. estimate the slope, and the slope is d N N

Visual Inspection
Lee and Verleysen [12] suggested estimating the dimension by examining the profile of the final value of the NLDR cost function as a function of the dimension to which the data is reduced. The dimension estimate is taken to be an integer value in the interval where the profile is relatively flat.

Nonlinear Dimensionality Reduction Methods
In this section, we present the basic definitions of the stress functions and optimization algorithms. Difficulties or important optimizations for particular pairs are discussed briefly.
In the discussions below, we assume that we are given N points, Y = {y 1 , y 2 , · · · , y N }, in a space, S, not necessarily Euclidean, and an associated distance function, dist(y i , y j ). NLDR produces N points in R s , X = {x 1 , x 2 , · · · , x N } that are represented by the n = N s components of X = [x 1 , x 2 , · · · , x n ]. We assume that successive groups of s components of X define projected vector x i , i.e., X = vec(x 1 , x 2 , · · · , x N ). In the following formulas,d ij = dist(y i , y j ) and d ij (X) = x i − x j for some norm on R s .
The Normalized stress function is defined as This stress function is normalized by a single weight that is related to the data in S and is therefore a constant, i.e., independent of the projection onto R s .
The Kruskal-1 stress function is defined as It is normalized by a single weight but this weight is determined by the projected data in R s . However, the minimizers of Kruskal-1 stress function are the same as Normalized Stress function after a rescaling of X, i.e., σ n (X) = σ 2 Sammon's stress function or the NLM stress function is defined as where c = i<jd ij . Note that this stress function has a weight, dist(y i , y j ), for each term in the sum of squares of the distance differences. Sammon's stress function aims to preserve distances between points that are close in S when assigning positions in R s . Like Sammon's stress function, the CCA stress function also has local weights. However, CCA's weights are related to the points in R s . The CCA stress function is defined as where the weight function H λ (z) is a decreasing positive function dependent on a parameter λ. Demartines and Herault [32] discuss two choices for this function.
The value of λ decreases on each iteration. Initially, it is large and the function H λ is close to 1 and σ c (X), like σ n (X), considers all points similarly. As the iteration proceeds, λ decreases and the importance of a term in the stress function depends on the proximity of the associated pair of points in R s . Therefore, the number of points involved in updating the positions in X generally decreases as the iteration progresses. The manner in which λ is updated is very important to the applicability of this cost function and depends on the algorithm used to optimize σ cca (X). In particular, the influence of the update of λ on the relationship of the convergence behavior and the complexity of an iteration of the optimization algorithm is a crucial consideration. The details of this consideration for each algorithm below are omitted but the interested reader is directed to [Section 4.2.4,12] for a discussion of this issue for simple gradient descent and the Stochastic Gradient Descent algorithms.

Optimization Algorithms
We have considered four algorithms to optimize the stress functions. They are majorization , Gauss-Seidel-Newton, stochastic gradient descent, and MCMC simulated annealing. The first two are classical descent methods while the latter two are based on stochastic ideas that allow occasional ascent steps.

Majorization
The majorization algorithm is a classical approach to minimizing a function F (X) that has been applied in the NLDR context by [48]. A function G(X, Z) which satisfies G(X, X) = F (X) and G(X, X) < G(X, Z) is called a majorizing function. Minimizing F (X) directly means searching for minimum along line Z = X. The majorization algorithm exploits Z to search along another path. Given a majorizing function, we have F (X) = G(X, X) ≤ G(X, Z). and the algorithm is 0. Given Z 0 = X 0 , k = 1 1. X k = argmin X G(X, Z k−1 ) 2. Z k = X k 3. if stopping condition not satisfied then k = k + 1 and go to step 1 .
According to the properties of majorizing function and taking X k as a minimizer we have and a sequence X i such that F (X i ) is a nonincreasing sequence is produced. This method usually converges to a local minimum. The main limitation of majorization is, of course, the need for a majorizing function. However, even when G(X, Z) exists, finding the minimizer over X may be more difficult than the original problem. The lack of a suitable majorizing function is the reason, in practice, why majorization is not used for NLDR based on σ cca (X). For σ n (X), σ 1 (X), and σ s (X), analytical expressions for the minimizer of G(X, Z) over x given a value of Z are known for the respective majorizing functions.

Gauss-Seidel-Newton
Coordinate descent methods are another classical approach to minimization that adapts standard methods for solving nonlinear equations to optimization [49]. Their advantage is their simplicity and while their convergence, in general, may be unacceptably slow they can be effective for particular problems.
The basic idea is to find a root of the gradient of F (X), i.e., ∇F (X * ) = 0 by allowing only a subset of the variables to change. The simplest approach is to fix all components of X except one that is updated in a manner that reduces F (X) and moves towards a root of ∇F (X). Each of the components of X are updated based on some schedule that defines the method and each update requires taking one of more steps of a scalar nonlinear equation solver such as Newton's method. The onestep Gauss-Seidel-Newton method is so-called because a Gauss-Seidel relaxation schedule is used to order the updates of the components of X, i.e., when updating a component of X the most recent value of all other components are used to compute a single scalar Newton's method update. Specifically, we have and the one-step nonlinear Gauss-Seidel-Newton algorithm updates the scalar x where α > 0 is stepsize. Note the algorithm takes a single step of scalar Newton's method to update each component of X. The absolute value guarantees that the method is a descent method for infinitesimal α. Sammon applied one-step Jacobi-Newton to the NLDR problem using σ s (X), i.e., each component was updated using the old values of all other components, and a fixed stepsize but did nothing to guarantee a descent step [31]. The one-step Jacobi-Newton method can also be viewed as an inexact Newton method to update X where the Hessian is approximated by its diagonal elements only. In order to guarantee a descent step with a nontrivial α we use a line search and impose the strong Wolfe conditions to determine α on each step. Note that the algorithm is easily extended to the block case where, rather than updating a single component of X, the s components that define the projected vector x i are updated using one step of Newton's method for a system of s nonlinear equations. In that case, the notion of the absolute value is, of course, no longer a valid method of guaranteeing descent for the s × s local Hessians. One-step Gauss-Seidel-Newton is easily applied to σ n (X), σ 1 (X), and σ s (X). For σ cca (X) the update of λ and its relationship to convergence yields a noticeable increase in the complexity of a single iteration compared to the other algorithms applied to σ cca (X).

Stochastic Gradient Descent
The majorization and Gauss-Seidel-Newton algorithms discussed above are classical descent techniques. The remaining two algorithms we have considered have a statistical approach that allows the occasional ascent step, particularly early in the iteration, in order to provide a more global view of the cost function and convergence. The Stochastic Gradient Descent method (SGD) modifies the classical deterministic descent method, see [Appendix C, 12] for a description. Note that the uses of the name Stochastic Gradient vary considerably in the numerical literature. We adopt it here due to its use for this algorithm for NLDR by [12]. It is used also to describe algorithms based on various forms of probabilistic perturbations to a deterministic gradient used for example in the MCMC Simulated Annealing algorithm discussed in the next section.
It is assumed that the cost function F (X) has the form where each term g i (X) is associated with the distances from one of the projected vectors, x i , to all of the others. The deterministic gradient descent algorithm up-dates the s components of X corresponding to each projected vector x j ∈ R s using where α > 0 is set based on optimization considerations. SGD updates each component of X using, g i (X), a single randomly selected term in the F (X). Given the choice of i, the update is for j = 1, 2, · · · , n. To ensure convergence, the step size α is set based on a Robbins-Monro condition [12]. Demartines and Herault [32] applied a version of SGD to σ cca (X). It is easily applied to the other cost functions since all can be written in the assumed form of F (X). When applied to σ n (X), σ s (X), and σ cca (X), ∇ xj g i (X) depends only on x j and x i . However, when applied to σ 1 (X) the update of each x j is significantly more complex, by a factor of O(N ), than the updates for the This is due to the dependence on all of the x i in the weight ( i<j d 2 ij (X)) −1 .

MCMC Simulated Annealing algorithm
The final optimization algorithm used is the MCMC Simulated Annealing algorithm which is discussed in detail in [45]. The method combines a discrete MCMC iteration that samples a distribution related to a cost function F (X) and determinstic simulated annealing to transform the sampling into an efficient optimization algorithm for F (X). The natural form of the algorithm to find a maximum of a function F (X) is used here. It is easily modified for minimization by applying it to −F (X). The MCMC aspect of the algorithm is derived by combining a diffusion process with a Metropolis-Hastings-type acceptance-rejection test. Consider the probability density p(X) = e F (X) /G, where G is a normalized constant which makes the integration of p(X) equal to 1. Note that the peaks of p(X) occur at local maxima of F (X). Therefore, a brute force method of determining the location of a maximum is to repeatedly sample X using the density p(X). After a large number of samples, the frequency of the X observed would indicate a peak in the density and a maximum of F (X). The sampling is accomplished using a proposal density q(Z|X) and the acceptance-rejection function ρ(X, Z) = min{ p(Z)q(X|Z) p(X)q(Z|X) , 1}. For MCMC Simulated Annealing, the proposal density q(Z|X) is taken as the normal density with mean X + ∇F (X)δ and variance δ. This implies that the acceptance-rejection function is ρ(X, Z) = e F (Z)−F (X) . Note that repeatedly sampling from the proposal density function, i.e., X t+1 ∼ q(Z|X t ), corresponds to following the deterministic gradient with a random perturbation on each step. This is a diffusion process also referred to as a stochastic gradient flow. The addition of the Metropolis-Hastings acceptancerejection step yields a discrete MCMC algorithm, the Metropolis-adjusted Langevin algorithm, that samples p(X).
This sampling algorithm is a very inefficient method of optimization. This is fixed by applying simulated annealing and adding a temperature T to the density