MBEToolbox is written in the MATLAB language and has been tested on the WINDOWS platform with MATLAB version 6.1.0. The main functions implemented are: sequence manipulation, computation of evolutionary distances derived from nucleotide-, amino acid- or codon-based substitution models, phylogenetic tree construction, sequence statistics and graphics functions to visualize the results of analyses. Although it implements only a small fraction of the multiplicity of existing methods used in molecular evolutionary analyses, interested users can easily extend the toolbox.
Input data and formats
MBEToolbox requires a single ASCII file containing the nucleotide or amino acid sequence alignment in either PHYLIP [1], CLUSTALW [2] or fasta format. The toolbox does provide a built-in CLUSTALW [2] interface if an unaligned sequence file is provided. Protein-coding DNA sequences can be automatically aligned based on the corresponding protein alignment with the command alignseqfile.
After input, in common with the MATHWORKS bioinformatics toolbox, MBEToolbox represents the alignment as a numeric matrix with every element standing for a nucleic or amino acid character. Nucleotides A, C, G and T are converted to integers 1 to 4, and the 20 amino acids are converted to integers 1 to 20. A header, containing information about the names and type of the sequences as well as the relevant genetic code for protein-coding nucleotides, is attached to the alignment matrix to form a MATLAB structure. An example alignment structure, aln, in MATLAB code follows:
aln =
seqtype: 2
geneticcode: 1
seqnames: {1 × n cell}
seq: [n × m double]
where n is the number of sequences and m is the length of the aligned sequences. The type of sequence is denoted by 1, 2 or 3 for sequences of non-coding nucleotides, protein coding nucleotides and amino acids, respectively.
Sequence manipulation and statistics
The alignment structure, aln, can be manipulated using the MATLAB language. For example, aln.seq(x,:) will extract the x th sequence from the alignment, while aln.seq(:, [i: j]) will extract columns i to j from the alignment. Users may easily extract more specific positions by using functions developed in the toolbox, such as extractpos(aln, 3) or extractdegeneratesites to obtain the third codon positions or fourfold degenerate sites, respectively. For each sequence, some basic statistics such as the nucleotide composition (ntcomposition) and GC content, can be reported. Other functions include the calculation of the relative synonymous codon usage (RSCU) and the codon adaptation index (CAI), counts of segregating sites, taking the reverse complement or translating a sequence, and determining the sequence complexity.
Evolutionary distances
The evolutionary distance is one of the important measures in molecular evolutionary studies. It is required to measure the diversity among sequences and to infer distance-based phylogenies. MBEToolbox contains a number of functions to calculate evolutionary distances based on the observed number of differences. The formulae used in these functions are analytical solutions of a variety of Markov substitution models, such as JC69 [3], K2P [4], F84 [1], HKY [5] (see [6] for detail). Given the stationarity condition, the most general form of Markov substitution models is the General Time Reversible (GTR or REV) model [7–10]. There is no analytical formula to calculate the GTR distance directly. A general method, described by Rodriguez et al. [9], has been implemented here. In this method a matrix F, where F
ij
denotes the proportion of sites for which sequence 1 (s1) has an i and sequence 2 (s2) has a j, is formed. The GTR distance between s1 and s2 is then given by
where ∏ denotes the diagonal matrix with values of nucleotide equilibrium frequencies on the diagonal, and tr(A)denotes the trace of matrix A. The above formula can be expressed in MATLAB syntax directly as:
>> d = -trace(PI*logm(inv(PI)*F))
MBEToolbox also calculates the gamma distribution distance and the LogDet distance [11] (i.e., Lake's paralinear distance [12]).
For alignments of codons, the toolbox provides calculation or estimation of the synonymous (K
s
) and non-synonymous (K
a
) substitution rates by the counting method of Nei and Gojobori [13], the degenerate methods of Li, Wu and Luo [14] and the method of Li or Pamilo and Bianchi [15, 16], as well as the maximum likelihood method through PAML [17]. All these methods for calculating K
s
and K
a
require that the input sequences are aligned in the appropriate reading frame, which can be performed by the function alignseqfile. Unresolved codon sites will be removed automatically. In addition, several quantities, including the number of substitutions per site at only synonymous sites, at only non-synonymous sites, at only four-fold-degenerate sites, or at only zero-fold-degenerate sites can be calculated. The output from these calculations are distance matrices which can be exported into text or excel files, or used directly in further operations.
Phylogeny inference
Two distance-based tree creation algorithms, Unweighted Pair Group Method with Arithmetic mean (UPGMA) and neighbor-joining (NJ) [18] are provided and trees from these methods can be displayed or exported. Maximum parsimony and maximum likelihood algorithms can be applied to nucleotide or amino acid alignments through an interface to the phylip package [1]. As properly implemented maximum likelihood methods are the best vehicles for statistical inference of evolutionary relationships among species from sequence data, several maximum likelihood functions have been explicitly implemented in MBEToolbox. These functions allow users to incorporate various evolutionary models, estimate parameters and compare different evolutionary trees.
The simplest case of estimation of the evolutionary distance between two sequences, s 1 and s 2, can be considered as the estimation of the branch length (the number of substitutions along a branch) separating ancestor and descendent nodes. Branch lengths, relative to a calibrated molecular clock, can reveal the time interval for this separation. A continuous time Markov process is generally used to model evolution along the branch from s 1 to s 2. A transition rate matrix, Q, is used to indicate the rate of changing from one state to another. For a specified time interval or distance, t, the transition probability matrix is calculated from P(t) = eQt. If there are N sites, the full likelihood is
In this equation,
and
are the i th bases of sequences 1 and 2 respectively;
is the expected frequency of base
.
In MBEToolbox, to calculate the likelihood, L, at a given time interval (or distance) t, we have to specify a substitution model by using an appropriate model defining function, such as modeljc, modelk2p or modelgtr for non-coding nucleotides, modeljtt or modeldayhoff for amino acids, or modelgy94 for codons. These functions return a model structure composed of an instantaneous rate matrix, R, and an equilibrium frequency vector, pi which give Q, (Q = R*diag(pi)). Once the model is specified, the function likelidist(t, model, s1, s2) can calculate the log likelihood of the alignment of the two sequences, s 1 and s2, with respect to the time or distance, t, under the substitution model, model.
In most cases we wish to estimate t instead of calculating L as a function of t, so the function optimlikelidist (model, s1, s2) will search for the t that maximises the likelihood by using the Nelder-Mead simplex (direct search) method, while holding the other parameters in the model at fixed values. This constraint can be relaxed by allowing every parameter in the model to be estimated by functions, such as optimlikelidistk2p, that can estimate both t and the model's parameters. Figure (1a and 1b) illustrates the estimation of the evolutionary distance between two ribonuclease genes through the fixed- and free-parameter K2P models, respectively. When the K2P model's parameter, kappa, is fixed, the result and trace of the optimisation process is illustrated by the graph of L and t (Fig. 1a). When kappa is a free parameter, a surface shows the result and trace of the optimisation process (Fig. 1b).
When calculating the likelihood of a phylogenetic tree, where s 1 and s 2 are two (descendant) nodes in a tree joined to an internal (ancestor) node, s
a
, we must sum over all possible assignments of nucleotides to s
a
to get the likelihood of the distance between s 1 and s 2. Consequently, the number of possible combinations of nucleotides becomes too large to be enumerated for even moderately sized trees. The pruning algorithm introduced by Felsenstein [19] takes advantage of the tree topology to evaluate the summation in a computationally efficient (but mathematically equivalent) manner. This and a simple and elegant mapping from a 'parentheses' encoding of a tree to the matrix equation for calculating the likelihood of a tree, developed in the MATLAB software, PHYLLAB [20], have been adopted in likelitree.
Combination of functions
Basic operations can be combined to give more complicated functions. A simple combination of the function to extract the fourfold degenerate sites with the function to calculate GC content produces a new function (countgc4) that determines the GC content at 4-fold degenerate sites (GC4). A subfunction for calculating synonymous and nonsynonymous differences between two codons, getsynnonsyndiff, can be converted into a program for calculating codon volatility [21] with trivial effort. Similarly, karlinsig which returns Karlin's genomic signature (the dinucleotide relative abundance or bias) for a given sequence can be easily re-formulated to estimate relative di-codon frequencies, which may be a new index of biological signals in a coding sequence. In addition, the menu-driven user interface, MBEGUI, is also a good example illustrating the power of combination of basic MBEToolbox functions.
Graphics and GUI
Good visualisation is essential for successful numerical model building. Leveraging the rich graphics functionality of MATLAB, MBEToolbox provides a number of functions that can be used to create graphic output, such as scatterplots of K
s
vs K
a
, plots of the number of transitions and transversions against genetic distance, sliding window analyses on a nucleotide sequence and the Z-curve (a 3-dimensional curve representation of a DNA sequence [22]). A simple menu-driven graphical user interface (GUI) has been developed by using GUIDE in MATLAB. The top menu includes File, Sequences, Distances, Phylogeny, Graph, Polymorphism and Help submenus (Fig. 2). It aids the usage of the most frequently required functions so that users do not have to run any scripts or functions from the MATLAB command line in most cases.