- Methodology article
- Open Access
GapCoder automates the use of indel characters in phylogenetic analysis
© Young and Healy; licensee BioMed Central Ltd. 2003
- Received: 29 November 2002
- Accepted: 19 February 2003
- Published: 19 February 2003
Several ways of incorporating indels into phylogenetic analysis have been suggested. Simple indel coding has two strengths: (1) biological realism and (2) efficiency of analysis. In the method, each indel with different start and/or end positions is considered to be a separate character. The presence/absence of these indel characters is then added to the data set.
We have written a program, GapCoder to automate this procedure. The program can input PIR format aligned datasets, find the indels and add the indel-based characters. The output is a NEXUS format file, which includes a table showing what region each indel characters is based on. If regions are excluded from analysis, this table makes it easy to identify the corresponding indel characters for exclusion.
Manual implementation of the simple indel coding method can be very time-consuming, especially in data sets where indels are numerous and/or overlapping. GapCoder automates this method and is therefore particularly useful during procedures where phylogenetic analyses need to be repeated many times, such as when different alignments are being explored or when various taxon or character sets are being explored. GapCoder is currently available for Windows from http://www.home.duq.edu/~youngnd/GapCoder.
- Character State
- Large Indel
- Indel Event
- Short Indel
- Separate Character
The position of insertion/deletion mutations (indels) in molecular data sets can be useful phylogenetic information [1–4], yet this information is rarely used, especially in large data sets with many indels. There are three main reasons for this. First, some workers believe that indels may be unreliable as characters . However, numerous studies in which indel characters were compared with already established tree topologies have found that these indels are reliable in constructing phylogenies [6–11]. Second, it can be very time-consuming to determine character states based on gaps and enter this information into a data matrix by hand. Third, there is disagreement as to the best method of defining homologous character states for indels. Several different methods for incorporating indels into phylogenetic analyses have been used. We discuss five of the most useful of these methods.
The computer program MALIGN uses the first of these methods of including indels in sequence alignment and phylogenetic analysis of sequences . In this method, gap characters are considered to be a fifth character state for bases in DNA, as in Eernisse and Kluge . Therefore, adjacent gap characters are considered independently of their neighbors, although subsequent gap characters after the first may be weighted less heavily to reflect the possibility of longer indel regions [12, 13]. Essentially, each individual gap position is considered as if it were a separate indel event. This is not very realistic. Insertion or deletion events often consist of multiple bases [14–16]. Since many gap characters do not arise independently of one another, counting each gap character as a separate event causes indel events to be considered multiple times in determining phylogenetic relationships. This over-weights the indels and can distort phylogenies. Simmons and Ochoterena  also note a theoretical objection: because gaps are the product of the alignment procedure, and are not actually found in organisms or their sequences, sequences with gap characters do not have anything to compare with other sequences at the point where the gap occurs. For these reasons, gaps should not be considered as a fifth character state for nucleotide characters.
The second method, optimization alignment, is implemented in the program POY . POY achieves a phylogenetic analysis, including indels as character state changes, without ever creating a multiple-sequence alignment. Allthough this avoids the major problems with MALIGN, it has a limitation. Indel changes may be weighted more heavily than substitutions, but the same weight is used for the determining the position of indels and phylogenetic analysis. For example, it is not possible to use a gap weight of 10 (an indel is equivalent to 10 substitutions), as is common in protein-coding regions, without also weighing that change 10 times as much as a substitution in phylogenetic analysis.
The third method to be considered is the multistate gap region method [4, 18–20]. In this method, areas of overlapping indels, gap regions, are coded as individual characters. Different indels within each region are considered to be different states for the corresponding multistate gap region characters . Within the DNA sequences, gap characters are coded as missing data, and the gap region characters are then placed at the end of each sequence. This method is useful because it does code indels as separate characters and does consider contiguous gap characters as related. However, the number of character states for each gap region can be quite large. Since there are so many different possible states, these characters can be less informative regarding relationships than other methods.
Simmons and Ochoterena have proposed a fourth method for coding indels . This method is termed "simple indel coding". Similar to the third method, this process codes indels as separate characters in a data matrix, which is then considered along with the DNA base characters in phylogenetic analysis. Each indel with different start and/or end positions is considered to be a separate character, which all of the taxa under consideration either have or lack. If one of the indels completely overlaps an indel contained within another sequence, the sequences containing the longer indel are coded as being inapplicable for the shorter indel. This is done because it is impossible to determine whether or not the shorter indel is present in the sequences containing the longer one. Simple indel coding has the advantages of being conservative and easy to implement while still allowing indels to be highly informative in determining a correct phylogeny .
The final method for indel coding is also described by Simmons and Ochoterena . This method is called complex indel coding. This method attempts to better account for the fact that indels are evolutionarily related to one another, and that an indel region may be modified through additional insertion/deletion events to yield a different indel region in another sequence. Complex indel coding, like simple indel coding, codes indels with different start and end positions as individual characters. However, overlapping indels may represent an evolutionary transition sequence . Step matrices are constructed to accommodate this possibility. Complex indel coding utilizes more of the available information and never implies fewer steps than what is biologically realistic. However, this method generates some multi-state characters and step matrices and is thus more complicated to program. Also, the step matrices slow down phylogenetic programs. For a more thorough discussion of indels and their purpose in phylogenetic analysis, see reference .
The GapCoder program
GapCoder has the potential to be useful in phylogenetics, especially in non-protein-coding regions where indels can be as plentiful as substitutions. Whenever multiple phylogenetic analyses are performed, or greater resolution is required, GapCoder provides an efficient way to incorporate the phylogenetic information contained in the indels. For example, the output resulting from GapCoder may be used in exploratory analyses of optimal DNA sequence alignment. Such an analysis would likely include GapCoder as part of an objective method with four stages. In the first stage, several alignments would be created using a program such as ClustalX. GapCoder would then be used to code the indels into the data matrix. Next, a phylogenetic analysis of the data would be performed using software such as PAUP. Finally, the best alignment could be chosen using the desired optimality criterion. GapCoder is also useful when different character sets and/or taxon sets are being explored, such as when different combinations of outgroups are tried. This often requires re-aligning the data set for each taxon set; GapCoder allows the indel characters to be quickly added each time.
- Eernisse DJ, Kluge AG: Taxonomic congruence versus total evidence, and amniote phylogeny inferred from fossils, molecules, and morphology. Mol Biol Evol 1993, 10: 1170–1195.PubMedGoogle Scholar
- Vogler AP, DeSalle R: Evolution and phylogenetic information content of the ITS-1 region in the tiger beetle Cicindela dorsalis . Mol Biol Evol 1994, 11: 393–405.PubMedGoogle Scholar
- Simmons AM, Mayden RL: Phylogenetic relationships of the creek chubs and the spine-fins: An enigmatic group of North American cyprinid fishes (Actinopterygii: Cyprinidae). Cladistics 1997, 13: 187–206.View ArticleGoogle Scholar
- Freudenstein JV, Chase MW: Analysis of mitochondrial nad1 b-c intron sequences in Orchidaceae: Utility and coding of length-change characters. Syst Bot 2001, 26: 643–657.Google Scholar
- Golenberg EM, Clegg MT, Durbin ML, Doebley J, Ma DP: Evolution of a non-coding region of the chloroplast genome. Mol Phylogenet Evol 1993, 2: 52–64.View ArticlePubMedGoogle Scholar
- Lloyd DG, Calder VL: Multi-residue gaps, a class of molecular characters with exceptional reliability for phylogenetic analyses. J Evol Biol 1991, 4: 9–21.View ArticleGoogle Scholar
- Van Ham RCHJ, Hart H, Mes THM, Sandbrink JM: Molecular evolution of noncoding regions of the chloroplast genome in the Crassulaceae and related species. Curr Genet 1994, 25: 558–566.View ArticlePubMedGoogle Scholar
- Johnson LA, Soltis DE: Phylogenetic inference in Saxifragaceae sensu stricto and Gilia (Polemoniaceae) using mat K sequences. Ann Mo Bot Gard 1995, 82: 149–175.View ArticleGoogle Scholar
- Baldwin BG, Markos S: Phylogenetic utility of the external transcribed spacer (ETS) of 18S–26S rDNA: Congruence of ETS and ITS trees of Calycadenia (Compositae). Mol Phylogenet Evol 1998, 10: 449–463.View ArticlePubMedGoogle Scholar
- Prather LA, Jansen RK: Phylogeny of Cobaea (Polemoniaceae) based on sequence data from the ITS region of nuclear ribosomal DNA. Syst Bot 1998, 23: 57–72.View ArticleGoogle Scholar
- Simmons MP, Ochoterena H, Carr TG: Incorporation, relative homoplasy, and effect of gap characters in sequence-based phylogenetic analysis. Syst Biol 2001, 50: 454–462.View ArticlePubMedGoogle Scholar
- Wheeler WC, Gladstein DS: MALIGN: A multiple sequence alignment program. J Hered 1994, 85: 417–418.Google Scholar
- Giribet G, Wheeler WC: On gaps. Mol Phylogenet Evol 1999, 13: 132–143.View ArticlePubMedGoogle Scholar
- Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol 1992, 224: 461–471.View ArticlePubMedGoogle Scholar
- Gu X, Li W-H: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol 1995, 40: 464–473.View ArticlePubMedGoogle Scholar
- Simmons MP, Ochoterena H: Gaps as characters in sequence-based phylogenetic analyses. Syst Biol 2000, 49: 369–381.View ArticlePubMedGoogle Scholar
- Wheeler WC: Optimization alignment:the end of multiple alignment in phylogenetics? Cladistics 1996, 12: 1–9.View ArticleGoogle Scholar
- Baum DA, Sytsma KJ, Hoch PC: A phylogenetic analysis of Epilobium (Onagraceae) based on nuclear ribosomal DNA sequences. Syst Bot 1994, 19: 363–388.View ArticleGoogle Scholar
- Young ND, Steiner KE, dePamphilis CW: The evolution of parasitism in Scrophulariaceae/Orobanchaceae: plastid gene sequences refute an evolutionary transition series. Ann Missouri Bot Gard 1999, 86: 876–893.View ArticleGoogle Scholar
- Lutzoni F, Wagner P, Reeb V, Zoller S: Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst Biol 2000, 49: 628–651.View ArticlePubMedGoogle Scholar
- Swofford DL: PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4. Sinauer Associates, Sunderland, Massachussetts 1998.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.