OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies

Background In eukaryotic genomes, most genes are members of gene families. When comparing genes from two species, therefore, most genes in one species will be homologous to multiple genes in the second. This often makes it difficult to distinguish orthologs (separated through speciation) from paralogs (separated by other types of gene duplication). Combining phylogenetic relationships and genomic position in both genomes helps to distinguish between these scenarios. This kind of comparison can also help to describe how gene families have evolved within a single genome that has undergone polyploidy or other large-scale duplications, as in the case of Arabidopsis thaliana – and probably most plant genomes. Results We describe a suite of programs called OrthoParaMap (OPM) that makes genomic comparisons, identifies syntenic regions, determines whether sets of genes in a gene family are related through speciation or internal chromosomal duplications, maps this information onto phylogenetic trees, and infers internal nodes within the phylogenetic tree that may represent local – as opposed to speciation or segmental – duplication. We describe the application of the software using three examples: the melanoma-associated antigen (MAGE) gene family on the X chromosomes of mouse and human; the 20S proteasome subunit gene family in Arabidopsis, and the major latex protein gene family in Arabidopsis. Conclusion OPM combines comparative genomic positional information and phylogenetic reconstructions to identify which gene duplications are likely to have arisen through internal genomic duplications (such as polyploidy), through speciation, or through local duplications (such as unequal crossing-over). The software is freely available at .


Background
To extend knowledge about genes in a model species to other related species, it is important to distinguish genes that are directly related to one another through speciation (orthologs) from genes that have duplicated independent of speciation (paralogs) [1]. Paralogs may be a result of many different types of gene duplication, including unequal crossing-over, transposon-mediated duplications, or polyploidy, and may have occurred recently or long before some speciation event of interest. One-to-one orthologous relationships at least hint at conservation of gene function, whereas functional relationships among complex many-to-many paralogous relationships are much more difficult to infer. Numerous studies highlight the fact that such many-to-many relationships are common, complicating the extrapolation from characterized genes in model species to candidate genes in other species [1][2][3][4]. Identifying the nature of duplications is also important for investigating how different gene families evolve. One might expect that some gene families are arranged in a genome in such a way that paralogous duplications (and gene losses) are common, while other gene families have little tolerance for this sort of high turnover. For example, an important group of plant genes involved in pathogen resistance, the nucleotide binding site -leucine rich repeat (NBS-LRR) gene family, undergoes evolution through recombination and gene birth and death within large NBS-LRR clusters [5][6][7][8]. An example of a very different pattern (with little clustering and low rates of gene birth and death) is the gene family comprised of the 20S proteasome subunits (described later in this paper and [9]). Characterizing gene families in these terms requires identifying genes as having paralogous or orthologous origins.
A slightly different comparison can be made within a genome that has undergone polyploidy. Although any homologous genes in such a genome are technically paralogs, not all paralogs are equal: some originiate due to the polyploidy event, and others arise through other gene duplication mechanisms either before or after polyploidy. In a genome such as Arabidopsis thaliana, a history of polyploidy is greatly complicated by multiple rounds of (whole or partial) genome duplication, followed by extensive losses, rearrangements, and degradation of homoeologous (duplicated) regions [10][11][12][13][14]. Homologous genes within a single genome can be described as "segmental duplicates" (if they arose through polyploidy or other duplication of large genomic segments), or "tandem duplicates" (if they arose through unequal crossingover), or "other duplicates" (for all other cases, such as ectopic duplications). Identifying nodes in a gene phylogeny as arising from segmental duplications can help to determine whether a set of internal duplications were part of a larger genomic duplication, or whether they came from independent, localized events. In turn, this type of characterization suggests mechanisms of the past and ongoing evolutionary mechanisms underlying the development of a gene family. Similarly, we can predict relative ages for the birth of various groups of genes by identifying whether gene births occurred before or after a particular segmental duplication. The same approaches can also be used in comparisons of two species to describe evolutionary patterns of gene births and deaths in a gene family relative to speciation timing.
We describe a method for integrating comparative genomic positional information and gene phylogenies to infer which nodes in a phylogeny were due to (1) speciation or internal genomic segmental duplications ("ortho-") and (2) which were due to tandem gene duplications ("para-") or other mechanisms. The method, implemented in a suite of three programs called DiagHunter, OrthoMap, and ParaMap (or OrthoParaMap (OPM) as shorthand for the suite), consists of identifying conserved, collinear regions in a two-way genome comparison ("diagonals" in a dot plot), calculating a gene family phylogeny, mapping the gene family onto the genome comparison, mapping the gene family/genome/genome comparison back onto the phylogeny, and inferring internal nodes at which segmental or tandem duplications probably occurred. By way of nomenclature, in the comparative genomic literature, large chromosomal regions that are similar in content and organization are frequently referred to as "synteny blocks" [15][16][17][18][19] or (in the case of a comparison of a genome to itself), "segmental duplications" [10,12,13,20]. The terms "homology" and "collinearity" are also used to describe such regions, but we will reserve "homology" to refer to gene relationships.
The problem of integrating phylogenetic and comparative genomic positional data bears some resemblance to the problem of combining information from species and gene phylogenies in order to distinguish orthologous from paralogous duplications. In a two-species gene tree, in the absence of additional gene duplications or losses, all genes should occur as ortholog pairs. In a gene tree with N species, all clades should, after some point, contain N genes (one for each species). Under the assumption that these kinds of orthologous relationships are more common than gene duplications and losses, several algorithms and program implementations have been developed to infer likely duplications and losses, given a gene phylogeny and a species phylogeny. Two programs that take this approach are GeneTree [21,22] and RIO (Resampled Inference of Orthologs) [23,24]. These "tree reconciliation" methods successfully identify likely orthologies, but do not make use of gene positional information to infer duplication mechanism, and do not distinguish segmental from tandem duplicates within a single polyploid genome. These are both objectives of OPM.
We describe an application of OPM to a gene family whose members reside primarily in the mammalian X chromosome: the melanoma-associated antigen (MAGE) genes in the mouse and human genomes [25,26]. The results show that several groups of genes within this gene family have followed very different evolutionary histories. We also apply OPM to two gene families from Arabidopsis thaliana, relating these to internal duplications within this genome. The approach shows that the selected gene families have followed strikingly different evolutionary trajectories, and also helps to characterize and confirm the relative age of a putative polyploidy event that gave rise to segmental duplications in the Arabidopsis genome.

Implementation
The three main programatic steps in the OPM process, DiagHunter, OrthoMap, and ParaMap, require algorithm descriptions. All programs were implemented in Perl. Parts of the suite make use of the BioPerl libraries [27] and the GD graphics module [28].

DiagHunter
Syntenic regions could, in principle, be identified using any of several genomic comparison programs, including PipMaker [29], MUMmer [30], VISTA [31], GRIMM s [32], BLASTZ [33], FORRepeats [34], or REPuter [35]. None of these, however, meet all of the critical criteria of 1) identifying both small and large (multi-megabase), contiguous or interrupted syntenic regions, 2) identifying synteny blocks with diverse data sets and genomes; 3) being freely available; 4) providing simple text output of gene pairs and coordinates of diagonals. Thus, we developed Diag-Hunter to meet these needs, and to be a part of the OPM distribution. Nevertheless, it should be emphasized that any program that can be adapted to produce gene pair coordinates appropriately can also be used as the first step in the OPM process. This might be advantageous if in specific cases a program has been well tested with a particular data set, for example. DiagHunter will be described briefly here. The algorithm and measures of sensitivity and selectivity are described in more detail a software report in Cannon et al. [36], and in the distribution at http:// www.tc.umn.edu/~cann0010/Software.html.
Briefly, the synteny-identifying algorithm of DiagHunter walks through a pre-computed array of filtered similarity hits. The array is an M × N matrix of coordinates (x i , y j ), where x i and y j are gene "midpoint" ((5' + 3')/2) positions in genomes of sizes M and N. All coordinates are scaled by some factor to bring hits closer together (a parameter that depends on the average gene density in the genomes to be compared). At each hit, the algorithm checks in the neighborhood for hits that might either be other members of "direct" or "inverted" repeats, choosing the nearest and most favorable positions first. Once a candidate diagonal has been initiated, only hits with appropriate orientations are considered. The program follows these chains recursively, checking up to 75 possible positions in the vicinity for each step. Each position contributes a score from a scoring matrix that gives the best (lowest) scores to the nearest and most-nearly-diagonal hits. At each step, a running-average score for the candidate diagonal is computed. The program repeats the search to extend the current diagonal until a score threshold is passed or until no other candidate hits are located. If the diagonal meets the selection parameters (numbers of hits in the diagonal and average diagonal score), then it is retained, and all hits identified in this search are removed from subsequent searches. Then the program starts walking again from where it started the diagonal. Sensitivity is increased, and time to process the data is decreased, if the sparse hits are brought "closer together" by compressing the original matrix (a DiagHunter parameter). If this sort of compression is performed, then the original coordinates are recovered at the end.

OrthoMap
Conceptually, this program works by identifying threeway intersections between pairs of gene family members and diagonals. Such intersections require that genes be sufficiently similar and that they have the same genomic contexts in both genomes (or in both copies of a genomic segmental duplication, in the case of a comparison of a genome with itself). For the genome self-comparison, such homologs would technically be paralogs, but as mentioned in the introduction, these share some characteristics of orthologs -and to simplify the discussion in this section, will be referred to as orthologs or ortholog candidates.
The "three-way intersection" can be visualized in terms of a dot plot comparing two chromosomes. If two genes in a gene family (one from each chromosome) "hit" within a diagonal (representing a syntenic region), then a natural assumption is that the two genes behaved like their neighbors in both chromosomes -in other words, that they are part of that diagonal, and split from one another at the same time as speciation occurred (or polyploidy or segmental duplication occurred, in the case of a comparison of a genome with itself). More specifically, the program reads a hash of "syntenic gene pair" coordinates generated by OrthoMap, then reads the positions of genes in the gene family, then BLASTs the genes in the gene family against one another, and checks whether sufficiently strong hits (coordinate pairs from the gene family) are found in or near a diagonal. "Sufficiently strong" and "near" are parameters that are specified by the user. In the analyses described here we used a BLASTP [37] E-value cutoff for the gene family of 10 -25 , and tested "near" cutoffs ranging from 10 kb to 250 kb in tests with mouse × human and Arabidopsis × self. On the basis of other tests for rates of gene duplication by distance (not shown), 50 kb was chosen as an appropriate parameter in Arabidopsis. The same parameter was used in the human -mouse comparison, because it was sufficient to identify most obviously clustered MAGE genes -although for general purposes, 50 kb may be conservative given the low gene density in the mouse and human genomes. More lenient values for the "near" cutoff have a beneficial effect of including more tandem duplicates as ortholog candidates, and a detrimental effect of (potentially) falsely identifying some gene pairs as members of diagonals. Identifying several tandem duplicates as ortholog candidates does not also prevent these from being identified later as paralogs by ParaMap.
Once ortholog candidates have been identified, OrthoMap reads a phylogenetic tree file, and appends the diagonal names onto gene names in the gene phylogeny. To do this, the program takes advantage of the "extended New Hampshire" or NHX phylogeny format [38], which can include additional annotation tags at genes or internal nodes. Trees with this format can be viewed or re-annotated with the ATV application [38]. Nodes probably giving rise to orthologies are easy to spot: these are the most recent common ancestors between two genes or groups of genes that have been identified by OrthoMap as probable orthologs.

ParaMap
Local gene duplication appears to be common in many gene families. Inferring at which point in a gene phylogeny a local duplication has occurred is tedious, and is made difficult by the fact that relationships may be nontransitive (A is close to B and B to C, but A is not directly close to C), several genes in a clade may be genomically close but not appear to be near one another in a gene tree, and the definition of "close" may need to be changed for different genome comparisons. The ParaMap program reads a rooted, bifurcating tree file in which gene positions have been appended to gene names. It reads the tree from root to tip (from most ancient common ancestor out to individual genes), and recursively asks whether any two subtrees contain genes that are genomically near. If so, and if the depth in the tree is no greater than the depth specified as a parameter (for example, half of the average tree height from root to tips), then that node is annotated as a candidate origin of a local duplication. A final script combines NHX tags from "ortholog" and "paralog" trees to produce a tree with orthologous and paralogous duplications identified.

Overview of Results and Discussion
In this section, we describe 1) objectives of the OPM suite; 2) application of the method in the MAGE gene family in mouse and human 3) application of the method to the 20S proteasome gene family in Arabidopsis; 4) application of the method to another gene family, the "major latex proteins" in Arabidopsis. 4) discussion of performance.

Objectives of the OrthoParaMap suite
Briefly, the OPM procedure consists first of pre-processing of genomic data to obtain BLAST [37] (or other similarity) scores between all genes in two genomes, followed by a search for syntenic regions, then gene family phylogeny construction, and finally "mapping" of the gene family onto the genome comparison and back onto the phylog-eny to identify potential orthologs. All scripts and programs are freely available at http://www.tc.umn.edu/ cann0010/Software.html. These steps are as follows (details are in the Methods section): • Construct a fasta file that contains ID lines with unique gene names, species and chromosome identifiers, and gene positions.
• BLAST all gene sequences from two genomes against one another (or in the case of the Arabidopsis example, against itself), and parse to give a similarity or "hit" matrix.
• For a gene family of interest, identify all genes from the genomes to be compared.
• Construct final, trimmed alignments, and a phylogeny.
• Plot gene family members in both genomes, calculate similarities between gene family members from both genomes, and map the gene family similarities onto the genome comparison. This was done with OrthoMap. The general approach consists of identifying three-way intersections between pairs of gene family members and synteny blocks from two genomes: these represent probable orthologs.
• Infer nodes responsible for orthologous gene duplications or paralogous duplications. This is done for the paralogs that appear to have arisen through local duplications, using the ParaMap program. This recursively walks through the tree, identifying internal nodes that give rise to genes or other nodes that are physically near one another on the chromosome.

MAGE family example in mouse and human
The melanoma associated antigen (MAGE) gene family [25,26] serves as a convenient and informative case study for several reasons. Most of the 20+ gene family members come from the X chromosome in both mouse and human. Because X is a sex chromosome, the MAGE and other genes on this chromosome have essentially been trapped on the chromosome (in the human genome, we identified 24 MAGE genes on the X chromosome and four on chromosomes 3 or 15; and in the mouse genome, we identified 19 MAGE genes on the X chromosome, and four on chromosomes 7 or 19). Rearrangements within this chromosome have also not been extensive [16,39,40], so synteny and collinearity have been retained to an unusual degree between the mouse and human X chromosomes. The MAGE gene family also nicely illustrates at least two distinct evolutionary patterns.
We used the OPM programs first to identify syntenic regions between the mouse and human X chromosomes. We then used these syntenic regions to identify orthologous and paralogous duplications that have given rise to the mouse / human MAGE phylogeny. The results are shown in Figure 1, which shows predicted syntenic regions for a portion of the comparison between mouse and human X chromosomes.
If two MAGE genes are found within a syntenic region in the two genomes (a diagonal feature in the dot plot), these should simply be one of the hits or dots in the diagonal that represents this region. Because local insertions, transpositions, deletions and tandem duplications are common in genomes, it is important also to consider hits that may fall slightly off the predicted diagonal. This is done by considering hits that fall within a specified "offdiagonal threshold" for a vertical and horizontal distance in which a gene may be considered to be part of a diagonal. With a threshold of zero, ten pairs of MAGE genes are identified as having a speciation origin. This means that these genes were also identified by DiagHunter as being on a diagonal path, and are therefore parts of the synteny block predicted by that algorithm. Because small inversions and duplications are common, a higher off-diagonal threshold is desirable (though this parameter depends on features of the genomes being compared, including gene density and time to speciation, affecting amount of rearrangement and duplication). The analysis in Figure 1 used an off-diagonal threshold of 50 kb, which resulted in 17 pairs of MAGE genes being identified as having a speciation origin.
Once pairs of genes in the gene family have been mapped to predicted synteny blocks, those pairs can also be identified in a phylogeny for that family. This can be done conveniently using ParaMap, taking advantage of the "extended New Hampshire" format (NHX), described in [38]. The extended format allows additional tags to be associated with nodes or sequence names. OrthoMap determines the intersections of genes and diagonals (synteny blocks or segmental duplications), and then inserts NHX tags that contain diagonal names and chromosome identifiers, into a phylogenetic tree for the gene family. For example, at the top of Figure 1, there are two clades that each contain one Hx and one Mx sequences (standing for human X and mouse X chromosomes). Each of these would be natural ortholog candidates, simply on the basis of this phylogenetic pattern (simple one-to-one relationships, with one human sequence corresponding to one mouse sequence). The 14b and 14a tags on these sequences indicate that these came from diagonal 14 (out-side the portion of the dot plot shown in Figure 1, but a similar case is shown for diagonal 9, with members near the center of the tree). The human sequence comes from the "b" (vertical) axis, and the mouse sequence comes from the "a" (horizontal) axis. There are also three other cases of likely speciation origin for MAGE genes outside of the X chromosome. These are apparent as three H15/M7 doublets, from syntenic regions on human chromosome 15 and mouse chromosome 7 (data not shown, but synteny relationships depicted in [16]). In both the comparison of MAGE genes on the X chromosomes and on 15 and 7, The combination of positional data and phylogenetic context makes it clear that many of these gene pairs are very likely to be orthologs. Once the diagonal-name tags have been added, it is a simple matter to use ATV to manually add "S" tags (for speciation or segmental duplication, depending on the genomic comparison, and distinct from tandem duplications) at internal nodes.
The last major step is to infer which internal nodes may represent local or "tandem" gene duplications. This does not require two-way genome comparisons, because tandem duplication occurs in only one genome, but the inference cannot be made from a straightforward visual inspection of the phylogeny, even if positional information is included in the phylogeny. The reason is that all pairs of sequences within subtrees (defined in terms of a portion of the total tree depth or in absolute distance terms) need to be evaluated for "closeness," and the most recent common ancestor of "close" sequences in a clade then needs to be flagged as a candidate origin of a tandem duplication. These candidate nodes are indicated in the phylogeny by "t" at tandem duplication nodes. Two such nodes are highlighted in the top half of the figure, and many are present in the bottom half of the figure. The two tandem duplications in the top half are interesting because each significantly predates speciation duplications in duplication blocks 14 and 9.
Duplication patterns in the bottom portion of the tree are dramatically different than in the top or middle of the tree. The bottom-most clade of 12 human genes has been described as MAGE family A [25,26], which maps to three adjacent clusters within 3.5 Mb centered near chromosomal band Xq28 [25]. The next clade of 4 human genes has been described as MAGE family B, which maps to a cluster at chromosomal band Xp21 [41]. ParaMap identifies likely tandem duplication origins for most of these genes. All fall into three clusters within a 3.5 Mb region. Likewise, all of the genes in the MAGE B group are identified by ParaMap as originating through tandem duplications. OrthoMap identifies the members of family A as "orthologous"to five paralogs in mouse (using a relaxed definition of orthology that allows for many-to-many relationships for genes in two species). It should be noted A comparison of the mouse and human X chromosomes and the MAGE gene phylogeny Figure 1 A comparison of the mouse and human X chromosomes and the MAGE gene phylogeny. Dot plot on the left shows the mouse X chromosome on the horizontal axis and the human X chromosome on the vertical axis, with the 5'-end of both chromosomes at the upper left corner of the dot plot. Syntenic regions predicted by DiagHunter [36,72] are highlighted in blue and numbered in the order that the program identified them. The locations of MAGE [25,26] genes are shown with short green lines on the axes. Where two of these genes from mouse and human intersect with a diagonal, they are highlighted with bulls-eyes, both on the diagonal and both axes. These points represent candidate orthologs. OrthoMap uses these diagonal names to annotate the phylogeny, shown on the right. Names in the phylogeny have the form "Mx-0035427_69351 9a". First character indicates Mouse or Human; second character or digit(s) indicates chromosome number (with u being undetermined); middle digits (after the dash) are the last seven digits of the Ensembl gene ID; digits after the underscore are the gene midpoint position in Kb; and last characters (e.g. 9a) correspond to diagonal numbers from the dot plot on, with a or b signifying horizontal or vertical axis/chromosome origin, respectively. In the tree, S indicates inferred speciation, and t indicates inferred tandem duplication (as inferred by ParaMap). Lines drawn between the middle dot plot and nodes in the phylogeny show where segmental duplications have been "mapped" between the genomic dot plot and the phylogenetic analysis. Two cases of ancient (pre-speciation) tandem gene duplications are indicated on the tree, as are cases of tandem duplications that have occurred in mouse and human after speciation.

PAM units
Bootstrap support: that probable recombination and gene conversion in these regions may decrease certainty (and bootstrap values) in this portion of the tree. Clearly, this group of genes is evolving much more rapidly than those at the top of the phylogeny -both in terms of gene births through tandem duplications, and in terms of rate of change in coding sequence.
Examples of known functions for some family members suggest ways that biological roles may have shaped the evolutionary history and vice versa. The two topmost human genes in the phylogeny, Ensembl gene IDs 67445 and 102316, are the MAGE-D1 and MAGE-D2 genes. These appear to play key developmental roles in the brain [42,43] -and therefore, might be expected to be highly conserved. The large cluster of human MAGE A genes, centered near Xq28 [25,26], have been found to be highly expressed in tumors of various histological types [26], particularly in melanoma and breast carcinoma cell lines [44]. Proteins coded by these genes are also the targets of autoantibodies from patients with systemic lupus erythematosus, and so may play a role in autoimmune diseases [44]. A natural speculation is that proteins important in both autoimmune disease and cancer-recognition might require (or acquire) a nimble evolutionary strategy, analogous to the clustered and rapidly-evolving Major Histocompatibility Complex (MHC) gene family in mammals [45,46] -and in contrast to the MAGE-D1 genes that are involved in early brain development.

Arabidopsis 20S proteasome comparison and internal genomic duplications
Making sense of the relationship between gene function and evolution in the 20S proteasome gene family requires an understanding of the structure of the proteasome. The proteasome is responsible in eukaryotes for recycling proteins through degradation of ubiquitin-tagged proteins [9,47]. The proteasome is a large complex, consisting of a 28-subunit catalytic cylindrical structure, called the 20S proteasome, and an ATP-dependent 19S "regulatory particle," consisting of an additional set of approximately 18 subunits [48]. The combination of regulatory particle and 20S proteasome constitutes the 26S proteasome. The 20S proteasome (the catalytic core the 26S proteasome) is made up of four stacked rings. The two inner rings are each composed of seven 20S beta polypeptides, and these rings are sandwiched between two alpha rings each composed of a ring of seven alpha 26S alpha polypeptides, giving an alpha 7 beta 7 beta 7 alpha 7 structure [9]. Given this type of arrangement, it might be expected that the minimum number of kinds of proteins making up the 20S proteasome would be two: alpha and beta. Alternatively, each of the seven alpha and beta subunits might be somewhat different from one another, requiring 14 kinds of protein to make up the 20S proteasome. The simpler arrangement is seen in the archeon Thermoplasma acidophilum, which has a single alpha-type subunit and a single beta-type subunit [49,50]. The more complex arrangement is seen in yeast (Saccharomyces cerevisiae), which has seven distinct alpha-type and seven beta-type subunits [51].
In the Arabidopsis 20S proteasome there are 23 genes encoding subunits of the 20S proteasome [47,[52][53][54] but why 23 rather than 14? The phylogeny in Figure 2 suggests the answer. There are 14 groups of 20S proteasome subunits, in two groups of seven. Most (18) of the 23 proteins occur in pairs, and five are singletons. As might be expected, the two groups of seven correspond with the alpha-type subunits (top of Figure 2) and beta-type subunits (bottom of Figure 2) [9,53]. It appears that there are two nearly complete sets of alpha and beta subunits in Arabidopsis. OPM identifies seven of the nine pairs of subunit types as having segmental duplication origins. Furthermore, all seven of the duplication blocks involved likely occurred during the same polyploidy event. By the analysis of Blanc et al. [10,55], these duplicated segments all come from a probable round of polyploidy that they estimate occurred before the Arabidopsis/Brassica rapa split and probably during the early emergence of the crucifer family (24-40 Mya) [10]. The timing of this polyploidy episode is in agreement with other recent analyses [11,14,17]. Ermolaeva et al. [14] place this event at roughly 30 -35 Myr, after the Brassicaceae / Malvaceae split [17], and before the Arabidopsis/Brassica split.
This relative timing of duplications in the Arabidopsis 20S proteasome gene family is supported when sequences from other plant species are considered. The bottom of Figure 2 shows two Arabidopsis proteasome subunits and probable orthologs from soybean, Medicago, tomato, potato, and maize. These sequences come from TIGR EST "Tentative Consensus" sequences (TCs) [56], so are inherently error-prone (i.e., may contain mis-reads or misassemblies). Nevertheless, in this highly expressed family of relatively small protein subunits, TCs do appear to be of high enough quality to use for approximate phylogenetic work: most TCs from these species in this gene family cover the full gene length, and most of the 14 20S subunits do have at least one representative from each included species. In the sample shown at the bottom of Figure 2, the two Arabidopsis sequences group together, as do the Medicago and Glycine sequences and the Lycopersicon and Solanum sequences. Furthermore, the Zea sequences are placed basally in the phylogeny, as would be expected of any monocot. Likewise, the solanaceous and legume sequences placements recapitulate the species phylogenies for these taxa. Though not all clades of orthologs in the phylogeny are this tidy, this example is generally typical of the remaining multi-species clades for 20S proteasome gene family from Arabidopsis Each dot represents the coordinates of two proteins with bit score cutoffs of 560 (expect value of 10 -68 for these data). Red dots indicate homologous proteins with the same orientations in both chromosomes, and blue indicate proteins with opposite orientations. This information is used by DiagHunter to predict synteny blocks, which are reported as text coordinates and as images (lower left). Where two 20S proteasome members intersect with a diagonal, they are highlighted with bulls-eyes, both on the diagonal and on both axes. In this version, other hits are also highlighted, with percent identity indicated using a yellowto-black color scheme (black = 100% identity). Hits between gene family members and a diagonal represent candidate orthologs. OrthoMap uses these diagonal names to annotate the phylogeny, shown on the right. Gene names have the form "At3g22630_8010 1b", where first nine characters are the Arabidopsis Genome Initiative name; the number after the underscore is the position in kb on the chromosome (indicated after 'At'), and the orange numbers/letters after the space indicate the diagonal name from a chromosome comparison with that gene. Nodes giving rise to tandem gene duplications would be inferred by ParaMap and shown by a red 't' (none were found in this phylogeny). Nodes giving rise to segmental duplicates were manually inferred using the OrthoMap tags and annotated using the ATV tree-viewing program [38]. The small phylogeny at the bottom shows the positions of EST consensus sequence homologs from soybean, Medicago truncatula, tomato, potato, and maize (see text). These help to pinpoint when the segmental duplication occurred in this clade in Arabidopsis.  the gene family. For example, eight of the nine Arabidopsis 20S gene pairs are reciprocally their phylogenetically nearest neighbors (and in the one exception, a Solanum sequences nests with the two Arabidopsis sequences). This argues strongly for common origin of the 20S proteasome duplicates in Brassicaceae, after the split from Solanaceae and other dicot families -and consistent with current estimates for this Arabidopsis polyploidy episode [10,17].
The scenario suggested by the combination of biological, phylogenetic, and genome contextual information from OPM is that following polyploidy, roughly 20-40 Mya [10,11,14,17], nearly all of the members of the "extra" proteasome subunits have been maintained. Therefore, in the current Arabidopsis, there are two (nearly) complete sets of 20S subunits. Though there appear to have been five losses in the gene family since polyploidy, no "double losses" have been tolerated -which would have brought the number of alpha or beta subunits below seven. It is also worth noting that we see no tandem (local) duplications in the gene family. The low rates of loss or duplication in the gene family, apart from one probable round of whole-family duplication, suggest that maintenance of the stoichiometry of the 20S components is important, and that there is a significant cost to the loss or duplication of a single component. This pattern of low rates of gene duplication or loss in highly conserved, multi-subunit proteins contrasts with the following example, which shows rapid turnover of gene family members, including loss of major gene lineages in some plant families.

Arabidopsis Major Latex Protein gene family and internal genomic duplications
The Major Latex Protein (MLP) family encodes proteins that were originally isolated from the latex of opium poppy, with high levels of RNA expression in poppy capsules and lactifers [57,58]. MLPs are have also been found in a wide range of plants and tissues ( [59] and references therein). Functions of MLP are not known, but the MLPs do show significant similarity to a group of intracellular pathogenesis-related (IPR or PR10) proteins [60]. The IPRs typically show increased expression following wounding, pathogen attack, or stress, and several have been shown to have antibacterial, antifungal, or ribonuclease activity [60][61][62][63]. The two gene families (MLP and IPR-PR10) show only about 25% identity, but have similar structures, sizes, and pI's, and sequence and structural analyses indicate that they are similar enough to be considered to be part of a single superfamily and to be included in a single phylogenetic analysis [60].
The top phylogeny in Figure 3 shows all MLP homologs in Arabidopsis, with the same OPM analysis as described in the MAGE and 20S proteasome gene families. The bottom phylogeny also includes MLP homologs from Glycine, Medicago, and tomato, with a rooting at approximately the node leading to the IPR/PR10 subfamily (not shown). Interestingly, there are no Arabidopsis homologs that group with the IPR/PR10 subfamily [60].
Differences between the MLP and proteasome families are immediately apparent. The MLP phylogeny is "bushy" and uneven, in contrast to the regular, deeply divided, paired structure in the proteasome family. Also, while the proteasome family has no tandem duplications and at least seven (and perhaps nine) segmental duplications, the MLP family has 11 tandem and three segmental duplications. Distances to predicted segmental duplications are greater in the MLP than in the proteasome family. In the MLP family, protein distances to segmental duplications range from about 15 to 60 PAM units [64], but in the proteasome 20S family, range from 0 to about 4 PAM. Nevertheless, the MLP duplications do appear to come from the same polyploidy event as was observed in the proteasome phylogeny. By the analysis of Blanc et al. [10,55], the three duplication blocks identified in Figure 3 are all part of the "recent" polyploidy event that occurred early in the evolution of the Brassicaceae. In fact, duplication blocks 10a/21b in clades B and C in Figure 3 are part of the same duplication complex on chromosome 1 where two of the proteasome 20S duplications reside (blocks 7a/23b from the alpha subunits, and 8a/22b from the beta subunits).
Clearly, the MLP members have been evolving much more rapidly following polyploidy than have the proteasome 20S subunits. This evolution has consisted of wholegene duplications as well as nonsynonymous changes, as is apparent from the 11 tandem duplications among the 24 MLP members.
As with the proteasome subunits, adding sequences from related taxa helps to provide evolutionary context. For example, it is possible to determine whether predicted segmental duplications may have occurred before the split between Fabaceae and Brassicaceae (not expected, if the segmental duplications are from the more recent polyploidy event rather than a more ancient event). In fact, all legume (and, for that matter, tomato) sequences have a basal placement relative to segmental duplications in the Arabidopsis clades, supporting the hypothesis of a more recent polyploidy in the Arabidopsis genome. The use of sequences from several species also provides some indication of rates of gene birth in the MLP family. The Glycine, Medicago, and tomato sequences are all EST-derived, so are subject to under-sampling and/or sequence errors, but do provide at least crude indications that genes have duplicated in particular lineages (and have probably been lost from others) following separation of these plant families.
The "major latex protein" (MLP) gene family from Arabidopsis and three comparison species Figure 3 The "major latex protein" (MLP) gene family from Arabidopsis and three comparison species. The "major latex protein" (MLP) gene family from Arabidopsis is shown in the top phylogeny. In the bottom phylogeny, the same gene family is shown, but including sequences from tomato, soybean, and Medicago truncatula are also included. The top phylogeny was constructed using one of nine equally parsimonious topologies from a maximum parsimony search, followed by branch length estimation using Tree-Puzzle. The bottom phylogeny uses the consensus of 22 equally parsimonious topologies from a maximum parsimony search, followed by branch length estimation using Tree-Puzzle. Gene nomenclature is the same as described in Figure 2. Both trees were rooted at the approximate location of the connecting branch to the intracellular pathogenesis-related (IPR or PR10) proteins (not shown).

Discussion of performance
There are several areas in which the OPM approach and implementation might be improved. The identification of syntenic regions using DiagHunter currently is strongly dependent on parameters that will differ from genome to genome. For example, the much lower gene density in the mouse and human genomes than in Arabidopsis meant that we needed to compress the mammalian "hit matrix" more than for the Arabidopsis hit matrix. Evaluations of DiagHunter specificity and selectivity are described at [36]. There is also no built-in statistical evaluation of parameters. This might be improved using a maximum likelihood or Bayesian framework.
The assignment of gene pairs to diagonals is quite straightforward, though is subject to some sources of error. A pair of genes might by chance have a hit that is near a diagonal, or might be inappropriately be judged to be outside of a diagonal, perhaps because of local rearrangements in one genome. The probability that a gene falls within any given region can be though of as a Poisson process, so it should also be a Poisson process for the coordinates of a pair of genes to randomly fall within a specified region (the space near a diagonal) in a two-dimensional comparison of genomes. The larger the gene family in comparison to the proteome, the more likely it is that a false positive will be identified. These statistical analyses could potentially be incorporated into the analysis, however, determining the sources of most (probable) errors is essentially an empirical task, and will differ between genome comparisons. For example, ongoing transposing duplications can occur in Arabidopsis, but cannot occur between mouse and human, so the amount of background noise should be inherently higher in the single-genome comparison -making it difficult to produce general estimates of error. Some, but not all, false positives or false negatives will be apparent in the context of a phylogeny. For example, in the 20S proteasome, it appears that two duplications in the alpha subunit subfamily probably originated as part of the same duplication that gave rise to the rest of the paired genes. Presumably, these come from duplicated regions that were too small to be picked up by DiagHunter.
An additional potential source of error is in the inference of tandem duplication origins (by ParaMap), or of segmental duplication origins. Both inferences depend on the correctness of the phylogenetic reconstruction. The MLP trees in Figure 3 illustrate the problem. These clustered, rapidly-evolving sequences, have probably undergone recombination and conversion -processes that will tend to make a phylogeny more uncertain [65]. This is evident in the number of branches with poor bootstrap support the MLP Arabidopsis tree, and in the alternate (equally parsimonious) solutions in clade A in the top phylogeny and the bottom multi-species phylogeny. In the lower tree, At1g14960 is placed basally in the clade, but in a derived position in the upper tree. This poses a dilemma as to which node was the actual segmental duplication origin. What is more certain is that there was some segmental duplication event near the base of this clade. There are similar difficulties in the assignment of tandem duplications. One way to approach this problem would be to calculate duplication origins for resampled bootstrap data, and then to assign segmental or tandem duplications and the frequencies of observed duplication origins to a consensus tree. The current program (ParaMap) does not currently perform these kinds of bootstrap calculations. Given this limitation, therefore, it is important to keep in mind that inferences about internal character states (duplication origins) are, in fact, reconstructions, so trees should be interpreted as possible and not absolute explanations of the data.
How do the OPM results compare with other phylogenybased orthology identification programs such as GeneTree [21,22] or RIO [23,24]? RIO asks how often, in phylogenies calculated from a resampled gene family alignment, genes from different species appear to be paralogs or orthologs. GeneTree attempts to identify a minimum number of gene duplications needed to reconcile a gene tree with a species tree. Examining GeneTree in more detail with the MAGE data for Figure 1, the species tree has only two terminal nodes: mouse and human. Figure 4 shows a portion of the MAGE phylogeny from duplication region 9, with the OPM results on top and the GeneTree results on the bottom. In the GeneTree figure, duplications are nodes with squares, speciations are nodes without squares, and additional inferred genes are in gray. GeneTree and RIO generally appear to do a better job of identifying probable gene duplications and speciations than OPM. In this example, GeneTree identifies five probable duplications and four probable speciations vs. whereas OPM identifies three tandem duplications and two speciations. However, OPM also provides different, complementary information: it suggests mechanisms of gene duplications by taking into account gene location and synteny information, it provides a means of mapping phylogenetic data into a comparative genomic view and vice versa, and it provides data that can be used to test predictions of orthology. OPM can also be used to describe patterns of gene family evolution within a single genome, as was shown in the MLP and proteasome examples.

Conclusions
We have described a suite of programs called OrthoPara-Map (OPM) that combines comparative genomic positional information and phylogenetic reconstructions of gene families to identify which gene duplications are likely to have arisen through internal genomic duplications (such as polyploidy), or through speciation, or through local gene duplications. We described the application of the software using three examples: the melanoma-associated antigen (MAGE) gene family on the X chromosomes of mouse and human; the 20S proteasome subunit gene family in Arabidopsis, and the major latex protein gene family in Arabidopsis. In the MAGE family, the software effectively identifies orthologs and paralogs, and highlights parts of the gene family that have undergone rapid evolution and expansion since speciation, as well as parts of the family that are highly conserved. Tandem duplications are evident both before and after speciation. In the two examples from Arabidopsis, OPM identifies duplications that occurred as a result of polyploidy and those that occurred due to local gene duplications, illustrating strikingly different evolutionary patterns in the two gene families. Figure 4 Comparisons of approaches of GeneTree and OrthoParaMap. The top tree is a clade from the middle of the MAGE phylogeny in Figure 1. The bottom tree shows the "reconciled" species and gene family tree predicted by GeneTree [22]. Gene duplications in the GeneTree prediction are indicated with small squares, and all other nodes are predicted speciations. Inferred sequence losses are shown in gray. The two procedures provide complementary results: OPM identifies which genes are part of synteny blocks or clusters, and GeneTree predicts some additional gene duplications or losses based on speciation and gene family trees.