 Proceedings
 Open access
 Published:
Identification of conserved gene clusters in multiple genomes based on synteny and homology
BMC Bioinformatics volume 12, Article number: S18 (2011)
Abstract
Background
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams.
Results
Our primary contribution is a local slidingwindow SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous subfamilies by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.
Conclusions
The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.
Background
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. It is often suggested that regions with similar gene content among different species are evidence for phylogenetic relationship and trace through evolution the inheritance of function from a common ancestor. Within one genome, the presence of large duplicated blocks may be due to the ancient largescale or whole genome duplication, while presence of segments with homologous genes, named conserved gene clusters in multiple genomes more likely indicates an evolutionary constraint for a functionally related group. Our primary contribution is a local slidingwindow algorithm that starts from an existing protein family classification and produces two results: first, concerved gene clusters, and second, a subdivision of families into orhtologous subgroups. Our approach can be seen as using conserved gene clusters in order to sift through the family structure to uncover orthology. We evaluate the biological relevance of our approach on the example of Protoploid yeasts [1].
A number of studies indicate that regions of conserved homology among multiple species may result from functional pressure to keep these genes close, but it may also be conserved because the genomes under study have not sufficiently diverged. For the former, the most well known examples are that of operons in prokaryotes [2], but also the existence of functional interactions [3] and similar expression patterns [4] in closely located genes. For the latter, existence of conserved gene clusters is the computational basis for ancestral genome reconstruction [5] and search for ancestral homologs among genes in the same family [6]. Orthologs are homologous genes related by speciation [7, 8] which retain the same functionality as their common ancestors. Homologous genes related by duplication within one lineage are called paralogs and generally differ in functionality [9–12]. A number of papers introduce algorithms to compute conserved gene clusters and orthologous groups, see for example, [13–16]. These approaches vary on a number of parameters. First, there are authors who consider strictly conserved chromosomal segments with similar gene order and orientation [17–19]. Second, come the approaches where one considers conserved contiguous regions but without colinearity [13, 20]. Third, the authors relax the definition of conserved regions by allowing gaps [18, 21–24]. Four, paralogous gene copies within one chromosome are allowed in order to explore manytomany homologous relationships [13, 25]. Finally, some authors study the effect of varying the gap between adjacent neighbors [24, 26, 27].
In this paper we start from the notion of gene teams introduced in [28]. This model allows only one copy of a gene on a given chromosome. We relax this restriction by following the approach of homology teams defined in [13]. Furthermore, we set the gap threshold not only for adjacent genes, but by requiring the distance for any two genes considered as being neighbors to be smaller than a certain threshold. A similar choice was made in [20]. We call the obtained gene clusters synteny teams.
Our SYNS (SYNtenic teamS) algorithm refines existing families into orthologous subfamilies, by analyzing the neighborhoods around the members of a given family with a locally sliding window. This is done for all pairs of chromosomes in multiple genomes on which family members appear. The pairwise conserved contiguous segments are agglomerated by relying on a partial homology and biological criteria introduced in [1] between segments. This results in larger conserved segments that we call syntenic zones. We evaluate our algorithm on the existing homologous families for five genomes of the Hemyascomycete yeasts from the Genolevures database [29]. Indeed, there already exists a subclassification of these families into orthologous subfamilies [1] that has undergone expert validation and thus can be used as a reference point for the evaluation of biological relevance of our results. We further illustrate the results of our method for the particular case of the Pdrp (pleiotropic drug resistance proteins) phylogenetic subfamily of ABC transporters that has been manually analyzed in [6].
Methods
In this section we define the notion of unordered conserved gene clusters that allows for paralogous copies and gaps on multiple genomes. Following the work of [20, 30, 31], we allow one homologous gene to appear more than once in one chromosome. We refine the approach of homology teams [13] by distinguishing between orthologous and paralogous copies of genes. Large syntenic zones are built my merging clusters based on genes common among them instead of directly merging the ordered chains with overlapping families as in [32]. For mathematical notations and examples in a textual format we follow [28].
Definition 1 A chromosome is defined as a pair c = (Σ, G), where Σ = {f_{1}, f_{2}, …, f_{ m }} is the set of homologous families and G = (g_{1}, g_{2}, ..., g_{ n }) is an ordered sequence of genes. Each gene g_{ i } ∈ G is a couple (p_{ i }, f_{ i }), where p_{ i } is the position of gene g_{ i }on c and g_{ i }belongs to some homologous group f_{ i } ∈ Σ.
Here, Σ is the alphabet for any chromosome c and p_{ i } is an integer. When it is necessary to indicate to which chromosome belongs a given gene, this is done by a subscript: (p_{ i }, f_{ i })_{ c }.
Definition 2 Given a chromosome c, with two genes g_{ i } = (p_{ i }, f_{ i }) and g_{ j } = (p_{ j }, f_{ j }), the distance between g_{ i } and g_{ j } is defined by Δ(g_{ i }, g_{ j }) = p_{ i }– p_{ j }.
Example 1 Let c_{1}and c_{2}be two chromosomes over the same alphabet Σ = {f_{1}, f_{2}, f_{3}, f_{4}} of homologous families with genes on c_{1}being (1, f_{2}), (2, f_{1}), (4, f_{4}), (7, f_{3}), (8, f_{1}), and on c_{2}being (1, f_{1}), (2, f_{2}), (3, f_{2}), (4, f_{3}), (6, f_{4}). This is denoted by:
c_{1} = 〈f_{2}f_{1}*f_{4}**f_{3}f_{1}〉,
c_{2} = 〈f_{1}f_{2}f_{2}f_{3}*f_{4}〉.
Asterisks stand for genes that are unassigned to homologous groups; notice that * is not part of the alphabet Σ.
A gene subset G′ ⊆ G induces the subset of families Σ′ denoted by Σ(G′) such that f_{ i } ∈ Σ′ if and only if there exists g_{ i } ∈ G′ such that g_{ i } = (p_{ i }, f_{ i }). A set of genes G′ from the same chromosome, forms a chromosomal segment s = (Σ′, G′, c) with or without gaps. When it is clear from the context, we will assimilate a set of genes G′ with the corresponding chromosomal segment.
For example, in the case of G′ = {(2, f_{1}), (4, f_{4}), (8, f_{1})} and alphabet Σ′ = Σ(G′) = {f_{1}, f_{4}}, G′ defines a chromosomal segment with gaps on c_{1} = 〈f_{2}f_{1} * f_{4} * *f_{3}f_{1}〉. This segment G′ is noncontiguous on c_{1}; the gaps correspond to (5, *), (6, *) and (7, f_{3}).
Definition 3 A chromosomal segment s = (Σ′, G′, c) is contiguous if for any two genes g_{ i } = (p_{ i }, f_{ i }) and g_{ j } = (P_{ i }, f_{ j }) from G′ and any p such that p_{ i } <p <p_{ j }, either the gene g = (p, f) at the position p belongs to G' or this position corresponds to an asterisk. Otherwise, the segment is said to be noncontiguous For example, G′ = {(4, f_{4}), (7, f_{3}), (8, f_{1})} on c_{1} = 〈f_{2}f_{1} * f_{4} * *f_{3}f_{1}〉 forms a contiguous segment.
Synteny teams
Two genes g_{ i } = (p_{ i }, f_{ i }) and g_{ j } = (p_{ j }, f_{ j }) on the same chromosome are considered to be neighbors when Δ(g_{ i }– g_{ j }) <δ for a given threshold δ > 0. For a gene g_{ i }, we denote the set of neighbor genes N_{ i } to be centered around it, that is N_{ i } = {g_{ k } = (p_{ k }, f_{ k })  p_{ i } – ⌊δ/2⌋ ≤ p_{ k } ≤ p_{ i } + ⌊δ/2⌋}.
Definition 4 A chromosomal segment s is called a δ— segment if every pair of genes of s is separated by a distance smaller than δ, that is s = {g_{ i }  ∀g_{ j } ∈ s, Δ(g_{ i }, g_{ j }) <δ}. A window w is a contiguous δsegment.
Definition 5 We say that Σ′ ⊆ Σ is a δ— subset if there exists at least one δ—segment s′ = (Σ′, G′, c) such that Σ' = Σ(G'). We say that s' is the witness of this δ—subset.
Example 2 For δ = 3, the δ—subsets on chromosome c_{2} = 〈f_{1}f_{2}f_{2}f_{3} * f_{4}〉 are the following:
 {f_{1}, f_{2}} as witnessed by ((1, f_{1}), (2, f_{2})), ((1, f_{1}), (3, f_{2})), and ((1, f_{1}), (2, f_{2}), (3, f_{2}));
 {f_{2}, f_{3}} as witnessed by ((2, f_{2}), (4, f_{3})), ((3, f_{2}), (4, f_{3})), and ((2, f_{2}), (3, f_{2}), (4, f_{3}));
 {f_{3}, f_{4}} as witnessed by ((4, f_{3}), (6, f_{4})).
Definition 6 Let Σ be the set of homologous families over a set of chromosomes C. We say that Σ′ ⊆ Σ is a δ— cluster if Σ′ is a δ—subset for all chromosomes in some C′ ⊆ C, where C′ ≥ 2. We say that the set of genes
witnesses the δ—cluster Σ′.
A witness S is thus a set of all genes that participate in the segments witnessing the relevant (δsubsets. Let Σ and Σ' to be two (δclusters such that Σ ∩ Σ′ ≠ ∅. Let S and S′ be the corresponding witness sets. Denote by S_{∩} and the sets of genes in each of these witness sets that are members of the families in Σ ∩ Σ′.
Definition 7 A δ—cluster Σ is said to be a (δsynteny if (a) the corresponding witness set S has genes belonging to at least two different chromosomes and (b) there does not exist a δcluster Σ′ with a witness set S′ such that
Example 3 Let c_{1}, c_{2}and c_{3}be chromosomes as shown in figure1.
c_{1} = 〈f_{3} * *f_{5}f_{4}f_{1} * f_{2}* f_{5}f_{4}〉
c_{2} = 〈f_{1}f_{2} * *f_{3}*f_{4}f_{5}f_{1} * f_{5}〉
c_{3} = 〈f_{2} * f_{3} **f_{5}*f_{1}f_{4} * f_{5}〉
Let (δ = 3. We obtain the following nontrivial δ—clusters: {f_{4}, f_{1}}, {f_{5}, f_{1}}, {f_{4}, f_{5}, f_{1}}, {f_{1}, f_{2}} and {f_{4}, f_{5}} between c_{1}and c_{2}; and {f_{1}, f_{5}}, {f_{1}, f_{4}} and {f_{4}, f_{5}} between c_{1}and c_{3}. The nontrivial δsyntenies are {f_{4}, f_{5}}, {f_{1}, f_{2}}, {f_{4}, f_{1}}, {f_{5}, f_{1}} and {f_{4}, f_{5}, f_{1}}.
The superset inclusion in definition 7 implies that for the computational purposes there is no need to consider the smaller of the two sets and thus causes merging of the syntenies if the witness of one synteny is a complete subset of another in our algorithm.
Example 4 Let c_{1}, c_{2}and c_{3}be three chromosomes in figure2.
c_{1} = 〈f_{1} * *f_{4}* f_{6}f_{6}f_{7}f_{8}f_{9}〉
c_{2} = 〈*f_{5}* f_{3}f_{6}* f_{2}f_{4}f_{8}f_{7}f_{9}〉
c_{3} = 〈f_{4}f_{8}f_{4}f_{7}f_{8}f_{8}* f_{8}f_{2}〉
Let (δ = 3 and consider f_{8}. Nontrivial δ—clusters are: {f_{7}, f_{8}, f_{9}}, {f_{7}, f_{8}} and {f_{8}, f_{9}} between c_{1}and c_{2}, {f_{7}, f_{8}} between c_{1}and c_{3}and { f_{8}, f_{2}}, {f_{7}, f_{8}} and {f_{8}, f_{4}} between c_{2}and c_{3}. Therefore, we obtain the following nontrivial δ—syntenies: {f_{7}, f_{8}, f_{9}}, {f_{7}, f_{8}}, {f_{8}, f_{2}} and {f_{8}, f_{4}}. Notice that the δcluster {f_{7}, f_{8}, f_{9}} covers witnesses of the δcluster {f_{8}, f_{9}}, but the witnesses of the δcluster {f_{7}, f_{8}} on chromosome c_{3}do not witness the δcluster {f_{7}, f_{8}, f_{9}}. Therefore, we merge the δcluster {f_{8}, f_{9}} in th e δsynteny {f_{7}, f_{8}, f_{9}}; however, {f_{7}, f_{8}} remains as a separate δsynteny.
We have seen that a (δ—synteny must contain the maximal (δ—cluster with respect to subset inclusion. All (δ—syntenies for a set of chromosomes C, with C >= 2 are included in the result. Such a synteny set is informally called a synteny team following the terminology introduced in [28, 32] for gene teams.
Definition 8 Given a δ—synteny teamwe say that Σ_{ i }and Σ_{ j }are transitively connected if the witnesses S_{ i } and S_{ j } overlap, that is S_{ i } ∩ S_{ j } ≥ 1. We further define a δzone as a union of transitively connected δsyntenies Σ_{ i }and Σ_{ j }.
Example 5 Consider C = {c_{1}, c_{2}, c_{3}} from example 4 and δ = 3. Suppose that we compute clusters in the neighborhood of f_{8}. Nontrivial syntenies are the following: Σ_{1} = {f_{7}, f_{8}} for witness S_{1} = {(8, f_{7})_{ c }, (9, f_{8})c1, (9, f8)c2, (10, f_{7})_{c2}, (2, f_{8})_{c3} (4, f_{7})_{c3}, (5, f_{8})_{c3}, (6, f_{8})_{c3}} and Σ_{2} = {f_{4}, f_{8}} for witness S_{2} = {(8, f_{4})_{c2}, (9, f_{8})_{c2}, (1, f_{4})_{c3}, (3, f_{4})_{c3}, (2, f_{8})_{c3}}. Notice, that Σ_{1} ∩ Σ_{2} = {f_{8}} ≠ ∅ and S_{1} ∩ S_{2} = {(9, c_{2}, f_{ 8 }), (2, c_{3}, f_{8})} ≠ ∅. We obtain one nontrivial δ—zone {f_{4}, f_{8}, f_{7}} by agglomerating δ—syntenies Σ_{1}and Σ_{2}based on the transitivity (see figure3). Notice that this leaves the gene (8, f_{8})_{c3}, out of the δzone. The transitivity relationship in the SYNS algorithm combines each pair of two δ— syntenies sharing at least one witness into one δzone. The notion of a δ— zone aims at uncovering even distant evolutionary relationships based on conservation of gene content within neighborhoods. It is slightly amended based on the following two considerations.

1.
Several paralogous genes may exist on the same chromosome. When two or more paralogs appear within one window of size δ, we include them in the same witness set of a δsynteny since it is not possible to computationally distinguish between them.

2.
It may happen that two distinct δsyntenies share only one paralogous gene. This is what we call a weak bond. Creating a δzone based on a single gene intersection may either lead to a δzone that is phylogenetically valid or may create an erroneous result (see [6]).
Definition 9 Given a δ—synteny teamand its witness set S = {S_{ i }, S_{ j }} we say that S forms a weak bond if S_{ i } ∩ S_{ j } = 1. We further define g = S_{ i } ∩ S_{ j } to be the witness of a weak bond.
The δ— zone {Σ_{ i }, Σ_{ j }} resulting from a weak bond may be erroneous. We rely on phylogeny to solve this issue. We consider a total order over all the species under study defined by phylogeny: a ≺ b if species b has diverged from the common ancestor earlier than species a (≺ corresponds then to the relative speciation time). When no other witness from a other than g exists, we split the erroneously obtained synteny in two parts: one that contains the orthology relationships within a given family f and another one that keeps the supposed paralogs. The details of how this is done are presented in Results section.
Definition 10 Letbe a δ— synteny team over the witness set S = {S_{ i }, S_{ j }} such that S_{ i } > S_{ j } and let g = S_{ i } ∩ S_{ j } be the witness of a weak bond. If g is from the biggest species according to ≺ in S_{ j }, we say that S_{ i }witnesses a maximal orthologous (δsynteny Σ_{ i }and S'_{ j } = S_{ j } \ g witnesses a paralogous δsynteny Σ_{ j }.
Example 6 Consider C′ = {c_{2}, c_{3}} from example 4 and figure4supposing that c_{3} ≺ c_{2}and consider neighborhoods around f_{8}with (δ = 3. Two nontrivial δsyntenies are connected by a weak bond: Σ_{1} = {f_{8}, f_{2}} with witness S_{1} = {(8, f_{2})_{c2}, (9, f_{8})_{c2}, (8, f_{8})_{c3}, (9, f_{2})_{c3}} and Σ_{2} = { f_{4}, f_{8}} with witness S_{2} = {(7, f_{4})_{c2}, (9, f_{8})_{c2}, (1, f_{4})_{c3}, (2, f_{8})_{c3} (3, f_{4})_{c3}, (5, f_{8})_{c3}}. Indeed, {(9, f_{8})_{c2}} is the witness of this weak bond. Since c_{3} ≺ c_{2}, then Σ_{2}is the maximal orthologous δsynteny with witness S_{2}, while Σ_{1}is the one with the paralogous copy of f_{8} (at position 9 on c_{2}). The set S_{1}becomes Members of a family are split into an orthologous and paralogous subsets present in different syntenies. At the end of our procedure, only the largest orthologous (δzone and the nonintersecting paralogous (δzones covering any given homologous family remain in the result.
Syntenic TeamS algorithm
In this section, we present the SYntenic TeamS (SYNS) algorithm which computes δ— zones in multiple genomes. In previous work gene teams between two chromosomes of size m and n are computed by an O(m + n)log^{2}(m + n) algorithm consisdering only onetoone homologous relationships [32]. The approach by [20] solves the ordered gene clusters problem by proposing a directed acyclic graph model and an NPhard longest path solution; results contain maximal but also nonmaximal orthologous clusters. Our approach relies on the same slidingwindow general approach as in [20]. However, we gain in time efficiency by limiting the sliding of the window only around positions of family members. Given a set of families Σ and a predefined window size δ, we examine neighborhoods of each family f ∈ Σ in all chromosomes. For all genes of f including paralogous copies, we consider a neighborhood from –δ to +δ around them. This neighborhood is examined by a sliding window of size δ and we form sets of genes corresponding to families in a given window position. These sets are intersected to look for common gene content if they belong to different chromosomes. The intersections define synteny conservation within the family neighborhoods by using definitions in Methods section. We further look for transitivity among δ— syntenies and build (δzones. To do this, we search for overlaps among witnesses of δ— clusters. If the witness intersection size is > 1 then the δ— syntenies are agglomerated to form one δ— zone. Three different cases corresponding to phylogenetic topologies shown in figure 5 are considered for solving the weak bond problem. Let S_{ i } and S_{ j } to be the two witnesses connected by a weak bond, we sort the genes of these witnesses according to the ≺ order of speciation. If the witness of a weak bond occurs in the biggest species according to ≺ or if there is no any other witness from a bigger species, then we consider that (cases A and B in figure 5) the two clusters define a valid (δzone. Case C in figure 5 shows the situation where forming a (δzone can not be justified from the evolutionary perspective. For cases A and B we continue to search for paralogous gene clusters. We gather all maximal δ— zones in the final result.
Time complexity
Table 1 shows the comparative time complexity analysis of our approach and other existing ortholog detection algorithms for the cases where such information is available. In the SYNS algorithm, we consider that one homologous family f may appear in at most c × t locations in all genomes, where c is the total number of chromosomes and t is the maximal number of paralogous copies. Given that we explore neighborhoods of size 2 × δ + 1, the number of all windows of size δ for f is (δ + 1) × c × t. The computation of all witnesses for a given family takes O(((δ + 1) × c × t)^{2}). If in this computation all the possible intersections are nonempty, then in the worst case scenario we obtain for f the set of (δclusters of size ((δ +1) × c × t)^{2}. Which implies that the (δsynteny computation takes O(((δ + 1) × c × t)^{4}); which is repeated for all families f ∊ Σ.
Evaluation of results
The Genolevures database provides families of proteins across the phylum of hemyascomycetous yeasts. To evaluate the performance of our algorithm, we have executed it on the existing families from the Genolevures Release 3 Candiate 3 (20080924) [33], [29] with 4949 families covering 25196 protein coding genes from five protoploid yeast species [1].
Comparison with other methods
The critical windowsize parameter δ of SYNS was set to 7 for all experiments. This value was obtained in order to match our results with the previously defined and expert validated orthologous subgroups [1]. We have compared the orthologous groups obtained by SYNS on the yeast data to those obtained by the following methods: Cococl [34], MultiParanoid [35] and OrthoMCL [36]. Table 2 shows the numbers of orthologous groups classified by these methods. OrthoMCL [36] was run with default inflation index= 1.5, evalue cutoff= –5 and percent match cutoff = 50 values starting from input fasta files. Cococl was run recursively starting with fasta sequences with boostrap threshold score= 1 and split score= 0.5 and using ClustalW for multiple sequence alignment. Multiparanoid was run using default parameters (no cutoff and no duplicate appearance of gene in clusters), using BLOSUM62 matrix for Blast alignments. Table 2 shows the total number of classified proteins and the total number of orthologous groups detected by SYNS and these algorithms using the original Genolevures families as a baseline [33]. In comparison with the SONS method, the SYNS classifies a comparable number of proteins, but generates more orthologous groups, implying that these groups are more finegrained.
We compare the orthologous groups between the SYNS method and those obtained by other algorithms in table 3. To compare two classifications we first look at how many groups are identical between two methods (Id column) and compute the similarity value (between 0 and 1) over the intersection of the covered protein sets (for definition see [33]). Second, we analyze the differences between two classifications. For these we report the number of proteins that are classified only by the SYNS (SYNS column) when compared to those only classified a given method (meth. column). The remaining differences are classified according to granularity: a split when a group obtained by a given method is split into multiple subgroups by the SYNS algorithm, a merge in the opposite case, and messy when the split/merge relationship is complicated. We further analyze the differences with respect to SONS classification case by case (available at http://www.cbib.ubordeaux2.fr/redmine/projects/syns/files). We have found that in the case of splits between the resulting groups (50 groups in table 3, the more finegrained groups obtained by the SYNS algorithm are more functionally relevant in general. For the cases of merges (141 groups) and messy events (70 groups) there is no clearcut qualitative difference. However, for these 211 cases more functionally plausible groups can be obtained by SYNS when using a smaller window size δ = 5. Overall, SYNS method appears to be the best match with the curated SONS results [1], while relying on a clear mathematical definitions and having satisfactory running time.
Analysis of two protein families
We illustrate the functional relevance the SYNS algorithm by considering the classification of Pdrp (pleiotropic drug resistance transporter proteins) subfamily performed in [6]. This is a subset of the PDR proteins from the GL3C0025 (total 60 proteins) Genolevures family. We compare this manual analysis with the results obtained automatically by SONS and SYNS algorithms.
Seven SONS, six SYNS and seven groups obtained by manual curation provide hypothethis on the evolution of this protein family. The manually curated orthologous groups are confirmed by gene cluster analysis. But in some cases the results differ. Groups P_{1} through P_{4} in table 4 denote four orthologous groups over five species annotated in [6] according to their S. cerevisiae members, namely Pdr12p group (P_{3}, 5 members), Snq2p group (P_{1} + P_{2}, 5+4=9 members) and Pdrp5p/15p group (P_{4}, 3 members). Groups P_{5} through P_{7} in table 4 contain genes whose relationship to Pdr5p/15p is based on phylogenetic evidence only [6]. Three tandem gene repeats appear in ERGO (Eremothecium gossypii), KLLA (Kluyveromyces lactis) and SAKL (Saccharomyces kluyveri) and are found in a similar neighborhood [6] in groups P_{1} and P_{2}.
Comparatively to the SONS classification, our approach proposes a more conservative classification for these proteins into orthologous groups. Indeed, SONS exclude ZYRO0D17710g from the Snq2/YNR070w phylogenetic cluster, while regrouping the remaining proteins belonging to P_{1} and P_{2}. Moreover, according to [6], SAKL0F04312g belongs to the Aus1p/Pdr11p group which has no shared neighborhood in preWGD five species. Thus, it is not surprising that this gene is missing in the SYNS classification (SONS algorithm classifies it in an independent group, not shown in table 4).
A similar analysis is done for the GL3C0026 family that has 57 members and four different functionally annotated groups. Figure 6 illustrates the evolutionary pattern based on the combination of phylogenetic analysis and functional annotations of this family. SONS algorithm produces 7 orthologous gene clusters, while SYNS generates 8 clusters functionally more relevant. Both SONS and SYNS successfully classify the Lornithine transaminase (OTAse) group (with the S. cerevisiae member YLR438w CAR2). However, SONS classification fails to distinguish the YGR019w UGA1 Gammaaminbutyrate (GABA) transaminase group from the YNR058w aminopelargonic acid aminotransferase (DAPA) group. On the contrary, SYNS method separates the cluster having the YGR019w UGA1 gene according to its functional anotation. Our algorithm also succeeds to correctly distinguish the single orthog gene clusters from the YGR019w UGA1 group. For the YOL140w ARG8 Acetylornithine aminotransferase group, both SONS and SYNS algorithms provide similar conserved gene clusters. However, SONS erroneously mixes some genes of this group with YGR019w UGA1 cluster and YNR058w BIO3 cluster, whereas SYNS algorithm succeeds to distinguish them. The combined functional annotations and neighborhood analysis support the evolutionary pattern illustrated in figure 6 for the GL3C0026 family. Therefore we can conclude that the final δzones in our algorithm may preserve a functionally meaningful conserved gene clusters.
Conclusion
The double goal of this study is to identify locally conserved gene clusters and to use them in order to subdivide an existing family structure into orthologous groups. To this end, we define a model for unordered local synteny and propose an algorithm to identify conserved gene clusters and their division into orthologous and paralogous clusters among multiple genomes. To validate our approach we have executed our method for the five Hemyascomycetous yeasts and genomes and examined the conserved nonoverlapping gene clusters that arise from each homologous family of Genolevures database [29]. Our approach shows 99% protein coverage for existing homologous groups.
We perform similar comparisons with the existing SONS groups [6] over the Genolevures families. The 90% similarity between our approach and SONS groups indicates that our automatic method comes close to the manually curated results, especially since part of the differences between these groups can be explained by the nonclassification of the paralogous conserved gene clusters by SONS. This confirms the pertinence of our definition of conserved neighborhoods based on transitivity and phylogenetic constraints that make it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. The SYNS method makes it possible to distinguish between orthologous and paralogous conserved gene clusters and thus makes it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. This implies that the proposed sliding window and partial traversal approach, efficiently produces biologically relevant conserved gene clusters and corresponding orthologous groups with O(n((δ + 1) × c × t)^{4}) worstcase complexity, for a predefined window size δ.
References
Consortium G: Comparative genomics of protoploid Saccharomycetaceae. Genome Res 2009, 19(10):1696–709.
Ermolaeva M: Operon finding in bacteria. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics 2005, 2886–2891.
Snel B, Bork P, Huynen M: The identification of functional modules from the genomic association of gene. Proc Natl Acad Sci USA 2002, 99(9):5890–5895. 10.1073/pnas.092632599
Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem. Sci. 1998, 23(9):324–328. 10.1016/S09680004(98)012742
Bergeron A, Blanchette M, Chateau A, Chauve C: Reconstructing Ancestral Gene Orders Using Conserved Intervals. In WABI. Volume 3240. Edited by: Jonassen I KJ, Lecture Notes in Computer Science. Springer; 2004:14–25.
Seret ML, Diffels JF, Goffeau A, Baret PV: Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts. BMC genomics 2009, 10(459):1–11.
Fitch WM: Distinguishing homologous from analogous proteins. Syst. Zool 1970, 19: 99–113. 10.2307/2412448
Fitch W: Homology a personal view on some of the problems. Trends Genet 2000, 16: 227–231. 10.1016/S01689525(00)020059
Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biol 2002, 3: RESEARCH0008.
Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290: 1151–1155. 10.1126/science.290.5494.1151
Lynch M, Force A: The probability of duplicated gene preseration by subfunctionalization. Genetics 2000, 154: 459–473.
Ohno S: Evolution be gene duplication. New York: Springer; 1970.
He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. Journal of computational biology 2005, 12(6):638–656. 10.1089/cmb.2005.12.638
Bansal AK: An automated comparative analysis of 17 complete microbial genomes. Bioinformatics 1999, 15: 900–908. 10.1093/bioinformatics/15.11.900
Goldberg D, McCouch S, Kleinberg J: Algorithms for constructing comparative maps. In Comparative Genomics. Edited by: Shankoff D, Nadeau JH. NL: Kluwer Academic Press; 2000:281–294.
Housworth EA, Postlethwait J: Measures of synteny conservation between species pairs. Genetics 2002, 162: 441–448.
Nadeau JH, Shankoff D: Counting on comparative maps. Trends Genet 1998, 14(12):495–501. 10.1016/S01689525(98)016072
Tamames J: Evolution of gene order conservation in prokaryotes. Genome Biol 2001, 6(2):0020.1–11.
Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature 1997, 387: 708–713. 10.1038/42711
Yang Q, Yi G, Zhang F, Thon MR, Sze SH: Identifying gene clusters within localized regions in multiple genomes. Journal of Computational Biology 2010, 17(5):657–668. 10.1089/cmb.2009.0116
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. PNAS 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896
Vandepoele K, Saeys Y, Simillion C, Raes J, Peer YVD: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between arabidopsis and rice. Genome Research 2002, 12(11):1792–1801. 10.1101/gr.400202
Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290: 2114–2117. 10.1126/science.290.5499.2114
Hoberman R, Sankoff D, Durand D: The Statistical Significance of MaxGap Clusters. In Comparative Genomics. Volume 3388. Edited by: Lecture Notes in Computer Science. Edited by Lagergren J. Springer Berlin /Heidelberg; 2005:55–71. 10.1007/9783540322900_5
Parida L: Gapped permutation pattern discovery for gene order comparisons. J. Comput. Biol. 2007, 14: 45–55. 10.1089/cmb.2006.0103
Heber S, Stoye J: Algorithms for finding gene clusters. Lect. notes Comput. Sci. 2001, 2149: 252–263. 10.1007/3540446966_20
Kim S, Choi JH, Yang J: Gene teams with relaxed proximity constraint. Proc. IEEE Comput. Sys. Bioinformatics Conf. 2005, 44–55.
Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. In Proc. 2nd Annual Workshop on Algorithms in Bioinformatics (WABI), Volume 2452 of Lectures Notes in Computer Science. New York: SpringerVerlag; 2002:464–476.
Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet J, Durrens P: Genolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes. Nucleic Acids Research 2009, 37(Database issue):D550D554.
Didier G: Common intervals of two sequences. Lect. Notes Comput. Sci 2003, 2812: 17–24. 10.1007/9783540397632_2
Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two or more sequences. Lect. Notes Comput. Sci 2004, 3109: 347–358. 10.1007/9783540278016_26
Beal MP, Bergeron A, Corteel S, Raffinot M: An algorithmic view of gene teams. Theoret. Comput. Sci 2004, 320(2–3):395–418. 10.1016/j.tcs.2004.02.036
Nikolski M, Sherman D: Family relationships: should consensus reign?  consensus clustering for protein families. Bioinformatics 2007, 23(2):e71e76. 10.1093/bioinformatics/btl314
Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCOCL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics 2006, 22(7):779–788. 10.1093/bioinformatics/btl009
Alexeyenko A, Tamas I, Liu G, Sonnhammer E: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 2006, 22(14):e9e15. 10.1093/bioinformatics/btl213
Li L, Stoeckert C, Roos D: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178–89. 10.1101/gr.1224503
Acknowledgements
The authors would like to thank Pascal Durrens for constructive discussions and help with the analysis of biological relevance of results.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S9.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Conceived and designed the experiments: AS, MN. Performed the experiments and analyzed the data: AS. Wrote the paper: AS, HS, MN.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Sarkar, A., Soueidan, H. & Nikolski, M. Identification of conserved gene clusters in multiple genomes based on synteny and homology. BMC Bioinformatics 12 (Suppl 9), S18 (2011). https://doi.org/10.1186/1471210512S9S18
Published:
DOI: https://doi.org/10.1186/1471210512S9S18