Identification of conserved gene clusters in multiple genomes based on synteny and homology

Sarkar, Anasua; Soueidan, Hayssam; Nikolski, Macha

doi:10.1186/1471-2105-12-S9-S18

Volume 12 Supplement 9

Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Proceedings
Open access
Published: 05 October 2011

Identification of conserved gene clusters in multiple genomes based on synteny and homology

Anasua Sarkar¹,
Hayssam Soueidan² &
Macha Nikolski¹

BMC Bioinformatics volume 12, Article number: S18 (2011) Cite this article

3538 Accesses
8 Citations
1 Altmetric
Metrics details

Abstract

Background

Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams.

Results

Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.

Conclusions

The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.

Background

Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. It is often suggested that regions with similar gene content among different species are evidence for phylogenetic relationship and trace through evolution the inheritance of function from a common ancestor. Within one genome, the presence of large duplicated blocks may be due to the ancient large-scale or whole genome duplication, while presence of segments with homologous genes, named conserved gene clusters in multiple genomes more likely indicates an evolutionary constraint for a functionally related group. Our primary contribution is a local sliding-window algorithm that starts from an existing protein family classification and produces two results: first, concerved gene clusters, and second, a subdivision of families into orhtologous subgroups. Our approach can be seen as using conserved gene clusters in order to sift through the family structure to uncover orthology. We evaluate the biological relevance of our approach on the example of Protoploid yeasts [1].

A number of studies indicate that regions of conserved homology among multiple species may result from functional pressure to keep these genes close, but it may also be conserved because the genomes under study have not sufficiently diverged. For the former, the most well known examples are that of operons in prokaryotes [2], but also the existence of functional interactions [3] and similar expression patterns [4] in closely located genes. For the latter, existence of conserved gene clusters is the computational basis for ancestral genome reconstruction [5] and search for ancestral homologs among genes in the same family [6]. Orthologs are homologous genes related by speciation [7, 8] which retain the same functionality as their common ancestors. Homologous genes related by duplication within one lineage are called paralogs and generally differ in functionality [9–12]. A number of papers introduce algorithms to compute conserved gene clusters and orthologous groups, see for example, [13–16]. These approaches vary on a number of parameters. First, there are authors who consider strictly conserved chromosomal segments with similar gene order and orientation [17–19]. Second, come the approaches where one considers conserved contiguous regions but without co-linearity [13, 20]. Third, the authors relax the definition of conserved regions by allowing gaps [18, 21–24]. Four, paralogous gene copies within one chromosome are allowed in order to explore many-to-many homologous relationships [13, 25]. Finally, some authors study the effect of varying the gap between adjacent neighbors [24, 26, 27].

In this paper we start from the notion of gene teams introduced in [28]. This model allows only one copy of a gene on a given chromosome. We relax this restriction by following the approach of homology teams defined in [13]. Furthermore, we set the gap threshold not only for adjacent genes, but by requiring the distance for any two genes considered as being neighbors to be smaller than a certain threshold. A similar choice was made in [20]. We call the obtained gene clusters synteny teams.

Our SYNS (SYNtenic teamS) algorithm refines existing families into orthologous sub-families, by analyzing the neighborhoods around the members of a given family with a locally sliding window. This is done for all pairs of chromosomes in multiple genomes on which family members appear. The pairwise conserved contiguous segments are agglomerated by relying on a partial homology and biological criteria introduced in [1] between segments. This results in larger conserved segments that we call syntenic zones. We evaluate our algorithm on the existing homologous families for five genomes of the Hemyascomycete yeasts from the Genolevures database [29]. Indeed, there already exists a sub-classification of these families into orthologous sub-families [1] that has undergone expert validation and thus can be used as a reference point for the evaluation of biological relevance of our results. We further illustrate the results of our method for the particular case of the Pdrp (pleiotropic drug resistance proteins) phylogenetic subfamily of ABC transporters that has been manually analyzed in [6].

Methods

In this section we define the notion of unordered conserved gene clusters that allows for paralogous copies and gaps on multiple genomes. Following the work of [20, 30, 31], we allow one homologous gene to appear more than once in one chromosome. We refine the approach of homology teams [13] by distinguishing between orthologous and paralogous copies of genes. Large syntenic zones are built my merging clusters based on genes common among them instead of directly merging the ordered chains with overlapping families as in [32]. For mathematical notations and examples in a textual format we follow [28].

Definition 1 A chromosome is defined as a pair c = (Σ, G), where Σ = {f₁, f₂, …, f_m} is the set of homologous families and G = (g₁, g₂, ..., g_n) is an ordered sequence of genes. Each gene g_i ∈ G is a couple (p_i, f_i), where p_i is the position of gene g_ion c and g_ibelongs to some homologous group f_i ∈ Σ.

Here, Σ is the alphabet for any chromosome c and p_i is an integer. When it is necessary to indicate to which chromosome belongs a given gene, this is done by a subscript: (p_i, f_i)_c.

Definition 2 Given a chromosome c, with two genes g_i = (p_i, f_i) and g_j = (p_j, f_j), the distance between g_i and g_j is defined by Δ(g_i, g_j) = |p_i– p_j|.

Example 1 Let c₁and c₂be two chromosomes over the same alphabet Σ = {f₁, f₂, f₃, f₄} of homologous families with genes on c₁being (1, f₂), (2, f₁), (4, f₄), (7, f₃), (8, f₁), and on c₂being (1, f₁), (2, f₂), (3, f₂), (4, f₃), (6, f₄). This is denoted by:

c₁ = 〈f₂f₁*f₄**f₃f₁〉,

c₂ = 〈f₁f₂f₂f₃*f₄〉.

Asterisks stand for genes that are unassigned to homologous groups; notice that * is not part of the alphabet Σ.

A gene subset G′ ⊆ G induces the subset of families Σ′ denoted by Σ(G′) such that f_i ∈ Σ′ if and only if there exists g_i ∈ G′ such that g_i = (p_i, f_i). A set of genes G′ from the same chromosome, forms a chromosomal segment s = (Σ′, G′, c) with or without gaps. When it is clear from the context, we will assimilate a set of genes G′ with the corresponding chromosomal segment.

For example, in the case of G′ = {(2, f₁), (4, f₄), (8, f₁)} and alphabet Σ′ = Σ(G′) = {f₁, f₄}, G′ defines a chromosomal segment with gaps on c₁ = 〈f₂f₁ * f₄ * *f₃f₁〉. This segment G′ is non-contiguous on c₁; the gaps correspond to (5, *), (6, *) and (7, f₃).

Definition 3 A chromosomal segment s = (Σ′, G′, c) is contiguous if for any two genes g_i = (p_i, f_i) and g_j = (P_i, f_j) from G′ and any p such that p_i <p <p_j, either the gene g = (p, f) at the position p belongs to G' or this position corresponds to an asterisk. Otherwise, the segment is said to be non-contiguous For example, G′ = {(4, f₄), (7, f₃), (8, f₁)} on c₁ = 〈f₂f₁ * f₄ * *f₃f₁〉 forms a contiguous segment.

Synteny teams

Two genes g_i = (p_i, f_i) and g_j = (p_j, f_j) on the same chromosome are considered to be neighbors when Δ(g_i– g_j) <δ for a given threshold δ > 0. For a gene g_i, we denote the set of neighbor genes N_i to be centered around it, that is N_i = {g_k = (p_k, f_k) | p_i – ⌊δ/2⌋ ≤ p_k ≤ p_i + ⌊δ/2⌋}.

Definition 4 A chromosomal segment s is called a δ— segment if every pair of genes of s is separated by a distance smaller than δ, that is s = {g_i | ∀g_j ∈ s, Δ(g_i, g_j) <δ}. A window w is a contiguous δ-segment.

Definition 5 We say that Σ′ ⊆ Σ is a δ— subset if there exists at least one δ—segment s′ = (Σ′, G′, c) such that Σ' = Σ(G'). We say that s' is the witness of this δ—subset.

Example 2 For δ = 3, the δ—subsets on chromosome c₂ = 〈f₁f₂f₂f₃ * f₄〉 are the following:

- {f₁, f₂} as witnessed by ((1, f₁), (2, f₂)), ((1, f₁), (3, f₂)), and ((1, f₁), (2, f₂), (3, f₂));

- {f₂, f₃} as witnessed by ((2, f₂), (4, f₃)), ((3, f₂), (4, f₃)), and ((2, f₂), (3, f₂), (4, f₃));

- {f₃, f₄} as witnessed by ((4, f₃), (6, f₄)).

Definition 6 Let Σ be the set of homologous families over a set of chromosomes C. We say that Σ′ ⊆ Σ is a δ— cluster if Σ′ is a δ—subset for all chromosomes in some C′ ⊆ C, where |C′| ≥ 2. We say that the set of genes

witnesses the δ—cluster Σ′.

A witness S is thus a set of all genes that participate in the segments witnessing the relevant (δ-subsets. Let Σ and Σ' to be two (δ-clusters such that Σ ∩ Σ′ ≠ ∅. Let S and S′ be the corresponding witness sets. Denote by S_∩ and the sets of genes in each of these witness sets that are members of the families in Σ ∩ Σ′.

Definition 7 A δ—cluster Σ is said to be a (δ-synteny if (a) the corresponding witness set S has genes belonging to at least two different chromosomes and (b) there does not exist a δ-cluster Σ′ with a witness set S′ such that

Example 3 Let c₁, c₂and c₃be chromosomes as shown in figure1.

c₁ = 〈f₃ * *f₅f₄f₁ * f₂* f₅f₄〉

c₂ = 〈f₁f₂ * *f₃*f₄f₅f₁ * f₅〉

c₃ = 〈f₂ * f₃ **f₅*f₁f₄ * f₅〉

Let (δ = 3. We obtain the following non-trivial δ—clusters: {f₄, f₁}, {f₅, f₁}, {f₄, f₅, f₁}, {f₁, f₂} and {f₄, f₅} between c₁and c₂; and {f₁, f₅}, {f₁, f₄} and {f₄, f₅} between c₁and c₃. The non-trivial δ-syntenies are {f₄, f₅}, {f₁, f₂}, {f₄, f₁}, {f₅, f₁} and {f₄, f₅, f₁}.

The superset inclusion in definition 7 implies that for the computational purposes there is no need to consider the smaller of the two sets and thus causes merging of the syntenies if the witness of one synteny is a complete subset of another in our algorithm.

Example 4 Let c₁, c₂and c₃be three chromosomes in figure2.

c₁ = 〈f₁ * *f₄* f₆f₆f₇f₈f₉〉

c₂ = 〈*f₅* f₃f₆* f₂f₄f₈f₇f₉〉

c₃ = 〈f₄f₈f₄f₇f₈f₈* f₈f₂〉

Let (δ = 3 and consider f₈. Non-trivial δ—clusters are: {f₇, f₈, f₉}, {f₇, f₈} and {f₈, f₉} between c₁and c₂, {f₇, f₈} between c₁and c₃and { f₈, f₂}, {f₇, f₈} and {f₈, f₄} between c₂and c₃. Therefore, we obtain the following non-trivial δ—syntenies: {f₇, f₈, f₉}, {f₇, f₈}, {f₈, f₂} and {f₈, f₄}. Notice that the δ-cluster {f₇, f₈, f₉} covers witnesses of the δ-cluster {f₈, f₉}, but the witnesses of the δ-cluster {f₇, f₈} on chromosome c₃do not witness the δ-cluster {f₇, f₈, f₉}. Therefore, we merge the δ-cluster {f₈, f₉} in th e δ-synteny {f₇, f₈, f₉}; however, {f₇, f₈} remains as a separate δ-synteny.

We have seen that a (δ—synteny must contain the maximal (δ—cluster with respect to subset inclusion. All (δ—syntenies for a set of chromosomes C, with |C| >= 2 are included in the result. Such a synteny set is informally called a synteny team following the terminology introduced in [28, 32] for gene teams.

Definition 8 Given a δ—synteny teamwe say that Σ_iand Σ_jare transitively connected if the witnesses S_i and S_j overlap, that is |S_i ∩ S_j| ≥ 1. We further define a δ-zone as a union of transitively connected δ-syntenies Σ_iand Σ_j.

Example 5 Consider C = {c₁, c₂, c₃} from example 4 and δ = 3. Suppose that we compute clusters in the neighborhood of f₈. Non-trivial syntenies are the following: Σ₁ = {f₇, f₈} for witness S₁ = {(8, f₇)_c, (9, f₈)c1, (9, f8)c2, (10, f₇)_c2, (2, f₈)_c3 (4, f₇)_c3, (5, f₈)_c3, (6, f₈)_c3} and Σ₂ = {f₄, f₈} for witness S₂ = {(8, f₄)_c2, (9, f₈)_c2, (1, f₄)_c3, (3, f₄)_c3, (2, f₈)_c3}. Notice, that Σ₁ ∩ Σ₂ = {f₈} ≠ ∅ and S₁ ∩ S₂ = {(9, c₂, f₈), (2, c₃, f₈)} ≠ ∅. We obtain one non-trivial δ—zone {f₄, f₈, f₇} by agglomerating δ—syntenies Σ₁and Σ₂based on the transitivity (see figure3). Notice that this leaves the gene (8, f₈)_c3, out of the δ-zone. The transitivity relationship in the SYNS algorithm combines each pair of two δ— syntenies sharing at least one witness into one δ-zone. The notion of a δ— zone aims at uncovering even distant evolutionary relationships based on conservation of gene content within neighborhoods. It is slightly amended based on the following two considerations.

1.
Several paralogous genes may exist on the same chromosome. When two or more paralogs appear within one window of size δ, we include them in the same witness set of a δ-synteny since it is not possible to computationally distinguish between them.
2.
It may happen that two distinct δ-syntenies share only one paralogous gene. This is what we call a weak bond. Creating a δ-zone based on a single gene intersection may either lead to a δ-zone that is phylogenetically valid or may create an erroneous result (see [6]).

Definition 9 Given a δ—synteny teamand its witness set S = {S_i, S_j} we say that S forms a weak bond if |S_i ∩ S_j| = 1. We further define g = S_i ∩ S_j to be the witness of a weak bond.

The δ— zone {Σ_i, Σ_j} resulting from a weak bond may be erroneous. We rely on phylogeny to solve this issue. We consider a total order over all the species under study defined by phylogeny: a ≺ b if species b has diverged from the common ancestor earlier than species a (≺ corresponds then to the relative speciation time). When no other witness from a other than g exists, we split the erroneously obtained synteny in two parts: one that contains the orthology relationships within a given family f and another one that keeps the supposed paralogs. The details of how this is done are presented in Results section.

Definition 10 Letbe a δ— synteny team over the witness set S = {S_i, S_j} such that |S_i| > |S_j| and let g = S_i ∩ S_j be the witness of a weak bond. If g is from the biggest species according to ≺ in S_j, we say that S_iwitnesses a maximal orthologous (δ-synteny Σ_iand S'_j = S_j \ g witnesses a paralogous δ-synteny Σ_j.

Example 6 Consider C′ = {c₂, c₃} from example 4 and figure4supposing that c₃ ≺ c₂and consider neighborhoods around f₈with (δ = 3. Two non-trivial δ-syntenies are connected by a weak bond: Σ₁ = {f₈, f₂} with witness S₁ = {(8, f₂)_c2, (9, f₈)_c2, (8, f₈)_c3, (9, f₂)_c3} and Σ₂ = { f₄, f₈} with witness S₂ = {(7, f₄)_c2, (9, f₈)_c2, (1, f₄)_c3, (2, f₈)_c3 (3, f₄)_c3, (5, f₈)_c3}. Indeed, {(9, f₈)_c2} is the witness of this weak bond. Since c₃ ≺ c₂, then Σ₂is the maximal orthologous δ-synteny with witness S₂, while Σ₁is the one with the paralogous copy of f₈ (at position 9 on c₂). The set S₁becomes Members of a family are split into an orthologous and paralogous subsets present in different syntenies. At the end of our procedure, only the largest orthologous (δ-zone and the non-intersecting paralogous (δ-zones covering any given homologous family remain in the result.

Syntenic TeamS algorithm

In this section, we present the SYntenic TeamS (SYNS) algorithm which computes δ— zones in multiple genomes. In previous work gene teams between two chromosomes of size m and n are computed by an O(m + n)log²(m + n) algorithm consisdering only one-to-one homologous relationships [32]. The approach by [20] solves the ordered gene clusters problem by proposing a directed acyclic graph model and an NP-hard longest path solution; results contain maximal but also non-maximal orthologous clusters. Our approach relies on the same sliding-window general approach as in [20]. However, we gain in time efficiency by limiting the sliding of the window only around positions of family members. Given a set of families Σ and a predefined window size δ, we examine neighborhoods of each family f ∈ Σ in all chromosomes. For all genes of f including paralogous copies, we consider a neighborhood from –δ to +δ around them. This neighborhood is examined by a sliding window of size δ and we form sets of genes corresponding to families in a given window position. These sets are intersected to look for common gene content if they belong to different chromosomes. The intersections define synteny conservation within the family neighborhoods by using definitions in Methods section. We further look for transitivity among δ— syntenies and build (δ-zones. To do this, we search for overlaps among witnesses of δ— clusters. If the witness intersection size is > 1 then the δ— syntenies are agglomerated to form one δ— zone. Three different cases corresponding to phylogenetic topologies shown in figure 5 are considered for solving the weak bond problem. Let S_i and S_j to be the two witnesses connected by a weak bond, we sort the genes of these witnesses according to the ≺ order of speciation. If the witness of a weak bond occurs in the biggest species according to ≺ or if there is no any other witness from a bigger species, then we consider that (cases A and B in figure 5) the two clusters define a valid (δ-zone. Case C in figure 5 shows the situation where forming a (δ-zone can not be justified from the evolutionary perspective. For cases A and B we continue to search for paralogous gene clusters. We gather all maximal δ— zones in the final result.

Time complexity

Table 1 shows the comparative time complexity analysis of our approach and other existing ortholog detection algorithms for the cases where such information is available. In the SYNS algorithm, we consider that one homologous family f may appear in at most c × t locations in all genomes, where c is the total number of chromosomes and t is the maximal number of paralogous copies. Given that we explore neighborhoods of size 2 × δ + 1, the number of all windows of size δ for f is (δ + 1) × c × t. The computation of all witnesses for a given family takes O(((δ + 1) × c × t)²). If in this computation all the possible intersections are non-empty, then in the worst case scenario we obtain for f the set of (δ-clusters of size ((δ +1) × c × t)². Which implies that the (δ-synteny computation takes O(((δ + 1) × c × t)⁴); which is repeated for all families f ∊ Σ.

Table 1 Comparison of time complexity of OrthoMCL, GCFinder and SYNS All experiments have been run on the dual-core Intel Xeon 2.33 GHz server. Results are also available for MultiParanoid (approx. 2 hours run time) and CoCo-CL (approx 3 hours run time) for which no time complexity is found in literature.

Full size table

Evaluation of results

The Genolevures database provides families of proteins across the phylum of hemyascomycetous yeasts. To evaluate the performance of our algorithm, we have executed it on the existing families from the Genolevures Release 3 Candiate 3 (2008-09-24) [33], [29] with 4949 families covering 25196 protein coding genes from five protoploid yeast species [1].

Comparison with other methods

The critical window-size parameter δ of SYNS was set to 7 for all experiments. This value was obtained in order to match our results with the previously defined and expert validated orthologous subgroups [1]. We have compared the orthologous groups obtained by SYNS on the yeast data to those obtained by the following methods: Coco-cl [34], MultiParanoid [35] and OrthoMCL [36]. Table 2 shows the numbers of orthologous groups classified by these methods. OrthoMCL [36] was run with default inflation index= 1.5, e-value cut-off= –5 and percent match cut-off = 50 values starting from input fasta files. Coco-cl was run recursively starting with fasta sequences with boostrap threshold score= 1 and split score= 0.5 and using ClustalW for multiple sequence alignment. Multiparanoid was run using default parameters (no cut-off and no duplicate appearance of gene in clusters), using BLOSUM62 matrix for Blast alignments. Table 2 shows the total number of classified proteins and the total number of orthologous groups detected by SYNS and these algorithms using the original Genolevures families as a baseline [33]. In comparison with the SONS method, the SYNS classifies a comparable number of proteins, but generates more orthologous groups, implying that these groups are more fine-grained.

We compare the orthologous groups between the SYNS method and those obtained by other algorithms in table 3. To compare two classifications we first look at how many groups are identical between two methods (Id column) and compute the similarity value (between 0 and 1) over the intersection of the covered protein sets (for definition see [33]). Second, we analyze the differences between two classifications. For these we report the number of proteins that are classified only by the SYNS (SYNS column) when compared to those only classified a given method (meth. column). The remaining differences are classified according to granularity: a split when a group obtained by a given method is split into multiple subgroups by the SYNS algorithm, a merge in the opposite case, and messy when the split/merge relationship is complicated. We further analyze the differences with respect to SONS classification case by case (available at http://www.cbib.u-bordeaux2.fr/redmine/projects/syns/files). We have found that in the case of splits between the resulting groups (50 groups in table 3, the more fine-grained groups obtained by the SYNS algorithm are more functionally relevant in general. For the cases of merges (141 groups) and messy events (70 groups) there is no clear-cut qualitative difference. However, for these 211 cases more functionally plausible groups can be obtained by SYNS when using a smaller window size δ = 5. Overall, SYNS method appears to be the best match with the curated SONS results [1], while relying on a clear mathematical definitions and having satisfactory running time.

Table 2 Comparisons of SYNS and other classifications with the existing family structure as baseline

Full size table

Table 3 Comparison of different computations of orthologous clusters with SYNS results on the Genolevures data Each line compares a given method with the SYNS; we report the number of genes classified only in the given method (meth), only by the SYNS algorithm (SYNS), the similarity value (sim) between two cluster sets (varying between 0 and 1 as defined in [33]), the number of genes that appear as singletons, the number of splits and merges between two cluster sets as well as the number of unclassifiable cases (messy).

Full size table

Analysis of two protein families

We illustrate the functional relevance the SYNS algorithm by considering the classification of Pdrp (pleiotropic drug resistance transporter proteins) subfamily performed in [6]. This is a subset of the PDR proteins from the GL3C0025 (total 60 proteins) Genolevures family. We compare this manual analysis with the results obtained automatically by SONS and SYNS algorithms.

Seven SONS, six SYNS and seven groups obtained by manual curation provide hypothethis on the evolution of this protein family. The manually curated orthologous groups are confirmed by gene cluster analysis. But in some cases the results differ. Groups P₁ through P₄ in table 4 denote four orthologous groups over five species annotated in [6] according to their S. cerevisiae members, namely Pdr12p group (P₃, 5 members), Snq2p group (P₁ + P₂, 5+4=9 members) and Pdrp5p/15p group (P₄, 3 members). Groups P₅ through P₇ in table 4 contain genes whose relationship to Pdr5p/15p is based on phylogenetic evidence only [6]. Three tandem gene repeats appear in ERGO (Eremothecium gossypii), KLLA (Kluyveromyces lactis) and SAKL (Saccharomyces kluyveri) and are found in a similar neighborhood [6] in groups P₁ and P₂.

Comparatively to the SONS classification, our approach proposes a more conservative classification for these proteins into orthologous groups. Indeed, SONS exclude ZYRO0D17710g from the Snq2/YNR070w phylogenetic cluster, while re-grouping the remaining proteins belonging to P₁ and P₂. Moreover, according to [6], SAKL0F04312g belongs to the Aus1p/Pdr11p group which has no shared neighborhood in pre-WGD five species. Thus, it is not surprising that this gene is missing in the SYNS classification (SONS algorithm classifies it in an independent group, not shown in table 4).

A similar analysis is done for the GL3C0026 family that has 57 members and four different functionally annotated groups. Figure 6 illustrates the evolutionary pattern based on the combination of phylogenetic analysis and functional annotations of this family. SONS algorithm produces 7 orthologous gene clusters, while SYNS generates 8 clusters functionally more relevant. Both SONS and SYNS successfully classify the L-ornithine transaminase (OTAse) group (with the S. cerevisiae member YLR438w CAR2). However, SONS classification fails to distinguish the YGR019w UGA1 Gamma-aminbutyrate (GABA) transaminase group from the YNR058w amino-pelargonic acid aminotransferase (DAPA) group. On the contrary, SYNS method separates the cluster having the YGR019w UGA1 gene according to its functional anotation. Our algorithm also succeeds to correctly distinguish the single orthog gene clusters from the YGR019w UGA1 group. For the YOL140w ARG8 Acetylornithine aminotransferase group, both SONS and SYNS algorithms provide similar conserved gene clusters. However, SONS erroneously mixes some genes of this group with YGR019w UGA1 cluster and YNR058w BIO3 cluster, whereas SYNS algorithm succeeds to distinguish them. The combined functional annotations and neighborhood analysis support the evolutionary pattern illustrated in figure 6 for the GL3C0026 family. Therefore we can conclude that the final δ-zones in our algorithm may preserve a functionally meaningful conserved gene clusters.

Table 4 Comparisons of orthologous clusters subdividing the Pdrp Genolevures family The Pdrp Genolevures family GL30025 as analysed by a) SONS results b) SYNS results c) after manual curation. The comparisons have been performed over the same sets of genes as in figure 3 in [6] for the Pdrp ”sensu stricto” proteins subset of the GL3C0025 family.

Full size table

Conclusion

The double goal of this study is to identify locally conserved gene clusters and to use them in order to subdivide an existing family structure into orthologous groups. To this end, we define a model for unordered local synteny and propose an algorithm to identify conserved gene clusters and their division into orthologous and paralogous clusters among multiple genomes. To validate our approach we have executed our method for the five Hemyascomycetous yeasts and genomes and examined the conserved non-overlapping gene clusters that arise from each homologous family of Genolevures database [29]. Our approach shows 99% protein coverage for existing homologous groups.

We perform similar comparisons with the existing SONS groups [6] over the Genolevures families. The 90% similarity between our approach and SONS groups indicates that our automatic method comes close to the manually curated results, especially since part of the differences between these groups can be explained by the non-classification of the paralogous conserved gene clusters by SONS. This confirms the pertinence of our definition of conserved neighborhoods based on transitivity and phylogenetic constraints that make it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. The SYNS method makes it possible to distinguish between orthologous and paralogous conserved gene clusters and thus makes it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. This implies that the proposed sliding window and partial traversal approach, efficiently produces biologically relevant conserved gene clusters and corresponding orthologous groups with O(n((δ + 1) × c × t)⁴) worst-case complexity, for a pre-defined window size δ.

References

Consortium G: Comparative genomics of protoploid Saccharomycetaceae. Genome Res 2009, 19(10):1696–709.
Article Google Scholar
Ermolaeva M: Operon finding in bacteria. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics 2005, 2886–2891.
Google Scholar
Snel B, Bork P, Huynen M: The identification of functional modules from the genomic association of gene. Proc Natl Acad Sci USA 2002, 99(9):5890–5895. 10.1073/pnas.092632599
Article PubMed CAS PubMed Central Google Scholar
Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem. Sci. 1998, 23(9):324–328. 10.1016/S0968-0004(98)01274-2
Article PubMed CAS Google Scholar
Bergeron A, Blanchette M, Chateau A, Chauve C: Reconstructing Ancestral Gene Orders Using Conserved Intervals. In WABI. Volume 3240. Edited by: Jonassen I KJ, Lecture Notes in Computer Science. Springer; 2004:14–25.
Google Scholar
Seret ML, Diffels JF, Goffeau A, Baret PV: Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts. BMC genomics 2009, 10(459):1–11.
Google Scholar
Fitch WM: Distinguishing homologous from analogous proteins. Syst. Zool 1970, 19: 99–113. 10.2307/2412448
Article PubMed CAS Google Scholar
Fitch W: Homology a personal view on some of the problems. Trends Genet 2000, 16: 227–231. 10.1016/S0168-9525(00)02005-9
Article PubMed CAS Google Scholar
Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biol 2002, 3: RESEARCH0008.
Article PubMed PubMed Central Google Scholar
Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290: 1151–1155. 10.1126/science.290.5494.1151
Article PubMed CAS Google Scholar
Lynch M, Force A: The probability of duplicated gene preseration by subfunctionalization. Genetics 2000, 154: 459–473.
PubMed CAS PubMed Central Google Scholar
Ohno S: Evolution be gene duplication. New York: Springer; 1970.
Chapter Google Scholar
He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. Journal of computational biology 2005, 12(6):638–656. 10.1089/cmb.2005.12.638
Article PubMed CAS Google Scholar
Bansal AK: An automated comparative analysis of 17 complete microbial genomes. Bioinformatics 1999, 15: 900–908. 10.1093/bioinformatics/15.11.900
Article PubMed CAS Google Scholar
Goldberg D, McCouch S, Kleinberg J: Algorithms for constructing comparative maps. In Comparative Genomics. Edited by: Shankoff D, Nadeau JH. NL: Kluwer Academic Press; 2000:281–294.
Google Scholar
Housworth EA, Postlethwait J: Measures of synteny conservation between species pairs. Genetics 2002, 162: 441–448.
PubMed CAS PubMed Central Google Scholar
Nadeau JH, Shankoff D: Counting on comparative maps. Trends Genet 1998, 14(12):495–501. 10.1016/S0168-9525(98)01607-2
Article PubMed CAS Google Scholar
Tamames J: Evolution of gene order conservation in prokaryotes. Genome Biol 2001, 6(2):0020.1–11.
Google Scholar
Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature 1997, 387: 708–713. 10.1038/42711
Article PubMed CAS Google Scholar
Yang Q, Yi G, Zhang F, Thon MR, Sze SH: Identifying gene clusters within localized regions in multiple genomes. Journal of Computational Biology 2010, 17(5):657–668. 10.1089/cmb.2009.0116
Article PubMed CAS Google Scholar
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. PNAS 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896
Article PubMed CAS PubMed Central Google Scholar
Vandepoele K, Saeys Y, Simillion C, Raes J, Peer YVD: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between arabidopsis and rice. Genome Research 2002, 12(11):1792–1801. 10.1101/gr.400202
Article PubMed CAS PubMed Central Google Scholar
Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290: 2114–2117. 10.1126/science.290.5499.2114
Article PubMed CAS Google Scholar
Hoberman R, Sankoff D, Durand D: The Statistical Significance of Max-Gap Clusters. In Comparative Genomics. Volume 3388. Edited by: Lecture Notes in Computer Science. Edited by Lagergren J. Springer Berlin /Heidelberg; 2005:55–71. 10.1007/978-3-540-32290-0_5
Chapter Google Scholar
Parida L: Gapped permutation pattern discovery for gene order comparisons. J. Comput. Biol. 2007, 14: 45–55. 10.1089/cmb.2006.0103
Article PubMed CAS Google Scholar
Heber S, Stoye J: Algorithms for finding gene clusters. Lect. notes Comput. Sci. 2001, 2149: 252–263. 10.1007/3-540-44696-6_20
Article Google Scholar
Kim S, Choi JH, Yang J: Gene teams with relaxed proximity constraint. Proc. IEEE Comput. Sys. Bioinformatics Conf. 2005, 44–55.
Google Scholar
Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. In Proc. 2nd Annual Workshop on Algorithms in Bioinformatics (WABI), Volume 2452 of Lectures Notes in Computer Science. New York: Springer-Verlag; 2002:464–476.
Google Scholar
Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet J, Durrens P: Genolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes. Nucleic Acids Research 2009, 37(Database issue):D550-D554.
Article PubMed CAS PubMed Central Google Scholar
Didier G: Common intervals of two sequences. Lect. Notes Comput. Sci 2003, 2812: 17–24. 10.1007/978-3-540-39763-2_2
Article Google Scholar
Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two or more sequences. Lect. Notes Comput. Sci 2004, 3109: 347–358. 10.1007/978-3-540-27801-6_26
Article Google Scholar
Beal MP, Bergeron A, Corteel S, Raffinot M: An algorithmic view of gene teams. Theoret. Comput. Sci 2004, 320(2–3):395–418. 10.1016/j.tcs.2004.02.036
Article Google Scholar
Nikolski M, Sherman D: Family relationships: should consensus reign? - consensus clustering for protein families. Bioinformatics 2007, 23(2):e71-e76. 10.1093/bioinformatics/btl314
Article PubMed CAS Google Scholar
Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics 2006, 22(7):779–788. 10.1093/bioinformatics/btl009
Article PubMed CAS PubMed Central Google Scholar
Alexeyenko A, Tamas I, Liu G, Sonnhammer E: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 2006, 22(14):e9-e15. 10.1093/bioinformatics/btl213
Article PubMed CAS Google Scholar
Li L, Stoeckert C, Roos D: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178–89. 10.1101/gr.1224503
Article PubMed CAS PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Pascal Durrens for constructive discussions and help with the analysis of biological relevance of results.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S9.

Author information

Authors and Affiliations

LaBRI, CNRS/Université Bordeaux 1, 351 crs Libération, 33405, Talence, France
Anasua Sarkar & Macha Nikolski
Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
Hayssam Soueidan

Authors

Anasua Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Hayssam Soueidan
View author publications
You can also search for this author in PubMed Google Scholar
Macha Nikolski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anasua Sarkar.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Conceived and designed the experiments: AS, MN. Performed the experiments and analyzed the data: AS. Wrote the paper: AS, HS, MN.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sarkar, A., Soueidan, H. & Nikolski, M. Identification of conserved gene clusters in multiple genomes based on synteny and homology. BMC Bioinformatics 12 (Suppl 9), S18 (2011). https://doi.org/10.1186/1471-2105-12-S9-S18

Download citation

Published: 05 October 2011
DOI: https://doi.org/10.1186/1471-2105-12-S9-S18

Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Identification of conserved gene clusters in multiple genomes based on synteny and homology