Volume 12 Supplement 9

Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Open Access

Identification of conserved gene clusters in multiple genomes based on synteny and homology

BMC Bioinformatics201112(Suppl 9):S18

DOI: 10.1186/1471-2105-12-S9-S18

Published: 5 October 2011

Abstract

Background

Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams.

Results

Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.

Conclusions

The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.

Background

Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. It is often suggested that regions with similar gene content among different species are evidence for phylogenetic relationship and trace through evolution the inheritance of function from a common ancestor. Within one genome, the presence of large duplicated blocks may be due to the ancient large-scale or whole genome duplication, while presence of segments with homologous genes, named conserved gene clusters in multiple genomes more likely indicates an evolutionary constraint for a functionally related group. Our primary contribution is a local sliding-window algorithm that starts from an existing protein family classification and produces two results: first, concerved gene clusters, and second, a subdivision of families into orhtologous subgroups. Our approach can be seen as using conserved gene clusters in order to sift through the family structure to uncover orthology. We evaluate the biological relevance of our approach on the example of Protoploid yeasts [1].

A number of studies indicate that regions of conserved homology among multiple species may result from functional pressure to keep these genes close, but it may also be conserved because the genomes under study have not sufficiently diverged. For the former, the most well known examples are that of operons in prokaryotes [2], but also the existence of functional interactions [3] and similar expression patterns [4] in closely located genes. For the latter, existence of conserved gene clusters is the computational basis for ancestral genome reconstruction [5] and search for ancestral homologs among genes in the same family [6]. Orthologs are homologous genes related by speciation [7, 8] which retain the same functionality as their common ancestors. Homologous genes related by duplication within one lineage are called paralogs and generally differ in functionality [912]. A number of papers introduce algorithms to compute conserved gene clusters and orthologous groups, see for example, [1316]. These approaches vary on a number of parameters. First, there are authors who consider strictly conserved chromosomal segments with similar gene order and orientation [1719]. Second, come the approaches where one considers conserved contiguous regions but without co-linearity [13, 20]. Third, the authors relax the definition of conserved regions by allowing gaps [18, 2124]. Four, paralogous gene copies within one chromosome are allowed in order to explore many-to-many homologous relationships [13, 25]. Finally, some authors study the effect of varying the gap between adjacent neighbors [24, 26, 27].

In this paper we start from the notion of gene teams introduced in [28]. This model allows only one copy of a gene on a given chromosome. We relax this restriction by following the approach of homology teams defined in [13]. Furthermore, we set the gap threshold not only for adjacent genes, but by requiring the distance for any two genes considered as being neighbors to be smaller than a certain threshold. A similar choice was made in [20]. We call the obtained gene clusters synteny teams.

Our SYNS (SYNtenic teamS) algorithm refines existing families into orthologous sub-families, by analyzing the neighborhoods around the members of a given family with a locally sliding window. This is done for all pairs of chromosomes in multiple genomes on which family members appear. The pairwise conserved contiguous segments are agglomerated by relying on a partial homology and biological criteria introduced in [1] between segments. This results in larger conserved segments that we call syntenic zones. We evaluate our algorithm on the existing homologous families for five genomes of the Hemyascomycete yeasts from the Genolevures database [29]. Indeed, there already exists a sub-classification of these families into orthologous sub-families [1] that has undergone expert validation and thus can be used as a reference point for the evaluation of biological relevance of our results. We further illustrate the results of our method for the particular case of the Pdrp (pleiotropic drug resistance proteins) phylogenetic subfamily of ABC transporters that has been manually analyzed in [6].

Methods

In this section we define the notion of unordered conserved gene clusters that allows for paralogous copies and gaps on multiple genomes. Following the work of [20, 30, 31], we allow one homologous gene to appear more than once in one chromosome. We refine the approach of homology teams [13] by distinguishing between orthologous and paralogous copies of genes. Large syntenic zones are built my merging clusters based on genes common among them instead of directly merging the ordered chains with overlapping families as in [32]. For mathematical notations and examples in a textual format we follow [28].

Definition 1 A chromosome is defined as a pair c = (Σ, G), where Σ = {f1, f2, …, f m } is the set of homologous families and G = (g1, g2, ..., g n ) is an ordered sequence of genes. Each gene g i G is a couple (p i , f i ), where p i is the position of gene g i on c and g i belongs to some homologous group f i Σ.

Here, Σ is the alphabet for any chromosome c and p i is an integer. When it is necessary to indicate to which chromosome belongs a given gene, this is done by a subscript: (p i , f i ) c .

Definition 2 Given a chromosome c, with two genes g i = (p i , f i ) and g j = (p j , f j ), the distance between g i and g j is defined by Δ(g i , g j ) = |p i – p j |.

Example 1 Let c1and c2be two chromosomes over the same alphabet Σ = {f1, f2, f3, f4} of homologous families with genes on c1being (1, f2), (2, f1), (4, f4), (7, f3), (8, f1), and on c2being (1, f1), (2, f2), (3, f2), (4, f3), (6, f4). This is denoted by:

c1 = 〈f2f1*f4**f3f1〉,

c2 = 〈f1f2f2f3*f4〉.

Asterisks stand for genes that are unassigned to homologous groups; notice that * is not part of the alphabet Σ.

A gene subset G′ G induces the subset of families Σ′ denoted by Σ(G′) such that f i Σ′ if and only if there exists g i G′ such that g i = (p i , f i ). A set of genes G′ from the same chromosome, forms a chromosomal segment s = (Σ′, G′, c) with or without gaps. When it is clear from the context, we will assimilate a set of genes G′ with the corresponding chromosomal segment.

For example, in the case of G′ = {(2, f1), (4, f4), (8, f1)} and alphabet Σ′ = Σ(G′) = {f1, f4}, G′ defines a chromosomal segment with gaps on c1 = 〈f2f1 * f4 * *f3f1〉. This segment G′ is non-contiguous on c1; the gaps correspond to (5, *), (6, *) and (7, f3).

Definition 3 A chromosomal segment s = (Σ′, G′, c) is contiguous if for any two genes g i = (p i , f i ) and g j = (P i , f j ) from Gand any p such that p i <p <p j , either the gene g = (p, f) at the position p belongs to G' or this position corresponds to an asterisk. Otherwise, the segment is said to be non-contiguous For example, G′ = {(4, f4), (7, f3), (8, f1)} on c1 = 〈f2f1 * f4 * *f3f1〉 forms a contiguous segment.

Synteny teams

Two genes g i = (p i , f i ) and g j = (p j , f j ) on the same chromosome are considered to be neighbors when Δ(g i – g j ) <δ for a given threshold δ > 0. For a gene g i , we denote the set of neighbor genes N i to be centered around it, that is N i = {g k = (p k , f k ) | p i δ/2p k p i + δ/2}.

Definition 4 A chromosomal segment s is called a δ— segment if every pair of genes of s is separated by a distance smaller than δ, that is s = {g i | g j s, Δ(g i , g j ) <δ}. A window w is a contiguous δ-segment.

Definition 5 We say that Σ′ Σ is a δ— subset if there exists at least one δ—segment s′ = (Σ′, G′, c) such that Σ' = Σ(G'). We say that s' is the witness of this δ—subset.

Example 2 For δ = 3, the δ—subsets on chromosome c2 = 〈f1f2f2f3 * f4are the following:

- {f1, f2} as witnessed by ((1, f1), (2, f2)), ((1, f1), (3, f2)), and ((1, f1), (2, f2), (3, f2));

- {f2, f3} as witnessed by ((2, f2), (4, f3)), ((3, f2), (4, f3)), and ((2, f2), (3, f2), (4, f3));

- {f3, f4} as witnessed by ((4, f3), (6, f4)).

Definition 6 Let Σ be the set of homologous families over a set of chromosomes C. We say that Σ′ Σ is a δ— cluster if Σ′ is a δ—subset for all chromosomes in some C C, where |C′| ≥ 2. We say that the set of genes
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Equa_HTML.gif

witnesses the δ—cluster Σ′.

A witness S is thus a set of all genes that participate in the segments witnessing the relevant (δ-subsets. Let Σ and Σ' to be two (δ-clusters such that Σ ∩ Σ′ ≠ . Let S and S′ be the corresponding witness sets. Denote by S and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq1_HTML.gif the sets of genes in each of these witness sets that are members of the families in Σ ∩ Σ′.

Definition 7 A δ—cluster Σ is said to be a (δ-synteny if (a) the corresponding witness set S has genes belonging to at least two different chromosomes and (b) there does not exist a δ-cluster Σ′ with a witness set Ssuch that https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq2_HTML.gif

Example 3 Let c1, c2and c3be chromosomes as shown in figure1.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig1_HTML.jpg
Figure 1

Example of δ -clusters and δ -syntenies The non-trivial δ—syntenies for the example 3 that cover c1, c2 and c3 are {f4, f5}, {f4, f1} and {f5, f1}; that cover c1 and c2 are {f1, f2} and {f4, f5, f1}. Colors indicate homology relationships. Connections indicate the relevant δ—clusters.

c1 = 〈f3 * *f5f4f1 * f2* f5f4

c2 = 〈f1f2 * *f3*f4f5f1 * f5

c3 = 〈f2 * f3 **f5*f1f4 * f5

Let (δ = 3. We obtain the following non-trivial δ—clusters: {f4, f1}, {f5, f1}, {f4, f5, f1}, {f1, f2} and {f4, f5} between c1and c2; and {f1, f5}, {f1, f4} and {f4, f5} between c1and c3. The non-trivial δ-syntenies are {f4, f5}, {f1, f2}, {f4, f1}, {f5, f1} and {f4, f5, f1}.

The superset inclusion in definition 7 implies that for the computational purposes there is no need to consider the smaller of the two sets and thus causes merging of the syntenies if the witness of one synteny is a complete subset of another in our algorithm.

Example 4 Let c1, c2and c3be three chromosomes in figure2.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig2_HTML.jpg
Figure 2

Merging of δ -syntenies Merging among δ-syntenies for chromosomes in the example 4: when considering f8, δ-cluster {f8, f9} is merged in the <5-synteny {f7, f8, f9}. Colors indicate the homology relationship. Connections indicate the relevant δ—clusters (only those relevant for merging are shown).

c1 = 〈f1 * *f4* f6f6f7f8f9

c2 = 〈*f5* f3f6* f2f4f8f7f9

c3 = 〈f4f8f4f7f8f8* f8f2

Let (δ = 3 and consider f8. Non-trivial δ—clusters are: {f7, f8, f9}, {f7, f8} and {f8, f9} between c1and c2, {f7, f8} between c1and c3and { f8, f2}, {f7, f8} and {f8, f4} between c2and c3. Therefore, we obtain the following non-trivial δ—syntenies: {f7, f8, f9}, {f7, f8}, {f8, f2} and {f8, f4}. Notice that the δ-cluster {f7, f8, f9} covers witnesses of the δ-cluster {f8, f9}, but the witnesses of the δ-cluster {f7, f8} on chromosome c3do not witness the δ-cluster {f7, f8, f9}. Therefore, we merge the δ-cluster {f8, f9} in th e δ-synteny {f7, f8, f9}; however, {f7, f8} remains as a separate δ-synteny.

We have seen that a (δ—synteny must contain the maximal (δ—cluster with respect to subset inclusion. All (δ—syntenies for a set of chromosomes C, with |C| >= 2 are included in the result. Such a synteny set is informally called a synteny team following the terminology introduced in [28, 32] for gene teams.

Definition 8 Given a δ—synteny team https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq3_HTML.gif we say that Σ i and Σ j are transitively connected if the witnesses S i and S j overlap, that is |S i S j | ≥ 1. We further define a δ-zone as a union of transitively connected δ-syntenies Σ i and Σ j .

Example 5 Consider C = {c1, c2, c3} from example 4 and δ = 3. Suppose that we compute clusters in the neighborhood of f8. Non-trivial syntenies are the following: Σ1 = {f7, f8} for witness S1 = {(8, f7) c , (9, f8)c1, (9, f8)c2, (10, f7)c2, (2, f8)c3 (4, f7)c3, (5, f8)c3, (6, f8)c3} and Σ2 = {f4, f8} for witness S2 = {(8, f4)c2, (9, f8)c2, (1, f4)c3, (3, f4)c3, (2, f8)c3}. Notice, that Σ1 ∩ Σ2 = {f8} ≠ and S1S2 = {(9, c2, f 8 ), (2, c3, f8)} ≠ . We obtain one non-trivial δ—zone {f4, f8, f7} by agglomerating δ—syntenies Σ1and Σ2based on the transitivity (see figure3). Notice that this leaves the gene (8, f8)c3, out of the δ-zone. The transitivity relationship in the SYNS algorithm combines each pair of two δ— syntenies sharing at least one witness into one δ-zone. The notion of a δ— zone aims at uncovering even distant evolutionary relationships based on conservation of gene content within neighborhoods. It is slightly amended based on the following two considerations.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig3_HTML.jpg
Figure 3

Agglomeration of δ -syntenies in δ -zones Example of δ-syntenies’ agglomeration by transitivity (see example 5): δ-clusters Σ1 = {f7, f8} and Σ2 = {f4, f8} are witnessed by gene sets indicated by ✰ and • symbols, respectively. They are merged in one δ—zone {f4, f8, f7} based on witness intersection {(9, f8)c2, (2, f8)c3}.

  1. 1.

    Several paralogous genes may exist on the same chromosome. When two or more paralogs appear within one window of size δ, we include them in the same witness set of a δ-synteny since it is not possible to computationally distinguish between them.

     
  2. 2.

    It may happen that two distinct δ-syntenies share only one paralogous gene. This is what we call a weak bond. Creating a δ-zone based on a single gene intersection may either lead to a δ-zone that is phylogenetically valid or may create an erroneous result (see [6]).

     

Definition 9 Given a δ—synteny team https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq4_HTML.gif and its witness set S = {S i , S j } we say that S forms a weak bond if |S i S j | = 1. We further define g = S i S j to be the witness of a weak bond.

The δ— zone {Σ i , Σ j } resulting from a weak bond may be erroneous. We rely on phylogeny to solve this issue. We consider a total order over all the species under study defined by phylogeny: a b if species b has diverged from the common ancestor earlier than species a ( corresponds then to the relative speciation time). When no other witness from a other than g exists, we split the erroneously obtained synteny in two parts: one that contains the orthology relationships within a given family f and another one that keeps the supposed paralogs. The details of how this is done are presented in Results section.

Definition 10 Let https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq5_HTML.gif be a δ— synteny team over the witness set S = {S i , S j } such that |S i | > |S j | and let g = S i S j be the witness of a weak bond. If g is from the biggest species according to in S j , we say that S i witnesses a maximal orthologous (δ-synteny Σ i and S' j = S j \ g witnesses a paralogous δ-synteny Σ j .

Example 6 Consider C′ = {c2, c3} from example 4 and figure4supposing that c3 c2and consider neighborhoods around f8with (δ = 3. Two non-trivial δ-syntenies are connected by a weak bond: Σ1 = {f8, f2} with witness S1 = {(8, f2)c2, (9, f8)c2, (8, f8)c3, (9, f2)c3} and Σ2 = { f4, f8} with witness S2 = {(7, f4)c2, (9, f8)c2, (1, f4)c3, (2, f8)c3 (3, f4)c3, (5, f8)c3}. Indeed, {(9, f8)c2} is the witness of this weak bond. Since c3 c2, then Σ2is the maximal orthologous δ-synteny with witness S2, while Σ1is the one with the paralogous copy of f8 (at position 9 on c2). The set S1becomes https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_IEq6_HTML.gif Members of a family are split into an orthologous and paralogous subsets present in different syntenies. At the end of our procedure, only the largest orthologous (δ-zone and the non-intersecting paralogous (δ-zones covering any given homologous family remain in the result.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig4_HTML.jpg
Figure 4

Example of a weak bond Example of a weak bond among δ-syntenies considering family f8: {(9, f8)c2} is the witness of a weak bond between the δ-syntenies {f8, f2} and {f4, f8}. Colors indicate the homology relationship. Connections indicate the relevant δ—clusters. Crossed box indicates the witness of a weak bond.

Syntenic TeamS algorithm

In this section, we present the SYntenic TeamS (SYNS) algorithm which computes δ— zones in multiple genomes. In previous work gene teams between two chromosomes of size m and n are computed by an O(m + n)log 2 (m + n) algorithm consisdering only one-to-one homologous relationships [32]. The approach by [20] solves the ordered gene clusters problem by proposing a directed acyclic graph model and an NP-hard longest path solution; results contain maximal but also non-maximal orthologous clusters. Our approach relies on the same sliding-window general approach as in [20]. However, we gain in time efficiency by limiting the sliding of the window only around positions of family members. Given a set of families Σ and a predefined window size δ, we examine neighborhoods of each family f Σ in all chromosomes. For all genes of f including paralogous copies, we consider a neighborhood from –δ to +δ around them. This neighborhood is examined by a sliding window of size δ and we form sets of genes corresponding to families in a given window position. These sets are intersected to look for common gene content if they belong to different chromosomes. The intersections define synteny conservation within the family neighborhoods by using definitions in Methods section. We further look for transitivity among δ— syntenies and build (δ-zones. To do this, we search for overlaps among witnesses of δ— clusters. If the witness intersection size is > 1 then the δ— syntenies are agglomerated to form one δ— zone. Three different cases corresponding to phylogenetic topologies shown in figure 5 are considered for solving the weak bond problem. Let S i and S j to be the two witnesses connected by a weak bond, we sort the genes of these witnesses according to the order of speciation. If the witness of a weak bond occurs in the biggest species according to or if there is no any other witness from a bigger species, then we consider that (cases A and B in figure 5) the two clusters define a valid (δ-zone. Case C in figure 5 shows the situation where forming a (δ-zone can not be justified from the evolutionary perspective. For cases A and B we continue to search for paralogous gene clusters. We gather all maximal δ— zones in the final result.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig5_HTML.jpg
Figure 5

Three topologies for a weak bond Examples of topologies where two δ-syntenies Σ i and Σ j with witness sets S i and S j have a weak bond. Species are ordered by phylogenetic order Sp i Sp i +1. Cases A (g1 is the witness of weak bond) and B (g2 is the witness of weak bond) are considered to be plausible from the evolutionary perspective, while case C (g2 is the witness of weak bond) is difficult to explain. Different colors represent the orthologous and paralogous δ-syntenies emerging from these cases. Vertical links represent synteny while horizontal arrows represent the weak bond.

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Equb_HTML.gif

Time complexity

Table 1 shows the comparative time complexity analysis of our approach and other existing ortholog detection algorithms for the cases where such information is available. In the SYNS algorithm, we consider that one homologous family f may appear in at most c × t locations in all genomes, where c is the total number of chromosomes and t is the maximal number of paralogous copies. Given that we explore neighborhoods of size 2 × δ + 1, the number of all windows of size δ for f is (δ + 1) × c × t. The computation of all witnesses for a given family takes O(((δ + 1) × c × t)2). If in this computation all the possible intersections are non-empty, then in the worst case scenario we obtain for f the set of (δ-clusters of size ((δ +1) × c × t)2. Which implies that the (δ-synteny computation takes O(((δ + 1) × c × t)4); which is repeated for all families f Σ.
Table 1

Comparison of time complexity of OrthoMCL, GCFinder and SYNS All experiments have been run on the dual-core Intel Xeon 2.33 GHz server. Results are also available for MultiParanoid (approx. 2 hours run time) and CoCo-CL (approx 3 hours run time) for which no time complexity is found in literature.

Method

Time complexity

Execution time

Notations Used

OrthoMCL

O(Nk 2 )

76min (excluding Blast)

N = #genes, k = pruning constant

GCFinder

Ordered

Unordered

O(nd(k + d))

O(k 2 nD(tD + 1) k - 1)

1546min

interrupted after 3 days

n = #families

k = #genes, d = window size

D = max #genes in a window

t = max # paralogous copies of a gene in a chr.

SYNS

O(n((δ + 1) × c × t)4)

15min

n = #families, c = #chr., δ = window size, t = max # of paralogous copies of a gene in a chr.

Evaluation of results

The Genolevures database provides families of proteins across the phylum of hemyascomycetous yeasts. To evaluate the performance of our algorithm, we have executed it on the existing families from the Genolevures Release 3 Candiate 3 (2008-09-24) [33], [29] with 4949 families covering 25196 protein coding genes from five protoploid yeast species [1].

Comparison with other methods

The critical window-size parameter δ of SYNS was set to 7 for all experiments. This value was obtained in order to match our results with the previously defined and expert validated orthologous subgroups [1]. We have compared the orthologous groups obtained by SYNS on the yeast data to those obtained by the following methods: Coco-cl [34], MultiParanoid [35] and OrthoMCL [36]. Table 2 shows the numbers of orthologous groups classified by these methods. OrthoMCL [36] was run with default inflation index= 1.5, e-value cut-off= –5 and percent match cut-off = 50 values starting from input fasta files. Coco-cl was run recursively starting with fasta sequences with boostrap threshold score= 1 and split score= 0.5 and using ClustalW for multiple sequence alignment. Multiparanoid was run using default parameters (no cut-off and no duplicate appearance of gene in clusters), using BLOSUM62 matrix for Blast alignments. Table 2 shows the total number of classified proteins and the total number of orthologous groups detected by SYNS and these algorithms using the original Genolevures families as a baseline [33]. In comparison with the SONS method, the SYNS classifies a comparable number of proteins, but generates more orthologous groups, implying that these groups are more fine-grained.

We compare the orthologous groups between the SYNS method and those obtained by other algorithms in table 3. To compare two classifications we first look at how many groups are identical between two methods (Id column) and compute the similarity value (between 0 and 1) over the intersection of the covered protein sets (for definition see [33]). Second, we analyze the differences between two classifications. For these we report the number of proteins that are classified only by the SYNS (SYNS column) when compared to those only classified a given method (meth. column). The remaining differences are classified according to granularity: a split when a group obtained by a given method is split into multiple subgroups by the SYNS algorithm, a merge in the opposite case, and messy when the split/merge relationship is complicated. We further analyze the differences with respect to SONS classification case by case (available at http://www.cbib.u-bordeaux2.fr/redmine/projects/syns/files). We have found that in the case of splits between the resulting groups (50 groups in table 3, the more fine-grained groups obtained by the SYNS algorithm are more functionally relevant in general. For the cases of merges (141 groups) and messy events (70 groups) there is no clear-cut qualitative difference. However, for these 211 cases more functionally plausible groups can be obtained by SYNS when using a smaller window size δ = 5. Overall, SYNS method appears to be the best match with the curated SONS results [1], while relying on a clear mathematical definitions and having satisfactory running time.
Table 2

Comparisons of SYNS and other classifications with the existing family structure as baseline

Method

# proteins

Protein coverage

# groups

OrthoMCL

23399

92

4146

MultiParanoid

15937

63

15888

Coco-cl

24396

96

5252

GCFinder

10080

40

1779

SONS

24016

95

5424

SYNS

25147

99

6441

Genolevures Families

25196

100

4949

Table 3

Comparison of different computations of orthologous clusters with SYNS results on the Genolevures data Each line compares a given method with the SYNS; we report the number of genes classified only in the given method (meth), only by the SYNS algorithm (SYNS), the similarity value (sim) between two cluster sets (varying between 0 and 1 as defined in [33]), the number of genes that appear as singletons, the number of splits and merges between two cluster sets as well as the number of unclassifiable cases (messy).

Method

Id

sim

meth

SYNS

singls

merges

splits

messy

OrthoMCL

3447

0.76

41

1794

1044

594

18

32

MultiParanoid

4325

0.26

4

20518

1988

4

121

1

Coco-cl

3632

0.82

42

793

774

383

511

103

GCFinder

470

0.24

769

9781

3417

749

4

46

SONS

4968

0.90

27

1158

874

141

51

70

Analysis of two protein families

We illustrate the functional relevance the SYNS algorithm by considering the classification of Pdrp (pleiotropic drug resistance transporter proteins) subfamily performed in [6]. This is a subset of the PDR proteins from the GL3C0025 (total 60 proteins) Genolevures family. We compare this manual analysis with the results obtained automatically by SONS and SYNS algorithms.

Seven SONS, six SYNS and seven groups obtained by manual curation provide hypothethis on the evolution of this protein family. The manually curated orthologous groups are confirmed by gene cluster analysis. But in some cases the results differ. Groups P1 through P4 in table 4 denote four orthologous groups over five species annotated in [6] according to their S. cerevisiae members, namely Pdr12p group (P3, 5 members), Snq2p group (P1 + P2, 5+4=9 members) and Pdrp5p/15p group (P4, 3 members). Groups P5 through P7 in table 4 contain genes whose relationship to Pdr5p/15p is based on phylogenetic evidence only [6]. Three tandem gene repeats appear in ERGO (Eremothecium gossypii), KLLA (Kluyveromyces lactis) and SAKL (Saccharomyces kluyveri) and are found in a similar neighborhood [6] in groups P1 and P2.

Comparatively to the SONS classification, our approach proposes a more conservative classification for these proteins into orthologous groups. Indeed, SONS exclude ZYRO0D17710g from the Snq2/YNR070w phylogenetic cluster, while re-grouping the remaining proteins belonging to P1 and P2. Moreover, according to [6], SAKL0F04312g belongs to the Aus1p/Pdr11p group which has no shared neighborhood in pre-WGD five species. Thus, it is not surprising that this gene is missing in the SYNS classification (SONS algorithm classifies it in an independent group, not shown in table 4).

A similar analysis is done for the GL3C0026 family that has 57 members and four different functionally annotated groups. Figure 6 illustrates the evolutionary pattern based on the combination of phylogenetic analysis and functional annotations of this family. SONS algorithm produces 7 orthologous gene clusters, while SYNS generates 8 clusters functionally more relevant. Both SONS and SYNS successfully classify the L-ornithine transaminase (OTAse) group (with the S. cerevisiae member YLR438w CAR2). However, SONS classification fails to distinguish the YGR019w UGA1 Gamma-aminbutyrate (GABA) transaminase group from the YNR058w amino-pelargonic acid aminotransferase (DAPA) group. On the contrary, SYNS method separates the cluster having the YGR019w UGA1 gene according to its functional anotation. Our algorithm also succeeds to correctly distinguish the single orthog gene clusters from the YGR019w UGA1 group. For the YOL140w ARG8 Acetylornithine aminotransferase group, both SONS and SYNS algorithms provide similar conserved gene clusters. However, SONS erroneously mixes some genes of this group with YGR019w UGA1 cluster and YNR058w BIO3 cluster, whereas SYNS algorithm succeeds to distinguish them. The combined functional annotations and neighborhood analysis support the evolutionary pattern illustrated in figure 6 for the GL3C0026 family. Therefore we can conclude that the final δ-zones in our algorithm may preserve a functionally meaningful conserved gene clusters.
Table 4

Comparisons of orthologous clusters subdividing the Pdrp Genolevures family The Pdrp Genolevures family GL30025 as analysed by a) SONS results b) SYNS results c) after manual curation. The comparisons have been performed over the same sets of genes as in figure 3 in [6] for the Pdrp ”sensu stricto” proteins subset of the GL3C0025 family.

SONS orthologous groups

SYNS orthologous groups

Manual curation

S1= {ZYRO0A04114g SAKL0C11616g SAKL0C11704g KLTH0A01914g ERGO0B08140g ERGO0B08162g KLLA0D03432g KLLA0D03476g}

Y1= {ZYRO0A04114g SAKL0C11616g SAKL0C11704g KLTH0A01914g ERGO0B08140g ERGO0B08162g KLLA0D03432g KLLA0D03476g ZYRO0D17710g}

P1 = {ZYRO0A04114g SAKL0C11616g KLTH0A01914g ERGO0B08140g KLLA0D03432g}

S2= {ZYRO0D17710g}

 

P2 = {ZYRO0D17710g KLLA0D03476g ERGO0B08162g SAKL0C11704g}

S3 = {SAKL0C05654g SAKL0H10670g KLLA0B09702g ZYRO0F08866g ZYRO0F08888g}

Y2= {SAKL0C05654g SAKL0H10670g KLLA0B09702g ZYRO0F08866g ZYRO0F08888g}

P3 = {SAKL0C05654g SAKL0H10670g KLLA0B09702g ZYRO0F08866g ZYRO0F08888g}

S4 = {ZYRO0D11836g ZYRO0D11858g ZYRO0D11880g}

Y3 = {ZYRO0D11836g ZYRO0D11880g ZYRO0D11858g}

P4 = {ZYRO0D11836g ZYRO0D11880g ZYRO0D11858g}

S5 = {SAKL0G08008g KLLA0F21692g}

Y4 = {SAKL0G08008g KLLA0F21692g}

P5 = {SAKL0G08008g KLLA0F21692g}

S6= {ERGO0G05126g}

Y5 = {ERGO 0 G0 5 126g}

P6 = {ERGO0G05126g}

S7= {KLTH0G19448g KLTH0E17138g}

Y6 = {KLTH0G19448g KLTH0E17138g}

P7 = {KLTH0G19448g}

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S18/MediaObjects/12859_2011_Article_4825_Fig6_HTML.jpg
Figure 6

Analysis of the Pdrp family Relationships between the 57 members of GL3C0026 family based on their functional annotations. Each line lists genes from one species (indicated on the left); each box represents one gene. For example line ZYRO, first box on the left A11990 stands for ZYROA11990 gene. The numbers below the boxes represent the relative gene order (position) on the chromosomes. Genes with similar functional annotations are connected using the same color.

Conclusion

The double goal of this study is to identify locally conserved gene clusters and to use them in order to subdivide an existing family structure into orthologous groups. To this end, we define a model for unordered local synteny and propose an algorithm to identify conserved gene clusters and their division into orthologous and paralogous clusters among multiple genomes. To validate our approach we have executed our method for the five Hemyascomycetous yeasts and genomes and examined the conserved non-overlapping gene clusters that arise from each homologous family of Genolevures database [29]. Our approach shows 99% protein coverage for existing homologous groups.

We perform similar comparisons with the existing SONS groups [6] over the Genolevures families. The 90% similarity between our approach and SONS groups indicates that our automatic method comes close to the manually curated results, especially since part of the differences between these groups can be explained by the non-classification of the paralogous conserved gene clusters by SONS. This confirms the pertinence of our definition of conserved neighborhoods based on transitivity and phylogenetic constraints that make it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. The SYNS method makes it possible to distinguish between orthologous and paralogous conserved gene clusters and thus makes it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. This implies that the proposed sliding window and partial traversal approach, efficiently produces biologically relevant conserved gene clusters and corresponding orthologous groups with O(n((δ + 1) × c × t)4) worst-case complexity, for a pre-defined window size δ.

Declarations

Acknowledgements

The authors would like to thank Pascal Durrens for constructive discussions and help with the analysis of biological relevance of results.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S9.

Authors’ Affiliations

(1)
LaBRI, CNRS/Université Bordeaux 1
(2)
Netherlands Cancer Institute

References

  1. Consortium G: Comparative genomics of protoploid Saccharomycetaceae. Genome Res 2009, 19(10):1696–709.View ArticleGoogle Scholar
  2. Ermolaeva M: Operon finding in bacteria. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics 2005, 2886–2891.Google Scholar
  3. Snel B, Bork P, Huynen M: The identification of functional modules from the genomic association of gene. Proc Natl Acad Sci USA 2002, 99(9):5890–5895. 10.1073/pnas.092632599PubMedPubMed CentralView ArticleGoogle Scholar
  4. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem. Sci. 1998, 23(9):324–328. 10.1016/S0968-0004(98)01274-2PubMedView ArticleGoogle Scholar
  5. Bergeron A, Blanchette M, Chateau A, Chauve C: Reconstructing Ancestral Gene Orders Using Conserved Intervals. In WABI. Volume 3240. Edited by: Jonassen I KJ, Lecture Notes in Computer Science. Springer; 2004:14–25.Google Scholar
  6. Seret ML, Diffels JF, Goffeau A, Baret PV: Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts. BMC genomics 2009, 10(459):1–11.Google Scholar
  7. Fitch WM: Distinguishing homologous from analogous proteins. Syst. Zool 1970, 19: 99–113. 10.2307/2412448PubMedView ArticleGoogle Scholar
  8. Fitch W: Homology a personal view on some of the problems. Trends Genet 2000, 16: 227–231. 10.1016/S0168-9525(00)02005-9PubMedView ArticleGoogle Scholar
  9. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biol 2002, 3: RESEARCH0008.PubMedPubMed CentralView ArticleGoogle Scholar
  10. Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290: 1151–1155. 10.1126/science.290.5494.1151PubMedView ArticleGoogle Scholar
  11. Lynch M, Force A: The probability of duplicated gene preseration by subfunctionalization. Genetics 2000, 154: 459–473.PubMedPubMed CentralGoogle Scholar
  12. Ohno S: Evolution be gene duplication. New York: Springer; 1970.View ArticleGoogle Scholar
  13. He X, Goldwasser MH: Identifying conserved gene clusters in the presence of homology families. Journal of computational biology 2005, 12(6):638–656. 10.1089/cmb.2005.12.638PubMedView ArticleGoogle Scholar
  14. Bansal AK: An automated comparative analysis of 17 complete microbial genomes. Bioinformatics 1999, 15: 900–908. 10.1093/bioinformatics/15.11.900PubMedView ArticleGoogle Scholar
  15. Goldberg D, McCouch S, Kleinberg J: Algorithms for constructing comparative maps. In Comparative Genomics. Edited by: Shankoff D, Nadeau JH. NL: Kluwer Academic Press; 2000:281–294.Google Scholar
  16. Housworth EA, Postlethwait J: Measures of synteny conservation between species pairs. Genetics 2002, 162: 441–448.PubMedPubMed CentralGoogle Scholar
  17. Nadeau JH, Shankoff D: Counting on comparative maps. Trends Genet 1998, 14(12):495–501. 10.1016/S0168-9525(98)01607-2PubMedView ArticleGoogle Scholar
  18. Tamames J: Evolution of gene order conservation in prokaryotes. Genome Biol 2001, 6(2):0020.1–11.Google Scholar
  19. Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature 1997, 387: 708–713. 10.1038/42711PubMedView ArticleGoogle Scholar
  20. Yang Q, Yi G, Zhang F, Thon MR, Sze SH: Identifying gene clusters within localized regions in multiple genomes. Journal of Computational Biology 2010, 17(5):657–668. 10.1089/cmb.2009.0116PubMedView ArticleGoogle Scholar
  21. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. PNAS 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896PubMedPubMed CentralView ArticleGoogle Scholar
  22. Vandepoele K, Saeys Y, Simillion C, Raes J, Peer YVD: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between arabidopsis and rice. Genome Research 2002, 12(11):1792–1801. 10.1101/gr.400202PubMedPubMed CentralView ArticleGoogle Scholar
  23. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290: 2114–2117. 10.1126/science.290.5499.2114PubMedView ArticleGoogle Scholar
  24. Hoberman R, Sankoff D, Durand D: The Statistical Significance of Max-Gap Clusters. In Comparative Genomics. Volume 3388. Edited by: Lecture Notes in Computer Science. Edited by Lagergren J. Springer Berlin /Heidelberg; 2005:55–71. 10.1007/978-3-540-32290-0_5View ArticleGoogle Scholar
  25. Parida L: Gapped permutation pattern discovery for gene order comparisons. J. Comput. Biol. 2007, 14: 45–55. 10.1089/cmb.2006.0103PubMedView ArticleGoogle Scholar
  26. Heber S, Stoye J: Algorithms for finding gene clusters. Lect. notes Comput. Sci. 2001, 2149: 252–263. 10.1007/3-540-44696-6_20View ArticleGoogle Scholar
  27. Kim S, Choi JH, Yang J: Gene teams with relaxed proximity constraint. Proc. IEEE Comput. Sys. Bioinformatics Conf. 2005, 44–55.Google Scholar
  28. Bergeron A, Corteel S, Raffinot M: The algorithmic of gene teams. In Proc. 2nd Annual Workshop on Algorithms in Bioinformatics (WABI), Volume 2452 of Lectures Notes in Computer Science. New York: Springer-Verlag; 2002:464–476.Google Scholar
  29. Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet J, Durrens P: Genolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes. Nucleic Acids Research 2009, 37(Database issue):D550-D554.PubMedPubMed CentralView ArticleGoogle Scholar
  30. Didier G: Common intervals of two sequences. Lect. Notes Comput. Sci 2003, 2812: 17–24. 10.1007/978-3-540-39763-2_2View ArticleGoogle Scholar
  31. Schmidt T, Stoye J: Quadratic time algorithms for finding common intervals in two or more sequences. Lect. Notes Comput. Sci 2004, 3109: 347–358. 10.1007/978-3-540-27801-6_26View ArticleGoogle Scholar
  32. Beal MP, Bergeron A, Corteel S, Raffinot M: An algorithmic view of gene teams. Theoret. Comput. Sci 2004, 320(2–3):395–418. 10.1016/j.tcs.2004.02.036View ArticleGoogle Scholar
  33. Nikolski M, Sherman D: Family relationships: should consensus reign? - consensus clustering for protein families. Bioinformatics 2007, 23(2):e71-e76. 10.1093/bioinformatics/btl314PubMedView ArticleGoogle Scholar
  34. Jothi R, Zotenko E, Tasneem A, Przytycka TM: COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics 2006, 22(7):779–788. 10.1093/bioinformatics/btl009PubMedPubMed CentralView ArticleGoogle Scholar
  35. Alexeyenko A, Tamas I, Liu G, Sonnhammer E: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 2006, 22(14):e9-e15. 10.1093/bioinformatics/btl213PubMedView ArticleGoogle Scholar
  36. Li L, Stoeckert C, Roos D: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178–89. 10.1101/gr.1224503PubMedPubMed CentralView ArticleGoogle Scholar

Copyright

© Sarkar et al; licensee BioMed Central Ltd. 2011

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement