Identification of conserved gene clusters in multiple genomes based on synteny and homology

Background Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on gene teams and homology teams. Results Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum. Conclusions The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.


Background
Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. It is often suggested that regions with similar gene content among different species are evidence for phylogenetic relationship and trace through evolution the inheritance of function from a common ancestor. Within one genome, the presence of large duplicated blocks may be due to the ancient large-scale or whole genome duplication, while presence of segments with homologous genes, named conserved gene clusters in multiple genomes more likely indicates an evolutionary constraint for a functionally related group. Our primary contribution is a local sliding-window algorithm that starts from an existing protein family classification and produces two results: first, concerved gene clusters, and second, a subdivision of families into orhtologous subgroups. Our approach can be seen as using conserved gene clusters in order to sift through the family structure to uncover orthology. We evaluate the biological relevance of our approach on the example of Protoploid yeasts [1].
A number of studies indicate that regions of conserved homology among multiple species may result from functional pressure to keep these genes close, but it may also be conserved because the genomes under study have not sufficiently diverged. For the former, the most well known examples are that of operons in prokaryotes [2], but also the existence of functional interactions [3] and similar expression patterns [4] in closely located genes. For the latter, existence of conserved gene clusters is the computational basis for ancestral genome reconstruction [5] and search for ancestral homologs among genes in the same family [6]. Orthologs are homologous genes related by speciation [7,8] which retain the same functionality as their common ancestors. Homologous genes related by duplication within one lineage are called paralogs and generally differ in functionality [9][10][11][12]. A number of papers introduce algorithms to compute conserved gene clusters and orthologous groups, see for example, [13][14][15][16]. These approaches vary on a number of parameters. First, there are authors who consider strictly conserved chromosomal segments with similar gene order and orientation [17][18][19]. Second, come the approaches where one considers conserved contiguous regions but without co-linearity [13,20]. Third, the authors relax the definition of conserved regions by allowing gaps [18,[21][22][23][24]. Four, paralogous gene copies within one chromosome are allowed in order to explore many-to-many homologous relationships [13,25]. Finally, some authors study the effect of varying the gap between adjacent neighbors [24,26,27].
In this paper we start from the notion of gene teams introduced in [28]. This model allows only one copy of a gene on a given chromosome. We relax this restriction by following the approach of homology teams defined in [13]. Furthermore, we set the gap threshold not only for adjacent genes, but by requiring the distance for any two genes considered as being neighbors to be smaller than a certain threshold. A similar choice was made in [20]. We call the obtained gene clusters synteny teams.
Our SYNS (SYNtenic teamS) algorithm refines existing families into orthologous sub-families, by analyzing the neighborhoods around the members of a given family with a locally sliding window. This is done for all pairs of chromosomes in multiple genomes on which family members appear. The pairwise conserved contiguous segments are agglomerated by relying on a partial homology and biological criteria introduced in [1] between segments. This results in larger conserved segments that we call syntenic zones. We evaluate our algorithm on the existing homologous families for five genomes of the Hemyascomycete yeasts from the Genolevures database [29]. Indeed, there already exists a sub-classification of these families into orthologous subfamilies [1] that has undergone expert validation and thus can be used as a reference point for the evaluation of biological relevance of our results. We further illustrate the results of our method for the particular case of the Pdrp (pleiotropic drug resistance proteins) phylogenetic subfamily of ABC transporters that has been manually analyzed in [6].

Methods
In this section we define the notion of unordered conserved gene clusters that allows for paralogous copies and gaps on multiple genomes. Following the work of [20,30,31], we allow one homologous gene to appear more than once in one chromosome. We refine the approach of homology teams [13] by distinguishing between orthologous and paralogous copies of genes. Large syntenic zones are built my merging clusters based on genes common among them instead of directly merging the ordered chains with overlapping families as in [32]. For mathematical notations and examples in a textual format we follow [28].
Definition 1 A chromosome is defined as a pair c = (Σ, G), where Σ = {f 1 , f 2 , …, f m } is the set of homologous families and G = (g 1 , g 2 , ..., g n ) is an ordered sequence of genes. Each gene g i G is a couple (p i , f i ), where p i is the position of gene g i on c and g i belongs to some homologous group f i Σ.
Here, Σ is the alphabet for any chromosome c and p i is an integer. When it is necessary to indicate to which chromosome belongs a given gene, this is done by a subscript: Definition 2 Given a chromosome c, with two genes g i = (p i , f i ) and g j = (p j , f j ), the distance between g i and g j is defined by Δ(g i , g j ) = |p ip j |.
A gene subset G′ ⊆ G induces the subset of families Σ′ denoted by Σ(G′) such that f i Σ′ if and only if there exists g i G′ such that g i = (p i , f i ). A set of genes G′ from the same chromosome, forms a chromosomal segment s = (Σ′, G′, c) with or without gaps. When it is clear from the context, we will assimilate a set of genes G′ with the corresponding chromosomal segment.
Definition 3 A chromosomal segment s = (Σ′, G′, c) is contiguous if for any two genes g i = (p i , f i ) and g j = (P i , f j ) from G′ and any psuch that p i <p <p j , either the gene g = (p, f) at the position p belongs to G' or this position corresponds to an asterisk. Otherwise, the segment is said to be non-contiguous For example, G′ = {(4, f 4 ), (7, f 3 ), (8, f 1 )} on c 1 = 〈f 2 f 1 * f 4 * *f 3 f 1 〉 forms a contiguous segment.

Synteny teams
Two genes g i = (p i , f i ) and g j = (p j , f j ) on the same chromosome are considered to be neighbors when Δ(g ig j ) <δ for a given threshold δ > 0. For a gene g i , we denote the set of neighbor genes N i to be centered around it, Definition 4 A chromosomal segment s is called a δsegment if every pair of genes of s is separated by a distance smaller than δ, that is s = {g i | ∀g j s, Δ(g i , g j ) <δ}. A window w is a contiguous δ-segment.
We obtain the following non-trivial δclusters: The superset inclusion in definition 7 implies that for the computational purposes there is no need to consider the smaller of the two sets and thus causes merging of the syntenies if the witness of one synteny is a complete subset of another in our algorithm.
Example 4 Let c 1 , c 2 and c 3 be three chromosomes in figure 2. We have seen that a (δ-synteny must contain the maximal (δ-cluster with respect to subset inclusion. All (δ-syntenies for a set of chromosomes C, with |C| >= 2 are included in the result. Such a synteny set is informally called a synteny team following the terminology introduced in [28,32] for gene teams.
Definition 8 Given a δ-synteny teamS = ∑ { }, i we say that Σ i and Σ j are transitively connected if the witnesses S i and S j overlap, that is |S i ∩ S j | ≥ 1. We further define a δ-zone as a union of transitively connected δ-syntenies Σ i and Σ j .
Example 5 Consider C = {c 1 , c 2 , c 3 } from example 4 and δ = 3. Suppose that we compute clusters in the Sarkar et al. BMC Bioinformatics 2011, 12(Suppl 9):S18 http://www.biomedcentral.com/1471-2105/12/S9/S18 neighborhood of f 8 . Non-trivial syntenies are the following: We obtain one non-trivial δzone {f 4 , f 8 , f 7 } by agglomerating δ-syntenies Σ 1 and Σ 2 based on the transitivity (see figure 3). Notice that this leaves the gene (8, f 8 ) c 3 , out of the δ-zone. The transitivity relationship in the SYNS algorithm combines each pair of two δ-syntenies sharing at least one witness into one δ-zone. The notion of a δ-zone aims at uncovering even distant evolutionary relationships based on conservation of gene content within neighborhoods. It is slightly amended based on the following two considerations.
1. Several paralogous genes may exist on the same chromosome. When two or more paralogs appear within one window of size δ, we include them in the same witness set of a δ-synteny since it is not possible to computationally distinguish between them.
2. It may happen that two distinct δ-syntenies share only one paralogous gene. This is what we call a weak bond. Creating a δ-zone based on a single gene intersection may either lead to a δ-zone that is phylogenetically valid or may create an erroneous result (see [6]).
Definition 9 Given a δ-synteny teamS = ∑ ∑ { , } i j and its witness set S = {S i , S j } we say that S forms a weak bond if |S i ∩ S j | = 1. We further define g = S i ∩ S j to be the witness of a weak bond.
The δ-zone {Σ i , Σ j } resulting from a weak bond may be erroneous. We rely on phylogeny to solve this issue. We consider a total order over all the species under study defined by phylogeny: a ≺ b if species b has diverged from the common ancestor earlier than species a (≺ corresponds then to the relative speciation time). When no other witness from a other than g exists, we split the erroneously obtained synteny in two parts: one that contains the orthology relationships within a given family f and another one that keeps the supposed paralogs. The details of how this is done are presented in Results section.
Definition 10 LetS = ∑ ∑ { , } i j be a δsynteny team over the witness set S = {S i , S j } such that |S i | > |S j | and let g = S i ∩ S j be the witness of a weak bond. If g is from the biggest species according to ≺ in S j , we say that S i witnesses a maximal orthologous (δ-synteny Σ i and S' j = S j \ g witnesses a paralogous δ-synteny Σ j .
Example 6 Consider C′ = {c 2 , c 3 } from example 4 and figure 4 supposing that c 3 ≺ c 2 and consider neighborhoods around f 8 with (δ = 3. Two non-trivial δ-syntenies are connected by a weak bond: f

Syntenic TeamS algorithm
In this section, we present the SYntenic TeamS (SYNS) algorithm which computes δ-zones in multiple genomes. In previous work gene teams between two chromosomes of size m and n are computed by an O(m + n)log 2 (m + n) algorithm consisdering only one-to-one homologous relationships [32]. The approach by [20] solves the ordered gene clusters problem by proposing a directed acyclic graph model and an NP-hard longest path solution; results contain maximal but also non-maximal orthologous clusters. Our approach relies on the same sliding-window general approach as in [20]. However, we gain in time efficiency by limiting the sliding of the window only around positions of family members. Given a set of families Σ and a predefined window size δ, we examine neighborhoods of each family f Σ in all chromosomes. For all genes of f including paralogous copies, we consider a neighborhood from -δ to +δ around them. This neighborhood is examined by a sliding window of size δ and we form sets of genes corresponding to families in a given window position. These sets are intersected to look for common gene content if they belong to different chromosomes. The intersections define synteny conservation within the family neighborhoods by using definitions in Methods section. We further look for transitivity among δ-syntenies and build (δ-zones. To do this, we search for overlaps among witnesses of δ-clusters. If the witness intersection size is > 1 then the δsyntenies are agglomerated to form one δ-zone. Three different cases corresponding to phylogenetic topologies shown in figure 5 are considered for solving the weak bond problem. Let S i and S j to be the two witnesses connected by a weak bond, we sort the genes of these witnesses according to the ≺ order of speciation. If the witness of a weak bond occurs in the biggest species according to ≺ or if there is no any other witness from a bigger species, then we consider that (cases A and B in figure 5) the two clusters define a valid (δzone. Case C in figure 5 shows the situation where forming a (δ-zone can not be justified from the evolutionary perspective. For cases A and B we continue to search for paralogous gene clusters. We gather all maximal δ-zones in the final result.    Time complexity Table 1 shows the comparative time complexity analysis of our approach and other existing ortholog detection algorithms for the cases where such information is available. In the SYNS algorithm, we consider that one homologous family f may appear in at most c × t locations in all genomes, where c is the total number of chromosomes and t is the maximal number of paralogous copies. Given that we explore neighborhoods of size 2 × δ + 1, the number of all windows of size δ for f is (δ + 1) × c × t. The computation of all witnesses for a given family takes O(((δ + 1) × c × t) 2 ). If in this computation all the possible intersections are non-empty, then in the worst case scenario we obtain for f the set of (δ-clusters of size ((δ +1) × c × t) 2 . Which implies that the (δ-synteny computation takes O(((δ + 1) × c × t) 4 ); which is repeated for all families f ∊ Σ.

Evaluation of results
The Genolevures database provides families of proteins across the phylum of hemyascomycetous yeasts. To evaluate the performance of our algorithm, we have executed it on the existing families from the Genolevures Release 3 Candiate 3 (2008-09-24) [33], [29] with 4949 families covering 25196 protein coding genes from five protoploid yeast species [1].

Comparison with other methods
The critical window-size parameter δ of SYNS was set to 7 for all experiments. This value was obtained in order to match our results with the previously defined and expert validated orthologous subgroups [1]. We have compared the orthologous groups obtained by SYNS on the yeast data to those obtained by the following methods: Coco-cl [34], MultiParanoid [35] and OrthoMCL [36]. Table 2 shows the numbers of orthologous groups classified by Three topologies for a weak bond Examples of topologies where two δ-syntenies Σ i and Σ j with witness sets S i and S j have a weak bond. Species are ordered by phylogenetic order Sp i ≺ Sp i+1 . Cases A (g 1 is the witness of weak bond) and B (g 2 is the witness of weak bond) are considered to be plausible from the evolutionary perspective, while case C (g 2 is the witness of weak bond) is difficult to explain. Different colors represent the orthologous and paralogous δ-syntenies emerging from these cases. Vertical links represent synteny while horizontal arrows represent the weak bond. O(nd(k + d)) O(k 2 nD(tD + 1) k-1 ) 1546min interrupted after 3 days n = #families k = #genes, d = window size D = max #genes in a window t = max # paralogous copies of a gene in a chr.
Sarkar et al. BMC Bioinformatics 2011, 12(Suppl 9):S18 http://www.biomedcentral.com/1471-2105/12/S9/S18 these methods. OrthoMCL [36] was run with default inflation index= 1.5, e-value cut-off= -5 and percent match cut-off = 50 values starting from input fasta files. Coco-cl was run recursively starting with fasta sequences with boostrap threshold score= 1 and split score= 0.5 and using ClustalW for multiple sequence alignment. Multiparanoid was run using default parameters (no cut-off and no duplicate appearance of gene in clusters), using BLOSUM62 matrix for Blast alignments. Table 2 shows the total number of classified proteins and the total number of orthologous groups detected by SYNS and these algorithms using the original Genolevures families as a baseline [33]. In comparison with the SONS method, the SYNS classifies a comparable number of proteins, but generates more orthologous groups, implying that these groups are more fine-grained. We compare the orthologous groups between the SYNS method and those obtained by other algorithms in table 3. To compare two classifications we first look at how many groups are identical between two methods (Id column) and compute the similarity value (between 0 and 1) over the intersection of the covered protein sets (for definition see [33]). Second, we analyze the differences between two classifications. For these we report the number of proteins that are classified only by the SYNS (SYNS column) when compared to those only classified a given method (meth. column). The remaining differences are classified according to granularity: a split when a group obtained by a given method is split into multiple subgroups by the SYNS algorithm, a merge in the opposite case, and messy when the split/ merge relationship is complicated. We further analyze the differences with respect to SONS classification case by case (available at http://www.cbib.u-bordeaux2.fr/redmine/projects/syns/files). We have found that in the case of splits between the resulting groups (50 groups in table 3, the more fine-grained groups obtained by the SYNS algorithm are more functionally relevant in general. For the cases of merges (141 groups) and messy events (70 groups) there is no clear-cut qualitative difference. However, for these 211 cases more functionally plausible groups can be obtained by SYNS when using a smaller window size δ = 5. Overall, SYNS method appears to be the best match with the curated SONS results [1], while relying on a clear mathematical definitions and having satisfactory running time.

Analysis of two protein families
We illustrate the functional relevance the SYNS algorithm by considering the classification of Pdrp (pleiotropic drug resistance transporter proteins) subfamily performed in [6]. This is a subset of the PDR proteins from the GL3C0025 (total 60 proteins) Genolevures family. We compare this manual analysis with the results obtained automatically by SONS and SYNS algorithms.
Seven SONS, six SYNS and seven groups obtained by manual curation provide hypothethis on the evolution of this protein family. The manually curated orthologous groups are confirmed by gene cluster analysis. But in some cases the results differ. Groups P 1 through P 4 in table 4 denote four orthologous groups over five species annotated in [6] according to their S. cerevisiae members, namely Pdr12p group (P 3 , 5 members), Snq2p group (P 1 + P 2 , 5+4=9 members) and Pdrp5p/15p group (P 4 , 3 members). Groups P 5 through P 7 in table 4 contain genes whose relationship to Pdr5p/15p is based on phylogenetic evidence only [6]. Three tandem gene repeats appear in ERGO (Eremothecium gossypii), KLLA (Kluyveromyces lactis) and SAKL (Saccharomyces kluyveri) and are found in a similar neighborhood [6] in groups P 1 and P 2 .
Comparatively to the SONS classification, our approach proposes a more conservative classification for these proteins into orthologous groups. Indeed, SONS exclude ZYRO0D17710g from the Snq2/YNR070w phylogenetic cluster, while re-grouping the remaining proteins belonging to P 1 and P 2 . Moreover, according to [6], SAKL0F04312g belongs to the Aus1p/Pdr11p group which has no shared neighborhood in pre-WGD five  Table 3 Comparison of different computations of orthologous clusters with SYNS results on the Genolevures data Each line compares a given method with the SYNS; we report the number of genes classified only in the given method (meth), only by the SYNS algorithm (SYNS), the similarity value (sim) between two cluster sets (varying between 0 and 1 as defined in [33]), the number of genes that appear as singletons, the number of splits and merges between two cluster sets as well as the number of unclassifiable cases (messy). A similar analysis is done for the GL3C0026 family that has 57 members and four different functionally annotated groups. Figure 6 illustrates the evolutionary pattern based on the combination of phylogenetic analysis and functional annotations of this family. SONS algorithm produces 7 orthologous gene clusters, while SYNS generates 8 clusters functionally more relevant. Both SONS and SYNS successfully classify the L-ornithine transaminase (OTAse) group (with the S. cerevisiae member YLR438w CAR2). However, SONS classification fails to distinguish the YGR019w UGA1 Gamma-aminbutyrate (GABA) transaminase group from the YNR058w amino-pelargonic acid aminotransferase (DAPA) group. On the contrary, SYNS method separates the cluster having the YGR019w UGA1 gene Table 4 Comparisons of orthologous clusters subdividing the Pdrp Genolevures family The Pdrp Genolevures family GL30025 as analysed by a) SONS results b) SYNS results c) after manual curation. The comparisons have been performed over the same sets of genes as in figure 3 in [6] for the Pdrp "sensu stricto" proteins subset of the GL3C0025 family. SONS orthologous groups SYNS orthologous groups Manual curation  according to its functional anotation. Our algorithm also succeeds to correctly distinguish the single orthog gene clusters from the YGR019w UGA1 group. For the YOL140w ARG8 Acetylornithine aminotransferase group, both SONS and SYNS algorithms provide similar conserved gene clusters. However, SONS erroneously mixes some genes of this group with YGR019w UGA1 cluster and YNR058w BIO3 cluster, whereas SYNS algorithm succeeds to distinguish them. The combined functional annotations and neighborhood analysis support the evolutionary pattern illustrated in figure 6 for the GL3C0026 family. Therefore we can conclude that the final δ-zones in our algorithm may preserve a functionally meaningful conserved gene clusters.

Conclusion
The double goal of this study is to identify locally conserved gene clusters and to use them in order to subdivide an existing family structure into orthologous groups. To this end, we define a model for unordered local synteny and propose an algorithm to identify conserved gene clusters and their division into orthologous and paralogous clusters among multiple genomes. To validate our approach we have executed our method for the five Hemyascomycetous yeasts and genomes and examined the conserved non-overlapping gene clusters that arise from each homologous family of Genolevures database [29]. Our approach shows 99% protein coverage for existing homologous groups. We perform similar comparisons with the existing SONS groups [6] over the Genolevures families. The 90% similarity between our approach and SONS groups indicates that our automatic method comes close to the manually curated results, especially since part of the differences between these groups can be explained by the non-classification of the paralogous conserved gene clusters by SONS. This confirms the pertinence of our definition of conserved neighborhoods based on transitivity and phylogenetic constraints that make it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. The SYNS method makes it possible to distinguish between orthologous and paralogous conserved gene clusters and thus makes it possible to include tandem repeats as well as loss, fusions or transpositions of gene copies in chromosomal rearrangements of genomes. This implies that the proposed sliding window and partial traversal approach, efficiently produces biologically relevant conserved gene clusters and corresponding orthologous groups with O(n((δ + 1) × c × t) 4 ) worst-case complexity, for a pre-defined window size δ.