A new method for 2D gel spot alignment: application to the analysis of large sample sets in clinical proteomics
 Sabine Pérès^{1}Email author,
 Laurence Molina^{1},
 Nicolas Salvetat^{1},
 Claude Granier^{1} and
 Franck Molina^{1}Email author
DOI: 10.1186/147121059460
© Pérès et al; licensee BioMed Central Ltd. 2008
Received: 04 June 2008
Accepted: 28 October 2008
Published: 28 October 2008
Abstract
Background
In current comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. Unfortunately, differences in gel migration and sample variability make efficient spot alignment very difficult to obtain, and, as consequence most of the software alignments return noisy gel matching which needs to be manually adjusted by the user.
Results
We present Sili2DGel an algorithm for automatic spot alignment that uses data from recursive gel matching and returns meaningful Spot Alignment Positions (SAP) for a given set of gels. In the algorithm, the data are represented by a graph and SAP by specific subgraphs. The results are returned under various forms (clickable synthetic gel, text file, etc.). We have applied Sili2DGel to study the variability of the urinary proteome from 20 healthy subjects.
Conclusion
Sili2DGel performs noiseless automatic spot alignment for variability studies (as well as classical differential expression studies) of biological samples. It is very useful for typical clinical proteomic studies with large number of experiments.
Background
Twodimensional gel electrophoresis is a high resolution technique that is widely used in proteomics to separate thousands of proteins from a complex sample. After separation, a 2D map is obtained in which each protein, or isoform, is represented by a spot. In clinical proteomics the user has to analyze 2D maps of a large number of proteins as, very often, dozens of controls and pathological samples are compared. To allow this comparison, maps from all gels have to be aligned. Unfortunately, differences in gel migration and sample variability can render spot alignment very difficult [1]. There are two types of general limitations for 2D profiling: i) those due to variations in proteome composition and ii) those due to inadequacy of the analytical methods [2]. Computeraided image analysis contributes to the second kind of limitations and may lead to analytical pitfalls [1]. For instance, 2D gel migration can cause geometrical distortion and variable spot coordinates in different gels [3, 4] for many reasons [5]. During the process of image analysis, spot alignment is a critical step since it will condition spot comparison. Spot alignment can be performed mainly in two way: i) spot detection followed by spotbased image warping and finally spot alignment, or ii) pixelbased image warping followed by spot detection and then spot alignment [3]. In the first method the spotbased image warping corrects image distortion using userdefined landmark spots. This process can eventually be fully automated by making the spatial correction implicit [3]. The spot alignment is often expressed using a fusion gel that is representative to the whole experiment [6, 7]. When the number of gels to be aligned is high, the distortion has to be modelled with a loworder polynomial transformation [3, 8]. In this case, local geometric distortions are poorly corrected leading to an increase of noise in the spot alignment. In the second (pixelbased warping) method, the spatial correction is performed directly from rawimage data, taking advantage of techniques originating from image processing research. This approach leads to a more flexible image distortion (followed by spot detection) virtually eliminating matching problems. However, even if this method is more convenient, it remains bias due for instance to discontinuous change in intensity among the set of aligned gels [9] ending to affected spot intensity quantitation. In addition, the user must systematically adjust or control the spot alignment process by hand [4, 6]. This is time consuming and a source of errors.
Implementation
Alignment representation using graph theory
If N is the number of gels and S the set of all spots of the N gels, then for any spot i, gel matching will give all the spots j which match with i. We use the notation i → j when spot i matches with spot j. An alignment of spot i includes i and all its matching spots (see Figure 2), noted as A(i):A(i) = {j ∈ S : i → j} ∪ {i}.
Alignments are represented by a weighted undirected graph which is called matching graph. A node corresponds to a specific spot of a given gel and an edge represents the matches between spots. The weight of an edge is the number of matches between two spots. Therefore, it is less or equal to the number of gels.
If G_{ N }is the matching graph of N gels and S the size of a set S, then we have:G_{ N }= (V, E, w)
where :

V = S is the set of vertices of G_{ N }

E = {(i, j) : i ∈ S, j ∈ A(i), i ≠ j} is the set of edges of G_{ N }

w(i, j) = {k ∈ S : i ∈ A(k), j ∈ A(k)}, ∀(i, j) ∈ E is the weight of the edge (i, j).
The search of SAP on N gels comes down to finding specific subgraphs of the matching graph G_{ N }. In the best case, all the spots, which represent the same element, will pairwise match in n gels, with n ≤ N. In the matching graph the nodes representing these n spots are all connected together. This subgraph is called a clique (i.e. a complete graph); moreover all its edges are weighted by n. A graph G = (V, E) is a clique if all vertices are pairwise adjacent, i.e. ∀i, j ∈ V with i ≠ j, we have (i, j) ∈ E. However, alignments are not always perfect. The case where all the spots match at least once, but are not all pairwise adjacent, is represented in the matching graph by a clique in which at least one edge is weighted by a value lower than n. In the worst case, some spots will be missing in an alignment even through they should belong to it. If two spots from different gels never match together during the whole recursive matching procedure, but match with many of the other spots, they are not adjacents in the matching graph and the subgraph is not a clique. This subgraph contains a clique and some other nodes which are adjacents to several nodes of the clique; we call it a pseudoclique. In all cases, SAP are represented by dense clusters of nodes in the graph (i.e. nodes that are highly connected to each other) which are either cliques or pseudocliques.
Algorithm of SAP identification
Reducing the graph
Finding a maximal clique is a classical NPhard problem [14], thus exact algorithms are guaranteed to return a solution only in a time which increases exponentially with the number of vertices in the graph [15]. Therefore, one can expect exact solution methods to have limited performance on large datasets. To overcome this difficulty, we decomposed the graph into reduced subgraphs and then we determined the SAP in the corresponding reduced search spaces.
To reduce the search space, the graph is partitioned in all its connected components (i.e the maximal connected subgraphs). Before searching the connected component, we suppressed the edges weighting 1, as we assumed that an edge with a weight of 1 was not sufficient to belong to a pseudoclique, and that it could not represent an alignment of size 2 (which contains 2 spots). The suppression of these edges will not distort the results because if these spots are really in the same alignment, they should have a high connectivity with some other spots of the alignment and so they would be later restored.
Where u, v ∈ V, M_{ u }= N_{ u }\N_{ v }and W_{ uv }= N_{ u }∩ N_{ v }with N(u), N(v) denote the neighborhoods of u and v. e(A, B) (or e(A)) denotes the number of edges between the two sets A and B (or within a set A). The first term counts the number of triads (cycles of length 3) containing the edge (u, v) and the later computes the relative number of cycles of size 4 containing the edge.
Algorithm of SAP search
SAP_search(sm, γ):  

1. Graph construction.  • Remove edges having a weight of 1. 
2. Graph reduction:  • Remove edges for which the strength metric is lower than sm. 
• Decompose the graph into its connected components.  
3. For each connected components, cluster_search(γ).  
4. Return the new graph. 
Algorithm of clusters search
Clusters_search(γ):  

1. Clique and pseudoclique search:  • Find the maximal cliques and remove the cliques of size two if their nodes have more than 1 neighbour 
• For all cliques C, remove the nodes n for which max w(n, i)_{i∈C}≤ τ(C)  
• For all cliques C, add to C the γdense nodes n such that min w(n, i)_{i∈C}≥ τ(C)  
• Remove from the list of cluster the included clusters.  
2. Select clusters according to their svalue:  • Remove the "worst clusters": For all clusters C_{1}, C_{2} such that ${V}_{{c}_{1}}$ ≥ ${V}_{{c}_{2}}$, if ${V}_{{c}_{1}}$ ∩ ${V}_{{c}_{2}}$ ≥ γ × ${V}_{{c}_{1}}$ and s(C_{1}) ≥ s(C_{2}) and Gelnb(C_{1}) ≥ Gelnb(C_{2}) then remove C_{2}. 
• For all clusters C_{1}, C_{2} such that ${V}_{{c}_{1}}$ ∩ ${V}_{{c}_{2}}$ ≥ γ × min(${V}_{{c}_{1}}$, ${V}_{{c}_{2}}$), add the nodes of min(s(C_{1}), s(C_{2})) which have a maximal weight greater than τ(C_{ min }) in max(s(C_{1}), s(C_{2})).  
3. Select nodes which belong to several clusters:  • For all nodes n which belong to several clusters, remove n from all clusters where it does not have its maximal MeanW. 
Clusters search
The principle of the algorithm of cluster search is to find sets of nodes which are highly connected but not necessary all pairwise adjacent and with edges of high weight value (with respect to the size of the set). Maximal cliques were searched by using the BronKerbosch algorithm [17] with the heuristics of Koch [18] and Cazals [19]. Moreover, we kept the cliques of size 2 only if they were disconnected from the rest of the graph. The search of cliques did not take into account the weight of the edges, which had to be checked; moreover the nodes which are highly connected to the clique had to be added. After finding all the maximal cliques, we assumed that the nodes characterized by edges with a small maximal weight in the clique were noisy spots and therefore we removed them. If they are not, they will be restored in the next steps. In conclusion, the nodes n which have a maximal weight in a clique C smaller than a threshold value τ(C) (1) are removed (i.e. if ∃n ∈ C s.t max w(n, i)_{i∈C}≤ τ(C) then n is removed from C). On the principle, the threshold value insures a high tolerance to weakly connected nodes within a great clique and a low tolerance within a small clique.
where GelNb(C) is the number of gels in C. We can note that GelNb(C) can be different of V_{ c } because all nodes in C are not always coming from different gels, (c.f. spots c_{1} and ${{c}^{\prime}}_{1}$ in Figure 2). Thus, GelNb(C) ≤ V_{ c }. τ(C) gets its values from interval [$\frac{1}{2}$V_{ c }, V_{ c }]. The threshold gives a value which is close to $\frac{1}{2}$V_{ c } if the clique is of great size and a value closed to V_{ c } if the clique is of small size. It is worth noting that this formula is valid for clusters of nodes which are not cliques.
A node n is selected to be added to a clique C if it is connected with at least γ × V_{ c } nodes of C where γ ∈ [0, 1] is a parameter. γ is the percentage of nodes that a node has to be connected to belong to a clique. This value is chosen depending the quality of the gels and/or the matching. If qualities are not very good, γ has to be low to tolerate nodes which might have been missed in the matching process else has to γ be high not to give noisy SAP. We say that n is γdense in C. If G_{ N }= (V, E) is a a graph and C = (V_{ c }, E_{ c }) a subgraph of G_{ N }, then a node i ∈ V is γdense in C if there is a subset ${{V}^{\prime}}_{c}$ ⊂ V_{ c }such that for each j ∈ ${{V}^{\prime}}_{c}$ the edge (i, j) ∈ E and ${{V}^{\prime}}_{c}$ ≥ γ × V_{ c }. Moreover a node which is added to a clique C must have matched several times with the nodes of C (i.e min w(n, i)_{i∈C}≥ τ(C)). If at least one node has been added to a clique, the resulting set of nodes is not a clique anymore; we will called it in the following a cluster. It is worth noting that all the γdense nodes of a maximal clique C of a graph G_{ N }belong to a maximal clique of G_{ N }. This means that, at this step, many clusters are likely to be included in other clusters. When this happens, the included clusters are removed.
Select clusters according to their quality criteria value
Where the binomial coefficient $\left(\begin{array}{c}\left{V}_{c}\right\\ 2\end{array}\right)$ gives the maximum number of edges between the vertices in C. s(C) is the ratio between the number of edges and the highest possible number of edges. For all clusters C in G, we have s(C) ∈ [0, 1]. If s(C) = 1, C is a clique. Thus, a cluster represents a good alignment if its svalue is close to 1. If we find two clusters such that $\left{V}_{{c}_{1}}\cap {V}_{{c}_{2}}\right\ge \gamma \times max\left(\left{V}_{{c}_{1}}\right,\left{V}_{{c}_{2}}\right\right)$ we will remove the one with the lower svalue and the lower number of gels. As few clusters of small size have been removed, so clusters which share a lot of nodes with a bigger one may still remain. Our aim is to adjust the clusters in order to obtain a high value of s(C). Therefore, if there are two clusters such that the cluster with smaller number of nodes is γdense in the other (i.e $\left{V}_{{c}_{1}}\cap {V}_{{c}_{2}}\right\ge \gamma \times min\left(\left{V}_{{c}_{1}}\right,\left{V}_{{c}_{2}}\right\right)$), the nodes of the cluster with smaller svalue which have a maximal weight greater than τ(C_{ min }) are added in the cluster with greater svalue.
Select nodes which belong to several clusters
Thus, if a node n belongs to several clusters, n remains only in the cluster where n has its highest mean weight.
Results and discussion
We used Sili2DGel to study the variability of the urinary proteome from 20 healthy subjects. After 2D gel electrophoresis, silver straining and imaging as in [21], a recursive matching was performed with the Melanie software to identify every spots in each gel. The matching graph of these alignments had 16 386 nodes and 236 593 edges.
We also calculated the overall signal loss after Sili2Dgel spot alignment by comparing the total volume of each gel before and after treatment (Figure 6). All together, the sum of the percentage of the conserved intensity of all SAPrelated spots (1598) in the synthetic gel represented 80% of the sum of the percentage of intensity of all spots (2000) from the original gels. The 924 SAP of the synthetic gel covered 80% of the experimental signal. The remaining 20% of the signal was found in the set of rejected spots after spot alignment, and could be accessed by the user in the output table for possible further manual analysis (see also synthetic gel in Figure 5d). The SAPrelated spots constituted the set of data that were considered suitable for further statistical analysis.
Conclusion
In comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. However, most of the software alignments return noisy gel matching which needs to be manually adjusted by the biologist. Moreover, several of these systems pair each gel only against a single reference gel and therefore some spots might be missed. To restore them, it is necessary to make recursive alignments. To meet the needs of clinical proteomics of comparing large sets of 2D gels, we have developed Sili2Dgel an automatic gel alignment method based on graph theory to find SAP (without manual adjustment) after a recursive alignment procedure. This method first constructs a matching graph and then reduces its complexity by searching all its maximal cliques, adding the γdense nodes with a high minimal weight, selecting the clusters with high size and quality values and selecting nodes which belong to several clusters. Each cluster is considered as a SAP in the synthetic gel and indicates the equivalent spot position in the complete set of gels.
All SAPrelated spots are available to the user for further statistical analysis. In addition, our method allows one to address recurrent clinical questions about the variability of biological samples leading to the issue of the conservation of proteins in the studied proteome. We used Sili2Dgel to analyze 20 normal urinary proteomes and we could show that spot conservation was heterogeneous, probably reflecting individual variations.
Finally, the input and output files of Sili2DGel (tabular text files) are compatible with the main 2D gel analysis systems on the market and this allows users to easily combine our method with their familiar environment, making Sili2Dgel a companion tool for users of current commercial proteome analysis software. It performs, after recursive gel matching, an automatic global spot alignment of large sets of related gels with little loss of global signal and a large number of SAP. If needed, the SAP can be used to enhance the resolution of other spots using the spot resolution from the best gels of the set. Sili2DGel performs noiseless automatic spot alignment for variability studies (as well as classical differential expression studies) of biological samples. It makes it very useful for typical clinical proteomic studies with large number of experiments.
Availability and requirements

Project name: Sili2DGel

Project home page: http://www.sysdiag.cnrs.fr/publications/supplementarymaterials/Sili2DGel/

Operating system(s): Platform independent

Programming language: Java
Declarations
Acknowledgements
We thank BaSysBio project for financial support and Guy Melançon for fruitful discussions and support. We also thank the three anonymous referees for their comments which helped improve the manuscript.
Authors’ Affiliations
References
 Biron D, Brun C, Lefevre T, Lebarbenchon C, Loxdale H, Chevenet F, Brizard J, Thomas F: The pitfalls of proteomics experiments without the correct use of bioinformatics tools. Proteomics 2006, 6: 5577–5596. 10.1002/pmic.200600223View ArticlePubMedGoogle Scholar
 Garbis S, Lubec G, Fountoulakis M: Limitations of current proteomics technologies. Journal of Chromatography A 2005, 1077: 1–18. 10.1016/j.chroma.2005.04.059View ArticlePubMedGoogle Scholar
 Aittokallio T, Salmi J, Nyman T, Nevalainen O: Geometrical distortions in twodimensional gels: applicable correction methods. Journal of Chromatography B 2005, 815: 25–37. 10.1016/j.jchromb.2004.07.037View ArticleGoogle Scholar
 Marengo E, Robotti E, Antonucci F, Cecconi D, Campostrini N, Righetti P: Numerical approaches for quantitative analysis of twodimensional maps: A review of commercial software and homemade systems. Proteomics 2005, 5: 654–666. 10.1002/pmic.200401015View ArticlePubMedGoogle Scholar
 Gustafsson J, Blomberg A, M R: Warping twodimensional electrophoresis gel images to correct for geometric distortions of the spot pattern. Electrophoresis 2002, 23: 1731–1744. 10.1002/15222683(200206)23:11<1731::AIDELPS1731>3.0.CO;2#View ArticlePubMedGoogle Scholar
 Berth M, Moser F, Kolbe M, Bernhardt J: The state of the art in the analysis of twodimensional gel electrophoresis images. Appl Microbiol Biotechnol 2007, 76: 1223–1243. 10.1007/s0025300711280PubMed CentralView ArticlePubMedGoogle Scholar
 Luhn S, Berth M, Hecker M, Bernhardt J: Using standard positions and image fusion to create proteome maps from collections of twodimensional gel electrophoresis images. Proteomics 2003, 3(7):1117–1127. 10.1002/pmic.200300433View ArticlePubMedGoogle Scholar
 Panek J, Vohradsky J: Point pattern matching in the analysis of twodimensional gel electropherograms. Electrophoresis 1999, 20: 3483–3491. 10.1002/(SICI)15222683(19991201)20:18<3483::AIDELPS3483>3.0.CO;2RView ArticlePubMedGoogle Scholar
 Dowsey A, Dunn M, Yang G: Automated image alignment for 2D gel electrophoresis in a highthroughput proteomics pipeline. Bioinformatics 2008, 24(7):950–957. 10.1093/bioinformatics/btn059View ArticlePubMedGoogle Scholar
 Garrels J: The QUEST system for quantitative analysis of twodimensional gels. J Biol Chem 1989, 264: 5269–5282.PubMedGoogle Scholar
 Appel R, Vargas J, Palagi P, Walther D, Hochstrasser D: Melanie IIa thirdgeneration software package for analysis of twodimensional electrophoresis images: I. Features and user interface. Electrophoresis 1997, 18: 2735–2748. 10.1002/elps.1150181507View ArticlePubMedGoogle Scholar
 Faergestad E, Rye M, Walczak B, Gidskehaug L, Wold J, Grove H, Jia X, Hollung K, Indahl U, Westad F, Berg F, Martens H: Pixelbased analysis of multiple images for the identification of changes: a novel approach applied to unravel proteome patters of 2D electrophoresis gel images. Proteomics 2007, 7: 3450–3461. 10.1002/pmic.200601026View ArticlePubMedGoogle Scholar
 Lemkin P, Lipkin L, EP L: Some extensions to the gellab TwoDimensional Electrophoretic Gel Analysis System. Clin Chem 1982, 28(4 Pt 2):840–849.PubMedGoogle Scholar
 Karp R: Reducibility among combinatorial problems. Complexity of Computer Computations 1972, 85–104.View ArticleGoogle Scholar
 Garey M, Johnson D: Computers and Intractability: a guide to the theory of NPcompleteness. WH Freeman 1983.Google Scholar
 Melançon G, Sallaberry A: Edge Metrics for Visual Graph Analytics: A Comparative Study. iv 2008, 0: 610–615.Google Scholar
 Bron C, Kerbosch J: Algorithm 457: finding all cliques of an undirected graph. Commun ACM 1973, 16: 575–577. 10.1145/362342.362367View ArticleGoogle Scholar
 Koch I: Enumerating all connected maximal common subgraphs in two graphs. Theor Comput Sci 2001, 250: 1–30. 10.1016/S03043975(00)002863View ArticleGoogle Scholar
 Cazals F, Karande C: Reporting maximal cliques: new insights into an old problem. Tech rep INRIA 2005.Google Scholar
 Chiricota Y, Jourdan F, Melançon G: Software components capture using graph clustering. In Theor Comput Sci. 11th International Workshop on Program Comprehension; 2003:217–226.Google Scholar
 Molina L, Grimaldi M, RobertHebmann V, Espert L, Varbanov M, Devaux C, Granier C, BiardPiechaczyk M: Proteomic analysis of the cellular responses induced in uninfected immune cells by cellexpressed X4 HIV1 envelope. Proteomics 2007, 7: 3116–3130. 10.1002/pmic.200700306View ArticlePubMedGoogle Scholar
 Adachi J, Kumar C, Zhang Y, Olsen J, Mann M: The human urinary proteome contains more than 1500 proteins including a large proportion of membrane proteins. Genome biology 2006., 7:Google Scholar
 Zerefos P, Vougas K, Dimitraki P, Kossida S, Petrolekas A, Stavrodimos K, Giannopoulos A, Fountoulakis M, A V: Charaterization of the human urine proteome by preparative electrophoresis in combination with 2DE. Proteomics 2006, 6: 4346–4355. 10.1002/pmic.200500671View ArticlePubMedGoogle Scholar
 Auber D: Tulip: A huge graph visualisation framework. Graph Drawing Softwares, Mathematics and Visualization 2003, 105–126.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.