A new method for 2D gel spot alignment: application to the analysis of large sample sets in clinical proteomics

Background In current comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. Unfortunately, differences in gel migration and sample variability make efficient spot alignment very difficult to obtain, and, as consequence most of the software alignments return noisy gel matching which needs to be manually adjusted by the user. Results We present Sili2DGel an algorithm for automatic spot alignment that uses data from recursive gel matching and returns meaningful Spot Alignment Positions (SAP) for a given set of gels. In the algorithm, the data are represented by a graph and SAP by specific subgraphs. The results are returned under various forms (clickable synthetic gel, text file, etc.). We have applied Sili2DGel to study the variability of the urinary proteome from 20 healthy subjects. Conclusion Sili2DGel performs noiseless automatic spot alignment for variability studies (as well as classical differential expression studies) of biological samples. It is very useful for typical clinical proteomic studies with large number of experiments.


Background
Two-dimensional gel electrophoresis is a high resolution technique that is widely used in proteomics to separate thousands of proteins from a complex sample. After separation, a 2D map is obtained in which each protein, or isoform, is represented by a spot. In clinical proteomics the user has to analyze 2D maps of a large number of proteins as, very often, dozens of controls and pathological samples are compared. To allow this comparison, maps from all gels have to be aligned. Unfortunately, differences in gel migration and sample variability can render spot alignment very difficult [1]. There are two types of general limitations for 2D profiling: i) those due to variations in proteome composition and ii) those due to inadequacy of the analytical methods [2]. Computer-aided image analysis contributes to the second kind of limitations and may lead to analytical pitfalls [1]. For instance, 2D gel migration can cause geometrical distortion and variable spot coordinates in different gels [3,4] for many reasons [5]. During the process of image analysis, spot alignment is a critical step since it will condition spot comparison. Spot alignment can be performed mainly in two way: i) spot detection followed by spot-based image warping and finally spot alignment, or ii) pixel-based image warping followed by spot detection and then spot alignment [3]. In the first method the spot-based image warping corrects image distortion using user-defined landmark spots. This process can eventually be fully automated by making the spatial correction implicit [3]. The spot alignment is often expressed using a fusion gel that is representative to the whole experiment [6,7]. When the number of gels to be aligned is high, the distortion has to be modelled with a low-order polynomial transformation [3,8]. In this case, local geometric distortions are poorly corrected leading to an increase of noise in the spot alignment. In the second (pixel-based warping) method, the spatial correction is performed directly from raw-image data, taking advantage of techniques originating from image processing research. This approach leads to a more flexible image distortion (followed by spot detection) virtually eliminating matching problems. However, even if this method is more convenient, it remains bias due for instance to discontinuous change in intensity among the set of aligned gels [9] ending to affected spot intensity quantitation. In addition, the user must systematically adjust or control the spot alignment process by hand [4,6]. This is time consuming and a source of errors.
Up to now comparative analysis of 2D gels has been based on the utilization of commercial gel analysis systems (e.g. Pdquest [10], Melanie [11], Samespots [12], Proteomweaver, Gellab [13], etc.), which identify spots of interest by image comparison, a process called gel matching. While some systems pair each gel of a matching set against a single "reference gel" (e.g. Melanie, Pdquest, etc.), some other algorithms follow the concept of recursive gel matching (e.g. Samespot, Proteomweaver, etc.). This means that each gel of a matching set is recursively used as "reference gel" once during the matching process. However, the resulting spot alignment remains noisy and is not suitable for further statistical analysis. We propose herein a new algorithm for automatic spot alignment, called Sili2DGel, which uses data from recursive gel matching to return only the meaningful Spot Alignment Positions (SAP) for a given set of gels ( Figure 1). Sili2DGel is based on graph theory, input data are represented by a graph in which specific subgraphs are searched. The results are returned under various formats (clickable synthetic gel, text file, etc.). This approach provides the user with an automatic and efficient spot alignment tool suitable for analysis of a large set of 2D gels.

Alignment representation using graph theory
Ideally, after different experiments, a given protein should be represented by spots displayed at the same coordinates on each gel. However, if only a single reference gel is selected for a match set, spots that are not found in that reference gel will not form alignments. For instance in Figure 2, if Gel 1 is the reference gel then the alignment of {d2;d3} will not be recorded as they are not present in Gel 1. Moreover, different kinds of distortions can skew the matching. As a consequence, some spots are likely to end up in an alignment where they should not, and others will not be attributed to an alignment when they should. For instance in Figure 2, assuming that the spots b 1 , b 2 and b 3 represent the same protein, if b 2 matches with b 3 , it should be aligned with b 1 . Spots which belong to an alignment due to an error have to be eliminated (noise spots), and spots which are missing have to be restored (missing spots). The meaningful Spot Alignment Positions (SAP) correspond to the set of spots which represent the same protein after exclusion of the noise spots and reinstatement of the missing spots. SAP can be determined by analysing the alignments given by the recursive gel matching method. One should note that this program depends on prior accurate processing of the spots indentifications and the preliminary spot alignments which are not trivial tasks. So a spot that has not been recognised due to low signal level in spot identifications of the prior process, will be missing in the following analysis.
If N is the number of gels and S the set of all spots of the N gels, then for any spot i, gel matching will give all the spots j which match with i. We use the notation i → j when Principle of the Sili2DGel algorithm Figure 1 Principle of the Sili2DGel algorithm. Sili2DGel represents the result of recursive gel matching with a graph, decomposes it in disconnected subgraphs, searches specific subgraphs which represent the SAP and returns them under various formats.
Alignments are represented by a weighted undirected graph which is called matching graph. A node corresponds to a specific spot of a given gel and an edge represents the matches between spots. The weight of an edge is the number of matches between two spots. Therefore, it is less or equal to the number of gels.
If G N is the matching graph of N gels and |S| the size of a set S, then we have: where : . It is therefore necessary to clean the graph to find the sets of nodes which represent the same spots. This is done by removing the edges which represent wrong connexions.
The search of SAP on N gels comes down to finding specific subgraphs of the matching graph G N . In the best case, all the spots, which represent the same element, will pairwise match in n gels, with n ≤ N. In the matching graph the nodes representing these n spots are all connected together. This subgraph is called a clique (i.e. a complete graph); moreover all its edges are weighted by n. A graph However, alignments are not always perfect. The case where all the spots match at least once, but are not all pairwise adjacent, is represented in the matching graph by a clique in which at least one edge is weighted by a value lower than n. In the worst case, some spots will be missing in an alignment even through they should belong to it. If two spots from different gels never match together during the whole recursive matching procedure, but match with many of the other spots, they are not adjacents in the matching graph and the subgraph is not a clique. This subgraph contains a clique and some other nodes which are adjacents to several nodes of the clique; we call it a pseudoclique. In all cases, SAP are represented by dense clusters of nodes in the graph (i.e. nodes that are highly connected to each other) which are either cliques or pseudocliques.

Algorithm of SAP identification
Reducing the graph Finding a maximal clique is a classical NP-hard problem [14], thus exact algorithms are guaranteed to return a solution only in a time which increases exponentially with the number of vertices in the graph [15]. Therefore, one can expect exact solution methods to have limited performance on large datasets. To overcome this difficulty, we decomposed the graph into reduced subgraphs and then we determined the SAP in the corresponding reduced search spaces.
To reduce the search space, the graph is partitioned in all its connected components (i.e the maximal connected subgraphs). Before searching the connected component, we suppressed the edges weighting 1, as we assumed that an edge with a weight of 1 was not sufficient to belong to a pseudoclique, and that it could not represent an alignment of size 2 (which contains 2 spots). The suppression Example of alignments of three gels Gel 1 of these edges will not distort the results because if these spots are really in the same alignment, they should have a high connectivity with some other spots of the alignment and so they would be later restored.
Pseudocliques represent very dense clusters of the graph. To select very dense clusters, the isthmuses (i.e. the edges which separate a set of nodes of the graph in two highly connected subgraphs) have to be eliminated. The strength metric [16] allows the isthmuses determination by measuring how much an edge is likely to separate a graph in two highly connected subgraphs. It is defined as: Values of sm are between 0 and 2. A low value indicates that the edge is more likely to act as an isthmus whereas a high value signifies that it is potentially at the centre of a cluster. It is worth noting that a null strength metric is an isthmus and a value of two is an edge which belongs to an isolated clique. Thus, edges with a small strength metric (lower than a threshold value sm) are suppressed to reduce the graph. If the graph is highly connected, sm value should not to be too low to allow a good reduction of the graph. Then, the connected components are calculated and SAP can be researched into all these reduced graphs. Table 1 represents the overall algorithm of SAP search and Table 2 the algorithm of cluster search for each connected component of the graph.

Clusters search
The principle of the algorithm of cluster search is to find sets of nodes which are highly connected but not necessary all pairwise adjacent and with edges of high weight value (with respect to the size of the set). Maximal cliques were searched by using the Bron-Kerbosch algorithm [17] with the heuristics of Koch [18] and Cazals [19]. Moreover, we kept the cliques of size 2 only if they were disconnected from the rest of the graph. The search of cliques did not take into account the weight of the edges, which had to be checked; moreover the nodes which are highly connected to the clique had to be added. After finding all the maximal cliques, we assumed that the nodes characterized  Figure 3 Graph representations. (a) raw graph (before treatment) and (b) treated graph (after treatment). Graphs were represented with the Tulip software [24]. The raw graph is composed of good SAP (3a, bottom left panel) and noisy SAP (3a, bottom right panel)

a) b)
by edges with a small maximal weight in the clique were noisy spots and therefore we removed them. If they are not, they will be restored in the next steps. In conclusion, the nodes n which have a maximal weight in a clique C smaller than a threshold value τ(C) (1) are removed (i.e. if ∃n ∈ C s.t max w(n, i) i∈C ≤ τ(C) then n is removed from C). On the principle, the threshold value insures a high tolerance to weakly connected nodes within a great clique and a low tolerance within a small clique.
If G N = (V, E) is a matching graph of N gels and C = (V c , E c ) a subgraph of G N , then we define the threshold of C as a function τ such that: where GelNb(C) is the number of gels in C. We can note that GelNb(C) can be different of |V c | because all nodes in A node n is selected to be added to a clique C if it is connected with at least γ × |V c | nodes of C where γ ∈ [0, 1] is a parameter. γ is the percentage of nodes that a node has to be connected to belong to a clique. This value is chosen depending the quality of the gels and/or the matching. If qualities are not very good, γ has to be low to tolerate nodes which might have been missed in the matching process else has to γ be high not to give noisy SAP. We say that n is γ-dense in C. If G N = (V, E) is a a graph and C = (V c , E c ) a subgraph of G N , then a node i ∈ V is γ-dense in C if there is a subset ⊂ V c such that for each j ∈ the edge (i, j) ∈ E and | | ≥ γ × |V c |. Moreover a node which 1. Graph construction.
• Remove edges having a weight of 1.

Graph reduction:
• Remove edges for which the strength metric is lower than sm.
• Decompose the graph into its connected components.
3. For each connected components, cluster_search(γ). • For all cliques C, remove the nodes n for which max w(n, i) i∈C ≤ τ(C) • For all cliques C, add to C the γ-dense nodes n such that min w(n, i) i∈C ≥ τ(C)

Return the new graph.
• Remove from the list of cluster the included clusters.
3. Select nodes which belong to several clusters: • For all nodes n which belong to several clusters, remove n from all clusters where it does not have its maximal MeanW.
is added to a clique C must have matched several times with the nodes of C (i.e min w(n, i) i∈C ≥ τ(C)). If at least one node has been added to a clique, the resulting set of nodes is not a clique anymore; we will called it in the following a cluster. It is worth noting that all the γ-dense nodes of a maximal clique C of a graph G N belong to a maximal clique of G N . This means that, at this step, many clusters are likely to be included in other clusters. When this happens, the included clusters are removed.

Select clusters according to their quality criteria value
Clusters which share a lot of nodes can remain, whereas, clusters which are characterized by a small size and low quality criteria will be removed. The clustering quality measure [20] for a cluster C, s(C), is defined as follows: Where the binomial coefficient gives the maximum number of edges between the vertices in C. s(C) is the ratio between the number of edges and the highest possible number of edges. For all clusters C in G, we have s(C) ∈ [0, 1]. If s(C) = 1, C is a clique. Thus, a cluster represents a good alignment if its s-value is close to 1. If we find two clusters such that we will remove the one with the lower s-value and the lower number of gels. As few clusters of small size have been removed, so clusters which share a lot of nodes with a bigger one may still remain. Our aim is to adjust the clusters in order to obtain a high value of s(C). Therefore, if there are two clusters such that the cluster with smaller number of nodes is γ-

Select nodes which belong to several clusters
The last step is to remove nodes, which belong to several clusters, from their worst clusters. If G N = (V, E) is a graph and C = (V c , E c ) a subgraph of G N , then the mean weight of a node n ∈ V in C is defined as the sum of all weights of all edges of n divided by the total number of vertices in C: Thus, if a node n belongs to several clusters, n remains only in the cluster where n has its highest mean weight.

Results and discussion
We used Sili2DGel to study the variability of the urinary proteome from 20 healthy subjects. After 2D gel electrophoresis, silver straining and imaging as in [21], a recursive matching was performed with the Melanie software to identify every spots in each gel. The matching graph of these alignments had 16 386 nodes and 236 593 edges.
The raw matching graph (Figure 3a) allowed us to notice that the spots (i.e nodes of the graph) were very connected while the graph should have been composed of subgraphs which look like cliques. Therefore, it was not possible to make a relevant large scale statistical analysis at that stage. By applying Sili2DGel which withdrew background noises with parameter settings γ = 0.4 and sm = 0.8, we obtained 924 SAP of which 634 were cliques, 152 contained several spots in the same gels, 92 were conserved in all gels (Figure 4a) and only 25 had a clustering measure lower than 0.7 (Figure 4b). The closer sm is to zero, the more an edge will represent an isthmus and the closer sm is to 2, the more the edge in the centre of a clique. Looking at the raw data, we never found any edges representing an isthmus. So, after probing various values for the sm parameter, we found a value of 0.8 as the best compromise. The resulting graph contained 11 746 nodes and 80 769 edges (Figure 3b). All the subgraphs were clusters which represented the alignments and were either cliques or pseudocliques. These clusters represented good choices because the s-value for 770 of them was greater than 0.9 ( Figure 4b) and only few clusters with a low s-value were left. Our software provided a synthetic gel that conveniently represented the SAP distribution and spot conservation among the studied gels (Figure 4a). We observed that spot conservation in the urinary proteome was heterogeneous. Indeed, by looking at the SAP length distribution we observed occurrences for all the possible SAP length from 3 to 20 gels. Interestingly, we noticed that the highest occurrence is found for the spots strictly conserved among the 20 gels. The more variable spots (SAP length of 4) are more rare in this study. This heterogeneity is consistent with experimental data found in the literature [22,23].
The analysis showed that 152 SAP contained several spots from the same gels. This suggests that for a given SAP, gels where single spots are found have a lower resolution than gels with duplicated spots in the corresponding SAP (for instance, see spots c 2 and c 3 of Figure 2). As a consequence, the resolution of a specific spot of a low resolution gel could be enhanced by using the corresponding spot from a better resolution gel. This set of heterogeneous SAP is provided to the user to allow specific analysis. Our software provided a synthetic gel which corresponds to all the SAP found among all the gels which have been identified (Figure 5a) with the algorithm. Figure 5 shows the raw spots from Gel1 before edge reduction (Figure 5c) built from the Gel 1 image (Figure 5b) and the difference between the SAP-related spots from Gel 1 (Figure 5d) and the rejected spots (Figure 5e). When we compared the percentage of the volume of the SAP-related spots in the synthetic gel to that of the rejected spots for Gel 1, we observed an average conservation of 80% ( Figure 6) of the original signal. Indeed, out of the 947 spots of Gel1 (100% of volume), 717 spots (88% of volume) were related to a SAP of the synthetic gel. The rejected spots, which represented on average the remaining 12% of the spots, were considered as ambiguous signals (see rejected spots set for Gel1 in figure 5d).
We also calculated the overall signal loss after Sili2Dgel spot alignment by comparing the total volume of each gel before and after treatment ( Figure 6). All together, the sum of the percentage of the conserved intensity of all SAP-related spots (1598) in the synthetic gel represented 80% of the sum of the percentage of intensity of all spots (2000) from the original gels. The 924 SAP of the synthetic gel covered 80% of the experimental signal. The remaining 20% of the signal was found in the set of rejected spots after spot alignment, and could be accessed by the user in the output table for possible further manual analysis (see also synthetic gel in Figure 5d). The SAPrelated spots constituted the set of data that were considered suitable for further statistical analysis.

Conclusion
In comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. However, most of the software alignments return noisy gel matching which needs to be manually adjusted by the biologist. Moreover, several of these systems pair each gel only against a single reference gel and therefore some spots might be missed. To restore them, it is necessary to make recursive alignments. To meet the needs of clinical proteomics of comparing large sets of 2D gels, we have developed Sili2Dgel an automatic gel alignment method based on graph theory to find SAP (without manual adjustment) after a recursive alignment procedure. This method first constructs a matching graph and then reduces its complexity by searching all its maximal cliques, adding the γ-dense nodes with a high minimal weight, selecting the clusters with high size and quality values and selecting nodes which belong to several clusters. Each cluster is considered as a SAP in the synthetic gel and indicates the equivalent spot position in the complete set of gels.
All SAP-related spots are available to the user for further statistical analysis. In addition, our method allows one to address recurrent clinical questions about the variability of biological samples leading to the issue of the conservation of proteins in the studied proteome. We used Sili2Dgel to analyze 20 normal urinary proteomes and we could show that spot conservation was heterogeneous, probably reflecting individual variations.
Finally, the input and output files of Sili2DGel (tabular text files) are compatible with the main 2D gel analysis systems on the market and this allows users to easily combine our method with their familiar environment, making Sili2Dgel a companion tool for users of current commercial proteome analysis software. It performs, after recursive gel matching, an automatic global spot alignment of large sets of related gels with little loss of global signal and a large number of SAP. If needed, the SAP can be used to SAP distribution and clustering measure enhance the resolution of other spots using the spot resolution from the best gels of the set. Sili2DGel performs noiseless automatic spot alignment for variability studies (as well as classical differential expression studies) of biological samples. It makes it very useful for typical clinical proteomic studies with large number of experiments.