During the past decade the databases of protein sequences have grown exponentially reaching several millions entries while 3D protein structures databases grew quadratically so as to reach, regarding the Protein Data Bank (PDB)
,∼30000 non redundant structures sharing less than 90% sequence identity. In order to assign a structure and then a function to as many new sequences as possible, there are various methods. When a sequence is similar enough to the sequence of one or more known 3D structures, methods based on homology modeling give satisfying results. When sequence similarity fall in the “twilight zone” - i.e. under 30% of sequence identity - one has to resort to other methods. Among those, threading methods take advantage of available 3D structures to infer a 3D structure from a new sequence. Using statistical filters parametrized on a library of structural cores -i.e. a bank of invariant structural motifs of protein families -, they correlate 1D (i.e. sequential) and 3D information. In such context, the predictive ability of the threading method directly depends on the representativeness and exhaustivity of the core library. Such a library can be built upon a set of representative structures taken from expert structural classifications
[2, 3] as SCOP
 and CATH
. However, due to the necessary careful manual inspection of the data, these expert classifications face difficulties in coping with the growing number of newly determined protein structures. For instance, since the last version of SCOP (1.75), there has been a growth of about 21% (10417 to 12643) of the total number of non-redundant protein chain in the PDB ( VAST
 non-redundant set for a BLAST p-value of 10−7available at
ftp://ftp.ncbi.nih.gov/mmdb/nrtable/). Hence automatic and fast clustering approaches become necessary.
Over the past decade there have been many attempts aiming at developing automatic classification procedures, mainly applying supervised classification methods using as labels of know 3D structures part of a reference classification. Jain and Hirst
 proposed such a supervised machine learning (ML) algorithm based on random forest to learn how to classify a new 3D structure in a SCOP family. Thus a 3D structure is described using a set of global structural descriptors composed from four to six secondary structural elements (SSEs) for protein domains. However, supervised classification methods heavily depends on the reference classification, whose labels are fixed, and therefore only partially address the problem of automatic classification of 3D structures.
Røgen and Fain
 suggested an unsupervised approach using a description of protein structures derived from knot theory in order to describe the compared structures globally. Zemla et al proposed a similarity scoring function that aims at automatically identifying local and global structurally conserved regions in order to drive a clustering algorithm. Sam et al.
 investigated varieties of tree-cutting strategies and found some irreducible differences between the best possible automatic partitions and SCOP classifications. These results have been confirmed by the work of Pascual-Garcia et al.
. They have investigated the non-transitivity of objective structural similarity measures: a protein A can be found similar to an other protein B, the protein B can be found similar to a third protein
and still proteins A and C may share no similarity. They have shown that non transitivity, that does occur at low similarity levels, leads to non unicity of the partition resulting from the clustering process. For fine granularity -i.e. high similarity levels- structural transitivity is satisfied with few violations within a given cluster and different classification procedures converge to the same partition. For coarser granularities -i.e. lower similarity levels- as the similarity measures are computed on distorted and divergent 3D motifs, requiring to partition the set of structures implies choices for deciding which transitivity violations should be ignored. Depending on these choices classifications may differ significantly.
Furthermore, such similarity based classification procedures of 3D structures only consider a single overall pairwise similarity measure or score, that is derived from local similarities, and do not make use of the detailed mapping of similar parts computed during the alignment process. As a consequence, these procedures, ignoring the mapping information, may lead to cluster proteins that do not all share a common motif. This point will be further illustrated using a Simple case studies section. Then, prior to running a graph based clustering process, we propose to make use of the mapping information in ternary similarity constraints applied on triples of structures. Our experiments will compare the agreement between automatic classifications, obtained with and without that preliminary processing, and the SCOP reference classification.
First we need to use the similarity degree between two protein structures in order to build a graph of similarities whose vertices are protein structures and edges correspond to similarities exceeding a given threshold. Such a graph can be directly given as an input to a graph based clustering process. However, our proposal is to use the mapping information for defining similarities between protein alignment as follows. Let us define an alignment between 2 proteins A and B as a one to one mapping of (sub)parts of A onto (sub)parts of B. A similarity between two alignments is thus defined if the two alignments share a common sequence. More precisely, the alignment between protein A and protein B and the alignment between protein B and protein C are stated as similar if the (sub)parts of B implied in both alignments constitute a significant part of at least one of the two alignments. In other words, we consider a ternary similarity between A, B and C, centered on B, and that such a ternary similarity is stronger if the regions of B implied in its similarity with A are also implied in its similarity with C. The aim of the preprocessing step is then to consider that whenever there is an edge between proteins A and B and an edge between proteins B and C, then the ternary similarity centered in B, quantifying the common part shared by the three proteins, should be high enough. In that case we will state that the ternary constraints are satisfied. The preprocessing step will then consist in reducing the original graph to a graph satisfying the ternary constraints.
To summarize it, the method, shortly introduced in
 starts with building a graph of 3D structures whose edges represent pairwise similarities. That graph is first transformed into its line graph that represents the adjacencies between the graph edges. Applying the ternary constraints results in eliminating some vertices of the line graph. A maximal line graph is then extracted from the resulting graph. The graph of 3D structures corresponding to this maximal line graph now satisfies the ternary constraints: every triple of linked proteins corresponds to a significant structural motif. In our experiments, MCL
, a Graph Clustering algorithm previously applied with success to the clustering of protein sequences in families on a large scale
 is used for achieving the final classification. That classification is then compared to the expert classification SCOP at the finest granularity -ie the SCOP “Family level”-. We also experiment a standard clustering method, suited for applications involving a large and unknown number of clusters, the preprocessing step being also applied in these experiments.