Multiple graph regularized protein domain ranking
 Jim JingYan Wang^{1},
 Halima Bensmail^{2} and
 Xin Gao^{1, 3}Email author
https://doi.org/10.1186/1471210513307
© Wang et al; licensee BioMed Central Ltd. 2012
Received: 22 May 2012
Accepted: 29 October 2012
Published: 19 November 2012
Abstract
Background
Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods.
Results
To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiGRank. Instead of using a single graph to regularize the ranking scores, MultiGRank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiGRank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods.
Conclusion
The problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.
Background
Proteins contain one or more domains each of which could have evolved independently from the rest of the protein structure and which could have unique functions[1, 2]. Because of molecular evolution, proteins with similar sequences often share similar folds and structures. Retrieving and ranking protein domains that are similar to a query protein domain from a protein domain database are critical tasks for the analysis of protein structure, function, and evolution[3–5]. The similar protein domains that are classified by a ranking system may help researchers infer the functional properties of a query domain from the functions of the returned protein domains.
The output of a ranking procedure is usually a list of database protein domains that are ranked in descending order according to a measure of their similarity to the query domain. The choice of a similarity measure largely defines the performance of a ranking system as argued previously[6]. A large number of algorithms for computing similarity as a ranking score have been developed:
Pairwise protein domain comparison algorithms compute the similarity between a pair of protein domains either by protein domain structure alignment or by comparing protein domain features. Protein structure alignment based methods compare protein domain structures at the level of residues and sometime even atoms, to detect structural similarities with high sensitivity and accuracy. For example, Carpentier et al. proposed YAKUSA[7] which compares protein structures using onedimensional characterizations based on protein backbone internal angles, while Jung and Lee proposed SHEBA[8] for structural database scanning based on environmental profiles. Protein domain feature based methods extract structural features from protein domains and compute their similarity using a similarity or distance function. For example, Zhang et al. used the 32D tableau feature vector in a comparison procedure called IR tableau[3], while Lee and Lee introduced a measure called WDAC (Weighted Domain Architecture Comparison) that is used in the protein domain comparison context[9]. Both these methods use cosine similarity for comparison purposes.
Graphbased similarity learning algorithms use the traditional protein domain comparison methods mentioned above that focus on detecting pairwise sequence alignments while neglecting all other protein domains in the database and their distributions. To tackle this problem, a graphbased transductive similarity learning algorithm has been proposed[6, 10]. Instead of computing pairwise similarities for protein domains, graphbased methods take advantage of the graph formed by the existing protein domains. By propagating similarity measures between the query protein domain and the database protein domains via graph transduction (GT), a better metric for ranking database protein domains can be learned.
The main component of graphbased ranking is the construction of a graph as the estimation of intrinsic manifold of the database. As argued by Cai et al.[11], there are many ways to define different graphs with different models and parameters. However, up to now, there are, in general, no explicit rules for choice of graph models and parameters. In[6], the graph parameters were determined by a gridsearch of different pairs of parameters. In[11], several graph models were considered for graph regularization, and exhaustive experiments were carried out for the selection of a graph model and its parameters. However, these kinds of gridsearch strategies select parameters from discrete values in the parameter space, and thus lack the ability to approximate an optimal solution. At the same time, crossvalidation[12, 13] can be used for parameter selection, but it does not always scale up very well for many of the graph parameters, and sometimes it might overfit the training and validation set while not generalizing well on the query set.
In[14], Geng et al. proposed an ensemble manifold regularization (EMR) framework that combines the automatic intrinsic manifold approximation and semisupervised learning (SSL)[15, 16] of a support vector machine (SVM)[17, 18]. Based on the EMR idea, we attempted to solve the problem of graph model and parameter selection by fusing multiple graphs to obtain a ranking score learning framework for protein domain ranking. We first outlined the graph regularized ranking score learning framework by optimizing ranking score learning with both relevant and graph constraints, and then generalized it to the multiple graph case. First a pool of initial guesses of the graph Laplacian with different graph models and parameters is computed, and then they are combined linearly to approximate the intrinsic manifold. The optimal graph model(s) with optimal parameters is selected by assigning larger weights to them. Meanwhile, ranking score learning is also restricted to be smooth along the estimated graph. Because the graph weights and ranking scores are learned jointly, a unified objective function is obtained. The objective function is optimized alternately and conditionally with respect to multiple graph weights and ranking scores in an iterative algorithm. We have named our Multi ple G raph regularized Rank ing method MultiGRank. It is composed of an offline graph weights learning algorithm and an online ranking algorithm.
Methods
Graph model and parameter selection Given a data set of protein domains represented by their tableau 32D feature vectors[3]$\mathcal{X}=\{{x}_{1},{x}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{x}_{N}\}$, where${x}_{i}\in {\mathbb{R}}^{32}$ is the tableau feature vector of ith protein domain, x_{ q }is the query protein domain, and the others are database protein domains. We define the ranking score vector as$\mathbf{\text{f}}={[{f}_{1},{f}_{2},\mathrm{..},{f}_{N}]}^{\top}\in {\mathbb{R}}^{N}$ in which f_{ i } is the ranking score of x_{ i } to the query domain. The problem is to rank the protein domains in$\mathcal{X}$ in descending order according to their ranking scores and return several of the top ranked domains as the ranking results so that the returned protein domains are as relevant to the query as possible. Here we define two types of protein domains: relevant when they belong to the same SCOP fold type[19], and irrelevant when they do not. We denote the SCOPfold labels of protein domains in$\mathcal{X}$ as$\mathcal{L}=\{{l}_{1},{l}_{2},\mathrm{..},{l}_{N}\}$, where l_{ i } is the label of ith protein domain and l_{ q }is the query label. The optimal ranking scores of relevant protein domains {x_{ i }},l_{ i } = l_{ q } should be larger than the irrelevant ones {x_{ i }},l_{ i } ≠ l_{ q }, so that the relevant protein domains will be returned to the user.
Graph regularized protein domain ranking
We applied two constraints on the optimal ranking score vector f to learn the optimal ranking scores:
Relevance constraint Because the query protein domain reflects the search intention of the user, f should be consistent with protein domains that are relevant to the query. We also define a relevance vector of the protein domain as$\left(\right)close="">\mathbf{\text{y}}={[{y}_{1},{y}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{y}_{N}]}^{\top}\in {\{1,0\}}^{N}$ where y_{ i } = 1, if x_{ i } is relevant to the query and y_{ i } = 0 if it is not. Because the type label l_{ q } of a query protein domain x_{ q } is usually unknown, we know only that the query is relevant to itself and have no prior knowledge of whether or not others are relevant; therefore, we can only set y_{ q } = 1 while y_{ i }, i ≠ q is unknown.
where D is a diagonal matrix whose entries are${D}_{\mathit{\text{ii}}}=\sum _{i=1}^{N}{W}_{\mathit{\text{ij}}}$ and L = D − W is the graph Laplacian matrix. This is a basic identity in spectral graph theory and it provides some insight into the remarkable properties of the graph Laplacian.
where α is a tradeoff parameter of the smoothness penalty. The solution is obtained by setting the derivative of O(f) with respect to f to zero as f = (U + αL)^{−1}U y. In this way, information from both the query protein domain provided by the user and the relationship of all the protein domains in$\mathcal{X}$ are used to rank the protein domains in$\mathcal{X}$. The query information is embedded in y and U, while the protein domain relationship information is embedded in L. The final ranking results are obtained by balancing the two sources of information. In this paper, we call this method G raph regularized Rank ing (GRank).
Multiple graph learning and ranking: MultiGRank
Here we describe the multiple graph learning method to directly learn a selfadaptive graph for ranking regularization The graph is assumed to be a linear combination of multiple predefined graphs (referred to as base graphs). The graph weights are learned in a supervised way by considering the SCOP fold types of the protein domains in the database.
Multiple graph regularization
The main component of graph regularization is the construction of a graph. As described previously, there are many ways to find the neighbors${\mathcal{N}}_{i}$ of x_{ i } and to define the weight matrix W on the graph[11]. Several of them are as follows:

Gaussian kernel weighted graph:${\mathcal{N}}_{i}$ of x_{ i } is found by comparing the squared Euclidean distance as,$\begin{array}{c}\left\right{x}_{i}{x}_{j}{}^{2}={x}_{i}^{\top}{x}_{i}2{x}_{i}^{\top}{x}_{j}+{x}_{j}^{\top}{x}_{j}\end{array}$(4)
where σ is the bandwidth of the kernel.

Dotproduct weighted graph:${\mathcal{N}}_{i}$ of x_{ i } is found by comparing the squared Euclidean distance and the weighting is computed as the dotproduct as,$\begin{array}{c}{W}_{\mathit{\text{ij}}}=\left\{\begin{array}{c}{x}_{i}^{\top}{x}_{j},\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathrm{if}\phantom{\rule{.3em}{0ex}}(i,j)\in \mathcal{E}\\ \phantom{\rule{1em}{0ex}}0,\phantom{\rule{2em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{1em}{0ex}}\mathrm{else}\end{array}\right.\end{array}$(6)

Cosine similarity weighted graph:${\mathcal{N}}_{i}$ of x_{ i } is found by comparing cosine similarity as,$\begin{array}{c}C({x}_{i},{x}_{j})=\frac{{x}_{i}^{\top}{x}_{j}}{\left\right{x}_{i}\left\right\left\right{x}_{j}\left\right}\end{array}$(7)

Jaccard index weighted graph:${\mathcal{N}}_{i}$ of x_{ i } is found by comparing the Jaccard index[20] as,$\begin{array}{c}J({x}_{i},{x}_{j})=\frac{{x}_{i}\bigcap {x}_{j}}{{x}_{i}\bigcup {x}_{j}}\end{array}$(9)

Tanimoto coefficient weighted graph:${\mathcal{N}}_{i}$ of x_{ i } is found by comparing the Tanimoto coefficient as,$\begin{array}{c}T({x}_{i},{x}_{j})=\frac{{x}_{i}^{\top}{x}_{j}}{\left\right{x}_{i}{}^{2}+\left{x}_{j}\right{}^{2}{x}_{i}^{\top}{x}_{j}}\end{array}$(11)
With so many possible choices of graphs, the most suitable graph with its parameters for the protein domain ranking task is often not known in advance; thus, an exhaustive search on a predefined pool of graphs is necessary. When the size of the pool becomes large, an exhaustive search will be quite timeconsuming and sometimes not possible. Hence, a method for efficiently learning an appropriate graph to make the performance of the employed graphbased ranking method robust or even improved is crucial for graph regularized ranking. To tackle this problem we propose a multiple graph regularized ranking framework, that provides a series of initial guesses of the graph Laplacian and combines them to approximate the intrinsic manifold in a conditionally optimal way, inspired by a previously reported method[14].
where μ_{ m } is the weight of mth graph. To avoid any negative contribution, we further constrain${\sum}_{m=1}^{M}{\mu}_{m}=1,\phantom{\rule{1em}{0ex}}{\mu}_{m}\ge 0.$
where μ = [μ_{1},⋯,μ_{ M }]^{⊤} is the graph weight vector.
Offline supervised multiple graph learning
To avoid the parameter μ overfitting to one single graph, we also introduce the l_{2} norm regularization term μ^{2} to the object function. The difference between f_{ q } and y_{ q } should be noted: f_{ q }∈{1,0}^{ N } plays the role of the given ground truth in the supervised learning procedure, while${y}_{q}\in {\mathbb{R}}^{N}$ is the variable to be solved. While f_{ q } is the ideal solution of y_{ q }, it is not always achieved after the learning. Thus, we introduce the first term in (16) to make y_{ q } as similar to f_{ q } as possible during the learning procedure.
Object function
where F =[f_{1},⋯,f_{ N }] is the ranking score matrix with the qth column as the ranking score vector of qth protein domain, and Y =[y_{1},⋯,y_{ N }] is the relevance matrix with the qth column as the relevance vector of the qth protein domain.
Optimization
Because direct optimization to (17) is difficult, instead we adopt an iterative, twostep strategy to alternately optimize F and μ. At each iteration, either F or μ is optimized while the other is fixed, and then the roles are switched. Iterations are repeated until a maximum number of iterations is reached.

Optimizing F: By fixing μ, the analytic solution for (17) can be easily obtained by setting the derivative of O(F,μ) with respect to F to zero. That is,$\begin{array}{ccc}\phantom{\rule{.3em}{0ex}}\frac{\text{\u2202O}(F,\mu )}{\mathrm{\partial F}}\hfill & =\hfill & 2(FY)+2\alpha \sum _{m=1}^{M}{\mu}_{m}\left({L}_{m}F\right)=0\hfill \\ \phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}F\hfill & =\hfill & {(I+\alpha \sum _{m=1}^{M}{\mu}_{m}{L}_{m})}^{1}Y\hfill \end{array}$(18)

Optimizing μ: By fixing F and removing items irrelevant to μ from (17), the optimization problem (17) is reduced to,$\begin{array}{cc}\phantom{\rule{1em}{0ex}}\underset{\mu}{\text{min}}\phantom{\rule{0.3em}{0ex}}\alpha \hfill & \phantom{\rule{.8em}{0ex}}\sum _{m=1}^{M}{\mu}_{m}\mathrm{Tr}\left({F}^{\top}{L}_{m}F\right)+\beta \left\right\mu {}^{2}\hfill \\ =\alpha \sum _{m=1}^{M}{\mu}_{m}{e}_{m}+\beta \sum _{m=1}^{M}{\mu}^{2}\hfill \\ =\alpha {e}^{\top}\mu +\beta {\mu}^{\top}\mu \hfill \\ \phantom{\rule{1.3em}{0ex}}\text{s.t.}\phantom{\rule{1em}{0ex}}\hfill & \sum _{m=1}^{M}{\mu}_{m}=1,\phantom{\rule{1em}{0ex}}{\mu}_{m}\ge 0.\hfill \end{array}$(19)
where e_{ m } = Tr(F^{⊤}L_{ m }F) and e =[e_{1},⋯,e_{ M }]^{⊤}. The optimization of (19) with respect to the graph weight μ can then be solved as a standard quadratic programming (QP) problem[4].
Offline algorithm
The offline μ learning algorithm is summarized as Algorithm 1
Algorithm 1.
MultiGRank: offline graph weights learning algorithm.
Require: Candidate graph Laplacians set$\mathcal{T}$;
Require: SCOP type label set of database protein domains$\mathcal{L}$;
Require: Maximum iteration number T;
Construct the relevance matrix Y =[y_{ iq }]^{N×N} where y_{ iq } if l_{ i } = l_{ q }, 0 otherwise; Initialize the graph weights as${\mu}_{m}^{0}=\frac{1}{M}$, m = 1,⋯,M;
for t = 1,⋯,T do
Update the ranking score matrix F^{ t } according to
previous$\left(\right)close="">{\mu}_{m}^{t1}$ by (18);
Update the graph weight μ^{ t } according to
updated F^{ t } by (19);
end for Output graph weight μ = μ^{ t }.
Online ranking regularized by multiple graphs
Given a newly discovered protein domain submitted by a user as query x_{0}, its SCOP type label l_{0} will be unknown and the domain will not be in the database$\mathcal{D}=\{{x}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{x}_{N}\}$. To compute the ranking scores of${x}_{i}\in \mathcal{D}$ to query x_{0}, we extend the size of database to N + 1 by adding x_{0}into the database and then solve the ranking score vector for x_{0} which is defined as$\mathbf{\text{f}}=[{f}_{0},\cdots \phantom{\rule{0.3em}{0ex}},{f}_{N}]\in {\mathbb{R}}^{N+1}$ using (3). The parameters in (3) are constructed as follows:

Laplacian matrix L: We first compute the m graph weight matrices${\left\{{W}_{m}\right\}}_{m=1}^{M}\in {\mathbb{R}}^{(N+1)\times (N+1)}$ with their corresponding Laplacian matrices${\left\{{L}_{m}\right\}}_{m=1}^{M}\in {\mathbb{R}}^{(N+1)\times (N+1)}$ for the extended database {x_{0},x_{1},⋯,x_{ N }}. Then with the graph weight μ learned by Algorithm 1, the new Laplacian matrix L can be computed as in (13).Online graph weight computation: When a new query x_{0} is added to the database, we calculate its K nearest neighbors in the database$\mathcal{D}$ and the corresponding weights W_{0j} and W_{j 0}, j = 1,⋯,N. If adding this new query to the database does not affect the graph i n the database space, the neighbors and weights W_{ ij }, i,j = 1,⋯,N for the protein domains in the database are fixed and can be precomputed offline. Thus, we only need to compute N edge weights for each graph instead of (N + 1) × (N + 1).

Relevance vector y: The relevance vector for x_{0} is defined as$\left(\right)close="">\mathbf{\text{y}}={[{y}_{0},\cdots \phantom{\rule{0.3em}{0ex}},{y}_{N}]}^{\top}\in {\{1,0\}}^{N+1}$ with only y_{0} = 1 known and y_{ i }, i = 1,⋯,N unknown.

Matrix U: In this situation, U is a (N + 1)×(N + 1)diagonal matrix with U_{00} = 1 and U_{ ii } = 0, i = 1,⋯,N.
The online ranking algorithm is summarized as Algorithm 2
Algorithm 2.
MultiGRank: online ranking algorithm.
Require: protein domain database$\mathcal{D}=\{{x}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{x}_{N}\}$;
Require: Query protein domain x_{0};
Require: Graph weight μ;
Extend the database to (N + 1) size by adding x_{0} and compute M graph Laplacians of the extended database; Obtain multiple graph Laplacian L by linear combination of M graph Laplacians with weight μ as in (13); Construct the relevance vector$\mathbf{\text{y}}\in {\mathbb{R}}^{(N+1)}$ where y_{0} = 1 and diagonal matrix$U\in {\mathbb{R}}^{(N+1)\times (N+1)}$ with U_{ ii } = 1 if i = 0 and 0 otherwise; Solve the ranking vector f for x_{0} as in (20); Ranking protein domains in$\mathcal{D}$ according to ranking scores f in descending order.
Protein domain database and query set
We used the SCOP 1.75A database[21] to construct the database and query set. In the SCOP 1.75A database, there are 49,219 protein domain PDB entries and 135,643 domains, belonging to 7 classes and 1,194 SCOP fold types.
Protein domain database
Query set
We also randomly selected 540 protein domains from the SCOP 1.75A database to construct a query set. For each query protein domain that we selected we ensured that there was at least one protein domain belonging to the same SCOP fold type in the ASTRAL SCOP 1.75A 40% database, so that for each query, there was at least one ”positive” sample in the protein domain database. However, it should be noted that the 540 protein domains in the query data set were randomly selected and do not necessarily represent 540 different folds. Here we call our query set the 540 query dataset because it contains 540 protein domains from the SCOP 1.75A database.
Evaluation metrics
By varying the length of the returned list, different TPR, FRP, recall and precision values are obtained.
ROC curve
Using FPR as the abscissa and TPR as the ordinate, the ROC curve can be plotted. For a highperformance ranking system, the ROC curve should be as close to the topleft corner as possible.
Recallprecision curve
Using recall as the abscissa and precision as the ordinate, the recallprecision curve can be plotted. For a highperformance ranking system, this curve should be close to the topright corner of the plot.
AUC
The AUC is computed as a singlefigure measurement of the quality of an ROC curve. AUC is averaged over all the queries to evaluate the performances of different ranking methods.
Results and discussion
We first compared our MultiGRank against several popular graphbased ranking score learning methods for ranking protein domains. We then evaluated the ranking performance of MultiGRanking against other protein domain ranking methods using different protein domain comparison strategies. Finally, a case study of a TIM barrel fold is described.
Comparison of MultiGRank against other graphbased ranking methods
The figure shows the ROC and the recallprecision curves obtained using the different graph ranking methods. As can be seen, the MultiGRank algorithm significantly outperformed the other graphbased ranking algorithms; the precision difference got larger as the recall value increased and then tend to converge as the precision tended towards zero (Figure2 (b)). The GRank algorithm outperformed GT in most cases; however, both GRank and GT were much better than the pairwise ranking which neglects the global distribution of the protein domain database.
AUC results off different graphbased ranking methods
Method  AUC 

MultiGRank  0.9730 
GRank  0.9575 
GT  0.9520 
PairwiseRank  0.9478 
 1.
GRank and GT produced similar performances on our protein domain database, indicating that there is no significant difference in the performance of the graph transduction based or graph regularization based single graph ranking methods for unsupervised learning of the ranking scores.
 2.
Pairwise ranking produced the worst performance even though the method uses a carefully selected similarity function as reported in [3]. One reason for the poorer performance is that similarity computed by pairwise ranking is focused on detecting statistically significant pairwise differences only, while more subtle sequence similarities are missed. Hence, the variance among different fold types cannot be accurately estimated when the global distribution is neglected and only the protein domain pairs are considered. Another possible reason is that pairwise ranking usually produces a better performance when there is only a small number of protein domains in the database; therefore, because our database contains a large number of protein domains, the ranking performance of the pairwise ranking method was poor.
 3.
MultiGRank produced the best ranking performance, implying that both the discriminant and geometrical information in the protein domain database are important for accurate ranking. In MultiGRank, the geometrical information is estimated by multiple graphs and the discriminant information is included by using the SCOPfold type labels to learn the graph weights.
Comparison of MultiGRank with other protein domain ranking methods
AUC results for different protein domain ranking methods
Method  AUC 

MultiGRank  0.9730 
IR Tableau  0.9478 
YAKUSA  0.9537 
SHEBA  0.9421 
QP tableau  0.9364 
The results in Table2 show that with the advantage of exploring data characteristics from various graphs, MultiGRank can achieve significant improvements in the ranking outcomes; in particular, AUC is increased from 0.9478 to 0.9730 in MultiGRank which uses the same Tableau feature as IR Tableau. MultiGRank also outperforms QP Tableau, SHEBA, and YAKUSA; and AUC improves from 0.9364, 0.9421 and 0.9537, respectively, to 0.9730 with MultiGRank. Furthermore, because of its better use of effective protein domain descriptors, IR Tableau outperforms QP Tableau.
To evaluate the effect of using protein domain descriptors for ranking instead of direct protein domain structure comparisons, we compared IR Tableau with YAKUSA and SHEBA. The main differences between them are that IR Tableau considers both protein domain feature extraction and comparison procedures, while YAKUSA and SHEBA compare only pairs of protein domains directly. The quantitative results in Table2 show that, even by using the additional information from the protein domain descriptor, IR Tableau does not outperform YAKUSA.
This result strongly suggests that ranking performance improvements are achieved mainly by graph regularization and not by using the power of a protein domain descriptor.
Plots of TPR versus FPR obtained using MultiGRank and various fieldspecific protein domain ranking methods as the ranking algorithms are shown in Figure3 (a) and the recallprecision curves obtained using them are shown in Figure3 (b). As can be seen from the figure, in most cases, our MultiGRank algorithm significantly outperforms the other protein domain ranking algorithms. The performance differences get larger as the length of the returned protein domain list increases. The YAKUSA algorithm outperforms SHEBA, IR Tableau and QP Tableau in most cases. When only a few protein domains are returned to the query, the sizes of both the true positive samples and the false positive samples are small, showing that, in this case, all the algorithms yield low FPR and TPR. As the number of returned protein domains increases, the TPR of all of the algorithms increases. However, MultiGRank tends to converge when the FPR is more than 0.3, whereas the other ranking algorithms seems to converge only when the FPR is more than 0.5.
Case Study of the TIM barrel fold
Besides considering the results obtained for the whole database, we also studied an important protein fold, the TIM beta/alphabarrel fold (c.1). The TIM barrel is a conserved protein fold that consists of eight αhelices and eight parallel βstrands that alternate along the peptide backbone[22]. TIM barrels are one of the most common protein folds. In the ASTRAL SCOP 1.75A %40 database, there are a total of 373 proteins belonging to 33 different superfamilies and 114 families that have TIM beta/alphabarrel SCOP fold type domains,. In this case study, the TIM beta/alphabarrel domains from the query set were used to rank all the protein domains in the database. The ranking was evaluated both at the fold level of the SCOP classification and at lower levels of the SCOP classification (ie. superfamily level and family level). To evaluate the ranking performance, we defined ”true positives” at three levels:
Fold level
When the returned database protein domain is from the same fold type as the query protein domain.
Superfamily level
When the returned database protein domain is from the same superfamily as the query protein domain.
Family level
When the returned database protein domain is from the same family as the query protein domain.
Conclusion
The proposed MultiGRank method introduces a new paradigm to fortify the broad scope of existing graphbased ranking techniques. The main advantage of MultiGRank lies in its ability to represent the learning of a unified space of ranking scores for protein domain database in multiple graphs. Such flexibility is important in tackling complicated protein domain ranking problems because it allows more prior knowledge to be explored for effectively analyzing a given protein domain database, including the possibility of choosing a proper set of graphs to better characterize diverse databases, and the ability to adopt a multiple graphbased ranking method to appropriately model relationships among the protein domains. Here, MultiGRank has been evaluated comprehensively on a carefully selected subset of the ASTRAL SCOP 1.75 A protein domain database. The promising experimental results that were obtained further confirm the usefulness of our ranking score learning approach.
Declarations
Acknowledgements
The study was supported by grants from National Key Laboratory for Novel Software Technology, China (Grant No. KFKT2012B17), 2011 Qatar Annual Research Forum Award (Grant No. ARF2011), and King Abdullah University of Science and Technology (KAUST), Saudi Arabia. We appreciate the valuable comments from Prof. Yuexiang Shi, Xiangtan University, China.
Authors’ Affiliations
References
 Zhang Y, Sun Y: HMMFRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors. BMC Bioinformatics 2011, 12: 198. 10.1186/1471210512198PubMed CentralView ArticlePubMedGoogle Scholar
 Ochoa A, Llinas M, Singh M: Using context to improve protein domain identification. BMC Bioinformatics 2011, 12: 90. 10.1186/147121051290PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang L, Bailey J, Konagurthu AS, Ramamohanarao K: A fast indexing approach for protein structure comparison. BMC Bioinformatics 2010, 11(Suppl 1):S46. 10.1186/1471210511S1S46PubMed CentralView ArticlePubMedGoogle Scholar
 Stivala A, Wirth A Stuckey: Tableaubased protein substructure search using quadratic programming. BMC Bioinformatics 2009, 10: 153. 10.1186/1471210510153PubMed CentralView ArticlePubMedGoogle Scholar
 Stivala AD, Stuckey PJ, Wirth AI: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinformatics 2010, 11: 446. 10.1186/1471210511446PubMed CentralView ArticlePubMedGoogle Scholar
 Bai X, Yang X, Latecki LJ, Liu W, Tu Z: Learning ContextSensitive Shape Similarity by Graph Transduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 2010, 32(5):861–874.View ArticlePubMedGoogle Scholar
 Carpentier M, Brouillet S, Pothier J: YAKUSA: A fast structural database scanning method. ProteinsStructure Function and Bioinformatics 2005, 61(1):137–151. 10.1002/prot.20517View ArticleGoogle Scholar
 Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Engineering 2000, 13(8):535–543. 10.1093/protein/13.8.535View ArticlePubMedGoogle Scholar
 Lee B, Lee D: Protein comparison at the domain architecture level. BMC Bioinformatics 2009, 10(Suppl 15):S5. 10.1186/1471210510S15S5PubMed CentralView ArticlePubMedGoogle Scholar
 Weston J, Kuang R, Leslie C, Noble W: Protein ranking by semisupervised network propagation. BMC Bioinformatics 2006, 7(Suppl 1):S10. 10.1186/147121057S1S10PubMed CentralView ArticlePubMedGoogle Scholar
 Cai D, He X, Han J, Huang TS: Graph Regularized Nonnegative Matrix Factorization for Data Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011, 33(8):1548–1560.View ArticlePubMedGoogle Scholar
 Varma S, Simon R: Bias in error estimation when using crossvalidation for model selection. BMC Bioinformatics 2006, 7: 91. 10.1186/14712105791PubMed CentralView ArticlePubMedGoogle Scholar
 Nagel K, JimenoYepes A, RebholzSchuhmann D: Annotation of protein residues based on a literature analysis: crossvalidation against UniProtKb. BMC Bioinformatics 2009, 10(Suppl 8):S4. 10.1186/1471210510S8S4PubMed CentralView ArticlePubMedGoogle Scholar
 Geng B, Tao D, Xu C, Yang L, Hua X: Ensemble Manifold Regularization. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012, 34(6):1227–1233.View ArticlePubMedGoogle Scholar
 You ZH, Yin Z, Han K, Huang DS, Zhou X: A semisupervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics 2010, 11: 343. 10.1186/1471210511343PubMed CentralView ArticlePubMedGoogle Scholar
 Song M, Yu H, Han WS: Combining active learning and semisupervised learning techniques to extract protein interaction sentences. BMC Bioinformatics 2011, 12(Suppl 12):S4. 10.1186/1471210512S12S4View ArticleGoogle Scholar
 Kandaswamy KK, Pugalenthi G, Hazrati MK, Kalies KU, Martinetz T: BLProt: prediction of bioluminescent proteins based on support vector machine and relieff feature selection. BMC Bioinformatics 2011, 12: 345. 10.1186/1471210512345PubMed CentralView ArticlePubMedGoogle Scholar
 Tung CW, Ziehm M, Kaemper A, Kohlbacher O, Ho SY: POPISK: Tcell reactivity prediction using support vector machines and string kernels. BMC Bioinformatics 2011, 12: 446. 10.1186/1471210512446PubMed CentralView ArticlePubMedGoogle Scholar
 Shi JY, Zhang YN: Fast SCOP classification of structural class and fold using secondary structure mining in distance matrix. In Pattern Recognition in Bioinformatics. Proceedings 4th IAPR International Conference, PRIB 2009. ENGLAND: Sheffield; 2009:344–53.Google Scholar
 Albatineh AN, NiewiadomskaBugaj M: Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Advances in Data Analysis and Classification 2011, 5(3):179–200. 10.1007/s116340110090yView ArticleGoogle Scholar
 Chandonia J, Hon G, Walker N, Lo Conte L, Koehl P, Levitt M, Brenner S: The ASTRAL Compendium in 2004. Nucleic Acids Research 2004, 32(SI):D189—D192.PubMed CentralPubMedGoogle Scholar
 Kim C, Basner J, Lee B: Detecting internally symmetric protein structures. BMC Bioinformatics 2010, 11: 303. 10.1186/1471210511303PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.