Integrated web service for improving alignment quality based on segments comparison

Plewczynski, Dariusz; Rychlewski, Leszek; Ye, Yuzhen; Jaroszewski, Lukasz; Godzik, Adam

doi:10.1186/1471-2105-5-98

Software
Open access
Published: 22 July 2004

Integrated web service for improving alignment quality based on segments comparison

Dariusz Plewczynski^1,2,
Leszek Rychlewski¹,
Yuzhen Ye³,
Lukasz Jaroszewski⁴ &
…
Adam Godzik^3,4

BMC Bioinformatics volume 5, Article number: 98 (2004) Cite this article

5834 Accesses
10 Citations
Metrics details

Abstract

Background

Defining blocks forming the global protein structure on the basis of local structural regularity is a very fruitful idea, extensively used in description, and prediction of structure from only sequence information. Over many years the secondary structure elements were used as available building blocks with great success. Specially prepared sets of possible structural motifs can be used to describe similarity between very distant, non-homologous proteins. The reason for utilizing the structural information in the description of proteins is straightforward. Structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate.

Results

Here we provide a new fragment library for Local Structure Segment (LSS) prediction called FRAGlib which is integrated with a previously described segment alignment algorithm SEA. A joined FRAGlib/SEA server provides easy access to both algorithms, allowing a one stop alignment service using a novel approach to protein sequence alignment based on a network matching approach. The FRAGlib used as secondary structure prediction achieves only 73% accuracy in Q3 measure, but when combined with the SEA alignment, it achieves a significant improvement in pairwise sequence alignment quality, as compared to previous SEA implementation and other public alignment algorithms. The FRAGlib algorithm takes ~2 min. to search over FRAGlib database for a typical query protein with 500 residues. The SEA service align two typical proteins within circa ~5 min. All supplementary materials (detailed results of all the benchmarks, the list of test proteins and the whole fragments library) are available for download on-line at http://ffas.ljcrf.edu/darman/results/.

Conclusions

The joined FRAGlib/SEA server will be a valuable tool both for molecular biologists working on protein sequence analysis and for bioinformaticians developing computational methods of structure prediction and alignment of proteins.

Background

Protein structure is obviously modular, with similar structural segments, such as alpha helices and beta strands found in unrelated proteins. Such segments, identified from structure, are used extensively in description and analysis of protein structures [1, 2]. Several groups have demonstrated that only a small library of segments is sufficient to rebuild experimental protein structures with high accuracy [3]. Predicted local structure segments (PLSS) are also used in structural prediction, starting from the nearest neighbor approach to secondary structure prediction [4–6]. This idea was later extended and lead to even more successful applications of PLSSs in ab initio structure prediction by Baker and colleagues, who developed a library of sequence-structure motifs called I-sites [7]. Those motifs are later assembled in a complete protein structure by a program ROSETTA [8]. Predicted local structure segments are also used in a novel protein alignment algorithm, based on the comparison of PLSSs for two proteins treated as networks and finding a common path through networks describing the two proteins [9]. The underlying idea in all those approaches is that because global folding constraints can override local preferences, the prediction of structure segments from local sequence is by necessity uncertain. Therefore, instead of trying to predict a correct local structure, all possible local solutions are identified and other constraints (folded structure in Rosetta, or compatible alignment in SEA) are used to identify a globally consistent solution.

Prediction of local structure segments can be approached in two different ways. A first possibility, used in most nearest neighbor secondary structure algorithms, is to use a representative set of proteins with known structure as source of structure segments, but without any restrictions on a number or type of segments. In this approach, we don't make any assumptions about the compositions and distributions of segments in the library and this approach can be compared to unsupervised learning approach. In a second approach, used for instance in the I-site method, only segments from a specifically constructed fragment library are used in prediction, thus this approach is similar to supervised learning. Interestingly, some limited tests suggest that the former approach leads to lower prediction accuracy [10]. The same tests suggested the possibility that different segment libraries could lead to different prediction, and likely, some segment libraries would be better suited to some tasks.

Following this observation, we have developed the FRAGlib – a fragment library specifically designed to complement a segment alignment SEA. SEA alignment algorithm was developed previously in our group [9] and originally used in conjunction with the I-site library. I-site library [7] was originally developed to be used in ab initio folding predictions and anecdotal evidence suggested that it may not be ideally suited for alignment purposes. In this note we describe a combined FRAGlib/SEA server and first benchmarking results of this method.

Implementation

Database of Short Fragments

FRAGlib is based on the idea of developing a uniform coverage of all known types of local structural regularity with the distribution based on that observed in natural proteins. The collection of segments is constructed using representative set of proteins from the ASTRAL database [11, 12]. For each protein in this set, each continuous segment with regular secondary structure, including the flanking residues on both sides, is added to the FRAGlib (see below for details). We do not utilize any further clustering algorithm so our database contains no-unique entries and it is redundant both in terms of structure and sequence information.

Local structure is described by the SLSR (Symbolized Local Stuctures Representation) codes consisting of 11 symbols {HGEeBdbLlxc}, each representing a certain backbone dihedral (phi and psi) region [7, 13]. Protein local structure is described as a string of local-structure symbols and a local structure segment is defined as a 5–17 amino acid fragment with constant local structural codes. Segments are then extended by two additional residues offset at the beginning, and at the end of a segment. We store all such segments with their sequence, SLSR style local structures representation codes and the homology profile [14, 15], derived from that of their parent protein. The library is highly redundant, i.e. there are many segments with the same structural description, but each of the redundant fragments is coming from a different parent protein (or a different part of the same parent protein), therefore it has a different sequence and a different profile associated with it.

FRAGlib prediction

In a next step, FRAGlib segment library is used to assign local structure segments for a new protein (query) based only on sequence information using a variant of the FFAS profile-profile alignment algorithm [16]. A profile for the query protein is calculated following the FFAS protocol, then for all possible overlapping segments of length from 7 to 19 amino acids, their profiles are compared to those of the segments from the FRAGlib database and the score of each alignment is calculated using a FFAS-like scalar product of composition vectors at each position. Since the segments being compared have the same length, no dynamic programming alignment is necessary and the score calculation can be highly optimized.

As the result of this procedure, each position in the query protein can be assigned to all of the possible LSSs in the database, each with a specific score (see Figure 1). Only reduced sets of predicted LSSs, rather arbitrarily limited to the first 20 highest scoring segments are kept for further analysis. This cut-off is the only free parameter of the method, and can be set by user using the Web interface of the server. The Q3 quality of the FRAGlib used as a secondary structure prediction algorithm (data not shown), with the prediction based on the single best scoring segment for each position is 73% on a standard secondary structure prediction benchmark. The Q3 gives percentage of residues predicted correctly as helix, strand, and coil or for all three conformational states.

SEA Segment Alignment Approach to Protein Comparison

The principal motivation to develop the FRAGlib segment prediction was to further improve the alignment quality for comparing distantly related proteins, which is one of the most important problems in practical application of comparative modeling and fold recognition [17]. To address this problem, we have previously developed a SEA algorithm, which compares the network of predicted local structure segments (PLSSs) for two proteins using the network matching approach. In a previous paper we have demonstrated that the SEA algorithm, using I-site server for PLSSs prediction and a simple sequence-sequence scoring for segment comparison resulted in alignments better than the FFAS profile-profile alignment algorithm and several other alignment tools.

A full description of the SEA algorithm is available in the previous manuscript [9], so only a brief summary is presented here. Every residue in each of the proteins being aligned is described as a vertex in the graph. Two artificial vertices are added to the very beginning of each protein as a source vertex, and also at the end as a sink vertex. For each PLSS is described as an edge between the vertices representing its first and last positions. For some PLSS protocols, some parts of the protein may not be covered by any predicted segments, so virtual edges are added to all neighbor residues to form a complete, continuous network. Each assembly of connected PLSSs corresponds to a path in this network. In a next step, PLSSs networks of two proteins are compared by the SEA algorithm. For each pair of positions i and j, with position i coming form the first protein and position j from the second protein, all possible segments covering each of the positions must be considered in a combinatorial way and compared to get the optimal similarity score. It is not the sequences or secondary structures at two positions that are compared, but all segments that cover these two positions. This is the main feature of SEA that makes it different from standard sequence pair-wise alignments. The computational complexity of SEA is about O(NMC 1C 2), where C 1 and C 2 are the average numbers of segments that cover a position in each protein (the segment coverage). Detailed description of the SEA mathematical algorithm together with benchmarks results obtained using the I-site server calculated PLSSs network can be found elsewhere [9].

The integrated FRAGlib and SEA server is available at [18]. The FRAGlib database and segment prediction provides the PLSSs network for each aligned protein, and the SEA algorithm aligns the two networks. On Figure 2 we present the flowchart of the integrated web service. Preliminary benchmarks for the FRAGlib/SEA server and presented below. A full paper on the FRAGlib algorithm is in preparation.

Results and Discussion

We use here as a benchmark the database of 409 family-level similar pairs [19]. Each protein pair shares at least one similar domain as identified by SCOP [20]. Segments coming from the proteins of the same SCOP family as the proteins being compared were removed from the FRAGlib calculated PLSSs network. Further analysis of the SEA results also confirmed that the memorization is not a problem here, as all the SEA alignment are build predominantly from segments that are not locally optimal.

To evaluate the improvement we use two measures of alignment quality: the classical root mean square deviation (RMSD) and the shift score [1]. The shift score measures misalignment between a predicted alignment of two proteins and the reference alignment. The shift score measure ranges from -ε(default as -0.2) to 1.0, where 1.0 means an identical alignment. RMSD is dependent on alignment length and the shift score is dependent on the reference alignment, so both measures are less than perfect in comparing alignments. In our case we use as the reference alignment provided by the CE structural method [21]. We chose the CE, which is available as a single file executable for various operating systems, as an example of purely structural alignment tool. It is a method for fast calculation of pairwise structure alignments, which aligns two proteins chains using characteristics of their local geometry as defined by vectors between Cα positions. Heuristics are used there in defining a set of optimal paths joining termed aligned fragment pairs with gaps as needed. The path with the best RMSD is subject to dynamic programming in order to achieve an optimal alignment. For specific families of proteins additional characteristics are used to weight the alignment.

'Table 1 [see Additional file 5]' compared the quality of the FRAGlib/SEA (identified as SEA_F in the Table) alignment with that of the structural alignment prepared with the CE algorithm [21] and the SEA algorithm used with I-site segment prediction (SEA_I), SEA algorithm used with the actual (not predicted) local structure segments (SEA_T), local single predicted structures (SEA_loc) and few other publicly available alignment tools. All the results other than the FRAGlib/SEA alignments, as well as alignment quality evaluation, were adopted from the original SEA manuscript [9]. The results presented in 'Table 1 [see Additional file 5]' show that SEA_F significantly improves the alignment quality as compared to all other methods, including SEA_I (SEA using I-site prediction), bringing it close to (and in the shift based quality measure actually improving on) the SEA algorithm using the actual structure segments.

Table 1 General performance of classical methods for building alignments together with segment alignment algorithm incorporating different local structure diversities.

Full size table

Conclusions

The benchmarks show that SEA with FRAGlib (SEA_F) integrated prediction service better incorporate diversities of local structure predictions over known methods. It produces also more accurate alignments in comparison to SEA_I (based on the I-site library), or the SEA with single predicted structures (SEA_loc). Comparing those sequence pairwise alignments we can observe that predicted local structure information seems to improve the alignment qualities. Alignments from SEA using FRAGlib method of describing diversities of local structure prediction have the same quality as alignments using true local structures derived from their known 3D structures SEA_T.

Availability and requirements

An integrated SEA/FRAGlib server is available at [18]. Both components can be used separately, SEA alignment with arbitrary PLSSs and FRAGlib for other purposes than segment alignment, but the integrated server provides the complete alignment method for comparing pairs of protein sequences using a network matching algorithm. The fragments library prediction method (FRAGlib) is also available as the separate http server at [22]. The software is freely available to academics. Contact Dariusz Plewczynski darman@bioinfo.pl or Adam Godzik adam@burnham.org for information on obtaining the local copy of a software.

References

Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18: 306–314. 10.1093/bioinformatics/18.2.306
Article CAS PubMed Google Scholar
Fischer D, Eisenberg D: Protein fold recognition using sequence-derived predictions. Protein Science 1996, 5: 947–955.
Article PubMed Central CAS PubMed Google Scholar
Levitt M, Gerstein M: A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci 1998, 95: 5913–5920. 10.1073/pnas.95.11.5913
Article PubMed Central CAS PubMed Google Scholar
Yi TM, Lander ES: Protein secondary structure prediction using nearest-neighbor methods. J Mol Biol 1993, 232: 1117–1129. 10.1006/jmbi.1993.1464
Article CAS PubMed Google Scholar
Rychlewski L, Godzik A: Secondary structure prediction using segment similarity. Protein Engineering 1997, 10: 1143–1153. 10.1093/protein/10.10.1143
Article CAS PubMed Google Scholar
Xu H, Aurora R, Rose GD, White RH: Identifying two ancient enzymes in archaea using predicted secondary structure alignment. Nature Structural Biology 1999, 6: 750–754. 10.1038/11525
Article CAS PubMed Google Scholar
Bystroff C, Baker D: Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 1998, 281: 565–577. 10.1006/jmbi.1998.1943
Article CAS PubMed Google Scholar
Simons KT, Bonneau R, Ruczinski II, Baker D: Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 1999, 37: 171–176. 10.1002/(SICI)1097-0134(1999)37:3+<171::AID-PROT21>3.3.CO;2-Q
Article Google Scholar
Ye Y, Jaroszewski L, Li W, Godzik A: A segment alignment approach to protein comparison. Bioinformatics 2003, 19: 742–749. 10.1093/bioinformatics/btg073
Article CAS PubMed Google Scholar
Godzik A: unpublished personal communication. 2003.
Google Scholar
Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: ASTRAL compendium enhancements. Nucleic Acids Research 2002, 30: 260–263. 10.1093/nar/30.1.260
Article PubMed Central CAS PubMed Google Scholar
Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research 2000, 28: 254–256. 10.1093/nar/28.1.254
Article PubMed Central CAS PubMed Google Scholar
I-sites/HMMSTR backbone angle regions[http://www.bioinfo.rpi.edu/~bystrc/hmmstr/rama.html]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Article PubMed Central CAS PubMed Google Scholar
Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 2000, 9: 232–241.
Article PubMed Central CAS PubMed Google Scholar
Elofsson A: A study on protein sequence alignment quality. Proteins 2002, 46: 330–339. 10.1002/prot.10043
Article CAS PubMed Google Scholar
SEgment Alignment (SEA) server (Protein pairwise alignment based on network matching algorithm)[http://ffas.ljcrf.edu/Servers/sea.html]
Jaroszewski L, Li W, Godzik A: Improving the quality of twilight-zone alignments. Protein Science 2001, 9: 1487–1496.
Article Google Scholar
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159
CAS PubMed Google Scholar
Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 1998, 11: 739–747. 10.1093/protein/11.9.739
Article CAS PubMed Google Scholar
Fragments Library Tool using profile-profile alignments[http://ffas.ljcrf.edu/Servers/frag.html]

Download references

Acknowledgments

This work was supported by the USA grant ("SPAM" GM63208) and BioSapiens project within 6FP EU programme (LHSG-CT-2003-503265).

Author information

Authors and Affiliations

Bioinformatics Laboratory, BioInfoBank Institute, Poznan, Poland
Dariusz Plewczynski & Leszek Rychlewski
Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Poland
Dariusz Plewczynski
The Burnham Institute, La Jolla, USA
Yuzhen Ye & Adam Godzik
Bioinformatics Core JCSG, University of California San Diego, La Jolla, USA
Lukasz Jaroszewski & Adam Godzik

Authors

Dariusz Plewczynski
View author publications
You can also search for this author in PubMed Google Scholar
Leszek Rychlewski
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhen Ye
View author publications
You can also search for this author in PubMed Google Scholar
Lukasz Jaroszewski
View author publications
You can also search for this author in PubMed Google Scholar
Adam Godzik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Plewczynski.

Additional information

Authors' contributions

DP designed, implemented, and evaluated the FRAGlib program. The benchmark dataset and programme for aligning two short sequence profiles were provided by LJ. The integration of FRAGlib predictions within SEA network alignment software together with benchmark evaluation of the SEA method was done by YY. AG was responsible for the overall project coordination. All authors have read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Plewczynski, D., Rychlewski, L., Ye, Y. et al. Integrated web service for improving alignment quality based on segments comparison. BMC Bioinformatics 5, 98 (2004). https://doi.org/10.1186/1471-2105-5-98

Download citation

Received: 30 March 2004
Accepted: 22 July 2004
Published: 22 July 2004
DOI: https://doi.org/10.1186/1471-2105-5-98

Integrated web service for improving alignment quality based on segments comparison