High throughput profile-profile based fold recognition for the entire human proteome
© McGuffin et al; licensee BioMed Central Ltd. 2006
Received: 22 February 2006
Accepted: 07 June 2006
Published: 07 June 2006
In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power.
In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible.
We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains.
This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.
Maintaining databases that contain the most up-to-date, high quality models of the protein structures for entire proteomes is becoming increasingly difficult. Structural annotation databases such as 3D-Genomics, Gene3D, Superfamily and the Genomic Threading Database[4, 5] contain predicted models for all of the proteins encoded within key genomes. The quality of the models deposited in these databases relies on obtaining the closest templates for each target sequence and constructing the best possible sequence to structure alignment. Most of the methods used to construct models currently rely on sequence-profile based searches; however keeping these models up-to-date is problematic for several reasons. The proteome sequences of key organisms are often updated on a monthly basis, the protein structures within the Protein Data Bank  are updated every week and many of the non redundant protein sequence databases , used to construct PSI-BLAST  profiles for example, are updated every single day. It is clear from the LiveBench experiments  that the reliability of fold recognition methods is greatly affected by the templates that are available in a fold library. A hard target sequence, regarded as having an analogous or novel fold one day, may have a homologous template available the next. It is also clear that the quality of sequence profiles is likely to increase with the availability of closer homologues.
Currently the best fold recognition methods are those which employ profile-profile based searching. A number of studies have shown that these methods greatly outperform the sequence-profile based methods which are often used to populate structural annotation databases (see Ohlsen et al. for a recent review) . Clearly the most comprehensive structural annotation for a given proteome would be achieved through more rigorous profile-profile based searches. However, the trade-off for the increased coverage of high confidence annotations is the speed at which predictions can be made. The main added computational overhead is due to the required construction of profiles for every unique target sequence within a proteome.
In this paper we describe our JYDE (Job Yield Distribution Environment) system, a meta scheduler designed to be run above cluster schedulers such as Sun Grid Engine  and Condor  [see Additional File 1]. We demonstrate that JYDE is able to distribute large numbers of intensive fold recognition jobs on demand running across several computer clusters within independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software  in order to annotate the latest ENSEMBL  version of the Human proteome as quickly as possible. Using our JYDE system to harness over 500 CPUs from 3 independent Grid domains we have been able to annotate 99.9% of the protein sequences within the human proteome in less than 24 hours. This study demonstrates that the prospect of carrying out on demand snapshots of the structural annotations for key eukaryotic organisms is now entirely feasible.
Profile-profile based fold recognition using mGenTHREADER
The most recent mGenTHREADER protocol  was followed for profile-profile based fold recognition. The comparison method used was designed to directly compare PSI-BLAST position-specific scoring matrix (PSSM) scores, and makes use of an optimized heuristic comparison metric. The target PSSM was built using 10 iterations of PSI-BLAST (-j 10), searching the UNIREF100 sequence data bank with low complexity regions, coiled coil regions and transmembrane segments filtered out. The profile-profile scoring scheme we use is essentially based on the dot product of two PSSM vectors X (from the target) and Y (from the template), though where any negative values in the target PSSM are set to zero:
To find the correct alignment parameters (i.e. gap penalties) for this scoring function, the parameters were optimized using a grid search to maximize the sum of model quality for each top hit across a benchmark set of 50 difficult fold recognition targets. Using a standard affine gap penalty term, a gap-opening penalty of 7.4 and extension penalty of 0.47 were found to be optimal.
A non-redundant template fold library for mGenTHREADER was constructed from 6331 representatives from the PDB. All pairs of proteins within the fold library had low sequence homology to one another (FASTA  E > 0.001 and < 30% identity). The profile-profile version of mGenTHREADER was run on the Human proteome sequences downloaded from the ENSEMBL website  (Version NCBI35 from November 2005) using the JYDE pipeline described below. The method was also run on the sequences from the fold library itself in order to estimate the reliability of the output scores (see below).
Sequence-profile based fold recognition using GenTHREADER
The GenTHREADER protocol  was used for sequence-profile based fold recognition. The same procedure was carried out as for mGenTHREADER above using the identical fold library and target sequences.
Calculation of P-values for GenTHREADER and mGenTHREADER fold assignments
It is essential to provide a quantitative measure of the confidence we have in any particular fold assignment from both methods. For this we used an approach similar to that used in a previous study  based on hypothesis testing. We determined the statistical significance of obtaining a fold match with a given score or better when compared to a null model. Our null model is based on the criteria that a match of this score or better has occurred by chance and does not actually signify that the sequence has the specified fold. Clearly the alternative model is that the match score is due to the sequence actually having the given fold.
In more detail, we generated random pairings of sequences which are known not to have the same fold. This was carried out by comparing the TM-score of pairs of proteins within the fold library, using the TM-align method and recommended parameters . Applying GenTHREADER and mGenTHREADER to these sequences provided a score distribution for the null model. Unlike our previous study , which assumed an extreme value distribution for the null distribution, in this study we fitted a generic density model based on a Gaussian kernel using the R software . This was found to more accurately reflect particular features of our current data. We were then able to determine the statistical significance of any score using a one-sided test based on this distribution. The p-value gives the proportion of non-matching folds which, on average, would be incorrectly assigned as matches. The coverage of sequences with assigned structures was determined within the Human proteomes using standard p-value cut-offs (< 0.001, < 0.01 and < 0.05).
The JYDE pipeline for high throughput structural annotation
The JYDE (Job Yield Distribution Environment) software package was developed as part of the e-Protein project (e-Protein.org) in order to distribute structural annotation jobs across multiple co-operating processing clusters at independent Grid domains. JYDE consists of a meta-scheduler that works above cluster schedulers, such as SGE or Condor, and supports multiple submission front-ends. The software currently in use at UCL has a Bioinformatics tailored web interface which is powered by a Tomcat servlet. The servlet allows authenticated users to upload a proteome sequence file via a web form and then prepares the data for processing by bioinformatics software such as mGenTHREADER. The proteome sequence files are subdivided into smaller files which are then passed through to the Portal. The Portal maintains the job queues with different priorities for different users and projects.
The Portal requests permission for job execution from the Distribution Manager (DM). The Distribution Manager on the submission server is part of a peer-to-peer network with the DMs at other Grid sites and attempts to balance the load across them. The DM issues permits to the Portal which instructs the portal to execute a particular job at a particular site. The Portal has different modules to support communication with different kinds of clusters, e.g. a specific pluggable module talks to sites which are running SGE6 or Condor. This module submits the jobs to the cluster, reports on their status and returns the results when they finish.
JYDE hardware setup
Three clusters were used within three independent Grid domains at UCL and Imperial. At the peak of the experiment 148 Pentium IIIs (1.3 GHz) were available at the UCL Computer Science domain (cs.ucl.ac.uk), 243 Opterons (various speeds) were available at the Imperial LESC domain (lesc.imperial.ac.uk) and 192 Xeons (2.8 GHz) were available at the UCL Information Systems domain (ccc.ucl.ac.uk). In total 515 CPUs were available to carry out the structural annotations.
We arranged to have increased access to each of the clusters between the 2nd and 4th December 2005. Exclusive access to the cluster at the UCL Computer Science domain was obtained throughout the experiment. We had increased priority to the cluster at the Imperial LESC domain and restrictions were imposed on the job lengths of other users. At the UCL Information Systems domain restrictions were put in place to prevent any other users from submitting jobs during the experiment.
JYDE software setup
Three components of JYDE were installed on the submission web server including: the Bioinformatics front end to the Portal (Tomcat servlet), the Portal itself and the Distribution Manager.
Installing our software on the clusters was relatively straightforward and merely required a standard user account to be setup on the front end machine. A tar file containing the binaries for our mGenTHREADER software was uploaded into the NFS-shared home directory on each cluster. Java was also installed in the home directory where it was found to be unavailable. Public keys were setup in order to enable the submission web server to SSH into each clusters and submit jobs. This SSH connection also enabled the clusters to transfer results files back to the main data store on the file server. Finally we configured the mGenTHREADER shell scripts for each cluster. These scripts were run by SGE on each node, which in turn called the mGenTHREADER binaries. The SGE 'qsub' command was run on the front end machine with the shell script and input parameters.
The mGenTHREADER program required read-access to large database files containing up-to-date sequence and structure data. An rsync through SSH was setup to keep these files updated on each cluster.
Data flow during rapid annotation of the human proteome
We downloaded the latest available human proteome sequence file from ENSEMBL (version NCBI35, November 2005) and submitted it for annotation using mGenTHREADER via the web interface.
The human proteome sequence file contained 32010 unique sequences. For reasons of efficiency, we chose to allocate 5 sequences to each job so that JYDE would run a total of 6402 jobs. The parameters were passed to the Portal, the input file was pre-processed and subdivided into 6402 smaller files.
The Portal queried the Distribution Manager (DM) for permits to run 6402 jobs. The DM initially issued a few hundred permits to the Portal in order to fill each cluster with jobs. Each permit specified the identity of a cluster where the Portal was allowed to run a job. For each of the 6402 waiting jobs, the Portal matched the job up to a permit and then submitted a permit to a specified cluster. For each job the input data file, containing 5 protein sequences, was then uploaded to the cluster via SSH. Since the clusters in this experiment were all running SGE, the job was then submitted using the 'qsub' command via an SSH connection to the cluster front end.
As each of the jobs ran, log files were written containing information about the job status. The log files were then transferred back to the central data store on the file server. The status of each run could then be queried using the web interface on the web server. The Tomcat servlet was able to read the log files and display which jobs were queued, running or completed on each of the clusters.
The DM worked by constantly checking the lengths of the wait queues at each site. When a queue on a particular site fell below a certain threshold, new permits were issued for that site, the Portal would then submit more jobs to that site. The aim of this strategy was to keep every CPU at every site running jobs and to keep a few jobs waiting at each site at any time, but not so many that it would hinder the DM's ability to make scheduling decisions. Prior to running, the majority of jobs were queued in Portal's queue on submission web server and not in the SGE cluster queues on the Grid sites.
Throughput of human proteome annotation
CPU usage at independent Grid domains
The tailing-off phase in CPU usage is clearly indicated on the plot (Figure 4). This occurs when the number of CPUs exceeds the number of jobs left to run. Towards the end of the run the last remaining jobs were redirected to the fastest cluster (lesc.imperial.ac.uk) in attempt to minimize the tailing-off effect.
Comparison of GenTHREADER and mGenTHREADER: estimated coverage of the human proteome
Comparison of GenTHREADER and mGenTHREADER: processing benchmarks
Processing benchmarks for GenTHREADER versus mGenTHREADER. The processing time required for the sequence-profile method GenTHREADER is compared to the profile-profile version of mGenTHREADER. The data was collected by running both methods on sequences from the Human proteome using 148 identical CPUs (1.3 GHz Pentium IIIs from the cs.ucl.ac.uk domain) over a period of 10 hours. The throughput was measured as the mean number of sequences with structural annotations per hour.
Mean sequences per hour on 148 identical CPUs
Mean sequences per hour on 1 CPU
Mean time for 1 Human sequence on 1 CPU (min)
Estimated time for all Human sequences on 1 CPU (days)
Analysis of latest available templates
The PDB  is updated on a weekly basis. During the week that the fold library was constructed (2nd November, 2005), 15 new structures became available which had no detectable sequence homology to known templates within the fold library. These structures have the following PDB codes and chain identifiers: 2esnA, 2es7A, 2bryA, 2beiA, 2bduA, 2bdqA, 2axwA, 2auaA, 2aneA, 2ahuA, 1z7aA, 1ytlA, 1ys4A, 1ylqA, 1on1A, 1wp7A.
These latest template structures were assigned to 211 protein sequences within the Human proteome using mGenTHREADER. Of these 211 sequences, 23 had new assignments which were estimated to be significant (p < 0.05).
In this study we have provided a proof of concept for carrying out on demand profile-profile fold recognition for large eukaryotic proteomes. We have developed and benchmarked Grid middleware in the form of a meta-scheduler called JYDE [see Additional File 1]. This first version of JYDE has been designed specifically for distributing fold recognition software such as mGenTHREADER for large-scale structural annotations but it is also easily extensible to other bioinformatics applications.
Profile-profile based methods for fold recognition are able to detect more remote homologues with higher confidence than can be found with sequence-profile based methods . Figure 5 clearly demonstrates the advantage of using the profile-profile version of mGenTHREADER for annotation of the Human proteome over the sequence-profile method GenTHREADER. There have been many studies on the advantages of profile-profile methods over sequence-profile methods since the first method developed by Rychlewski and colleagues . In a recent review, Ohlson et al.  observed that profile-profile methods performed at least 30% better than standard sequence-profile methods both in alignment quality and in their ability to recognize distantly related proteins. This observation is reflected in the difference in estimated coverage of significant fold assignments obtained by mGenTHREADER over GenTHREADER (Figure 5).
It is important that structural annotation of whole proteomes can be updated on demand in order to ensure that the most accurate models are obtained for every sequence. New fold templates are released on a weekly basis and new sequences appear daily. The example model in Figure 6 was constructed from one of 15 new templates which were only made available during the week that the fold library was constructed. This fold was assigned to a domain of a human protein where no structural template was previously available. So why not just align sequences to the new structures incrementally, as and when they come out? On the surface this may appear to be a pragmatic approach; however the sequence databases from which profiles are constructed are also updated frequently. This means that the ranking of models may change as more accurate sequence to structure alignments can be made. This would, in turn, mean that model rankings from week-to-week could not be accurately compared against one another. Additionally, versions of proteome sequences are often updated on a monthly basis which would further complicate updates, were they to be carried out incrementally. Finally, other structure prediction applications which make use of sequence profiles would also benefit from continual updates. For example Przybylski et al.  observed that secondary structure prediction improves as sequence databases increase in size. Secondary structure alignment scores using predicted secondary structures have been an important feature of many fold recognition methods . It is clear from the LiveBench  assessments that the highest quality fold recognition models are produced by profile-profile based methods which maintain both continually updated fold libraries and sequence databases.
The trade off for increased accuracy is the speed at which predictions can be made. The profile-profile methods such as mGenTHREADER are significantly more CPU intensive than the sequence-profile methods such as GenTHREADER (Table 1). However, through the development of JYDE we have provided a Grid platform to enable proteome wide profile-profile fold recognition to be carried out on demand. The system could be easily scaled up to include more clusters and is extensible to other high throughput CPU intensive bioinformatics applications.
While it may not be necessarily economic to carry out a complete update of the structures for just the Human proteome every 24 hours, it is perhaps necessary to carry out an update at least every month in order to maintain accurate models for the latest sequence versions. This could be carried out in line with the new release cycle of the proteome sequence. If we were to use the JYDE system in its current setup to run mGenTHREADER predictions 24 hours a day, it would be possible (and perhaps more justifiable) to carry out complete structural annotations for up to 31 or so large eukaryotic proteomes per month, in line with their ENSEMBL release versions.
The JYDE system is not restricted to protein fold recognition and could be applied to any bioinformatics applications where there is a need to regularly distribute intensive methods on large-scale data sets. The system could also be deployed in situations where the performance of a stand-alone prediction server would have insufficient power, for example serving ab initio/new fold predictions for individual sequences on demand.
Although the current version of the JYDE system has proved to be robust and efficient there are some aspects which could be further improved. For instance, we are aware that the current method initialises a large number of SSH sessions, and the Distribution Manager is not yet optimally configured for job allocation. We are hoping to address these issues in the next version of the software.
It is clear that the tailing off phase could be further reduced in order to decrease the total time taken. One option would be to reduce the number of sequences per job to one at the end of the run, as the number of sequences left to run equals the total number of CPUs available. It is impractical to initially set the number of sequences per job to one. This is because of the inefficiencies of handling 32010 jobs in the initial submission, which hinders performance at the start of the run. However, it should be possible to build in dynamically configurable job sizes in future versions of JYDE.
Another consideration for reducing the tailing off phase would be to redirect all final jobs to the fastest cluster when the number of sequences left to run equals the number of nodes available in the fastest cluster. This was attempted at the end of the second run where we redirected the last remaining jobs to the Imperial-LESC cluster (lesc.imperial.ac.uk) (Figure 4). It is difficult to accurately predict how long a fold recognition job will need to run based on the sequence information alone. However, it would be possible to monitor long running jobs and resubmit them to faster nodes in the early stages of the run. Each of these performance improvements can be carried out automatically and will be added to future versions of JYDE, indeed it is our priority to minimize the tailing off phase as far as possible.
We succeeded in annotating 99.9% of the Human proteome in under 24 hours. The JYDE system was able to effectively schedule jobs dynamically across three different Grid domains, achieving a maximum throughput of 1487 sequences per hour and using 515 CPUs at the peak of the run. This study clearly demonstrates the feasibility of on demand high-throughput structural annotations of the proteomes of major eukaryotic organisms. The use of grid middleware such as JYDE software should allow us to maintain continually updated structural annotation databases containing the highest possible quality models of protein structures for key eukaryotic organisms.
Availability and requirements
Project names: e-Protein, JYDE
Operating system(s): Platform independent (Linux/UNIX preferred)
Programming language: Java 1.5
Other requirements: Ant
License: Freely available for academic use
Any restrictions to use by non-academics: License required
We are very grateful to Keith Sephton at LESC and William Hay at UCL for arranging priority access to their clusters for this experiment. This work was supported by the UK Biotechnology and Biological Sciences Research Council and the UK Department of Trade and Industry (LJM, RTS) along with the BioSapiens Network of Excellence (European Commission FP6 Programme contract number LSHG-CT-2003-503265).
- Fleming K, Muller A, MacCallum RM, Sternberg MJ: 3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes. Nucleic Acids Res 2004, (32 Database):D245–50. 10.1093/nar/gkh064Google Scholar
- Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA: Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 2006, (34 Database):D281–4. 10.1093/nar/gkj057Google Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. J Mol Biol 2001, 313: 903–919. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- McGuffin LJ, Street S, Sørensen SA, Jones DT: Genomic Threading Database. Bioinformatics 2004, 20: 131–2. 10.1093/bioinformatics/btg387View ArticlePubMedGoogle Scholar
- McGuffin LJ, Street SA, Bryson K, Sorensen SA, Jones DT: The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms. Nucleic Acids Res 2004, (32 Database):D196–9. 10.1093/nar/gkh043Google Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, (34 Database):D187–91. 10.1093/nar/gkj161Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Rychlewski L, Fischer D: LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci 2005, 14: 240–5. 10.1110/ps.04888805PubMed CentralView ArticlePubMedGoogle Scholar
- Ohlson T, Wallner B, Elofsson A: Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins 2004, 57: 188–97. 10.1002/prot.20184View ArticlePubMedGoogle Scholar
- Sun Grid Engine Project Homepage[http://gridengine.sunsource.net]
- Condor Project Homepage[http://www.cs.wisc.edu/condor]
- Jones DT, Bryson K, Coleman A, McGuffin LJ, Sadowski MI, Sodhi JS, Ward JJ: Prediction of novel and analogous folds using fragment assembly and fold recognition. Proteins 2005, (Suppl 7):143–51. 10.1002/prot.20731Google Scholar
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Graf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hubbard TJ: Ensembl 2006. Nucleic Acids Res 2006, (34 Database):D556–61. 10.1093/nar/gkj133Google Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85: 2444–8. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- McGuffin LJ, Jones DT: Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003, 19: 874–81. 10.1093/bioinformatics/btg097View ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005, 33: 2302–9. 10.1093/nar/gki524PubMed CentralView ArticlePubMedGoogle Scholar
- The R Project for Statistical Computing[http://www.r-project.org/]
- Jones DT: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999, 287: 797–815. 10.1006/jmbi.1999.2583View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – A hierarchic classification of protein domain structures. Structure 1997, 5: 1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Dali: a network tool for protein structure comparison. Trends Biochem Sci 1995, 11: 478–80. 10.1016/S0968-0004(00)89105-7View ArticleGoogle Scholar
- Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000, 9: 232–41.PubMed CentralView ArticlePubMedGoogle Scholar
- Przybylski D, Rost B: Alignments grow, secondary structure prediction improves. Proteins 2002, 46: 197–205. 10.1002/prot.10029View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.