PAT: predictor for structured units and its application for the optimization of target molecules for the generation of synthetic antibodies
- Jouhyun Jeon†1,
- Roland Arnold†1,
- Fateh Singh1,
- Joan Teyra1,
- Tatjana Braun1 and
- Philip M. Kim1, 2, 3Email author
© Jeon et al. 2016
Received: 17 March 2016
Accepted: 23 March 2016
Published: 1 April 2016
The identification of structured units in a protein sequence is an important first step for most biochemical studies. Importantly for this study, the identification of stable structured region is a crucial first step to generate novel synthetic antibodies. While many approaches to find domains or predict structured regions exist, important limitations remain, such as the optimization of domain boundaries and the lack of identification of non-domain structured units. Moreover, no integrated tool exists to find and optimize structural domains within protein sequences.
Here, we describe a new tool, PAT (http://www.kimlab.org/software/pat) that can efficiently identify both domains (with optimized boundaries) and non-domain putative structured units. PAT automatically analyzes various structural properties, evaluates the folding stability, and reports possible structural domains in a given protein sequence. For reliability evaluation of PAT, we applied PAT to identify antibody target molecules based on the notion that soluble and well-defined protein secondary and tertiary structures are appropriate target molecules for synthetic antibodies.
PAT is an efficient and sensitive tool to identify structured units. A performance analysis shows that PAT can characterize structurally well-defined regions in a given sequence and outperforms other efforts to define reliable boundaries of domains. Specially, PAT successfully identifies experimentally confirmed target molecules for antibody generation. PAT also offers the pre-calculated results of 20,210 human proteins to accelerate common queries. PAT can therefore help to investigate large-scale structured domains and improve the success rate for synthetic antibody generation.
Protein domains are fundamental units to study protein structure, conformation, function and evolution. A protein domain is generally defined as a structural unit which can fold independently and have their unique biological function , while their identification usually relies on their property of being conserved in evolution . The identification of structural domains has become more prominent to engineer protein properties by experimental means , model protein structures using computational approaches  and determine 3D structures using X-ray crystallography and Nuclear Magnetic Resonance (NMR) . Especially, identification of stable structural domain is a crucial first step to generate novel synthetic antibodies . For these reasons, many approaches have been suggested to identify structural domains. In earlier work, Huang et al. implemented a method (DisMeta) to identify structured regions by excluding disordered regions , thereby implicitly (but not explicitly) detecting stably folded structures. Also, a number of methods have been developed to identify protein structural domains: Marsden et al. developed DomPred that predicts structural domains using the alignment of predicted secondary structures of a given target against secondary structures of known domains . A number of ab-initio methods have also been attempted to structural domains. They incorporated position specific physico-chemical properties of amino acids, amino acid composition, relative solvent accessibility, as well as evolutionary information in the form of sequence profiles [9, 10]. While such approaches exist, there still is no efficient and integrative computational pipeline to identify structural domain for optimizing their likelihood of expression and folding. Furthermore, a user-friendly webserver to predict these targets is not available.
To address this need, we developed an integrated computational framework, PAT (Predictor for structural domains to design Antibody Target molecules), that can predict optimal structural domains. PAT automatically analyzes various structural properties, evaluates the folding stability, and identifies possible structured units in a given protein sequence. PAT identifies two types of structured regions with reliable boundaries. The first are traditional domains, i.e. strongly conserved stretches of protein sequence that usually adopt compact folds that are annotated in usual databases such as Pfam . The others are putative structural units, i.e., parts of the protein that adopt stable folds but are not contained in current domain databases, presumably due to a lack of sequence conservation (unassigned regions). For the identification of putative structural units, PAT employs a novel scoring system by measuring the relevance of structural properties, integrating structural properties systematically, and generating target score that can represent folding stability of target molecules. PAT also provides users with the results of each intermediate calculation, including residue-specific evolutionary rate, disorderness, secondary structure, presence of trans-membrane and signal peptide, hydrophobicity, antigenicity, and compilation of primary amino acid sequences homologous to the query that can help further analyses of the user’s proteins of interest.
In this study, to show the wide application of structural domain prediction, we applied PAT to identify target molecules of synthetic antibodies. Synthetic antibodies are invaluable tools for the recognition of specific protein targets and have numerous applications in clinical studies and biological science . Also, antibodies are applied to high-throughput proteome-wide studies to explore expression levels, subcellular localizations, and physical associations of target proteins . It has been shown that proteins fragments that fold into stable structures are preferred as target molecules and consistently lead to high-affinity antibodies [6, 13]. Furthermore, these structural domains have been used as targets to produce affinity reagents and suitable constructs for antigen cell-surface display . One of the major bottlenecks of synthetic antibody generation is the optimal identification and production of suitable antibody targets (sometimes referred to as antigens) since potential target proteins often fail to express or do not lead to high affinity binders . In our proof-of-principle experiment, we showed that integrating structural properties of RNA-binding proteins (RBPs) can characterize protein regions that act as targets of synthetic antibodies . In this study, we proved that PAT can be broadly applied to all protein families and effectively identify structural domains that can be target molecules for synthetic antibody generation.
Identifying protein structured units
PAT integrates four domain databases to identify protein domains (Fig. 1a). First, PAT defines two types of domains (see Additional file 1 for details): sequence-based domains (from Pfam , SMART , and PROSITE ) and structure-based domains (from Gene3D ). Then, the sequence-based and structure-based domains are compared to find a consensus domain. We encounter three different cases: First, if one sequence-based domain maps to more than 50 % of one structure-based domain (we refer to this case as “good overlap”), the region that covers both types of domains is determined as a consensus domain. If several structure-based domains map to one sequence-based domain (“fragmented structure-based domain”) or vice versa (“fragmented sequence-based domain”), the structure-based domain is considered as a consensus domain. If a given protein only contains sequence-based domain annotations, the sequence-based domain is selected.
Identifying putative structural units
This optimized score shows an area under the ROC curve of 0.68. This score reflects performance on the amino acid level (i.e., it is reflective of substantially higher accuracies at the protein level, when allowing for some boundary error). Next, PAT determines putative structural units that are enriched with high scoring residues. To do this, PAT employs a density grid clustering algorithm . First, PAT divides the area of the protein into a number of “grids” of 5 residues and calculates an average target score of each grid. Then, the grid that has the highest average target score is defined as the center of the putative structural unit. Finally, the putative structural unit is extended as long as its target score is larger than a defined cut-off. At the target score cut-off of 0.52, PAT shows the best balanced accuracy (68.91 %), specificity of 62.94 and 74.88 % of sensitivity (Additional file 1: Figure S1). We use all putative structural units of a minimum length of more than 40 residues. As a result, PAT reports a set of structural domains including well-defined structural domains and putative structural units with their boundaries (Fig. 1c).
Results and discussion
Performance evaluation of PAT
Comparative performance of PAT and TargetTrack
Number of targets
173 (82.38 %)
32,904 (66.84 %)
145 (69.05 %)
15,617 (31.72 %)
Performance of PAT to identify putative structural units
Target score prediction
Balanced accuracy (%)
For a comparative performance evaluation of PAT predictions, we also applied DisMeta  and DomPred  to these 75 experimentally characterized constructs (Additional file 4). We found that PAT outperforms the other two methods. Only 6 constructs (8 %) and 41 constructs (54.67 %) have a reciprocal overlap (>70 %) with DisMeta and DomPred, respectively. Also, the overall reciprocal overlap of PAT (84.43 %, standard deviation ± 9.86) is about 1.5 times higher than overlaps of DisMeta (43.72 %, standard deviation ± 16.25) and DomPred (71.30 %, standard deviation ± 27.36).
Description of PAT
The availability of high quality protein structural domains is a necessary prerequisite for protein engineering, protein structure determination and successful antibody generation. PAT is an effective tool to find potential structural domains by adapting a novel integrative scoring scheme and has been shown to do so efficiently. We believe that PAT has great practical value to researches focusing on large-scale structured target production and will ultimately improve the success rate for synthetic antibody generation and follow up studies.
Availability and requirements
Project name: PAT.
Project home page: http://www.kimlab.org/software/pat.
Operating system(s): Linux for the distributed source code and operating system independent for the web servers.
Programming language: Python 2.6 and C++.
License: Non-commercial use only.
Any restrictions to use by non-academics: Contact authors for permission.
We thank Taehyung Kim and Alexey Strokach for technical assistance and valuable discussion. This work was supported by an operating grant of the Canadian Institute for Health Research (MOP-123526), Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology (357-2011-1-C00143), and NSERC-CREATE Training Program (384338–10).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Kong L, Ranganathan S. Delineation of modular proteins: domain boundary prediction from sequence information. Brief Bioinform. 2004;5(2):179–92.View ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J. The Pfam protein families database. Nucleic Acids Res. 2012;40(Database issue):D290–301.View ArticlePubMedPubMed CentralGoogle Scholar
- Gulich S, Uhlen M, Hober S. Protein engineering of an IgG-binding domain allows milder elution conditions during affinity chromatography. J Biotechnol. 2000;76(2–3):233–44.View ArticlePubMedGoogle Scholar
- Chivian D, Kim DE, Malmstrom L, Bradley P, Robertson T, Murphy P, Strauss CE, Bonneau R, Rohl CA, Baker D. Automated prediction of CASP-5 structures using the Robetta server. Proteins. 2003;53 Suppl 6:524–33.View ArticlePubMedGoogle Scholar
- Folkers GE, van Buuren BN, Kaptein R. Expression screening, protein purification and NMR analysis of human protein domains for structural genomics. J Struct Funct Genomics. 2004;5(1–2):119–31.View ArticlePubMedGoogle Scholar
- Konthur Z, Hust M, Dubel S. Perspectives for systematic in vitro antibody generation. Gene. 2005;364:19–29.View ArticlePubMedGoogle Scholar
- Huang YJ, Acton TB, Montelione GT. DisMeta: a meta server for construct design and optimization. Methods Mol Biol. 2014;1091:3–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Marsden RL, McGuffin LJ, Jones DT. Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci. 2002;11(12):2814–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Jianlin Cheng MJS, Baldi P. DOMpro: protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks. Data Min Knowl Discov. 2006;1(13):1–10.View ArticleGoogle Scholar
- Eickholt J, Deng X, Cheng J. DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics. 2011;12:43.View ArticlePubMedPubMed CentralGoogle Scholar
- Borrebaeck CA. Antibodies in diagnostics - from immunoassays to protein chips. Immunol Today. 2000;21(8):379–82.View ArticlePubMedGoogle Scholar
- Mersmann M, Meier D, Mersmann J, Helmsing S, Nilsson P, Graslund S, Structural Genomics C, Colwill K, Hust M, Dubel S. Towards proteome scale antibody selections using phage display. N Biotechnol. 2010;27(2):118–28.View ArticlePubMedGoogle Scholar
- Fellouse FA, Esaki K, Birtalan S, Raptis D, Cancasci VJ, Koide A, Jhurani P, Vasser M, Wiesmann C, Kossiakoff AA, et al. High-throughput generation of synthetic antibodies from highly functional minimalist phage-displayed libraries. J Mol Biol. 2007;373(4):924–40.View ArticlePubMedGoogle Scholar
- Wittrup KD. Protein engineering by cell-surface display. Curr Opin Biotechnol. 2001;12(4):395–9.View ArticlePubMedGoogle Scholar
- Schofield DJ, Pope AR, Clementel V, Buckell J, Chapple S, Clarke KF, Conquer JS, Crofts AM, Crowther SR, Dyson MR, et al. Application of phage display to high throughput antibody generation and characterization. Genome Biol. 2007;8(11):R254.View ArticlePubMedPubMed CentralGoogle Scholar
- Na H, Laver JD, Jeon J, Singh F, Ancevicius K, Fan Y, Cao WX, Nie K, Yang Z, Luo H, et al. A high-throughput pipeline for the production of synthetic antibodies for analysis of ribonucleoprotein complexes. RNA. 2016.Google Scholar
- Letunic I, Doerks T, Bork P. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res. 2012;40(Database issue):D302–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Sigrist CJ, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I. New and continuing developments at PROSITE. Nucleic Acids Res. 2013;41(D1):D344–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Lees J, Yeats C, Perkins J, Sillitoe I, Rentzsch R, Dessailly BH, Orengo C. Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis. Nucleic Acids Res. 2012;40(Database issue):D465–71.View ArticlePubMedPubMed CentralGoogle Scholar
- Koga N, Tatsumi-Koga R, Liu G, Xiao R, Acton TB, Montelione GT, Baker D. Principles for designing ideal protein structures. Nature. 2012;491(7423):222–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Myers JK, Oas TG. Preorganized secondary structure as an important determinant of fast protein folding. Nat Struct Biol. 2001;8(6):552–8.View ArticlePubMedGoogle Scholar
- Mirny L, Shakhnovich E. Evolutionary conservation of the folding nucleus. J Mol Biol. 2001;308(2):123–9.View ArticlePubMedGoogle Scholar
- Dyson MR, Shadbolt SP, Vincent KJ, Perera RL, McCafferty J. Production of soluble mammalian proteins in Escherichia coli: identification of protein features that correlate with successful expression. BMC Biotechnol. 2004;4:32.View ArticlePubMedPubMed CentralGoogle Scholar
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998;54(Pt 6 Pt 1):1078–84.View ArticlePubMedGoogle Scholar
- Sarmah RD, Bhattacharyya DK. A distributed algorithm for intrinsic cluster detection over large spatial data. World Acad Sci Eng Technol. 2008;21:856–66.Google Scholar
- Savitsky P, Bray J, Cooper CD, Marsden BD, Mahajan P, Burgess-Brown NA, Gileadi O. High-throughput production of human proteins for crystallization: the SGC experience. J Struct Biol. 2010;172(1):3–13.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen L, Oughtred R, Berman HM, Westbrook J. TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004;20(16):2860–2.View ArticlePubMedGoogle Scholar
- Buchan DW, Ward SM, Lobley AE, Nugent TC, Bryson K, Jones DT. Protein annotation and modelling servers at University College London. Nucleic Acids Res. 2010;38(Web Server issue):W563–8.View ArticlePubMedPubMed CentralGoogle Scholar
- UniProt C. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012;40(Database issue):D71–5.Google Scholar