PAT: predictor for structured units and its application for the optimization of target molecules for the generation of synthetic antibodies

Background The identification of structured units in a protein sequence is an important first step for most biochemical studies. Importantly for this study, the identification of stable structured region is a crucial first step to generate novel synthetic antibodies. While many approaches to find domains or predict structured regions exist, important limitations remain, such as the optimization of domain boundaries and the lack of identification of non-domain structured units. Moreover, no integrated tool exists to find and optimize structural domains within protein sequences. Results Here, we describe a new tool, PAT (http://www.kimlab.org/software/pat) that can efficiently identify both domains (with optimized boundaries) and non-domain putative structured units. PAT automatically analyzes various structural properties, evaluates the folding stability, and reports possible structural domains in a given protein sequence. For reliability evaluation of PAT, we applied PAT to identify antibody target molecules based on the notion that soluble and well-defined protein secondary and tertiary structures are appropriate target molecules for synthetic antibodies. Conclusion PAT is an efficient and sensitive tool to identify structured units. A performance analysis shows that PAT can characterize structurally well-defined regions in a given sequence and outperforms other efforts to define reliable boundaries of domains. Specially, PAT successfully identifies experimentally confirmed target molecules for antibody generation. PAT also offers the pre-calculated results of 20,210 human proteins to accelerate common queries. PAT can therefore help to investigate large-scale structured domains and improve the success rate for synthetic antibody generation. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1001-1) contains supplementary material, which is available to authorized users.


Identifying protein domains and their boundaries
To identify protein domains with optimized boundaries, PAT defined two types of domains: sequence-based domains and structure-based domains. Sequence-based domains are derived from sequence profiles based on alignments of known domain sequences and structure-based domains are defined from the structural relationship of known domain structures [1]. It has been shown that different methods to delineate the sequence-based domains and structure-based domains complement each other and thus help to capture reliable protein domains with structural boundaries [2]. To identify sequence-based domain, PAT combines the domain information of Pfam [3], SMART [4], and PROSITE [5]. Sequence-based domains are defined if more than 2 databases capture the same region as a domain. Structure-based domains are derived from the domain information of Gene3D [6]. Gene3D is a recently updated database of protein domains and provides comprehensive structural annotation using HMM models based on the CATH domain families [7]. We used the most recently updated version of databases for study.

Identifying putative structural units
To compile structure-related information of putative structural units, PAT uses PSIPRED for secondary structure identification [8], InterProScan to obtain known domain information [9], TMHMM [10] and SignalP 4.0 [11] to examine the presence of trans-membrane and signal peptide and ConSeq to calculate residue-specific evolutionary rate based on the notion that structurally important residues show slow evolutionary rate [12]. DISOPRED2 is used to define ordered and disordered residues in proteins. DISOPRED2 generates position-specific score matrices by analyzing sequence profiles of homologous sequences and estimate disorder probability of each residue. Then, DISOPRED2 determines structural-state of each residue based on disorder probability (order; disorder probability is smaller than false positive rate threshold and disorder; disorder probability is bigger than false positive rate threshold) [13]. We used result of two-state prediction. To identify antibody-targetable structural regions, PAT integrates antigenicity and hydrophobicity based on the notion that residues such as Cys, Leu and Val are more likely to be a part of antibody recognition sites [14] and hydrophobic residues can affect interactions between target molecules and antibodies [15]. Antigenic and Hmoment from EMBOSS package are used to measure antigenicity and hydrophobicity of residues [16] Antigenic analyses the physicochemical properties of amino acids and their frequencies of occurrence in experimentally known antibody recognition regions and finds antigenic sites in proteins. All tools which are integrated in PAT pipeline are applied with default options.
Homologous sequences are selected from Swiss-Prot/TrEMBL to calculate residue-specific evolutionary rate (ConSeq) and to predict disorder (DISOPRED) and secondary structure (PSIPRED). Sequences whose length is 0.7~1.4 times the query sequence length and < 90% similarity to other sequences are considered. It has been shown that PSIPRED and DISOPRED2 show the best performance to predict secondary structures [17] and disordered regions [18], respectively. Therefore, we used their prediction results to characterize putative structural units.
After compiling this information, PAT assigns scores, which represents structure-related properties. Since we focused on the identification of structural regions, we assigned high scores to residues that are involved in secondary structure (0: secondary structure absent, 1: secondary structure present), ordered region (0: disordered residue, 1: ordered residue) and have slow evolutionary rate (0 to 1 in step size of 0.2. Evolutionary rates below the average are divided into 3 equal intervals. The same 3 intervals are used for the scores above the average. In total, 6 grades were obtained depending on the level of conservation. 0 and 1 indicate low and high conservation, respectively). Furthermore, to consider structural regions that can be recognized by antibodies, we assigned high score to residues that are predicted as antigenic sites (0: nonantigenic site, 1: antigenic site) and tend to be hydrophobic (0: < top 50%, 0.5: top 50% ≤ hydrophobic score ≤ top 75%, 1: ≥ top 75% of hydrophobic scores). Trans-membrane regions, signal peptides, and known domains are excluded for further analysis to find only putative structural unit. Next, we measured the relevance of each structural feature and optimized a scoring scheme. A grid-search method was applied to find the best weight of each structural feature. The weights of each feature were selected ranging from 0.1 to 1.0 with a step size of 0.1 and all possible combinations of weights for features were tested on performance on the training set. As a training set, we used 164 mammalian proteins whose known structures are not listed as domains (putative structural units; Table S1). We collected all solved mammalian protein structures (that would be the part of entire proteins) from PDB database [19]

Comparing PAT prediction with experimentally characterized constructs for antibody generation
From an in-house pipeline for the generation of synthetic antibodies, we selected 75 targets that are localized to extracellular regions and against which antibodies were successfully produced using phage display (Additional file 4). The boundaries of these experimental constructs are determined by manual inspection and several experimental trials/errors. We assumed that manually characterized boundaries would be optimized boundaries to get expressed and purified constructs. Among the 75 constructs, 26 contain predicted putative structural units. We compared boundaries of experimental constructs with boundaries of PAT prediction. For the further performance evaluation of PAT prediction, we also applied DisMeta (http://www.wenmr.eu/wenmr/dismeta-disorder-prediction-metaserver) [20] and DomPred (http://bioinf.cs.ucl.ac.uk/psipred/?dompred=1) [8] to 75 experimental constructs.