ProLego: tool for extracting and visualizing topological modules in protein structures

Background In protein design, correct use of topology is among the initial and most critical feature. Meticulous selection of backbone topology aids in drastically reducing the structure search space. With ProLego, we present a server application to explore the component aspect of protein structures and provide an intuitive and efficient way to scan the protein topology space. Result We have implemented in-house developed “topological representation” in an automated-pipeline to extract protein topology from given protein structure. Using the topology string, ProLego, compares topology against a non-redundant extensive topology database (ProLegoDB) as well as extracts constituent topological modules. The platform offers interactive topology visualization graphs. Conclusion ProLego, provides an alternative but comprehensive way to scan and visualize protein topology along with an extensive database of protein topology. ProLego can be found at http://www.proteinlego.com Electronic supplementary material The online version of this article (10.1186/s12859-018-2171-9) contains supplementary material, which is available to authorized users.


ProLego: tool for extracting and visualizing topological modules in protein structures
Supplementary Information Taushif Khan, Shailesh Kumar Panday, Indira Ghosh

Approaches to analyze protein topology
Over the years, methods like PROMOTIF [1], TopDraw [2], TOPs+ [3], Pro-Origami [4]) and PTGL [5] have implemented different algorithms to generate protein topology graphs from protein 3D atomic structures. With the objective of understand and simplify the atomic structure space, topological approach has been used in different scales. However, only handful of methods are available that provide automatic generation of the protein topology diagram [4,5] (Table S1). Based on H-bonds inspired from works of Jane Richardson (1)

Hutchinson and Thronton
Algorithm, no public web server 2 PRO-MOTIF (1996) Provides details of the location of structural motifs in PDB. Used to compare protein topologies Hutchinson et.al [1] Stand-alone code available on request to the authors. 3 TOPs (2000) Protein topology by analyzing relative position and orientation of secondary structures.

Contact string (CS) from Protein chain
Contract string is the linear construct of protein secondary structure (SS) adjacency matrix ( Fig S2, lower panel). Along with SS contact information, this adjacency matrix contains type of contact and its orientation information. Detail contact definition has been listed in Table   S2. In "contact string", we decompose the adjacency matrix diagonally and construct a string, where each segment represents contact information of sequentially distanced neighbours in an increasing order. The segments are noted as dash "-" and each contact (element of adjacency matrix), as dots ".". This construct of linear notation makes topology comparison, storage and visualization easy and efficient.
The definition of SS contact has been carefully estimated and selected [7]. The contact distance criteria is found to be in agreement to conventional use [8,9]. A brief description of building of adjacency matrix and "contact string" can be see in Figure S2.
A working example of contact string generation can be found in Fig. S2, for a 4 helix protein.
Contact matrix(M) of (H x H) captures the contacting alpha helices as well as their orientation.
Matrix dimension depends on the number of alpha helices (H) in the protein chain. The matrix element (i,j) shows the contact and orientation of two alpha helices i and j in the protein. If i and j are not contacting then M(i,j) is '0' ,otherwise can be 'a','r' or 'p' depending on helix orientation of anti-parallel, orthogonal and parallel respectively. The near diagonal elements represent contact between sequentially adjacent alpha helices (contact distance = 1) and as we go diagonally up the contact distance increases. Contact distance is the distance between two contacting alpha helices in terms of number of alpha helices they are apart in sequence.
We have considered contact distance between same helices as '0', between adjacent as 'H' as 1 and henceforth, represented by "-" in contact string. ProLego server application defines contact between two secondary structures (alpha (H), beta (E) ), with above tabulated rules. The residue contact deinition has been adapted as per most popular definition of distance between heavy atom between a residue pair is sum of their VdW radius with a thresold of 0.6 Angstroms. The minimum number of "contacting residues" required for two SSE to be in contact has been described in the column "Definition". Orientation defnitions of two contacting SSE has been followed as under the column "Orientation". Figure S1 :ProLego architecture. The layered-architecture makes, ProLego easy to use and maintain.
The user-layer has different option of input for protein chain. Layers are shown in different boxes. The pipeline shows the flow of instruction in black and blue lines.

Extracting topological modules using contact string
For different secondary structure content, presence of topological modules can be searched.
Modules are the structural patterns of lower SS content occurring in higher SS content proteins. For example, a protein with 4 alpha-helix can have a structural pattern of two three alpha helix proteins. This can be generalized for "n" sse ( ∇ n> 4), where 3 to n-1 topological modules can be extracted . A pictorial representation of this concept has been described in Figure S2. This is also referred as "topological modules" that may be used to build and design more complex proteins with higher SS content from the structural roots of lower SS content proteins. Topological modules for different sets of SS has been formed from the contact string. A working example of chain L of photoreaction protein (PDB: 1JB0:L), has been provided in Table S3. This protein chain has 5 alpha helices. Using contact string, 8 different topological modules can be identified, which have significant presence in ProLegoDB.
With ProLego contact string, it is quite easy to identify and decompose repeating protein units. A example of Zink and Tax binding protein has been discussed in Figure S4   1chuA03, 1dd5A01, 1eh1A01, 1ge9A01, 1g vnC00, 1h5wB01,1hx8A02, 1i6zA00, 1is1A 01, 1iseA01, 1k04A02, 1knrA03,1m62A00, 1o3xA00, 1qsdA00, 1sumB01, 1uk5A00, 1 uurA01,1wfdA00, 1wrdA00, 1x9bA00, 1yd 8G00, 1z8uA00, 2c5kT00,2dl1A01, 2i0mA0 1, 2jwsA00, 2kdlA00, 2ptfB02, 2v6yA00,2v 8sV00, 2vqeT00, 2w2uA00, 3a8yD00, 3axj A02, 3ldqB00,3mxzA00, 3nvoB02, 3qb5K0   Figure A and B, shows different representation of two proteins, which has a same topology but sharing very low sequence identity (19%, as shown in Fig C). For each case (A, B), upper diagram shows the linear topology with strands represented as triangles (with relative orientation as up/down triangle) and helices are represented as rectangle. The length of helical rectangles scaled as per number of residues in the helix. The protein chain is represented as red to green to blue as passes from N to C terminal. The linear lines, connecting secondary structure (SS) blocks shows the chain connectivity, whereas the arc lines represent the spatial connectivity and type of SS contact (color coded as labeled, Table S2). The secondary structure contact map, shows all spatial contact between pairs of SS. A 3D carton representation (VMD generated) and 2D topology cartoon plot as generated from ProLego. The 2D ProLego cartoon shows contact between two SS blocks by red dotted lines and chain connectivity by black continuous line.   Using the standalone suit, contact string has been generated for decoys in each protein case.
The ab-initio designing principle is filtered with energy contributions and a set of minimum energy templates has been listed as designed decoys. With ProLego topology we can scan the presence of different topology scaffolds, that depends on contacting secondary structures.
While examining the decoys, we found that on an average ~85% (± ~15%) can be filtered out as they have "non-contacting (NC)" secondary structures. The rest "contacting" topology in ProLegoDB can be used to scan the "P" (prevailing) or "NP" (Non-Prevailing) topologies. As shown in Table S6, from all examined cases, on average ~8% of generated topologies can be mapped to "Prevalent" topology class. These selected topologies (~20 out of 200) can be used as top selected templates for further refinements.
Overall, ProLego can filter large number of decoys whose topologies are not found in the already studied non-redundant datasets. As for smaller proteins (with secondary structure number <= 10); the topology space has been exhaustively analysed and therefore the filter is conventional. This data can be accessed from the github repository https://github.com/taushifkhan/plv-DecoySearch .

ProLego in protein designing
Protein topology representation from ProLego is "string based". The inherent component based approach provides the relative ranking of different topology possible from a secondary structure (SS). We have compiled ProLegoDB, with topology information from the analysis of different non-redundant protein structure databases. At present, ProLegoDB has topology and different sizes (residues 30 to 500).

Description of designed dataset:
Recently in the study and design of min-proteins (3 and 4

Experimental setup
From each design round, main chain topology has been extracted from the data provided in the above link using ProLego stand alone suit. Corresponding sequence stability score of the main chain as been categorized into stable (>1) and non-stable (<1) as reported by the authors. We have investigated the influence of topology classification ("prevalent" and "non-prevalent") in the reported designed topology pool. For each topology group (e.g. 'HHH'), occurrence of ProLego topologies and their stability has been monitored. Table S6: Designed proteins templates in ProLego topologies.

Results
ProLego topologies in 3H designed template (Data from Rocklin et.al. [11],). The rounds of experiments shows the enrichment in sequence stability. The stability score of >1 has been assigned as "stable" and <1 as "Not-stable". ProLego status of "1" and "0" represents "Prevalent" and "No-prevalent" topologies respectively.
The reported experiment has a feed-back design that enrich the sequence stability in each round. As shown in table S6, in subsequent rounds, number of stable designs is observed to be increasing (from ~20% in round one to 87% in round 4), which is according to the reported data. Here, we have investigated the population of stable and non-stable in each case of ProLego topology. As discussed in the main text, ProLego, examines all possible topology in a SS-group. In case of 3 helix SS, ProLegoDB has 50 different topologies and ~32% (16 out of 50) topology are "prevalent". Investigating the nature of topology space in the synthetic designing, we have observed near equal presence of "prevalent" (8) and "non-prevalent" (7) topologies in round 1. Whereas, as the sequence enrichment occurred from round 2 onwards, there is a selective presence of "prevalent" topologies. Moreover, the number of "stable" folds in these topologies are found to be increasing in every "prevalent" topologies. We have seen similar observation of preferred topologies in other 3 cases (EHEE, HEEH, EEHEE), as shown in the datasheet.
This selective occurrence of certain topologies in the above synthetic dataset is in agreement with the "preferred" set in naturally observed topology dataset (reported in ProLegoDB). As shown in previous works [7], the "preferred" topologies can support functionally diverse scaffolds. This gives the evolutionary advantage for the frequent use of certain topologies in the structure space. As we have seen even in the de-novo synthetic datasets these naturally occurring topologies to be emerging as the "useful" scaffolds. Overall percentage occurrence of ProLego topologies in stable decoys of Rocklin et. al. [11].
For each SS constructs (case), different number of topology has been analyzed (1), which found to be present in ProLegoDB. All topologies are then grouped in to prevalent "P" and nonprevalent groups based on the described method. Above table, shows the percentage occurrences of prevalent and non -prevalent topology among "stable" decoy pool of each round.

ProLego time estimation
ProLego calculation of contact string form proteins have observed steady increase with number protein size as shown by Figure S6. Analyzing protein size w.r.t total secondary structure content and total number of residues, time profiles are seems to be varying at different scales.
For small proteins (i.e. total residues < 150), ProLego observed to be generating results with in 20 seconds, whereas when the number of residues went pass 200 total run time can be increased to 90 seconds. As shown in Fig S6(A), the mean value of total secondary structure content (< 10) is within 60 seconds.
In current implementation of ProLego, time estimation for small proteins are relatively fast, i.e. within 30 second. For bigger proteins, as the total secondary structure content increases (>  Figure B and C describes time with respect to residues as protein size. In Figure B, proteins have been grouped into three groups as 'small' (residues < 250), ''Medium' (residues between 250 and 400) and 'Big' (residues greater than 400), with similar properties as Fig A. For 'small' proteins, Fig  C, shows time w.r.t total number of residues.
10 SS and total residues > 300), mean value of run time increase to 90 seconds with observable variation.

Detail of Dataset Analysis in ProLegoDB
Protein Data Base (PDB) has been filtered for X-ray structures with good resolution structures (< 3 Angstrom) and sequence identity clusters of 80%, 60% and 30%. The non-redundant subsets are generated from CD-Hit and PISCES server. The main goal of data variation is to check the consistency of the resulted topology groups and robustness of the prevalence classes.
In each dataset, protein chains have been analyzed in classes of SS-composition, which is defined by the arrangement of SS in protein chain from N to C terminal. In each composition group, protein chains are then clustered as per their topology. Statistical significance of each topology group has been computed by Chi-Square test. We consider the topology group distribution in a SS-composition is significant is the P-Value in < 0.001. i.e chances of finding such distribution by random is 1/1000.
The frequency of occurrence of topologies in a SS-composition ranked as per their percentile scores. The topology set which belongs to the first quartile (Q > 75) are grouped in "Prevalent (P)" topology group. The rest topologies are grouped as "Non-prevalent (NP)" topology group.
We evaluate the statistical significance of difference between P and NP group using Wilcoxcon-Rank-Sum test. If the difference is found to be statistically significant (P-value < 0.01), we considered the subset of topologies as "prevalent" (or most-frequent).