COZOID overview
Our newly proposed system enables for the efficient visual exploration of a large number of PPI complexes. For a better understanding, we introduced the following notation. A protein P consists of a set of amino acids forming a polypeptidic chain. A complex C is represented by a set of mutually interacting proteins. In our case, we focus primarily on the interactions between two protein structures P1 and P2, which form a complex C(P1,P2). The mutual spatial orientation of the interacting proteins in the complex forms a configuration. The i-th configuration of complex C(P1,P2), denoted as CONF
i
(C(P1,P2)), represents one of the possible mutual orientations of this complex. Generally, there can be n (1≤i≤n) possible configurations for a given complex, and the task is to select the configuration that is the most relevant one from a proteomics point of view. The decision is based on various pieces of knowledge about the geometric arrangement of the configuration as well as other aspects, such as knowledge of the contacts between the amino acids present in the contact zone of the given configuration. Therefore, the selection of the most relevant configurations cannot be completed automatically and requires insights from the proteomic expert. This represents a typical domain-related problem, which has to be supported by specifically designed visualizations.
The visualization methods proposed in this paper allow the user to visually explore a set of possible configurations detected by one of the existing computational tools and to select the most proteomically relevant ones. The users have to iteratively filter out those configurations that do not fulfill the given specific criteria. The proteomic expert workflow, along with our proposed visual support of its individual stages, is depicted in Fig. 3. The input datasets, consisting of dozens of configurations between two interacting proteins, were computed using the HADDOCK [11] and PyDock [12] tools. However, any of the existing tools for protein-protein docking can serve as a source of input data for our system.
The proposed visualizations are based on the precondition that the users already have initial knowledge about the interacting proteins. Thus, the experts are able to define a pair of amino acids that are expected to interact. This is not restrictive, as computational tools also require this information to produce a meaningful set of configurations. In other words, we are using similar input information as the computational tools. The second possibility is that the users do not have this information but are aware of an already explored protein complex with a similar structure that can serve as a reference (primary) complex for further comparison and exploration. In this case, the computational tools usually produce even more configurations, but most of them are irrelevant and have to be filtered out. Our tool can utilize the information about the interactions in the primary complex and enhance the filtering process.
Our methods have been designed specifically to help proteomic experts answer the following questions:
-
Q1: Which configurations contain a selected interacting pair of amino acids (and what is the frequency of the occurrence of this pair in all configurations)?
-
Q2: Which pairs of amino acids are present in a given configuration?
-
Q3: How close are the amino acids in the contact zone and which are the closest ones?
-
Q4: How similar and different are the contact zones in the configurations?
-
Q5: What are the physico-chemical properties of the amino acids in the contact zone?
-
Q6: What are the differences between the sets of amino acids in the contact zones of different configurations?
Answering these questions helps the proteomic experts to better understand the interactions in the protein-protein complexes and to evaluate the correctness of the given configurations. The proposed visualizations enable one to find the answers by interactively exploring the configurations which is demonstrated in the supplementary video as well (see Additional file 1). In the following chapters, we introduce our proposed views in detail.
Matrix view
When using a computational tool to generate possible configurations, the resulting set S={CONF
i
(C(P1,P2));1≤i≤n}, n can be very large, ranging from dozens to hundreds. This amount is impossible to explore manually; thus, some preliminary filtering is crucial. The filtering stage is designed to answer question Q1. We propose a matrix-based visualization inspired by commonly used heat maps (Fig. 4a). The rows and columns in the Matrix view correspond to the interacting proteins P1 and P2, respectively. Each row or column represents one amino acid present in a contact zone in some of the configurations CONF
i
(C(P1,P2)). The rows and columns are formed only by those amino acids from the interacting proteins that are in contact in at least one configuration. The contact between the amino acids is based on their Euclidean distance. Two amino acids are considered to be in contact if their distance is between 3 and 5 Å. This range can be interactively changed by the user. The color of each cell in the matrix corresponds to the number of occurrences of the corresponding interacting amino acids in the set S of all configurations. The colored lists of amino acids can be interpreted as histograms, encoding the number of their occurrences. The intense red color represents the pairs of amino acids that are interacting in most of the configurations. The Matrix view serves directly for filtering out improbable solutions using the interactive user-driven selection of cells. The selection is performed by clicking on individual cells. Moreover, the matrix allows the expert to selecSut a combination of several pairs of amino acids. This is useful if the user wants to further explore only those configurations that contain specific interactions, such as between the amino acid pair A, B and simultaneously the pair C, D.
The big advantage of the Matrix view is its independence from the size of the input set of possible configurations. The number of rows and columns is limited by the size of the interacting proteins, meaning that in the worst case, it corresponds to the total number of amino acids in these proteins. However, in most cases, the number of amino acids in the contact zones is much smaller than the total number of amino acids. Each configuration of the input dataset then increases the counters in the respective matrix cells. In the case of many interacting amino acids, the cells in the matrix can become too small. In these situations, the users can employ the table lens technique introduced by Rao and Card [13], which can be applied to both rows and columns in the matrix (Fig. 4a).
To provide the users with more detailed information about individual configurations, the Matrix view contains an additional side view, which is positioned directly next to the matrix (Fig. 4b). The user can select a primary configuration to which all the remaining configurations are compared. An example of a primary configuration can be a crystal structure downloaded from the PDB database. We propose the following ranking score, which indicates the similarity between the contact zone of a given configuration and the primary configuration. One of the interacting proteins, e.g., P1, is selected as a reference protein, while the second protein, e.g., P2, is marked as the paired protein. The score is computed in the following way.
-
For each match of an amino acid in the contact zones from the reference proteins of the compared and the primary configuration, the similarity score is increased by one.
-
For each matching interaction pair in the contact zones from the compared and the primary configuration, the similarity score is increased by four.
-
For each missing interaction pair in the contact zones from the compared and the primary configuration, the similarity score is decreased by one.
This score was determined experimentally while designing and testing the view (see Results chapter). The central part of the side view consists of a scrollable list of individual configurations from a subset of S that was filtered with the Matrix view. The configurations are ordered according to their similarity scores, from the most similar to the least similar ones. The primary configuration is always displayed as the first one on the top of the list.
The side view helps to answer questions Q2 and Q3, as it enables an iterative search through the list of configurations and the exploration of all pairs of interacting amino acids for each configuration. The user can select a configuration to focus on by clicking on it. By default, each configuration in focus contains one polyline connecting two amino acids from the contact zone that are the closest among all the possible pairs (Fig. 4b). The user can hover the mouse over the lists of amino acids on the left and right side and inspect the corresponding connection lines for a given amino acid. By clicking on the rectangle representing a given amino acid, the connection lines remain in the view. The pairs of amino acids that form the configuration in focus can be highlighted in the matrix (with green border rectangles in Fig. 4a). From the color of the matrix cells, the user can immediately estimate the number of configurations in which these pairs are present. Vice versa, by interacting with the matrix and selecting the given rectangles, the side view is automatically filtered to show only those configurations that satisfy the filtering condition.
The Matrix view serves as the first filtration tool for selecting only those configurations that contain a desired combination of interacting amino acids. This filtering cannot be automated because the frequency of a given pair in configurations does not correlate with the importance of these configurations. The most frequent pair of interacting amino acids can be of the same interest as a pair interacting only in one configuration. Therefore, insights from the proteomic expert in combination with the interaction possibilities from the Matrix view have proven to be a very efficient and powerful solution. Selected configurations can be further processed by the following visualization methods.
Exploded view
The proteomics experts are already familiar with the manipulation of molecules in a three-dimensional (3D) environment; thus, a 3D representation has to be an integral part of the workflow. Moreover, the 3D space helps to find answers for questions Q3-Q5, which are related to the appearance of the contact zones of selected configurations and the properties of interacting amino acids (expressed by different coloring schemes). Exploring and comparing many structures in 3D at once suffers from problems such as high overlap, occlusion, and visual clutter (Fig. 5b). Traditionally used spatial representations are not sufficient. To overcome these limitations, we adapted an exploded-view technique, to enlarge the distance between the interacting proteins. Figure 5c shows the comparison of three configurations using our proposed Exploded view.
The main principle of the Exploded view is the following. First, all the reference proteins taken from the configurations selected in the Matrix view are aligned using the Combinatorial Extensions from the structural-alignment algorithm [14] so that their 3D spatial representations overlap (Fig. 5). Here, it is important to understand that the reference protein shown in Fig. 5b (the brown one) actually represents three overlapping aligned reference proteins, each coming from one configuration. The set of paired proteins interacting with the reference proteins is positioned around the aligned reference proteins with an enlarged distance.
To ensure that the paired proteins in the Exploded view will not collide with each other, we arrange the paired proteins into a parabolic regular grid. For each reference protein and it’s paired protein, the Exploded view retains the information about their interaction. If several configurations are exploded at once, the Exploded view contains many paired proteins arranged around the aligned reference proteins. As the change in the position of the exploded proteins can cause disorientation in the scene, the pairing information between the corresponding reference proteins (aligned) and paired proteins (“exploded”) is initially indicated as a partially transparent tube that connects the centers of their contact zones. The radius of the tube is modulated (it is smaller in the middle of the tube to reduce the visual clutter). Once the user understands the overview of the protein spatial arrangement, the tube can be switched off. The pairing information is also encoded by color (a different color is used for each configuration). If the contact zones contain colliding amino acids (i.e., their mutual distance is less than 3 Å), the residues are indicated by a red color.
Figure 5 depicts a set of three configurations before (a, b) and after (c) applying the Exploded view. The Exploded view removes the problem of overlapping paired proteins. It also helps to see the shape and position of the contact zones. However, this solution does not solve the problem where the contact zones face each other, meaning that the user has to adjust the camera to observe the contact zones of the reference and paired proteins from a perpendicular viewing direction. This manipulation does not enable the user to see both contact zones simultaneously. This problem is solved by the proposed Open-Book view, which is presented in the following section.
Open-Book view
The Exploded view does not allow one to observe both parts of a given contact zone simultaneously. The proposed Open-Book view is designed to specifically answer questions similar to Q5, which addresses a detailed exploration of one selected contact zone in the complex C(P1,P2). This involves the presentation of the information about different properties of individual amino acids forming the contact zone and their pairing.
The Open-Book view is activated if the user selects one of the configurations from the Exploded view. The selection is performed by clicking on the connection tube from the desired configuration CONF
i
(C(P1,P2)) in the Exploded view. The other configurations are automatically hidden, the selected configuration returns to its initial position (before applying the Exploded view), and an animated transition for the opening of CONF
i
(C(P1,P2)) is launched. When animating the opening, the reference and paired proteins are rotated and translated so that they are positioned next to each other and the contact zones are facing towards the observer (see Fig. 6).
The algorithm performing the opening computes the vectors defining the orientation of the contact zones (their normal vectors). From the normal vectors and the camera position, we compute the rotation angle, which is then applied to the reference and paired protein. To maintain the information about the amino acid pairings, the user can also visualize individual connections between these pairs through simple lines.
The contact zones represented by their surfaces can be color-coded according to multiple criteria. The color can encode the distance between the amino acids or represents different physico-chemical properties of the amino acids or their atoms, such as hydrophobicity or partial charges. The coloring scheme used in the Matrix view represents the so-called conservation of the amino acids in all configurations. It can also be used to color the contact zone. The surfaces can be augmented with labels to inform the users about the type and identifier of individual amino acids.
In both the Exploded view and the Open-Book view, a protein can also be represented by other traditionally used visualization styles, such as cartoon, spheres, balls&sticks, sticks, etc. Moreover, these methods can be combined. For example, the proteins can be represented by the cartoon style and the amino acids in the contact zones can be visualized using the sticks representation to see their spatial orientation.
If the task is to compare individual configurations with respect to the pairs of interacting amino acids, a further drill-down is necessary. Therefore, in the next section, we propose another abstract view supporting mainly the comparison of paired amino acids in individual contact zones from selected configurations.
Contact-Zone list-view
The Contact-Zone list-view helps to answer questions related to the comparison of the contact zones at the level of the individual amino acids, such as in Q6. The list for one configuration consists of two sets of amino acids in the contact zones, each set coming from one interacting protein (see Fig. 7). The left part of the view contains all amino acids coming by default from the reference protein, while the right part is formed by their interaction counterparts in the paired protein. However, the order of proteins in the list-view can be changed. The order depends on the current task, i.e., if we want to compare the constitution of contact zones from the reference or the paired protein in the given configurations. The view contains all possible connections (with respect to the distance) between the amino acids from both contact zones. To avoid the intersection of lines representing the connections, some amino acids on the right side are repeated – one instance for each reference protein amino acid within a user-defined distance. This solution was adopted because without these repetitions, there would be many line intersections, which substantially decreases the readability of the representation (see Fig. 2b).
For each configuration, one list-view is created and all the list-views are juxtapositioned so the user can see and visually compare the constitution of the contact zones from all selected configurations. The user can modify this representation by changing the color, which can encode different properties for the amino acids mapped onto their corresponding rectangles. The properties are the same as those mapped onto the surface of the contact zone in the Exploded and Open-Book views. The left part of the list can then be sorted according to these properties (see Fig. 8). Moreover, by clicking on individual rectangles representing the amino acids, the corresponding amino acids are selected in the 3D view as well.
The principle steps for building the Contact-Zone list-view are the following. For all configurations, which should be visualized in the Contact-Zone list-view, we find the interacting pairs of amino acids in their contact zones.
Then, the list of amino acids present in all reference proteins from the selected configurations is created. Now, for each configuration, we take the interacting amino acids from the paired proteins, sort them according to a selected criterion (e.g., hydrophobicity), and add them to the Contact-Zone list-view. The amino acids in the left part of the Contact-Zone list-view are always sorted in the same way for all depicted configurations. Similar to the Matrix view, the user can select a primary configuration to which all the remaining configurations are compared (see Fig. 7b) using the proposed ranking score algorithm, which is described in “Matrix view” section. The Contact-Zone list plots the configurations ordered from left to right by the similarity score from the most similar to the least similar. The Contact-Zone list-view of the primary configuration is always displayed as the first one from the left side of the view.
The user can select between two visualization modes – the compare and the compact list-view. In compare mode, the amino acids in the contact zone in the primary configuration that are not present in the contact zone from any other configuration are depicted as white rectangles with labels giving the names of the missing amino acids (see Fig. 7b). The compact mode omits these missing amino acids to save space. In both modes, the matches between amino acids in the primary configuration are highlighted with red bordered rectangles and connecting lines. This way, the user can immediately see which amino acids are present in both the primary configuration as well as the other configurations and which amino acids are missing. To guide the visual comparison, we also introduced interactive highlighting and, if necessary, zooming to corresponding amino acids in different configurations.