HTS provides a means to quickly test a large number of chemical compounds against a biological target in order to determine potentially interesting compounds (hits) which can then be investigated further. Once the data has been collected, the researcher must go through the following stages:
Data processing & quality control
The researcher normalizes the data against the control values and analyzes the result to check for abnormal behaviors. A number of biases and outliers can exist in an HTS experiment. In our environment we typically check for assay plates with a low Z-factor [9], a measure of assay quality based on the controls. This phase is also supported by a visual tool we have developed previously [10] to visually explore the values in the plates. The researchers can directly filter out those compounds with unreliable readouts or simply mark them for future analysis.
Hit selection
The goal of this phase is to identify the compounds that reacted in the experiment. Typically, the researcher organizes the results in a list sorted according to activity level and chooses a threshold value above which the compounds are considered active. These compounds are called hits.
Hit confirmation
After hit identification, the researcher re-tests the selected hits in a new and more focused screening (e.g. testing different concentrations of the compounds and calculating the half maximal inhibitory concentration (IC50)) in which confirmation of activity is sought. This phase is normally done manually given the small number of molecules to test.
Hit exploration
After hit confirmation the researcher explores the chemical space around the hits. Typically, he or she is in search of relationships between molecular structure and activity to isolate molecular fragments that induce activity (a process called structure-activity relationship (SAR) analysis).
Hit expansion
Similar to hit exploration, hit expansion focuses on exploring the space around the hits but the focus is on identifying alternative molecules that retain the desired properties and meet additional requirements (e.g. solubility).
Each one of these steps can involve data and computational resources. While in our environment we provide support for all these stages, HiTSEE provides only support for a subset of them, namely: selection, exploration, and expansion.
Data processing and tasks
In the following we provide additional details about the data and describe how it is processed before entering into the description of HiTSEE. Then, we describe how we gathered the requirements for the design of the tool and discuss our main motivation for focusing on a subset of the HTS tasks.
Data processing
HiTSEE KNIME integrates fully into the KNIME platform [3]. KNIME is a well-known data mining framework based on a workflow paradigm where data is processed by connecting data processing nodes one to another. It features an extensive and extensible library of nodes with a variety of purposes, e.g., querying, data mining algorithms, biochemical libraries. Before entering the HiTSEE KNIME nodes (more details in Section HiTSEE for KNIME) additional nodes preprocess the data within the platform:
-
1.
Data Normalization - The system enables several kinds of normalization to be applied and take different plate formats into account. Typically at this stage the system normalizes the data, taking the values found in the control cells into consideration, and using the average value of the positive and negative controls.
-
2.
Quality Control - At this stage the researcher can use a variety of tools we have developed to asses the quality of the experiment and to filter out or mark values with unusual behaviors. Many of the functions we have implemented leverage on a plate view through which the user can observe the distribution of the activity levels across the plates.
-
3.
Fingerprint Generation - The molecular structure of each compound (described by the SMILE format in the database) has to be translated into a format that allows structural similarity comparisons between the molecules. Thus, we transform the molecular descriptions into binary vectors called fingerprints.
Since this last fingerprint generation step is critical for the way HiTSEE arranges the molecules in its main view we provide additional details about it below.
Fingerprint generation
Chemoinformatics applications use fingerprints (FPs) as a way to allow similarity searches and comparisons between molecules. The basic idea behind FPs is to describe each molecule with a numeric vector that captures relevant properties of molecules. While FPs can capture a variety of molecular features, structural fingerprints are above all the most popular [11]. Structural fingerprints are based on the concept of molecular fragments, that is, subsets of atoms and bonds found in the original sets, and describe each molecule in terms of the presence or absence of a molecular fragment. A fingerprint is thus a (normally very long) binary vector where each entry represents a fragment. The value is set to one if the fragment is contained in the molecule and to zero otherwise. Through such a binary representation it is possible to compare the structural similarity of molecules: two molecules with similar vectors contain similar molecular fragments.
For more information on fingerprints and related techniques in chemoinformatics Leach and Gillet [12] provide an excellent introduction to the aforementioned concepts. Bender and Glen [13] provide an overview on molecule similarity measures (with and without the use of fingerprints) which can partially be applied using HiTSEE in the KNIME environment. Additionally, thorough reviews on fingerprints and chemical similarity can be found in the following papers: [14–16].
Tasks
The requirements we have gathered to design HiTSEE are the result of a long-term collaboration between the Department of Information Science and the Konstanz Research School in Chemical Biology at the University of Konstanz. We organized regular meetings between the involved groups to become acquainted with the biochemical problems and to gather information about current practices and data analysis needs. HiTSEE is the last in a number of developed prototypes designed over a year and a half of collaboration. We used the prototypes as a way to probe the design space, to better understand the domain problems, and eventually to isolate the tasks that needed a real support in terms of visual analytics tools.
While we originally developed prototypes for a diverse number of tasks throughout the range of the HTS steps, e.g., data processing, quality control, and chemical libraries overviews, HiTSEE has been designed specifically to support hit selection, exploration and expansion. More precisely we provide support for two main visual analytics tasks:
-
1.
Setting a threshold in hit selection. One of the challenges we encountered early on in the process was the definition of an activity threshold value in the hit selection process. From our observations and discussions with the domain experts we realized that the hits are normally selected through a fuzzy process. The researcher sorts the molecules according to their activity value and chooses a threshold going by eye, searching for a trade-off between the number of hits (to be kept low for later, more in-depth, testing) and the risk of missing important molecules. One need voiced by our collaborators was the possibility to gain, already at this stage, a better view on the selected hits in order to make the hit selection process more informed.
-
2.
Exploring the neighborhood of confirmed hits. A second major need we spotted during our collaboration consists of the exploration of the neighborhood of one or more confirmed hits in the hit expansion phase. This stage starts when one or more molecules are declared to be active in a secondary screening. At this point, the researcher wants to explore the neighborhood to: (1) understand how little structural changes influence the chemical behavior with the selected target; (2) find a trade-off between the activity level expressed by the compounds and other chemical features of interest. In our specific case, for instance, the solubility of the compounds (measured in LogP values) is a critical element to isolate molecules of interest.
HiTSEE supports these two tasks in an integrated environment in which the user can project elements of interest in a scatter plot view, expand the projected items to include their neighbors, and perform several interactive operations that support flexible navigation and details on demand. In the following we describe HiTSEE in detail and explain how it supports the aforementioned tasks.
HiTSEE
HiTSEE's interface is organized around three main views: list+projection view (Figure 1 (left, middle)), molecules detail (Figure 1 (right)) and substructure search view (Figure 2), that support exploration, in-depth investigation and structural queries. The list+projection view permits molecules of interest to be selected and to project them in a scatter plot visualization to form clusters of (structurally) similar compounds. The view supports the investigation of relationships between activity levels, structural features, and other chemical properties. The molecules detail and substructure search view shows the molecular structure of compounds selected in the projection view and triggers substructure searches.
In the following we describe each component in detail together with the interaction capabilities offered by each one.
List+Projection view
The List+Projection view consists of two interactive elements: a compounds list and a linked scatter plot view. (Figure 3 (left)) The compound list organizes the full set of compounds in the library in a list format sorted by activity level. Each item is represented by its molecular structure and by a bar with length proportional to its activity level.
The user can select one or more items in the list, project them in the scatter plot view, and expand the selection to a user-defined number of neighbors. The neighbors are the compounds that are structurally most similar to the current selection. The structural similarity is calculated from the fingerprint bit-vectors (see Fingerprint Section).
The compounds are represented by circles and positioned in the view through a multidimensional scaling (MDS) projection such that compounds with similar structures occupy similar positions. Size represents the activity level and color is used to distinguish between those compounds included in the initial selection and those added by the expansion mechanism. Each circle also contains a small modified pie chart representing additional chemical properties of interest (in our case the LogP value). The pie chart is designed in a way to turn its fill color into a more prominent one (darker blue) when the value of interest goes beyond a predefined threshold.
The MDS projection takes a distance matrix of metric distance values as input. For each pair of compounds, we calculate the Tanimoto [17] distance between their fingerprint bit-vectors. Two problems emerge from MDS-based projections: overlapping items and fuzzy boundaries between the groupings. To cope with these two issues we implemented two additional features. First, we used an overlap removal mechanism that permits to displace the items from their original position if they overlap each other. Second, in order to facilitate the grouping of the items, we cluster the items and draw a "bubble" around them to reinforce the perception of grouping. The clustering algorithm takes the screen-space positions of the items as input and clusters them into bubble sets [18]. For each cluster, we determine the common substructure of all containing compounds and position it left to the cluster.
In designing the projection view, we tried to optimize its visual effectiveness towards reading patterns with biological interest. In the following we provide a summary of the rationale behind our main design choices.
Since position is the visual variable that can be perceived pre-attentively most accurate [19], we use it to convey molecular similarity (through the proximity data given by MDS), which is the most important piece of information in the data. Activity level is mapped to circle size (with a square root mapping to take into account the area effect) to allow for easy discrimination among the molecules. While visual variables like bar length allow for a more accurate comparison of values [19], we decided to use circles and their size because: (1) they cluster more naturally than shapes with other aspect ratios, (2) they are more robust w.r.t. the overlapping removal mechanism, (3) they allow for easy discrimination between high vs. low activity molecules while keeping the visualization compact, (4) reading the activity values accurately is not the main purpose of the visualization (as long as major differences can be spotted). A third parameter (LogP) is encoded as the visual variable angle. To allow better readability we visualize the angle by using a filled pie chart with only one pie embedded in the circles. While a number of alternatives exist to encode two parameters, as for instance stacked bars and nested circles (see Figure 3), we decided to use a modified pie chart because it corresponds well with the circular shape we adopted and readability scales visually better than nested rings to items of different size.
Molecules detail view and substructure search view
From the projection view the user can select a group of interesting compounds to be investigated in detail. Figure 1 (right) shows the detail view with its core features.
The selected set of compounds is visualized as an ordered list of high resolution molecule renderings. We map the chemical features activity and logP into small bar charts to the right, the pKa values are rendered directly into the molecule.
During the investigation of the molecules we permit the user to start a search on a particular pattern by selecting a molecular fragment and issuing a query for retrieving all the compounds containing the selected fragment. We support this function by providing the substructure search view, which opens when the user double-clicks on a molecule in the detail view.
The substructure search view (Figure 2) is based on the JChem Marvin Sketch applet (see Section Implementation Details), which provides a common interactive method for selecting substructures. The user starts a search on the selected substructure, the search results are highlighted as selections in the List+Projection view, and the user can project them in the projection view for investigation.
HiTSEE for KNIME
Integration of HiTSEE into the KNIME platform is achieved by the development of a series of processing nodes that realize the functions developed in HiTSEE and additional helper nodes that permit to build a fully functional pipeline. Figure 4 shows an example of the workflow. The data can be read and pre-processed using the large set of available nodes in the KNIME platform (many specifically built for processing biological data). In the example, the first set of nodes permits to load the data into the system; preprocess it to calculate derived information, such as logP and fingerprints; and to apply some normalization functions. The nodes between the "Loop Start" and "Loop End" represent the core functionalities offered by the HiTSEE KNIME integration and reflect the main behavior of HiTSEE as a continuous iteration loop to focus on a specific set of compounds.
Each iteration starts with the selection of a molecule subset from the user directly or indirect from a user defined search request. The first set of nodes (b) is responsible for the generation of the expanded set (for a project-and-expand operation) and for the search of the target molecules of a substructure search. The second set of nodes (c) is responsible for projection and clustering. More specifically, this node implements the following set of functions: (1) calculate the distances between the molecules in the selected (and expanded) subset; (2) project the distances; (3) cluster the projected positions; (4) find a most common substructure as cluster label. Finally, the last node (d) is the one that implements the visualization, the interactive functions, and the loop control.
The "Project" and "Project and Expand" buttons trigger the HiTSEE loop and permit it to execute one iteration of the steps described above under the parameters given by the user. The main view is split into the previously described list view, projection view, and detail view. The example in Figure 1 shows a selection made by the user (highlighted in gray). When the "Project & Expand" button is clicked, the following parameters are set for the next loop cycle. The three selected molecules are the new subset of interest, an expansion of the subset is required for a set size of 15 additional compounds. The loop is started with these parameters for one cycle and the main view changes accordingly.
Flexibility
Integration into the KNIME platform allows a higher flexibility in changing the behavior of HiTSEE. The major advantage rests in the fact that the processing work flow can be easily changed. In the following, we provide a few examples of changes that could be applied by simply replacing or adding nodes in the workflow:
-
The distance measure between molecules can be changed. The originally used Tanimoto metric to calculate the similarity between the molecules can be substituted by calculations of alternative fingerprints or alternative metrics.
-
The activity value can be transformed, e.g., to reduce highly dynamic behaviors. This is exemplified in Figure 4 by applying a square root scaling to the activity value (see "SQRT normalize activity" node before the loop start).
-
Data preprocessing can be modified and allows richer automation. We applied an algorithm from Meinl et.al. [20] to initially select molecules of high structural diversity with a high activity value. (see "Meinl Selection" in Figure 4).
Many more functions can be integrated according to the specific needs of the analysis and the provision of nodes from the KNIME platform. By the integration and deployment of HiTSEE KNIME we reach a higher degree of generalization and enable our approach to be adopted for a wider range of biochemical challenges.
Implementation details
The KNIME version of HiTSEE is programmed in Java. For rendering molecules, finding common substructures, and making the interactive selection we use the KNIME integration of the JChem library version 5.4, 2011, ChemAxon http://www.chemaxon.com and infocom (http://infocom-science.jp/product/detail/jchemextensions_english.html). For the implementation of visual components we use the Processing http://processing.org/ framework v1.0 and the giCentreUtils v3.1.0 http://www.gicentre.org/utils/. MDS projection is performed by the Java Library for Multidimensional Scaling v0.2 http://www.inf.uni-konstanz.de/algo/software/mdsj/. The cluster shapes are generated with BubbleSets https://github.com/JosuaKrause/Bubble-Sets.