KNIME-CDK has been developed in Java 1.6 and is available via the KNIME update mechanism. The plug-in including its sources is available as release (stable) build and nightly (pre-release) build under GNU LGPL v3. It has been tested on KNIME Desktop version 2.6 and 2.7, the latter uses Java 1.7, with 2 GB memory and default settings otherwise, using the ChEBI compound library [12]. Over the last year, the plug-in and its underlying core library have been updated, reducing memory requirements and improving overall performance. The KNIME-CDK community site and forum [13] provide an overview of the implemented functionality and support respectively.
Following KNIME’s data model, the individual CDK molecule representations are stored in their own data cell type, the atomic unit for tabular data transfer from one node to another. A node can be considered as single worker carrying out a single function. Here node names are written in italic. Data persistence is guaranteed via the Chemical Markup Language (CML) [14] serializing the molecule when necessary. The underlying CDK molecules are handled and stored within data cells in standardized form, i.e., with implicit hydrogen atoms added, atom types perceived, and aromaticity detected. This guarantees consistency across all nodes and simplifies usability of the plug-in by hiding technical details from the user, hence allowing the scientist to focus on the task at hand.
The plug-in accepts molecules in CML, SDFile, MDL Mol, InChI, and SMILES formats [15] via the Molecule to CDK node and writes SDFiles via the CDK to Molecule node, hence converting the CDK molecule back to the default SDFile cell, that can be used with other cheminformatics plug-ins. In addition, the implemented SDFile interface ensures that all SDFile cell accepting nodes can directly be connected to KNIME-CDK nodes.
All subsequent operations are carried out on the internal CDK molecule representation and include, inter alia, generation of coordinates, atom signatures of various heights, common fingerprints, e.g., MACCS and Pubchem, two- and three-dimensional molecular descriptor values including XLogP and Lipinski’s Rule of Five, chemical name lookup via OPSIN [16], and substructure search (Figure 1a). Different routes in a workflow can run in parallel and nodes run always multi-threaded. In Figure 1b a chemical library is filtered for molecules containing a phenol group before successive hydrogen acceptor / donor count while being used for MACCS fingerprint and atom signature generation. The out-port view, i.e., the resulting data table, is shown for the Atom Signatures node. Further use cases of workflows using the KNIME-CDK plug-in include the management and analysis of chemical libraries through molecular descriptors, conformer analysis via RMSD, and NMR spectra prediction. Example workflows for these tasks can be found in the repository [17] of the myExperiment virtual research environment [18].
Complementing the signature node, the KNIME preference page contains a CDK tab to set global visualisation preferences. Given two- or three-dimensional coordinates, a renderer is provided to draw the molecules using the JChemPaint library [19]. By default the element symbol is drawn. The preference page allows to draw either canonical or sequential atom numbers instead of either all atoms or carbon / hydrogen atoms only.