Fig. 1
From: GCAC: galaxy workflow system for predictive model building for virtual screening

QSAR based predictive model building - a typical protocol used in GCAC: The initial data is molecular structural information file (SDF/MOL/SMILE) which can be used to generate molecular descriptors. Once descriptors are generated, data cleaning is performed which ensures the removal of data redundancy. Preprocessing is performed in two steps - first, the removal of missing values and near zero variance features as they are not useful for model building. The input data is split into training and test datasets. The training data set is used for model building and test data is used for model evaluation. In a second step of preprocessing, further treatment is applied as per selected method for model building. It includes removal of zero variance features, highly correlated values, centering and scaling of data in the training data. In the model building step, learning and hyper-parameter optimization is facilitated using resampling, internal cross-validation and performance evaluation over the set of parameters chosen. The most accurate model is selected and evaluated on test data set. The selected model is saved and utilized further when a new set of compounds need to be predicted from a set of compounds of unknown activity