Performance of ML model as a function of training set composition. (Left) Performance of ML model as a function of the training set size (i.e. number of combinatorial libraries). Experimental setting are similar to those presented in Figure 2, where each point corresponds to the cross-validation performance when we use only a portion of the training data. (Right) Success rate as a function of the minimal distance between test and training targets (1, 2, 3) – distance in number of bases, (100%, 80%, 20%) – proportion of the training set which is kept after removal of targets which are too similar to targets in the test set. Distance subsampling – distance based selection of targets, Uniform subsampling – random selection of equivalent size training set; r gives the drop (ratio) in performance score due to the distance based selection of training targets.