To comprehend relationship between intrinsic characteristics of chemical compound and the compound interaction with protein target is an essential task to evaluate potential protein-binding function for virtual drug screening. Similarity relationship between compounds can be characterized differently, depending on different aspects of features to be measured. The similarity measurement of small molecules has been the focus of essentially every compound-based approach to design or identify novel drug candidates . However, in the process of novel drug screening, the representation of a compound varies dramatically, which results in different similarity measurements. Such similarity difference has given rise to distinct candidate compound similarity ranking lists with only generally about 15% overlap . It is demanding and necessary if information from multiple data sources can be integrated together to produce a comprehensive representation and assessment of similarity relationship between small molecules , thus expected to boost the results of virtual drug screening.
Generally, the drug candidates are related to specific targets. The investigation on the nature of target-specific structure–activity relationships of molecules should be based on the available data sources concerning structure, activity and target-binding information from a comprehensive and integrative perspective. Fortunately, public resources are in a rapid growth both in the quantity of data and in the type of data-generating, which provide us a great chance to further mine the relationship between compounds and their targets. Besides the classic representations of small molecules, like various fingerprints characterizing compound chemical structure, public high-throughput experimental data representing bioactivity of compounds are boosting with the development of online database, including PubChem (http://pubchem.ncbi.nlm.nih.gov/) , Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/)  and DrugBank (DrugBank, http://drugbank.ca/)  etc., which provides an alternative way for molecule characterization based on bioactivity profiles. Several recent studies on the relationship between different compound features claimed that, correlations were proposed between bioactivity profiles and target networks, especially when chemical structures were similar [2, 6–8]. By simply combining both public repositories of compound targets and compound bioactivity, these studies indicates that comparison of bioactivity profile can provide insight into the mode of actions (MOA) at the molecular level, which will facilitate the knowledge-based discovery of novel compounds. However although various relationship were found between multiple features, no effective quantitative integrating methods was proposed or evaluated to combine these multi-view features. Inspired by previous works, two important and interesting computational issues are needed to investigate: (1) is there a quantitative relationship between compound features (bioactivity profile and structural feature) and compound target that can be specifically described? (2) Since the former works implicated that an integration of multiple compound features may result in a better measurement of target-specific compound similarity rather than only one specific type was adopted, how such integration can be optimized to quantitatively and automatically combine information from various views of compound representations, i.e., structural features, bioactivity features and other more? Hereby in our study, we refer such multiple features description and integration for compound as a multi-view data representation and learning problem, and we aim at presenting a quantitative relationship between target-specific compound similarity and multi-view representations of compound features in an efficient multi-view learning schema.
It should be noted that the term “multi-view learning” was initially presented from 3D-object recognition by the machine learning and graphic communities . Naturally as implicated by its name, multi-view learning combines models from different aspects of one identical entity to obtain an overall and comprehensive representation for further study. Multi-view learning was classically introduced as co-training, a semi-supervised learning procedure to distinguish webpages using two different types of data . Thereafter the concept of integration of different information sources has been developed for years in the field of information retrieval [11–13]. On the other side, as an unsupervised-learning method, multi-view clustering algorithms can be divided into two categories in general : (1) Fusion of similarity data by deriving a convex combination of similarities from different views to minimize a given penalty error [15, 16]. (2) Fusion of clustering decision derived from each view separately [17, 18]. In the clustering process, other techniques like canonical correlation analysis (CCA)  and matrix factorization  were employed to reduce the feature dimension or reconcile clustering groups. These applications of multi-view learning commonly yield better performance than that of single-view learning. In our study, as both the structure and bioactivity information are two distinguished intrinsic features to describe the small molecule, it is natural to investigate the results with the integration of both the chemical space (molecule structure) and genetic space (bioactivity profile) of molecules for a better evaluation of molecular properties and similarity comparison.
In this study, firstly a data set of 37 compounds (in Additional file 1: Table S1) from previous study based on bioactivity profile similarity  were adopted. Two similarity matrix characterizing bioactivity profile and structural similarity were calculated. As we would like to investigate the hierarchical structure of similarity among compounds regarding to multiple data sources, rather than only achieve an integrated ranking decision, a similarity fusion method was employed and modified to automatically optimize the weights of the combination of different similarity data. A hierarchy clustering was produced and discussed based on the fused similarity. Then, in order to evaluate the fusion method on the large scale dataset, Connectivity MAP dataset  containing 1267 compounds with their gene expression profile and structure fingerprint representation were used to perform drug virtual screen based on similarity searching. The compound-target interaction in these experiments was also analysed and compared quantitatively to demonstrate the benefits introduced by the integration of multiple data representations.