SNF-NN: computational method to predict drug-disease interactions using similarity network fusion and neural networks

Background Drug repositioning is an emerging approach in pharmaceutical research for identifying novel therapeutic potentials for approved drugs and discover therapies for untreated diseases. Due to its time and cost efficiency, drug repositioning plays an instrumental role in optimizing the drug development process compared to the traditional de novo drug discovery process. Advances in the genomics, together with the enormous growth of large-scale publicly available data and the availability of high-performance computing capabilities, have further motivated the development of computational drug repositioning approaches. More recently, the rise of machine learning techniques, together with the availability of powerful computers, has made the area of computational drug repositioning an area of intense activities. Results In this study, a novel framework SNF-NN based on deep learning is presented, where novel drug-disease interactions are predicted using drug-related similarity information, disease-related similarity information, and known drug-disease interactions. Heterogeneous similarity information related to drugs and disease is fed to the proposed framework in order to predict novel drug-disease interactions. SNF-NN uses similarity selection, similarity network fusion, and a highly tuned novel neural network model to predict new drug-disease interactions. The robustness of SNF-NN is evaluated by comparing its performance with nine baseline machine learning methods. The proposed framework outperforms all baseline methods (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AUC-ROC$$\end{document}AUC-ROC = 0.867, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AUC-PR$$\end{document}AUC-PR=0.876) using stratified 10-fold cross-validation. To further demonstrate the reliability and robustness of SNF-NN, two datasets are used to fairly validate the proposed framework’s performance against seven recent state-of-the-art methods for drug-disease interaction prediction. SNF-NN achieves remarkable performance in stratified 10-fold cross-validation with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AUC-ROC$$\end{document}AUC-ROC ranging from 0.879 to 0.931 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$AUC-PR$$\end{document}AUC-PR from 0.856 to 0.903. Moreover, the efficiency of SNF-NN is verified by validating predicted unknown drug-disease interactions against clinical trials and published studies. Conclusion In conclusion, computational drug repositioning research can significantly benefit from integrating similarity measures in heterogeneous networks and deep learning models for predicting novel drug-disease interactions. The data and implementation of SNF-NN are available at http://pages.cpsc.ucalgary.ca/ tnjarada/snf-nn.php.

Drugs with similar ADEs are more probably to have common target proteins and therefore share similar underlying pathways. The similarity between drug pairs was calculated using the Jaccard coefficient and ADEs.

Drug Chemical Structures
Drug chemical structure information tends to point towards any transcriptional response similarity between drugs, and thus it is a powerful source of information for therapeutic repositioning opportunities.
Drug 2D chemical structure information was retrieved from PubChem [15]. Subsequently, CDK [16] was used to encode each drug into a binary feature vector of PubChem substructures.
A binary matrix was constructed to represent the drug-chemical-substructure associations where each drug r is represented as a binary feature vector of PubChem 2D chemical substructures, and the absence or presence of a drug-chemical-substructure association is denoted by 0 or 1, respectively.
Drugs with similar chemical structures are anticipated to have similar biological indications and, therefore, cure common diseases. The similarity between drug pairs was calculated using the Jaccard coefficient and the 2D chemical fingerprint descriptor. GO annotations of drug target-coding genes were obtained from the Gene Ontology Consortium

Drug GO Annotation Targets
Database [17]. Experimentally validated and literature-derived information (i.e., molecular functions, biological processes, cellular components) was utilized while computationally annotated information was kept out.
Binary matrices were constructed to represent the three drug pairwise semantic similarities where each drug r in the MF, BP, and CC semantic similarity matrices is represented as a binary feature vector the GO molecular function terms, GO biological process terms, and GO cellular component terms, respectively.
Drugs with semantic similar target-coding genes appear to share similar underlying mechanisms. The MF, BP, and CC semantic similarities between drug pairs were calculated using the information-based measure proposed in [18]. 3

Drug Metabolism Enzymes
Drug metabolism refers to the set of biochemical processes and metabolic breakdown of drugs within any living organism. Drug metabolism is usually represented by the enzymatic alteration caused by the drug; hence, it can help identify similarities between drugs.
Drug metabolism enzymes information was retrieved from the Human Metabolome Database (HMDB) [19].
A binary matrix was constructed to represent the drug-metabolism-enzyme associations where each drug r is represented as a binary feature vector of enzymes, and the absence or presence of a drug-metabolism-enzyme association is denoted by 0 or 1, respectively.
Drugs with similar metabolism enzymes are more likely to share similar underlying pathways. The similarity between drug pairs was calculated using the Jaccard coefficient and metabolism enzymes.

Drug Protein Sequences
Drug protein sequence information refers to the amino acid sequence information of a protein, or part of a protein, which characterizes the protein post-translational modifications. Such information helps in studying the similarity between drugs and thus identifying novel drug repositioning candidates.
A binary matrix was constructed to represent the drug-protein sequence associations, where each drug r is represented by a binary feature vector of protein sequences, and the absence or presence of a drug-protein-sequence association is denoted by 0 or 1, respectively.
Drugs with similar protein amino acid sequences tend to share common functional mechanisms. The similarity between drug pairs was calculated using the normalized average of the SW sequence alignment similarity scores of proteins [22].
where p(r a ) and p(r b ) represent the target protein sets of drugs r a and r b , respectively; |p(r a )| and |p(r b )| are the total numbers of target proteins of drugs r a and r b , respectively, and SW (p i (r a ), p j (r b )) is the sequence alignment similarity score between the two proteins p i (r a ) and p j (r b ). The S (proteins) value is in the range [0, 1]. 4

Drug Anatomical Therapeutic Chemical Classification Codes
The Anatomical Therapeutic Chemical (ATC) classification system divides the active drug substances into different groups based on their organ or system on which they act as well as their therapeutic effect, pharmacological activities, and chemical characteristics. The ATC classification system consists of five different classification levels: (i) main anatomical or pharmacological groups, (ii) pharmacological or therapeutic subgroups, (iii) & (iv) chemical, pharmacological or therapeutic subgroups, and (v) chemical substances. Such a classification system provides useful insight into drugs and helps in predicting novel drug targets.
Drug-ATC classification codes were collected from the DrugBank Database [1].
A binary matrix was constructed to represent the drug-ATC classification code associations, where each drug r is represented by a binary feature vector of ATC classification codes, and the absence or presence of a drug-ATC association is denoted by 0 or 1, respectively.
Drugs with similar structures and classification codes appear to have similar medical indications. The similarity between drug pairs was calculated using the similarity measure proposed in [10].

Drug Pairwise Interactions
Drug-Drug Interactions (RRIs) refer to the adverse effects that may result from the concurrent consumption of two or more drugs. RRIs play a significant role in delaying, decreasing, or enhancing the absorption and expected molecular alterations of the concurrent consumed drugs.
Drug-drug interactions were collected from the DrugBank Database [1].
A binary adjacency matrix was constructed to represent the drug-drug interactions, where each drug r is represented by a binary feature vector of drugs, and the absence or presence of an RRI is denoted by 0 or 1, respectively.
Drugs with interactive chemicals are anticipated to have similar characteristics and biological functions.
2 Disease-related similarity data

Disease Genes
Disease-Gene Associations (DGAs) describe the relationships between genetic diseases and the genes associated with them. Such associations help in understanding the underlying mechanism of complex diseases, and therefore they support the efforts of developing preventive and therapeutic solutions.
Human Phenotype Ontology (HPO) disease-gene associations were also retrieved from the HPO KnowledgeBase [25]. Diseases were mapped from their Disease Ontology Identifiers (DOIDs) to their UMLS's concept unique identifiers.
Literature-based disease-gene associations were extracted from Medline abstracts published in the period (January 1970-December 2019) using the BeFree text mining tool [26]. BeFree consists of two modules, namely Biomedical Named Entity Recognition (BioNER) and Relation Extraction (RE).
BioNER utilizes gene and disease dictionaries to apply fuzzy and pattern matching methods in order to recognize gene and disease entities mentioned in the literature [27]. Subsequently, RE identifies DGAs using the exploitation of semantic and morphosyntactic information from the text extracted from the literature [28]. Genes were mapped to their UniProt accession numbers [7].
Binary matrices were constructed to represent the curated, HPO, and literature-based DGAs where each disease d is represented as a binary feature vector of genes, and the absence or presence of a DGA is denoted by 0 or 1, respectively.
Diseases associated with similar genes tend to share similar underlying mechanisms. The similarity between disease pairs was calculated using the Jaccard coefficient and curated, HOP, and literature-based genes, separately.

Disease Variants
Disease-Variant Associations (DVAs) relate to the relationships between genetic diseases and the genetic variants associated with them. Variants refer to the gene's DNA sequence changes that are caused by a disease. Gene variants can alter the gene functionality (e.g., produce a non-functioning protein) and, thus, provide useful insight into disease conditions. Curated disease-variant associations were obtained from UniProt/SWISS-PROT KnowledgeBase [20] and the GWASdb Database of human genetic variants identified by genome-wide association studies [29].
Variants were mapped to their Single Nucleotide Polymorphisms (SNPs) using the NCBI Database of Genetic Variation (dbSNP) [30].
Literature-based disease-variant associations were extracted from Medline abstracts published in the period (January 1970-December 2019) using both the SETH text mining tool [31] and the BeFree tool.
SETH extracts gene variants and maps them to their dbSNP's SNPs, while the RE module of the BeFree tool identifies DVAs from the extracted information.
Binary matrices were constructed to represent the curated and literature-based DVAs where each disease d is represented as a binary feature vector of variants, and the absence or presence of a DVA is denoted by 0 or 1, respectively. 6 Diseases associated with similar variants are expected to share similar underlying pathways. The similarity between disease pairs was calculated using the Jaccard coefficient and curated and literaturebased variants, separately.

Disease MicroRNAs
Disease-MicroRNA Associations (DMAs) refer to the single-stranded non-coding RNA sequences associated with a disease. MicroRNAs (miRNAs) regulate up to 60% of protein-encoding genes. They are also considered as key mediators of the host response to diseases [32].
Disease-microRNA associations were collected from the DincRNA Database [33]. Diseases were mapped from their DOIDs to their UMLS's CUIs.
A binary matrix was constructed to represent the disease-miRNA associations, where each disease d is represented by a binary feature vector of miRNAs, and the absence or presence of a DMA is denoted by 0 or 1, respectively.
Diseases that affect similar miRNAs are expected to have similar underlying pathways. The similarity between disease pairs was calculated using the Jaccard coefficient and miRNAs.

Disease Long Non-coding RNAs
Disease-Long Non-coding RNAs Associations (DLAs) describe the relationship between the genetic diseases and the Long Non-coding RNAs (lncRNAs) associated with them. lncRNAs are RNA transcripts with lengths exceeding 200 nucleotides. Despite lacking protein-coding potential, lncRNAs are emerging as important regulators in gene expression networks for their roles in modulating mRNA stability and controlling nuclear architecture and transcription as well as translation and post-translational modifications in the cytoplasm [34].
Disease-lncRNAs associations were collected from the DincRNA Database [33]. Diseases were mapped from their DOIDs to their UMLS's CUIs.
A binary matrix was constructed to represent the disease-lncRNA associations, where each disease d is represented by a binary feature vector of lncRNAs, and the absence or presence of a DLA is denoted by 0 or 1, respectively.
Diseases with similar lncRNAs tend to share similar cellular functional effects. The similarity between disease pairs was calculated using the Jaccard coefficient and lncRNAs. 7

Disease HPO Phenotypes
Disease phenotypes present the observable characteristics of the disease and tend to describe the quality of an organism (e.g., morphology, development, behaviour) as opposed to the organism's genotype [35].
Disease phenotypes associations were retrieved from the HPO KnowledgeBase [25]. Diseases were mapped from their DOIDs to their UMLS's CUIs.
A binary matrix was constructed to represent the disease-phenotype associations, where each disease d is represented by a binary feature vector of phenotypes, and the absence or presence of a phenotype is denoted by 0 or 1, respectively.
Diseases can be defined using their phenotypes; therefore, diseases with similar phenotypes are anticipated to share similar underlying pathways. The similarity between disease pairs was calculated using the Jaccard coefficient and phenotypes.

Inferred Disease-Disease Associations
Inferred disease pairwise similarity methods raise great attention lately for their key role in identifying disease-causing genes [36] and predicting novel drug indications [37]. Such computational methods quantify the shared characteristics of disease pairs based on semantic and/or functional information.
Semantic-based methods use terms of gene ontology [38] and human phenotype ontology [25] to calculate pairwise disease similarity, while function-based methods use functional associations of disease genetic entities (e.g., genes, miRNAs, lncRNAs) [39].
Resnik [41] introduced a semantic similarity measure in an IS-A taxonomy based on the notion of information content. However, Lin [42] used information-theoretic definition to compute semantic pairwise similarity scores. Moreover, Wang [43] developed a novel semantic similarity measure based on Go terms. Furthermore, Mathur [44] leveraged implicit semantic similarity to propose the PSB similarity measure. In addition, Cheng [45] integrated semantic and gene functional information to present the SemFunSim similarity measure. Finally, Piñero [40] utilized curated disease genes and variants to compute disease pairwise similarity reported in the DisGeNET KnowledgeBase.
Binary adjacency matrices were constructed to represent the disease-disease associations, where each disease d is represented by a binary feature vector of diseases, and the absence or presence of an IDDA is denoted by 0 or 1, respectively. 8 Diseases with high similarity scores are anticipated to have similar characteristics, genomic effects, and underlying pathways.