### Work frame

To clarify the research that we did, a flow chart of our research work is showed in Fig. 1. Firstly, we should get the information of different diseases and metabolites. After getting three data sets, we need to integrate data into a one-to-one corresponding data format between disease and metabolites through a semantic text mining algorithm.Besides, we should also obtain some known metabolites which are related to the diseases. Then the method ‘InfDisSim’ is employed to calculate the similarity of different diseases. After that, the method ‘MISM’ is applied to obtain the similarity of metabolites. Then we could build a network of similarity of metabolites. Finally, we found out some novel disease-metabolite relationships by Random Walk.

To obtain the basic relationship between metabolites and diseases, three datasets are used as following: HMDB, NCBO Annotator and Diseases ontology.

### Data collection and database content

#### Human metabolome database

We downloaded the metabolites data from Human Metabolome Database (HMDB) [31]. The most widely used and complete database involves more than 40,000 kinds of metabolomes. It contains three kinds of data information: Chemical data, Clinical data and Biochemical data. They collected this information from thousands of public sources.

The dataset we got is the diseases’ related metabolites which has many complex files. So we would use the other datasets to future understand these data.

#### Disease ontology

Diseases Ontology [32] started as a part of NUgene project in Northwestern University in 2003. By summarizing other datasets, Diseases Ontology can strongly support the heredity, environmental factor and other inducements of diseases, which help researchers understands diseases better.

Each disease or the concept of the diseases is a node. They all have cross literature comments and a DOID name is given for each disease. The nodes in the lower layer are subclasses or subtypes of the nodes in the upper layer, and the parent-child relationship between the DOID is preserved in the data information. All the diseases are classified into seven groups: diseases caused by environmental origin, diseases caused by infectious agent, diseases of anatomical entity, diseases of behavior, diseases of biological process, hereditary disease, disease syndrome and gene ontology. All the nodes are connected by the Directed Acyclic Graph (DAG).

After obtaining the data of diseases-related metabolites by HMDB, we used the Diseases Ontology to annotate the diseases. Therefore, we can know the name and the related information of the diseases.

#### National center for biomedical ontology

In order to improve the semantic expression ability and open interconnection ability of data, National center for biomedical ontology (NCBO) [33] proposed a data sharing project to solve the lack of integration tools for scientific ontologies. The dataset of each domain are presented in the form of information islands. Most of the information can not be semantically identified by the machine, so that there is an obstacle to the interaction between the information nodes, which goes against to biomedical research and knowledge discovery. NCBO has six core components, including computer science and biomedical informatics research, promoting biology projects and external research collaboration, infrastructure, education, communication and management.

We can further understand and annotate the HMDB data through NCBO. Then a disease-to-metabolic data file can be obtained.

### Method

#### Calculating similarity of pair-wise diseases

There is a certain similarity between diseases, whereas the similarity is often caused by the same molecular origins. Protein-coding genes’ interaction can reflect the mechanism of the diseases to some extent. Therefore, the similarity of diseases can be achieved by the genes behind the diseases.

In this paper, to calculate the similarity of the diseases we used the method named ‘InfDisSim’ [13, 34]. This method measured the similarity of diseases by gene functional network. Gene functional network can provide the information flow which can be used to calculate the disease similarity. To analyze the information flow, ITM Probe [35] is employed which included three models: absorbing, emitting and channel. Each disease is a boundary node in the network, besides, each gene is a transient node.

Each disease has several related metabolites, if the number of the metabolites is N, the weight vector of disease *t*_{
1
} would be:

$$ {WV}_{t_1}=\left\{{w}_{1,1},{w}_{1,2},\kern0.5em \dots \kern0.5em ,{w}_{1,i},\kern0.5em \dots \kern0.5em ,{w}_{1,N}\right\} $$

(1)

Here \( {\mathrm{WV}}_{t_1} \)is the weight vector of *t*_{
1
}, *w*_{1, i}the weight score of *t*_{
1
} on the *i*th dimension. The cosine of their vectors is used to represent the disease similarity, the equation is as following:

$$ Inf\left({t}_1,{t}_2\right)=\frac{\sum \limits_{i=1}^N{w}_{1,i}\cdot {w}_{2,i}}{\sqrt{\sum \limits_{i=1}^N{w^2}_{1,i}\sqrt{\sum \limits_{j=1}^N{w^2}_{2,j}}}} $$

(2)

The similarity of disease is defined as following:

$$ InfDisSim\left({t}_1,{t}_2\right)= Inf\left({t}_1,{t}_2\right)\frac{\left|{G}_1\right|\left|{G}_2\right|}{{\left|{G}_{MICA}\right|}^2} $$

(3)

Where*G*_{1},*G*_{2} indicates metabolites set of *t*_{
1
} and *t*_{
2
}, respectively. *G*_{
MICA
}is the metabolites set of *t*_{
3
}. And ∣. ∣ represents the number of terms in the specified set.

Then we could obtain the similarity of the diseases.

#### Calculating similarity of pair-wise metabolites

A method named ‘MISIM’ was proposed by Dong Wang et al. [36] which is used to estimate the similarity of micro-RNAs. In the research, they pointed out that the genes which have similar functions are often associated with similar diseases, so the similarity of diseases could be computed by DAG. This idea is quite similar with the work we did in the ‘InfDisSim’, in addition, this is also the premise of calculating similarity of metabolites. Due to the thought and the miRNA-disease association data, they presented ‘MISM’ to infer the functional similarity of miRNAs by the diseases relationship.

Compared with our research, we tried to compute the similarity of the metabolites. Since the background and theoretical basis are the same, we applied the ‘MISM’ to calculate the similarity of metabolites by the similarity of diseases.

Firstly, the semantic similarity which is the relationship between diseases should be defined. Then the similarity of disease to one group of diseases can be calculated as follows:

$$ S\left(d,D\right)=\underset{1\le i\le k}{\max}\left(S\left(d,{d}_i\right)\right) $$

(4)

Here *d* represent one disease and *D* means one disease group. *S*(*d*, *D*) is the maximum similarity between one disease and one disease set.

After getting the similarity of diseases, we could calculate similarity of metabolites. *D*_{1} involves *m* diseases and *D*_{2} involves *n* diseases. If *D*_{1} is one metabolite which is related to the group of disease and *D*_{2} is another metabolite which is related to another group of diseases, the similarity of the two metabolites could be computed by:

$$ Similarity\left({M}_1,{M}_2\right)=\frac{\sum \limits_{1\le i\le m}S\left({d}_{1i},{D}_2\right)+\sum \limits_{1\le i\le n}S\left({d}_{2i},{D}_1\right)}{m+n} $$

(5)

Then similarity between *M*_{1} *and M*_{2}could be obtained.

#### Predicting novel disease-metabolite relationships using random walk

Random Walk is an important part of stochastic process. For example, if an ant starts from *X*_{
t
}, it takes a step forward by the probability of 0.5 (*X*_{t + 1} = *X*_{
t
} + 1) or takes a step back by the probability of 0.5 (*X*_{t + 1} = *X*_{
t
} − 1). Then the points which the ant arrives at each moment can constitute a one-dimensional random walk process.

Random walk can be regarded as a special case of Markov chain. In the case of current knowledge and information, the past (the historical state) is irrelevant to the prediction of the future (the future state). At each step of the Markov chain, the system can change from one state to another or maintain the current state according to the probability distribution. The change of the state is called transfer, and the probability associated with different states is called the transition probability. If G is the adjacency matrix of graph A, we can normalize A as following:

*D* is the degree matrix of A which is a diagonal matrix. The diagonal element is *D*(*i*, *i*) = ∑ *A*(*i*, *j*). Here *P* is the random walk matrix, and the sum of the jump probabilities of each node and all other nodes is 1.

A random walk matrix corresponds to a Markov chain, that is, any two states can reach each other. Starting from an arbitrary state, the probability at the next state is as following:

$$ {A}_{t+1}={A}_tP $$

(7)

The process keeps moving, and after a certain period of time, equilibrium state is reached. The equilibrium state means that the probability distribution of the state is no longer changing. The method to calculate equilibrium state is as following:

$$ \pi =D\left(i,j\right)/\sum \limits_i\sum \limits_jA\left(i,j\right) $$

(8)

When *πP* = *π*, the equilibrium state is reached.

The basic matrix of Markov chains is defined as:

$$ Z={\left(I-P-W\right)}^{-1} $$

(9)

Where *I* is a unit matrix, *P* is the corresponding random walk matrix, and *W* is a matrix which the equilibrium state’s rows are stacked. For a regular Markov chain, *W* can be considered as the case where n in *P*^{n} tends to infinity.

The algorithm flow is as following:

Step 1: |

Given initial iteration point x, step length is *λ*, control accuracy is ℓ |

Step 2: Iteration times is N, k is the current iteration time |

Step 3: When k < N, randomly generate a N-dimension vector *u* = (*u*_{1}, *u*_{2} … *u*_{
n
}).then finish the first walk*x*_{1} = *x* + *λu*^{'} |

Step 4: If *f*(*x*_{1}) < *f*(*x*), k = 1 and return to the step 2, else k = k + 1 and return to the step 3. |

Step 5: If the optimal solution is not found in N times, the optimal solution is centered on the current optimal solution. |

RWR is a global network ranking algorithm. In terms of the probabilities of the edges between the two nodes, one or several seed nodes can randomly transit to their neighbor nodes. The probability of returning to the start seed node is supposed as γ, and then RWR algorithm can be defined as follows:

$$ {P}_{t+1}=\left(1-\gamma \right){AP}_t+\gamma {P}_0 $$

(10)

Here, *A* is the column-normalized adjacency matrix, *P*_{0}is the initial probability vector and *P*_{
t
} is the probability vector which element at node i at step t. According to the previous study, γ would be 0.85 [37].

The Fig. 2 shows the calculation process of Random Walk of identifying diseases-related metabolites. Firstly, we should set parameters, then start the circle until the difference between *P*_{t + 1} and *P*_{
t
} is lower than the threshold. Finally, we could get all the possible diseases-related metabolites.