In silico ranking of phenolics for therapeutic effectiveness on cancer stem cells

Background Cancer stem cells (CSCs) have features such as the ability to self-renew, differentiate into defined progenies and initiate the tumor growth. Treatments of cancer include drugs, chemotherapy and radiotherapy or a combination. However, treatment of cancer by various therapeutic strategies often fail. One possible reason is that the nature of CSCs, which has stem-like properties, make it more dynamic and complex and may cause the therapeutic resistance. Another limitation is the side effects associated with the treatment of chemotherapy or radiotherapy. To explore better or alternative treatment options the current study aims to investigate the natural drug-like molecules that can be used as CSC-targeted therapy. Among various natural products, anticancer potential of phenolics is well established. We collected the 21 phytochemicals from phenolic group and their interacting CSC genes from the publicly available databases. Then a bipartite graph is constructed from the collected CSC genes along with their interacting phytochemicals from phenolic group as other. The bipartite graph is then transformed into weighted bipartite graph by considering the interaction strength between the phenolics and the CSC genes. The CSC genes are also weighted by two scores, namely, DSI (Disease Specificity Index) and DPI (Disease Pleiotropy Index). For each gene, its DSI score reflects the specific relationship with the disease and DPI score reflects the association with multiple diseases. Finally, a ranking technique is developed based on PageRank (PR) algorithm for ranking the phenolics. Results We collected 21 phytochemicals from phenolic group and 1118 CSC genes. The top ranked phenolics were evaluated by their molecular and pharmacokinetics properties and disease association networks. We selected top five ranked phenolics (Resveratrol, Curcumin, Quercetin, Epigallocatechin Gallate, and Genistein) for further examination of their oral bioavailability through molecular properties, drug likeness through pharmacokinetic properties, and associated network with CSC genes. Conclusion Our PR ranking based approach is useful to rank the phenolics that are associated with CSC genes. Our results suggested some phenolics are potential molecules for CSC-related cancer treatment.


Background
Cancers diagnosed at the earlier stage can be curable through conventional treatments such as surgery, chemotherapy and radiotherapy [1][2][3][4]. However, cancers diagnosed at a later stage are more progressive and aggressive and they often lead to metastasis to multiple organs. While significant progress has been made to improve diagnosis and surveillance, this has not helped much to improve the overall cancer survival rates [5,6]. Even after the cancer is diagnosed and treated at earlier stage, not all cancer cells can be killed and tumor recurrence has been frequently reported. When tumor recurrence happens, cancer becomes more aggressive and metastatic [7][8][9]. Growing evidences [10][11][12] has indicated that these residual cells play a crucial role as therapeutic resistant and own the property of self-renewal (stem-like properties) known as the cancer stem cells (CSCs). CSCs behave same as normal stem cells do. Moreover, they have multi-differentiative potentials and capa-bility of generating multiple cancer cell types that eventually develop tumors. The self-renewal property of CSCs enables them to give rise to other type malignant cells [13,14]; therefore, they can be described as phenotypically and functionally diversified immortal tumor cells. Such cells have been found in various types of human tumors and might be attractive targets for cancer treatment [11,12,[15][16][17]. These CSCs generally make up just 1% to 5% of all cells in a tumor [18]. Most CSCs are believed to be resistant to chemo-or radio-therapy, indicating CSCs play an important role in cancer relapse and metastasis. Therefore, it requires the development of novel, diverse, and multi-targeted approaches for cancer treatment due to the fact that CSCs have different and still uncovered characteristics. But in fact, clinicians are still struggling to find such CSC targeting therapies with no or limited side-effects.
The currently available treatment options for cancer are surgery, radiation therapy and chemotherapy. More recently, systemic chemotherapy [2,[19][20][21] has becoming the popular one for cancer treatment. Along with cancer cells, healthy cells are also damaged by chemotherapeutic drugs. This may cause side effects to the patients. Lack of major progresses in molecular targeted therapies has made researchers to unfold the prospects of natural anticancer agents from plants known as phytochem-ical. During the years, phytochemicals are a major topic of research because of their naturally healing capability. For the disease such as cancer, they have been testified for having the potential to target heterogeneous populations of cancer cells and CSCs. Moreover, they are capable of targeting the key signaling pathways of can-cer leaving the normal cells intact or minimal toxicity. However, laboratory-based experiments for identifying the drug targets for natural products is not only ex-pensive, labor expensive, but also a prolonged process. Therefore, computational approaches for drug (phytochemical) ranking can greatly speed up the traditional drug discovery process [15,22], and can provide potential candidates for follow up experimental validation. To date, there have been strong needs to develop a sys-tematic and comprehensive computation-based approaches to identify and validate phytochemical for cancer cells.
In this study, CSC genes and their interacting phytochemicals from the phenolic group are systematically collected and curated from the available databases. Then, a bipartite graph has been built from the collected data where CSC genes form one disjoint independent set and the interacting phytochemical is the other set. The graph is then weighted according to the interaction strength between the phenolics and the CSC genes. Two different metrics have been used to weight the CSC genes: DSI, which indicates the extent of a gene being specific to a disease, and DP I which indicates the association of a gene with a set of diseases (pleiotropy). After forming the weighted bipartite graph, a ranking technique based on PageRank (PR) has been applied to rank the phenolics signifying their influence on the CSC genes. Different datasets and platforms are used to validate the resultant phenolics.

Methods
CSCs, like all stem cells, are unspecialized and can divide and renew themselves, as well as give rise to specialized cells. This type of stem cells can be found in a small proportion within a tumor and can replicate tumor cells. Thus, they may lead to tumor growth and migration. They can be left behind even after the course of cancer treatment completes, allowing the tumor to recur and spread around the body. Natural products may be the one reliable option to discover novel treatments demanded by the difficulty of treating CSCs. The work on CSCs is still in early stages. Currently, the research on CSCs is primarily taking place in the research laboratory. Early clinical trials are targeted in the development of effective anti-cancer strategies. As the number of the experiments is few; therefore, the CSCs related databases [23] are also rare. Moreover, those databases have little CSC related information.

CSC related genes data
We collected 1118 CSC related genes from the CSCdb database https ://bioin forma tics.ustc.edu.cn/cscdb [23]. CSCdb is a literature-based database (collected from about 13,000 articles) and useful for CSC-related research. The database contains CSCs marker genes, CSCs-related genes and their functional annotations. It could be an important resource for finding new CSCs and their potential therapeutic tar-gets. A complete information of 1769 genes that have been found to be associated in the functional regulation of CSCs is provided by CSCdb. In addition, 74 marker genes along with 9475 annotations on 13 CSC-related functions have been reported.

Phenolics data
In addition to the common cancer treatments (surgery, radiotherapy and chemotherapy), the systemic chemotherapy has become an alternative cancer treatment. Two common problems associated with chemotherapy are drug resistance and toxicity by damaging healthy cells, causing them to secret proteins that accelerates the growth of cancer and develop drug resistance in patients. To address these limitations of cytotoxic chemotherapy, researchers are keenly interested in natural products as some recent studies proved their chemo-protective properties such as anticancer properties [15]. Natural therapies, such as the use of plant-derived products in can-cer treatment, may reduce adverse side effects. Currently, a few plant products are being used to treat cancer. The list of phytochemicals is collected from the literatures [24,25]. There are different group of phytochemical available from dif-ferent natural products. In this paper, only 21 phenolics are considered for the study. The list of phenolics are given in Table 1.We then searched these 21 phe-nolics in the PCIDB database [26]. For each of the phenolic, the interacting genes are collected. Moreover, the numbers that a phenolic interacting with a gene are also downloaded in the same way. From the lit-eratures [22,24,27], satisfactory clinical instances are achieved for Allium sativum, camptothecin, curcumin, green tea, Panax ginseng, resveratrol, Rhus verniciflua and Viscum album dence to support their anticancer effects. The experiments on natural products clearly show that they can be used as complementary therapeutics against various types of cancer.

DisGeNet
DisGeNet is a database that yields scores to the genes depending on various metrics [28]. Here, the DSI and DPI scores for each gene are considered. The DSI score of a gene indicates how much a gene is specific to a disease. For example, if a gene is associated with too many diseases, DSI score for that gene is as low as 0. On the other hand, if a gene is associated with only one or few diseases, its DSI score would be as high as 1. It is calculated as Eq. 1: where N d is the number of diseases associated to the gene and N T is the total number of diseases in DisGeNet. The DPI score for a gene is 1 if it is associated with largely different classes of diseases and 0 if it is associated with same class of diseases. It is calculated according to Eq. 2. where N dc is the number of the different MeSH disease classes of the diseases as-sociated to the gene and N T C is the total number of MeSH diseases classes in DisGeNet.

PageRank (PR)
PR invented by Google founders Larry Page and Sergey Brin, is a way of measuring the importance of website pages [29]. PR is an algorithm used by Google Search to rank websites in their search engine results. Essentially, PR does not rank web sites as a whole, but is determined for each page individually. Further, the PR of page A is recursively defined by the PR of those pages which link to page A. When site A links to any web page, Google considers this as site A endorsing, or casting a vote for that page. Google takes into consideration all of these link votes (i.e., the website's link profile) to draw conclusions about the relevance and significance of individual webpages and your website as a whole. This is the basic concept behind PR. In short, PR is"vote" by all the other pages on the Web regarding how important a page is. A link to a page counts as a vote of support. When there is no link, it means no support (but it is an abstention from voting rather than a vote against the page). From the original Google paper [29], PR has been defined as in Eq. 3.
where PR(A) is the PR of page A, PR(T i ) is the PR of pages T i which links to page A, C(T i ) is the number of outgoing links on page Ti as each page spreads its vote out evenly amongst all outgoing links. The number of outgoing links for page 1 is C(T 1 ), C(T n ) for page n, and so on for all pages, and d is a damping factor which can be set between 0 and 1. They usually set d to 0.85. Note that the PR form a probability distribution over pages, so the sum of all web pages' PR will be 1. PR or PR(A) can be calculated using a simple iterative algorithm, which corresponds to the principal eigenvector of the normalized link matrix of the web. PR or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. That means just calculating a page's PR without knowing the final value of the PR of the other pages. Basically, in each run, the calculation is getting closer to estimate the final value. So, repeat the calculations many times until the numbers stop changing by a threshold value.

Preprocessing of the dataset
Assume that p = {p 1 , p 2 , …., p m } is a list of the phenolics whose set of interacting genes are G p = {G p1 , G p2 , …., G pm }, where G p1 , G p2 ,… G pm is the gene set that interacts with p 1 , p 2 , …., p m respectively. CSC genes are collected from CSCdb database. Suppose q CSC genes are collected and described as CSC = {cg 1 , cg 2 , …., cg q }. We then take the common genes between each interacting set and CSC gene set and it generates G p1 ∩ CSC, G p2 ∩ CSC, ….., G pm ∩ CSC and it implies G 1 , Next, out of these gene sets, the common gene set s is taken out and then these genes are searched in DisGeNet database for collecting their score. A few of them do not have scores; therefore, they are excluded from the set. Finally, n genes (< s) are gathered for further processing.

Proposed method
From the collected datasets, a weighted bipartite graph is constructed where one set of the bipartite graph is the set of phenolics (i.e., p) and other set is the gene set (i.e., n). The edges are weighted according to the number of ways a phenolic interacting with the genes. These weights are normalized by using mean and standard deviation. The absolute of the normalized values are taken into consideration. The n genes are also weighted in terms of DSI and DPI scores. Given the above weighted bipartite graph, the job of the algorithm is to rank the phenolics. Here, it comes the concept of Page Ranking that has been used to build our model. Starting with a random ranking for the phenolics, the edge weights and gene weights are used to recalculate the new ranks and gradually conclude the final ranks for the phenolics. The critical question is when to stop recalculating the ranks for the phenolics. The answer is kept on calculating the ranks for the phenolics until no change is found in the last two rankings. The pictorial definition of the proposed method has been shown in Fig. 1.
Rank calculation: Let p 1 is the phenolic for which a random rank r 1 is given initially. A random value between 0 and 1 has been generated for each pheno-lic. When these values are sorted in non-increasing order, they will produce the rankings for the phenolics. So, r 1 is the value in between 0 and 1. If phenolic p 1 interacts with x genes with edge weights w 1 , w 2 , …, w x and x genes have the weights gw 1 , gw 2 , …, gw x given by DSI and DPI, then the new rank of phenolic p 1 is calcu-

Results and discussion
Among 21 phenolics, the top five phenolics are Resveratrol, Curcumin, Quercetin, Epigallocatechin Gallate and Genistein. For demonstration purpose, only these top ranked phenolics are studied for their oral bioavailability through molecular properties, drug likeness through pharmacokinetic properties and associated net-work with CSC genes.

Calculation of molecular properties
All the calculated parameters, namely molecular weight, log P, the number of rotatable bonds, polar surface area, the number of hydrogen bond donors and acceptors, the Lipsinki Rule violation, aromatic rings and heavy atoms, are thought to be associated with molecular flexibility, oral bioavailability, solubility and permeability of drugs which are the basic requirements for any drug to have good pharmacokinetic parameters. These properties are calculated from ChEMBL, a large bioactiv-ity lated as r new = r 1 i=1 w i * gw i /abs(normalized(x)) database [30]. The molecular weight describes the molecular flexibility and oral bioavailability. As summarized in Table 2, the molecular weights for all the five phe-nolics are 228.25, 368.39, 302.24, 458.38 and 270.24, respectively. This information indicates that the top ranked phenolics have high molecular flexibility as well as oral bioavailability. It has been seen that the molecular flexibility correlates with molecular weight, that is, larger compounds would be more flexible. The logP is lipophilicity of a compound and for all the five phenolics, logP values are greater than or equal to 2, but less than 5. The numbers of rotatable bond are defined as any single bond, not in a ring, bound to a nonterminal heavy atom(i.e., non-hydrogen). It can be seen the majority of compounds with seven or fewer rotatable bonds met which represents more oral bioavailability as published in the literature [31]. As Polar Surface Area (PSA) characterizes drug absorption, including intestinal ab-sorption and bioavailability, therefore the five phenolics have high PSA, specially Epigallocatechin Gallate (197.37) as PSA. From literature [31], it has been estab-lished that 12 or fewer Hydrogen Bond (H-Bond) Acceptors (HBA) and H-Bond Donors (HBD) are essentially good for those with high oral bioavailability. In this study we found top ranked phenolics have less than 12 HBAs and HBDs. Lipinski rule of 5 based on five criteria namely, molecular mass, high lipophilicity (logP), hydrogen bond donors, hydrogen bond acceptors and molar refractivity. Except for EpigallocatechinGallate, no top ranked phenolics are violated the Lipsinki rule of 5. It has been well established that more than three aromatic rings in a molecule correlate with poorer drug development ability [32]. All the top five phenolics have 3 or fewer aromatic rings, indicating their draggability.

Resveratrol
The phytochemical compound is stilbenoids. A stilbenol is stilbene in which the phenyl groups are substituted at positions 3, 5, and 4′ by hydroxy groups. The chemical structure of resveratrol is given in Fig. 2. It has anticancer properties and inhibits lipid peroxidation of low-density lipoprotein and prevents the cytotoxicity of oxidized LD [33]. Resveratrol also increases the activity of some antiretroviral drugs in vitro.

Curcumin
The phytochemical compound is Diarylheptanoids. A beta-diketone is methane in which two of the hydrogens are substituted by feruloyl groups. A natural dyestuff is found in the root of Curcuma longa. Curcumin has antioxidant, anti-inflammatory, antiviral and antifungal actions [34,35]. The chemical structure of curcumin is given in Fig. 3.

Quercetin
The phytochemical compound is flavonoid. A pentahydroxyflavone has the five hy-droxy groups placed at the 3-, 3′-, 4′-, 5-and 7-positions. It is one of the most abundant flavonoids in edible vegetables, fruit and wine. Health effects include an improvement of cardiovascular health, reducing risk for cancer, and protection against osteoporosis. This phytochemical has anti-inflammatory, anti-allergic and antitoxic effects [36]. The chemical structure of quercetin is shown in Fig. 4.

Epigallocatechin gallate
The phytochemical compound is Flavan 3-ols flavan. A gallate ester obtained by the formal condensation of gallic acid with the (3R)-hydroxy group of (-)-epigallocatechin. A number of chronic diseases have been associated with free rad-ical damage, including cancer, arteriosclerosis, heart diseases and accelerated ag-ing [37]. Epigallocatechin gallate interferes with many enzyme systems: it inhibits fast-binding and reversible fatty acid synthase, increases tyrosine phosphorylation of the insulin receptor, activation of ornithine decarboxylase. The chemical structure of epigallocatechin gallate is given in Fig. 5.

Geninstein
The phytochemical compound is Isoflavones, 7-Hydroxyisoflavone with additional hydroxy groups at positions 5 and 4′. It is a phytoestrogenic isoflavone with antioxidant properties. it acts as a phytoestrogens, antioxidant, anti-cancer agent and it could help people with metabolic syndrome [38]. The chemical structure of gninstein is given in Fig. 6.

Drug likeliness analysis
The pharmacokinetic properties of a chemical present the drug-like ability of a molecule. Therefore, it is an important aspect in consideration. These pharma-cokinetic properties are calculated in pkCSM platform [39]. Water Solubility of a compound (logS) reflects the solubility of the molecule in water at 25 • C and given as the logarithm of the molar concentration logmol/L. A compound is considered to have high Caco-2 permeability if it has a P app > 8 * 10 −6 cm/s. High Caco-2 permeability would be for a predicted value > 0.90. From Table 3, it is clear that most of them are greater than 0.90 as Caco-2 permeability. For a given compound, the intestinal absorption predicts the percentage that will be absorbed through the human intestine. A molecule with an absorbance of less than 30% is considered to be poorly absorbed. All the top five ranked phenolics from our experiment have intestinal absorption values greater than 30%. A compound is considered to have a relatively low skin permeability if it has logKp > − 2.5. The outcome of our experiment shows that all the top five ranked phenolics have logKp > − 2.5. The ability of a drug to cross the brain is an important measure to reduce the side effects. Blood-Brain permeability is measured as the logarithmic ratio of the brain to plasma drug concentration (logBB). For a given compound, a logBB > 0.3 has been treated as readily cross the blood-brainbarrier(BBB) while molecules with logBB < − 1 are poorly distributed to the brain. The  (Table 3).

Association with CSC genes
To find the association between the top ranked phenolics and CSC genes, the Comparative Toxicogenomics Database (CTD) [40] has been used. As shown in Table 4, a majority of them are associated with prostatic neoplasms, breast neoplasms, car-cinoma hepatocellular, stomach neoplasms, and colorectal neoplasms, as sorted by their inference score. The table also shows the association between phenolics and diseases (neoplasms class) by interacting with the cancer genes in the inference network and CSC genes that has also been tabularized. Then the inference score of the network and references are collected from the CTD database. It has been noticed that all the phenolics interact with the highest number of CSC genes of Breast neoplasms. Biological relevance of the top rank phenolics are also described in Table 5. To find biological relevance computationally, top ten interacting genes, top five pathways with p-value and top five GO terms with p-value are collected from CTD database. It is clear from the table that most of the top interacting genes are cancer related, however it is still unknown whether they are also CSCs related. However, there are many works conducted regarding the combinations of the drugs targeting different CSC-genes [41][42][43][44].
The total experiment has been done computationally. From dataset collection to validating the top ranked phenolics, our results relied on the information from different databases and literatures. However, we will extend the study not only on CSC related genes but their draggability in future.

Conclusions
The phenolics have already been reported to have significant anti-cancer potential.
Here, we further explored them for their mechanistic perspective as potential anti-cancer lead molecules for CSC genes. Computationally, a bipartite graph has been formed where one group is the set of collected CSC genes and the other group is the interacting phenolics. The edges represent the interactions and are weighted accord-ing to the strength of interaction between the phenolics and the CSC genes. Also, the CSC genes are given some weight by two metrics, namely, DSI and DP I. Then, a ranking technique inspired from PR algorithm has been developed to rank the phe-nolics. However, one can apply other ranking algorithms (e.g., matrix factorization) to rank the phenolics. The  ranks of the phenolics indicate their association with the CSC genes. From data collection to validation, several databases have been used. In this study, few phytochemicals have been tested and validated for their strong effects on CSCs. Further efforts should be made to experimentally validate their potential to target CSCs, toxicities and drugabilities. The associated pathways for all the top ranked phenolics are related to cancer,  immune system, metabolic, signal transduction etc. Moreover, the low p-values associated with the pathways indicate the statistical significance of the phenolics to those pathways. Lower p-values of the GO-terms indicate that the resultant phenolics are statistically significant and are not selected randomly and it is evident from the table. As future work, we will extend our work through including the combinations of the drugs targeting differ-ent CSC-genes into our current study, as well as collecting more data for a larger number of phenolics.