Incorporating topological information for predicting robust cancer subnetwork markers in human protein-protein interaction network

Background Discovering robust markers for cancer prognosis based on gene expression data is an important yet challenging problem in translational bioinformatics. By integrating additional information in biological pathways or a protein-protein interaction (PPI) network, we can find better biomarkers that lead to more accurate and reproducible prognostic predictions. In fact, recent studies have shown that, “modular markers,” that integrate multiple genes with potential interactions can improve disease classification and also provide better understanding of the disease mechanisms. Results In this work, we propose a novel algorithm for finding robust and effective subnetwork markers that can accurately predict cancer prognosis. To simultaneously discover multiple synergistic subnetwork markers in a human PPI network, we build on our previous work that uses affinity propagation, an efficient clustering algorithm based on a message-passing scheme. Using affinity propagation, we identify potential subnetwork markers that consist of discriminative genes that display coherent expression patterns and whose protein products are closely located on the PPI network. Furthermore, we incorporate the topological information from the PPI network to evaluate the potential of a given set of proteins to be involved in a functional module. Primarily, we adopt widely made assumptions that densely connected subnetworks may likely be potential functional modules and that proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities. Conclusions Incorporating topological attributes based on these assumptions can enhance the prediction of potential subnetwork markers. We evaluate the performance of the proposed subnetwork marker identification method by performing classification experiments using multiple independent breast cancer gene expression datasets and PPI networks. We show that our method leads to the discovery of robust subnetwork markers that can improve cancer classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1224-1) contains supplementary material, which is available to authorized users.


Introduction
In this work, we focus on one of the problems in translational genomics which is the identification of biomarkers from microarray gene expression data to classify type or state of complex disease. This problem is generally challenging and practically difficult because it normally involves with: 1) Small sample size of clinical data, 2) Large number of potential markers, and 3) Heterogeneity across patient and samples.
Several studies have been working on identifying gene markers which are selected based solely on gene expression data. These markers have shown to be useful to build classifiers for disease prediction. However, there are some limitations of these gene-based markers. For example, given two large-scale-dataset studies of breast cancer metastasis [1,2]. Both studies tried to find out what would be the gene markers to look at in order to estimate the risk of cancer metastasis. Both of them identified around 70 gene markers with 60-70 % of accuracy. However, they shared only 3 genes in common from 55 of possible genes that might share across two platforms [3]. These genebased markers yielded low performance on cross-dataset experiments. Afterward, many studies have been proposed to improve prediction accuracy and reproducibility of the identified biomarkers.
As cancer is a complex disease which its progression involves dysregulation of multiple genetic processes, there is an alternative approach based on the assumption that genes which are known to be in common pathways [4][5][6][7][8] or genes whose protein products are functionally related in protein-protein interaction (PPI) networks [9][10][11] should be interpreted together as a single feature. This approach analyzes gene expression data at "modular" level by integrating biological information, such as known molecular pathways or PPI networks. Many studies have shown that this "integrative approach" tends to be more robust than single gene markers and may improve classification accuracy.
This approach has drawn the attention to several studies to find what might be the effective way to integrate the expression of genes that belong to the same module. Several ideas have been proposed such as using mean or median, sum, or difference of the expression levels of the gene that belong to the same modules as modular activity. PPI network has been shown to overcome the limited numbers of known pathway information. Chuang et al. [9], one of the first studies in this field, proposed a greedy search algorithm for finding discriminative subnetwork markers. Su et al. [10] proposed dynamic programming method to identify and greedily combined paths containing differentially expressed and coexpressed genes to obtain subnetwork markers for predicting breast cancer metastasis. More recently, in our previous work [11], we utilized a message-passing clustering algorithm to identify subnetwork markers with high-accuracy disease prediction. The method is capable to simultaneously predict multiple non-overlapping subnetwork markers which may lead to cover more genes with lower computational cost compared to the existing methods.
With these advantages, we adopt our previous messagepassing based approach while incorporating the topological information from the PPI network to identify the potential functional modules-or subnetworks. Initially, we adopt widely made assumptions that densely connected subnetworks may likely be potential functional modules and that proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities. We employ association indices to estimate the topological information.
Association indices have been shown to be one of powerful tools for measuring similarity between genes [12]. For example, Jaccard index has been successfully used to measure neighborhood similarity for clustering and constructing Power Graph in the work of Royer et al. [13].
In this paper, we propose a novel method for incorporating PPI network topological information to enhance identification of subnetwork markers for predicting cancer prognosis. We utilize various association coefficients to estimate the topological similarity and also apply different approaches to integrate into our previous messagepassing based method. We assess the identified subnetwork markers and evaluate their discriminative power and their classification performance through experiments using publicly available independent breast cancer gene expression datasets and PPI networks.

Datasets
In this study, we obtained two independent breast cancer microarray gene expression datasets from the public domain, which we refer to as GSE2034 [2] and NKI295 [14]. GSE2034 was profiled on the Affymetrix U133a platform (GPL96) and downloaded from the Gene Expression Omnibus (GEO) database [15]. NKI295 was profiled on Agilent Hu25K platform and downloaded from the supplement information from Chang et al. [16]. We used both datasets as published by their original studies. GSE2034 contains expression profiles of 286 breast cancer patients, NKI295 contains expression profiles of 295 patients. For 108 patients in GSE2034 and 78 patients in NKI295, metastasis had been detected within 5 years of surgery. We labeled them as "metastatic", while the remainder was labeled as "non-metastatic".
Four publicly available human PPI networks were used in this study which we refer to as Chuang, HPRD, GASO-LINE, and BioGRID. Chuang was obtained from a previous study by Chuang et al. [9]. HPRD was downloaded from the Human Protein Reference Database Release 9 [17]. GASOLINE was obtained from the work of Micale et al. [18]. It was derived from STRING database [19] considering only experimentally verified protein interactions. BioGRID was downloaded from the Biological General Repository for Interaction Datasets version 3.4.134 (Homo Sapiens) [20]. We did not combine all the PPI networks because they were compiled based on different criteria and domain of interest. Table 1 shows the number of unique proteins and interactions for each PPI network. BioGRID contains the largest number of interactions while HPRD contains the largest number of proteins.
We overlaid the gene microarray datasets with each PPI network by mapping each gene to its corresponding protein in the network. After removing the proteins that do not have corresponding genes in both gene expression datasets, we obtained an induced networks with the statistics shown in Table 2. After data integration, the numbers of proteins are quite similar to each other. BioGRID still contains the largest number of interactions while the others contain approximately the same.

Affinity propagation-based subnetwork identification
We adopt the subnetwork identification procedure from our previous study [11], where we utilized a messagepassing clustering algorithm, called affinity propagation, to cluster genes whose protein products interact with each other or are closely located in PPI network. The input of this clustering algorithm is the measure of similarity between genes. We originally defined the similarity of genes based entirely on the discriminative power to distinguish between the two class labels as follows: where t i , and t k are t-test statistics score of the loglikelihood ratio (LLR) between metastatic and nonmetastatic samples of genes i, and k, respectively. t ik is the t-test score of the summation of the LLRs of genes i, and k.
The LLR, λ, of gene i, λ(x i ), is based on probabilistic inference strategy proposed in [7] and it is computed by  where x i is the expression level of the gene i and f j (x i ) is the conditional Gaussian probability density function of x i under phenotype j. The last term is the penalty term measured by the difference between discriminative power of considering genes. The parameter, α, is defined between [ 0, 1] to control this term. It is shown in our previous work [11] that the size of the network decreases as α gets larger. It is because a larger α tends to cluster genes with similar discriminative power. As a result of that, it yields small subnetworks with fewer genes.
The Eq. 1 is based on original assumptions that when considering similarity between two genes, the gene itself should have high discriminative power, combining both genes as subnetwork should increase the overall discriminative power, and both genes should have similar discriminative power.

Incorporating topological information for computing the similarity between genes
With the assumption that the proteins corresponding to the genes in the same subnetwork should have common topological attributes, we consider two following points: • Densely connected subnetworks may likely be potential functional modules. • Proteins that are not directly connected but interact with similar sets of other proteins may share similar functionalities.
Based on these considerations, we incorporate the topological information of proteins in the PPI network by measuring their association coefficient-or topological similarity. We measure topological attribute using different types of association coefficients. Let N i and N k be the neighborhood binary vectors of protein i and k. We define the topological similarity between proteins i and k, s T (i, k), based on different similarity indexes as follows: 1. Jaccard index: We define topological similarity, Jaccard index is widely used to quantify the similarity 2. Kulczyński index: This measure, s T K (i, k), represents the average proportion of the number of common neighbors to the total number of neighbors of each protein. It is given by 3. Tversky index: We define topological similarity based on Tversky index, s T T (i, k), as In order to indicate the direction of similarity (asymmetric similarity), we let a T T = 1 and b T T = 0. This asymmetric definition lets the exemplars of the identified clusters be more densely connected than other non-exemplars. We can rewrite the equation as followings Tversky index can be viewed as a general form of Tanimoto coefficient (Jaccard index) when a T T = 1 and b T T = 1, and Dice coefficient when a T T = 0.5 and b T T = 0.5.
We do not include other similarity indices whose results are in the same order (no alteration in the ranks) because they give the same output when applying affinity propagation. For example, Dice coefficient, |N i |+|N k | , and Jaccard index share similar results in terms of ranking. Ochiai index (or Cosine index), |N i ∩N k | √ |N i |·|N k | , and Geometric index, |N i |·|N k | provide the same ranks as of Kulczyński index. As we focus on retrieving topological information from the PPI network, we do not make use of the number of common non-neighbor proteins |¬N i ∩¬N k | in this study.
Finally, we add the topological similarity, (3), (4) and (6), to the computation of similarity between genes i and k, s(i, k), in two different ways.
1. Similarity between genes i and k, s(i, k), as a product of the topological similarity s T (i, k) and the discriminative power based similarity s DP (i, k). We define as: 2. Similarity between genes i and k, s(i, k), as a combination of the topological similarity s T (i, k) and the discriminative power based similarity s DP (i, k). We first scale the discriminative power based similarity s DP (i, k) into the range [ 0, 1] as same as topological similarity's bŷ where s DP is the set of all discriminative power based similarity of all gene pairs. Then, we combine them as follows where β =[ 0, 1] is used to control the magnitude between each similarity. Topological similarity, s T (i, k), has more effects as β increases. It should be noted that s(i, k) can be viewed as the summation of topological similarity and discriminative power based similarity when β = 0.5.
We use the same setting for preference as in [11]. The selfsimilarity is set to s(k, k) = c for all k, where s(i, k) ≤ c for only 1 % of all gene pairs (g i , g k ) to guarantee that every gene gets equal chance to be an exemplar at the initial stage of clustering process.

Probabilistic inference of subnetwork activity
To estimate the modular-or subnetwork-activity of identified subnetwork, we employ the probabilistic inference method proposed in [7] which is the aggregation of the LLRs of all member genes to represent the activity level of the subnetwork markers, A(G). It is computed by where x i is the expression level of the gene g i in the subnetwork G = g 1 , g 2 , . . . , g n . This inference method can be viewed as the aggregation of the probabilistic evidence of the expression level of genes in the subnetworks.

Experimental set-up
We identified subnetwork markers incorporating three different strategies to measure topological similarity which we referred to as Jaccard-based, Kulczyński-based, and Tversky-based. As mentioned previously, we used two different approaches to integrate topological similarity to measure similarity between genes: 1) Product of topological and discriminative power based similarity, namely, "product-based approach", and 2) Linear combination of topological and discriminative power based similarity, namely, "linear-combination-based approach". In the latter approach, we used three different values of β(= 0.25, 0.5, 0.75) to investigate the impact of topological similarity to the subnetwork identification. In fact, we can also setup the experiments the other way around to find the optimal the value of β for each data.
After computing similarity between genes and applying affinity propagation-based subnetwork identification, all output clusters were ranked based on the t-test statistics score of their activity level. Then we selected the top 50 clusters with high discriminative power as the potential subnetwork markers for assessing their classification performance.
We repeated these processes to both gene expression datasets and all four PPI networks.

Results
For comparison, we also evaluated the method proposed in [9], and [11] which we refer to as the 'greedy' method, and the ' AP-based' method, respectively. We applied the greedy method with 5 % minimum required improvement which is the same setting as originally published in [9]. In the AP-based method, we set the magnitude of the penalty term, α, to 0.5 by reason shown in [11] that it yields high and consistent classification performance as of smaller α with the smaller size of identified subnetworks compared to larger α. For simplicity in displaying Tables and Figures in this section, we abbreviate Jaccard-based, Kulczyński-based, and Tversky-based to jac, kul, and tve, respectively. The suffixes, _p, and _lc are appended to indicate productbased approach, and linear-combination-based approach, respectively. Table 3 shows the average size of top 50 highly discriminative subnetwork markers identified by each method on GSE2034 and NKI295. Each column shows the results for each PPI network. The average size of markers identified by product-based and linear-combination-based approach is similar to the original AP-based method. We can clearly see that the average size of top markers identified by the proposed method and AP-based is larger than the greedy-based.

Statistics of the subnetwork markers
As we can see from Table 3, the average size of top 50 highly discriminative subnetwork markers increases as the PPI network with larger number of interactions and unique proteins is used. This trend can be clearly seen when BioGRID is employed. Among productbased approach group, Tversky-based similarity, tve_p, yields larger subnetworks. In linear-combination-based approach, we can see that the average size decreases as β increases in most cases. However, we cannot see this trend distinctly in Tversky-based, tve_lc. The main reason is that Tversky-based similarity mostly provides higher similarity index compared with the others as it is designed  Table 6 The number of genes in top 50 highly discriminative subnetwork markers from tve_p method on GASOLINE categorized by their GO terms to indicate the direction of the similarity. For instance, when a gene shares all of its neighbors with another gene (|N i ∩ N k | = |N i |), it returns the maximum similarity (s T T (i, k) = 1), whereas the other topological similarities yield lower because they depend on the number of neighbors the both genes. As defined in Eq. 9, the clustering process relies more on topological information as β gets larger. Therefore, in this case, more genes tend to be clustered into the same subnetwork.
We can see the similar trends for the number of unique genes in top 50 discriminative subnetwork markers as shown in Table 4. We can also clearly see that the top markers identified by the proposed method and APbased cover more genes than the greedy-based. The larger unique genes covered show that the proposed method may increase the chance to discover genes that are not known to be related to the disease. This also means the higher probability of identifying new subnetwork and pathway.
Next, we studied the overlap between the top 50 highly discriminative subnetwork markers identified on different gene expression datasets. The proposed method yield larger overlap when comparing to all of the previous methods as shown in Table 5. Again, similar trends as in Table 3 can also be observed here. The larger overlaps show that more of common genes are covered and shared among identified subnetworks from independent dataset from different platforms. This may lead us to more robust classifiers, we demonstrate the robustness by providing classification performance charts showing that the experimental results from the proposed method are consistent in the next section.
Additionally, we analyzed enriched functions of the genes in the subnetwork markers using Panther [21], a web-based system designed to facilitate analysis of large numbers of genes and provide comprehensive function information which includes up-to-date comprehensive Gene Ontology (GO) annotations (GO database version 1.2, released 2016-05-20 with 44,588 total annotations). An example of the enrichment analysis of the top 50 highly discriminative subnetworks identified using tve_p method on GASOLINE is shown in Table 6. We can see that the genes in identified subnetworks from different gene expression datasets also share common GO terms.

Discriminative power of the subnetwork markers
We evaluated the discriminative power of the subnetwork markers based on the same procedure as previously used in these studies [6][7][8]10]. We computed the t-test score of the inferred subnetwork activity level. And then we sorted the absolute value in descending order. The average absolute t-test score of the top K = 10, 20, 30, 40, 50  Fig. 1. We can see that the discriminative power of subnetwork markers identified by product-based approach, and linear-combinationbased approach are considerably higher than the result of the greedy method. Among product-based approach group, Tversky-based yields the highest in most of the results.
We also assessed how the subnetwork markers identified on specific gene expression dataset perform in another independent dataset. We sorted the subnetwork markers based on their t-test score of the inferred subnetwork activity level on one dataset and we reevaluated the discriminative power on the other dataset. As shown in Fig. 2, we can see that the trends of discriminative power of subnetwork markers across different gene expression datasets are similar to those observed in Fig. 1. The analysis of discriminative power of the subnetwork markers identified on NKI295 data also shows a similar trend (Figures S1 and S2 in Additional file 1).
About the impact of different PPI networks, the PPI network with larger number of interactions tends to yield the higher discriminative power. One of the reasons may be that it contains more topological information which may help to measure the similarity between genes. As intuitively expected, we can see that BioGRID is advantageous to the other PPI networks because it contains the largest number of interactions (as shown in Figures 1d and Additional file 1: Figure S1(d)).

Evaluating the reproducibility of the identified subnetwork markers
In order to evaluate the reproducibility of subnetwork markers, we performed five-fold cross-validation experiments based on a similar set-up that has been commonly used in previous studies [6][7][8][9][10][11], where the entire process was repeated for 100 random partitions.
We identified potential subnetwork markers and selected the top 50 subnetworks as a feature set for the classifier on one gene expression dataset. After that, we built the linear discriminant analysis (LDA) classifiers based on the selected features and evaluated the accuracy on the other dataset. The classification performance assessed by the area under ROC curve (AUC) is shown in Fig. 3. We can see that both productbased approach and linear-combination based approach yield consistently high performance across different gene expression datasets and PPI networks.
In this work, we use the term, 'reproducibility' in the sense of the ability to identify common discriminative genes or subnetworks across different independent datasets. Therefore, using these subnetworks as biomarkers for disease classification may lead to consistent Fig. 2 Discriminative power of subnetwork markers across independent gene expression datasets. The markers were identified and ranked on GSE2034 and their discriminative power was evaluated on NKI295. We computed the mean absolute t-score of the top K=10, 20, 30, 40, and 50 markers by different methods for the following PPI datasets: a Chuang, b HPRD, c GASOLINE, and d BioGRID a b Fig. 3 Reproducibility of subnetwork markers identified by various methods. The bars show the cross-dataset classification performance (average AUC) of different methods. a GSE2034 was used for identifying the potential markers and NKI295 was used for training and evaluating the classifier, b We repeated as NKI295 was used for identifying the markers and GSE2034 was used for training and evaluation of the classifier performance. Furthermore, in terms of reproducibility in practical usage, the AP-based methods, including our proposed methods, cost less computation time compared to the greedy algorithm as shown in [11].

Conclusion
In this paper, we propose a novel method that incorporates topological information to identify subnetwork markers that can be used in cancer prognosis prediction. We demonstrate how widely used association coefficients, such as Jaccard index, Kulczyński index, and Tversky index can be utilized to measure topological similarity. Also, we show how to integrate these measures by two different approaches, product-based, and linearcombination based. Based on our experimental results, Tversky-based strategy is most suitable to measure similarity between genes when the direction of interaction is involved. It yields consistently high discriminative power across different datasets. Furthermore, utilizing the larger PPI network with larger number of unique proteins and interactions, such as BioGRID, may lead to the better subnetwork identification with higher classification performance.
The proposed method considerably increases the coverage of genes and also the overlap of genes when identified across different independent datasets. Through extensive evaluations using various independent breast cancer gene expression datasets and PPI networks, the experimental results show that our method leads to the identification of robust and reproducible subnetwork markers that may lead to better cancer classification.

Additional file
Additional file 1: Supplementary materials. Figure S1: Discriminative power of subnetwork markers identified on NKI295 by different methods. Figure S2: Discriminative power of subnetwork markers across independent gene expression datasets. (PDF 1260 kb)