Network topology measures for identifying disease-gene association in breast cancer
© The Author(s) 2016
Published: 25 July 2016
Massive biological datasets are generated in different locations all over the world. Analysis of these datasets is required in order to extract knowledge that might be helpful for biologists, physicians and pharmacists. Recently, analysis of biological networks has received a lot of attention, as an understanding of the network can reveal information about life at the cellular level. Biological networks can be generated that examine the interaction between proteins or the relationship amongst different genes at the expression level. Identifying information from biological networks is recognized as a significant challenge, due to the inherent complexity of the structures. Computational techniques are used to analyze such complex networks with varying success.
In this paper, we construct a new method for predicting phenotype-gene association in breast cancer using biological network analysis. Several network topological measures have been computed and fed as features into two classification models to investigate phenotype-gene association in breast cancer. More importantly, to overcome the problem of the skewed datasets, a synthetic minority oversampling technique (SMOTE) is adapted in order to transform an imbalanced dataset to a balanced one. We have applied our method on the gene co-expression network (GCN), protein–protein interaction network (PPI), and the integrated functional interaction network (FI), which combined the PPIs and gene co-expression, amongst others. We assess the quality of our proposed method using a slightly modified cross-validation.
Our method can identify phenotype-gene association in breast cancer. Moreover, use of the integrated functional interaction network (FI) has the potential to reveal more information and hidden patterns than the other networks. The software and accompanying examples are freely available at http://faculty.kfupm.edu.sa/ics/eramadan/NetTop.zip.
Understanding crosstalk and feedback among oncogenic pathways is critical in order to predict and overcome resistance to targeted anticancer therapy. The topology of biological networks has increasingly been used to complement studies based on individual genes or gene sets. Several network applications are relevant to the study of pathway crosstalk in drug resistance. The identification of modules and sub–networks that are relatively isolated from the rest of the network can lead to an understanding of the direct interaction and cooperation among molecules and to more detailed or dynamic models of the network. Network topological characteristics can potentially be predictive biomarkers through network based classification [1, 2].
Protein interaction networks and gene co-expression networks potentially represent patterns of network connectivity among genes/proteins that differ between clinically relevant phenotypes. Various topological measures that identify relationships between genes, such as node degree, betweenness , or bridging , may contribute to the ability to predict phenotype-gene association.
Here, we apply several techniques for network analysis to demonstrate their utility in studying biological networks in breast cancer. We utilize network topological measures to expose the important nodes (genes/proteins) within the network, and identify marker genes (genes related to breast cancer) from gene co-expression networks, protein interaction networks, or integrated functional networks.
In the present work, we have extracted thirteen topological measurements (features) from a publicly available gene co-expression network and a protein interaction network. We then use classification models to investigate the phenotype-gene association in breast cancer. Moreover, we apply this approach to the integrated functional network of PPI and gene expression in order to investigate the hidden patterns of breast cancer that might not be revealed in the protein network or gene co-expression network.
Recently, the network–based approach has also been used for this purpose. For instance, Zhang et al.  proposed a network–based Cox regression model (Net-Cox). The proposed model was intended to investigate the gene expression signatures that contribute to the result of death or repetition in ovarian cancer treatment. Moreover, Ruan et al.  proposed a general co-expression network-based technique that allows analysis of genes and samples obtained from microarray datasets. This technique uses a rank–based network construction method, a parameter-free module discovery algorithm, and a reference network-based metric for module evaluation. The study utilized a number of different datasets for evaluation purposes, such as yeast and human cancer microarray.
Yuanfang et al.  proposed an approach that utilized a mouse genome-wide functional relationship network and support vector machine classifier to investigate the bone mineral density (BMD) of a phenotype related to osteoporotic fracture. Two genes were revealed (Timp2 and Abcg8) that are related to bone density defects that were not identified in other statistical methods (i.e. genome-wide association studies/quantitative trait loci).
Wu et al.  developed a naive Bayes classifier (NBC) to reveal a functional interaction (FI) network that combines both curated protein–protein interaction networks and pathway information.The computed FI network was used to investigate two glioblastoma multi–form (GBM) datasets and projected the cancer candidate genes onto the FI network.
Step 1: Extract topological measures from biological networks.
Step 2: Identify the breast cancer signature genes.
Step 3: Apply SMOTE in order to make a balanced dataset.
Step 4: Use classification models in order to investigate the phenotype-gene association in breast cancer.
Details about these steps are described below:
Bary center score
Katz status index
The degree of a vertex v in a graph G=(V,E) is the number of connections it has. Here V is the set of vertices (genes or proteins) in the graph and E is the set of edges (links) in the graph. The distance σ vw of a vertex v from another vertex w is the number of edges in the shortest path between them. A path in a graph is a sequence of edges that connect a sequence of vertices (no repeated vertices allowed). The walk is a path in which vertices or edges may be repeated.
The numerator in the fraction shows the number of shortest paths joining s and t on which v is an intermediate vertex.
Bary center score ranks each vertex of the graph depending on the total shortest path of the vertex. It computes the shortest path distances for each vertex in the graph and a score will be assigned for each vertex based on the lengths of the shortest paths that go through the vertex.
Clustering coefficient measures the degree of cohesiveness in a given graph. For a given vertex v, C cc (v) is defined as the ratio of actual number of edges E i within its neighborhood and the maximum number of possible edges in that neighborhood.
The coreness value measures the set of vertices that are highly and mutually interconnected. The k-core is the largest subgraph, comprising vertices of a degree at least k, and is derived by recursively removing vertices with a degree lower than k until none remain.
The within–module z-score measures how well connected vertex i is to other vertices in the module.
To apply the structural hole concept, we identify nodes utilizing Burt’s aggregate constraint measure (Equation 2.7 in ). Burt’s structural hole argument is that social capital is created by a network in which individuals in the social network can broker connections between otherwise disconnected segments. This concept builds on a metaphor of ‘social capital’ that is made concrete with network models in which topological measures rank nodes by their connectivity and lack of redundancy. The argument further posits that since there is some cost of maintaining connections, non-redundancy increases the influence of a node.
Breast cancer signature genes
We have extracted 451 genes that related to breast cancer from the databases mentioned above. We fed this gene data as class labels into classifiers. Thereby the class labels in the dataset are represented as ‘Yes’ (genes that influence breast cancer disease) and ‘No’ (genes that do not influence breast cancer disease).
Synthetic minority oversampling technique
Synthetic Minority Oversampling Technique (SMOTE)  is a sampling approach used to transform an imbalanced dataset to a balanced one. A dataset can be considered imbalanced if there is one group of observations with a very minimum number of samples compared to the other group of observations in the same dataset. It is well known that a machine learning classifier cannot perform well if the dataset is highly imbalanced. The dataset we used in this study is imbalanced by nature and hence application of SMOTE could transform the dataset to a balanced one.
In this study we have used two different classifiers: Decision Tree Bagger (DTB)  and Random under sampling boost RUSBoost  in order to classify the data, based on the extracted topological measures as features and breast cancer signature genes as the class label.
Decision Tree Bagger employs a classic decision tree as the classifier and then a bagging methodology is used to further enhance the classification performance of the classifier. A decision tree is a widely used classifier that divides the dataset such that the impurity level in each partitioned dataset is reduced when compared to the dataset that has been partitioned. The impurity level of a dataset is measured using the class label of each of the records. The most popular measurement for measuring impurity level is the Gini Index. Following a tree structure view, the source dataset is considered as a root node of the tree, while each partitioned dataset is considered as a child node that is rooted at the corresponding root node. Dataset partition is repeated at each of the sub-partitions with the aim of achieving a pure partitioned data at the leaf node of each of the branches of the tree. Once the tree is induced from the training dataset, traversing the tree from the root to each of the leaf nodes generates rules. These rules are then applied to classify an unknown dataset. Since the decision tree is induced from the training dataset, the tree structure might vary with varying sets of data of the same problem. Hence, the performance of the respective decision tree also could vary. To overcome this and achieve an enhanced classification performance a number of bootstrap replicas of the dataset are generated. This process of generating multiple replicas of the dataset, by varying sets of data in each of the datasets, is called the bagging or bootstrap methodology. Through application of the bagging methodology, the resultant individual replica of the training dataset is used to induce a decision tree. Thus, there will be as many decision trees as there are generated dataset replicas. The bagging replica could be sampled randomly choosing from N observations out of N with replacement, where N is the total data events in the dataset. Furthermore, the average of the classification performances from individual trees is considered as the output of the decision tree bagger.
Random Under Sampling Boost (RUSBoost) decision tree is another approach used to enhance the performance of the base decision tree classifier to better deal with an imbalanced dataset. In this approach, the data that belongs to the minority class is considered as the basic population, while data belonging to the majority class is under–sampled, such that the data for each of the classes becomes balanced. Let us consider that there are observations that belong to the minority class in the training data. Following the RUSBoost approach, these N observations are considered as the basic population for sampling. Thus, a total N observation from the data belonging to the majority class is sampled. Note: if there is more than one class that is considered as a majority class, N observations are sampled from each of the classes. All of the sampled data is merged with the N observations from the minority class to form a balanced dataset. After achieving a balanced dataset, a decision tree is induced using this dataset.
We consider several measures in order to evaluate each classifier performance:
The receiver operating characteristic (ROC) curve  is a well known performance measurement metric used to evaluate the performance of a classifier, particularly when the dataset is highly imbalanced. The ROC curve can be generated by considering a two-dimensional Cartesian plot, where the x-axis represents the amount (1-SP) and the y-axis represents SN. It should be noted that by varying the threshold level of classifying the data into two classes (e.g. either 1 or 0), the above mentioned measures will also vary. Hence the ROC plot reflects these variations in terms of both Sensitivity and Specificity. In summary, through analysis of the ROC plot it can be easily identified which threshold level provides the best performance for a classifier. It is worth mentioning here that the best possible performance for a classifier can be achieved if both Sensitivity and Specificity yield 100 %. In other words, the ROC curve that exactly matches the upper part of the ROC space yields the best performance. Hence, the closer the curve to the upper part of the ROC space, the better the performance is. Alternatively, the area under the curve can reveal the quality of the classifier’s performance. If the curve covers the whole ROC space, the classifier is called the perfect classifier. As such, the area under the curve (AUC) can also be used as an indication of classifier performance. An AUC value equal to 1 is called the best classifier, while anything close to 1 can be considered as good as that of the perfect classifier. An AUC value less than 0.5 is considered to be a random classifier performance.
To achieve a generalized performance of the proposed method, we applied the well known k-fold cross validation schema. In this schema, the dataset is divided into k equal partitions and a computational model is generated using k−1 partitioned datasets, while the k t h partitioned dataset is kept untouched in order to test the model later. These steps are repeated k times such that each individual data is used to test the efficacy of the proposed model. It is worth mentioning that for k-fold partition, a total k number of models with varying training datasets are generated. As our proposed model consists of identification of features that are based on the performance of the proposed model, while selecting features we considered only the total k−1 partitions of dataset by keeping the data belonging to the k t h partition aside. By doing so we achieve a more general performance of the proposed model without having any bias towards any class of data.
Results and discussion
Comparison of classification results which adapt SMOTE sampling
We applied 10−fold cross validation schema. We then compute the 95 % confidence interval for the mean with the following formula: Q=M±Z .95 σ M , where Z .95 is the number of standard deviations extending from the mean of a normal distribution required to contain 0.95 of the area and σ M is the standard error of the mean. Clearly, the DTB classification model, which adapts SMOTE sampling and uses topological features extracted from the integrated functional interaction network (FI), has the highest G-Mean value (0.90±0.02), as illustrated in Table 2. A high G-Mean value indicates that a high proportion of the cancerous genes (the breast cancer genes’ signature) are predicted correctly. On the other hand, the DTB classification models that adapt SMOTE sampling and use topological features extracted from the other two networks — GCN and PPI — have lower G-Mean values of (0.89±0.02) and (0.88±0.02), respectively. This indicates that using an integrated functional interaction network can reveal more information about phenotype-gene association in breast cancer. RUSBoost has similar results but has one major drawback: the RUSBoost uses its own sampling method, which creates a conflict with the SMOTE sampling method.
Feature importance analysis: Accuracy, Gini Index, and the combined score are listed
Katz status index
Bary center score
List of some genes that are misclassified by the method as breast cancer related genes
CD4+ lymphocyte deficiency
amyloid beta (A4) precursor protein
Alzheimer disease 1, Amyloidosis, Dementia, early-onset progressive, autosomal recessive,
cyclin-dependent kinase 2
A novel susceptibility locus for type 1 diabetes.
Glomerulopathy with fibronectin deposits.
interferon regulatory factor 1
Gastric cancer, Macrocytic anemia, Myelodysplastic syndrome, preleukemic, Myelogenous leukemia, acute, Nonsmall cell lung cancer.
Alzheimer disease, Cardiomyopathy, Pick disease.
signal transducer and activator of transcription 1
Mycobacterial infection, atypical, familial disseminated.
solute carrier family 25
Micochondrial phosphate carrier deficiency.
son of sevenless homolog 1
Fibromatosis, gingival, Noonan syndrome 4.
We have compared various topological measures that have the potential to identify phenotype-gene association for breast cancer. We have extracted thirteen features from publicly available gene co-expression networks and protein interaction networks. We have used two classification models to investigate the phenotype-gene association in breast cancer. Moreover, we have applied this approach to the integrated functional network of PPI and gene expression in order to investigate the hidden pattern of breast cancer that might not be revealed in the protein networks or gene co-expression networks.
In conclusion, our approach is capable of effectively detecting the phenotype-gene association in breast cancer.
The authors wish to acknowledge King Fahd University of Petroleum and Minerals (KFUPM) for utilizing the various facilities in carrying out this research. Many thanks are due to the anonymous referees for their detailed and helpful comments.
Publication cost of this article was personally funded by the authors. This article has been published as part of BMC Bioinformatics Volume 17 Supplement 7, 2016: Selected articles from the 12th Annual Biotechnology and Bioinformatics Symposium: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-7.
Availability of data and materials
The source and data are freely available at http://faculty.kfupm.edu.sa/ics/eramadan/NetTop.zip.
The idea came from Dr. Emad Ramadan. He also coordinated the research, writing and submission of this paper. Sadiq Al-Insaif carried out the study, developed and implemented the methodology. Dr. Rafiul Hassan helped in the analysis and writing of the data mining aspects of this paper. All authors read and approved the final manuscript.
All authors declare they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Tuck D, Kluger H, Kluger Y. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics. 2006; 7:236. doi:1471-2105-7-236.View ArticlePubMedPubMed CentralGoogle Scholar
- Chuang H, Lee E, Liu Y, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007; 3:140. doi:msb4100180.View ArticlePubMedPubMed CentralGoogle Scholar
- Brandes U. A faster algorithm for betweeness centrality. J Math Soc. 2001; 25:163–77.View ArticleGoogle Scholar
- Ramadan E, Osgood C, Pothen A. Discovering overlapping modules and bridge proteins in proteomic networks. In: Proceedings of ACM International Conference Bioinformatics and Computational Biology. vol. 5: 2010.Google Scholar
- Furey T, et al.Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000; 16:906–14.View ArticlePubMedGoogle Scholar
- Ramaswamy S, et al.Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci. 2001; 98:15149–54.View ArticlePubMedPubMed CentralGoogle Scholar
- Li X, et al.Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. Nucleic Acids Res. 2004; 32:2685–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang W, et al.Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput Biol. 2013; 9(3):e1002975.View ArticlePubMedPubMed CentralGoogle Scholar
- Ruan J, Dean A, Zhang W. A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Syst Biol. 2010;4(1).Google Scholar
- Guan Y, et al.Functional genomics complements quantitative genetics in identifying disease-gene associations. PLoS Comput Biol. 2010;6(11).Google Scholar
- Wu G, Feng X, Stein L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 2010;11(5).Google Scholar
- Wilson G, Banzhaf W. Discovery of email communication networks from the enron corpus with a genetic algorithm using social network analysis. Evol Comput. 2009.Google Scholar
- Mering V, et al.Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417(6887).Google Scholar
- White S, Smyth P. Algorithms for estimating relative importance in networks. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining: 2003.Google Scholar
- Burt R. Structural holes: the social structure of competition: Harvard University Press; 1995.Google Scholar
- Becker K, et al.The genetic association database. Nat Genet. 2004;36(5).Google Scholar
- Smith C, Goldsmith C, Eppig J. The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol. 2004;6(1).Google Scholar
- Robinson P, et al. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83(5).Google Scholar
- Chawla N, et al. SMOTE: synthetic minority over-sampling technique. 2011. http://arxiv.org/abs/1106.1813.
- Breiman L. Bagging predictors. Mach Learn. 1996;24.Google Scholar
- Seiffert C, et al.Rusboost: Improving classification performance when training data is skewed. In: 19th International Conference on Pattern Recognition: 2008.Google Scholar
- Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006; 27:861–74.View ArticleGoogle Scholar
- Hedenfalk I, et al.Gene-expression profiles in hereditary breast cancer. N Engl J Med.2001;344(8).Google Scholar
- Breitkreutz B, et al.The biogrid interaction database. Nucleic Acids Res. 2008;36(suppl 1).Google Scholar