Clustering analysis of tumor metabolic networks

Background Biological networks are representative of the diverse molecular interactions that occur within cells. Some of the commonly studied biological networks are modeled through protein-protein interactions, gene regulatory, and metabolic pathways. Among these, metabolic networks are probably the most studied, as they directly influence all physiological processes. Exploration of biochemical pathways using multigraph representation is important in understanding complex regulatory mechanisms. Feature extraction and clustering of these networks enable grouping of samples obtained from different biological specimens. Clustering techniques separate networks depending on their mutual similarity. Results We present a clustering analysis on tissue-specific metabolic networks for single samples from three primary tumor sites: breast, lung, and kidney cancer. The metabolic networks were obtained by integrating genome scale metabolic models with gene expression data. We performed network simplification to reduce the computational time needed for the computation of network distances. We empirically proved that networks clustering can characterize groups of patients in multiple conditions. Conclusions We provide a computational methodology to explore and characterize the metabolic landscape of tumors, thus providing a general methodology to integrate analytic metabolic models with gene expression data. This method represents a first attempt in clustering large scale metabolic networks. Moreover, this approach gives the possibility to get valuable information on what are the effects of different conditions on the overall metabolism.


ADDITIONAL FILE 5
Clustering analysis of tumor metabolic networks Ichcha Manipur, Ilaria Granata, Lucia Maddalena and Mario R. Guarracino Full list of author information is available at the end of the article

Additional File 5 -Clustering Metrics
The evaluation of clustering algorithms is generally carried out in terms of internal and external validation indices [1]. Internal indices aim at evaluating the goodness of a computed data partition using quantities and features extracted from the data. On the other side, external indices are based on the existence of a given ground truth data partition and aim at evaluating how accurately a clustering technique partitions the data as compared to the ground truth.
In our applications, we are given ground truths for all the considered datasets and, rather than evaluating a clustering technique per se, we are interested in evaluating its ability in partitioning the given data for the problems at hand. Therefore, we select an extended set of external accuracy metrics often adopted for clustering evaluation [1,2,3,4,5], described in detail in the following. • T P = number of pairs of elements in S that are in the same cluster in CP and in the same cluster in GT . . , c}, l 1 = l 2 , and m 1 = m 2 . • F P = number of pairs of elements in S that are in the same cluster in CP but in different clusters in GT • F N = number of pairs of elements in S that are in different clusters in CP but in the same cluster in GT . . , c}, and l 1 = l 2 . The total number of possible pairs is given by Intuitively, T P + T N can be considered as the number of agreements between the computed partition CP and the ground truth partition GT , i.e., the number of correct decisions made by the clustering technique of assigning two similar elements to the same cluster (T P ) or of assigning two dissimilar elements to different clusters (T N). Likewise, F P+F N can be considered as the number of disagreements between CP and GT , i.e., the number of wrong decisions of assigning two dissimilar elements to the same cluster (F P ) or of assigning two similar elements to different clusters (F N).

Rand Index (RI) and Adjusted Rand Index (ARI)
Named by [6] after W.R. Rand [7], the Rand Index measures the percentage of decisions taken by the clustering technique under exam that are correct, i.e., its accuracy RI = T P + T N T P + F P + F N + T N = 2(T P + T N) n(n − 1) .
RI assumes values in [0,1]. As observed in [3,6], RI does not guarantee that random partitions will get a value close to zero. To counter this effect, some authors prefer to adopt the so-called Adjusted Rand Index, defined as where E[RI] indicates the expected RI of a random partition. ARI assumes values in [-1,1]; negative values indicate that the computed partition is worse than a random partition.

Misclassification Rate (MR)
The Misclassification Rate measures the percentage of wrong decisions taken by a clustering technique [4], computed as It assumes values in [0,1].

F-Measure (F 1 )
The F-measure, also known as Figure of Merit, is the weighted harmonic mean of Precision and Recall where Precision and Recall are defined as P recision = T P T P + F P Recall = T P T P + F N .

Fowlkes-Mallows Index (F MI)
The Fowlkes-Mallows index [8] is defined as the geometric mean of Precision and Recall It assumes values in [0,1].

Cluster Accuracy (CA)
Cluster Accuracy is defined [1] as where f (CP i |GT ) indicates the number of elements of cluster CP i whose label corresponds to the ground truth label most frequent in the cluster. CA assumes values in (0,1].

Normalized Mutual Information (NMI) and Adjusted Mutual Information (AM I)
The Normalized Mutual Information measures the amount of statistical information shared by random variables representing the cluster assignments and the ground truth label assignments of the elements. Given the computed partition CP of the set S into c clusters, its entropy is the amount of uncertainty for that partition, defined [3] as where p(i) = |CP i |/n is the probability that an element picked at random from CP falls into class CP i . Likewise, let the entropy of the ground truth partition GT be given by where p (j) = |GT j |/n is the probability that an element picked at random from GT falls into class GT j . The mutual information between CP and GT is computed as where p(i, j) = |CP i ∩ GT j |/n is the probability that an element picked at random falls into both classes CP i and GT j . The Normalized Mutual Information is defined [9] as and assumes values in [0,1]. As with the Rand Index, NMI is not adjusted for chance. To counter this effect, the Adjusted Mutual Information can be adopted [10,11,12]