SUPPLEMENTARY : SFSSClass: An integrated approach for miRNA based tumor classification

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India Department of Computer Science & Engineering, Jadavpur University, Kolkata, India Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA 4MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Tsinghua University, Beijing 100084, China ____________________________________________________________________________________________________


Preprocessed Dataset
The data is preprocessed as suggested in [1] by filtering out those miRNAs whose expression value never exceeds a minimal cut off (>7.25 on log 2 scale) for all the samples.

1.2.1
Exp1: A set of More Differentiated Tumor (MDT) samples from 9 tissue types have been classified based on the following data sets and parameter.

Selection of optimal parameter
Performance of the classifier Uncorrelated Shrunken Centroid (USC) [3] is depend on the two threshold value -Shrinkage threshold (∆) and Correlation threshold (ρ). It is a prerequisite for classification to select a smaller number of relevant features as the cost for classifying the patients sample is directly proportional to the number of features that should be tested to make the diagnosis. But only reduced number of features may not provide good prediction accuracy. Thus it is needed to determine the optimal parameter (∆ and ρ) in such a way that the classification accuracy is increased. To determine the optimal parameters, ten random fourfold cross validation has been performed on training set and observed for which parameter the average classification error rate is minimum. As can be seen from the fugures mentioned above the optimal parameter (∆ and ρ) is chosen based on the cross validation result. But only average classification error rate is not sufficient for SFSSClass algorithm to determine the optimal parameter. As in the proposed method a set of relevant features are already selected through Simultaneous Feature and Sample selection (SFSS) using Biclustering and cancer-miRNA network and it is assumed that these miRNAs have significant class information, thus further reduction of the number of the potential miRNAs is not desirable. As can be seen from Figure

Biclustering algorithm SAMBA: A short note Biclustering algorithm SAMBA: A short note
Groups of genes showing similar activity patterns under a specific subset of the experimental conditions, can be identified by a biclustering algorithm [5]. The concept of biclustering was first introduced by Hartigan in 1972 [6] and this technique was first implemented to gene expression data by Cheng and Church in the year 2000 [7]. Microarray data is stored in an n X p matrix M, which is defined by a set of rows (genes) R= R 1 , R 2 , …, R n and a set of columns C= C 1 , C 2 , …, C p . An entry m gs of M is a real value representing the expression level of gene g for sample s. A bicluster is a submatrix M GS of M, where G ⊆ R and S ⊆ C, having similar activity pattern. A bicluster consists of a subset of genes expected to be co-expressed within a subset of the conditions (belonging to that bicluster). In the proposed article we have considered a graph theoretic approach to biclustering combined with a statistical data model called SAMBA [8]. In this algorithm the gene expression matrix is considered as a bipartite graph G = (U, V, E) where U is the set of conditions, V is the set of genes and (u, v) ∈ E iff v responds in condition u. Biclusters produced by SAMBA is considered as a subgraph H = (U′, V′, E′ ) of G where V′ is represented as sub-Groups of genes showing similar activity patterns under a specific subset of the experimental conditions, can be identified by a biclustering algorithm [5]. The concept of biclustering was first introduced by Hartigan in 1972 [6] and this technique was first implemented to gene expression data by Cheng and Church in the year 2000 [7]. Microarray data is stored in an n X p matrix M, which is defined by a set of rows (genes) R= R 1 , R 2 , …, R n and a set of columns C= C 1 , C 2 , …, C p . An entry m gs of M is a real value representing the expression level of gene g for sample s. A bicluster is a submatrix M GS of M, where G ⊆ R and S ⊆ C, having similar activity pattern. A bicluster consists of a subset of genes expected to be co-expressed within a subset of the conditions (belonging to that bicluster). In the proposed article we have considered a graph theoretic approach to biclustering combined with a statistical data model called SAMBA [8]. In this algorithm the gene expression matrix is considered as a bipartite graph G = (U, V, E) where U is the set of conditions, V is the set of genes and (u, v) ∈ E iff v responds in condition u. Biclusters produced by SAMBA is considered as a subgraph H = (U′, V′, E′ ) of G where V′ is represented as sub-set of genes that are coexpressed with a subset of conditions U′. A likelihood score assesses the significance of an observed subgraph . In SAMBA it has been shown how to assign weights to the vertex pairs of the bipartite graph. Quality of a computed bicluster in SAMBA is defined by the weight of the subgraph. The objective of SAMBA is to identify maximum weight subgraph assuming that the weight of the subgraph will correspond to its statistical significance. The most significant biclusters under the weighting scheme is equivalent to the selection of the heaviest subgraphs in the model bipartite graph. (In SAMBA, the genes with degree exceeding a threshold d are ignored.) This is biologically relevant since the genes that show high expression in many conditions contribute little to the bicluster. Moreover such genes are typically involved in several biological processes, and hence do not exhibit a specific effect as desirable in a bicluster.
In [8] two statistical models have been considered for the resulting bicluster. In the simpler model, it is assumed that all the values of all the genes in a given bicluster have changed relative to their normal level in the subset of the conditions that form the bicluster, without considering any kind of coherence of the values mg s . Each value mg s , can be represented either by the symbol A 1 (change) or A 0 (nochange) instead of its true values. An edge has been established between a gene and a condition if that gene has a changed expression level in that specific condition. No edge means no change. In the refined model, for each bicluster every two conditions must have the same or opposite effect for each of the genes. In this model the sign of the change is taken into account and is achieved by assigning a signal C ij ∈{-1,1} to each edge of the graph and then looking for a bicluster (I, J) and an assignment τ : The algorithm finds K number of heaviest bicliques in the graph. In a post processing phase in order to perform a local improvement of the biclusters, SAMBA performs greedy addition or removal of vertices of the selected subgraph.

CANCER-MIRNA NETWORK
A complete list of all the miRNAs involved in different cancer types is provided in Table S1. The differential expression pattern of miR-NAs in different tumor tissues along with a list of references (PubMed-indexed for MEDLINE or PMID) is also present in this table. The information is obtained from extensive literature search [9]. Other relevant parameters that have been considered are location of the miR-NAs at fragile sites and cancer associated genomic regions, epigenetic alteration of miRNA expression and abnormalities in miRNA processing target genes and proteins.