Variable selection performance with simulation dataset
We describe herein the performance of our method with two network inference methods on a simulated dataset. In the hard thresholding transformation, we considered the effects of two parameters on the performance of variable selection: the hard threshold τ that determines the number of genes and edges included in the unweighted co-expression network, and the percentage of genes to be selected based on their connectivity in the resulted network, determined by equation (3). With a greater value of τ, the resulting network will have a smaller number of genes and edges, but the connection strength (correlation coefficient) between paired genes will be higher. We reported the average F-scores and CER values based on 100 simulated datasets for two different dimensionalities (d=500, Figure 1a and b, and d=1000, Additional file 1: Figure S1a and b).
It is not surprising to observe that selecting all of the 500 genes in the dataset can only lead to a low F-score (0.15) and a high CER (0.29), as shown in Figure 1a and b, simply because too many noninformative genes were included without the variable selection step. When we varied the threshold τ to generate a network with reduced size but kept all the genes in the resulting network regardless of their connectivity (in the case of genes with top 100% connectivity being selected), the performance evaluated by the F-score was improved but still poor, regardless of how many genes were in the resulting network. However, both the F-scores and CER were shown to improve further with an additional step of gene filtering by the gene connectivity rank. Generally, the more stringent the gene connectivity rank filtering, the lower the number of genes selected. To compare the performance of different gene connectivity ranks with the same number of genes selected, we had to decrease the threshold τ to achieve a large network size when the gene connectivity rank was more stringent. As shown in the Figure 1a and b, the gene filtering with the top 20 percentile connectivity resulted in the highest F value and the lowest CER when 100 genes were selected. However, this was not always the case. When 40 genes were included, the top 40% percentile rank achieved best performance among all these filtering scales. This indicates that the performance of variable selection is affected by both the connection strength and the connectivity of the selected genes. We observed similar results in both simulated datasets with different dimensionalities (d= 1000, Additional file 1: Figure S1a and b).
As shown in above analysis, the network variable selection based on hard-thresholding transformation was influenced by both the network size and the gene connectivity filtering. It may be challenging to optimize both of these two factors in real data analysis. To resolve this problem, we used a soft thresholding transformation for gene selection that is only dependent on the power function parameter β. It not only takes into account the information of all the genes, but also reduces the effect of noise induced correlation by the power function, assuming that the noise correlation occurs more likely at smaller values than the correlation associated with true gene relationships.
In the soft thresholding transformation, we varied the value of β to construct a series of gene co-expression networks. Results in Figure 1c and d demonstrated that the power transformation significantly improved the performance of variable selection and led to a higher F-score peak and lower CER than the original non-transformed one (β=1). We further found the performance was not a monotonic function of β. Among the four power functions with different parameters β, the optimal value of F-score and CER were achieved when β was set to 3, which may result in the optimized state for emphasizing the correlation associated with true gene relationships by diminishing the noisy effects in this simulation setting. We observed similar results when d=1000 (Additional file 1: Figure S1c and d).
In comparison with other feature selection methods
To further demonstrate the effectiveness of our proposed network based analysis for variable selection, we compared it with two other classic filter algorithms, the Laplacian Score [7] and the Max-Variance. As a special case of a spectral feature selection algorithm, Laplacian score selects those features that can best preserve the local manifold structure (He et al. [7]). In the comparison, we set up the parameter β with the optimized value obtained in the aforementioned simulation result (β =3). The genes selected from each method were used for clustering samples with the K-means algorithm. Our proposed network algorithm consistently outperformed the other two methods, particularly at a low dimensionality (20–100), where all of these methods reached their best performance (Figure 2b). The Laplacian score method achieved an optimal CER value of 0.25, close to that of the network without a power transformation. Similar results were observed in the comparison of the F-scores (Figure 2a). This indicates the effectiveness of our variable selection method in clustering analysis.
Selecting genes that support a common clustering structure by module identification
The gene co-expression network analysis captures the relationships among the genes so that it can help identify a small number of sets of highly correlated genes, each of which tends to assemble into a functional module that can be involved in biological pathways or molecular complexes. Also, these genes together assure a specific clustering of samples, which might be different from other sets of correlated genes. The genes selected by a module usually have greater intramodular connectivity than those that do not belong to the module. Therefore, module analysis not only captures the connectivity information of individual nodes as what we did in the variable selection, but also reveals the higher order organization of gene topological similarity in the entire gene space.
We applied both the hard-thresholding and soft-thresholding transformation on the gene co-expression network for module identification. Note that the sensitivity of this method varies depending on the co-expression network size and the composition of variable space. In the analysis, we assigned the value of μ to 1.5 in the simulation setting to demonstrate a clear module structure. For hard thresholding transformation of network (d=500), we varied the threshold τ to retain the top 1% of correlation coefficients for calculating the Jaccard similarity coefficients between genes. This cutoff is roughly consistent with our simulation setting where there are 40 informative genes and 40*40 informative gene pairs among 500*500 total number of gene pairs (0.64%). We also tested the performance on the top 5% of correlations. Figure 3b and c showed the discovered modules in the network when the top 1% and top 5% of correlations were kept in the transformed network, respectively. Two ‘hot blocks’ can be clearly identified along the diagonal, each of which corresponding to the original defined module in the simulation setting with only a few missing genes. Due to varying numbers of edges included, the boundaries between blocks exhibited distinctive sharpness in Figure 3b and c, but the module structure and the genes included in each module were the same. Therefore, this simulation result can serve as a guideline for determining cutoffs. For a hard threshold of 1%, we are assuming that roughly 10% of the genes are informative and 90% are not. This assumption is of course not optimal for every dataset. Fortunately, this simulated example suggests that module identification is not very sensitive to this parameter. Therefore, in a real dataset, if we use a hard threshold, we will first set the threshold to select the top 1% of edges, and also vary the threshold slightly, while checking whether the hot block appears to be consistent with respect to small changes of the cutoff.
We also performed soft thresholding transformation and obtained similar results (Figure 3d and e), indicating the relative robustness of our method in module identification. Furthermore, each of the blocks induced a unique bipartitioning of the sample space that is equivalent to the sample partition inferred by the corresponding modules in our simulation setting (C1 vs. others and C2 vs. others). We observed similar results when d=1000 (Additional file 1: Figure S2).
Application in real datasets
Along with simulations, we applied our method to two real experimental datasets: Leukemia [8] and Colon cancer data [20].
The leukemia dataset consists of 72 patients with two subtypes of acute leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The latter is composed of two subclasses, B-cell and T-cell types. Therefore there could be two possible biologically meaningful clustering solutions including one with two-clusters of samples (AML and ALL) and the other with three clusters (T cell ALL, B cell ALL and AML).
Following Dudoit and Fridlyand [21], three pre-processing steps were applied to the original data matrix and a final 72 X 3571 data matrix was obtained. Because the pre-processing steps included thresholding the gene expression values with a floor and a ceiling boundaries, many artificially high correlations were introduced. We filtered out these genes whose medians equal the boundary values and obtained 3033 genes totally. We first studied the module organization of the gene space and the associated sample partition in the leukaemia dataset. In the implementation, the value of τ was chosen to include only the edges corresponding to the top 1% of paired correlation coefficients in the network. As shown in Figure 4a, the topological similarity matrix exhibited a sharp separation of modules from its neighboring genes. We evaluated the sample clustering performance of modules by using the gene set included in each module for sample partitioning. We found that most of them induced a meaningful partition of the sample space. Specifically, the first module at the bottom right corner rendered a dichotomy of the samples according to the known classification, ALL/AML, with the CER value equalling 0.155, whereas the second module tends to distinguish B cell ALL patients from the rest with a CER value of 0.2, indicating the unrecognized similarity between T cell ALL samples and AML samples in the dataset. The other modules also imposed a potential novel partition of samples. These results confirmed multiple possible clustering solutions in the leukemia dataset. We also performed variable selection to select individual genes based on their connectivity in the transformed network using soft thresholding transformation. For the three-cluster solution, the 100 genes selected from the network-based analysis yielded the sample partition coinciding almost precisely with the known classification (T cell ALL, B cell ALL and AML) with CER equalling to 0.09, when the parameter β reached 40 or above (Figure 4c). In Figure 4c, we also compared the performance of our network-based method with other filter methods. Results showed that our approach achieved a comparable optimal CER value with the Laplacian and the Max-Variance methods, when more than 100 genes were selected. Out of the top 100 informative genes selected from our method, 28 and 35 genes were found by the Laplacian and the Max-Variance method, respectively (Additional file 1: Figure S4a). The genes selected from our method may also represent new sample partitions. This is supported by the observation that our method achieved better separation of B cell AML samples from the rest compared to other methods (Additional file 1: Figure S5a). We further examined the gene lists selected from soft thresholding transformation with varied β, and found that the overlap among the gene lists increased as β became larger. After β reached 80, the selected gene lists were almost unchanged. This is consistent with their similar clustering performance as shown in Figure 4c. However, recall that in our simulation studies the optimal CER was achieved when β was small (β =3), reflecting possibly different variable compositions between the simulated dataset and the leukemia data. As shown in Figure 5a and b, these implications became clear. In Figure 5a, the correlation of 40 informative genes in the simulated dataset followed a uniform distribution. Unlike the 40 informative genes in simulation, the correlation of the 100 selected informative genes from the leukemia data followed a mixture distribution with two components, one at the high end of the distribution and another close to zero (Figure 5b). This is reasonable for real datasets given the assumption that gene expression data are influenced by many active biological processes, where genes within each of the processes tend to be highly correlated with one another, but may not be well correlated with those participating in other biological processes. Therefore, the correlation values between genes corresponding to different biological processes will be small, located in the low end of the correlation distribution. Since the component in the high end was well separable from the other for the full gene space, the corresponding highly connected genes tend to be always selected as informative genes disregarding the value of β. Therefore, the selected genes should not be sensitive to β when β is above a certain threshold that aims to remove the noisy correlations. However, this was not the case for the simulated dataset, where the correlation of the informative genes followed a uniform distribution while that of the full set of genes was close to zero, so a small value of β should be able to remove the noise induced correlations.
We also analyzed the colon cancer data by Alon et al. [20], which contains two classes of samples based on disease status: 40 tumor samples and 22 normal samples. In addition, an independent study reported that there was a difference in the experimental protocols used to process the samples [22]. There are 22 samples processed by protocol 1, and the other 40 samples were processed by protocol 2. Taking the different protocols into consideration, the study has at least three different possible sample partition structures based on disease status, sample protocols, and their combination. In the analysis, we first employed the module analysis on the dataset, followed by variable selection for the clustering analysis. As we did with the leukemia dataset, the edges corresponding to the top 1% correlation were kept for module analysis. As shown in Figure 4b, dozens of modules were identifiable along the diagonal of the similarity matrix. Each module exhibited a distinctive partition of the sample space. Among the first three modules at the bottom right, module 1 had strong tendency to partition the samples according to the normal versus tumor classification with a CER value (0.35), whereas module 3 was informative for the partition based on different protocols (CER=0.27). Also the number of genes included in these two modules differed. Module 1 was the largest in terms of the number of genes included. These results indicate that the classification of tumor versus normal samples is a more dominant factor in the sample clustering compared to the different sample protocols. It was interesting to observe that the number of genes in module 2 was similar compared to module 1. However their clustering behaviours differed, suggesting that module 2 may inform a novel sample partition of the colon cancer dataset. The aforementioned module analysis reinforced the idea that the colon cancer dataset has at least three clustering solutions. Here, a soft thresholding transformation was implemented to select the feature genes. As shown in Figure 4d, in a four-cluster solution that combines the disease status and the sample protocol changes, the clustering performance of our method based on different values of β is similar given a β value of 40 or above. Here, the CER value was higher than that obtained from the leukemia dataset, possibly due to mislabeled samples [23]. Nevertheless, our approach achieved a better performance than the other methods. In Figure 4d, when β was 40 or above, our method had a CER value equal to 0.25 with 100 genes selected, whereas Laplacian and Max-Variance obtained the CER values of 0.32 and 0.29, respectively. Only 18 and 8 genes selected from our approach were also obtained from the Laplacian and the Max-Variance method, respectively (Additional file 1: Figure S4b). For the other clustering solution based on different experimental protocols, our method outperformed others as well (Additional file 1: Figure S5b). We also plotted the distribution of correlations for full gene space and the 100 selected informative genes respectively (Figure 5c). Similar to the leukemia dataset, the gene set has two well-separated components of correlations at two ends, which may explain their saturating behaviour of clustering performance when β reaches a certain value.