 Methodology article
 Open Access
 Published:
Analyzing the similarity of samples and genes by MGPCC algorithm, tSNESS and tSNESG maps
BMC Bioinformatics volume 19, Article number: 512 (2018)
Abstract
Background
For analyzing these gene expression data sets under different samples, clustering and visualizing samples and genes are important methods. However, it is difficult to integrate clustering and visualizing techniques when the similarities of samples and genes are defined by PCC(Person correlation coefficient) measure.
Results
Here, for rare samples of gene expression data sets, we use MGPCC (minigroups that are defined by PCC) algorithm to divide them into minigroups, and use tSNESSP maps to display these minigroups, where the idea of MGPCC algorithm is that the nearest neighbors should be in the same minigroups, tSNESSP map is selected from a series of tSNE(tstatistic Stochastic Neighbor Embedding) maps of standardized samples, and these tSNE maps have different perplexity parameter. Moreover, for PCC clusters of mass genes, they are displayed by tSNESGI map, where tSNESGI map is selected from a series of tSNE maps of standardized genes, and these tSNE maps have different initialization dimensions. Here, tSNESSP and tSNESGI maps are selected by Avalue, where Avalue is modeled from areas of clustering projections, and tSNESSP and tSNESGI maps are such tSNE map that has the smallest Avalue.
Conclusions
From the analysis of cancer gene expression data sets, we demonstrate that MGPCC algorithm is able to put tumor and normal samples into their respective minigroups, and tSNESSP(or tSNESGI) maps are able to display the relationships between minigroups(or PCC clusters) clearly. Furthermore, tSNESS(m)(or tSNESG(n)) maps are able to construct independent tree diagrams of the nearest sample(or gene) neighbors, where each tree diagram is corresponding to a minigroup of samples(or genes).
Background
With the rapid development of highthroughput biotechnologies, we were easily able to collect a large amount of gene expression data with many subjects of biology or medicine [1]. Here, we aimed at these gene expression data sets that came from tumoral and normal samples, where these data sets were often characterized by mass genes but with relatively small amounts of samples, their rows were corresponding to genes, and columns were representing samples [2]. For these gene expression data sets, they usually incorporated several thousands of probes associated with more and less relevance for cancers [3]. Thus, the filtering approaches applied to each probe before data analysis, with the aim to find differentially expressed genes, such as Tstatistics, Significance Analysis, Adaptive Ranking, Combined Adaptive Ranking and Twoway Clustering [4, 5]. For samples of gene expression data sets, a major challenge was how to resolve their subtypes, and compare in different diseased states [4, 6]. Much work had been done on exploratory subtypes of cancers, such as Hierarchical clustering, Kmeans, penalised likelihood methods and the random forest [7, 8]. Moreover, to determine the intrinsic dimensionality of genes, the clustering analysis was used to search for patterns and group genes into expression clusters that provided additional insight into the biological function and relevance of genes that showed different expressions [9–13]. Furthermore, to display classification of genes(or samples) in a meaningful way for exploration, presentation, and comprehension in diseased states and normal differentiation, many dimension reduction techniques were used to embed highdimensional data for visualization in 2D(two dimensional) spaces [14–17], and had been successful in complementing clusters of Euclidean distance [14], such as Hierarchical clustering dendrograms, PCA(principal component analysis), tSNE, heat maps, and network graphs [14–18].
For samples of gene expression data sets, their dimensionality often resulted in their different types to be isometric by Euclidean distance [9]. Thus, in the process of samples and genes clustering analysis, PCC commonly used also [10, 12, 13]. The simplest way to think about PCC was to plot curves of two genes, with PCC telling us how similar the shapes of their two curves were. But for PCC clusters of gene expression data, many projection techniques gave them poor visualizations usually [16]. To efficiently map clusters of PCC, PCC had been defined by transformed genes, such as PCCF(PCC of Fpoints) and PCCMCP(PCC of multiplecumulative probabilities) [19, 20]. Moreover, PCAF and tSNEMCPO gave good visualizations for clusters of PCCF and PCCMCP, respectively. However, for PCC clusters of the original gene expression points, PCAF and tSNEMCPO gave them poor visualizations also [19, 20].
Here, for samples of gene expression data sets, we used MGPCC algorithm to divide them into different minigroups, where the similarities of samples were defined by PCC measure, and the idea of MGPCC algorithm is that the nearest neighbors should be in the same minigroups. That is, for any sample of a minigroup, its nearest neighbor was in the minigroup also. Moreover, we used tSNESSP maps to display the relationships of minigroups, where tSNESSP map was selected from a series of tSNE maps of standardized samples, these tSNE maps had different perplexity parameter, and the initialization dimensions of these tSNE maps were thirty. In tSNE, the perplexity might be viewed as a knob that sets the number of effective nearest neighbors. It was comparable with the number of nearest neighbors that was employed in many manifold learners [21, 22].
Furthermore, for gene clusters that were generated from PCC, we attempted to use tSNESGI maps to display them, where tSNESGI maps were selected from a series of tSNE maps of standardized genes. Compared to tSNESSP maps, tSNESGI map was selected from these tSNE maps that had the same perplexity parameter, but different initialization dimensions, where the perplexity parameter of these tSNE maps were the dimensions of genes. In fact, for gene expression data sets under different samples, their genes were mass and dense, and the performance of tSNE with these data sets required a larger perplexity.
Here, we used Avalue to select the tSNESSP and tSNESGI maps, where Avalue was modeled from areas of clustering projections, and a tSNE map was selected as tSNESSP(or tSNESGI) if its Avalue was the smallest compared to others. Furthermore, for clusters with different clustering number, their tSNESGI maps might come from the different tSNE maps.
To evaluate the reliability of the MGPCC and tSNESSP, we applied them to gene expression data sets of lung cancers [23, 24]. Results showed that MGPCC algorithm was able to put tumor and normal samples into their respective minigroups, and tSNESSP maps gave these minigroups clear boundaries also, which helped us to mine the subtypes of cancers. Moreover, for PCC clusters of genes, tSNESGI maps gave them better visualizations compared to tSNE of the original and normalized genes, which made clustering and visualizing techniques better integration. Furthermore, for the nearest sample(or gene) neighbors, tSNESS(m)(or tSNESG(n)) maps were able to give them independent tree diagrams, where each tree diagram was corresponding to a minigroup of samples(or genes).
Materials and methods
Data and data source
The first data set GDS3837 provides insight into potential prognostic biomarkers and therapeutic targets for nonsmall cell lung carcinoma, where it has 54674 genes, 60 normal and 60 tumor samples that are taken from nonsmoking females [23, 24]. The second data set GDS3257 provides insight into the molecular basis of lung carcinogenesis induced by smoking, where it has a total of 22283 genes, and contains 107 samples that are taken from former, current and never smokers [23, 24], where GDS3837 and GDS3257 can be downloaded from NCBI’s GEO Database.
Here, we firstly use GDS3257 and GDS3837 to construct 5 matrixes, where A_{k}(k =1, 2, 3, 4 and 5) is the kth matrix, the ith row of A_{k} represents the ith gene, the jth column represents the jth sample, genes of A_{k} are filtered by Ttest(Hypothesis testing for the difference in means of two types of samples), and the detail of A_{k} is summarized in Table 1. Then, 5 sample data sets are constructed by A_{k}, where datak(k =1, 2, 3, 4 and 5) is the kth sample data set, and datak is transposed matrix of A_{k}. And then, 5 gene data sets are constructed by A_{k}, where data(k+5)(k =1, 2, 3, 4 and 5) is the kth gene data set, and data(k+5) only contains tumor samples of A_{k}. That is, A_{k} is represented by (B_{k},C_{k}), and data(k+5)(k =1, 2, 3, 4 and 5) is B_{k}, where B_{k} and C_{k} contains tumor and normal samples, respectively.
Methods
Here, we use X_{i} to represent the ith sample of datak(k =1, 2, 3, 4 and 5), and Y_{j} to represent the jth gene of data(k+5). That is, X_{i} is the ith row of datak(k =1, 2, 3, 4 and 5), and Y_{j} is the jth row of data(k+5), where
Spoints
Here, X_{i} and Y_{j} are standardized into SS_{i} and SG_{j}, where SS_{i} and SG_{j} are called as Ssample and Sgene of X_{i} and Y_{j} respectively, and
MGPCC algorithm
Here, for X_{j1} and X_{j2}, they are used to construct the first minigroup, where
ρ(X_{i},X_{j}) is PCC between X_{i} and X_{j}, and u is the number of samples. For X_{j3}, it is put into the first minigroup if it satisfies
When the first minigroup contains (t−1)(t>3) samples, X_{jt} is put into the first minigroup if it satisfies
where X_{j1},X_{j2},⋯,X_{j(t−1)} belong to the first minigroup. Continuously, the first minigroup is completely built until no sample satisfies Eq. (5).
The remaining samples repeat above step until all minigroups are completely built. For a minigroup, it is completely built if no sample satisfies Eqs. (4) or (5), that is, a minigroup contains two genes at least. Similarly, MGEuclidean algorithm can be used to construct minigroup also, where the algorithm uses Euclidean distance to define the similarities of samples.
The Avalue
For samples of each minigroup, we plot the boundary of their projections by a closed line, where the closed line is called as boundaryline of the minigroup, the boundaryline forms a convex hull of their projections, and the area of the convex hull is called as Avalue of the minigroup. Here, we use Avalue to describe the consistency between samples and their projections, where
a_{i} is Avalue of the ith minigroup, a is Avalue of the data set, v is the number of minigroups.
In general, for adjacent minigroups, there is often some overlap for their convex hulls. Thus, Avalue is smaller, the consistency between points and projections is more valid.
The tSNESSP and tSNESGI
Using tSNE requires tuning some parameters, notably the perplexity and initialization dimension. Although tSNE results are robust to the settings of parameters, in practice, we still have to interactively choose parameters by visually comparing results under multiple settings. For minigroups and clusters of samples that are generated from PCC, we empirically validate that tSNE maps of the standardized samples with an appropriate perplexity can clearly display them, where the initialization dimension of these tSNE maps is thirty. But for PCC clusters of genes, tSNE maps of Sgenes with an appropriate initialization dimension can give them good visualizations, where the perplexity parameter of these tSNE maps is the dimensions of genes.
Here, for minigroups and clusters of samples that are generated from PCC, their tSNESSP map is selected from a series of tSNESS(k) maps by Avalue, where tSNESS(k) is tSNE map of the standardized samples, its initialization dimensions are thirty, its perplexity parameter is k, and the value of k ranges from 3 to 30. That is, for tSNESS(t), it is selected as tSNESSP if its Avalue is the smallest compared to other tSNESS(k). Similarly, for PCC clusters of genes, their tSNESGI map is selected from a series of tSNESG(i) maps by Avalue also, where tSNESG(i) is tSNE map of Sgenes, its perplexity parameter of these tSNE maps is the dimensions of genes, its initialization dimensions is i, the value of i ranges from 3 to the dimensions of genes.
Accuracy, FMeasure, RI and NMI
For tSNE maps, since they are able to give good visualizations for clusters of Euclidean distance, they can be successful in complementing these PCC clusters that are relative consistency with Euclidean ones. Here, we use Accuracy, FMeasure, RI(Rand index) and NMI(Normalized mutual information) [25, 26] (http://nlp.stanford.edu/IRbook/html/htmledition/evaluationofclustering1.html) to evaluate the consistency of clusters between PCC and Euclidean distance, where clusters of Euclidean distance are seen as the gold standard of genes. In general, Accuracy is a simple and transparent evaluation measure, RI penalizes both false positive and false negative decisions during clustering, FMeasure in addition supports differential weighting of these two types of errors, and NMI can be information theoretically interpreted, where the detailed explanation of these four criteria are explained see in [25, 26], (http://nlp.stanford.edu/IRbook/html/htmledition/evaluationofclustering1.html) and their matlab codes are available at Additional file 1. Furthermore, the higher value of these four criteria means that the more consistency of clusters between PCC and Euclidean distance.
Results
The reliability of minigroups
To test the reliability of minigroups, we applied MGPCC and MGEuclidean algorithms to 5 sample data sets, where MGEuclidean applied to the standardized samples, Osamples and Nsamples simultaneously, Osamples and Nsamples were the original and normalized samples respectively, and results of minigroups were summarized in Table 2. Here, for a minigroup, it was regarded as tumor group if its tumor samples were more than normal ones, otherwise, it was a normal one. Moreover, for a tumor(or normal) sample, it was misjudged if it was put into a normal(or tumor) group. For MGPCC and MGEuclidean, Table 2 showed that they correctly judged all samples of data1 and data3, and only a few samples of data2, data4 and data5 were misjudged. For instance, only 2 normal and 2 tumor samples of data5 were misjudged by MGPCC algorithm, where data5 contained 60 normal and 60 tumor samples. That is, MGPCC algorithm was able to put tumor and normal samples into their respective minigroups, which could help us to compare in different diseased states and normal samples.
The clustering feature of Sgenes
Here, for data6, 7, 8, 9 and 10, their Sgenes, Ngenes and Ogenes were divided into clusters by Kmeans with Euclidean distance and PCC, respectively, where Ngenes and Ogenes were the original and normalized genes, respectively. Then, Accuracy, FMeasure, RI and NMI were used to demonstrate the consistency of clusters between PCC and Euclidean distance, where clusters of Euclidean distance were seen as the gold standard of genes. For comparison, Accuracy, FMeasure, RI and NMI of these PCC clusters were summarized in Table 3. For clusters of any data set, Table 3 showed that their Accuracy, FMeasure, RI and NMI of Sgenes were far more than Ngenes and Ogenes. That is, for Sgenes, their clusters of PCC and Euclidean were more consistent compared to Ogenes and Ngenes.
In general, for data with a normal distribution, the patterns revealed by the clusters under PCC and Euclidean roughly agreed with each other. But for Ogenes and Ngenes of complex gene expression data sets, results showed that their PCC and Euclidean clusters had significant differences.
The reliability of Avalue
Here, we used clusters of data5 to exemplify that Avalue was able to quantify the validity of projecting maps, where samples of data5 were divided into 5 and 3 clusters by Kmeans with PCC. For 5 and 3 clusters of data5, they were displayed on tSNESS(20) and tSNESS(30) maps (Fig. 1(a) and 1(b)) respectively, and the boundarylines of clustering projections were showed on Fig. 1(a) and 1(b) also. For tSNESS(30) map of data5, it gave good visualizations for 3 clusters (Fig. 1(b)), but tSNESS(20) had slightly intermixing for 5 clusters (Fig. 1(a)). Moreover, for the boundarylines of tSNESS projections, 5 clusters had more significant overlaps than ones of 3 clusters, while Avalue increased with area of overlap. That is, Avalue was larger, the consistency between points and projections was more invalid.
Selecting tSNESSP maps by Avalue
Here, for data4 and data5, their Osamples were divided into 5 clusters by Kmeans with PCC, respectively. Then, for clustering results of data4 and data5, their Avalues of different tSNESS(k) maps were obtained by Eq. (6), where these Avalues were showed by blue lines in Fig. 2 (a) and (b), respectively. For different tSNESS(k) maps, Fig. 2 (a) and (b) showed that their Avalues had significant difference, Avalues of tSNESS(20) and tSNESS(25) were the minimum for 5 clusters of data4 and data5, respectively. That is, tSNESS(20) and tSNESS(25) were the optimal 2D maps for 5 clusters of data4 and data5, respectively.
Moreover, for 2 clusters of data4 and data5 according to normal and tumor samples, their Avalues of different tSNESS(k) maps were showed in Fig. 2 (a) and (b) also, where these Avalues were showed by red lines. From Fig. 2 (a) and (b), tSNESS(30) was not the optimal 2D maps for any data set also. For 2 clusters of data5, Avalues of its tSNESS(30) was 0.58779, while tSNESS(18) was 0.52373. That is, tSNESS(18) was more appropriate for displaying 5 clusters of data5.
Selecting tSNESGI maps by Avalue
Here, for gene clusters of data7 and data9, their Avalues of tSNESG(i) maps were showed in Fig. 2 (c) and (d) respectively, where Ogenes of each data set were divided into 3 and 5 clusters by Kmeans with PCC. Figure 2 (c) and (d) showed that tSNESG(m) maps were not the optimal 2D maps for any clustering result, tSNESG(4) maps were tSNESGI maps of 3 clusters of data7 and data9, tSNESG(7) map was tSNESGI maps of 5 clusters of data7, and tSNESG(8) map was tSNESGI maps of 5 clusters of data9, respectively.
By Accuracy, FMeasure, RI and NMI, we demonstrated that PCC and Euclidean clusters of Sgenes were relative consistent, which enabled tSNESG(i) maps for displaying PCC clusters. But for tSNE map with the randomly choosing parameters, it could give poor visualization for PCC clusters, which could lead to misinterpretation of clusters. Here, we used Avalue to quantify the quality of tSNESG(i) maps, which enabled tSNESGI maps to project genes of the same clusters together, and neighbor clusters in adjacent regions.
The biological reliability of tSNESSP maps
Here, we used data1, 2, 3 and 4 to assess the biological reliability of tSNESSP maps. According to population membership of samples, these four data sets were mapped on tSNESSP maps respectively (Fig. 3), where tSNESSP maps of data1, 2, 3 and 4 were tSNESS(20), tSNESS(19), tSNESS(30) and tSNESS(18), respectively. In fact, for tumor and normal samples of different populations, their biological partitioning were not always obvious from those differentially expressed genes, but Fig. 3 clearly showed that tSNESSP maps were able to project samples of the same populations into the same together, which could help us to understand the relationships between different populations.
The consistency between MGPCC algorithm and tSNESSP maps
Here, for minigroups of data1, 2, 3 and 4, they were used to assess the consistency between MGPCC algorithm and tSNESSP maps, where and data1, 2, 3 and 4 were divided into 4, 7, 5 and 13 minigroups by MGPCC algorithm, respectively. According to minigroup membership of samples, these four data sets were mapped on tSNESSP maps (Fig. 4), where tSNESSP maps of data1, 2, 3 and 4 were tSNESS(6), tSNESS(8), tSNESS(12) and tSNESS(12), respectively. From Fig. 4(a), (b) and (c), tSNESSP maps of data1, 2, and 3 were able to project samples of the same minigroups together. But for minigroups of data4, the seventh minigroup had slightly intermixing with others (Fig. 4(d)), where the samples of the seventh minigroup were marked by black points. In fact, for 23 minigroups of data5 that were generated from MGPCC algorithm, their relationships were not obvious displayed by their tSNESSP map also. That is, the exhibition effects of tSNESSP maps might weaken when the number of minigroups was relatively large.
Comparison of tSNESSP and PCAS
Here, for tumor and normal samples of data5, they were mapped on PCAS and tSNESSP maps (Fig. 5 (a) and (b)) according to their population memberships, where PCAS is PCA of samples, and tSNESSP map of data5 were tSNESS(18). Then, samples of data5 were divided into 4 clusters by Kmeans with PCC, and the clustering result was overlaid on PCAS and tSNESSP(tSNESS(19)) maps (Fig. 5 (c) and (d)) also. For biological classifications and PCC clusters of data10, Fig. 5 showed that tSNESSP maps provided them good 2D projections (Fig. 5 (b) and (d)), but PCAS maps had significant intermixing for them (Fig. 5 (a) and (c)). For biological classifications and PCC clusters of other data sets in this paper, PCAS gave them poor visualization also.
In fact, for the optimization criterion of PCA, the relationship of distant points was able to depict as accurately as possible, while small interpoint distances might be distorted [14]. Moreover, there might be no single linear projection that gave a good view for most gene expression data [14]. Thus, for complex gene expression data sets, many linear projection methods might fail.
The reliability of tSNESGI maps
For 3, 4, 5 and 6 clusters of data9 that were generated from Kmeans with PCC, they were shown on tSNESGI maps (Fig. 6), where tSNESGI maps of 3, 4, 5 and 6 clusters were tSNESG(4), tSNESG(5), tSNESG(8) and tSNESG(9), respectively. Figure 6 showed that tSNESGI gave the relatively clear 2D projections for 3, 4 and 5 clusters, but had significant intermixing for 6 clusters. That is, tSNESGI maps might weaken when the number of clusters was relatively large.
Compared to Kmeans clustering analysis, MGPCC algorithm does not estimate the number of clusters. But for genes, MGPCC algorithm generates a large number of minigroups, which can make genes with the similar biological function into different minigroups. Thus, MGPCC algorithm is not appropriate to cluster genes.
Comparison of tSNESGI, tSNEN and tSNEO maps
Here, Ogenes data7 and data8 were firstly divided into 3 clusters by Kmeans with PCC, and then these clustering results were overlaid on tSNESGI, tSNEN and tSNEO maps (Fig. 7), where tSNEN and tSNEO maps were tSNE maps of Ogenes and Ngenes respectively, and their initialization dimensions were the same as tSNESGI. Figure 7 showed that tSNESGI provided these clustering results good 2D projections, but tSNEN and tSNEO maps had significant intermixing.
For PCC clusters of data6, 9 and 10, when tSNEN and tSNEO maps gave them poor visualizations also. The reason was that PCC and Euclidean clusters of Ogenes and Ngenes had significant differences.
Constructing the nearest sample neighbor map by tSNESS(m)
For gene expression data sets under samples, the hierarchical clustering were used to display their sample neighbors usually [27], but the method was likely to cause loose sample neighbors. By Dplots [19], tSNESS(m) maps were able to generate more valid gene neighbors compared to tSNESSP, where m was the dimension of samples. Here, we constructed of the nearest sample neighbors by tSNESS(m) map, where sample neighbors were defined by PCC. For sample neighbors of data1, 2, 3 and 4, they were displayed on Fig. 8, where the nearest gene neighbor were lined by red line.
Figure 8 showed that sample neighbors had created several independent tree diagrams. In fact, each tree diagram was corresponding to a minigroup of samples. Thus, the combination of tSNESS(m) map and MGPCC algorithm was able to help us to search subtypes of samples.
Constructing the nearest gene neighbor map by tSNESG(n)
In fact, for tSNE method that had used to construct gene neighbors, where the initialization dimension of these tSNE was dimension of genes [14, 20]. Here, we constructed of the nearest gene neighbors by tSNESG(n), where gene neighbors were defined by PCC, and we focused our attention on data6. For gene neighbors of data1, they were displayed on Fig. 9(a). From Fig. 9(a), gene neighbors had created many independent tree diagrams also, and these tree diagrams were corresponding to minigroups that were generated from MGPCC algorithm.
Based on GDS3837, GDS3257 and GDS3054, nine differentially expressed genes that were associated with lung cancer had been extracted, where these 9 genes that were smoking independent, and they were AGER, CA4, EDNRB, FAM107A, GPM6A, NPR1, PECAM1, RASIP1 and TGFBR3 [16]. Here, we used tSNESG(n) map to display these nine minigroups that contained nine specific genes (Fig. 9(b)). From Fig. 9(b), these nine independent tree diagrams might help us to search correlation genes.
Discussion
For samples of gene expression data sets of cancers, there are no clear boundary between subtypes of samples usually [7]. The reason is that the high dimensions of samples often results in the different subtypes to be isometric [9]. Here, we use MGPCC algorithm to divide samples into minigroups, and results show that the algorithm can put tumor and normal samples into their respective minigroups. In fact, MGPCC algorithm puts the nearest neighbors in the same minigroups, which can distinguish the inconspicuous differences of different subtypes of samples. However, when MGPCC algorithm applies genes, it generates a large number of minigroups. That is, for genes with similar expression patterns, they may be put to different minigroups, which make difficult to group genes with the similar biological function together. The reason is that MGPCC algorithm does not presuppose the number of minigroups, and the similar genes are not necessarily the nearest neighbors. Moreover, for the large number of minigroups, any dimension reduction technique may give messy visualizations for the entire data set. Thus, MGPCC algorithm is not appropriate to divide genes.
To efficiently display minigroups of samples that are generated from MGPCC algorithm, we firstly verify that PCC and Euclidean clusters of the standardized samples are more consistent compared to the original and normalized ones, and PCC of the standardized samples are the same as the original and normalized ones. Since tSNE maps have been successful in displaying clusters of Euclidean distance, tSNE maps of the standardized samples can give good visualizations for minigroups also. However, for tSNE maps of the standardized samples, they have significant difference for different parameters, and most of them give poor visualizations for minigroups also. To select the optimal tSNE maps of minigroups, tSNESSP are constructed secondly, where tSNESSP maps are selected from these tSNE maps of the standardized samples with different perplexity parameter. Results show that that tSNESSP maps give minigroups of samples good visualizations, and give PCC clusters of samples good visualizations also. However, for tSNESSP maps, when we use them to display PCC clusters of genes, they give fuzzy visualizations. The reason may be that the dimensions of samples are far more than ones of genes. To efficiently map PCC clusters of genes, tSNESGI maps are constructed, where tSNESGI maps are selected from these tSNE maps of the standardized genes with different initialization dimensions. By several gene expression data sets of cancers, we verify that SNESGI maps can give PCC clusters of genes good visualizations. Furthermore, we use tSNESS(m) and tSNESG(n) maps to display the nearest neighbor of samples and genes respectively, which make the relationships between samples(or genes) easy to visualize and understand. In total, for gene expression data sets of cancers, these four types of tSNE maps identify them easy and intuitive.
Conclusion
In this article, we use MGPCC algorithm to divide samples of gene expression data sets into minigroups, and tSNESSP to display the relationships of these minigroups. Moreover, we provide tSNESGI maps to display PCC clusters of genes, and tSNESS(m) and tSNESG(n) maps to display the nearest neighbor of samples and genes respectively. In total, for MGPCC algorithm and these four types of tSNE maps, they can help us to understand the entire gene expression data sets when they coordinate with each other.
Abbreviations
 2D:

two dimensional
 Avalue:

the quantifying criterion of the projecting maps
 MGPCC:

the algorithm using PCC to put the nearest neighbors into the same minigroups
 Ngenes:

the normalized gene
 Ogenes:

the original gene
 PCA:

the principal component analysis
 PCAS:

PCA of the standardized samples
 PCC:

Person correlation coefficient
 Sgenes:

the standardized genes
 the standardized samples:

the standardized samples
 tSNE:

tstatistic Stochastic Neighbor Embedding
 tSNEN:

tSNE of Ngenes
 tSNEO:

tSNE of Ogenes
 tSNESG:

tSNE of standardized genes
 tSNESGI:

tSNESG that its Avalue is the smallest
 tSNESS:

tSNE of standardized samples
 tSNESSP:

tSNESG that its Avalue is the smallest
References
Brazma A, Vilo J. Gene expression data analysis. Febs Lett. 2000; 480(1):17–24.
Yu X, Yu G, Wang J. Clustering cancer gene expression data by projective clustering ensemble. PLoS ONE. 2017; 12(2):e171429.
Grimes ML, Lee WJ, van der Maaten L, Shannon P. Wrangling phosphoproteomic data to elucidate cancer signaling pathways. PLoS ONE. 2013; 8(1):e52884.
Shaik JS, Yeasin M. A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics. 2007; 8:347.
Kong X, Mas V, Archer KJ. A nonparametric metaanalysis approach for combining independent microarray datasets: application using two microarray datasets pertaining to chronic allograft nephropathy. BMC Genomics. 2008; 9:98.
Cavalli F, Hubner JM, Sharma T, Luu B, Sill M, Zapotocky M, Mack SC, Witt H, Lin T, Shih D, et al.Heterogeneity within the PFEPNB ependymoma subgroup. Acta Neuropathol. 2018; 136(2):227–37.
Tishchenko I, Milioli HH, Riveros C, Moscato P. Extensive Transcriptomic and Genomic Analysis Provides New Insights about Luminal Breast Cancers. PLoS ONE. 2016; 11(6):e158259.
Zucknick M, Richardson S, Stronach EA. Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol. 2008; 7(1):e7.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999; 286(5439):531–7.
Yao J, Chang C, Salmi ML, Hung YS, Loraine A, Roux SJ. Genomescale cluster analysis of replicated microarrays using shrinkage correlation coefficient. BMC Bioinformatics. 2008; 9:288.
Roche KE, Weinstein M, Dunwoodie LJ, Poehlman WL, Feltus FA. Sorting Five Human Tumor Types Reveals Specific Biomarkers and Background Classification Genes. Sci Rep. 2018; 8(1):8180.
Jaskowiak PA, Campello RJ, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics. 2014; 15(Suppl 2):S2.
Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863–8.
Bushati N, Smith J, Briscoe J, Watkins C. An intuitive graphical visualization technique for the interrogation of transcriptome data. Nucleic Acids Res. 2011; 39(17):7380–9.
Gehlenborg N, O’Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Kohlbacher O, Neuweger H, Schneider R, Tenenbaum D, et al.Visualization of omics data for systems biology. Nat Methods. 2010; 7(3 Suppl):S56–68.
Sanguinetti G. Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell. 2008; 30(3):535–40.
Huisman S, van Lew B, Mahfouz A, Pezzotti N, Hollt T, Michielsen L, Vilanova A, Reinders M, Lelieveldt B. BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome. Nucleic Acids Res. 2017; 45(10):e83.
Tzeng WP, Frey TK. Mapping the rubella virus subgenomic promoter. J Virol. 2002; 76(7):3189–201.
Jia X, Zhu G, Han Q, Lu Z. The biological knowledge discovery by PCCF measure and PCAF projection. PLoS ONE. 2017; 12(4):e175104.
Jia X, Liu Y, Han Q, Lu Z. Multiplecumulative probabilities used to cluster and visualize transcriptomes. FEBS Open Bio. 2017; 7(12):2008–20.
Van der Maaten L, Hinton G. Visualizing data using tSNE. J Mach Learn Res. 2008; 9:2579–605.
Xu W, Jiang X, Hu X, Li G. Visualization of genetic diseasephenotype similarities by multiple maps tSNE with Laplacian regularization. BMC Med Genomics. 2014; 7(Suppl 2):S1.
Lu TP, Tsai MH, Lee JM, Hsu CP, Chen PC, Lin CW, Shih JY, Yang PC, Hsiao CK, Lai LC, et al.Identification of a novel biomarker, SEMA5A, for nonsmall cell lung carcinoma in nonsmoking women. Cancer Epidemiol Biomarkers Prev. 2010; 19(10):2590–7.
Hasan AN, Ahmad MW, Madar IH, Grace BL, Hasan TN. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation. 2015; 11(5):229–35.
Milligan GW, Cooper MC. A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. Multivariate Behav Res. 1986; 21(4):441–58.
Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw. 2009; 20(2):189–201.
Bruse JL, Zuluaga MA, Khushnood A, McLeod K, Ntsinjana HN, Hsia TY, Sermesant M, Pennec X, Taylor AM, Schievano S. Detecting Clinically Meaningful Shape Clusters in Medical Image Data: Metrics Analysis for Hierarchical Clustering Applied to Healthy and Pathological Aortic Arches. IEEE Trans Biomed Eng. 2017; 64(10):2373–83.
Acknowledgements
This work rests almost entirely on open data. Contributors were gratefully acknowledged. Moreover, we deeply thank Mrs Xianchun Sun (Haidian district, Beijing garrison district, the fourth leaving cadre rehabilitation center) and Miss Tian wei (Nanjing NO.9 High School, PR China.) that carefully review our manuscript.
Funding
This work was supported by Major Program of National Natural Science Foundation of China (2016YFA0501600).
Availability of data and materials
The data sets were collected from the NCBI database. The more detailed report on data set was included in the article, in “Materials and methods” section.
Author information
Authors and Affiliations
Contributions
XJ analyzed and discussed the model, and wrote the manuscript. QH performed a portion of the model. ZL supervised the study. All coauthors actively commented and improved the manuscript, as well as finally read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declared that they had no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
MATLAB algorithm. A freely available MATLAB implemented to perform MGPCC, tSNESS, tSNESG and draw the nearest sample(or gene) neighbors for a data set. (ZIP 6873 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Jia, X., Han, Q. & Lu, Z. Analyzing the similarity of samples and genes by MGPCC algorithm, tSNESS and tSNESG maps. BMC Bioinformatics 19, 512 (2018). https://doi.org/10.1186/s1285901824955
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901824955
Keywords
 PCC
 MGPCC
 tSNESSP
 tSNESGI
 Avalue