 Research
 Open Access
 Published:
Combining gene ontology with deep neural networks to enhance the clustering of single cell RNASeq data
BMC Bioinformatics volume 20, Article number: 284 (2019)
Abstract
Background
Single cell RNA sequencing (scRNAseq) is applied to assay the individual transcriptomes of large numbers of cells. The gene expression at singlecell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas. To ensure that the singlecell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods.
Results
In this article, we try to reconstruct a neural network based on Gene Ontology (GO) for dimension reduction of scRNAseq data. By integrating GO with both unsupervised and supervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively.
Conclusions
The evaluation results show that the proposed models outperform some stateoftheart dimensionality reduction approaches. Furthermore, incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural networkbased model.
Background
In the past decade, transcriptome studies have benefited from nextgeneration sequencing (NGS) based on RNA expression profiling (RNAseq) [1–3]. However, the resulting expression value based on bulk RNAseq is an average of its expression levels across a large population of input cells [4]. Such bulk expression profiles are insufficient to provide insight into the stochastic nature of gene expression [5]. Therefore, bulk measures of gene expression may not help researchers to understand the distinct function and role of different cells [4]. To address the problem, single cell RNAseq (scRNAseq) is applied to assay the individual transcriptomes of large numbers of cells [6, 7]. The gene expression at singlecell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas [8, 9].
ScRNAseq data analysis poses several new computational challenges. To ensure that the singlecell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods. One of the most important applications of scRNAseq is to group cells and identify new cell types. The major computational challenge in this application is to cluster cells based on the gene expression at singlecell level. Clustering based on scRNAseq data may help us to understand underlying cellular mechanisms, which can promote the discovery of new markers on specific types of cells [10], and identification of tumor subtypes [11], etc.
In the clustering problem, cells are partitioned into different cell types based on their global transcriptome profiles. Each cell type has a significantly distinctive expression signature from the others. Since the expression values are always with high dimensionality and noise from the sequencing result, dimensionality reduction is usually performed before clustering. Till now, several methods have been proposed to eliminate the influence of noise and reduce the dimension. The existing methods could be loosely grouped into two categories, unsupervised method and supervised method.
In the unsupervised category, the main idea is to perform dimensionality reduction before clustering. The simplest method is based on the principal component analysis(PCA) [12]. As one of the most popular methods for dimensionality reduction, PCA has been studied extensively for clustering single cells [13–16]. Assuming that the data is normally distributed, PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, which are called principal components. However, for scRNAseq datasets, they are not exactly linearly separable. Tdistributed stochastic neighbor embedding (tSNE) [17] is a nonlinear dimensionality reduction technique, which is also used for clustering single cells recently [15, 16]. Based on the Gaussian kernel, tSNE converts high dimension data to low dimension space. But, it usually maps multidimensional data to two or more dimensions suitable for human observation. Hence it always accompanies with dimension restriction. Besides, similar to PCA, tSNE also does not consider the drop out events of scRNAseq data. To consider the specificity of scRNAseq data, ZIFA[18] uses zeroinflated factors to deal with the drop out events in scRNAseq data. Assuming that drop out events may lead to zero counts, ZIFA models these counts exactly zero rather than close to zero in the dataset. The evaluation test shows that ZIFA performs better than PCA and tSNE on some datasets. But the hypothesis of ZIFA is that zero is inflated as Gauss distribution, and the transformation between the descending dimension and original data is linear. Given the expression profiles of single cells, SNNCliq computes the similarity between cells by using the concept of shared nearest neighbor (SNN), and implements clustering algorithm based on graph theory [19]. By combining multiple clustering methods, SC3 performs a consensus clustering which includes spectral transformation, kmeans algorithm, and complete link approach to achieve high accuracy and robustness [20]. However, SC3 and SNNCliq cannot build a relationship between data representation and quantity and property of cell types. Integrating PCA and hierarchical clustering, pcaReduce tries to improve the original PCA method by finding a connection between the PCAbased representations and the number of resolvable cell types. Meanwhile, denoising autoencoder (DAE) [15] is used to reconstruct the data from high dimensions to low dimension space.
Motivated by the success of neural networks in other areas, Lin et al. develop a supervised method to generate the low dimensional representation of scRNAseq data based on neural networks (NN) [15]. NN model combines the neural network with the proteinprotein interaction (PPI) network to classify a number of cells. Given cells with know cell types, this model can be trained as a supervised model. After that, the hidden layer of the trained neural networks is used for generating the low dimensional representation of scRNAseq data. The experimental test shows that this supervised method performs better than most of the existing unsupervised models.
Although many attempts have been made to cluster single cells based on the global transcriptome profiles, most of them only consider the transcriptome profiles neglecting the prior biological knowledge. This large limits the performance of stateofart systems. Inspired by the success of neural network in modeling the hierarchical structure and function of a cell [21], we ask whether combining the rich prior biological knowledge in gene ontology (GO) with neural networks could enhance the clustering of cells based on their global transcriptome profiles. Gene Ontology (GO) [22], which has been widely used in many areas [23–28], provides a popular vocabulary system for systematically describing the attributes of genes and other biological entities. As one of the most popular bioinformatics sources, it contains reliable and easyinterpreted prior biological knowledge. In this article, we try to construct the structure of neural networks based on the prior knowledge of GO. By integrating GO with both supervised and unsupervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively, for clustering of scRNAseq data. The major contributions of this article are as follows:

To better dimensionality reduction of scRNAseq data, we propose a novel neural work structure considering the prior knowledge in GO.

We propose two novel models, named GOAE and GONN, to enhance cluster cells based on their transcriptome profiles.

The evaluation results show that the proposed models outperform some stateoftheart approaches.

Incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural networkbased model.
Methods
We propose a novel model to obtain the low dimensional representation of scRNAseq data by combining the Gene Ontology and neural network model. We use the terms in GO to replace the neuron in the neural network and convert the fullyconnected neural network as partialconnected. Based on this idea, we propose two novel methods: an unsupervised method based on an autoencoder model and a supervised method based on a traditional neural network model. The basic idea of our models is to perform the dimensionality reduction by training a neural network (or autoencoder) model and extract the latent layer as low dimensional representation. This section consists of the following components. First, we will introduce how to select significant GO terms from the whole GO structure. Second, we combine GO with an autoencoder to build an unsupervised model for dimensionality reduction, named GOAE. Third, we combine GO with a neural network to build a supervised model for dimensionality reduction, named GONN. Finally, the low dimensional representation is used for clustering of cells based on a clustering method.
Selection of significant GO terms
Gene Ontology (GO) is a popular vocabulary system for systematically describing the attributes of gene and gene product. Each GO term could annotate a set of genes. GO consists of three different categories, which are biology process, molecular function and cellular component. GO is structured as a directed acyclic graph. Each term has defined relations with other terms in the same or various categories. In this step, we choose GO terms that are used in the following model. We only use terms in the biological process and molecular function category since these terms might be more functional related. In GO, a parent term annotates all the genes annotated by its descendants. The main steps of selecting GO terms used in the following steps are as follows.
First, we select all the GO terms in the third layer. Evaluation test shows that GO terms at the third layer can achieve the best performance. The number of GO terms at different levels is shown in Table 1. These 1543 GO terms at the third level are the candidate terms that connect with the input layer in the neural network.
Second, we remove redundancy terms from the candidate terms obtained from the last step. The annotated genes of different terms may have overlap. Therefore, we remove the redundancy terms to decrease the information redundancy and the parameters in the following neural networkbased model.
Specifically, let GO_{i}:{gene_{1},gene_{2},⋯gene_{n}} be a GO term, named GO_{i}, annotating a set of annotation genes gene_{1},gene_{2},…gene_{n}. The unique score U_{ij} of two GO terms is defined as follows:
If the unique score U_{ij} of two GO terms is larger than 0.5, the two GO terms are considered as not unique. Then, we will delete the GO term that has fewer annotation genes.
Third, we remove the terms annotating genes that have similar expression profiles in different cells. Different genes may have different expression level in different cells. We tend to select the genes that have different expression levels for clustering. Therefore, we select the terms annotating genes having diverse expression levels in different cells. The diversity of a GO terms could be measured by gene expression values. Zscorebased method is used for normalization on gene dimension. Following this normalize operation, the expression values of each gene is normalized as a standard normal distribution. We define std_{j} as standard deviation of gene_{j}. The diversity score H_{i} of a GO term GO_{i} is calculated as follows:
where n is the number of genes annotated by GO_{i}. If the diversity score of GO_{i} is less than the given threshold (in this case 0.1), GO_{i} is considered as low diversity term. We then delete the low diversity GO terms.
After these three steps, we obtain a set of GO terms with low redundancy and high diversity.
Architecture of unsupervised model (GOAE)
In the task of scRNAseq data clustering, an unsupervised dimensionality reduction model is a key component. To perform the dimensionality reduction, we combine the Gene Ontology with autoencoder that has been widely used in other areas, like image processing, natural language processing.
To combine the GO with neural network, we add GO terms to the neural network as partialconnected neurons. The structure of this model is formulated from extensive prior knowledge of gene ontology. The architecture of GOAE is shown in Fig. 1.
The input layer is genes involved in the scRNAseq datasets. In hidden layer 1, BP neurons and MF neurons are added based on the biological process and molecular function terms obtained from GO. As shown in Fig. 1, BP and MF neurons are partially connected. Only genes annotated by the corresponding GO term are fed to the GO term neuron.
GOAE consists of two components: encoder and decoder. From the input layer to hidden layer 2 are the encoder. The decoder part is exactly a mirror of the encoder part, which from hidden layer 2 to the output layer. Let x_{i} be the output of the ith hidden layer. The forward propagation of the neural network is:
where W_{i} represents the weight matrix of the edge from i−1 th layer to ith layer in the neural network, b_{i} is the bias of each ith hidden layer node, f(·) is the activation function. We choose tanh function in our case, which is:
In this GOAE model, we use the mean square error as a loss function. Let x_{0j} be the input vector of sample j, and x_{4j} is the output vector. n represents the number of training sample. The loss function is defined as follows:
After several training epochs, the hidden layer 2 could be a lowdimension space of the input data.
Since the encoder and decoder are completely symmetric, both input layer and output layer are partial connection.
After training GOAE model, the hidden layer 2 could be used as the lowdimension representation of a cell. Then we can use a clustering method, (in our case, kmeans++), for the clustering of single cells.
Architecture of supervised model (GONN)
A supervised dimensionality reduction model may also be needed in single cell clustering or retrieval [15]. Similar to the GOAE model, we replace the hidden layer1 neurons of the neural network with GO term nodes, which are partialconnected to the input layer neurons that represents the genes. In the GONN model, another hidden layer with 100 fullyconnected neurons are added (see Fig. 2). After the training phase, the hidden layer with 100 fullyconnected neurons is considered as the low dimensional representation of the input.
At the output layer, softmax function is used for classification. Softmax function is defined as:
where x is the input vector of output layer and c is the number of all cell types. Based on softmax activation function, we can obtain the probability vector that a cell is classified into different cell types. Finally, we use top1 method (the label which has the largest probability) to decide the cell type of a cell. In GONN, the loss is defined as:
where n is the number of samples in the training dataset. The first part of Eq. 7 is cross entropy. y_{j} and \(y_{j}^{\prime }\) represent the desired output and the predicted output of sample j respectively. The second part is L2 regularization, where λ is the L2 regularization coefficient. w represents the training parameter vector. We combine cross entropy and L2 regularization to avoid overfitting and optimize parameters.
After training GONN models by known label cells, we extract the information of the last hidden layer(hidden layer2) as the lowdimension representation. Then we can use a clustering method, (in our case, kmeans ++), for the clustering of single cells.
Evaluation criteria
We use the adjusted rand index(ARI) [29] to compare the clustering results of single cells with the true labels. ARI score can measure the similarity between two clustering results. It is defined as follows. Let X={X_{1},…,X_{r}} and Y={Y_{1},…,Y_{s}} be two different clustering results. n_{ij} represents the number of objects in common between X_{i} and Y_{j}. Let \(a_{i}={\sum }_{j}n_{ij}\) and \(b_{j}={\sum }_{i}n_{ij}\), the ARI is defined as follow:
The scale of ARI score is between 1 and 1. The higher the ARI score is, the more similar two clustering results are.
Furthermore, normalized mutual information(NMI) [30] is also used for evaluation. NMI uses the concept of information entropy to compare different clustering results. NMI score is calculated as follows:
H(X) is the entropy of X, which is calculated as follows:
I(X,Y) is the mutual information between X and Y, which is calculated as follows:
where \(N={\sum }_{i}{\sum }_{j}n_{ij}\). NMI scores are between 0 and 1. The higher the NMI score is, the more similar two clustering results are. In the following evaluations, we run each experiment 10 times and calculate their average scores as final results.
Data preparation
We evaluate our models on three scRNAseq datasets. The first dataset is a human scRNAseq data from [31]. In our experiment, 300 cells involving 11 cell types are used. The involved cell types are listed as follows: CRL2338(epithelial), CRL2339(lymphoblastoid), BJ(fibroblast from human foreskin), GW(gestational 16, 21, 21+3 weeks from fetal cortex), HL60(myeloid from acute leukemia), iPS(pluripotent), K562(myeloid from chronic leukemia), Kera(foreskin keratinocyte) and NPC(neural progenitor cells). We remove the genes that have missing values in these cell types. Eigth thousand six hundred eighty six genes are involved in the evaluation dataset. The second dataset is obtained from [15]. It integrates three mus musculus scRNAseq datasets [14, 32, 33], which contains 402 cells involving 16 cell types. Similarly, after removing the genes with missing values, 9437 genes are included in the evaluation dataset. The third dataset is also a mus musculus dataset from [15], which has more than 17,000 singlecell RNAseq data from different 31 datasets. We use this dataset to evaluate cell type assignment. The gene ontology data is downloaded from http://www.geneontology.org/.
Results and discussion
We test our models on two different scRNAseq datasets. We compare our methods with two supervised methods (i.e. NN(ppi/tf) [15] and NN(dense)) and six unsupervised methods(i.e. PCA [12], tSNE [17], ICA [34], pcaReduce [35], ZIFA [18], DAE [36]). We set batch size as 64, epoch number as 100, learning rates as 1e3 for GOAE model. We set the batch size as 64, epoch number as 200, learning rates as 0.2 for GONN model. For NN(dense) model, it has the same architecture as the twolayer GONN model but without partial connection between the input layer and hidden layer1. The NN(dense) model is used to test whether combining GO information can improve the supervised model. The DAE model is used to test whether the addition of GO information can improve the unsupervised neural network model. We also compare our model with other unsupervised methods. In all tests, we use kmeans++ for clustering based on different lowdimensional representations from different dimensionality reduction methods. The models are implemented using Python 3.6 and tensorflow 1.4.1 package.
Performance evaluation on human scRNAseq dataset
We test GOAE model (Fig. 1) and GONN model (Fig. 2) for clustering of human cells. 1174 GO terms satisfy the criteria described in 2.1 subsection. These terms are used in the GOAE and GONN model.
In the unsupervised test, all the unsupervised models are applied to the whole data set. All 11 types of cells are involved. Overall, GOAE performs the best among all tested methods. Similar with the experiment design in [15], several possible parameters (number of components) are tested for PCA and ICA method. We reduce the dimension of all data and using kmeans++ method to cluster all 11 cell types data. Figure 3 shows that GOAE perfects the best among all tested methods. The ARI and NMI score of GOAE are 0.917 and 0.933 respectively, while the scores of the runnerup method ZIFA are 0.873 and 0.914 respectively. The experiment result indicates that combining Gene Ontology and autoencoder can improve the performance of clustering of single cells.
For the supervised model, we compare GONN with the stateofart method NN(ppi/tf) [15] and the original neural network model (NN). We apply the same experimental protocol used in [15]. The cell types not used in the training phase are used as the test set. There are 11 cell types involved in this data set. We randomly select 2, 4 and 6 cell types as the test set in the evaluation test.
Overall, GONN method performs better than other methods (Tables 2 and 3). With the increase of the number of cell types in the test set, the clustering task becomes more challenging. The result shows that GONN performs the best when the number of cell types equals to 2, 4 and 6. Furthermore, when the number of cell types is 6, the ARI score of GONN is 0.8189, which is significantly higher than the runnerup method (Table 2). Unsurprisingly, GONN method also achieves the highest NMI score. The NMI score of GONN is 0.8803 even when the number of cell types is 6, while the value of the second best method is 0.8434.
Figure 4 is the 2D visualization of low dimensional representation based on GONN and GOAE. We use tSNE as the visualization tool. It is shown that the single cells are partitioned into different clusters based on GONN and GOAE, indicating that GONN and GOAE can learn a low dimensional representation for single cell data.
Performance evaluation on mus musculus dataset
Similar with evaluation test on the human dataset, we also test these models on mus musculus dataset that contains 16 cell types. For unsupervised models, we randomly select 2, 4, 6, 8, 10 and 12 cell types as test sets. For supervised models, since sufficient training set is necessary, we only randomly select 2, 4, 6 and 8 cell types as test sets. The rest of data are used as the training set. For GOAE and GONN model, 854 GO terms satisfy the criteria described in 2.1 subsection.
As shown in Tables 4 and 5, for the unsupervised model, GOAE achieves the highest performance on datasets with different numbers of cell types. The average of ARI scores of GOAE on all datasets is 0.7671, which is around 0.03 higher than the runnerup method DAE. More details are shown in Table 4. The trend of NMI scores is similar to ARI scores. GOAE can achieve the highest NMI scores on datasets with different numbers of cell types. The complexity of the problem increases with the increase in the number of cell types. When the number of cell types is 8, the NMI score of GOAE is 0.8545 that is 0.04 higher than the runnerup method DAE. The evaluation test on mus musculus dataset indicates that combining gene ontology with neural network can improve the performance of single cell RNAseq data clustering.
For the supervised model, GONN performs better than other compared methods. The ARI score decreases with the increase in the number of cell types involved in the test set. GONN can achieve a high ARI score (0.7599) even the number of cell types is 8, while the value of runnerup method is 0.6832. Similarly, GONN also achieves the highest NMI score in all tested methods. The average NMI score of different datasets is 0.9103, which is significantly higher than NN(dense) and NN(ppi/tf) method. The corresponding values of NN(dense) and NN(ppi/tf) are 0.8623 and 0.8496 respectively.
Effect of GO terms
One of the major contributions of our work is to add GO terms as neurons in the neural networks. To test whether the GO terms are selected appropriately, we rerun GONN and GOAE by varying the GO terms involved in the model. We use the mus musculus dataset on this test.
To determine the threshold selection for the U_{ij} and H_{i} scores, we varied one parameter and fix other parameters to conduct experiments on GONN (see Tables 6 and 7). The evaluation test shows that GONN can achieve the highest performance when the unique score and high expression score are set as 0.5 and 0.1 respectively.
As described in subsection 2.1, we remove the redundancy GO terms and GO terms with low expression scores. In this test, we create GONN_{v} and GOAE_{v} where the redundancy and lowdiversity GO terms are not removed. In GONN_{v} and GOAE_{v}, 1486 GO terms are involved, while only 854 GO terms involved in GONN and GOAE. Figure 5a and b show that GONN is clearly better than GONN_{v}, indicating that selecting appropriate GO terms contributes to the performance and this step has been appropriately designed. Similarly, Fig. 5c and (d) show that GOAE is clearly better than GOAE_{v}. Particularly, on the datasets with 8 and 10 cell types, the average ARI of GOAE are about 23% higher than GOAE_{v}.
Functional analysis on hidden layer nodes
For GOAE model, we train the model using samples of a certain cell type. Then, we could also obtain the top 10 highest GOterm nodes of the hidden layer. We select 8cell, 16cell, ES, earlyblast, and lateblast in this test, since training the GOAE model requires a sufficient amount of samples. For GONN model, we multiply the weight matrices W2 and W3 to represent the degree of importance between each cell type and the GO terms in the hidden layer 1. For each cell type, we selected the top10 important GO terms for analysis. Table 8 shows some of the highly weighted GOterm nodes in the GOAE and GONN models. For example, regulation of transporter activity (GO:0032409) is mainly associated with ES(embryonic stem cell) [37], and embryonic placenta development (GO:0001892) is always relative with zygote cell [38].
Cell type assignment
Another important application in single cell analysis is cell type assignment. To verify the effectiveness of our model in cell assignment and retrieval. We use a mus musculus dataset from Lin et al. paper [15], which has more than 17,000 single cells from different 31 datasets. We designed experiment according to [15].
To measure the results of cell type assignment, we calculate the percentage of the correctly predicted cell types by using top K nearest neighbors(K=100). Nine cell types are involved in the experiment, including 2 cell, 4 cell, 8 cell, zygote, embryonic stem cell(ESC), neurons, thymus, spleen and hematopoietic stem cell(HSC). Mean of average precision(MAP) [15, 39] is used to measure the assignment performance.
We compare our GONN model with NN(ppi/tf) model, the results are shown in Fig. 6. Our model GONN performs better in 2 cell, 8 cell, zygote cell types. Besides, GONN has higher average of MAP than NN (ppi/tf).
Conclusions
In this paper, we combine neural networks with Gene Ontology for reducing the dimensions of scRNAseq data, which can improve the clustering of scRNAseq data. We propose two models GOAE and GONN that are unsupervised and supervised model respectively.
The proposed model mainly contains two key components: the selection of significant GO terms and combination GO terms with the neural networkbased model. When selecting important GO terms, it is crucial to choose the appropriate thresholds. If the threshold is not properly selected, deleting too much or too few GO terms will affect the final result.
Performance evaluation on two datasets shows that GONN and GOAE perform better than existing stateofart dimensionality reduction methods for scRNAseq data.
Abbreviations
 ARI:

Adjusted rand index
 DAE:

Denoising autoencoder
 GO:

Gene ontology
 GOAE:

Gene ontology autoencoder
 GONN:

Gene ontology neural network
 MAP:

Mean of average precision
 NMI:

Normalized mutual information
 PCA:

Principle component analysis
 PPI:

Protein protein interaction
 SC3:

Singlecell consensus clustering
 scRNAseq:

Single cell RNA sequence
 SNNCliq:

Shared nearest neighbor Cliq
 TSNE:

Tdistributed stochastic neighbor embedding
 ZIFA:

Zero inflated factors analysis
References
Wang Z, Gerstein M, Snyder M. Rnaseq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57.
Cheng L, Hu Y, Sun J, Zhou M, Jiang Q. Dincrna: a comprehensive webbased bioinformatics toolkit for exploring disease associations and ncrna function. Bioinformatics. 2018; 34(11):1953–56.
Cheng L, Wang P, Tian R, Wang S, Guo Q, Luo M, Zhou W, Liu G, Jiang H, Jiang Q. Lncrna2target v2.0: a comprehensive database for target genes of lncrnas in human and mouse. Nucleic Acids Res. 2019; 47(D1):D140D144.
Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in singlecell transcriptomics. Nat Rev Genet. 2015; 16(3):133.
Raj A, van Oudenaarden A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell. 2008; 135(2):216–26.
Kolodziejczyk A, Kim JK, Svensson V, Marioni J, Teichmann S. The technology and biology of singlecell rna sequencing. Mol Cell. 2015; 58(4):610–20.
Hu Y, Tianyi Z, Tianyi Z, Ying Z, Liang C. Identification of alzheimer’s diseaserelated genes based on data integration method. Front Genet. 2018; 9:703.
Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A. mrnaseq wholetranscriptome analysis of a single cell. Nat Methods. 2009; 6(5):377–82.
Wu AR, Neff NF, Kalisky T, Dalerba P, Treutlein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF. Quantitative assessment of singlecell rnasequencing methods. Nat Methods. 2014; 11(1):41–46.
Jaitin DA, Kenigsberg E, KerenShaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A. Massively parallel singlecell rnaseq for markerfree decomposition of tissues into cell types. Science. 2011; 343(6172):776–9.
Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, Ryu HS, Kim S, Lee JE, Park YH. Singlecell rnaseq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun. 2017; 8:15081.
Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr Intell Lab Syst. 1987; 2(1):37–52.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of celltocell heterogeneity in singlecell rnasequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015; 33(2):155–60.
Shalek AK, Satija R, Adiconis X, Gertner RS, Gaublomme JT, Raychowdhury R, Schwartz S, Yosef N, Malboeuf C, Lu D. Singlecell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013; 498(7453):236.
Lin C, Jain S, Kim H, Barjoseph Z. Using neural networks for reducing the dimensions of singlecell rnaseq data. Nucleic Acids Res. 2017; 45(17):156.
Li X, Chen W, Chen Y, Zhang X, Gu J, Zhang MQ. Network embeddingbased representation learning for single cell rnaseq data. Nucleic Acids Res. 2017; 45(19):166.
Maaten L, Hinton G. Visualizing data using tsne. J Mach Learn Res. 2008; 9(2605):2579–605.
Yau C, Pierson E. Dimensionality reduction for zeroinflated single cell gene expression analysis. Genome Biol. 2015; 16(1):241.
Xu C, Su Z. Identification of cell types from singlecell transcriptomes using a novel clustering method. Bioinformatics. 2015; 31(12):1974–80.
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al.Sc3: consensus clustering of singlecell rnaseq data. Nat Methods. 2017; 14(5):483.
Ma J, Yu MK, Fong S, Ono K, Sage E, Demchak B, Sharan R, Ideker T. Using deep learning to model the hierarchical structure and function of a cell. Nat Methods. 2018; 15(4):290.
Carbon S, Ireland. A, Mungall CJ, Shu SQ, Marshall B, Lewis S, Hub TA. Amigo: online access to ontology and annotation data. Bioinformatics. 2009; 25(2):288–9.
Peng J, Hui W, Shang X. Measuring phenotypephenotype similarity through the interactome. BMC Bioinformatics. 2018; 19(5):114.
Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Min Bioinforma. 2017; 17(2):173–88.
Melott JM, Weinstein JN, Broom BM. Pathwaysweb: a gene pathways api with directional interactions, expanded gene ontology, and versioning. Bioinformatics. 2016; 32(2):312–4.
Peng J, Zhang X, Hui W, Lu J, Li Q, Liu S, Shang X. Improving the measurement of semantic similarity by combining gene ontology and cofunctional network: a random walk based approach. BMC Syst Biol. 2018; 12(2):18.
Pesaranghader A, Matwin S, Sokolova M, Beiko RG. simdef: definitionbased semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics. 2016; 32(9):1380–7.
Peng J, Wang T, Wang J, Wang Y, Chen J. Extending gene ontology with gene association networks. Bioinformatics. 2015; 32(8):1185–94.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193–218.
Vinh NX, Epps J, Bailey J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Cambridge: JMLR.org; 2010, pp. 1073–80.
Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P. Lowcoverage singlecell mrna sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014; 32(10):1053–8.
Sasagawa Y, Nikaido I, Hayashi T, Danno H, Uno KD, Imai T, Ueda HR. Quartzseq: a highly reproducible and sensitive singlecell rna sequencing method, reveals nongenetic geneexpression heterogeneity. Genome Biol. 2013; 14(4):3097.
Deng Q, Ramsköld D, Reinius B, Sandberg R. Singlecell rnaseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014; 343(6167):193–6.
Comon P. Independent Component Analysis, a New Concept?Oxford: Elsevier NorthHolland, Inc.; 1994, pp. 287–314.
žurauskienė J, Yau C. pcareduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016; 17(1):140.
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res. 2010; 11(12):3371–408.
Sene KH, Porter CJ, Palidwor G, Pereziratxeta C, Muro EM, Campbell PA, Rudnicki MA, Andradenavarro MA. Gene function in early mouse embryonic stem cell differentiation. BMC Genomics. 2007; 8(1):85.
Pawel K, Vijay C, Carsten P. Simulating the mammalian blastocyst  molecular and mechanical interactions pattern the embryo. PloS Comput Biol. 2011; 7(5):1001128.
Zhang E, Yi Z. Average Precision. Boston: Springer; 2009. pp. 192–93.
Cruz DSGD, Lima APND, Neto JP, Massoco C. Effects of unilateral cervical vagotomy on murine dendritic cells. Am J Immunol. 2015; 11(2):48–55.
Ko MSH, Zalzman M, Sharova LV. Methods for enhancing genome stability and telomere elongation in embryonic stem cells. US; 2015. U.S. Patent Application 14/259,600, filed August 21, 2014.
Acknowledgements
We thank all the anonymous reviewers.
Funding
The publication costs for this article were funded by the corresponding author’s institution. This work was supported by National Natural Science Foundation of China (No. 61702421, 61332014, 61772426), China Postdoctoral Science Foundation (No. 2017M610651), China Postdoctoral Science Foundation (No. 2017BSHTDZZ11), Fundamental Research Funds for the Central Universities (No. 3102018zy033).
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 8, 2019: Decipher computational analytics in digital health and precision medicine. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement8.
Author information
Authors and Affiliations
Contributions
JP and XS designed the algorithm; XW implemented the algorithm; JP and XW wrote this manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Peng, J., Wang, X. & Shang, X. Combining gene ontology with deep neural networks to enhance the clustering of single cell RNASeq data. BMC Bioinformatics 20 (Suppl 8), 284 (2019). https://doi.org/10.1186/s1285901927696
Published:
DOI: https://doi.org/10.1186/s1285901927696
Keywords
 Single cell RNAseq data
 Gene ontology
 Autoencoder
 Neural network