Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data

Background Single cell RNA sequencing (scRNA-seq) is applied to assay the individual transcriptomes of large numbers of cells. The gene expression at single-cell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas. To ensure that the single-cell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods. Results In this article, we try to re-construct a neural network based on Gene Ontology (GO) for dimension reduction of scRNA-seq data. By integrating GO with both unsupervised and supervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively. Conclusions The evaluation results show that the proposed models outperform some state-of-the-art dimensionality reduction approaches. Furthermore, incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural network-based model.


Background
In the past decade, transcriptome studies have benefited from next-generation sequencing (NGS) based on RNA expression profiling (RNA-seq) [1][2][3]. However, the resulting expression value based on bulk RNA-seq is an average of its expression levels across a large population of input cells [4]. Such bulk expression profiles are insufficient to provide insight into the stochastic nature of gene expression [5]. Therefore, bulk measures of gene expression may not help researchers to understand the distinct function and role of different cells [4]. To address the problem, single cell RNA-seq (scRNA-seq) is applied to assay the individual transcriptomes of large numbers *Correspondence: shang@nwpu.edu.cn 1 School of Computer Science, Northwestern Polytechnical University, 710072 Xi'an, China 2 Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, 710072 Xi'an, China Full list of author information is available at the end of the article of cells [6,7]. The gene expression at single-cell level provides an opportunity for better understanding of cell function and new discoveries in biomedical areas [8,9].
ScRNA-seq data analysis poses several new computational challenges. To ensure that the single-cell based gene expression data are interpreted appropriately, it is crucial to develop new computational methods. One of the most important applications of scRNA-seq is to group cells and identify new cell types. The major computational challenge in this application is to cluster cells based on the gene expression at single-cell level. Clustering based on scRNA-seq data may help us to understand underlying cellular mechanisms, which can promote the discovery of new markers on specific types of cells [10], and identification of tumor subtypes [11], etc.
In the clustering problem, cells are partitioned into different cell types based on their global transcriptome profiles. Each cell type has a significantly distinctive expression signature from the others. Since the expression values are always with high dimensionality and noise from the sequencing result, dimensionality reduction is usually performed before clustering. Till now, several methods have been proposed to eliminate the influence of noise and reduce the dimension. The existing methods could be loosely grouped into two categories, unsupervised method and supervised method.
In the unsupervised category, the main idea is to perform dimensionality reduction before clustering. The simplest method is based on the principal component analysis(PCA) [12]. As one of the most popular methods for dimensionality reduction, PCA has been studied extensively for clustering single cells [13][14][15][16]. Assuming that the data is normally distributed, PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, which are called principal components. However, for scRNA-seq datasets, they are not exactly linearly separable. T-distributed stochastic neighbor embedding (t-SNE) [17] is a nonlinear dimensionality reduction technique, which is also used for clustering single cells recently [15,16]. Based on the Gaussian kernel, t-SNE converts high dimension data to low dimension space. But, it usually maps multidimensional data to two or more dimensions suitable for human observation. Hence it always accompanies with dimension restriction. Besides, similar to PCA, t-SNE also does not consider the drop out events of scRNA-seq data. To consider the specificity of scRNA-seq data, ZIFA [18] uses zero-inflated factors to deal with the drop out events in scRNA-seq data. Assuming that drop out events may lead to zero counts, ZIFA models these counts exactly zero rather than close to zero in the dataset. The evaluation test shows that ZIFA performs better than PCA and t-SNE on some datasets. But the hypothesis of ZIFA is that zero is inflated as Gauss distribution, and the transformation between the descending dimension and original data is linear. Given the expression profiles of single cells, SNN-Cliq computes the similarity between cells by using the concept of shared nearest neighbor (SNN), and implements clustering algorithm based on graph theory [19]. By combining multiple clustering methods, SC3 performs a consensus clustering which includes spectral transformation, k-means algorithm, and complete link approach to achieve high accuracy and robustness [20]. However, SC3 and SNN-Cliq cannot build a relationship between data representation and quantity and property of cell types. Integrating PCA and hierarchical clustering, pcaReduce tries to improve the original PCA method by finding a connection between the PCA-based representations and the number of resolvable cell types. Meanwhile, denoising autoencoder (DAE) [15] is used to reconstruct the data from high dimensions to low dimension space.
Motivated by the success of neural networks in other areas, Lin et al. develop a supervised method to generate the low dimensional representation of scRNA-seq data based on neural networks (NN) [15]. NN model combines the neural network with the protein-protein interaction (PPI) network to classify a number of cells. Given cells with know cell types, this model can be trained as a supervised model. After that, the hidden layer of the trained neural networks is used for generating the low dimensional representation of scRNA-seq data. The experimental test shows that this supervised method performs better than most of the existing unsupervised models.
Although many attempts have been made to cluster single cells based on the global transcriptome profiles, most of them only consider the transcriptome profiles neglecting the prior biological knowledge. This large limits the performance of state-of-art systems. Inspired by the success of neural network in modeling the hierarchical structure and function of a cell [21], we ask whether combining the rich prior biological knowledge in gene ontology (GO) with neural networks could enhance the clustering of cells based on their global transcriptome profiles. Gene Ontology (GO) [22], which has been widely used in many areas [23][24][25][26][27][28], provides a popular vocabulary system for systematically describing the attributes of genes and other biological entities. As one of the most popular bioinformatics sources, it contains reliable and easy-interpreted prior biological knowledge. In this article, we try to construct the structure of neural networks based on the prior knowledge of GO. By integrating GO with both supervised and unsupervised models, two novel methods are proposed, named GOAE (Gene Ontology AutoEncoder) and GONN (Gene Ontology Neural Network) respectively, for clustering of scRNAseq data. The major contributions of this article are as follows: • To better dimensionality reduction of scRNA-seq data, we propose a novel neural work structure considering the prior knowledge in GO. • We propose two novel models, named GOAE and GONN, to enhance cluster cells based on their transcriptome profiles. • The evaluation results show that the proposed models outperform some state-of-the-art approaches. • Incorporating with GO, we provide an opportunity to interpret the underlying biological mechanism behind the neural network-based model.

Methods
We propose a novel model to obtain the low dimensional representation of scRNA-seq data by combining the Gene Ontology and neural network model. We use the terms in GO to replace the neuron in the neural network and convert the fully-connected neural network as partial-connected. Based on this idea, we propose two novel methods: an unsupervised method based on an autoencoder model and a supervised method based on a traditional neural network model. The basic idea of our models is to perform the dimensionality reduction by training a neural network (or autoencoder) model and extract the latent layer as low dimensional representation. This section consists of the following components. First, we will introduce how to select significant GO terms from the whole GO structure. Second, we combine GO with an autoencoder to build an unsupervised model for dimensionality reduction, named GOAE. Third, we combine GO with a neural network to build a supervised model for dimensionality reduction, named GONN. Finally, the low dimensional representation is used for clustering of cells based on a clustering method.

Selection of significant GO terms
Gene Ontology (GO) is a popular vocabulary system for systematically describing the attributes of gene and gene product. Each GO term could annotate a set of genes. GO consists of three different categories, which are biology process, molecular function and cellular component. GO is structured as a directed acyclic graph. Each term has defined relations with other terms in the same or various categories. In this step, we choose GO terms that are used in the following model. We only use terms in the biological process and molecular function category since these terms might be more functional related. In GO, a parent term annotates all the genes annotated by its descendants. The main steps of selecting GO terms used in the following steps are as follows.
First, we select all the GO terms in the third layer. Evaluation test shows that GO terms at the third layer can achieve the best performance. The number of GO terms at different levels is shown in Table 1. These 1543 GO terms at the third level are the candidate terms that connect with the input layer in the neural network.
Second, we remove redundancy terms from the candidate terms obtained from the last step. The annotated genes of different terms may have overlap. Therefore, we remove the redundancy terms to decrease the information redundancy and the parameters in the following neural network-based model. Specifically, let GO i : {gene 1 , gene 2 , · · · gene n } be a GO term, named GO i , annotating a set of annotation genes gene 1 , gene 2 , . . . gene n . The unique score U ij of two GO terms is defined as follows: If the unique score U ij of two GO terms is larger than 0.5, the two GO terms are considered as not unique. Then, we will delete the GO term that has fewer annotation genes. Third, we remove the terms annotating genes that have similar expression profiles in different cells. Different genes may have different expression level in different cells. We tend to select the genes that have different expression levels for clustering. Therefore, we select the terms annotating genes having diverse expression levels in different cells. The diversity of a GO terms could be measured by gene expression values. Z-score-based method is used for normalization on gene dimension. Following this normalize operation, the expression values of each gene is normalized as a standard normal distribution. We define std j as standard deviation of gene j . The diversity score H i of a GO term GO i is calculated as follows: where n is the number of genes annotated by GO i . If the diversity score of GO i is less than the given threshold (in this case 0.1), GO i is considered as low diversity term. We then delete the low diversity GO terms. After these three steps, we obtain a set of GO terms with low redundancy and high diversity.

Architecture of unsupervised model (GOAE)
In the task of scRNA-seq data clustering, an unsupervised dimensionality reduction model is a key component. To perform the dimensionality reduction, we combine the Gene Ontology with autoencoder that has been widely used in other areas, like image processing, natural language processing.
To combine the GO with neural network, we add GO terms to the neural network as partial-connected neurons. The structure of this model is formulated from extensive prior knowledge of gene ontology. The architecture of GOAE is shown in Fig. 1.
The input layer is genes involved in the scRNA-seq datasets. In hidden layer 1, BP neurons and MF neurons are added based on the biological process and molecular function terms obtained from GO. As shown in Fig. 1, BP and MF neurons are partially connected. Only genes annotated by the corresponding GO term are fed to the GO term neuron.
GOAE consists of two components: encoder and decoder. From the input layer to hidden layer 2 are the encoder. The decoder part is exactly a mirror of the encoder part, which from hidden layer 2 to the output layer. Let x i be the output of the ith hidden layer. The forward propagation of the neural network is: where W i represents the weight matrix of the edge from i − 1 th layer to ith layer in the neural network, b i is the bias of each ith hidden layer node, f (·) is the activation function. We choose tanh function in our case, which is: In this GOAE model, we use the mean square error as a loss function. Let x 0j be the input vector of sample j, and x 4j is the output vector. n represents the number of training sample. The loss function is defined as follows: After several training epochs, the hidden layer 2 could be a low-dimension space of the input data.
Since the encoder and decoder are completely symmetric, both input layer and output layer are partial connection.
After training GOAE model, the hidden layer 2 could be used as the low-dimension representation of a cell. Then we can use a clustering method, (in our case, kmeans++), for the clustering of single cells.

Architecture of supervised model (GONN)
A supervised dimensionality reduction model may also be needed in single cell clustering or retrieval [15]. Similar to the GOAE model, we replace the hidden layer1 neurons of the neural network with GO term nodes, which are partial-connected to the input layer neurons that represents the genes. In the GONN model, another hidden layer with 100 fully-connected neurons are added (see Fig. 2). After the training phase, the hidden layer with 100 fully-connected neurons is considered as the low dimensional representation of the input.
At the output layer, softmax function is used for classification. Softmax function is defined as: where x is the input vector of output layer and c is the number of all cell types. Based on softmax activation function, we can obtain the probability vector that a cell is classified into different cell types. Finally, we use top-1 method (the label which has the largest probability) to decide the cell type of a cell. In GONN, the loss is defined as: where n is the number of samples in the training dataset. The first part of Eq. 7 is cross entropy. y j and y j represent the desired output and the predicted output of sample j respectively. The second part is L2 regularization, where λ is the L2 regularization coefficient. w represents the training parameter vector. We combine cross entropy and L2 regularization to avoid overfitting and optimize parameters. After training GONN models by known label cells, we extract the information of the last hidden layer(hidden layer2) as the low-dimension representation. Then we can use a clustering method, (in our case, kmeans++), for the clustering of single cells.

Evaluation criteria
We use the adjusted rand index(ARI) [29] to compare the clustering results of single cells with the true labels. ARI score can measure the similarity between two clustering results. It is defined as follows. Let X = {X 1 , . . . , X r } and Y = {Y 1 , . . . , Y s } be two different clustering results. n ij represents the number of objects in common between X i and Y j . Let a i = j n ij and b j = i n ij , the ARI is defined as follow: The scale of ARI score is between -1 and 1. The higher the ARI score is, the more similar two clustering results are. Furthermore, normalized mutual information(NMI) [30] is also used for evaluation. NMI uses the concept of information entropy to compare different clustering results. NMI score is calculated as follows: H(X) is the entropy of X, which is calculated as follows: I(X, Y ) is the mutual information between X and Y, which is calculated as follows: where N = i j n ij . NMI scores are between 0 and 1. The higher the NMI score is, the more similar two clustering results are. In the following evaluations, we run each experiment 10 times and calculate their average scores as final results.

Data preparation
We evaluate our models on three scRNA-seq datasets. The first dataset is a human scRNA-seq data from [31]. In our experiment, 300 cells involving 11 cell types are used. The involved cell types are listed as follows: CRL-2338(epithelial), CRL-2339(lymphoblastoid), BJ(fibroblast from human foreskin), GW(gestational 16, 21, 21+3 weeks from fetal cortex), HL60(myeloid from acute leukemia), iPS(pluripotent), K562(myeloid from chronic leukemia), Kera(foreskin keratinocyte) and NPC(neural progenitor cells). We remove the genes that have missing values in these cell types. Eigth thousand six hundred eighty six genes are involved in the evaluation dataset. The second dataset is obtained from [15]. It integrates three mus musculus scRNA-seq datasets [14,32,33], which contains 402 cells involving 16 cell types. Similarly, after removing the genes with missing values, 9437 genes are included in the evaluation dataset. The third dataset is also a mus musculus dataset from [15], which has more than 17,000 singlecell RNA-seq data from different 31 datasets. We use this dataset to evaluate cell type assignment. The gene ontology data is downloaded from http://www.geneontology.org/.

Performance evaluation on human scRNA-seq dataset
We test GOAE model (Fig. 1) and GONN model (Fig. 2) for clustering of human cells. 1174 GO terms satisfy the criteria described in 2.1 subsection. These terms are used in the GOAE and GONN model.  In the unsupervised test, all the unsupervised models are applied to the whole data set. All 11 types of cells are involved. Overall, GOAE performs the best among all tested methods. Similar with the experiment design in [15], several possible parameters (number of components) are tested for PCA and ICA method. We reduce the dimension of all data and using kmeans++ method to cluster all 11 cell types data. Figure 3 shows that GOAE perfects the best among all tested methods. The ARI and NMI score of GOAE are 0.917 and 0.933 respectively, while the scores of the runner-up method ZIFA are 0.873 and 0.914 respectively. The experiment result indicates that combining Gene Ontology and autoencoder can improve the performance of clustering of single cells.
For the supervised model, we compare GONN with the state-of-art method NN(ppi/tf ) [15] and the original neural network model (NN). We apply the same experimental protocol used in [15]. The cell types not used in the training phase are used as the test set. There are 11 cell types involved in this data set. We randomly select 2, 4 and 6 cell types as the test set in the evaluation test.
Overall, GONN method performs better than other methods (Tables 2 and 3). With the increase of the number of cell types in the test set, the clustering task becomes more challenging. The result shows that GONN performs the best when the number of cell types equals to 2, 4 and 6. Furthermore, when the number of cell types is 6, the ARI score of GONN is 0.8189, which is significantly higher than the runner-up method (Table 2). Unsurprisingly, GONN method also achieves the highest NMI score. The NMI score of GONN is 0.8803 even when the number of cell types is 6, while the value of the second best method is 0.8434. Figure 4 is the 2D visualization of low dimensional representation based on GONN and GOAE. We use t-SNE as the visualization tool. It is shown that the single cells are partitioned into different clusters based on GONN and GOAE, indicating that GONN and GOAE can learn a low dimensional representation for single cell data.

Performance evaluation on mus musculus dataset
Similar with evaluation test on the human dataset, we also test these models on mus musculus dataset that contains 16 cell types. For unsupervised models, we randomly select 2, 4, 6, 8, 10 and 12 cell types as test sets. For supervised models, since sufficient training set is necessary, we only randomly select 2, 4, 6 and 8 cell types as test sets. The rest of data are used as the training set. For GOAE and GONN model, 854 GO terms satisfy the criteria described in 2.1 subsection.
As shown in Tables 4 and 5, for the unsupervised model, GOAE achieves the highest performance on datasets with different numbers of cell types. The average of ARI scores of GOAE on all datasets is 0.7671, which is around 0.03 higher than the runner-up method DAE. More details are shown in Table 4. The trend of NMI scores is similar to ARI scores. GOAE can achieve the highest NMI scores on datasets with different numbers of cell types.  The complexity of the problem increases with the increase in the number of cell types. When the number of cell types is 8, the NMI score of GOAE is 0.8545 that is 0.04 higher than the runner-up method DAE. The evaluation test on mus musculus dataset indicates that combining gene ontology with neural network can improve the performance of single cell RNA-seq data clustering. For the supervised model, GONN performs better than other compared methods. The ARI score decreases with the increase in the number of cell types involved in the test set. GONN can achieve a high ARI score (0.7599) even the number of cell types is 8, while the value of runnerup method is 0.6832. Similarly, GONN also achieves the highest NMI score in all tested methods. The average NMI score of different datasets is 0.9103, which is significantly higher than NN(dense) and NN(ppi/tf ) method.
The corresponding values of NN(dense) and NN(ppi/tf ) are 0.8623 and 0.8496 respectively. The highest values are shown in blodface. See Table 4 for the more details

Effect of GO terms
One of the major contributions of our work is to add GO terms as neurons in the neural networks. To test whether the GO terms are selected appropriately, we re-run GONN and GOAE by varying the GO terms involved in the model. We use the mus musculus dataset on this test. To determine the threshold selection for the U ij and H i scores, we varied one parameter and fix other parameters to conduct experiments on GONN (see Tables 6  and 7). The evaluation test shows that GONN can achieve the highest performance when the unique score and high expression score are set as 0.5 and 0.1 respectively. Before GO terms selection, it selects epoch number=300, learning rate=1e-4. After selection, it selects epoch number=100, learning rate=1e-3. All GONN models in (c) and (d) all select epoch number=200 and learning rate=0.2. All these hyperparameters selection are achieve their best across a lots of hyperparameter groups As described in subsection 2.1, we remove the redundancy GO terms and GO terms with low expression scores. In this test, we create GONN v and GOAE v where the redundancy and low-diversity GO terms are not removed. In GONN v and GOAE v , 1486 GO terms are involved, while only 854 GO terms involved in GONN and GOAE. Figure 5a and b show that GONN is clearly better than GONN v , indicating that selecting appropriate GO terms contributes to the performance and this step has been appropriately designed. Similarly, Fig. 5c and (d) show that GOAE is clearly better than GOAE v . Particularly, on the datasets with 8 and 10 cell types, the average ARI of GOAE are about 2-3% higher than GOAE v .

Functional analysis on hidden layer nodes
For GOAE model, we train the model using samples of a certain cell type. Then, we could also obtain the top 10 highest GO-term nodes of the hidden layer. We select 8cell, 16cell, ES, earlyblast, and lateblast in this test, since training the GOAE model requires a sufficient amount of samples. For GONN model, we multiply the weight matrices W 2 and W 3 to represent the degree of importance between each cell type and the GO terms in the hidden layer 1. For each cell type, we selected the top-10 important GO terms for analysis. Table 8 shows some of the highly weighted GO-term nodes in the GOAE and GONN models. For example, regulation of transporter activity (GO:0032409) is mainly associated with ES(embryonic stem cell) [37], and embryonic placenta development (GO:0001892) is always relative with zygote cell [38]. ES GO:0032409Regulation of transporter activity [37] GONNES GO:0022417Protein maturation by protein folding [37] ES GO:0140101Catalytic activity, acting on a tRNA [37] BMDC GO:0099590Neurotransmitter receptor internalization [40] BMDC GO:0050881Musculoskeletal movement [40] Zygote GO:0001892Embryonic placenta development [38] Early 2cellGO:0032552Deoxyribonucleotide binding [41] Cell type assignment Another important application in single cell analysis is cell type assignment. To verify the effectiveness of our model in cell assignment and retrieval. We use a mus musculus dataset from Lin et al. paper [15], which has more than 17,000 single cells from different 31 datasets. We designed experiment according to [15].
To measure the results of cell type assignment, we calculate the percentage of the correctly predicted cell types by using top K nearest neighbors(K=100). Nine cell types are involved in the experiment, including 2 cell, 4 cell, 8 cell, zygote, embryonic stem cell(ESC), neurons, thymus, spleen and hematopoietic stem cell(HSC). Mean of average precision(MAP) [15,39] is used to measure the assignment performance. We compare our GONN model with NN(ppi/tf ) model, the results are shown in Fig. 6. Our model GONN performs better in 2 cell, 8 cell, zygote cell types. Besides, GONN has higher average of MAP than NN (ppi/tf ).

Conclusions
In this paper, we combine neural networks with Gene Ontology for reducing the dimensions of scRNA-seq data, which can improve the clustering of scRNA-seq data. We propose two models GOAE and GONN that are unsupervised and supervised model respectively.
The proposed model mainly contains two key components: the selection of significant GO terms and combination GO terms with the neural network-based model. When selecting important GO terms, it is crucial to choose the appropriate thresholds. If the threshold is not properly selected, deleting too much or too few GO terms will affect the final result.
Performance evaluation on two datasets shows that GONN and GOAE perform better than existing state-of-art dimensionality reduction methods for scRNA-seq data.