Overview
The main idea of our method is to integrate the PPI-based information as prior knowledge into a graph neural network, to capture the relationships between proteins and RNAs as well as among proteins, and thereby to improve the accuracy of protein abundance prediction. The whole pipeline is described in Fig. 5a and Algorithm 1. After noise reduction by SAVER-X, we divide the cells into two disjoint datasets, i.e. a training set and a test set. For training, we feed the training set to the model for parameter estimation and save the parameter values that correspond to the minimum MSE loss among all the epochs that have been computed. During the test, the model loads these parameters, and predicts the protein abundances of the cells in the test set directly.
Our model mainly consists of two modules. The first one is adding the PPI-based graph neural network to the dataset, shown as the “PPI-based graph neural network part” in Fig. 5a. These protein–protein interactions provide a way for information transmission between proteins, which means the proteins jointly promote specific biological functions, e.g. by inhibiting or promoting each other [31]. Intuitively, we encode the PPIs with a graph structure, where the nodes are proteins, and edges represent the interactions. Thus, we use the graph neural network to compute the result of information transmission through these interactions between proteins. The other module is the embedding of prior knowledge, such as co-expression and gene co-occurence, etc., which is described in Fig. 5a. Since PPI relationships tend to be conserved across different cell types [31], the PPI in large-scale databases such as STRING can be used for the knowledge embedding.
The whole structure of the model is shown in Fig. 5a. The input is the denoised data from SAVER-X. Then, similar to cTP-net [7], we extract the RNA representation from the input RNA data using a neural network for feature extraction, which includes two fully-connected layers, shown as the blue part in Fig. 5a. After that, to represent the features of N proteins in the high-dimensional space independently, we used N 1-layer forward networks to map the RNA representation to N protein feature vectors, and combined all the feature vectors of the proteins into matrix \(V_r\in {\mathbb {R}}^{N\times {d_r}}\), where \(d_r\) is the number of dimensions of the protein representations, shown as the orange vectors in Fig. 5a. Besides, the prior knowledge from different sources is embedded into matrix \(V_k\in {\mathbb {R}}^{N\times {d_k}}\), where \(d_k\) is the number of dimensions of the target vector space of the knowledge embedding, shown as the purple matrices in Fig. 5a. By concatenating the column vectors from the two matrices that correspond to the same protein, the high-dimensional representation of each protein is
$$\begin{aligned} v_i=v_{r_i}\oplus {v_{k_i}}, \end{aligned}$$
(1)
where \(v_i\in {\mathbb {R}}^{1\times {d}}, i=1,2,\dots ,N\), \(d=d_r+d_k\) and \(\oplus\) is the concatenation operation. Thus, the PPI network has the set of nodes \(V=\{v_1,v_2,\dots ,v_N\}\), and \(V\in {\mathbb {R}}^{N\times {d}}\). Moreover, the interactions between the proteins are represented as the set of edges \(E\subseteq {V\times {V}}\). Therefore, graph \(G= (V, E)\) represents the PPI network, as shown in the PPI-based Graph Neural Network part in Fig. 5a. To model the information transmission in the PPI network, we apply algorithms of graph neural network on G. After that, to map the N representations in d dimensions to the abundance values \(\hat{Y}\in {\mathbb {R}}^{N\times {1}}\), we reduce the dimensions of the node vectors from d to 1 through the predictor which is a 1-layer feed-forward network.
PPI-based graph neural network
In this paper, we assume that the proteins whose abundances are to be predicted have some relations with each other. Such relations could be due to physical interactions, crosstalk between signaling pathways, shared mechanisms of gene regulation from transcription to translation, or some other functional relationships. For convenience, we consider such relations as “protein–protein interactions” (PPIs) in the general sense, i.e. the PPIs include both direct and indirect interactions. A PPI network is naturally represented as an undirected graph denoted by \(G = (V, E)\), where each node in V corresponds to a protein and each edge in E corresponds to the interaction between two proteins.
To represent the edges in set E, we use a weight matrix \(W\in {\mathbb {R}}^{d\times {d}}\) to capture the relations among the features of the proteins and we use an adjacency matrix \(A\in {\mathbb {R}}^{N\times {N}}\)containing edge weights to describe the connectivity among the proteins. The values in both matrices are initialized randomly and will be adjusted when the model is trained, according to the definition of graph neural network in [36]. During the training, the nodes transmit feature information to each other, and the result is:
$$\begin{aligned} V^{e}=\sigma (AVW), \end{aligned}$$
(2)
where matrix \(V^e\in {\mathbb {R}}^{N\times {d}}\) contains the node vectors transformed from the node vectors in V through A, W and the sigmoid function \(\sigma (x)=\frac{1}{1+e^{-x}}\), which is applied to each element of matrix AVW. After that, we use a Feed-Forward (FF) layer to reduce the dimensions of the node features from \(N\times {d}\) to \(N\times {1}\), where N is the number of proteins. Different from cTP-net [7], which fits the Centered Log-ratio Range of protein abundance [4] by the ReLu function \(ReLu(x)=max(0,x)\), we use the PReLu function \(PReLu(x)=max(0,x)+0.25\times min(0,x)\) in the last layer to ensure that the model can predict values less than 0. Note that, in the CITE-seq data, the protein abundance values are log-transformed and thus could be negative sometimes. Thus, the output is
$$\begin{aligned} \hat{Y}=PReLu(FF(V^{e})). \end{aligned}$$
(3)
Prior knowledge
In the previous section we mainly built a PPI network from a specific dataset, but there is additional prior knowledge about PPI from other datasets. The STRING database collects information on PPI from different anngles such as co-expression and gene co-occurrence, etc. Therefore, we use this superset of PPI information to improve the model performance. To represent these features, we embed this prior knowledge into \(d_k\) dimensions, which adds constraints to the protein predictions in the graph neural network. The structure is shown in Fig. 5b and the algorithm is described in Algorithm 3.
We use M independent features \(C=\{C_1,C_2,\dots , C_M\}\) of the PPIs in the STRING database [31]. Each feature \(C_i\) is represented by a graph with N protein nodes and \(N\times {N}\) edges represented by the interaction scores, where N is the number of proteins. We transform every \(C_i\) into an \(N \times N\) adjacency matrix \({C_i}'\in {\mathbb {R}}^{N\times {N}\times {1}}\). When a protein is missing in the prior knowledge database, which means the connections of the protein with others are absent. We set the weights of the connections to 0. In order to obtain the high-dimensional features of each adjacent matrix, each column vector in matrix \({C_i}'\) is encoded by N 1-layer fully-connected networks with \(d_c\) dimensions and the result is \(A_{c_i}\in {\mathbb {R}}^{N\times {N}\times {d_c}}\). Then, through the attention mechanism defined in [37], the importance scores of the features are merged into matrix \(A_c\in {\mathbb {R}}^{N\times {N}\times {d_c}}\),
$$\begin{aligned} A_c &= elu\left( \frac{1}{M}\sum _{i=1}^{M}(a_{c_i}W_{a_i}A_{c_i})\right) , \end{aligned}$$
(4)
$$\begin{aligned} a_{c_i}&= \frac{\mathrm{exp}{(elu(A_{c_i}))}}{\sum _{e=1}^{N}\mathrm{exp}{(elu(A_{c_e}))}}, \end{aligned}$$
(5)
where \(a_{c_i}\) is the normalized attention coefficient, \(W_{a_i}\) is the weighted matrix for the i-th coefficient, and \(elu(x)=max(0,x)+min(0,\mathrm{exp}(x)-1)\).
To combine the prior knowledge with each protein node to constrain the information transmission, we divide \(A_c\) into N submatrices \(A_{c_j}\in {\mathbb {R}}^{N\times {d_c}}\), where \(0<j\le {N}\), and each submatrix corresponds to one of the N proteins. To reflect different degrees of importance of the protein pairs, we need to re-weight all the relationships. In the following, \(A_{k_j}\in {\mathbb {R}}^{N\times {d_c}}\) represents the re-weighted relationships:
$$\begin{aligned} a_{k_j}&= \frac{\mathrm{exp}{(elu(A_{k_j}))}}{\sum _{e=1}^{N}\mathrm{exp}{(elu(A_{k_e}))}}, \end{aligned}$$
(6)
$$\begin{aligned} A_{k_j} & = elu(a_{k_j}W_{k_j}A_{c_j}), \end{aligned}$$
(7)
where \(a_{k_j}\) is the normalized attention coefficient for the different constrained features. Because a pair of proteins may be influenced by multiple intermediate proteins, we concatenate all the prior knowledge of protein interactions for each node into a feature vector, as follows:
$$\begin{aligned} V_k=A_{k_1}\oplus {A_{k_2}}\oplus \dots \oplus {A_{k_N}}, \end{aligned}$$
(8)
where \(V_k\in {\mathbb {R}}^{N\times {d_k}}, d_k=N\times {d_c}\), and \(\oplus\) is the concatenation operation.
Model training
Before training, we set the parameters for the model. In the fully connected layers, the hidden sizes are 1024 and 128 for the numbers of output neurons of the two hidden layers for the RNA representation and 32 hidden neurons in the connected layer for the prior knowledge embedding. In the feed-forward network, we set \(d_r\) to 64, \(d_c\) to 32 and \(d_k\) to \(d_c \times N\). The number of nodes N in our graph neural network depends on the dataset, i.e., \(N=10\) for PBMC and \(N=13\) for CBMC. Thus, \(d_k=320\), \(d=d_r+d_k=384\) for PBMC, and \(d_k=416\), \(d=d_r+d_k=480\) for CBMC.
For the training, we set the number of epochs to 350 and batch size to 32. For the optimization of loss function based on mean squared error (MSE), we first set the global \(MSE_{loss}'\) to an infinite value. In each epoch, if the current \(MSE_{loss}\) is smaller than the global \(MSE_{loss}'\), we update \(MSE_{loss}'\) to \(MSE_{loss}\), and save the model parameters of this epoch. We assume that all proteins have equal weights in the MSE loss:
$$\begin{aligned} MSE_{loss}(Y,\hat{Y})=\sum _{i=1}^{N}{(\hat{y}_i-y_i)}^2, \end{aligned}$$
(9)
where Y contains the ground truth measurements and \(\hat{Y}\) is the set of the predicted protein abundances. The initial learning rate is set to \(10^{-6}\). The model parameters are estimated based on the minimization of MSE loss and the Adam optimizer by back propagation.