Kernelized multiview signed graph learning for single-cell RNA sequencing data

Background Characterizing the topology of gene regulatory networks (GRNs) is a fundamental problem in systems biology. The advent of single cell technologies has made it possible to construct GRNs at finer resolutions than bulk and microarray datasets. However, cellular heterogeneity and sparsity of the single cell datasets render void the application of regular Gaussian assumptions for constructing GRNs. Additionally, most GRN reconstruction approaches estimate a single network for the entire data. This could cause potential loss of information when single cell datasets are generated from multiple treatment conditions/disease states. Results To better characterize single cell GRNs under different but related conditions, we propose the joint estimation of multiple networks using multiple signed graph learning (scMSGL). The proposed method is based on recently developed graph signal processing (GSP) based graph learning, where GRNs and gene expressions are modeled as signed graphs and graph signals, respectively. scMSGL learns multiple GRNs by optimizing the total variation of gene expressions with respect to GRNs while ensuring that the learned GRNs are similar to each other through regularization with respect to a learned signed consensus graph. We further kernelize scMSGL with the kernel selected to suit the structure of single cell data. Conclusions scMSGL is shown to have superior performance over existing state of the art methods in GRN recovery on simulated datasets. Furthermore, scMSGL successfully identifies well-established regulators in a mouse embryonic stem cell differentiation study and a cancer clinical study of medulloblastoma.

(1) The problem in (1) The vectorized form of (1) is: where the first term in the summation corresponds to the first term in (1), and the correspondence between the remaining terms to the term in (1) can be deduced using the hyperparameters. First two constraints correspond to the trace constraints in (1). The constraint ℓ i,+ ⊥ℓ i,− together with ℓ i,+ ⩽ 0 and ℓ i,− ⩽ 0 is called complementarity constraints (Scheel and Scholtes, 2000) and corresponds to (L i,+ , L i,− ) ∈ C in (1).
The problem in (2) is non-convex due to complementarity constraints. However, ADMM is shown to be convergent for problems with complementarity constraints under some assumptions (Wang et al., 2019).
To write the problem in standard ADMM form, introduce auxiliary variables v i = ℓ i,+ and w i = ℓ i,− for all i. Similarly, introduce v = ℓ + and w = ℓ − . Also, let V = {v 1 , . . . , v N , v} and W = {w 1 , . . . , w N , w}. Then, the problem in its standard ADMM form is: where Augmented Lagrangian can then be written as: where ρ is the parameter of augmented Lagrangian, λ i,+ , λ i,− , λ + and λ − are the Lagrange multipliers.
Using augmented Lagrangian, ADMM steps at kth iteration are then found as follows: where and represent the values of variables at kth and (k − 1)th iteration, respectively. To solve (5), we use the fact that it can be solved for each (v i , w i ) pair (and (v, w)), separately. This separation leads to a set of optimization problems all of which can be solved by projection onto the complementarity set S. The problem in (6) is separable across L + v and L − v , leading to two optimization problems both of which can be solved with Block Coordinate Descent (BCD) (Shi et al., 2016).

Signed AUPRC Ratio
Edges in GRNs can be either activating, inhibitory or non-existing. Thus, performance of a GRN inference algorithm should be measured by how well it can infer activating, inhibitory and non-existing edges.
Given the ground truth GRN G and the output of a GRN inference algorithm G, let G + and G − be the activating and inhibitory edges in the ground truth GRN and G + and G − be the activating and inhibitory edges in the inferred network. We compare G + to G + with AUPRC to measure how well the algorithm finds the activating edges. Similarly, we compare G − to G − to measure the performance on inhibitory edges. Let AUPRC + and AUPRC − represent these values. We calculate signed AUPRC ratio as follows: where AUPRC + random and AUPRC − random are the performance measures of a random estimator. Finally, note that if an algorithm infers an unsigned GRN, we use the inferred GRN for both G + and G − .

Hyperparameter Sensitivity
In this section, we study the sensitivity of scMSGL to hyperparameter selection. As described above, hyperparameters are selected by setting d + , d − , c + and c − . We set d + = d − = d and c + = c − = c and change the values of c and d. In Figure 1, we report the performance of scMSGL with correlation kernel for varying c and d values. The two datasets from the right panel of Figure 1 of the main text are considered. In both datasets, the highest AUPRC ratios are obtained when the learned graphs are the densest. Comparing left and right plots of the Figure 1, it is also observed that the optimal value of c changes with decreasing view similarity in the ground truth graphs. Note that the lowest AUPRC ratio is still comparable to or higher than those of benchmarking algorithms reported in Figure 1 of the main text.
Overall, it can be concluded that there is a large range of values around the optimal d and c for which the learned graphs have satisfactory AUPRC ratios.

Data Generation Procedure
1. For each simulation setting, we first draw a baseline binary graph G from a random graph model with n genes. We considered Erdős-Rényi and Barabasi-Albert random graph models as mentioned in the main text.
2. N = 5 perturbed graphs are generated by randomly adding 0.9 × n 2 × η edges to the baseline graph where η > 0 is the fraction of added edges.
3. For the kth graph, each edge is assigned a weight, W k ij such that: 4. For the kth graph, p k random samples are drawn from multivariate Gaussian distribution with precision matrix W k . The random samples are used as the columns of the matrix X k ∈ R n×p k .
5. To mimic the dropout phenomenon present in real single cell datasets, we next introduced additional zeros to the gene expression matrix X k . Following (Pierson and Yau, 2015), the dropout probability for each row of X k was calculated as: π ij = exp(−αX k ij 2 ), where α represents the exponential decay parameter that controls the dependence between the dropout probability and gene expression.
6. A binary indicator was next sampled for each entry: ξ k ij ∼ Bernoulli(π ij ), with ξ k ij = 1 indicating that the corresponding entry of X k ij would be replaced by 0. The dropout probability for each gene vector was calculated as ω k i = p k j=1 ξ ij .
7. Using a modification of the NORTA (Normal to Anything) method (Yahav and Shmueli, 2012) we generated samples from a multivariate zero inflated negative binomial distribution based on X k generated in Step 4 using mean λ, dispersion κ and zero-inflation parameters ω j 's.
8. To mirror real scRNA-seq gene expression data behaviour, λ and κ were estimated from the MB scRNA-seq data set (GSE119926) (Hovestadt et al., 2019) analyzed in the main manuscript. λ 2 : Search for the value that returns graphs with c + = c − = c Table 1: Selected hyperparameters of baseline methods and scMSGL. n is the number of genes, d + and d − are the desired positive and negative edge densities of the learned graphs, respectively. c s is the desired pairwise similarity between G i,s and G j,s , ∀i ̸ = j and s ∈ [+, −].