OscoNet: inferring oscillatory gene networks

Cutillo, Luisa; Boukouvalas, Alexis; Marinopoulou, Elli; Papalopulu, Nancy; Rattray, Magnus

doi:10.1186/s12859-020-03561-y

Research
Open access
Published: 25 August 2020

OscoNet: inferring oscillatory gene networks

Luisa Cutillo²,
Alexis Boukouvalas¹^na1,
Elli Marinopoulou¹,
Nancy Papalopulu¹ &
…
Magnus Rattray¹

BMC Bioinformatics volume 21, Article number: 351 (2020) Cite this article

2858 Accesses
2 Citations
8 Altmetric
Metrics details

Abstract

Background

Oscillatory genes, with periodic expression at the mRNA and/or protein level, have been shown to play a pivotal role in many biological contexts. However, with the exception of the circadian clock and cell cycle, only a few such genes are known. Detecting oscillatory genes from snapshot single-cell experiments is a challenging task due to the lack of time information. Oscope is a recently proposed method to identify co-oscillatory gene pairs using single-cell RNA-seq data. Although promising, the current implementation of Oscope does not provide a principled statistical criterion for selecting oscillatory genes.

Results

We improve the optimisation scheme underlying Oscope and provide a well-calibrated non-parametric hypothesis test to select oscillatory genes at a given FDR threshold. We evaluate performance on synthetic data and three real datasets and show that our approach is more sensitive than the original Oscope formulation, discovering larger sets of known oscillators while avoiding the need for less interpretable thresholds. We also describe how our proposed pseudo-time estimation method is more accurate in recovering the true cell order for each gene cluster while requiring substantially less computation time than the extended nearest insertion approach.

Conclusions

OscoNet is a robust and versatile approach to detect oscillatory gene networks from snapshot single-cell data addressing many of the limitations of the original Oscope method.

Background

Oscillating genes are expressed in a periodic manner leading to alternating appearance and disappearance of the corresponding mRNA and protein. Oscillatory genes have been shown to play a pivotal role in many developmental processes, by enabling individual systems to implement diverse functions [1]. To identify oscillatory genes a combination of time-lapse microscopy techniques and fluorescent reporter genes is required, which prohibits the assaying of a large number of potential oscillators. In turn, this limits the ability of the experimenter to uncover novel oscillators.

Recent developments in single-cell transcriptomics provide the potential to capture genome-wide transcriptional events. However, single-cell RNA-seq (scRNA-seq) experiments, even when done as time series, pose challenges associated in identifying oscillating genes because cells may be asynchronous in vivo or may lose synchronisation in the time required for sample preparation and sequencing.

To overcome these problems, Leng et al. [2] have proposed a computational approach to identify oscillating genes in static, unsynchronised scRNA-seq experiments. They have constructed a parameterised distance between pairs of putative oscillating genes which is minimised for co-oscillating pairs that have the same temporal profile but different phase. Their approach allows gene pairs to be ranked by how close they are according to this co-oscillation distance. However, they do not provide a statistical test to decide which gene pairs are oscillatory.

In this contribution, we provide a well-calibrated hypothesis test which allows us to remove the less interpretable cut-off in the first step of the Oscope pipeline [2]. We do this by selecting gene pairs according to a false discovery rate (FDR) cut-off. Our non-parametric bootstrap approach requires us to optimise the gene-pair distance many times, and therefore we have implemented an efficient optimisation algorithm to minimise the gene-pair distance function. We then construct an undirected network of co-oscillating genes based on all gene pairs which pass the FDR threshold. Community extraction and network analysis methods can then be used to identify statistically significant co-oscillating groups of genes. We also discuss three distinct statistical tests that can be employed to assess the significance of different aspects of the inferred network communities, namely community distinctiveness and enrichment, in terms of both genes (graph vertices) and their co-oscillations (graph edges). We conclude with a description of a simple method to infer pseudo-time for each gene community, inferring a cyclical ordering on a low-dimensional projection.

Review of the oscope method

The Oscope method [2] uses a paired-sine model and K-medoids clustering to identify groups of oscillatory genes. For each oscillatory group, an extended nearest insertion algorithm (ENI) is used to construct the cyclic order of cells, defined as the order that specifies each cell’s position within one cycle of the oscillation of that group [3]. The method requires normalised gene expression and the use of only high mean, high variance genes is recommended [3].

The method relies on the computation of a pairwise gene expression similarity. Let two genes be X= sin(ωt+ϕ) and Y= sin(ωt+ϕ+Ψ_XY), that is they have identical profiles except for a phase shift Ψ_XY. For N cells the co-oscillation distance is d(X,Y|Ψ_XY):

$$ \sum_{s=1}^{N}\left[Y_{s}^{2}+X_{s}^{2}-2Y_{s}X_{s}\cos\left(\Psi_{XY}\right)-\sin^{2}\left(\Psi_{XY}\right)\right] $$

(1)

The derivation is provided in the Supplementary Material (Section 3). The assumed sinusoidal form is not a major restriction, given that the time t is not required in the co-oscillation distance^{Footnote 1}. This is the crucial step in the algorithm as it allows the gene-pair similarity to be estimated without knowledge of the pseudo-time.

A limitation of this approach is that the cell population is assumed to be homogeneous. Including cell populations that are not relevant to the process under investigation will result in reduced accuracy as the co-oscillation distance will be averaged across both informative and non-informative cells. We therefore recommend to remove irrelevant cell populations prior to the application of the Oscope and OscoNet methods. This can be accomplished by visualisation and clustering methods such as t-SNE and k-means to identify cell populations and using gene markers or other information to characterise each cell cluster.

In the original Oscope approach, only genes that appear in the top T% of the ranked pairwise similarity list are selected for the subsequent K-medoids clustering step. The choice of T can have a significant impact on downstream analysis. However, the optimal choice of T is not obvious and will likely differ depending on the characteristics of the data, e.g. number of co-oscillatory pairs, technical/biological noise levels, number of cells in the dataset etc. In our work, we instead propose to use a well-calibrated non-parametric test to select gene pairs with a given FDR. Genes that contribute to these gene pairs are then selected for further downstream analysis.

A further limitation of the Oscope approach is that linearly correlated genes cannot be disambiguated from gene pairs with 0 or π phase shift. Therefore a heuristic is used to remove clusters that have a high proportion of similarities with 0 or π phase shift.

Finally, the ENI algorithm is used to estimate pseudo-time per gene cluster. The use of the ENI algorithm is not critical to the algorithm and can be replaced by a computationally more efficient probabilistic approach as we describe below.

Methods

Our overall approach is summarised in Fig. 1. Following a quality control step (see Appendix), we propose three main phases, namely: building a network of gene co-oscillations using a non-parametric hypothesis test, identify the statistically significant communities in the resulting network and infer pseudo-time for each gene community.

Non-parametric test on co-oscillation

The key insight in the Oscope algorithm is that a gene-pair oscillation score can be defined without requiring a prior estimation of pseudo-time. Rather than specifying a threshold on the number of genes to keep, here we develop a non-parametric hypothesis test to assess the significance of two genes co-oscillating.

We employ a bootstrap approach which randomly permutes the order of cells for one of the genes in the gene-pair distance estimation in order to perform a statistical test. The cell index is permuted for one gene in a pair and we calculate the resulting pairwise gene similarity $d\left (X,\mathcal {P}(Y)|\Psi _{XY}\right)$ (recall Eq. 1) where $\mathcal {P}(Y)$ is the permuted cell order for gene Y. For each randomisation, we estimate the phase shift by minimising the pairwise distance,

$$ d_{perm} = \min_{\Psi_{XY}} d\left(X,\mathcal{P}(Y)|\Psi_{XY}\right) \ . $$

(2)

The p-value estimate for the gene pair XY is the fraction of times d_perm<d. Specifically for B randomisations the bootstrap p-value is

$$ p\text{-value}_{XY} = \frac{1}{B} \sum_{i=1}^{B} I\left(d_{perm}^{i} < d\right) \,, $$

(3)

where $d_{perm}^{i}$ the i^th randomisation, d=d(X,Y|Ψ_XY) the unpermuted pairwise distance and I(.) the indicator function, which is 1 when the inequality is true and 0 otherwise.

In our approach, we need to correct for the effect of multiple hypothesis tests since we are testing many gene pairs. We have selected to use the q-value approach of [4] which controls the FDR. The simple Bonferroni correction is too conservative in the case of very large numbers of tests and is also based on an unrealistic assumption that tests are independent. The q-value is an adjustment of the traditional p-value where the FDR is minimised based on the distribution of the p-values across all the tests performed. The q-value is an estimate of the FDR at a given p-value threshold obtained from modelling background (null) and signal contributions to the p-value distribution. This approach does not assume independent tests and has been shown to perform well in applications with very large numbers of tests. The q-value provides an estimate of the number of false edges in our co-oscillation network.

As the bootstrap approach is computationally demanding, we have expressed the distance measure using matrix operations and we provide an efficient implementation. We have implemented the bootstrap test in Python using NumPy vector operations and we have also implemented a faster version in Cython, which allows C code to be called from Python. A performance comparison of the different implementations is given in Fig. 2 wherein we compare our implementation of the pairwise distance calculation (NumPy, Cython) to the standard Oscope [2] R implementation. As all three implementations are able to leverage multi-core parallelism, our test is performed across 12 CPU cores. We see that both Numpy and Cython are substantially faster than the iterative R version which uses the Oscope bioconductor package. The Numpy and Cython implementations achieve an average speed-up of 6.5 and 93 respectively. The significant performance improvement allows for the practical use of the bootstrap algorithm. The computational penalty for a fully non-parametric significance test is therefore minimised.

Network analysis

We use the gene-pair test to construct a network of genes in which connections denote statistically significant co-oscillation. In the network setting, a variety of algorithms have been developed to extract communities within the entire network. The problem of community detection, or graph partitioning, in a network has been widely studied in a variety of research fields. In the present work, we consider community extraction algorithms based on maximising modularity, a measure of community distinctiveness. The modularity often referred to as Q, was defined by [5] as the fraction of the edges that fall within a given group of nodes (community partition) minus the expected fraction if edges were distributed at random. The concept of modularity is based on the idea that a random graph is not expected to have a cluster structure. In the “Synthetic study” section, we compare three popular community extraction algorithms^{Footnote 2}, namely: walktrap [6], fast-greedy [7] and infomap [8], and find the walktrap algorithm to be the most robust.

To further interpret the estimated network, which is still highly complex, we tried to identify the underlying community structures [9]. The identified communities will allow us to create a large scale map of our network [10]. We aim to use individual communities to shed light on the function of the co-oscillation represented by the network itself. Indeed, in biological networks, communities often correspond to functional units of a system. For example in metabolic networks [11], such functional groups correspond to cycles or pathways, whereas in the protein-protein interaction networks [12], communities correspond to proteins with similar functionality inside a cell.

To assess the significance of each community, we employ three statistical tests that examine different aspects of the community structure. Our first step is to apply a Wilcoxon test for each community, to compare the distribution of the intra-community edges versus the distribution of the inter-community edges. This tests allows assessing how well-defined a community is and complements the maximisation of the modularity objective during the community extraction stage.

We then validate our results by cross-referencing with a list of known oscillating targets. To this end, for each identified community, we perform a set of gene enrichment tests with respect to Cell Cycle (CC) related genes (Gene Ontology term GO:0007049). Firstly, we test each community for standard gene enrichment using a hypergeometric test; this test checks whether the genes in every single community are significantly enriched in the list of CC related genes. However, this test only considers membership to a community and does not take into account the within community connections corresponding to gene co-oscillations. For a more stringent test, we also employ a network enrichment test [13]. For every community that contains CC related genes, we are performing a network enrichment test between two lists of genes, namely: list A - the list of CC related genes in the community, list B - the list of all the genes in the community. This test aims at looking at the connection between our target genes to see if their links are statistically significant, hence if there is an enrichment. The main idea behind this test is that, under the null hypothesis of no enrichment, the number of links between two gene sets A and B follows a hypergeometric distribution. This test considers the connectedness among CC related genes within a single community. Therefore, it may be more stringent than the hypergeometric test, given that CC genes, which are only connected through intermediaries in a CC enriched community, are not found to directly co-oscillate. In our results, we show the validation of communities in terms of the Wilcoxon test and the CC enrichment in terms of network enrichment test [13].

We also introduce an index, community density, expressing the fraction of the actual number of edges in a community with respect to the total number of edges in the network. We also provide a relative community density, defined as the fraction of edges in a community given the maximum number of edges feasible in a community of that size.

Pseudo-time estimation

We use Laplace Eigenmaps [14] to reduce the gene expression space to a two-dimensional latent representation. Laplace Eigenmaps, also known as spectral embedding, is a non-linear dimension reduction method using a spectral decomposition of the graph Laplacian to preserve local distances in the gene expression space. The method is fast to compute as it involves a partial eigenvalue decomposition of a sparse graph matrix, namely the graph Laplacian. Further, the minimisation problem has a unique solution and therefore, does not suffer from local minima. As [14] discussed, the method is robust to outliers and noise as only local distances are used to estimate the neighbourhood graph.

The method consists of

1
Graph Construction: create sparse adjacency matrix A using n nearest neighbours.
2
Graph Laplacian Construction: L=D−A, where D is the degree matrix, the number of edges attached to each vertex.
3
Partial Eigenvalue Decomposition. Find top-Q eigenvalues of sparse matrix where Q is the dimensionality of the latent space. In our case this is always Q=2.

The critical parameter in the algorithm is n, the number of nearest neighbours. Our pseudo-time estimation algorithm consists of a search on a grid of values specified by a minimum m and a maximum value M. For n∈{m,…,M}:

1
Compute Laplacian eigenmap on 2 dimensions.
2
pseudo-time: compute angle in radians [−π,π] by arc tangent. Convert to [0,1].

The latter step imposes a periodic constraint on the pseudo-time: pseudo-time is defined as the angle of a cell in a circle.

To select the best value for the number of neighbours n, we apply a probabilistic periodic model to assess each pseudo-time for all genes in the cluster. We employ a Bayesian Gaussian process latent variable model [15] with a one-dimensional periodic kernel. All hyperparameters are optimised except the pseudo-time. We use 25 inducing points to ensure efficient optimisation and calculation of the model likelihood. We then use the marginal model likelihood to select the best pseudo-time and the number of neighbours n.

Results

In order to validate our overall approach and compare to the original results in [2], we evaluate the ability of OscoNet to identify oscillating groups of genes by applying it both to synthetic data, a microarray time-series study of oscillating genes [16] and to the profiles of single undifferentiated human embryonic stem cells (hESCs) [17]. For the biological data, we evaluate the enrichment of the discovered clusters with respect to the Gene Ontology CC biological process related genes.

Synthetic study

We generate a synthetic dataset of 1000 genes and 100 cells following the setup in [2]. We simulate 1000 genes, of which 180 are samples from a sinusoidal function with additive Gaussian noise. The 180 oscillators are simulated in 3 frequency groups, each group containing 60 genes. The relative frequencies of the three groups are proportional to 2:3:6.

Genes in each group were further exposed to strong or weak signals. Half of the oscillatory genes were simulated as strong oscillators with noise variance σ². The other half were simulated as weak oscillators with $\sigma ^{2}_{wk} = (2 \sigma)^{2}$. The starting phase Φ varies in different genes within a frequency group. The remaining genes were simulated as independent Gaussian noise.

Hypothesis test

We calculate the FDR on all gene pairs, defined as:

$$ \text{FDR}=\frac{F}{S}. $$

(4)

where S is the estimated number of significant gene pairs and F the number of false positive pairs. This is a more stringent criterion than in [2], where the number of oscillating genes was used and not co-oscillating pairs.

The accuracy of the non-parametric test with different number of bootstrap samples and the original Oscope method is shown in Fig. 3a for noise level of σ²=0.05. When considering higher noise levels the results are similar (see Supplementary). Very small number of bootstrap samples result into not well-calibrated tests (B=100). However, when the number of samples is increased the bootstrap test better approximates the true FDR.

When decreasing the desired level of FDR, we observe an elbow effect. At a certain point of significance, the observed FDR is not further reduced as we decrease the desired FDR. This occurs at increasingly smaller thresholds as the number of bootstrap samples is increased. However, we suspect that the number of cells and the variability of gene expression also play a role; when increasing the number of cells for instance, statistical power is expected to increase as we get higher resolution on the tails of the null distribution. Another factor is the amount of noise in the data as is demonstrated in Fig. 3b where we show the effect of different noise levels on the threshold value of the elbow after which no improvement in FDR is observed. As the noise level decreases, the statistical test is able to achieve lower FDR.

A possible improvement on our test would be to introduce an explicit model for the tails of the null distribution. Such a semi-parametric bootstrap approach would allow for lower achievable FDR levels as long as the modelling assumptions are accurate.

The poor performance of standard Oscope can be understood by the nature of the threshold used. At the lowest noise level (0.05) the method correctly identifies all the co-oscillating gene pairs at the default 5% threshold achieving a perfect true positive rate (TPR=1.0) as does the OscoNet hypothesis test. However the former gets a false positive rate of 10% (FPR=0.10) while the latter achieves a lower rate of 0.1% (FPR=0.001). The resulting FDR for Oscope is 0.91 and for OscoNet is 0.09, reflecting the highest sensitivity of the latter. The large FPR achieved by Oscope demonstrates the deficiency of setting such a threshold. In the Supplementary (Section 1), we present the FDR, FPR and TPR achieved across all noise levels.

Note that a bootstrap test is implemented in the Oscope pipeline after the clustering step to flag clusters that should be removed prior to downstream analysis. As this test is applied after the threshold of the top-ranked pairs of genes, the choice of the threshold has a significant impact on the results. In addition, by applying the bootstrap test on the cluster level, an entire cluster is either kept or removed, whereas our approach maintains all significant gene pairs. We argue our approach is logically more coherent since the clustering should be applied on only significant gene pairs rather than testing for significance after the clustering is completed.

Comparing community extraction methods

In order to validate our approach, we want to find densely connected subgraphs, also called communities, in the constructed synthetic graphs. To this aim, we compare three well-known network clustering algorithms available in the R package igraph, namely: walktrap, fastgreedy and infomap. In walktrap, the communities are assigned via random walks [6]. The idea is that short random walks tend to stay in the same community. In fastgreedy, the communities are estimated via directly optimising a modularity score [7]. On the other end, infomap aims to find a community structure that minimises the expected description length of a random walker trajectory [8]. We refer to the original papers for a detailed description. In an ideal case, true simulated oscillating genes should be clustered in the same community. For each of the five noise levels, as defined in the “Synthetic study” section, and for each one of the three community detection algorithms, we compare the recovered network partition to the true underlying one computing the Adjusted Rand Index (ARI). This measure is a scalar comparing two partitions on the same network. The ARI index, first introduced in [18], has zero expected value in the case of random partition, and it is 1 in the case of perfect agreement between two partitions. As depicted in Fig. 4, the index decreases when the level of noise increases and the most stable results are achieved by the walktrap algorithm. We therefore select the walktrap as the community extraction method in the clustering step of our overall approach.

cDNA microarray time course

We applied OscoNet to a microarray time-series study of oscillating genes studied in [16].

Following Leng et al. [2], we use experiment 3 from the original study [16] where double thymidine block was used to synchronise HeLa cells which were profiled at 48 time points following synchronisation. We start with a set of 41508 probes containing 1134 CC oscillators [16]. This corresponds to a set of 9559 genes containing 874 oscillating genes [16].

We filter probes using variance based filtering with a cut-off of 0.9 [19]. This results in 4151 probes containing 360 CC oscillators. This filtering step is more stringent than the default test on mean and variance of expression used in the Oscope paper. This filtering step also ensures that the set of probes we consider does not include high error measurements and includes the most informative genes only. In both cases, the data is normalised to [-1, 1].

Finally, we identify a total of 3884 genes that have at least one significant co-oscillation where the total number co-oscillations is 185,548. This corresponds to a highly sparse network with a sparseness of 0.0123. Here sparseness is defined as the ratio of existing edges to the maximum number of possible edges in the network. All the 60 CC oscillators are present in the network.

In the following, we summarise the results when applying the standard Oscope approach or respectively our approach. The Oscope approach was applied to the same data using the same normalisation. We utilize the standard paired sine model and only gene pairs with a similarity score in the top 5%. This corresponds to the default setting recommended in Oscope. We apply K-medoids with the maximum number of clusters set to K=5. Finally, we remove clusters flagged by Oscope as having mostly linear effects.

The application of standard Oscope led to the identification of two significant clusters of transcripts: the first cluster includes 72 probes whereof 70 are CC-related and the second cluster includes 136 probes of which 13 are CC-related. We note that no clusters were removed by the linear filtering step of the Oscope pipeline.

The application of OscoNet to the same dataset led to the identification of 5 significant clusters, according to the Wilcoxon test (significance α=0.01) described in the previous section. Singleton communities with a single gene were excluded.

In Table 1 we show all the significant clusters according to the Wilcoxon test ranked by the relative density. Of these five, three have at least one CC probes. Of those three, only one is enriched according to the NEAT and hypergeometric tests with α=0.01.

Table 1 Miroarray data Clusters Significance

Full size table

The only CC-enriched community has 239 CC genes out of 265 and is the second-ranked community as per relative density. This community is well-separated from the network and is enriched in CC from both enrichment perspectives, namely traditional enrichment in terms of genes and the network interconnections between the CC genes present in the community.

Importantly, this community includes the 70 CC related probes identified as co-oscillating by Oscope. Therefore this community gives a more detailed and richer description of the CC processes than the one uncovered by Oscope.

pseudo-time

We evaluate the performance of the ENI method of Leng et al. [2] and spectral methods on estimating pseudo-time. We run the ENI method using 30000 iterations, with a sliding polynomial of degree 3 and 4 starting points for the polynomial fitting. The pseudo-time order is not changed after approximately 6500 iterations in our experiments even when setting the maximum number of iterations to 100k. The ENI algorithm took on average 5-7 h in our experiments while the spectral method takes less than 1 min including the time to search over the parameter space. This makes the method very easy to use in practice in addition to the superior performance which we now discuss.

We assess the performance of the method when trained on the 72 gene cluster defined by Oscope and the 265 gene cluster defined by OscoNet. We also evaluate the performance of the pseudo-time on both of these clusters.

The dimensionality reduction and circle estimated by the spectral method for the two gene clusters is shown in Fig. 5. In both cases, the cells can be well-described by a circular path in the 2-D latent space projection of the data.

Our main validation measure is the Spearman correlation of the estimated peak times in the reconstructed gene expression to the ground truth (Table 2). The ground truth has been constructed by folding the time series in [16] along every period. ENI provides consistently poor results despite the large iteration count (30k iterations). The spectral method is simple and provides good results. It can also be easily extended in an intuitive manner to cover more complex cases - see the Conclusions section. Comparing the pseudo-time achieved by the different clusters using the spectral method, we note that using the larger gene cluster identified by OscoNet (265 genes) results in a more accurate pseudo-time. This is reflected in the higher peak time rank correlation; the larger cluster estimated by OscoNet is more informative on the true time compared to the smaller cluster estimated by Oscope. In the Supplementary (Section 2), the estimated and actual peak times for each gene are shown.

Table 2 Microarray data: spearman rank correlation of peak times

Full size table

We also report on the roughness measure of the true time series and the reconstructed version. This encapsulates how smooth a time series is by comparing neighbouring values with smaller values corresponding to a smoother time series; see Appendix for details. In Table 3 we report the difference of the median roughness of the estimated pseudo-time and true time. Overall the ENI results in an increase in the roughness measure while the Spectral method results in lower values reflecting smoother time courses. The spectral method achieves similar performance under both the smaller Oscope cluster (72 genes) and the larger OscoNet cluster (265). Therefore the differences reported above in peak time accuracy are not caused by smoother time courses.

Table 3 Microarray data: change in median roughness

Full size table

In Fig. 6 we show the gene expression for three genes using the experimental time. In Figures 7 we show the reconstructed base cycle and estimated gene expression. The pseudo-time methods have been estimated using the larger OscoNet cluster. As expected, the ENI profiles are poor due to the poor pseudo-time estimate. The spectral pseudo-time is anticorrelated and therefore provides a mirror image of the true time series profile.

Single-cell embryonic stem cells

To evaluate OscoNet on scRNA-seq data, we analysed profiles of single undifferentiated hESCs as described in Thomson et al. [17]. In particular, we analysed three replicate scRNA-seq experiments on H1 hESCs, with a sample size n=213.

We applied our method to n=213 samples with G=2375 genes following the same steps as [2] in terms of quality control. In particular, genes have been filtered by mean level and variance as in [2]. A total of 1914 genes have at least one significant co-oscillation with 5057 co-oscillatory pairs found. The sparseness value for the network is 0.001, an order of magnitude smaller than the previous study suggesting a sparser network.

A total of 677 genes were used to test for CC enrichment. After quality control, 153 CC genes were left of which 134 are subsequently present in the co-oscillatory network.

The original algorithm in Oscope identified 29 genes as co-oscillating, 21 of which are annotated with the cell cycle process GO-term. The Oscope algorithm was run with default values as previously reported [2]. In total the k-medoids Oscope step finds 5 clusters, 2 of which are eliminated by the subsequent linear filtering step. The CC enrichment of each cluster is given in Table 4.

Table 4 scRNA-seq: CC enrichment for each Oscope cluster

Full size table

After applying OscoNet, we assessed the significance of each community using a Wilcoxon test as described previously. The results are shown in Table 5. Hence we identify six significant communities using the Wilcoxon test (α=0.01). As in the previous study, we eliminate singleton communities that contain only one gene.

Table 5 scRNA-seq Data Clusters Significance

Full size table

Only one community is enriched in nodes and edges according to the hypergeometric and NEAT tests (α=0.01). This community is enriched in CC with 32 genes out of 77 belonging to the term GO : 0007049. This community contains the standard Oscope cluster of 29 genes reported in [2]. As in the previous study, we find that the OscoNet approach is more sensitive than Oscope identifying a cluster with a superset of CC genes.

The two highest ranked communities in terms of relative density have a small size of only 4 and 6 genes respectively. These are ranked highly because of their exceptionally small size and are not CC enriched. A threshold on the community size would exclude such small communities.

pseudo-time

We compared pseudo-time using the 77 and 29 gene clusters estimated by OscoNet and Oscope respectively. ENI takes an average of 13 h of computing time to estimate the pseudo-time for both clusters while the spectral method requires approximately 30 min. The difference in computing time is that ENI is leveraged the 2-opt algorithm to reduce the likelihood of local minima. In contrast, the spectral method has a unique solution for a given number of neighbours n and is quick to evaluate leveraging efficient sparse eigenvalue solvers. The grid search we performed over different values of n ensures the robustness of the results.

In Fig. 8 we show the fit for the spectral method for both gene sets. There is more noise in the data than in the previous case study, and the pattern is less circular. Despite this, the algorithm is able to recover the cell cycle order as we now show.

We compared the performance of the algorithm estimating pseudo-time using all 460 cells and assessing the separation of CC stages on the labelled subset of cells, which consists of 247 cells. After pseudo-time estimation, we applied the k-means algorithm with K=3 to identify the three clusters associated with G1, G2 and S phase. We computed the ARI to contrast the clustering results with the CC labels; the ARI measure is in the range [0,1] with higher values reflecting better agreement between the estimated and true cell cycle stages.

In Table 6 we show the ARI results in comparison to the spectral and ENI method. The performance of the latter is poor, which can be validated when looking at the gene expression output (Fig. 9). The spectral method provides better accuracy for both gene clusters with an ARI of 0.43 when using the Oscope 29 gene cluster and 0.49 when using the OscoNet 77 gene cluster. The higher ARI for the latter reflects the richer information content of the OscoNet cluster compared to the Oscope cluster; the OscoNet spectral ordering contains fewer cells assigned to an incorrect CC stage than the Oscope spectral ordering as is evident in Fig. 9 for four example genes.

Table 6 scRNA-seq: ARI on K=3 K-mean clustering vs true cell-cycle stage

Full size table

The poor performance of the ENI algorithm is confirmed when looking at the roughness value (Table 7). The ENI algorithm has higher roughness values for all combinations examined while the spectral method achieves similar roughness values with both clusters.

Table 7 scRNA-seq data: change in median roughness

Full size table

Conclusions

We have presented a well-calibrated hypothesis test that avoids a need for a threshold cut-off on the pairwise distance matrix. Further, our implementation is highly scalable due to the use of parallel vectorisation which has been demonstrated to be orders of magnitude faster than the one used in Oscope.

We have then performed network analysis to identify statistically significant communities of genes again avoiding any need to set the maximum number of communities as is needed in the K-medoids approach of [2]. We have also been able to select the significant communities based on a standard statistical test and assess the significance of enrichment in terms of both genes and their co-oscillatory relationship. We have validated our approach on synthetic data demonstrating that it is well-calibrated with respect to the false discovery rate, unlike [2].

We contrasted our approach to that of Leng et al. [2] on both microarray and scRNA-seq data with known oscillators. In both cases, we are able to recover the original Oscope cluster of genes in addition to discovering more oscillators.

We estimated a unique pseudo-time per gene community using a spectral approach. The spectral approach requires an order of magnitude less computation than the ENI approach proposed in Leng et al. [2] and is more accurate as reflected in both peak time correlation in the microarray data and cell cycle stage separation in the single-cell data.

One issue with our current approach is that for a low number of cells, the small number of possible permutations in the similarity measure reduces the power of the bootstrap hypothesis test resulting in a high minimum achievable FDR. A semi-parametric bootstrap approach should help to increase the power of the test, particularly in case of small numbers of cells and low FDRs, which is typically required.

Finally, extending this approach as a probabilistic model including the effect of dropout which is prominent in single-cell datasets, may further increase the accuracy of the method and allow for quantification of uncertainty in the resulting co-oscillating gene clusters.

The pseudo-time model could be extended to use a probabilistic model to estimate pseudo-time while imposing a periodicity constraint jointly. This will also allow for selecting the number of latent dimensions by maximising the model evidence. A probabilistic formulation of Laplace Eigenmaps latent variable model has already been shown [20] where the Laplacian Eigenmaps latent variable model is described. In future work, one could extend this class of model, in the probabilistic or deterministic formulation, to include a periodic constraint.

The method we have presented can be incorporated in a single-cell analysis pipeline to allow biologists to uncover novel periodic dynamic gene regulatory mechanisms. Our approach avoids less interpretable cut-offs and relies on well-calibrated statistical tests. As the availability and size of single-cell whole transcriptome data increases, the non-parametric and automatic nature of our approach will allow for a corresponding increase in resolution of the analysis resulting in a more detailed description of oscillatory gene expression systems.

Appendix

Quality control

In any single-cell pipeline, it is necessary to perform some rudimentary quality control to ensure downstream analysis is not affected by low-quality measurements. In the Oscope pipeline [2], a threshold on the mean level and a log-linear cut-off on the variance is applied.

We prefer using a simpler approach based on variance based filtering with a high variance cut-off of 0.9 [19]. This ensures a simple and consistent quality control mechanism that has been shown to remove low quality genes [19].

Roughness measure

Roughness for a particular gene was defined in [21] as

$$\frac{1}{\sigma_{g}} \sqrt{ \frac{1}{C-1} \sum_{c=1}^{C-1} \left(x_{g, z_{c}} - x_{g, z_{c+1}} \right)^{2} }$$

where σ_g the standard deviation of gene expression and $x_{g, z_{c}}$ is the gene expression for gene g at pseudo-time order z_c.

This metric measures the smoothness of the gene expression profile by looking at the differences of consecutive measurements. Smaller values indicate a smoother response.

Availability of data and materials

All data used in the paper are freely available; see [2] for details on download instructions. The OscoNet code is publicly available at https://github.com/alexisboukouvalas/OscoNet.

Notes

Intuitively, any monotonic transformation of time will not alter the value of the objective function.
These are available in the R package igraph.

Abbreviations

FDR:: False discovery rate
ENI:: Extended nearest insertion
CC:: Cell cycle
hESCs:: Human embryonic stem cells
ARI:: Adjusted rand index

References

Levine JH, Lin Y, Elowitz MB. Functional roles of pulsing in genetic circuits. Science. 2013; 342(6163):1193–200.
Article CAS Google Scholar
Leng N, Chu L-F, Barry C, Li Y, Choi J, Li X, Jiang P, Stewart RM, Thomson JA, Kendziorski C. Oscope identifies oscillatory genes in unsynchronized single-cell rna-seq experiments. Nat Methods. 2015; 12(10):947–50.
Article CAS Google Scholar
Bacher R, Kendziorski C. Design and computational analysis of single-cell rna-sequencing experiments. Genome Biol. 2016; 17(1):1.
Article Google Scholar
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003; 100(16):9440–5.
Article CAS Google Scholar
Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev. 2004; 69(11):026113.
CAS Google Scholar
Pons P, Latapy M. Computing Communities in Large Networks Using Random Walks. Berlin: Springer Berlin Heidelberg; 2005, pp. 284–93.
Book Google Scholar
Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E. 2004; 066111:70. https://doi.org/10.1103/physreve.70.066111.
Google Scholar
Rosvall M, Bergstrom CT. Maps of information flow reveal community structure in complex networks. PNAS. 2008:105–1118. https://doi.org/10.1073/pnas.0706851105.
Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002; 99(12):7821–6.
Article CAS Google Scholar
Fortunato S. Community detection in graphs. Phys Rep. 2010; 066111:75–174.
Article Google Scholar
Lawson CE, et al.Metabolic network analysis reveals microbial community interactions in anammox granules. Nat Commun. 2017. https://doi.org/10.1038/ncomms15416.
Hermjakob H, et al.The hupo psi’s molecular interaction format–a community standard for the representation of protein interaction data. Nat Biotechnol. 2004:177–83. https://doi.org/10.1038/nbt926.
Signorelli M, Vinciotti V, Wit E. Neat: an efficient network enrichment analysis test. BMC Bioinformatics. 2016; 12(352). https://doi.org/10.1186/s12859-016-1203-6.
Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003; 15(6):1373–96.
Article Google Scholar
Titsias MK, Lawrence ND. Bayesian gaussian process latent variable model. In: International Conference on Artificial Intelligence and Statistics. PMLR, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics: 2010. p. 844–51. http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf.
Michael L, Whitfield GS, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, et al.Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002; 13(6):1977–2000.
Article Google Scholar
Thomson JA, Itskovitz-Eldor J, Shapiro SS, Waknitz MA, Swiergiel JJ, Marshall VS, Jones JM. Embryonic stem cell lines derived from human blastocysts.Science. 1998; 282(5391):1145–7.
Article CAS Google Scholar
Hubert L, Arabie P. Comparing partitions. Journal of the Classification, pages. 1985; 2:193–218.
Article Google Scholar
Gentleman R, Carey V, Huber W, Hahne F. genefilter: genefilter: methods for filtering genes from high-throughput experiments; 2020. R package version 1.70.0. https://bioconductor.org/packages/release/bioc/html/genefilter.html.
Zhengdong L, Sminchisescu C, Carreira-Perpiñán MÁ. People tracking with the laplacian eigenmaps latent variable model In: Platt JC, Koller D, Singer Y, Roweis ST, editors. Advances in neural information processing systems. Curran Associates, Inc.: 2008. p. 1705–12. http://papers.nips.cc/paper/3226-people-tracking-with-the-laplacian-eigenmapslatent-variable-model.pdf.
Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell time series. Bioinformatics. 2016; 32(19):2973–80.
Article CAS Google Scholar

Download references

Acknowledgements

We thank the Oscope authors for sharing their processed microarray Whitfield data.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 10, 2020: Proceedings from the 13th Bioinformatics and Computational Biology International Conference - BBCC2018. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-10.

Funding

MR and AB were supported by MRC award MR/M008908/1. LC was supported by H2020-EU EU Marie Curie fellowship CONTESSA (ID: 660388). NP is funded through a Wellcome Trust Senior Research Fellowship (106185/Z/14/Z). EM is funded through a Sir Henry Wellcome Fellowship (201380/Z/16/Z). Publication costs are covered by the University of Manchester Wellcome Trust funds.

Author information

Luisa Cutillo and Alexis Boukouvalas are co-first authors.

Authors and Affiliations

Faculty of Biology, Medicine and Health, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK
Alexis Boukouvalas, Elli Marinopoulou, Nancy Papalopulu & Magnus Rattray
University of Leeds, Leeds, UK
Luisa Cutillo

Authors

Luisa Cutillo
View author publications
You can also search for this author in PubMed Google Scholar
Alexis Boukouvalas
View author publications
You can also search for this author in PubMed Google Scholar
Elli Marinopoulou
View author publications
You can also search for this author in PubMed Google Scholar
Nancy Papalopulu
View author publications
You can also search for this author in PubMed Google Scholar
Magnus Rattray
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LC and AB did all the computational work and equally contributed as first authors. MR, NP and EM conceived the study. All authors participated in writing the paper. All authors have read the paper and approve it.

Corresponding author

Correspondence to Luisa Cutillo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Luisa Cutillo and Alexis Boukouvalas contributed equally to this work.

Supplementary information

Additional file 1

A supplementary report is available (supplementary.pdf).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Cutillo, L., Boukouvalas, A., Marinopoulou, E. et al. OscoNet: inferring oscillatory gene networks. BMC Bioinformatics 21 (Suppl 10), 351 (2020). https://doi.org/10.1186/s12859-020-03561-y

Download citation

Published: 25 August 2020
DOI: https://doi.org/10.1186/s12859-020-03561-y

OscoNet: inferring oscillatory gene networks

Abstract

Background

Results

Conclusions

Background

Review of the oscope method

Methods

Non-parametric test on co-oscillation

Network analysis

Pseudo-time estimation

Results

Synthetic study

Hypothesis test

Comparing community extraction methods

cDNA microarray time course

pseudo-time

Single-cell embryonic stem cells

pseudo-time

Conclusions

Appendix

Quality control

Roughness measure

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us