 Research
 Open Access
 Published:
OscoNet: inferring oscillatory gene networks
BMC Bioinformatics volume 21, Article number: 351 (2020)
Abstract
Background
Oscillatory genes, with periodic expression at the mRNA and/or protein level, have been shown to play a pivotal role in many biological contexts. However, with the exception of the circadian clock and cell cycle, only a few such genes are known. Detecting oscillatory genes from snapshot singlecell experiments is a challenging task due to the lack of time information. Oscope is a recently proposed method to identify cooscillatory gene pairs using singlecell RNAseq data. Although promising, the current implementation of Oscope does not provide a principled statistical criterion for selecting oscillatory genes.
Results
We improve the optimisation scheme underlying Oscope and provide a wellcalibrated nonparametric hypothesis test to select oscillatory genes at a given FDR threshold. We evaluate performance on synthetic data and three real datasets and show that our approach is more sensitive than the original Oscope formulation, discovering larger sets of known oscillators while avoiding the need for less interpretable thresholds. We also describe how our proposed pseudotime estimation method is more accurate in recovering the true cell order for each gene cluster while requiring substantially less computation time than the extended nearest insertion approach.
Conclusions
OscoNet is a robust and versatile approach to detect oscillatory gene networks from snapshot singlecell data addressing many of the limitations of the original Oscope method.
Background
Oscillating genes are expressed in a periodic manner leading to alternating appearance and disappearance of the corresponding mRNA and protein. Oscillatory genes have been shown to play a pivotal role in many developmental processes, by enabling individual systems to implement diverse functions [1]. To identify oscillatory genes a combination of timelapse microscopy techniques and fluorescent reporter genes is required, which prohibits the assaying of a large number of potential oscillators. In turn, this limits the ability of the experimenter to uncover novel oscillators.
Recent developments in singlecell transcriptomics provide the potential to capture genomewide transcriptional events. However, singlecell RNAseq (scRNAseq) experiments, even when done as time series, pose challenges associated in identifying oscillating genes because cells may be asynchronous in vivo or may lose synchronisation in the time required for sample preparation and sequencing.
To overcome these problems, Leng et al. [2] have proposed a computational approach to identify oscillating genes in static, unsynchronised scRNAseq experiments. They have constructed a parameterised distance between pairs of putative oscillating genes which is minimised for cooscillating pairs that have the same temporal profile but different phase. Their approach allows gene pairs to be ranked by how close they are according to this cooscillation distance. However, they do not provide a statistical test to decide which gene pairs are oscillatory.
In this contribution, we provide a wellcalibrated hypothesis test which allows us to remove the less interpretable cutoff in the first step of the Oscope pipeline [2]. We do this by selecting gene pairs according to a false discovery rate (FDR) cutoff. Our nonparametric bootstrap approach requires us to optimise the genepair distance many times, and therefore we have implemented an efficient optimisation algorithm to minimise the genepair distance function. We then construct an undirected network of cooscillating genes based on all gene pairs which pass the FDR threshold. Community extraction and network analysis methods can then be used to identify statistically significant cooscillating groups of genes. We also discuss three distinct statistical tests that can be employed to assess the significance of different aspects of the inferred network communities, namely community distinctiveness and enrichment, in terms of both genes (graph vertices) and their cooscillations (graph edges). We conclude with a description of a simple method to infer pseudotime for each gene community, inferring a cyclical ordering on a lowdimensional projection.
Review of the oscope method
The Oscope method [2] uses a pairedsine model and Kmedoids clustering to identify groups of oscillatory genes. For each oscillatory group, an extended nearest insertion algorithm (ENI) is used to construct the cyclic order of cells, defined as the order that specifies each cell’s position within one cycle of the oscillation of that group [3]. The method requires normalised gene expression and the use of only high mean, high variance genes is recommended [3].
The method relies on the computation of a pairwise gene expression similarity. Let two genes be X= sin(ωt+ϕ) and Y= sin(ωt+ϕ+Ψ_{XY}), that is they have identical profiles except for a phase shift Ψ_{XY}. For N cells the cooscillation distance is d(X,YΨ_{XY}):
The derivation is provided in the Supplementary Material (Section 3). The assumed sinusoidal form is not a major restriction, given that the time t is not required in the cooscillation distance^{Footnote 1}. This is the crucial step in the algorithm as it allows the genepair similarity to be estimated without knowledge of the pseudotime.
A limitation of this approach is that the cell population is assumed to be homogeneous. Including cell populations that are not relevant to the process under investigation will result in reduced accuracy as the cooscillation distance will be averaged across both informative and noninformative cells. We therefore recommend to remove irrelevant cell populations prior to the application of the Oscope and OscoNet methods. This can be accomplished by visualisation and clustering methods such as tSNE and kmeans to identify cell populations and using gene markers or other information to characterise each cell cluster.
In the original Oscope approach, only genes that appear in the top T% of the ranked pairwise similarity list are selected for the subsequent Kmedoids clustering step. The choice of T can have a significant impact on downstream analysis. However, the optimal choice of T is not obvious and will likely differ depending on the characteristics of the data, e.g. number of cooscillatory pairs, technical/biological noise levels, number of cells in the dataset etc. In our work, we instead propose to use a wellcalibrated nonparametric test to select gene pairs with a given FDR. Genes that contribute to these gene pairs are then selected for further downstream analysis.
A further limitation of the Oscope approach is that linearly correlated genes cannot be disambiguated from gene pairs with 0 or π phase shift. Therefore a heuristic is used to remove clusters that have a high proportion of similarities with 0 or π phase shift.
Finally, the ENI algorithm is used to estimate pseudotime per gene cluster. The use of the ENI algorithm is not critical to the algorithm and can be replaced by a computationally more efficient probabilistic approach as we describe below.
Methods
Our overall approach is summarised in Fig. 1. Following a quality control step (see Appendix), we propose three main phases, namely: building a network of gene cooscillations using a nonparametric hypothesis test, identify the statistically significant communities in the resulting network and infer pseudotime for each gene community.
Nonparametric test on cooscillation
The key insight in the Oscope algorithm is that a genepair oscillation score can be defined without requiring a prior estimation of pseudotime. Rather than specifying a threshold on the number of genes to keep, here we develop a nonparametric hypothesis test to assess the significance of two genes cooscillating.
We employ a bootstrap approach which randomly permutes the order of cells for one of the genes in the genepair distance estimation in order to perform a statistical test. The cell index is permuted for one gene in a pair and we calculate the resulting pairwise gene similarity \(d\left (X,\mathcal {P}(Y)\Psi _{XY}\right)\) (recall Eq. 1) where \(\mathcal {P}(Y)\) is the permuted cell order for gene Y. For each randomisation, we estimate the phase shift by minimising the pairwise distance,
The pvalue estimate for the gene pair XY is the fraction of times d_{perm}<d. Specifically for B randomisations the bootstrap pvalue is
where \(d_{perm}^{i}\) the i^{th} randomisation, d=d(X,YΨ_{XY}) the unpermuted pairwise distance and I(.) the indicator function, which is 1 when the inequality is true and 0 otherwise.
In our approach, we need to correct for the effect of multiple hypothesis tests since we are testing many gene pairs. We have selected to use the qvalue approach of [4] which controls the FDR. The simple Bonferroni correction is too conservative in the case of very large numbers of tests and is also based on an unrealistic assumption that tests are independent. The qvalue is an adjustment of the traditional pvalue where the FDR is minimised based on the distribution of the pvalues across all the tests performed. The qvalue is an estimate of the FDR at a given pvalue threshold obtained from modelling background (null) and signal contributions to the pvalue distribution. This approach does not assume independent tests and has been shown to perform well in applications with very large numbers of tests. The qvalue provides an estimate of the number of false edges in our cooscillation network.
As the bootstrap approach is computationally demanding, we have expressed the distance measure using matrix operations and we provide an efficient implementation. We have implemented the bootstrap test in Python using NumPy vector operations and we have also implemented a faster version in Cython, which allows C code to be called from Python. A performance comparison of the different implementations is given in Fig. 2 wherein we compare our implementation of the pairwise distance calculation (NumPy, Cython) to the standard Oscope [2] R implementation. As all three implementations are able to leverage multicore parallelism, our test is performed across 12 CPU cores. We see that both Numpy and Cython are substantially faster than the iterative R version which uses the Oscope bioconductor package. The Numpy and Cython implementations achieve an average speedup of 6.5 and 93 respectively. The significant performance improvement allows for the practical use of the bootstrap algorithm. The computational penalty for a fully nonparametric significance test is therefore minimised.
Network analysis
We use the genepair test to construct a network of genes in which connections denote statistically significant cooscillation. In the network setting, a variety of algorithms have been developed to extract communities within the entire network. The problem of community detection, or graph partitioning, in a network has been widely studied in a variety of research fields. In the present work, we consider community extraction algorithms based on maximising modularity, a measure of community distinctiveness. The modularity often referred to as Q, was defined by [5] as the fraction of the edges that fall within a given group of nodes (community partition) minus the expected fraction if edges were distributed at random. The concept of modularity is based on the idea that a random graph is not expected to have a cluster structure. In the “Synthetic study” section, we compare three popular community extraction algorithms^{Footnote 2}, namely: walktrap [6], fastgreedy [7] and infomap [8], and find the walktrap algorithm to be the most robust.
To further interpret the estimated network, which is still highly complex, we tried to identify the underlying community structures [9]. The identified communities will allow us to create a large scale map of our network [10]. We aim to use individual communities to shed light on the function of the cooscillation represented by the network itself. Indeed, in biological networks, communities often correspond to functional units of a system. For example in metabolic networks [11], such functional groups correspond to cycles or pathways, whereas in the proteinprotein interaction networks [12], communities correspond to proteins with similar functionality inside a cell.
To assess the significance of each community, we employ three statistical tests that examine different aspects of the community structure. Our first step is to apply a Wilcoxon test for each community, to compare the distribution of the intracommunity edges versus the distribution of the intercommunity edges. This tests allows assessing how welldefined a community is and complements the maximisation of the modularity objective during the community extraction stage.
We then validate our results by crossreferencing with a list of known oscillating targets. To this end, for each identified community, we perform a set of gene enrichment tests with respect to Cell Cycle (CC) related genes (Gene Ontology term GO:0007049). Firstly, we test each community for standard gene enrichment using a hypergeometric test; this test checks whether the genes in every single community are significantly enriched in the list of CC related genes. However, this test only considers membership to a community and does not take into account the within community connections corresponding to gene cooscillations. For a more stringent test, we also employ a network enrichment test [13]. For every community that contains CC related genes, we are performing a network enrichment test between two lists of genes, namely: list A  the list of CC related genes in the community, list B  the list of all the genes in the community. This test aims at looking at the connection between our target genes to see if their links are statistically significant, hence if there is an enrichment. The main idea behind this test is that, under the null hypothesis of no enrichment, the number of links between two gene sets A and B follows a hypergeometric distribution. This test considers the connectedness among CC related genes within a single community. Therefore, it may be more stringent than the hypergeometric test, given that CC genes, which are only connected through intermediaries in a CC enriched community, are not found to directly cooscillate. In our results, we show the validation of communities in terms of the Wilcoxon test and the CC enrichment in terms of network enrichment test [13].
We also introduce an index, community density, expressing the fraction of the actual number of edges in a community with respect to the total number of edges in the network. We also provide a relative community density, defined as the fraction of edges in a community given the maximum number of edges feasible in a community of that size.
Pseudotime estimation
We use Laplace Eigenmaps [14] to reduce the gene expression space to a twodimensional latent representation. Laplace Eigenmaps, also known as spectral embedding, is a nonlinear dimension reduction method using a spectral decomposition of the graph Laplacian to preserve local distances in the gene expression space. The method is fast to compute as it involves a partial eigenvalue decomposition of a sparse graph matrix, namely the graph Laplacian. Further, the minimisation problem has a unique solution and therefore, does not suffer from local minima. As [14] discussed, the method is robust to outliers and noise as only local distances are used to estimate the neighbourhood graph.
The method consists of

1
Graph Construction: create sparse adjacency matrix A using n nearest neighbours.

2
Graph Laplacian Construction: L=D−A, where D is the degree matrix, the number of edges attached to each vertex.

3
Partial Eigenvalue Decomposition. Find topQ eigenvalues of sparse matrix where Q is the dimensionality of the latent space. In our case this is always Q=2.
The critical parameter in the algorithm is n, the number of nearest neighbours. Our pseudotime estimation algorithm consists of a search on a grid of values specified by a minimum m and a maximum value M. For n∈{m,…,M}:

1
Compute Laplacian eigenmap on 2 dimensions.

2
pseudotime: compute angle in radians [−π,π] by arc tangent. Convert to [0,1].
The latter step imposes a periodic constraint on the pseudotime: pseudotime is defined as the angle of a cell in a circle.
To select the best value for the number of neighbours n, we apply a probabilistic periodic model to assess each pseudotime for all genes in the cluster. We employ a Bayesian Gaussian process latent variable model [15] with a onedimensional periodic kernel. All hyperparameters are optimised except the pseudotime. We use 25 inducing points to ensure efficient optimisation and calculation of the model likelihood. We then use the marginal model likelihood to select the best pseudotime and the number of neighbours n.
Results
In order to validate our overall approach and compare to the original results in [2], we evaluate the ability of OscoNet to identify oscillating groups of genes by applying it both to synthetic data, a microarray timeseries study of oscillating genes [16] and to the profiles of single undifferentiated human embryonic stem cells (hESCs) [17]. For the biological data, we evaluate the enrichment of the discovered clusters with respect to the Gene Ontology CC biological process related genes.
Synthetic study
We generate a synthetic dataset of 1000 genes and 100 cells following the setup in [2]. We simulate 1000 genes, of which 180 are samples from a sinusoidal function with additive Gaussian noise. The 180 oscillators are simulated in 3 frequency groups, each group containing 60 genes. The relative frequencies of the three groups are proportional to 2:3:6.
Genes in each group were further exposed to strong or weak signals. Half of the oscillatory genes were simulated as strong oscillators with noise variance σ^{2}. The other half were simulated as weak oscillators with \(\sigma ^{2}_{wk} = (2 \sigma)^{2}\). The starting phase Φ varies in different genes within a frequency group. The remaining genes were simulated as independent Gaussian noise.
Hypothesis test
We calculate the FDR on all gene pairs, defined as:
where S is the estimated number of significant gene pairs and F the number of false positive pairs. This is a more stringent criterion than in [2], where the number of oscillating genes was used and not cooscillating pairs.
The accuracy of the nonparametric test with different number of bootstrap samples and the original Oscope method is shown in Fig. 3a for noise level of σ^{2}=0.05. When considering higher noise levels the results are similar (see Supplementary). Very small number of bootstrap samples result into not wellcalibrated tests (B=100). However, when the number of samples is increased the bootstrap test better approximates the true FDR.
When decreasing the desired level of FDR, we observe an elbow effect. At a certain point of significance, the observed FDR is not further reduced as we decrease the desired FDR. This occurs at increasingly smaller thresholds as the number of bootstrap samples is increased. However, we suspect that the number of cells and the variability of gene expression also play a role; when increasing the number of cells for instance, statistical power is expected to increase as we get higher resolution on the tails of the null distribution. Another factor is the amount of noise in the data as is demonstrated in Fig. 3b where we show the effect of different noise levels on the threshold value of the elbow after which no improvement in FDR is observed. As the noise level decreases, the statistical test is able to achieve lower FDR.
A possible improvement on our test would be to introduce an explicit model for the tails of the null distribution. Such a semiparametric bootstrap approach would allow for lower achievable FDR levels as long as the modelling assumptions are accurate.
The poor performance of standard Oscope can be understood by the nature of the threshold used. At the lowest noise level (0.05) the method correctly identifies all the cooscillating gene pairs at the default 5% threshold achieving a perfect true positive rate (TPR=1.0) as does the OscoNet hypothesis test. However the former gets a false positive rate of 10% (FPR=0.10) while the latter achieves a lower rate of 0.1% (FPR=0.001). The resulting FDR for Oscope is 0.91 and for OscoNet is 0.09, reflecting the highest sensitivity of the latter. The large FPR achieved by Oscope demonstrates the deficiency of setting such a threshold. In the Supplementary (Section 1), we present the FDR, FPR and TPR achieved across all noise levels.
Note that a bootstrap test is implemented in the Oscope pipeline after the clustering step to flag clusters that should be removed prior to downstream analysis. As this test is applied after the threshold of the topranked pairs of genes, the choice of the threshold has a significant impact on the results. In addition, by applying the bootstrap test on the cluster level, an entire cluster is either kept or removed, whereas our approach maintains all significant gene pairs. We argue our approach is logically more coherent since the clustering should be applied on only significant gene pairs rather than testing for significance after the clustering is completed.
Comparing community extraction methods
In order to validate our approach, we want to find densely connected subgraphs, also called communities, in the constructed synthetic graphs. To this aim, we compare three wellknown network clustering algorithms available in the R package igraph, namely: walktrap, fastgreedy and infomap. In walktrap, the communities are assigned via random walks [6]. The idea is that short random walks tend to stay in the same community. In fastgreedy, the communities are estimated via directly optimising a modularity score [7]. On the other end, infomap aims to find a community structure that minimises the expected description length of a random walker trajectory [8]. We refer to the original papers for a detailed description. In an ideal case, true simulated oscillating genes should be clustered in the same community. For each of the five noise levels, as defined in the “Synthetic study” section, and for each one of the three community detection algorithms, we compare the recovered network partition to the true underlying one computing the Adjusted Rand Index (ARI). This measure is a scalar comparing two partitions on the same network. The ARI index, first introduced in [18], has zero expected value in the case of random partition, and it is 1 in the case of perfect agreement between two partitions. As depicted in Fig. 4, the index decreases when the level of noise increases and the most stable results are achieved by the walktrap algorithm. We therefore select the walktrap as the community extraction method in the clustering step of our overall approach.
cDNA microarray time course
We applied OscoNet to a microarray timeseries study of oscillating genes studied in [16].
Following Leng et al. [2], we use experiment 3 from the original study [16] where double thymidine block was used to synchronise HeLa cells which were profiled at 48 time points following synchronisation. We start with a set of 41508 probes containing 1134 CC oscillators [16]. This corresponds to a set of 9559 genes containing 874 oscillating genes [16].
We filter probes using variance based filtering with a cutoff of 0.9 [19]. This results in 4151 probes containing 360 CC oscillators. This filtering step is more stringent than the default test on mean and variance of expression used in the Oscope paper. This filtering step also ensures that the set of probes we consider does not include high error measurements and includes the most informative genes only. In both cases, the data is normalised to [1, 1].
Finally, we identify a total of 3884 genes that have at least one significant cooscillation where the total number cooscillations is 185,548. This corresponds to a highly sparse network with a sparseness of 0.0123. Here sparseness is defined as the ratio of existing edges to the maximum number of possible edges in the network. All the 60 CC oscillators are present in the network.
In the following, we summarise the results when applying the standard Oscope approach or respectively our approach. The Oscope approach was applied to the same data using the same normalisation. We utilize the standard paired sine model and only gene pairs with a similarity score in the top 5%. This corresponds to the default setting recommended in Oscope. We apply Kmedoids with the maximum number of clusters set to K=5. Finally, we remove clusters flagged by Oscope as having mostly linear effects.
The application of standard Oscope led to the identification of two significant clusters of transcripts: the first cluster includes 72 probes whereof 70 are CCrelated and the second cluster includes 136 probes of which 13 are CCrelated. We note that no clusters were removed by the linear filtering step of the Oscope pipeline.
The application of OscoNet to the same dataset led to the identification of 5 significant clusters, according to the Wilcoxon test (significance α=0.01) described in the previous section. Singleton communities with a single gene were excluded.
In Table 1 we show all the significant clusters according to the Wilcoxon test ranked by the relative density. Of these five, three have at least one CC probes. Of those three, only one is enriched according to the NEAT and hypergeometric tests with α=0.01.
The only CCenriched community has 239 CC genes out of 265 and is the secondranked community as per relative density. This community is wellseparated from the network and is enriched in CC from both enrichment perspectives, namely traditional enrichment in terms of genes and the network interconnections between the CC genes present in the community.
Importantly, this community includes the 70 CC related probes identified as cooscillating by Oscope. Therefore this community gives a more detailed and richer description of the CC processes than the one uncovered by Oscope.
pseudotime
We evaluate the performance of the ENI method of Leng et al. [2] and spectral methods on estimating pseudotime. We run the ENI method using 30000 iterations, with a sliding polynomial of degree 3 and 4 starting points for the polynomial fitting. The pseudotime order is not changed after approximately 6500 iterations in our experiments even when setting the maximum number of iterations to 100k. The ENI algorithm took on average 57 h in our experiments while the spectral method takes less than 1 min including the time to search over the parameter space. This makes the method very easy to use in practice in addition to the superior performance which we now discuss.
We assess the performance of the method when trained on the 72 gene cluster defined by Oscope and the 265 gene cluster defined by OscoNet. We also evaluate the performance of the pseudotime on both of these clusters.
The dimensionality reduction and circle estimated by the spectral method for the two gene clusters is shown in Fig. 5. In both cases, the cells can be welldescribed by a circular path in the 2D latent space projection of the data.
Our main validation measure is the Spearman correlation of the estimated peak times in the reconstructed gene expression to the ground truth (Table 2). The ground truth has been constructed by folding the time series in [16] along every period. ENI provides consistently poor results despite the large iteration count (30k iterations). The spectral method is simple and provides good results. It can also be easily extended in an intuitive manner to cover more complex cases  see the Conclusions section. Comparing the pseudotime achieved by the different clusters using the spectral method, we note that using the larger gene cluster identified by OscoNet (265 genes) results in a more accurate pseudotime. This is reflected in the higher peak time rank correlation; the larger cluster estimated by OscoNet is more informative on the true time compared to the smaller cluster estimated by Oscope. In the Supplementary (Section 2), the estimated and actual peak times for each gene are shown.
We also report on the roughness measure of the true time series and the reconstructed version. This encapsulates how smooth a time series is by comparing neighbouring values with smaller values corresponding to a smoother time series; see Appendix for details. In Table 3 we report the difference of the median roughness of the estimated pseudotime and true time. Overall the ENI results in an increase in the roughness measure while the Spectral method results in lower values reflecting smoother time courses. The spectral method achieves similar performance under both the smaller Oscope cluster (72 genes) and the larger OscoNet cluster (265). Therefore the differences reported above in peak time accuracy are not caused by smoother time courses.
In Fig. 6 we show the gene expression for three genes using the experimental time. In Figures 7 we show the reconstructed base cycle and estimated gene expression. The pseudotime methods have been estimated using the larger OscoNet cluster. As expected, the ENI profiles are poor due to the poor pseudotime estimate. The spectral pseudotime is anticorrelated and therefore provides a mirror image of the true time series profile.
Singlecell embryonic stem cells
To evaluate OscoNet on scRNAseq data, we analysed profiles of single undifferentiated hESCs as described in Thomson et al. [17]. In particular, we analysed three replicate scRNAseq experiments on H1 hESCs, with a sample size n=213.
We applied our method to n=213 samples with G=2375 genes following the same steps as [2] in terms of quality control. In particular, genes have been filtered by mean level and variance as in [2]. A total of 1914 genes have at least one significant cooscillation with 5057 cooscillatory pairs found. The sparseness value for the network is 0.001, an order of magnitude smaller than the previous study suggesting a sparser network.
A total of 677 genes were used to test for CC enrichment. After quality control, 153 CC genes were left of which 134 are subsequently present in the cooscillatory network.
The original algorithm in Oscope identified 29 genes as cooscillating, 21 of which are annotated with the cell cycle process GOterm. The Oscope algorithm was run with default values as previously reported [2]. In total the kmedoids Oscope step finds 5 clusters, 2 of which are eliminated by the subsequent linear filtering step. The CC enrichment of each cluster is given in Table 4.
After applying OscoNet, we assessed the significance of each community using a Wilcoxon test as described previously. The results are shown in Table 5. Hence we identify six significant communities using the Wilcoxon test (α=0.01). As in the previous study, we eliminate singleton communities that contain only one gene.
Only one community is enriched in nodes and edges according to the hypergeometric and NEAT tests (α=0.01). This community is enriched in CC with 32 genes out of 77 belonging to the term GO : 0007049. This community contains the standard Oscope cluster of 29 genes reported in [2]. As in the previous study, we find that the OscoNet approach is more sensitive than Oscope identifying a cluster with a superset of CC genes.
The two highest ranked communities in terms of relative density have a small size of only 4 and 6 genes respectively. These are ranked highly because of their exceptionally small size and are not CC enriched. A threshold on the community size would exclude such small communities.
pseudotime
We compared pseudotime using the 77 and 29 gene clusters estimated by OscoNet and Oscope respectively. ENI takes an average of 13 h of computing time to estimate the pseudotime for both clusters while the spectral method requires approximately 30 min. The difference in computing time is that ENI is leveraged the 2opt algorithm to reduce the likelihood of local minima. In contrast, the spectral method has a unique solution for a given number of neighbours n and is quick to evaluate leveraging efficient sparse eigenvalue solvers. The grid search we performed over different values of n ensures the robustness of the results.
In Fig. 8 we show the fit for the spectral method for both gene sets. There is more noise in the data than in the previous case study, and the pattern is less circular. Despite this, the algorithm is able to recover the cell cycle order as we now show.
We compared the performance of the algorithm estimating pseudotime using all 460 cells and assessing the separation of CC stages on the labelled subset of cells, which consists of 247 cells. After pseudotime estimation, we applied the kmeans algorithm with K=3 to identify the three clusters associated with G1, G2 and S phase. We computed the ARI to contrast the clustering results with the CC labels; the ARI measure is in the range [0,1] with higher values reflecting better agreement between the estimated and true cell cycle stages.
In Table 6 we show the ARI results in comparison to the spectral and ENI method. The performance of the latter is poor, which can be validated when looking at the gene expression output (Fig. 9). The spectral method provides better accuracy for both gene clusters with an ARI of 0.43 when using the Oscope 29 gene cluster and 0.49 when using the OscoNet 77 gene cluster. The higher ARI for the latter reflects the richer information content of the OscoNet cluster compared to the Oscope cluster; the OscoNet spectral ordering contains fewer cells assigned to an incorrect CC stage than the Oscope spectral ordering as is evident in Fig. 9 for four example genes.
The poor performance of the ENI algorithm is confirmed when looking at the roughness value (Table 7). The ENI algorithm has higher roughness values for all combinations examined while the spectral method achieves similar roughness values with both clusters.
Conclusions
We have presented a wellcalibrated hypothesis test that avoids a need for a threshold cutoff on the pairwise distance matrix. Further, our implementation is highly scalable due to the use of parallel vectorisation which has been demonstrated to be orders of magnitude faster than the one used in Oscope.
We have then performed network analysis to identify statistically significant communities of genes again avoiding any need to set the maximum number of communities as is needed in the Kmedoids approach of [2]. We have also been able to select the significant communities based on a standard statistical test and assess the significance of enrichment in terms of both genes and their cooscillatory relationship. We have validated our approach on synthetic data demonstrating that it is wellcalibrated with respect to the false discovery rate, unlike [2].
We contrasted our approach to that of Leng et al. [2] on both microarray and scRNAseq data with known oscillators. In both cases, we are able to recover the original Oscope cluster of genes in addition to discovering more oscillators.
We estimated a unique pseudotime per gene community using a spectral approach. The spectral approach requires an order of magnitude less computation than the ENI approach proposed in Leng et al. [2] and is more accurate as reflected in both peak time correlation in the microarray data and cell cycle stage separation in the singlecell data.
One issue with our current approach is that for a low number of cells, the small number of possible permutations in the similarity measure reduces the power of the bootstrap hypothesis test resulting in a high minimum achievable FDR. A semiparametric bootstrap approach should help to increase the power of the test, particularly in case of small numbers of cells and low FDRs, which is typically required.
Finally, extending this approach as a probabilistic model including the effect of dropout which is prominent in singlecell datasets, may further increase the accuracy of the method and allow for quantification of uncertainty in the resulting cooscillating gene clusters.
The pseudotime model could be extended to use a probabilistic model to estimate pseudotime while imposing a periodicity constraint jointly. This will also allow for selecting the number of latent dimensions by maximising the model evidence. A probabilistic formulation of Laplace Eigenmaps latent variable model has already been shown [20] where the Laplacian Eigenmaps latent variable model is described. In future work, one could extend this class of model, in the probabilistic or deterministic formulation, to include a periodic constraint.
The method we have presented can be incorporated in a singlecell analysis pipeline to allow biologists to uncover novel periodic dynamic gene regulatory mechanisms. Our approach avoids less interpretable cutoffs and relies on wellcalibrated statistical tests. As the availability and size of singlecell whole transcriptome data increases, the nonparametric and automatic nature of our approach will allow for a corresponding increase in resolution of the analysis resulting in a more detailed description of oscillatory gene expression systems.
Appendix
Quality control
In any singlecell pipeline, it is necessary to perform some rudimentary quality control to ensure downstream analysis is not affected by lowquality measurements. In the Oscope pipeline [2], a threshold on the mean level and a loglinear cutoff on the variance is applied.
We prefer using a simpler approach based on variance based filtering with a high variance cutoff of 0.9 [19]. This ensures a simple and consistent quality control mechanism that has been shown to remove low quality genes [19].
Roughness measure
Roughness for a particular gene was defined in [21] as
where σ_{g} the standard deviation of gene expression and \(x_{g, z_{c}}\) is the gene expression for gene g at pseudotime order z_{c}.
This metric measures the smoothness of the gene expression profile by looking at the differences of consecutive measurements. Smaller values indicate a smoother response.
Availability of data and materials
All data used in the paper are freely available; see [2] for details on download instructions. The OscoNet code is publicly available at https://github.com/alexisboukouvalas/OscoNet.
Notes
 1.
Intuitively, any monotonic transformation of time will not alter the value of the objective function.
 2.
These are available in the R package igraph.
Abbreviations
 FDR:

False discovery rate
 ENI:

Extended nearest insertion
 CC:

Cell cycle
 hESCs:

Human embryonic stem cells
 ARI:

Adjusted rand index
References
 1
Levine JH, Lin Y, Elowitz MB. Functional roles of pulsing in genetic circuits. Science. 2013; 342(6163):1193–200.
 2
Leng N, Chu LF, Barry C, Li Y, Choi J, Li X, Jiang P, Stewart RM, Thomson JA, Kendziorski C. Oscope identifies oscillatory genes in unsynchronized singlecell rnaseq experiments. Nat Methods. 2015; 12(10):947–50.
 3
Bacher R, Kendziorski C. Design and computational analysis of singlecell rnasequencing experiments. Genome Biol. 2016; 17(1):1.
 4
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003; 100(16):9440–5.
 5
Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev. 2004; 69(11):026113.
 6
Pons P, Latapy M. Computing Communities in Large Networks Using Random Walks. Berlin: Springer Berlin Heidelberg; 2005, pp. 284–93.
 7
Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E. 2004; 066111:70. https://doi.org/10.1103/physreve.70.066111.
 8
Rosvall M, Bergstrom CT. Maps of information flow reveal community structure in complex networks. PNAS. 2008:105–1118. https://doi.org/10.1073/pnas.0706851105.
 9
Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002; 99(12):7821–6.
 10
Fortunato S. Community detection in graphs. Phys Rep. 2010; 066111:75–174.
 11
Lawson CE, et al.Metabolic network analysis reveals microbial community interactions in anammox granules. Nat Commun. 2017. https://doi.org/10.1038/ncomms15416.
 12
Hermjakob H, et al.The hupo psi’s molecular interaction format–a community standard for the representation of protein interaction data. Nat Biotechnol. 2004:177–83. https://doi.org/10.1038/nbt926.
 13
Signorelli M, Vinciotti V, Wit E. Neat: an efficient network enrichment analysis test. BMC Bioinformatics. 2016; 12(352). https://doi.org/10.1186/s1285901612036.
 14
Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003; 15(6):1373–96.
 15
Titsias MK, Lawrence ND. Bayesian gaussian process latent variable model. In: International Conference on Artificial Intelligence and Statistics. PMLR, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics: 2010. p. 844–51. http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf.
 16
Michael L, Whitfield GS, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, et al.Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002; 13(6):1977–2000.
 17
Thomson JA, ItskovitzEldor J, Shapiro SS, Waknitz MA, Swiergiel JJ, Marshall VS, Jones JM. Embryonic stem cell lines derived from human blastocysts.Science. 1998; 282(5391):1145–7.
 18
Hubert L, Arabie P. Comparing partitions. Journal of the Classification, pages. 1985; 2:193–218.
 19
Gentleman R, Carey V, Huber W, Hahne F. genefilter: genefilter: methods for filtering genes from highthroughput experiments; 2020. R package version 1.70.0. https://bioconductor.org/packages/release/bioc/html/genefilter.html.
 20
Zhengdong L, Sminchisescu C, CarreiraPerpiñán MÁ. People tracking with the laplacian eigenmaps latent variable model In: Platt JC, Koller D, Singer Y, Roweis ST, editors. Advances in neural information processing systems. Curran Associates, Inc.: 2008. p. 1705–12. http://papers.nips.cc/paper/3226peopletrackingwiththelaplacianeigenmapslatentvariablemodel.pdf.
 21
Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell time series. Bioinformatics. 2016; 32(19):2973–80.
Acknowledgements
We thank the Oscope authors for sharing their processed microarray Whitfield data.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 10, 2020: Proceedings from the 13th Bioinformatics and Computational Biology International Conference  BBCC2018. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement10.
Funding
MR and AB were supported by MRC award MR/M008908/1. LC was supported by H2020EU EU Marie Curie fellowship CONTESSA (ID: 660388). NP is funded through a Wellcome Trust Senior Research Fellowship (106185/Z/14/Z). EM is funded through a Sir Henry Wellcome Fellowship (201380/Z/16/Z). Publication costs are covered by the University of Manchester Wellcome Trust funds.
Author information
Affiliations
Contributions
LC and AB did all the computational work and equally contributed as first authors. MR, NP and EM conceived the study. All authors participated in writing the paper. All authors have read the paper and approve it.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Luisa Cutillo and Alexis Boukouvalas contributed equally to this work.
Supplementary information
Additional file 1
A supplementary report is available (supplementary.pdf).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cutillo, L., Boukouvalas, A., Marinopoulou, E. et al. OscoNet: inferring oscillatory gene networks. BMC Bioinformatics 21, 351 (2020). https://doi.org/10.1186/s1285902003561y
Published:
Keywords
 Singlecell
 Network analysis
 Nonparametric hypothesis test