Problem statement
This paper aims to devise an algorithm to annotate circRNA molecules with annotation terms. To do so, we exploit the interaction graph between circRNA–miRNA and miRNA–mRNA molecules. As we assume that the annotations of circRNAs are independent, we can process the individual circRNAs sequentially and restrict ourselves to a single circRNA molecule in our description. Assume a fixed ordering on miRNA and mRNA molecules. Assume that the count of mRNAs (miRNAs) is \(|m|\) (\(|\mu |\)).
Formally, we can define the interaction graph between the selected circRNA and miRNAs using a vector \(\mathbf {a}^{\mu , c}\in \{0,1\}^{|\mu |}\) where each field represents whether a particular miRNA interacts with the circRNA. Interactions between miRNAs and mRNAs are represented by an adjacency matrix \(\mathbf {A}^{m, \mu }\in \{0,1\}^{|m|\times |\mu |}\) where each row is a vector indicating which miRNAs interact with a particular mRNA.Footnote 2 We assume that the graph edges are directed only from circRNA to miRNA and from miRNA to mRNA, so that a directed path cannot connect two molecules of the same type. A simple network is shown in Fig. 1.
In our notation, an annotation term will be defined by a set of mRNAs and miRNAs it annotates. The membership of mRNAs (miRNAs respectively) is formalized using binary vectors \(\mathbf {g}^m\in \{0,1\}^{|m|}\) (\(\mathbf {g}^\mu \in \{0,1\}^{|\mu |}\) respectively). As a shorthand notation, we will use the symbol \(g\) to denote the tuple of the latter, i.e., \(g=(\mathbf {g}^m, \mathbf {g}^\mu )\). Having these definitions on hand, we can define the problem to be solved in this paper.
Definition 1
(CircRNA annotation problem) For a circular RNA, let \(\mathbf {A}^{m, \mu }\), \(\mathbf {a}^{\mu , c}\) be its interaction graph. Decide whether the circRNA should be annotated with a term \(g=(\mathbf {g}^m, \mathbf {g}^\mu )\).
Proposed statistic
To solve the problem, we will develop a simple yet powerful statistic to annotate a circRNA. The concept is based on the ”guilt by association” principle [16, 21]. The circRNA should be annotated with a term if and only if this molecule frequently interacts with miRNAs (and through them indirectly with mRNAs) annotated with the term. We will capture this frequency in statistic s. This statistic will quantify the size of the neighborhood of the circRNA that is annotated with the term. As the complete tripartite graph is only a unification of two bipartite graphs and remains fixed for the circRNA, we might calculate this number precisely:
$$\begin{aligned} s(c, g) = (\mathbf {a}^{\mu , c})^T \mathbf {g}^\mu + (\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c})^T \mathbf {g}^m. \end{aligned}$$
(1)
The first addend shows how many paths of length one end in a miRNA annotated with the term. The term \(\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}\) shows how many paths go from \(c\) to each mRNA. The second addend in Formula (1) calculates how many paths of length two terminate in an mRNA that is annotated with the term.
However, the frequency represented by \(s\) is hard to explain without knowing the entire neighborhood. Larger neighborhoods, as well as more abundant gene terms, tend to generate a larger frequency. The importance of the term could better be captured by a relative frequency. Assume that we start a random walk in circRNA \(c\). We might calculate the probability that this random walk ends in an RNA annotated with the term \(g\). For a fixed circRNA, the size of the neighborhood is fixed. Therefore, the aforementioned probability is equal to \(s(c, g)\) but for a normalization factor.
We will continue to use the number of paths represented in \(s(c, g)\), knowing that they are only linearly scaled, preserving the ordering of the above-mentioned probabilities. To increase interpretability, we will further develop a normalized \(s\) as well as the p-value for the statistic so that standard statistical reasoning is applicable. In this work, we will avoid random Monte-Carlo sampling (used among others in [6, 12,13,14,15]) and simulating the random walk and claim its low efficiency in our setting for the p-value calculation.
Normalization
As the value of statistic \(s\) grows by definition with the size of annotation term and the size of the interaction graph (it gives the number of distinct paths to mRNAs and miRNAs annotated with the term), we present the user with a more explainable output. We normalize the statistic (1) by its expected value
$$\begin{aligned} \mathbb {E}\left( s(c, g) \right) = \frac{\Vert \mathbf {g}^\mu \Vert _1}{|\mu |} (\mathbf {a}^{\mu , c})^T\mathbf {1}+ \frac{\Vert \mathbf {g}^m \Vert _1}{|m|} (\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c})^T \mathbf {1} \end{aligned}$$
(2)
where \(\mathbf {1}= (1, 1, \ldots , 1)^T\) is the vector of ones and \(\Vert \cdot \Vert _1\) denotes the L1-norm. The expected value gives the expected number of random walks that end in an RNA annotated with the term if the annotations were assigned randomly. The user is then presented with the ratio of the statistic and its expected value
$$\begin{aligned} \frac{s(c, g)}{\mathbb {E}\left( s(c, g) \right) }. \end{aligned}$$
(3)
This ratio represents the normalized statistic. The value above 1 then stands for terms that tend to interact with the circRNA under observation more than expected as can be seen in Table 1.
Influence of individual RNAs
Once circGPA predicts that a circRNA should be annotated with a term, users might be interested in which miRNAs and mRNAs back up this annotation. In other words, knowledge of which RNAs connect the circRNA to the annotation term is important. Fortunately, it is possible to split \(s(c, g)\) among individual molecules. A natural method of explaining how much the RNA adds to the statistic is to remove this RNA with all its incoming and outcoming edges from the graph. On such a modified graph, we recalculate the score and calculate the difference in the score value. We can calculate this difference for all miRNAs and mRNAs at once using linear algebra. We denote the vector of those differences \(\Delta ^{m}\) (\(\Delta ^{\mu }\)) for mRNAs (miRNAs). For a vector \(\mathbf {v}\), let \(\mathop {{diag}}(\mathbf {v})\) denote a diagonal matrix with elements of \(\mathbf {v}\) on its diagonal. Then we derive that
$$\begin{aligned} \Delta ^{\mu }&= \mathop {{diag}}(\mathbf {a}^{\mu , c}) \left( \mathbf {g}^\mu + (\mathbf {A}^{m, \mu })^T \mathbf {g}^m\right) , \end{aligned}$$
(4)
$$\begin{aligned} \Delta ^{m}&= \mathop {{diag}}(\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}) \mathbf {g}^m. \end{aligned}$$
(5)
One can notice that the L1 norm of \(\Delta ^{\mu }\) is equal to \(s(c, g)\). We use values \(\Delta ^{\mu }, \Delta ^{m}\) to sort mi/mRNAs in a report that shows the influence of individual RNAs. An example output will be seen in the "Results" section.
p-value calculation
To understand and compare the values of statistic \(s\) among different circRNAs and annotation terms, we need to calculate its p-value. The p-value cannot stem solely from s itself as other circular RNAs have a different number of connections to the remaining RNAs. In addition, more frequent annotation terms will reach higher scores. Formally, statistic \(s(c,g)\) is an outcome of the statistical test, whose null hypothesis is that the given \(c, g\) pair is not related. In other words, the null hypothesis is that circRNA \(c\) has no preference in interactions with miRNAs (or mediated interactions with mRNAs) annotated with term \(g\). The alternative hypothesis states that \(c\) should be annotated with \(g\) as \(g\) is overrepresented in the neighborhood of \(c\).
The p-value, in our case, represents the probability that a random annotation term of the same size in the same interaction graph reaches the same statistic \(s\) or higher. The literal implementation of the null distribution simulation would thus be empirical random sampling with replacement [22]. In our case, this Monte-Carlo approach would be based on enumerating the random subsets of m/miRNAs of the same size as the evaluated annotation term and calculating the statistic value based on Formula (1).
This paper proposes an exact approach that does not depend on random trials but uses generating polynomials instead to compute the p-value. We should first reformulate the problem so that we can easily describe its mathematical solution. Denote \(\Vert \mathbf {g}^\mu \Vert _1\) the number of miRNA molecules annotated by the term. Formally, \(\Vert \mathbf {g}^\mu \Vert _1\) is the L1-norm of the \(\mathbf {g}^\mu\) vector. Define \(\Vert \mathbf {g}^m \Vert _1\) similarly. For each miRNA, there is a fixed number denoting its weight in the statistic (1). This weight is 1 if and only if the circRNA of interest is connected to the miRNA, zero otherwise. The weight is stored in the respective field of \(\mathbf {a}^{\mu , c}\). Out of all miRNAs, we select \(\Vert \mathbf {g}^\mu \Vert _1\). For mRNA, the weight can be seen in the respective field of \(\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}\). Out of all interacting mRNAs, \(\Vert \mathbf {g}^m \Vert _1\) mRNAs are selected.
To calculate the p-value, the molecules of mRNA and miRNA are selected randomly given the weights and the fact that \(\Vert \mathbf {g}^\mu \Vert _1\) and \(\Vert \mathbf {g}^m \Vert _1\) need to be preserved. For now, we consider only miRNAs. Imagine a bag full of balls with numbers written on them. Each number is a field in \(\mathbf {a}^{\mu , c}\) (one field equals one ball). Now we randomly select \(\Vert \mathbf {g}^\mu \Vert _1\) balls from the bag and sum the numbers written on them. By repeating this procedure many times, we get the null distribution for the first part of the statistic (1). If we include a second bag with numbers taken from \(\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}\), we get the null distribution for the whole statistic.
Having built an informal intuition, we can proceed to introduce the generating polynomials by which we denote a polynomial which is a multiple of the well-established probability-generating functions [20]. Consider an mRNA that is connected by five paths to the circRNA. The weight of this mRNA is 5. In a random annotation term, this mRNA is either included or not. This gives two possibilities. We can formulate the generating polynomial for this mRNA as
$$\begin{aligned} 1 + x^5 y^1. \end{aligned}$$
(6)
The variable x keeps track of weights, y keeps track of the number of selected mRNAs. Having a simple graph with only one mRNA, we have two options for building a random mRNA set: either we use zero mRNAs, and the sum of weights is zero (the term 1, which equals \(x^0 y^0\)), or we use one, and the sum is 5 (the term \(x^5 y^1\)). If we also consider a new mRNA with weight 3, the resulting polynomial that represents the extended graph is
$$\begin{aligned} (1 + x^5 y^1) \cdot (1 + x^3 y^1) = 1 + x^3 y^1 + x^5 y^1 + x^8 y^2. \end{aligned}$$
(7)
We immediately see that if we select no mRNAs, we can only get the sum of weights 0; by selecting one, the weights will be either 3 or 5, and by selecting two, the sum of the weights will be eight. The coefficients by terms with \(y^1\) show a single possibility of getting a weight of three or five. Another helpful view on the formula above might be as on a dynamic programming algorithm in a 2D array where the power of x denotes a row, the power of y denotes a column, and the coefficient is the number at the particular position of the table. Now, we can define the generating polynomial for a weight vector.
Definition 2
(Generating polynomial) Let \(\mathbf {w}\) be a vector of weights (of mRNA or miRNA). Then the generating polynomial is
$$\begin{aligned} \hbox {genpoly}_{\mathbf {w}}(x,y) = \prod _{w\in \mathbf {w}} (1 + x^wy). \end{aligned}$$
(8)
Next, we define an operator that restricts the polynomial only on a selected power of one or more variables. We will denote the operator \(\mid x^n\) and use it to denote only terms that contain \(x^n\). For example, for the polynomial (7), operator \(\mid y^1\) will return \(x^3 + x^5\). The following theorem allows us to calculate the null distribution of the statistic (1).
Theorem 1
Consider statistic \(s\) for a fixed circRNA \(c\), interaction graph \(\mathbf {a}^{\mu , c}\), \(\mathbf {A}^{m, \mu }\) and annotation term sizes \(\Vert \mathbf {g}^\mu \Vert _1\), \(\Vert \mathbf {g}^m \Vert _1\). Then coefficients of the polynomial
$$\begin{aligned} \left( \hbox {genpoly}_{\mathbf {a}^{\mu , c}}(x,y) \mid y^{\Vert \mathbf {g}^\mu \Vert _1} \right) \left( \hbox {genpoly}_{\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}}(x,y) \mid y^{\Vert \mathbf {g}^m \Vert _1} \right) \end{aligned}$$
(9)
are the null distribution of statistic \(s\) up to a normalization factor.
Proof
From what precedes, it can be seen that the first multiplicand coefficients are the number of ways to reach a particular value of the miRNA part of the statistic (1) by selecting a particular number of miRNAs. The restriction to the \(y^{\Vert \mathbf {g}^\mu \Vert _1}\) ensures that the number of miRNAs in the annotation term is preserved. The same holds for the second multiplicand and mRNAs.
After multiplying the polynomials, the polynomial coefficients will hold the number of unique ways the value of the statistic can be achieved. The normalization to 1 then finishes the calculation of the null distribution. \(\square\)
Once the null distribution is calculated, the p-value is then obtained by a standard approach in which we sum probabilities of all statistic values greater than \(s(c, g)\).
Computational complexity
If we focus on the computational complexity of circGPA, most of the work is done in the Generating-Polynomial function. The two inner loops depend on variables maxx and maxy, where maxx is, in the worst case, linearly dependent on the L1-norm of vector \(\mathbf {w}\); maxy is equal to the number of RNAs annotated with the terms. Their product, therefore, is linearly dependent on the product of \(\Vert \mathbf {w}\Vert _1\) times the size of the annotation term. The two outer loops in function Generating-Polynomial do at most n operations for each unique non-zero entry in the vector of weights of count n. Sum of the fields of weight vector \(\mathbf {w}\) is, therefore, the same as the number of evaluations of the two outer loops. The computational complexity of the two outer loops is in the worst case equal to \(\Vert \mathbf {w}\Vert _1\). We may conclude that one call to the Generating-Polynomial function is in \({\mathcal {O}}(\Vert \mathbf {w}\Vert ^2 \cdot maxy)\). Other terms in function AnnotateCircRNA are asymptotically smaller than the runtime of the Generating-Polynomial function. The overall runtime is, therefore, in
$$\begin{aligned} {\mathcal {O}}\left( (\Vert \mathbf {a}^{\mu , c}\Vert _1)^2 \Vert \mathbf {g}^\mu \Vert _1 + (\Vert \mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}\Vert _1 )^2 \Vert \mathbf {g}^m\Vert _1\right) . \end{aligned}$$
(10)
Referential approach
A standard approach for p-value calculation would be to enumerate subsets of miRNAs/mRNAs as random annotation terms. The size of the term is preserved, and we count how many times the score is higher than \(s(c, g)\). This sampling Monte-Carlo approach then allows estimation of the p-value using the biased estimator \(\frac{r+1}{n+1}\), where r is the number of trials with a high enough score and n is the number of all trials [23].
Implementation
Using a naive implementation, we need at most \(2^{|m|} \cdot 2^{|\mu |}\) multiplications to evaluate polynomial (9). This number is the theoretically possible maximum number of terms in polynomial (7). The real number is, however, much smaller. As polynomial exponents repeat, the bound can be tightened. The power of x goes from 0 to \(\Vert \mathbf {a}^{\mu , c} \Vert _1\) in the case of miRNAs, and from 0 to \(\Vert \mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c} \Vert _1\) in the case of mRNAs. The y variable goes from 0 to \(|\mu |\) (0 to \(|m|\) in case of mRNA); however, relevant fields are only up to \(\Vert \mathbf {g}^\mu \Vert _1\) (\(\Vert \mathbf {g}^m \Vert _1\)). The x variable can be trimmed similarly using the fact that only \(\Vert \mathbf {g}^\mu \Vert _1\) (or \(\Vert \mathbf {g}^m \Vert _1\)) highest terms of \(\mathbf {a}^{\mu , c}\) (or \(\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}\)) may be used.
If the weight of a mi/mRNA occurs more than once, we can exploit the binomial expansion instead of term-by-term multiplication in the Equation (8). The multiplication could be implemented using the dynamic programming approach and pointers. A similar approach was used for the fast p-value calculation of the unweighted GSEA [24]. Details can be seen in Algorithm 1. The whole pipeline is illustrated in Fig. 2.
Input data and used libraries
Our implementation combines R code and C++ code for the critical p-value calculation. To connect the C++ code and R code, we use the Rcpp package [25].
To construct the graph, we exploit several databases and R packages. The circRNA–miRNA interactions are downloaded from the CircInteractome [26] database that uses the TargetScan [27] interaction prediction algorithm. The miRNA–mRNA interactions are downloaded from the TarBase [28], miRecords [29], and miRTarBase [30] databases via the multiMiR package [31]. In the case of miRNA–mRNA interactions, we used verified interactions only.
The GO annotation for the miRNAs is downloaded using the miRBase [32] and ENA Quick GO [33] databases. Annotation of mRNAs is done via the org.Hs.eg.db R package. The annotation terms are obtained from the MSigDB database [34] C5 category using the msigdbr R package.
Other R packages used include miRBaseConverter, GO.db, biomaRt,stringr, httr, openxlsx, geometry, tictoc, ggnet, network and polynom.
The graph constructed in the presented way can annotate 3, 009 circRNAs. There are 1,761 miRNAs connected by 81, 391 edges. This means that one circRNA interacts with 29 miRNAs on average. One miRNA interacts with 50 circRNAs on average. The circRNA that interacts most with other molecules is hsa_circ_0000005 with 307 interactions. The miRNA with the most frequent interactions is hsa-miR-942, with 799.
The graph contains 19, 375 mRNAs with 465, 741 known interactions with miRNAs. Therefore, one mRNA interacts with 24 miRNAs on average, while one miRNA interacts with 264 mRNAs on average. The most frequent miRNA is hsa-miR-1-3p with 7491 interactions, the most frequent mRNA is NUFIP2 with 331 interactions.
We work with 10, 189 annotation terms. The average size of those is 82 mRNA or miRNA molecules. If we exclude annotation terms which are too narrow or too broad (see Sect. Results for details), we end up with 7075 annotation terms with 89 molecules on average. One RNA is annotated with 43 terms on average.
All in one example
Consider the interaction network depicted in Fig. 1. In this figure, we have a circRNA of interest connected with 3 miRNAs that connect to 5 mRNAs. The edges in the graph can then be described by a vector and a matrix.
$$\begin{aligned} \mathbf {a}^{\mu , c}&= (1, 1, 1)^T,&\mathbf {A}^{m, \mu }&= \begin{pmatrix} 1 &{} 1 &{} 0 \\ 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 \\ 0 &{} 1 &{} 1 \end{pmatrix}. \end{aligned}$$
The annotation term contains 2 miRNAs and 3 mRNAs. It is formalized as
$$\begin{aligned} \mathbf {g}^\mu&= (1,1,0)^T,&\mathbf {g}^m&= (1,1,1,0,0)^T. \end{aligned}$$
The weights of the miRNAs and mRNAs are
$$\begin{aligned} \mathbf {a}^{\mu , c}&= (1,1,1)^T,&\mathbf {A}^{m, \mu }\mathbf {a}^{\mu , c}&= (2,3,1,1,2)^T. \end{aligned}$$
Which gives a statistic of
$$\begin{aligned} s(c, g) = 2 + 6 = 8. \end{aligned}$$
The first addend is for miRNAs, the second for mRNAs. The expected value of the statistic is
$$\begin{aligned} \mathbb {E}\left( s(c, g) \right) = \frac{2}{3} (1+1+1) + \frac{3}{5} (2+3+1+1+2) = 7.4. \end{aligned}$$
The normalized statistic is, therefore, \(\frac{8}{7.4} \cong 1.08\). The generating polynomial for miRNAs is
$$\begin{aligned} (1 + x^1y)^3 = 1 + 3xy + 3x^2y^2 + x^3y^3. \end{aligned}$$
As we have two miRNAs that are annotated with the terms, we immediately see that there are three options for achieving two paths from the circRNA that end in an annotated miRNA (see term \(3x^2y^2\) – variable y stands for the number of miRNAs, variable x stands for the number of paths). For mRNA, the generating polynomial is
$$\begin{aligned} (1 + x^1y)^2 (1 + x^2y)^2 (1 + x^3y) = \cdots + (2x^4 + 3x^5 + 4x^6 + x^7) y^3 + \cdots . \end{aligned}$$
The relevant terms contain the variable y to the power of three – the term annotates three mRNAs. And the corresponding terms show that there are two options for selecting mRNAs such that there are 4 paths that go from the circRNA to an annotated mRNA. Those two options are illustrated in Fig. 3. Similarly, we can see that there are four options so that six paths end in an annotated mRNA and so on.
The generating polynomial for the whole statistic is, therefore,
$$\begin{aligned} 3x^2 (2x^4 + 3x^5 + 4x^6 + x^7) = 6x^6+ 9x^7 + 12 x^8 + 3 x^9. \end{aligned}$$
From the polynomial, we see that there are 12 ways to obtain a statistic value equal to 8. One of these is the situation depicted in Fig. 1 and solved in this example. The p-value is equal to the ratio of the number of combinations that reach a statistic value greater or equal to 8 to the number of all possible combinations. These can be seen from the polynomial or calculated as \({3 \atopwithdelims ()2}{5 \atopwithdelims ()3} = 30\). Hence, the p-value is \(\frac{12+3}{6+9+12+3}\).
We can also see that the expected value is the same if calculated from the null distribution. There are six ways to get statistic equal to six, nine ways to get statistic equal to seven and so forth, i.e., the expected value is \(\frac{6\cdot 6 + 9\cdot 7+ 12 \cdot 8 + 3\cdot 9}{6+9+12+3} = \frac{222}{30} = 7.4\).