Algorithm
ProbeAlign identifies the homologous ncRNAs in a profile-based search manner. The profile is generated by using the multiple sequence alignment of a given ncRNA family. The aligned columns formed by a majority of gap are excluded in the search profile. In addition, the consensus structure of the family is considered as the structural information of the profile. The targets of search are usually the genomic or transcriptomic sequences with experimentally determined reactivities. In the latest implementation of ProbeAlign, higher reactivity of a site indicates higher chance of being unpaired, and vice versa.
Assume the alphabet of RNA sequences is {A, C, G, U, X}, in which X represents all unknown nucleotides. First we denote the query of an ncRNA family as Q = {P, S}, where P is the sequence profile of the family and S is the pairing pattern in the corresponding consensus structure. Let the length of the profile be n, then P = 〈p1, p2, . . . , p
n
〉 and S = 〈s1, s2, . . . , s
n
〉. Here, , which is a vector that contains the frequency of the nucleotides and gap at site i. s
i
is a boolean value indicating whether site i is paired in the consensus structure or not (0 means i is paired and vice versa). Note that the specific pairing relationship between sites in P is not considered in S, which is similar to the folded-BLAST [32]. For target T of length m, denote B = 〈b1, b2, . . . , b
m
〉 as the genomic sequence and R = 〈r1, r2, . . . , r
m
〉 as the observed reactivities. Denote D
i,j
, I
i,j
, M
i,j
as the optimal alignment scores for deleting, inserting and matching a column in the search profile, respectively. They can be computed using the following recursive functions:
(1)
Here, ε0 and ε
e
are the gap open penalty and the gap extension penalty, respectively. In our implementation, a "semi-global alignment" setting [33] is adopted. Therefore, these three functions are initialized with: M0,0 = 0, Mi,0= ε0 + i × ε
e
, M0,j= 0, and D0,j= Ii,0= −∞. τ and σ are functions to assess the structural and the sequence similarities between the queries and the targets, respectively. α and β are weights assigned to these two types of similarities.
The sequence similarity between the query profile and the target sequence is computed using the following formula:
(2)
where m(x, y) is the substitution score between nucleotides x and y.
The general function to compute structural similarity is as follows:
(3)
Given the reactivity r
j
, p(π
j
| r
j
) is computed to compare the structural aspect of b
j
with s
i
. π
j
is a random variable and π
j
∈ {0, 1}, 0 means b
j
is paired and 1 means b
j
is not paired. According to the Bayes' theorem, the probability can be computed as:
(4)
The probabilities p(r|π = 0) and p(r|π = 1) can be inferred from the reactivities retrieved from the RNAs with known secondary structures [34]. To simplify the computation, we assume p(π = 0) is equal to p(π = 1) and then define the function f as:
(5)
Note that the probability p(r|π) varies among different probing techniques. Even for one protocol, the reactivity distributions may be different due to the distinct computational methods for transferring the chemical signals from biological experiments. Therefore, it may be hard to apply Equation 5 on some techniques whose statistical properties have not been studied yet. To overcome this limitation and make the implementation of ProbeAlign easy to use, a simplified scoring function is proposed:
(6)
In Equation 6, r
c
is a cutoff value which is used to annotate the structural aspects of targets. Any site that has higher reactivity than r
c
is deemed as unpaired, and vice versa (r
j
> r
c
⇒ p(π
j
= 1|r
j
) = 1; r
j
<= r
c
⇒ p(π
j
= 0|r
j
) = 1). We have compared two different types of structural similarity functions by taking SHAPE protocol as an example. The benchmark results show that the optimal performance of those two functions is comparable. Therefore, the simplified scoring function is practical for universal usage, while the protocol-specific scoring function may be a better option if the reactivity distribution is known.
The above described dynamic programming algorithm computes the optimal alignment between the query profile and the target sequence with the consideration of both structural and sequence similarity. After alignment, traceback is performed to check the base pairing consistency between the query structure and the target. Bonus scores are assigned to the possible pairing bases. Such additional information can be used to detect potential false positive hits that have high alignment scores but low structural consistency with the query. For example, in Figure 1, two alignment scores are the same. However, the target in alignment 1 is more conserved with the query, compared with the target in alignment 2, because the red letters can form canonical pairs, while the green ones can not. Structure consistency scores can help us distinguish these two targets.
P-value estimation
A robust scheme for evaluating the statistical significance of prediction results is important for the homology search applications. However, what statistical distribution the ncRNA alignment scores should follow is still unclear. In this case, we simulated the ProbeAlign scores by searching 106 Rfam families (as defined by the Infernal RMARK3 dataset [20]) against five artificial sequences, whose GC content ranging from 30% to 70%. Each artificial sequence was constructed by concatenating random RNA sequences generated by GenRGenS [35] with a simple context-free grammar [36]. The secondary structure of each random sequence was predicted by mfold [11]. The corresponding reactivities of the secondary structure were simulated based on the SHAPE technology [34].
We fitted the ProbeAlign score density for each Rfam family with four different distributions: the normal, Gumbel, GEV (Generalized Extreme Value), and Gamma distributions. The goodness of fitting was measured with K-S test (Kolmogorov-Smirnov test). The fitting results on the five artificial sequences show that the ProbeAlign scores follow the Gamma distribution for most of Rfam families. Take the fitting on the artificial sequence with 50% GC content as an example. Out of 106 tested families, 103 families fit best with the Gamma distribution, and the other three families (bicoid_3, OLE, and rne5) fit best with the normal distribution. The score distribution fitting of the Corona_FSE family (which follows Gamma) and the rne5 family (which follows normal) is shown in Figure 2. It is clear that even rne5 fits better with the normal distribution, the Gamma distribution also fit the ProbeAlign score distribution well. So the Gamma distribution was chosen to evaluate the p-values in ProbeAlign.