In this section, we propose a new method for measuring the similarity between two RNA sequences without known structures. The proposed method is applied to the hierarchical clustering of ncRNAs with the weighted pair group method with averaging (WPGMA) algorithm. Given a set of sequences, we calculate an all-against-all similarity matrix using our method. Then, we derive the distance matrix by one minus the similarity, and obtain the cluster tree by the WPGMA algorithm.
The idea of our similarity measure is to approximate the Sankoff algorithm for structural alignment by the combination of the SW algorithm for sequence alignment, and the McCaskill algorithm for secondary structures. This approximation allows to utilize the ensembles of all possible sequence alignments and all possible secondary structures separately from each of the two algorithms. First, we describe a similarity measure using the scores of all possible sequence alignments between two RNAs. Next, we design a scoring function for these alignments using all possible secondary structures of each of the two RNAs.
Ensemble of all possible sequence alignments
To measure the similarity between two RNAs, one common approach is to perform pairwise alignment, and to calculate its alignment score. The Sankoff algorithm simultaneously models sequence alignments and secondary structures, and is extremely time-consuming. Therefore, we first approximate the Sankoff algorithm by the SW algorithm that only models sequence alignments apart from secondary structures. Although this is a strong approximation, we attempt to improve the reliability by utilizing all possible sequence alignments rather than one optimal sequence alignment.
For an RNA sequence x, we denote its length by |x|. For each position 1 ≤ i ≤ |x| in x, we denote the nucleotide by x
i
∈ {A, C, G, U}.
For two sequences, x and y, let Π
xy
be the set of all possible sequence alignments in the SW algorithm. Let π
xy
denote one particular sequence alignment in Π
xy
.
We calculate the similarity between x and y by accumulating the alignment score of π
xy
over Π
xy
. For this purpose, we employ local alignment (LA) kernel [16] defined as follows:
where β ≥ 0 is a parameter, and Score(π
xy
) is the alignment score of π
xy
under a given scoring scheme (gap penalties and match scores). In practice, we take the logarithm of LA kernel, and similarity values are normalized to range from 0 to 1:
LA kernel (1) can be computed by the variant of the SW algorithm as follows:
Initialization:
for i ∈ {0,…, |x|} and j ∈ {0, …, |y|} do
M(i, 0) = I
X
(i, 0) = I
Y
(i, 0) = T
X
(i, 0) = T
Y
(i, 0) = 0
M(0, j) = I
X
(0, j) = I
Y
(0, j) = T
X
(0, j) = T
Y
(0, j) = 0
end for
Iteration:
for i ∈ {1,…,|x|} and j ∈ {1,…,|y|} do
M(i, j) = eβSxy(i,j)(1 + I
X
(i – 1, j – 1) + I
Y
(i – 1, j – 1) + M(i – 1, j – 1))
I
X
(i, j) = eβgM(i – 1, j) + eβdI
X
(i – 1, j)
I
Y
(i, j) = eβg(M(i, j – 1) + I
X
(i, j – 1)) + eβdI
Y
(i, j – 1)
T
X
(i, j) = M(i – 1, j) + T
X
(i – 1, j)
T
Y
(i, j) = M(i, j – 1) + T
X
(i, j – 1) + T
Y
(i, j – 1)
end for
Termination:
K(x, y) = 1 + T
X
(|x|, |y|) + T
Y
(|x|, |y|) + M(|x|, |y|)
where the parameters g and d are the penalties for gap opening and gap extension, respectively, and S
xy
(i, j) is a scoring function for matching the i-th position in x and the j-th position in y. The design of S
xy
(i, j) impacts the performance of the resulting similarity measure, and will be described later.
At this point, we note that our method can take into account all possible sequence alignments in O(|x||y|) time. If we use the exact Sankoff algorithm instead, it takes prohibitive O(|x|3|y|3) time, which is not practical. In the case of the approximate Sankoff-style algorithms employed in the existing methods, all possible sequence alignments cannot be incorporated to the reduced dynamic programming matrix. Therefore, LA kernel based on the SW algorithm is an efficient way to deal with the ensemble of all possible sequence alignments.
Ensemble of all possible secondary structures
To design a scoring function S
xy
(i, j) for LA kernel, we need secondary structures of x and y. As mentioned above, the Sankoff algorithm models secondary structures simultaneously with sequence alignments which we have already modeled by the SW algorithm. Therefore, we next employ the McCaskill algorithm that only models secondary structures apart from sequence alignments. Although this is an additional approximation, we attempt to improve the reliability by utilizing all possible secondary structures rather than one optimal secondary structure.
For an RNA sequence x, let Θ
x
be the set of all possible secondary structures. Let θ
x
denote one particular secondary structure in Θ
x
. We represent a secondary structure as a set of binary variables θ
x
= {θ
x
(i, j)}1≤
i
<
j
≤|
x
|, where θ
x
(i, j) = 1 means that the i-th position and the j-th position in x form a base pair. For each position 1 ≤ i ≤ |x| in x, we represent the state of base-pairing using three kinds of binary variable: L
x
(i) = ∑
j:j
>
i
θ
x
(i, j) = 1 means that a base pair is formed with one of the downstream positions; R
x
(i) = ∑
j
:
j
<
i
θ
x
(j, i) = 1 means that a base pair is formed with one of the upstream positions; and U
x
(i) = 1 – L
x
(i) – R
x
(i) = 1 means that the position is unpaired. Given a fixed pair of secondary structures, θ
x
and θ
y
, we can measure the similarity between the i-th position in x and the j-th position in y using their state of base pairing:
W
xy
(i, j|θ
x
, θ
y
) = α (L
x
(i)L
y
(j) + R
x
(i)R
y
(j)) + s(x
i
, y
j
) U
x
(i)U
y
(j), (3)
where α ≥ 0 is a weight parameter for structural similarity, and s(x
i
, y
j
) is a substitution matrix for RNA sequences like the RIBOSUM 85-60 matrix [17]. This scoring function takes a non-zero value in three different cases: it takes α when both of the two positions form a base pair with one of their downstream positions, respectively; it takes α when both of the two positions form a base pair with one of their upstream positions, respectively; and it takes s(x
i
, y
j
) when both of the two positions are unpaired.
The McCaskill algorithm defines a probability distribution P(θ
x
|x) over Θ
x
. The binary variables θ
x
(i, j) and {L
x
(i), R
x
(i), U
x
(i)} are converted to the probabilities by taking the expectation over Θ
x
. For θ
x
(i, j), we obtain a base-pairing probability P
x
(i, j) that the i-th and the j-th positions form a base pair:
For {L
x
(i), R
x
(i), U
x
(i)}, we obtain three kinds of probability that the i-th position is paired with one of the downstream/upstream positions, or unpaired, respectively:
We design a scoring function S
xy
(i, j) by taking the expectation of (3) over Θ
x
and Θ
y
:
The proposed method is obtained by combining the normalized LA kernel (2) with the scoring function (4).
It should be noted that our method can take into account all possible secondary structures in O(|x|3 + |y|3) time, thanks to the McCaskill algorithm. Just as in all possible sequence alignments, the exact Sankoff algorithm results in O(|x|3|y|3) time, and the existing methods cannot incorporate all possible secondary structures. Our method requires O(|x||y|) + O(|x|3 + |y|3) time in total, which is more efficient than the exact Sankoff algorithm. Therefore, our strategy that combines the SW algorithm and the McCaskil algorithm allows to utilize the ensemble information with the reasonable computational cost.
Variations of the proposed method
The scoring function (4) proposed in this study is similar to the scoring function used in BPLA kernel [18, 19]. BPLA kernel is a prediction method that we previously developed for detecting new members of known ncRNA families. Although BPLA kernel was not applied to clustering problems in our previous study, we here clarify its relation to the proposed method. The scoring function used in BPLA kernel is defined as follows:
where
,
, and
. Therefore, the scoring function (5) can be regarded as a variation of the proposed scoring function (4) with the additional coefficients CL, CR, and CU. These coefficients take large values when the probabilities
and
are small. That is, BPLA kernel emphasizes the contribution of low-probability (unsure) secondary structures compared to the proposed method. In the next section, we experimentally verify this theoretical implication; the proposed method outperforms BPLA kernel.
Because of the resemblance between the scoring functions, (4) and (5), we set the parameters of the proposed method as used in BPLA kernel: α = 1.0, β = 0.1, g = –27, and d = –0.1