Skip to main content

Table 2 Definition of the adopted homology metrics (Alignment–free)

From: Detection of long non–coding RNA homology, a comparative study on alignment and alignment–free metrics

Metric

Definition

Description

n-gram distance

\({qgram}_{n}(X,Y) = \min \limits _{\substack { \scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {\sum _{i}|q^{x}_{i}-q^{y}_{i}|}{len(x)+len(y)}\right)\)

A n-gram is a subsequence of n consecutive characters of a string [48]. If \(\mathbf {q}^{x} = \left (q^{x}_{1}, q^{x}_{2}, \dots, q^{x}_{K}\right)\) is the n-gram vector of counts of n-gram occurrences in the sequence x the n-gram distance is given by the sum over the absolute differences \(|q^{x}_{i}-q^{y}_{i}|\), where \(q^{x}_{i}\) and \(q^{y}_{i}\) are the i-th unique n-grams of x and y respectively obtained by sliding a window of n characters wide over x and y and registering the occurring n-grams. The time complexity is O(len(xlen(y)).

Cosine similarity

\({cosine}_{n}(X,Y) = \max \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \frac {\mathbf {q}^{x} \cdot \mathbf {q}^{y}}{\|\mathbf {q}^{x}\|\|\mathbf {q}^{y}\|}\)

The cosine similarity is the cosine of the angle between the two n-gram vectors qx and qy [40]. The time complexity is O(len(x)+len(y)).

Jaccard similarity

\({jaccard}_{n}(X,Y) = \max \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {\sum _{i} \left (\mathbbm {1}_{q^{x}_{i}>0} + \mathbbm {1}_{q^{y}_{i}>0}\right)}{\sum _{i} \mathbbm {1}_{q^{x}_{i}>0} \cdot \mathbbm {1}_{q^{y}_{i}>0}} - 1\right)\)

The Jaccard coefficient measures the similarity between two finite sets, and is defined as the size of the intersection divided by the size of the union of the sample sets [49]. The size is computed from the set of unique n-grams by means of \(\mathbbm {1}_{q^{x}_{i} > 0}\), the indicator function having the value 1 if the i-th n-gram is present in x, 0 otherwise. The time complexity is O(len(x)+len(y)).

Base–base correlation distance

\(BBC(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\sqrt {\sum _{i=1}^{16}(V_{x_{i}} - V_{y_{i}})^{2}}\)

The Base–base correlation measures the sequence similarity by computing the euclidean distance between two 16-dimensional feature vectors, Vx and Vy, which contain all base pair mutual information [50]. The time complexity is O(len(xlen(y)).

Average common substring distance

\(ACS(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {1}{2} \left (\sum _{i=1}^{len(x)} \frac {lcs(x(i),y)}{len(x)} + \sum _{i=1}^{len(y)} \frac {lcs(y(i),x)}{len(y)}\right)\)

The average common substring is the average lengths of maximum common substrings for constructing phylogenetic trees [51]. Specifically, the lcs(x(i),y) (lcs(y(i),x)) is the length of the longest common substring of x (y) starting at each position i of x (y) and exactly matching some substring in y (x). The time complexity is O(len(x)+len(y)).

Lempel–Ziv complexity distance

\(LZ(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {c(x,y)-c(x)+c(yx)-c(y)}{\frac {1}{2}[c(xy)+c(yx)]} \)

The Lempel–Ziv complexity distance is defined by considering the minimum number of components over all production histories of x and y, c(x) and c(y) and their concatenations, c(xy) and c(yx) [52]. The time complexity is O(len(xlen(y)).

Jensen–Shannon distance

\(JSD(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {1}{2}KL(V_{x},V_{M}) + \frac {1}{2}KL(V_{y},V_{M})\)

The Jensen–Shannon distance is computed by averaging the Kullback–Leibler Divergence (KL) of Vx with respect to VM and Vy with respect to VM, where Vx and Vy are the same 16-dimensional feature vectors defined for BBC, and \(V_{M} = \frac {V_{x}+V_{y}}{2}\) [41]. The time complexity is O(len(x)+len(y)).

Hamming distance

\(HDist(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}hd(r(x),r(y))\)

The Hamming distance is defined between two strings of the same length as the number of positions in which corresponding values are different. We adopt two bit strings of length n, namely r(x) and r(y), representing the regulatory transcriptional machinery of x and y respectively, and n is the number of all transcription factors available in JASPAR [24]. Each position i of such bit strings is equal to 1 if the i-th transcription factor binds the promoter while 0 otherwise. The time complexity is O(n).

  1. X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences xseq(X),yseq(Y)