Detection of long non–coding RNA homology, a comparative study on alignment and alignment–free metrics

Noviello, Teresa M. R.; Di Liddo, Antonella; Ventola, Giovanna M.; Spagnuolo, Antonietta; D’Aniello, Salvatore; Ceccarelli, Michele; Cerulo, Luigi

doi:10.1186/s12859-018-2441-6

BMC Bioinformatics

Table 2 Definition of the adopted homology metrics (Alignment–free)

From: Detection of long non–coding RNA homology, a comparative study on alignment and alignment–free metrics

Metric	Definition	Description
n-gram distance	\({qgram}_{n}(X,Y) = \min \limits _{\substack { \scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {\sum _{i}\|q^{x}_{i}-q^{y}_{i}\|}{len(x)+len(y)}\right)\)	A n-gram is a subsequence of n consecutive characters of a string [48]. If \(\mathbf {q}^{x} = \left (q^{x}_{1}, q^{x}_{2}, \dots, q^{x}_{K}\right)\) is the n-gram vector of counts of n-gram occurrences in the sequence x the n-gram distance is given by the sum over the absolute differences \(\|q^{x}_{i}-q^{y}_{i}\|\), where \(q^{x}_{i}\) and \(q^{y}_{i}\) are the i-th unique n-grams of x and y respectively obtained by sliding a window of n characters wide over x and y and registering the occurring n-grams. The time complexity is O(len(x)·len(y)).
Cosine similarity	\({cosine}_{n}(X,Y) = \max \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \frac {\mathbf {q}^{x} \cdot \mathbf {q}^{y}}{\\|\mathbf {q}^{x}\\|\\|\mathbf {q}^{y}\\|}\)	The cosine similarity is the cosine of the angle between the two n-gram vectors q^x and q^y [40]. The time complexity is O(len(x)+len(y)).
Jaccard similarity	\({jaccard}_{n}(X,Y) = \max \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {\sum _{i} \left (\mathbbm {1}_{q^{x}_{i}>0} + \mathbbm {1}_{q^{y}_{i}>0}\right)}{\sum _{i} \mathbbm {1}_{q^{x}_{i}>0} \cdot \mathbbm {1}_{q^{y}_{i}>0}} - 1\right)\)	The Jaccard coefficient measures the similarity between two finite sets, and is defined as the size of the intersection divided by the size of the union of the sample sets [49]. The size is computed from the set of unique n-grams by means of \(\mathbbm {1}_{q^{x}_{i} > 0}\), the indicator function having the value 1 if the i-th n-gram is present in x, 0 otherwise. The time complexity is O(len(x)+len(y)).
Base–base correlation distance	\(BBC(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\sqrt {\sum _{i=1}^{16}(V_{x_{i}} - V_{y_{i}})^{2}}\)	The Base–base correlation measures the sequence similarity by computing the euclidean distance between two 16-dimensional feature vectors, V_x and V_y, which contain all base pair mutual information [50]. The time complexity is O(len(x)·len(y)).
Average common substring distance	\(ACS(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {1}{2} \left (\sum _{i=1}^{len(x)} \frac {lcs(x(i),y)}{len(x)} + \sum _{i=1}^{len(y)} \frac {lcs(y(i),x)}{len(y)}\right)\)	The average common substring is the average lengths of maximum common substrings for constructing phylogenetic trees [51]. Specifically, the lcs(x(i),y) (lcs(y(i),x)) is the length of the longest common substring of x (y) starting at each position i of x (y) and exactly matching some substring in y (x). The time complexity is O(len(x)+len(y)).
Lempel–Ziv complexity distance	\(LZ(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {c(x,y)-c(x)+c(yx)-c(y)}{\frac {1}{2}[c(xy)+c(yx)]} \)	The Lempel–Ziv complexity distance is defined by considering the minimum number of components over all production histories of x and y, c(x) and c(y) and their concatenations, c(xy) and c(yx) [52]. The time complexity is O(len(x)·len(y)).
Jensen–Shannon distance	\(JSD(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}\frac {1}{2}KL(V_{x},V_{M}) + \frac {1}{2}KL(V_{y},V_{M})\)	The Jensen–Shannon distance is computed by averaging the Kullback–Leibler Divergence (KL) of V_x with respect to V_M and V_y with respect to V_M, where V_x and V_y are the same 16-dimensional feature vectors defined for BBC, and \(V_{M} = \frac {V_{x}+V_{y}}{2}\) [41]. The time complexity is O(len(x)+len(y)).
Hamming distance	\(HDist(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}}hd(r(x),r(y))\)	The Hamming distance is defined between two strings of the same length as the number of positions in which corresponding values are different. We adopt two bit strings of length n, namely r(x) and r(y), representing the regulatory transcriptional machinery of x and y respectively, and n is the number of all transcription factors available in JASPAR [24]. Each position i of such bit strings is equal to 1 if the i-th transcription factor binds the promoter while 0 otherwise. The time complexity is O(n).

X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences x∈seq(X),y∈seq(Y)

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com