Detection of long non–coding RNA homology, a comparative study on alignment and alignment–free metrics

BMC Bioinformatics

Table 1 Definition of the adopted homology metrics (Alignment–based)

Metric	Definition	Description
Smith–Waterman similarity	\(SW(X,Y) = \max \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {sw(x,y)}{len(x)+len(y)}\right)\)	The Smith–Waterman similarity sw(x,y) is given by maximizing a score computed over a number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character [46]. Deletions/insertions (gaps) are penalized with a zero score, matches are rewarded with +5, and substitutions are penalized with -4 (NUC 4.4 substitution matrix). The time complexity is O(len(x)·len(y)).
Damerau–Levenshtein distance	\(DLevDist(X,Y) = \min \limits _{\substack {\scriptscriptstyle x \in seq(X)\\ \scriptscriptstyle y \in seq(Y)}} \left (\frac {dl(x,y)}{len(x)+len(y)}\right)\)	The Damerau–Levenshtein distance dl(x,y) is given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters [47]. The time complexity is O(len(x)·len(y)).

X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences x∈seq(X),y∈seq(Y)

ISSN: 1471-2105