The number of reduced alignments between two DNA sequences

Andrade, Helena; Area, Iván; Nieto, Juan J; Torres, Ángela

doi:10.1186/1471-2105-15-94

Research article
Open access
Published: 01 April 2014

The number of reduced alignments between two DNA sequences

Helena Andrade¹,
Iván Area²,
Juan J Nieto^1,3 &
…
Ángela Torres⁴

BMC Bioinformatics volume 15, Article number: 94 (2014) Cite this article

4538 Accesses
5 Citations
5 Altmetric
Metrics details

Abstract

Background

In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained.

Results

We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments.

Conclusions

A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods.

AMS Subject Classification

Primary 92B05, 33C20, secondary 39A14, 65Q30

Background

Let us consider a DNA sequence as a mathematical string

x = (x_{1}, x_{2}, \dots, x_{n}),

where x_i∈{A,G,C,T} is one of the four nucleotides, i=1,2,…,n, i.e. A denotes adenine, C cytosine, G guanine and T thymine. In these conditions, the sequence x is of length n.

Our main goal is to compare the sequence x with another DNA sequence

y = (y_{1}, y_{2}, \dots, y_{m}),

to measure the similarity between both strings and also to determine their residue-residue correspondences.

Sequence comparison and alignment is a central and crucial tool in molecular biology. For example, Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid) [1].

For some recent developments and directions we refer the reader to [2–7] and [8] for a general review of different alignments methods.

To align the sequences CGT and ACTT, one can use EMBOSS Needle for nucleotide sequence [9] that creates an optimal global alignment of the two sequences using the Needleman-Wunsch algorithm to get

\begin{matrix} EMBOSS-001 & 1 & - & C & G & T & 3 \\ | & \cdot & | \\ EMBOSS-001 & 1 & A & C & T & T & 4 \end{matrix}

Following Lesk [10], in order to compare the amino acids appearing at their corresponding positions in two sequences, theirs correspondences must be assigned and a sequence alignment is the identification of residue-residue correspondence. For some references on sequence alignment we refer the reader to [10–16].

To compare two sequences, there exist mainly three different possibilities leading to three different numbers of total alignments [10, 11, 13]:

1.
The total number of alignments denoted by f(n,m) that was solved in [13].
2.
A gap in a sequence is followed by another gap in the other sequence as in Alignments 1 and 2 for the sequences x=C G T and y=A C T T (see Tables 1 and 2 below) Considering the two alignments as equivalents to the Alignment 3 (see Table 3) without gap in those positions, we have the number of reduced alignments denoted by h(n,m), and obviously h(n,m)<f(n,m). This case has been solved in [11], and we give here another representation in terms of hypergeometric series.
3.
In the interesting case that the alignments 1 and 2 are equivalent, but different from alignment 3 we have a number or reduced alignments g(n,m) where h(n,m)<g(n,m)<f(n,m). This last case is new and we present an explicit formula for g.

Table 1 Alignment 1

Full size table

Table 2 Alignment 2

Full size table

Table 3 Alignment 3

Full size table

Results and discussion

Number of f(x,y)alignments

The total number of alignments f(x,y) satisfies the following recurrence relation [13]

f (n, m) = f (n - 1, m) + f (n, m - 1) + f (n - 1, m - 1),

with initial conditions f(n,0)=f(0,m)=1 for n,m=1,2,3,…. The solution of the above partial difference equation is given by

f (n, m) = \sum_{k = 0}^{min {n, m}} 2^{k} (\binom{m}{k}) (\binom{n}{k}),

(see formula (10) in [13]) and the generating function [17, 18] is

F (x, y) = - \frac{1}{xy + x + y - 1} .

Therefore the coefficients f(n,m) in the expansion

F (x, y) = \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} f (n, m) x^{n} y^{m}

are given in terms of a hypergeometric series by

f (n, m) =_{2} F_{1} (- m, - n; 1; 2) .

This relation seems to be new in this form. Here, the generalized hypergeometric series is defined as (see e.g. [19, Chapter 16])

_{p} F_{q} (a_{1}, \dots, a_{p}; b_{1}, \dots, b_{q}; d) = \sum_{i = 0}^{\infty} \frac{{(a_{1})}_{k} {(a_{2})}_{k} \dots {(a_{p})}_{k}}{k! {(b_{1})}_{k} {(b_{2})}_{k} \dots {(b_{q})}_{k}} d^{k},

and (A)_k=A(A+1)⋯(A+n−1), with (A)₀=1, denotes the Pochhammer’s symbol. It is assumed that b_j≠−k in order to avoid singularities in the denominators. If one of the parameters a_j equals to a negative integer, then the sum becomes a terminating series.

Number of h(x,y)alignments

In this case, the recurrence relation for the h(n,m) coefficients is [11]

\begin{array}{l} h (n, m) & = h (n - 1, m) + h (n, m - 1) - h (n - 2, m - 2), \\ n, m \geq 2, \end{array}

with initial conditions h(n,0)=h(0,m)=1. Therefore, the generating function [17, 18] is

H (x, y) = \frac{1 - xy}{x^{2} y^{2} - x - y + 1},

and the coefficients in the expansion

H (x, y) = \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} h (n, m) x^{n} y^{m}

are given by

\begin{array}{l} h (n, m) & = \sum_{i = 0}^{A} \frac{{(- 1)}^{i} (- 3 i + m + n)!}{i! (m - 2 i)! (n - 2 i)!} \\ - \sum_{i = 0}^{B} \frac{{(- 1)}^{i} (- 3 i + m + n - 2)!}{i! (- 2 i + m - 1)! (- 2 i + n - 1)!}, \end{array}

where

\begin{array}{l} A = min \{[\frac{n}{2}], [\frac{m}{2}]\}, \\ B = min \{[\frac{n - 1}{2}], [\frac{m - 1}{2}]\} . \end{array}

The above coefficients can be written in terms of (terminating) hypergeometric series as

\begin{array}{l} \frac{(m + n)!}{m! n!} & 4 F 3 (\begin{array}{l} \begin{array}{l} \frac{1 - m}{2}, - \frac{m}{2}, \frac{1 - n}{2}, - \frac{n}{2} \\ \frac{- m - n}{3}, \frac{- m - n + 1}{3}, \frac{- m - n + 2}{3} \end{array} & \frac{16}{27} \end{array}) \\ - \frac{(m + n - 2)!}{(m - 1)! (n - 1)!} \\ 4 F 3 (\begin{array}{l} \begin{array}{l} \frac{1 - m}{2}, 1 - \frac{m}{2}, \frac{1 - n}{2}, 1 - \frac{n}{2} \\ \frac{- m - n + 2}{3}, \frac{- m - n + 3}{3}, \frac{- m - n + 4}{3} \end{array} & \frac{16}{27} \end{array}) . \end{array}

Number of g(x,y)alignments

As indicated before, the main aim of this paper is to give an explicit representation in this case. The recurrence relation for the g(n,m) coefficients is [11]

\begin{array}{l} g (n, m) & = g (n - 1, m - 1) + g (n - 1, m) + g (n, m - 1) \\ - 2 g (n - 2, m - 2), n, m \geq 2, \end{array}

with initial conditions g(n,0)=g(m,0)=1. Thus, the generating function [17, 18] is

G (x, y) = \frac{1 - xy}{2 x^{2} y^{2} - xy - x - y + 1} .

(1)

Theorem 1

The coefficients α _n,m in the expansion

G (x, y) = \frac{1 - xy}{2 x^{2} y^{2} - xy - x - y + 1} = \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} α_{n, m} x^{n} y^{m}

(2)

are explicitly given by

\begin{array}{l} α_{n, m} & = (\sum_{i = U (n, m)}^{n + m} \sum_{j = A (i, n, m)}^{B (i, n, m)} β_{i, j, n, m}) \\ - (\sum_{i = U (n, m) - 1}^{n + m - 2} \sum_{j = C (i, n, m)}^{D (i, n, m)} γ_{i, j, n, m}), \end{array}

(3)

where

\begin{matrix} β_{i, j, n, m} = \frac{{(- 1)}^{i - j} 2^{i - j} i!}{(i - j)! (2 i - j - m)! (2 i - j - n)! (3 j - 4 i + m + n)!}, \end{matrix}

(4)

\begin{matrix} γ_{i, j, n, m} = \frac{{(- 1)}^{i - j} 2^{i - j} i!}{(i - j)! (2 i - j - m + 1)! (2 i - j - n + 1)! (3 j - 4 i + m + n - 2)!}, \end{matrix}

(5)

A (i, n, m) = max \{0, [\frac{4 i - m - n}{3}]\},

(6)

B (i, n, m) = min \{i, 2 i - m, 2 i - n, [\frac{4 i - n - m}{2}]\},

(7)

C (i, n, m) = max \{0, [\frac{4 i - m - n - 2}{3}]\},

(8)

\begin{array}{l} D (i, n, m) & = min \{i, 2 i - m + 1, 2 i - n + 1, \\ [\frac{4 i - n - m + 2}{2}]\}, \end{array}

(9)

U (n, m) = \{\begin{array}{l} m - [n / 2], & n \leq m, \\ [(m + 1) / 2] + n - m, & n \geq m, \end{array}

(10)

and [ x] denotes the integer part of x.

Proof

If we expand,

\begin{array}{l} G (x, y) & = (1 - xy) \sum_{i = 0}^{\infty} {(x + y + xy - 2 x^{2} y^{2})}^{i} = (1 - xy) \\ \times \sum_{i = 0}^{\infty} (\sum_{j = 0}^{i} (\sum_{k = 0}^{j} (\sum_{s = 0}^{k} {(- 1)}^{i - j} 2^{i - j} (\binom{i}{j}) (\binom{j}{k}) \\ (\binom{k}{s}) y^{2 i - j - s} x^{2 i - j - k + s}))), \end{array}

(11)

we have two summands to be computed, namely

\begin{matrix} \sum_{i = 0}^{\infty} (\sum_{j = 0}^{i} (\sum_{k = 0}^{j} (\sum_{s = 0}^{k} {(- 1)}^{i - j} 2^{i - j} (\binom{i}{j}) (\binom{j}{k}) \\ (\binom{k}{s}) y^{2 i - j - s} x^{2 i - j - k + s}))) \end{matrix}

(12)

\begin{matrix} - xy \sum_{i = 0}^{\infty} (\sum_{j = 0}^{i} (\sum_{k = 0}^{j} (\sum_{s = 0}^{k} {(- 1)}^{i - j} 2^{i - j} (\binom{i}{j}) (\binom{j}{k}) \\ (\binom{k}{s}) y^{2 i - j - s} x^{2 i - j - k + s}))) . \end{matrix}

(13)

In order to compute the first sum (12) let us introduce

m = 2 i - j - s, n = 2 i - j - k + s.

(14)

Therefore, the summation to be done reads as

\begin{array}{l} \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} (\sum_{i = U}^{V} \sum_{j = A}^{B} {(- 1)}^{i - j} 2^{i - j} (\binom{i}{j}) (\binom{j}{4 i - 2 j - m - n}) \\ (\binom{4 i - 2 j - m - n}{2 i - j - m})) x^{n} y^{m} \end{array}

where U, V, A and B must be computed in terms of the initial indices.

The product of binomials can be simplified to

\frac{i!}{(i - j)! (2 i - j - m)! (2 i - j - n)! (3 j - 4 i + m + n)!} .

Thus,

\begin{matrix} i \geq 0, j \geq 0, 4 i - 2 j - m - n \geq 0, 4 i - 2 j - m \\ - n \geq 0, 2 i - j - m \geq 0, i - j \geq 0, 2 i - j \\ - m \geq 0, 2 i - j - n \geq 0, 3 j - 4 i + m + n \geq 0, \end{matrix}

and then

\begin{align} A (i, n, m) & = A = max \{0, [\frac{4 i - m - n}{3}]\} \leq j \\ \leq min \{i, 2 i - m, 2 i - n, [\frac{4 i - n - m}{2}]\} \\ = B (i, n, m) = B. \end{align}

Finally, the summation reads as

\begin{array}{l} \sum_{n = 0}^{\infty} \sum_{m = 0}^{\infty} (\sum_{i = U (n, m)}^{n + m} \sum_{j = A}^{B} \\ \frac{{(- 1)}^{i - j} 2^{i - j} i!}{(i - j)! (2 i - j - m)! (2 i - j - n)! (3 j - 4 i + m + n)!}) x^{n} y^{m}, \end{array}

where

U (n, m) = \{\begin{array}{l} m - [n / 2], & n \leq m, \\ [(m + 1) / 2] + n - m, & n \geq m. \end{array}

A similar work with the second summand (13) leads to the final result.

Some numerical values are g(10,10)=2003204, g(50,50)=2.71972×10³⁴, g(100,100)=7.55997×10⁶⁹, and we note that g(n,n)>10⁸⁰ for n≥115. This last inequality is relevant since 10⁸⁰ is an estimation of the number of protons of our universe [13].

Conclusions

A unified approach for a wide class of alignments between two DNA sequences has been provided. We conclude also that our approach gives an explicit formula filling a gap in the theory of sequence alignment. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. It may be used also, in the future, to get explicit formulas and compute the number of total, reduced, and effective alignments for multiple sequences.

Methods

We have performed a number of numerical computations to compare our formulae and Mathematica®; [20] command Coefficient for the series expansion of (1), on a MacBook Pro featuring a 45 nm “Penryn” 2.66 GHz Intel “Core 2 Duo” processor (P8800), with two independent processor “cores” on a single silicon chip, 8 GB of 1066 MHz DDR3 SDRAM (PC3-8500). We would like to mention that our approach is amazingly fast, since e.g. g(100,100) is computed by using Mathematica®; in 0.125165 seconds by using the new formulas presented in this paper, while the use of Mathematica®; command Coefficient needs 99.167659 seconds.

References

The European Bioinformatics Institute: Pairwise Sequence Alignment. http://www.ebi.ac.uk/Tools/psa/,
Orobitg M, Lladós J, Guirado F, Cores F, Notredame C: Scalability and accuracy improvements of consistency-based multiple sequence alignment tools. EuroMPI. Edited by: Dongarra J, Blas JG, Carretero J. 2013, New York, USA: ACM International Conference Proceeding Series, 259-264.
Chapter Google Scholar
Orobitg M, Cores F, Guirado F, Roig C, Notredame C: Improving multiple sequence alignment biological accuracy through genetic algorithms. J Supercomput. 2013, 65 (3): 1076-1088. 10.1007/s11227-012-0856-9.
Article Google Scholar
Montañola A, Roig C, Guirado F, Hernández P, Notredame C: Performance analysis of computational approaches to solve multiple sequence alignment. J Supercomput. 2013, 64 (1): 69-78. 10.1007/s11227-012-0751-4.
Article Google Scholar
Zhong C, Zhang S: Efficient alignment of rna secondary structures using sparse dynamic programming. BMC Bioinformatics. 2013, 14: 269-10.1186/1471-2105-14-269.
Article PubMed Central PubMed Google Scholar
Veeneman BA, Iyer MK, Chinnaiyan AM: Oculus: faster sequence alignment by streaming read compression. BMC Bioinformatics. 2012, 13: 297-10.1186/1471-2105-13-297.
Article PubMed Central PubMed Google Scholar
Chaisson M, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): theory and application. BMC Bioinformatics. 2012, 13: 238-10.1186/1471-2105-13-238.
Article PubMed Central PubMed CAS Google Scholar
Löytynoja A: Alignment methods: Strategies, challenges, benchmarking, and comparative overview. Evolutionary Genomics. Methods in Molecular Biology. Edited by: Anisimova M. 2012, New York, USA: Humana Press, 203-235.
Google Scholar
The European Bioinformatics Institute: Pairwise Sequence Alignment (Nucleotide). http://www.ebi.ac.uk/Tools/psa/emboss\_needle/nucleotide.html,
Lesk AM: Introduction to Bioinformatics. 2002, Oxford, UK: Oxford University Press
Google Scholar
Andrade H: Análise matemática dalgunhos problemas no estudo de secuencias biolóxicas. PhD thesis, Universidade de Santiago de Compostela, Departamento de Análise Matemática (2013),
Bai F, Zhang J, Zheng J: Similarity analysis of DNA sequences based on the EMD method. Appl Math Lett. 2011, 24 (2): 232-237. 10.1016/j.aml.2010.09.010.
Article Google Scholar
Cabada A, Nieto JJ, Torres A: An exact formula for the number of aligments between two DNA sequences. DNA Sequence (continued as Mitochondrial DNA). 2003, 14: 427-430.
Google Scholar
Eger S: Sequence alignment with arbitrary steps and further generalizations, with applications to alignments in linguistics. Inform Sci. 2013, 237: 287-304.
Article Google Scholar
Morgenstern B: A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Appl Math Lett. 2002, 15 (1): 11-16. 10.1016/S0893-9659(01)00085-4.
Article Google Scholar
Zhang J, Wang R, Bai F, Zheng J: A quasi-MQ EMD method for similarity analysis of DNA sequences. Appl Math Lett. 2011, 24 (12): 2052-2058. 10.1016/j.aml.2011.05.041.
Article Google Scholar
Srivastava HM, Manocha HL: A Treatise on Generating Functions. Ellis Horwood Series: Mathematics and its Applications. 1984, Chichester: Ellis Horwood Ltd.
Google Scholar
Wilf HS: Generatingfunctionology. 2006, Wellesley, MA: A K Peters Ltd.
Google Scholar
Abramowitz M, Stegun IA: Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables. 1966, New York: Dover Publications Inc.
Google Scholar
Wolfram Research I: Mathematica, Version 9.01. 2013, Champaign, Illinois: Wolfram Research, Inc.
Google Scholar

Download references

Acknowledgements

The authors are grateful to Prof. Marko Petkovs̆ek for helpful comments. The work of I. Area has been partially supported by the Ministerio de Economía y Competitividad of Spain under grant MTM2012–38794–C02–01, co-financed by the European Community fund FEDER. J.J. Nieto also acknowledges partial financial support by the Ministerio de Economía y Competitividad of Spain under grant MTM2010–15314, co-financed by the European Community fund FEDER.

Author information

Authors and Affiliations

Departamento de Análise Matemática, Facultade de Matemáticas, Universidade de Santiago de Compostela, 15782, Santiago de Compostela, Spain
Helena Andrade & Juan J Nieto
Departamento de Matemática Aplicada II, E.E. Telecomunicación, Universidade de Vigo, 36310, Vigo, Spain
Iván Area
Faculty of Science, King Abdulaziz University, P.O. Box 80203, 21589, Jeddah, Saudi Arabia
Juan J Nieto
Departamento de Psiquiatría, Radioloxía e Saúde Pública, Facultade de Medicina, Universidade de Santiago de Compostela, 15782, Santiago de Compostela, Spain
Ángela Torres

Authors

Helena Andrade
View author publications
You can also search for this author in PubMed Google Scholar
Iván Area
View author publications
You can also search for this author in PubMed Google Scholar
Juan J Nieto
View author publications
You can also search for this author in PubMed Google Scholar
Ángela Torres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan J Nieto.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Each of the authors HA, IA, JJN and AT, contributed to each part of this study equally and read and approved the final version of the manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Andrade, H., Area, I., Nieto, J.J. et al. The number of reduced alignments between two DNA sequences. BMC Bioinformatics 15, 94 (2014). https://doi.org/10.1186/1471-2105-15-94

Download citation

Received: 10 January 2014
Accepted: 19 March 2014
Published: 01 April 2014
DOI: https://doi.org/10.1186/1471-2105-15-94

The number of reduced alignments between two DNA sequences

Abstract

Background

Results

Conclusions

AMS Subject Classification

Background

Results and discussion

Number of f(x,y)alignments

Number of h(x,y)alignments

Number of g(x,y)alignments

Theorem 1

Proof

Conclusions

Methods

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

The number of reduced alignments between two DNA sequences

Abstract

Background

Results

Conclusions

AMS Subject Classification

Background

Results and discussion

Number of f(x,y)alignments

Number of h(x,y)alignments

Number of g(x,y)alignments

Theorem 1

Proof

Conclusions

Methods

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us