Overview
Figure 1 shows the flow of Cgaln. First, Cgaln divides the input sequences into "blocks" with a fixed length (Figure 1A). These blocks are taken as "elements" to be aligned at the block level. Each cell of the meshlike structure is associated with a block-to-block similarity score. The similarity between two blocks, each from the two input sequences, is evaluated by the frequency of seeds (k-mers) commonly found in the blocks. Similar methods based on seed counts have been used to quickly estimate the degree of similarity between two protein sequences [22, 23]. Based on these similarity scores, the block-level alignments are obtained by a local DP algorithm (Figure 1B). For the local DP, we apply the Smith-Waterman algorithm [24] modified so that sub-optimal similarities are also reported [25]. The size of a block is arbitrary within the limits that both the number of blocks and the size of each block should be < 216 in order to be represented by unsigned short integers.
The nucleotide-level alignment is conducted on the restricted regions included in the block-level alignment found in the first stage (Figure 1C). We adopt a seed-extension strategy widely used in homology search programs such as Blast [26]. The obtained seed matches are integrated into gapless HSPs. The HSPs are then filtered and chained to conform to coherent alignment. If necessary, the chained HSPs may serve as anchor points, the subsequences between which are aligned by a standard DP algorithm or recursive search for shorter seeds.
The above-mentioned algorithm implicitly assumes that genomic rearrangements smaller than the block size of Cgaln (about 10 Kb) are rare, and they are not searched for with the default settings. To discover rearrangements smaller than the block size, we prepared the "-sr" option that considers inversions as small as a few hundred bases at the recursive phase applied to inter-HSP regions (see below).
Cgaln accepts two single- or multi-fasta files. When either or both files are in multi-fasta form, Cgaln aligns every pair of single-fasta entry sequences and outputs the united results. In this section, we assume for simplicity that both input sequences are single-fasta files. By setting the "-r" option, Cgaln examines both orientations of a sequence in a single run.
Seed designing
Cgaln uses the "spaced seed" proposed in PatternHunter [27] for fast and sensitive seed matching. A k/w spaced seed is a discrete series of nucleotides of length w in which k <w positions are examined for nucleotide matching. Cgaln uses an 11/18 spaced seed by default, with the same pattern as that used in PatternHunter (expressed as 111*1**1*1**11*111, where "1" and "*" respectively indicate the positions to be examined or ignored). This pattern was designed to be most sensitive for a pair of randomly generated sequences with 70% nucleotide identities. Although some other seed patterns are more sensitive than this pattern under some conditions (e.g., for coding regions), we chose this seed design so that Cgaln could be generally applicable to whole genome sequences of various species. Optionally, a 12/19 or 13/20 spaced seed can be used, as suggested by Mak and Benson [28] (expressed as 1111*1*1**11**1*111 and 1111*1**11**11*1*111, respectively). We use the term k-mer hereafter to refer to these k/w spaced seeds, as the value for k is most important. A larger k is appropriate for longer sequences but requires more memory, and hence there is a compromise in the choice of the best value for k. The "code" of each seed is its quaternary expression calculated from the k examined positions by converting nucleotides A, C, G, and T into numerals 0, 1, 2, and 3, respectively. Thus, the code of a k/w seed is between 0 and 4k- 1.
Block-level alignment (BA)
In this subsection, we describe block-level alignment (abbreviated as BA hereafter).
Measuring the score between two blocks
Let us denote the given input genomic or chromosomal (single-fasta) sequences G
a
and G
b
. Let L
a
and L
b
be the lengths of G
a
and G
b
, respectively. First, Cgaln divides G
a
and G
b
into blocks with the fixed length of J, except that the last block may be shorter than J. The numbers of blocks in G
a
and G
b
are denoted as m
a
and m
b
, and thus m
a
= ⌈L
a
/J⌉ and m
b
= ⌈L
a
/J⌉. Let
be the x-th block of G
a
and
be the y-th block of G
b
(1 ≤ x ≤ m
a
, 1 ≤ y ≤ m
b
).
The score M
x,y
, the measure of similarity between
and
, is defined as the summation of the scores of seeds commonly found in both
and
. The score of seed k
i
is evaluated by the probability of the chance of finding one or more matches of k
i
in a block pair. Let
be the total number of k
i
in G
a
, then the expected mean number of occurrences of k
i
in a block
is
We assume the Poisson distribution for the probability of k
i
occurring f
i
times in a block
, i.e.,
p
b
(f
i
) is obtained analogously. When k
i
occurs f
i
times in
and h
i
times in
, k
i
matches f
i
h
i
times between
and
. It is clearly an overestimation to use this number as the measure of similarity between
and
, because true homologous matches should line up around a single diagonal. In such a case, we regard min(f
i
, h
i
) as the number of matches as suggested by [23]. Thus, the score of seed k
i
is
where the probability of seed k
i
occurring more than or equal to x times in a block is
Summing up s(k
i
) for all k-mers, we obtain the similarity score as:
Local DP alignment at block level
After obtaining M
x,y
for (1 ≤ x ≤ m
a
and 1 ≤ y ≤ m
b
), the local alignment of BA is conducted by using local DP as follows:
where d is the gap penalty. In practice, Mx, yis calculated in the process of this DP procedure to save the required memory from O(m
a
m
b
) to O(min(m
a
, m
b
)). To obtain the optimal and suboptimal locally best-matched alignments, we use the algorithm proposed by Gotoh [25]. This algorithm assigns each block cell to a "colony", which is a candidate for local alignment. To define the colony borders, we apply the X-drop-off approach [29]. At a certain block cell, if the score Fx, yincreases from 0 to a positive value, a new colony starts. A colony is "significant" if Fx, y> T
col
somewhere in the colony, where T
col
denotes a predefined threshold. A colony ends when Fx, ybecomes ≤ 0 or falls by more than T
col
from the maximum score of the colony, then Fx, yis reset to 0. Only significant colonies are retained for further analysis. This method can greatly reduce the storage requirement, while the computational time remains O(m
a
m
b
).
It should be noted that Mx, yis always ≥ 0 even between nonhomologous blocks, while it is a prerequisite that, on average, the similarity score must be less than zero for the local DP algorithm to work properly [30]. Consequently, we subtract a constant bias B from each Mx, y, so that
, i.e., the X-drop-off should finish after passing through, on average, a unrelated cells (a = 5 by default). If there are repetitive sequences that escaped masking, BA may output them many times. Cgaln has an option for suppressing such repetitious outputs. If there are inconsistent colonies that overlap each other, the "-fc" option filters all of them out except for the one with the highest scored.
Tables of input sequences
BA uses four kinds of tables: a "seed table", an "index table", and two "Poisson tables". The seed table stores the number of occurrences of each seed k
i
in a genomic sequence, whereas the index table stores a list of blocks where each k
i
resides. The seeds occurring more than a certain number of times (1024 by default) are omitted in subsequent analyses as highly repetitive sequences. The two Poisson tables respectively store the probabilities that a fixed number of k
i
s (p(x
i
)) and more than a fixed number of k
i
s (p(≥ x
i
)) occur in a block, where the probabilities are calculated based on the Poisson distribution. The probabilities of occurrences exceeding an upper limit (3 by default) are ignored as repeats. It is sufficient to construct these tables only once for each genomic sequence and for each orientation. They are stored in binary files and read into memory at run time. By using these tables, seed matching on G
a
can finish in O(L
a
/4k) per k-mer in G
b
[31]. Thus, the similarity measure matrix, Mx, y(x = 1.. m
a
, y = 1.. m
b
), is computed in O(L
a
L
b
/4k).
Nucleotide-level alignment (NA)
Cgaln applies the nucleotide-level alignment (abbreviated as NA hereafter) within the restricted areas that are composed of block cells included in the local alignments of BA. However, the area covered by the set of block cells thus obtained is insufficient for NA, because the expected results of NA may shift slightly away from the aligned cells. Hence, we extend the area to be searched by NA to the "envelope" of the block cells found in the first stage (see Figure 2A).
Generating and chaining HSPs
At NA, Cgaln adopts a seed-finding approach with the 11-mer spaced seed. Figure 2B shows NA within a cell. First, the seed matches are searched for by using an index table again. This index table stores the list of positions at the nucleotide level within the restricted region, while the index table for BA stores genome-wide positions at the block level. A group of matches are integrated into one larger matching segment if the matches are closer to each other than a given threshold (20 by default) with no gap (i.e., if they lie on the same diagonal in the dot matrix). We define such a gapless matching segment as an HSP. Lonesome matches, individual matches that are not integrated with any other matches, are filtered out. The score of each HSP is defined as usual as the sum of scores assigned to individual matches and mismatches included in it. The masked bases are omitted. Third, the HSPs are extended to both sides with no gap until the HSP score drops below a threshold. Finally, the extended HSPs are chained by computing a maximal-scoring collinear subset of them by a sparse-DP algorithm [32]. This step also helps to eliminate noisy HSPs. Cgaln penalizes a predefined value, pen
overlap
= max(overlap length - 10, 0) by default, for partially inconsistent (overlapping in either projection) HSP pairs. Cgaln may output some inconsistent but high-scoring HSPs. If one wants "unique" outputs, e.g. to detect SNPs, the "-cs" option of Cgaln increases this penalty to infinity and suppresses any outputs that are inconsistent with the highest-scoring alignment.
Iterative alignment within inter-HSP regions
Additionally, Cgaln aligns iteratively the regions between neighboring HSPs to improve sensitivity. First, the HSPs are extended with gaps toward both sides. We adopt DP with X-drop-off for this gapped extension. The second step varies with the length of the inter-HSP region. (1) If an inter-HSP region is shorter than the lower-threshold T
low
(50 bp by default), Cgaln uses standard global DP. (2) Or else, if it is below a higher-threshold T
high
(twice the block size by default), Cgaln applies the seed finding and chaining approaches again with shorter-spaced seeds (7-mer by default), and the interval regions are aligned by global DP if the longer interval is less than the given threshold T
dp
(200 bp by default). (3) If the inter-HSP region is longer than T
high
, that region is left unaligned.
At the iterative alignment step, Cgaln unmasks the repetitive sequences because there might be homologous regions in repetitive (masked) regions. By default, the DP match/mismatch scores of Cgaln are set to be identical to those of Blastz, derived by Chiaromonte et al. [33] with the gap open and extension penalties of 400 and 30, respectively.
Evaluation method
It is difficult to evaluate the accuracy of genomic alignment because of the lack of "true" alignment data [34]. In this paper, we focused our attention on the orthologous protein-coding genes and corresponding coding exons. This approach has obvious drawbacks, as most genomic alignment programs, including Cgaln, are designed to find not only orthologs but also other homologs. Moreover, alignment of non-coding genes and intergenic regions can be misinterpreted in our procedure. However, we expect that this kind of imprecision would not much affect the evaluation of the relative performance of various methods.
Figure 3 schematically illustrates a case of alignment results. For a bacterial genome pair, we considered how many homologous base pairs in the reference alignment of orthologous gene pairs are correctly aligned by a given method for calculating sensitivity and specificity. A true positive (TP) indicates the number of genomically aligned bases that coincide with those in the reference alignment, a false negative (FN) indicates the number of bases in the reference alignment not observed in the genomic alignment, and a false positive (FP) indicates the number of genomically aligned bases outside the reference alignment. Thus, the sensitivity (Sn) and specificity (Sp) are defined as:
For the example shown in Figure 3,
and
For a mammalian chromosomal pair, sensitivity is evaluated as before, in which homologous exons are considered homologous regions. However, it is dificult to evaluate specificity because the biologically significant regions that hold large amounts of the entire genome are not clear. In this examination, we counted the total length of generated HSPs ((b) + (c) + (e) + (f) in Figure 3) as an indicator of specificity, for if the total HSP length from one alignment program is considerably greater than that from other programs with the same sensitivity, the program may be judged to have low specificity. We also evaluated specificity with a human chromosome 20 - mouse chromosome 2 pair. Human chromosome 20 is considered completely orthologous to mouse chromosome 2 [35], and the alignment result of human chromosome 20 and mouse chromosome 2 should coincide with that of human chromosome 20 and the whole genomic sequence of mouse. Therefore, specificity can be evaluated as
where len20-2 is the total length of HSPs for human chromosome 20 vs. mouse chromosome 2, while len20-allis that for human chromosome 20 vs. whole mouse genome.
Preparation of data
We obtained the whole-genome sequences of Escherichia coli CFT073 (Accession No. NC_004431.1, 5,231,428 bp), E. coli K12 (NC_000913.2, 4,639,675 bp), and Salmonella typhimurium (NC_003197.1, 4,857,432 bp) from NCBI http://www.ncbi.nlm.nih.gov/. Before applying these genomic sequences to alignment programs, we masked repetitive sequences by MaskerAid [36] with default parameters. We also prepared the whole-genome sequences of human (build hg19) and mouse (build mm9) from the UCSC Genome Browser http://genome.ucsc.edu/. These human and mouse genome sequences were already soft-masked, and we did not use any masking tool. We also downloaded the latest version (KOREF_20090224) of Korean individual (SJK) genomic sequence [37] from the FTP site ftp://ftp.kobic.kr/pub/KOBIC-KoreanGenome/.
Our examination requires reference data on the locations of homologous regions on the input sequences. To obtain such a reference dataset, we first collected sets of orthologous gene pairs. For bacterial genomes, we used MBGD (Microbial Genome Database for Comparative Analysis) [38] for this purpose. MBGD is a database for comparative analysis of microbial genomes, and possesses data on orthologous gene clusters of bacteria. While COG [39] is well known as an orthologous gene database, we preferred MBGD because MBGD is better than COG at clustering orthologous genes in more detail. Data on orthologous gene pairs between human and mouse were obtained from RefSeq http://www.ncbi.nlm.nih.gov/RefSeq/ and Ensembl http://www.ensembl.org/. Next, we aligned corresponding cDNA sequences of all gene pairs by the standard local DP algorithm [24]. Finally, we located each gene on the reference genome. For bacterial pairs, the information on location was obtained from GenBank. For mammalian pairs, we mapped the cDNA sequences on the respective genomic regions by Spaln [40, 41]. From 10042 original gene pairs, 9373 pairs (= 18746 genes) could be mapped on the reference genomes for the homologous genomic regions. As 97% (18252) cDNAs were mapped on the reference genomes with 100% identity and the other genes were mapped with more than 95% identity, we regarded these data as valid. By combining the cDNA alignment and the mapping coordinates, we obtained the homologous exonic regions on the genomes.
Programs used for tests
To compare the performance of our algorithm with other leading programs presently available, we developed the Cgaln program that implements the above-mentioned algorithm in C on a Linux platform. We report on comparisons of the accuracy and computational speed of Cgaln with those of Blastz, AVID, and NUCmer. Blastz is one of the principal pairwise alignment programs for long sequences, and is used as an internal engine of several multiple genomic sequence alignment programs, such as MultiPipMaker [42], TBA, and MultiZ [43]. We examined Blastz with two sets of parameter values; with the default parameter set and with the tuned parameter set (T = 2, C = 2). The option "T = 2" disregards transitions as matches; this speeds up computation but slightly reduces sensitivity. The option "C = 2" directs "chain and extend", which helps to reduce false positives. AVID is also a fast and accurate genomic alignment program, but it is a global aligner and not suitable for the alignment of genomes with large rearrangements. We applied AVID only to bacterial genome pairs. NUCmer is a variant of MUMmer 3.0. It clusters the matches of MUMmer 3.0 and tries to align the non-exact regions between the matches by DP.
Although there are other genomic alignment tools, such as CHAOS, LAGAN, SLAGAN, DIALIGN, GLASS, and WABA, they are either too slow to execute or their source codes are not available. All experiments were performed on a 2.0 GHz Core2Quad (64-bit CPU) with 8 GB memory.