There are many different algorithms for string searching and pattern matching that will match sequences with various numbers of differences. However, the purpose of this program, ADaM, is to show the improvements that can be gained with an *exact* range query. Given a user-specified range, ADaM will return *all* matching sequences with zero false negatives.

Under the hood, ADaM contains a forest data structure known as the APF (Adaptive Projection Forest). The APF is based off of the Excluded Middle Vantage Point (EMVP) tree described in depth in [18–20], but has been extended to provide a greater branching factor, better control over the number of trees produced, and guaranteed bounded search times for range searches below the given radius.

### Euclidean space vs metric space

Metric-space indexes are designed to work on data that has no euclidean representation, the most common example being a 2-d graph with an origin at (0, 0). When searching for points in a euclidean space with dimension *d*, for example, the vector {0}^{d} can always be used as the origin, and all other points could be described by their *d*-dimensional distance from the origin.

DNA sequences do not have a natural euclidean representation, i.e. there is no defined origin that makes sense. Instead, the location of individual sequences in a given space are only defined *in reference to* other sequences. For example, the sequence *aaaa* does not have a pre-determined placement, but is considered a distance of *d*(*aaaa, aaca*) away from point *aaca*. The use of this generic distance function allows for different similarity measures in metric spaces (Hamming distance verses Needleman-Wunsch, for example), but also generalizes to point-space queries with point-space distance functions (such as the L2 norm). The distance function is what determines how "similar" two sequences are, and identifies where to search in the pre-compiled database. A distance function, *d*(*x, y*), on a metric space is defined with the following properties:

\begin{array}{cc}\hfill non\text{-}negativity& \hfill d\left(x,y\right)\ge 0\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\\ \hfill symmetry& \hfill d\left(x,y\right)=d\left(y,x\right)\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\\ \hfill triangleinequality& \hfill d\left(x,z\right)\le d\left(x,y\right)+d\left(y,z\right)\\ \hfill identityofindiscernibles& \hfill d\left(x,y\right)=0\to x=y\phantom{\rule{1em}{0ex}}\end{array}

For DNA sequences, there are two typical distance functions: Hamming distance, and banded Needleman-Wunsch distance. The Hamming distance is simply a count of the number of characters that are different between two strings; because the chemical makeup of DNA is such that transitions (between *a* and *t* or *c* and *g*) are more likely than transversions (any other nucleotide pairings), differences may be weighted accordingly. Since DNA sequences form discrete metric spaces (and not continuous ones like the number line), this also increases the amount of total variation among nucleotides, and effectively also increases the query speed.

The second distance metric commonly used is the banded Needleman-Wunsch. This is the typical Needleman-Wunsch dynamic programming algorithm with the search space limited to a pre-determined number of gaps. The band limits the size of the dp-matrix, corresponding to a decrease in the number of calculations required.

### Guarantees of exact methods

Unlike the methods that apply heuristics to increase the speed in the string-searching process, the APF is a data structure for obtaining the guaranteed best accuracy. In other words, it can be shown that for any query searched in the forest, any and every element in the tree that lies within a user-defined distance on a given distance function will be returned as a result. Put more formally, for query *q*, result set *R*, set of index points *T*, distance function *d*(*a, b*), and distance range, *r*:

\forall t\in T:t\in R\iff d\left(q,t\right)\le r.

(1)

While a brute-force scan of all points would return a successful result, any metric-space algorithm seeks to reduce the overall number of distance calculations performed. Because of this, the measure of success or failure of the algorithm is not accuracy (all are guaranteed to be accurate), but in the running time of the algorithm or, since the running time is dominated by the distance function, the total number of distance comparisons performed.

In the next several sections, we will give an overview of some of the important features of the APF and how it differs from other metric indexes.

#### Implementing an exclusion area

The major difference between a *k*-d tree and the EMVP forest (as described by [19]) is that of an *exclusion area*. Figure 2 shows a single query, *q*, with radius *r*, plotted on a single dimension with index sequences plotted according to distance from *p* 1. To reduce the search time, the sequences can be split in half, all those with distance less than the median distance, *d*_{
m
}, placed on one branch of the tree, and all those greater than or equal to *d*_{
m
} on another branch. However, because the query radius in Figure 2 crosses the *d*_{
m
}, both branches must be included in the search. If this pattern continues at multiple levels in the tree, this can quickly degrade to a linear scan of all sequences instead of a *O*(log *n*) search.

Adding an exclusion area of width 2 ∗ *τ* helps to mitigate this problem. As the EMVP forest is constructed, all the points *within* the exclusion area (defined by a range of ±*τ* from the median distance) will be removed, collected, and used as the starting points in a new tree. Thus, for a query with radius *r* ≤ *τ*, at most one side plus a possible excluded region will need to be searched (see Figure 3).

The exclusion area is not only used to decrease the search time, but is also used to turn the tree into a forest, optimizing it for parallel processing. This forest can easily be ported to the MapReduce framework: each processor has a different tree, so individual queries are mapped once on each processor, then reduced to find the best mapping location.

#### Controlling the exclusion area

Implementing an exclusion area with a single parameter does not entirely solve the search space problem. On the one hand, fewer branches in the tree will need to be searched, but on the other, removing points within the excluded region can do more harm than good, especially considering the high frequency of points that surround the median. If more points are in the exclusion area than in the actual indexed tree, it is easy to see that this will degrade into a linear scan through the points. For example, for a random 64-bp sequence from the human X-chromosome, approximately 15% of the data lies *d*_{
m
} ± 1, and over 50% of the data lies *d*_{
m
} ± 5. If the data structure consistently excluded 50% of the data, there would be hundreds of trees and the logarithmic search time would be multiplied by a large constant. An optimal dataset for this type of build would have a bimodal distribution with only a small number of points surrounding the median, but this rarely happens in practice.

To overcome this problem, the APF uses three variables, which are set to control the maximum search radius and the number of processors being used. The variables are: the width of the exclusion area, *τ*; the maximum percentage of points in the exclusion area, *m*; and the maximum number of pivots at each node, *D*. The total exclusion region is calculated from the intersection of all pivots at that level in the tree. Each bisection of the data by a new pivot creates a new branch in the tree, for a maximum of 2^{D} possible tree nodes (called *children*) at the next level. At the start of the build process, either *τ* or *m* is set, and the other is calculated at runtime. If *m* was set to 0.1, a given node in the tree might have a *τ* of only 1. If *τ* was set to 2, *m* might be determined at that node to be 0.153. For data with high separability, a single pivot might be enough to confidently split the data; for highly dense data, the value of *D* might be set to 6 or 10. A major implementation detail that separates the APF apart from other indices that use exclusion is the ability to control the exclusion region and the number of points inside to create a more balanced forest.

#### Pivot selection

One of the most crucial decisions in constructing the APF is pivot selection. Since there is typically more than one pivot at each level, these pivots are ideally are selected to minimize the difference between numbers of points at each child. In other words, these pivots should be selected so that the variance *between* pivots is maximized, and the total percentage of pivots in the exclusion area is as small as possible. A poorly-selected pivot would require additional computation without providing additional information, and would result in an unbalance in the number of points at each child. For example, selecting the point {*a*}^{k}as *p*_{1} and *c*{*a*}^{k-1}as *p*_{2} provides very little information gain. Sequences that are close to *p*_{1} are also close to *p*_{2}, and sequences distant from *p*_{1} are also distant from *p*_{2}.

On the other hand, there is a tradeoff between build time and query time. The fastest *build* method for finding a pivot is obviously to just select a random point, but this can lead to an unbalanced tree. The fastest *query* method (optimizing tree structure) is to perform principle components analysis (PCA) with the entire set of points, as described by [21]. While this would maximize the variance between pivots, it quickly becomes computationally intractable as the size of the database increases. Further analysis of pivot selection and its impact on query time can be found later in this paper.

The APF implemented in ADaM selects pivots intelligently, leveraging the data distribution. Since the data is distributed approximately normally around the median, the optimal second pivot would lie at the peak of the distribution, and would select pivots similar in a process to that done with PCA. In order to save time and avoid costly sorts and distance comparisons, successive pivots are selected in this same manner, relying on the random distribution of the data to ensure that pivots are providing information. See Figure 4 for the point distribution for two pivots, with *p*_{2} selected at the median of *p*_{1}.

### Constructing and querying the APF

#### Construction

The APF is constructed by selecting a set of pivots, sorting the points according to their distance from each pivot, and using the median distance, *d*_{
m
}, as the partitioning plane to create the exclusion area, *d*_{
m
} ± *τ* (see Figure 3). The exclusion percentage, *m*, is calculated from all the intersecting exclusion areas, and if it is too large (and the maximum number of pivots has not been reached), an additional pivot is added.

From these *d* pivots, 2^{d} children are created, and labeled with a number from 0. . . 2^{d} − 1 to identify its sector location in the *d*-dimensional space. Since the children will later become trees themselves, points are assigned to these children in the following manner: Let the label *l*_{
i
} be the binary representation of child *c*_{
i
} with *d* digits, each digit corresponding to a distinct pivot. If the bitwise and between 2^{j} and label *l*_{
i
} is zero, then all the points assigned to this child would be *strictly less than* the median distance for pivot *p*_{
j
}, {d}_{m}\left({p}_{j}\right). If this value is 1, the points are required to be greater than or equal to {d}_{m}\left({p}_{j}\right).

**Example 1** If *i* = 13, *l*_{
i
} will be the sequence 1101, and child *c*_{
i
} would contain all points that are greater than or equal to the median for pivots *p*_{3}, *p*_{2}, and *p*_{0}, and less than the median for pivot *p*_{1}.

Points that lie in the intersecting excluded regions are saved until the tree is formed, and the process of selecting pivots and assigning points to children is recursed on each child until the size of the point set is too small, or until all points lie in an exclusion area. The exclusion points from the creation of one tree are used to form the next tree, creating a forest of AP-Trees.

#### Querying

To run a range query on the APF, the query is issued successively on each tree, returning the set of combined results. The correct traversal path through a single tree is found as follows: At each node, the distance from the query point, *q*, to each of the *d* pivots is turned into a bit vector, the reverse process of assigning points to children described above. If this is an exact query, only the single child corresponding to this bit vector need be queried, and the search space is reduced drastically. However, since a query with range *r* may not lie entirely above or below a given pivots middle distance, additional children might need to be queried. If the distance from *q* to *p* _{
i
} plus the query range, *d*(*q, p*_{
i
}) ± *r*, does not cross the boundary partition, it can be placed uniquely in one child node. If, however, adding the range to the query point causes the total distance to *entirely* cross the boundary (spanning 2*τ*), additional children will be searched.

**Example 2** Let {d}_{m}\left({p}_{j}\right) be [16, 15, 14, 15], for pivots [*p*_{3}, *p*_{2}, *p*_{1}, *p*_{0}], and let *τ* = 2. Let *d*(*q, p*_{
i
}) be [26, 18, 14, 2], for each *p*_{
i
}. For a query of range *r* = 4, the results lie strictly below *p*_{0}, both above and below *p*_{1} (because *d*(*q, p*_{1}) ± *r* = [10, 18] spans {d}_{m}\left({p}_{1}\right)\pm \tau =\left[12,16\right]), and strictly above *p*_{2} and *p*_{3} (because the boundary line, *d*(*q, p*_{2}) − *r* = 14, is still greater than {d}_{m}\left({p}_{2}\right)-\tau =13). The query for *q* ± *r* would only need to visit two children: 0011 and 0111.

Each query is executed on each tree, and the results are aggregated and returned to ADaM. In theory, a marker for whether the queries spanned the exclusion area could be set, and subsequent trees only need be searched if this area was crossed, but since all exclusion regions from the creation of the entire tree are combined (a decision that is shown to be empirically fastest by [18]), it is highly unlikely that such an event would occur.

### Integration into ADaM

Since the APF consists of sequences with a pre-determined length, the ADaM wrapper is needed to make it useful for reads longer than the given query sequence. The ADaM algorithm can basically be thought of as a seed and extension algorithm, where the APF selects the top-scoring seeds, ADaM extends the seed alignments to find the total score, and the genomic location with the highest score is returned. Using multiple seeds will take a linearly-increasing amount of time, but will reduce the overall number of extensions needed. Currently, ADaM only supports alignments with Hamming distance, but this can easily be extended to gapped alignments with a Needleman-Wunsch or Smith-Waterman distance function.