This section consists of four subsections. In the first two subsections, we discuss, at a conceptual level, the Euler algorithm and its crucial step of arborescence generation, in preparation of the discussion of implementation details. In the third subsection, we present the algorithmic engineering details of our implementation. In the fourth subsection, we describe the software organization and user interfaces of the uShuffle tool. To justify our algorithm choices and to explain our optimization techniques, the discussions in the first three subsections are necessarily technical. The readers who are not particularly interested in the theoretical discussion of graph algorithms or the technical details of algorithmic engineering can safely skip to the fourth subsection for the software organization and user interfaces.
The Euler Algorithm
In this subsection, we review some basic concepts of the Euler algorithm.
Directed multigraph
A k-let is a subsequence of k consecutive elements in a sequence. Let S be a sequence to be permuted. Let T
k
be a uniform random sequence that preserves the k-let counts of S. (For example, T1 is a simple permutation of S, and T2 is a permutation of S with the same dinucleotide counts.) To generate T
k
for k ≥ 2, the Euler algorithm [2, 15] first constructs a directed multigraph G. We refer to Figure 1 for an example. For each distinct (k - 1)-let in S, G has a vertex. For each k-let L in S, which contains two (k - 1)-lets L1 and L2 such that L1 precedes L2, G has a directed edge from the vertex for L1 to the vertex for L2. Duplicates of k-lets may exist in S, so there may be multiple edges between the vertices.
Correspondence between permutations and Eulerian walks
As we scan the k-lets in S one by one, we also walk in the directed multigraph G from vertex to vertex. When all the k-lets are scanned, each edge in G is visited exactly once: the walk is Eulerian. On the other hand, given an Eulerian walk in G, we can recover a sequence by spelling out the (k - 1)-lets of the vertices along the walk (and discarding the overlaps). Since each k-let in S corresponds to an edge in G, every Eulerian walk in G corresponds to a sequence with the same k-let counts as S. Kandel et al. [15] showed that, as long as an Eulerian walk starts and ends at the same two vertices s and t that correspond to the starting and the ending (k - 1)-lets of S, the i-let counts for all 1 ≤ i ≤ k are preserved. Therefore, generating a uniform random sequence T
k
reduces to generating a uniform random Eulerian walk in G from s to t.
Correspondence between Eulerian walks and arborescences
For an Eulerian walk in G, each vertex v of G except the ending vertex t has a last edge e
v
that exits from v for the last time. The set of last edges for all vertices except t forms an arborescence rooted at t: a directed spanning tree in which all vertices can reach t. Given an arborescence A rooted at t, a random Eulerian walk from s to t with the last edges conforming to A can be easily generated [2, 15]:
-
1.
For each vertex v, collect the list of edges E
v
exiting from v. Permute each edge list E
v
separately while keeping e
v
the last edge on the list.
-
2.
Walk the graph G in accordance with the edge lists {E
v
}: start from s (set u ← s), take the first unmarked edge (u, v) from the list E
u
, mark the edge, then move to the next vertex v (set u ← v); continue until all edges are marked and the walk ends at t.
In directed multigraphs, there is a nice correspondence between Eulerian walks and arborescences: every arborescence rooted at t corresponds to exactly the same number of Eulerian walks [3, 15]. Therefore, generating a uniform random Eulerian walk in G from s to t reduces to generating a uniform random arborescence in G rooted at t. In the next subsection, we discuss algorithms for generating random arborescences, some of which are based on, quite amusingly, random walks again.
Generating Random Arborescences
In this subsection, we review the existing algorithms for arborescence generation, and explain our choice of Wilson's algorithm [19, 23]. There are two major approaches to generating random arborescences and spanning trees: determinant algorithms and random-walk algorithms.
Determinant algorithms
Determinant algorithms are based on the matrix tree theorem [3, Chapter II, Theorem 14]. For a graph G, the probability that a particular edge e appears in a uniform random spanning tree is the ratio of two numbers: the number of spanning trees that contain the edge e, and the total number of spanning trees. The matrix tree theorem allows one to compute the exact number of spanning trees of a graph by evaluating the determinant of the combinatorial Laplacian (or Kirchhoff matrix) of the graph. A random spanning tree can be generated by repeatedly contracting or deleting edges according to their probabilities.
The first determinant algorithms were given by Guénoche [14] and Kulkarni [16]: for a graph of n vertices and m edges, a random spanning tree can be generated in O(n3m) time. This running time was later improved to O(n3) [7]. Colbourn, Myrvold, and Neufeld [8] simplified the O(n3) time algorithm and showed that the running time can be further reduced to O(n2.376), the best upper bound for multiplying two n × n matrices [9].
Random-walk algorithms
Random-walk algorithms use an entirely different approach to generating random spanning trees. Aldous [1] and Broder [5] (after discussing the matrix tree theorem with Diaconis) independently discovered an interesting connection between random spanning trees and random walks:
Simulate a uniform random walk in a graph G starting at an arbitrary vertex s until all vertices are visited. For each vertex v ≠ s, collect the edge {u, v} that corresponds to the first entrance to v. The collection T of edges is a uniform random spanning tree of G.
For a graph G and a vertex v in it, define the cover time C
v
(G) as the expected number of steps a random walk starting from v takes to visit all vertices of G. The running time of the Aldous-Broder algorithm [1, 5] is clearly linear in the cover time. In the context of shuffling biological sequences, Kandel et al. [15] extended Aldous-Broder algorithm [1, 5] to generate uniform random arborescences of Eulerian directed graphs in the cover time. Wilson and Propp [24] then presented an algorithm for generating uniform random arborescences of general directed graphs in 18 cover times.
Wilson's algorithm
Wilson [19, 23] showed that random arborescences and spanning trees can be generated more quickly than the cover time by a cycle-popping algorithm which simulates loop-erased random walks. For a graph G and two vertices u and v in it, define the hitting time hu,v(G) as the expected number of steps a random walk takes from u to v. The running time of Wilson's algorithm [19, 23] is linear in the maximum or mean hitting times of the corresponding stochastic graphs. As Wilson [19, 23] noted, the mean and maximum hitting times are always less than the cover time, and the differences can be quite significant in certain graphs. Therefore, for generating uniform random arborescences, Wilson's algorithm [19, 23] is superior to Kandel et al.'s algorithm [15].
For completeness of presentation, we include in the following the pseudocode of Wilson's algorithm [19, 23]:
RandomTreeWithRoot(r)
1 for i ← 1 to n
2 InTree [i] ← false
3 Next [r] ← nil
4 InTree [r] ← true
5 for i ← 1 to n
6 u ← i
7 while not InTree [u]
8 Next [u] ← RandomSuccessor(u)
9 u ← Next [u]
10 u ← i
11 while not InTree [u]
12 InTree [u] ← true
13 u ← Next [u]
14 return Next
Let E
u
be the set of directed edges exiting from the vertex u. The function RandomSuccessor(u) chooses a uniformly random edge (u, v) from E
u
, then returns the vertex v.
Unlike the Aldous-Broder algorithm [1, 5], which simulates a single random walk from the root to visit all vertices, Wilson's algorithm [19, 23] simulates multiple random walks: starting from each unvisited vertex, a random walk continues until it joins a growing arborescence which initially contains only the root. A random walk follows the Next [·] pointers; whenever a previously visited vertex is encountered again, a loop is formed and immediately erased because the Next [·] pointer is overwritten (in the first while loop). As soon as a walk reaches the growing arborescence, all vertices in the walk join the arborescence as one more branch.
A comparison of the two approaches
We now give a comparison of the two approaches to generating random arborescences. Kandel et al. [15] proved that the cover time of an Eulerian directed multigraph of n vertices and m edges is O(n2m). From our preceding discussion on the cover time and the hitting time, it follows that the expected running time of Wilson's algorithm [19, 23] on the same multigraph is at most O(n2m) too, neglecting the log n factors.
For a multigraph, the number m of edges can be arbitrarily larger than the number n of vertices. So it might appear that the determinant algorithm by Colbourn et al. [8], which runs in deterministic O(n3) time or even O(n2.376) time, would be a better alternative than the random walk algorithms [15, 19, 23]. However, we note that when m is large the intermediate values of the determinant computation can be large too. On the typical computer systems today, the arithmetic operations on floating-point numbers do not have enough precision to guarantee the accuracy and stability in the numerical computation of the determinant algorithms. The random walk algorithms [15, 19, 23], on the other hand, require only basic operations on small integers, and do not have these numerical problems. Therefore, we have decided to implement Wilson's random-walk algorithm [19, 23] for arborescence generation.
Implementation Details
In this subsection, we describe the details of our implementation of the Euler algorithm [2, 15, 19, 23] for generating k-let-preserving random sequences.
Kandel et al.'s data structure
As suggested by Kandel et al. [15], a simple implementation of the Euler algorithm [2, 15] may use a look-up table of size σk-1for all possible (k - 1)-lets as vertices in the directed multigraph G, then build an adjacency matrix of size σk-1× σk-1for the edges in G. When both σ and k are small constants, the space requirement of this simple approach, σ2k-2, may not look severe. However, a calculation shows that, even for σ = 20 (the alphabet size of proteins) and k = 3 (a typical choice of k), the space requirement amounts to
σ2k-2= 204 = 160, 000.
On the other hand, the typical length of a protein sequence is below 1000. Even though a sequence itself may be stored in only 1 kilo-bytes, the permutation algorithm still requires hundreds of times more space regardless. The situation becomes even worse when k is further increased: even for the rather innocent-looking parameters σ = 20 and k = 5, the space requirement
σ2k-2= 208 > 168 = 232
exceeds all 4 giga-bytes of memory that can be accommodated by a 32-bit computer! We note that the two sets of parameters that Coward [11] used for experiments on his shufflet program were only
σ = 4, k = 6, σ2k-2= 1, 048, 576
and
σ = 20, k = 3, σ2k-2= 160, 000.
We will discuss more about this in our comparison of uShuffle and shufflet in the Results and Discussion section.
Representing directed multigraph in linear space
To make the uShuffle program scalable, it is clear that careful algorithmic engineering are necessary in the implementation. As we discussed in the previous subsection on the Euler algorithm, the directed multigraph G contains a vertex for each distinct (k - 1)-let in S. Since the number of (k - 1)-lets in S is exactly l - k + 2, G has at most l - k + 2 vertices, and hence exactly l - k + 1 directed edges between consecutive (k - 1)-lets. This implies that the size of G is in fact linear in the length l of the sequence S to be permuted. With suitable data structures, uShuffle needs only linear space.
In the following, we first explain the construction and representation of the directed multigraph G, then explain the random sequence generation after the graph construction. The graph construction consists of two steps: determine the set of vertices, then add the directed edges.
Determining vertices
We use a hashtable to determine the set of vertices. The hashtable consists of a bucket array of size b = l - k + 2, the number of (k - 1) lets in S, and a linked list at each bucket to avoid collision by chaining [10]. Each (k - 1)-let x = x1x2⋯xk-1has a polynomial hash code
where a = /2 is the reciprocal of the golden ratio; the index of x to the bucket array is
i(x) = ⌊h(x)·b⌋ mod b.
Initialize the hashtable to be empty, then try to insert the (k - 1)-lets into the hashtable one by one. If a (k - 1)-let is the first of its kind, it is assigned a new vertex number then inserted into the hashtable; its starting index to the sequence S is also recorded. If a (k - 1)-let has been inserted before, it is not inserted to the hashtable: its vertex number and index to the sequence S are copied from those of the first (k - 1)-let of its kind. After insertions, we can deduce the total number of vertices in the directed multigraph from the largest vertex number assigned. The memory for vertices are then allocated.
Adding directed edges
To add the directed edges, we use an adjacency-list representation to avoid the excessive memory requirement of an adjacency-matrix. In an adjacency-list representation, two edge lists need to be maintained at each vertex: a list of incoming edges and a list of outgoing edges. The outgoing edge lists are necessary for generating Eulerian walks [2]. The incoming edges lists are necessary for generating arborescences when Kandel et al.'s algorithm [15] is used (as in the implementation by Coward [11]). We use Wilson's algorithm [19, 23] for generating arborescences. As we discussed in the previous section, Wilson's algorithm [19, 23] is faster than Kandel et al.'s algorithm [15]. Furthermore, we note here that Wilson's algorithm [19, 23] has another advantage over Kandel et al.'s algorithm [15] in terms of the ease of implementation. Instead of one backward random walk from the ending vertex t to reach all other vertices as in Kandel et al.'s algorithm [15], Wilson's algorithm [19, 23] uses multiple forward random walks from each unvisited vertex to join the arborescence rooted at t: the outgoing edge lists alone are sufficient for generating both the Eulerian walks and the arborescences.
Representing edge lists and managing memory
For maximum efficiency, we implement each edge list as an array of vertices. The numbers of outgoing edges differ from vertex to vertex; if we allocate a fixed-size array for each vertex, then we would have to make each array large enough to hold all edges in the worst case, and the resulting space requirement would become quadratic in the length l of the sequence S. We could of course first count the number of outgoing edges for each vertex, then allocate a separate array just large enough for each vertex. However, this would require us to call the relatively expensive memory allocation function once for each vertex.
In our implementation, we allocate one large array for all edges (the total number of edges is l - k + 1), then parcel out pieces to individual vertices. To achieve this, we first scan the sequence S to count the number of outgoing edges for each vertex, then point the array (outgoing edge list) of each vertex to successive offsets of the large array. With this optimization, the number of memory allocations is reduced to only 4: one for the hashtable bucket array, one for the array of (k - 1)-lets as hashtable entries, one of the array of vertices, and one for the array of edges. The memory for the bucket array and the hashtable entries can be freed as soon as the directed multigraph is constructed.
Sequence generation after graph construction
After the construction of the directed multigraph, we can generate a random sequence in three steps. As discussed in the previous section, we need to first simulate the loop-erased random walks [19, 23] to generate an arborescence, next permute the individual edge lists while maintaining the last edges, then simulate an Eulerian walk guided by the edge lists and output the sequence along the walk. Since each edge list is implemented as an array, the permutation can be executed very efficiently. To output the random sequence along the walk is also easy, since each vertex keeps the starting index of its first occurrence in the input sequence.
Software Organization and User Interfaces of the uShuffle Tool
In this subsection, we describe the software organization and user interfaces of the uShuffle tool.
C Library and command-line tool
Our initial implementation of uShuffle is in the C programming language. The C version of uShuffle consists of two components: a uShuffle library (ushuffle.c and ushuffle.h) and a command-line tool (main.c).
In a typical scenario, multiple k-let-preserving random sequences are generated for each input sequence. The graph construction stage of the uShuffle program needs to be done only once for the multiple output sequences. To give the users an option for optimization, we export three interface functions in the uShuffle library:
void shuffle(const char *s, char *t, int l, int k);
void shuffle1(const char *s, int l, int k);
void shuffle2(char *t);
The function shuffle accepts four parameters: s is the sequence to be permuted, t is the output random sequence, l is the length of s, and k is the let size k. The function shuffle simply calls shuffle1 first and shuffle2 next: shuffle1 implements the construction of the directed multigraph; shuffle2 implements the loop-erased random walks in the directed multigraph and the generation of the random sequence. The statistical behavior of a random permutation depends heavily on the random number generator.
Coward [11] noted that the default implementations of random number generators on various platforms are often unsatisfying, so he implemented his own generator using an arguably better algorithm. We note that there are numerous algorithms for random number generation, and new algorithms are continuously being proposed: whether one algorithm is superior to the other can be quite subjective. Instead of limiting the users to a particular implementation, we set the default generator to the random function from the standard C library, then export an interface function to allow sophisticated users to customize the generator:
typedef long (*randfunc_t)();
void set_randfunc(randfunc_t randfunc);
The command-line uShuffle tool is a minimal front-end of the uShuffle library that illustrates a typical use of the library. It has the following four options:
-
s <string> specifies the input sequence,
-
n <number> specifies the number of random sequences to generate,
-
k <number> specifies the let size,
-
seed <number> specifies the seed for random number generator.
Java applet
The uShuffle program is ported to the Java programming language. Beside having a library and a command-line tool, the Java version of the uShuffle program can also run as an applet in a web browser. We refer to Figure 2 for a screenshot of the uShuffle Java applet1: The interface of the applet is minimal and consists of three parts: an input text area at the top, an output text area at the bottom, and a control panel in the middle. The control panel contains two text fields and a button. The maximum let size k and the number n of output sequences can be set in the two text fields. When the "Shuffle" button is clicked, the applet takes the input sequence from the input text field, strips away the white spaces, generates n random sequences that preserve the k-let counts, then outputs the sequences in the output text area. The output is in the Fasta format when n > 1: each output sequence is preceded by a comment line containing a sequence number ranging from 1 to n.
The uShuffle Java applet keeps all the output sequences in memory for display in the output text area. When the number n of output sequences and the input sequence length l are exorbitantly large, for example, n = 10, 000, 000 and l = 100, the total memory required to hold the output sequences may exceed the maximum heap size of the Java virtual machine (JVM) and the applet may hang. This is not a bug in our program but is due to the limit of JVM; nevertheless, we prepared a web page to instruct the users how to increase the maximum heap size of JVM.
C#/Perl/Python versions
The uShuffle program is also ported to the C# programming language. Perl and Python are popular programming languages for bioinformatics; they allow easy integration with programs written in C. Instead of porting the uShuffle program to Perl and Python at the source code level, we prepared two web pages to instruct the users how to extend the Perl and Python environments with the uShuffle library.