Library Design
For developing the basic design, SeqAn has gone through an extensive conceptual phase in which we evaluated many designs and prototypic implementations. SeqAn has now a generic programming design that guarantees high performance, generality, extensibility, simplicity, and easy integration with other libraries. This design is based on four design principles which we will describe now.
Generic Programming
SeqAn adopts the generic programming paradigm that proved to be a key technique for achieving high performance algorithms in the C++ standard library [22]. Generic programming refers to a special style of programming where concrete types are substituted by exchangeable template types. Hence, classes and algorithms are written only once, but can be applied to different data types.
Global Function Interfaces
SeqAn uses global functions instead of member functions to access objects (we act here on an advice of [23], see Section 6.10.2.). This strategy improves the flexibility and the scalability of our library, since global functions, unlike member functions, can be added to a program at any time and without changing the existing code. Moreover, global function interfaces enable us to incorporate the C++ built-in types and handle them like user defined types. It is even possible to adapt arbitrary interfaces, i.e. of classes that are implemented in external libraries, to a common interface by using small global functions called 'shims' (Chapter 20 in [24]). Algorithms that access objects only via global functions can therefore be applied to a great variety of types, including built-in types and external classes.
Traits
Generic algorithms usually have to know certain types that correspond to their arguments: An algorithm on strings may need to know which type of characters are stored in the string, or what kind of iterator can be used to browse it. SeqAn uses type traits [25] for that purpose. In C++, trait classes are implemented as class templates that map types or constants given by template arguments to other types but also other C++ enties like constants, functions, or objects at compile time. Most of the advantages we already stated for global functions also apply to traits, i.e. new traits and new specializations of already existing traits can be added without changing other parts of the library.
Template Argument Subclassing
SeqAn uses a special kind of hierarchical structure that we call 'template argument subclassing', which means that different specializations of a given class template are specified by template arguments. For example, String<Compressed> is a subclass of String in the sense that all functions and traits which are applicable to String can also be applied to String<Compressed>, while it is possible to overload some functions especially for String<Compressed>. The rules of C++ overload resolution guarantee that the compiler always applies the most specific variant out of all existing implementations when an algorithm or trait has been called. This approach resembles class derivation in standard object-oriented programming, but it is often faster, because it does not require a type conversion for a subclass calling a function that is already defined for the base class, and since the actual type of the object used in a function is therefore already known at compile time, it is not necessary to detect it at run time using virtual functions. Non-virtual functions have the advantage that C++ compilers can use function inlining to save their overhead completely. Template argument subclassing enables us both to specialize functions and to delegate tasks soundly to base classes while still maintaining static binding.
Design Goals
These design principles support our design goals in the following way:
-
Performance: The library produces code that is competitive with manually optimized programs. Template argument subclassing makes it possible to plug in optimized specializations for algorithms whenever needed. Our generic programming design also speeds up the code in avoiding unnecessary virtual function calls.
-
Generality: All parts of the library are as flexible as possible. Algorithms can be applied to various data types and new types can be added if necessary. For example, generic alignment algorithms in SeqAn work on strings for arbitrary alphabets. However, specialized implementations that make use of certain attributes of the alphabet can still be developed using template argument subclassing.
-
Integration: SeqAn components are designed to fulfill the requirements specified in the C++ standard. In addition, SeqAn easily interacts with other libraries because the global interface can be expanded. Hence, algorithms and classes of other libraries are at hand.
-
Extensibility: The open-closed principle ('Be open for extension but closed for modifications!') is satisfied in so far as it is possible to extend the library by simply adding new code. SeqAn has this feature because it relies on stand-alone global functions and traits that can be added at any time without changing the existing code.
-
Simplicity: While a pure object-oriented library may be more familiar to some users, SeqAn is still simple enough to be used even by developers with average skills in C++.
Library Contents
SeqAn is a software library that is supposed to cover all areas of sequence analysis. Fig. 2 gives an overview of the contents of the library in the current state.
Sequences
The storage and manipulation of sequences is essential for all algorithms in the field of sequence analysis. In SeqAn, sequences are represented as strings of characters over various alphabets. Multiple string classes for different settings are available: Large sequences can be stored in secondary memory using external strings, bit-packed strings can be used to take advantage of small alphabets, or strings allocated on the stack can be used to guarantee fast access. String modifiers can be used to implement distinct views on a given sequence without copying it. A string segment, for instance, is a string modifier used to get access to an infix, suffix, or prefix of a given sequence.
Alignments
Alignments require the insertion of gaps into sequences. SeqAn does not actually insert these gaps directly into the sequence but treats them separately. The benefit is twofold: A single sequence can be used in multiple alignments simultaneously and the actual alphabet of the string must not include a special gap character. SeqAn offers both pairwise and multiple sequence alignment algorithms. Algorithms can be configured for different scoring schemes and different treatments of sequence ends (e.g, ends free-space alignments). In the pairwise case, local and global alignment algorithm are available. Besides the classical Needleman-Wunsch algorithm [19], more sophisticated algorithms are available, including an affine gap cost alignment [26] and Myer's bit vector algorithm [2]. Moreover, SeqAn offers efficient algorithms to chain alignment fragments [27, 28]. We are also currently integrating code for motif finding in multiple sequences.
Indices
The enhanced suffix array (ESA)[29] is probably the most fundamental indexing data structure in bioinformatics with various applications, e.g., finding maximal repeats, super maximal repeats, or maximal unique matches in sequences. An enhanced suffix array is a normal suffix array extended with an additional lcp table that stores the length of the longest common prefix of adjacent suffixes in the suffix array. SeqAn offers an ESA that can be build up in primary or in secondary memory, depending on the sequence size. The user has two choices to access the ESA, either as a regular suffix array or as a suffix tree. The later view on an ESA is realized using the concept of iterators that simulate a tree traversal. A more space and time efficient data structure for top-down traversals through only parts of the suffix tree is the lazy suffix tree [30] which is also implemented in SeqAn. Besides the sophisticated ESA, simpler indices are available, including basic hash tables like gapped- and ungapped q-gram indices (for their use see [31–33]).
Searching
Flexible pattern matching algorithms are fundamental to sequence analysis. Exact and approximate string matching algorithms are provided. For the exact string matching task, SeqAn offers the algorithms Shift-And, Shift-Or, Horspool, Backward Oracle Matching, and Backward Nondeterministic Dawg Machine [34]. For searching multiple patterns, SeqAn currently supports the Multiple Shift-And, the Set Horspool, and the Aho-Corasick algorithm [34]. Myer's bit vector algorithm [2] can be used for approximate string matching. Note that SeqAn's index data structures can naturally be used to search for strings as well.
Graphs
Graphs are increasingly important to a number of bioinformatics problems. Prime examples are string matching algorithms (e.g., Aho-Corasick, Backward Oracle Matching [34]), phylogenetic algorithms (e.g., upgma, neighbor joining tree [35]), or alignment representations [36]. Hence, we decided to include our own graph type implementation, including directed graphs, undirected graphs, trees, automata, alignment graphs, tries, wordgraphs, and oracles. Graph algorithms currently comprise breath-first search, depth-first search, topological sort, strongly-connected components, minimum spanning trees (e.g., Prim's algorithm, Kruskal's algorithm), shortest path algorithms (e.g., Bellman-Ford, Dijkstra, Floyd-Warshall), transitive closure, and the Ford-Fulkerson maximum flow algorithm [37]. Trees are heavily used in clustering algorithms and as guide trees during a progressive multiple sequence alignment. Alignment graphs are used to implement a heuristic multiple sequence alignment algorithm, which is similar to T-Coffee [38] but makes use of segments and a sophisticated refinement algorithm [39] to enable large-scale sequence alignments.
Biologicals
Besides the fundamental alphabets for biological purposes, like DNA or amino acids, SeqAn offers different scoring schemes for evaluating the distance of two characters, e.g., PAM, and BLOSUM. SeqAn also supports several file formats that are common in the field of bioinformatics, e.g., FASTA, EMBL, and genbank. Is is possible the access (e.g. to search) sequence data stored in such file formats without loading the whole data into memory. The integration of external tools (e.g., BLAST) and the parsing of metainformation is ongoing work.