Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics

Dutheil, Julien; Gaillard, Sylvain; Bazin, Eric; Glémin, Sylvain; Ranwez, Vincent; Galtier, Nicolas; Belkhir, Khalid

doi:10.1186/1471-2105-7-188

Software
Open access
Published: 04 April 2006

Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics

Julien Dutheil¹,
Sylvain Gaillard¹,
Eric Bazin¹,
Sylvain Glémin¹,
Vincent Ranwez²,
Nicolas Galtier¹ &
…
Khalid Belkhir¹

BMC Bioinformatics volume 7, Article number: 188 (2006) Cite this article

16k Accesses
86 Citations
Metrics details

Abstract

Background

A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/ouput methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications.

Results

We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus.

Conclusion

Implementation of methods aims at being both efficient and user-friendly. A special concern was given to the library design to enable easy extension and new methods development. We defined a general hierarchy of classes that allow the developer to implement its own algorithms while remaining compatible with the rest of the libraries. Bio++ source code is distributed free of charge under the CeCILL general public licence from its website http://kimura.univ-montp2.fr/BioPP.

Background

The design of re-usable software components into libraries has proved to be useful to the rapid development of bioinformatics applications, as witnesses the success of the ever growing open source libraries known as the Bio{*} projects. These libraries cover many fields of bioinformatics, but rarely offer tools for intensive calculus, as required for phylogenetic analyses for instance. Such computer-expensive applications are usually written in C, a compiled low-level language. Programming time-consuming tasks often requires the implementation of specific algorithms optimized for each particular problem. However, despite these specifities, there are some common points (e.g. likelihood calculation, minimization), and developers will appreciate to use modules pre-programmed in a flexible manner.

The use of the C++ language combines both the computer efficiency of the C language and the convenience of a Object-Oriented (OO) approach. We present Bio++, a set of C++ libraries dedicated to sequence analysis, molecular evolution and population genetics. Other libraries taking these advantages of C++ have been developed, including the libsequence library [1] for population genetics analysis (single nucleotides polymorphism and coalescent), PDBlib for manipulating tridimensional structures [2] or libcov [3] for protein phylogenetics and molecular evolution. The Bioinformatics Template Library [4] also provides algorithms for manipulating and analysing three dimensional structures, using a generic programming approach.

Implementation

We did a collaborative effort to design a hierarchy of useful classes for the rapid development of efficient applications in the fields of sequence analysis, phylogenetics, molecular evolution and population genetics. These libraries combine tools for data handling and numerical calculations, providing an easy-to-use, powerful and general development environment. The three main objectives of Bio++ are (i) to speed up the implementation of a new method in order to test it, (ii) to make easier the development of robust applications for distribution and (iii) to provide a high level programming language for biologists.

Library design

We used object-oriented programming paradigms to develop Bio++. We defined for each class an abstract basal class that contains a number of purely virtual functions (dummy functions with an empty boddy). Such a basal class is called an interface and can be seen has a kind of contract that a class must follow in order to be used by other methods. Such a contract only specifies what an object is at least able to do, and not how it does it. Most classes in Bio++ are derived from interfaces, and some common instanciations are proposed for each interface. The user may, however, write its own implementation of a given method, and still remain compatible with the rest of the library. For instance, one may easily design a new alphabet to handle a new kind of sequences, or write a new substitution model and perform likelihood calculation without writing any likelihood computation function. Conversely, one may implement a new likelihood computation algorithm without re-developping substitution model classes. Several interfaces and abstract classes (partial implementations) linked by inheritance specify different levels of specialization. Depending on their objectives and programming level, users might simply use fully-specified objects, re-implement specific methods, or even design new classes.

The sequence container hierarchy is a representative example of the class hierarchies defined in Bio++ (see figure 1). The SequenceContainer basal interface only specifies that sequences can be accessed by their name. The OrderedSequenceContainer has the additional requirement that sequences can be accessed by index, which is generally the case, even if ordering has no biological meaning. The SiteContainer interface requires that sequences have the same length (i.e. are aligned) and hence may also be accessed by site (= column in the alignment). Several implementations are proposed: in the VectorSequenceContainer class, data are stored as a vector of sequences. Each sequence can hence be accessed in O(1). If there are N sequences, they are hence accessed in O(N). The AlignedSequenceContainer is derived from the VectorSequenceContainer class, and adds site access. Since data are stored as sequences, the access time for a site is in O(N) and the complete set of L sites is acessed in O(N × L). The VectorSiteContainer proposes an alternative implementation by storing sequences as sites instead of sequence objects. The sequence access is hence achieved in O(L), and the site access in O(1). All methods working on containers only deal with the SequenceContainer or the OrderedSequenceContainer interface, whith no assumption about the implementation. The execution time, however, may vary depending on the implementation used.

Development is facilitated by the use of the code documentation generated by the doxygen program [5]. Full class documentation can be consulted online or downloaded in navigable HTML format. A short tutorial is also available.

Library content

We split Bio++ classes into five libraries: three main biological libraries (SeqLib, PhylLib and PopGen-Lib) and two utility libraries (Utils and NumCalc), see table 1 for there content and dependencies:

Table 1 Libraries available and dependencies.

Full size table

the Utils library contains core classes and utilitary functions;
the NumCalc library contains classes for numerical calculus, including several optimization tools, random number generators, probability distributions and statistical functions;
the SeqLib library is dedicated to sequence analysis: sequence storage and manipulation, input/output toward several file formats, alphabet translation, etc.
the PhylLib library deals with phylogenetic trees: Markov models, likelihood calculation, etc. The library is more focused on molecular evolution (model fitting, ancestral state reconstruction, across-site rate variation) than pure phylogenetic reconstruction, although several topology reconstruction methods are implemented (distance methods, nearest-neighbor interchanges (NNI) search for maximum likelihood (ML) and parsimony scores);
the PopGenLib library is dedicated to population genetics, with particular sequence and codominant markers storage. Methods are provided for polymorphism analysis, population divergence estimation and neutrality tests.

Results

Figure 2 shows a full program example. This application builds a Neighbor-Joining tree from a sequence file in Phylip format, and re-estimates parameters (branch lengths and shape of the rate across-site distribution) using maximum likelihood. The final tree is then written to a file in Newick format. Additional output files are also created, providing detailed information about the estimation procedure. The program begins with the creation of a ProteicAlphabet object (line 3):

const ProteicAlphabet * alphabet = new ProteicAlphabet();

This object specifies the type of sequence the program will use. The Alphabet object family contains classes for nucleic alphabets (DNA and RNA), proteins or codons (nuclear and mitochondrial, for vertebrates, echinoderms/nematodes and other invertebrates). A dataset is then readed from a Phylip sequence file. A sequence reader is created using

Phylip * seqReader = new Phylip(false, false);

Other file formats for reading and writing are supported, including the commonly used Fasta and Clustal. Sequences are then read and stored in a container:

SiteContainer * sites = seqReader->read("Myoglobin.phy", alphabet);

Since sequences are aligned, they are actually stored in a SiteContainer object (see figure 1). Line 7 to 10 display on screen some properties of the container, namely the number of sequences and sites. In this example we need to restrict our analysis to sites without any gap, so at line 11 a new container called "completeSites" is created from the former:

SiteContainer * completeSites = SiteContainerTools: :getSitesWithoutGaps(* sites);

Several {*}Tools classes providing utilitary functions are available. For instance, the SiteContainerTools class contains static functions that work on SiteContainer objects.

A JTT proteic SubstitutionModel object is created [6, 7]:

SubstitutionModel * model = new JTT92(alphabet);

Here again, several commonly used models are available. We also use a discrete gamma rate across sites distribution with 4 rate classes and a shape parameter of 0.5 [8]:

DiscreteDistribution * rateDist = new GammaDiscreteDistribution(4, 0.5);

This model is used first to estimate the distance matrix from the data. A DistanceEstimation object is then created, by giving the model and data as parameters:

DistanceEstimation distEstimation(model, rateDist, completeSites);

DistanceMatrix * matrix = distEstimation.getMatrix();

Distances will be estimated using a maximum likelihood (ML) procedure. The NumCalc library provides an object-oriented implementation of several general optimization procedures (Optimizer objects). For phylogenetic optimization, specific optimizers are also available. The program then computes a neighbor-joining tree [9]. A Neighbor Joining object is instantiated and the resulting tree stored in a variable:

NeighborJoining nj(* matrix, false);

Tree<Node> * tree = nj.getTree();

This variable is passed to a TreeLikelihood object, together with the data set, model and rate distribution. An additional parameter tells the object to display a few informations. The DRHomogeneousTreeLikelihood object is for homogeneous models, with rate across sites variations.

DiscreteRatesAcrossSitesTreeLikelihood * likFunction = new DRHomogeneousTreeLikelihood(*tree, *completeSites, model, rateDist, true);

Several TreeLikelihood implementations are available. The DRHomogeneousTreeLikelihood object uses one 'view' by neighbor node for each node, i.e. three views by node in case of bifurcating trees [10]. This implementation saves computational time during NNI-mediated topology search, and is also more convenient to compute branch-lengths derivatives. An alternative implementation is proposed: HomogeneousTreeLikelihood, and others could be easily added if required. For parameter optimization, we used a utilitary function from the OptimizationTools class. This function automatically instanciates the appropriate optimizer object. It receives as argument a pointer toward the likelihood function, a tolerance number and a maximum number of function evaluations. Two optional log-files are provided to monitor the optimization process ("profiler" and "messenger", which contain several detailed informations, like for instance all parameters and function values at each step of the process), and the last parameter specifies the verbose level:

OptimizationTools:: optimizeNumericalParameters(likFunction, 0.000001, 1000000, messenger, profiler, 3);

Finally, the optimized tree is writen to a file named output. dnd in Newick file format:

Newick newick;

newick.write(*tree, "output.dnd");

This program creates three files: two for the ML tree estimation and one for the final tree.

Conclusion

Bio++ is a mature project which has been used in previous works like molecular coevolution analysis [11] or codon analysis [12]. However it is an active project still receiving new methods and improvements.

Development snapshots may be accessed by anonymous CVS on the library website. The website also provides on-line documentation of classes, tutorial and several code examples. Several programs have been developped with the Bio++ libraries, including bppML for ML likelihood tree estimation and bppSG for sequence generation by simulation under different kinds of models. Any contribution will be welcome, as specific functions or as additional libraries compatible with the present ones.

Availability and requirements

C++ libraries are not organized in a tree-like hierarchy as java packages or perl modules. They are bundled in a non-nested way, and may be compiled in two flavours, either static or shared (= dynamic). Dynamic libraries are loaded during the program execution and hence can be shared by several applications, while static libraries are hard-coded into the executable, which no longer requires the library to be installed.

Each Bio++ library is complient with the GNU standards and uses the autotools suite for compilation and installation. Bio++ can hence be built with these GNU tools on any unix-like system. Alternatively, sources can be imported and compiled in any C++ development environment (IDE). It has been successfully installed on Linux, MacOS X and Windows using the Cygwin port and MinGW. Bio++ is distributed freely under the CeCILL public license (the French free software license, compatible with the GNU General Public License). It is available from its website at http://kimura.univ-montp2.fr/BioPP.

References

Thornton K: Libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 2003, 19: 2325–7. 10.1093/bioinformatics/btg316
Article CAS PubMed Google Scholar
Chang W, Shindyalov IN, Pu C, Bourne PE: Design and application of PDBlib, a C++ macromolecular class library. Computer Applications In The Biosciences 1994, 10: 575–86.
CAS PubMed Google Scholar
Butt D, Roger AJ, Blouin C: libcov: a C++ bioinformatic library to manipulate protein structures, sequence alignments and phylogeny. BMC Bioinformatics 2005, 6: 138–138. 10.1186/1471-2105-6-138
Article PubMed Central PubMed Google Scholar
Pitt WR, Williams MA, Steven M, Sweeney B, Bleasby AJ, Moss DS: The Bioinformatics Template Library – generic components for biocomputing. Bioinformatics 2001, 17: 729–37. 10.1093/bioinformatics/17.8.729
Article CAS PubMed Google Scholar
van Heesch D: Doxygen, a documentation system for C++.2005. [http://www.stack.nl/~dimitri/doxygen/]
Google Scholar
Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Computer Applications In The Biosciences 1992, 8: 275–82.
CAS PubMed Google Scholar
Kosiol C, Goldman N: Different versions of the Day-hoff rate matrix. Molecular Biology And Evolution 2005, 22: 193–9. 10.1093/molbev/msi005
Article CAS PubMed Google Scholar
Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates oversites: approximate methods. Journal Of Molecular Evolution 1994, 39: 306–14. 10.1007/BF00160154
Article CAS PubMed Google Scholar
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology And Evolution 1987, 4: 406–25.
CAS PubMed Google Scholar
Felsenstein J: Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2004.
Google Scholar
Dutheil J, Pupko T, Jean-Marie A, Galtier N: A model-based approach for detecting coevolving positions in a molecule. Molecular Biology And Evolution 2005, 22: 1919–28. 10.1093/molbev/msi183
Article CAS PubMed Google Scholar
Galtier N, Bazin E, Bierne N: GC-biased segregation of non-coding polymorphisms in Drosophila. Genetics 2005, 172: 221–8. 10.1534/genetics.105.046524
Article PubMed Google Scholar
Friedman N, Ninio M, Pe'er I, Pupko T: A structural EM algorithm for phylogenetic inference. Journal Of Computational Biology 2002, 9: 331–53. 10.1089/10665270252935494
Article CAS PubMed Google Scholar
Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications In The Biosciences 1997, 13: 555–6.
CAS PubMed Google Scholar

Download references

Acknowledgements

JD would like to thank Tal Pupko for providing help with C++ in the early development of this project. Some code in the NumCalc library have been adapted from the SEMPHY library [13], Ziheng Yang's PAML package [14] and the C++ port of the Java Matrix (JAMA) package http://math.nist.gov/tnt. This work was supported by French Ministére de la Recherche ACI IMPBio.

Author information

Authors and Affiliations

CNRS UMR 5171 – Génome, Populations, Interactions, Adaptation (GPIA), Université Montpellier 2, France
Julien Dutheil, Sylvain Gaillard, Eric Bazin, Sylvain Glémin, Nicolas Galtier & Khalid Belkhir
CNRS UMR 5554 – Institut des Sciences de I'Evolution de Montpellier (ISE-M), Université Montpellier 2, France
Vincent Ranwez

Authors

Julien Dutheil
View author publications
You can also search for this author in PubMed Google Scholar
Sylvain Gaillard
View author publications
You can also search for this author in PubMed Google Scholar
Eric Bazin
View author publications
You can also search for this author in PubMed Google Scholar
Sylvain Glémin
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Ranwez
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Galtier
View author publications
You can also search for this author in PubMed Google Scholar
Khalid Belkhir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julien Dutheil.

Additional information

Authors' contributions

JD developped the Utils, NumCalc, SeqLib and PhylLib libraries and drafted the manuscript. SGa developped the PopGenLib library, provided algorithms for the NumCalc library and helped with the GNU configuration tools. EB developped tools for the PopGenLib library. SG1 provided tools for the PopGenLib and SeqLib libraries. VR helped in the development of the PhylLib library. NG participated in the design of the library. KB suppervised the whole project. SG1, VR, NG and KB helped to draft the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Additional File 1: Sources and data files of the example as a zipped archive. (ZIP 29 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dutheil, J., Gaillard, S., Bazin, E. et al. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics 7, 188 (2006). https://doi.org/10.1186/1471-2105-7-188

Download citation

Received: 06 December 2005
Accepted: 04 April 2006
Published: 04 April 2006
DOI: https://doi.org/10.1186/1471-2105-7-188

Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics