Extending the BEAGLE library to a multiFPGA platform
 Zheming Jin†^{1} and
 Jason D Bakos^{1}Email author
https://doi.org/10.1186/147121051425
© Jin and Bakos; licensee BioMed Central Ltd. 2013
Received: 29 May 2012
Accepted: 4 January 2013
Published: 19 January 2013
Abstract
Background
Maximum Likelihood (ML)based phylogenetic inference using Felsenstein’s pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein’s pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)based platform called the Convey HC1.
Results
The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floatingpoint operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform’s peak memory bandwidth and the implementation’s memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGAbased platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE’s CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE’s GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt.
Conclusions
The use of data parallel architectures to achieve high performance for likelihoodbased phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floatingpoint pipeline with resource and throughput constraints to match the target platform. We have found that using lowlatency floatingpoint operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPUbased coprocessor.
Background
Computing the PLF and tree likelihood for candidate trees comprises the computational kernel of these tools. Since multiple tools use a common likelihood computation, Ayres et al., implemented a finely tuned implementation of the likelihood computation as a generalpurpose library called BEAGLE [2]. BEAGLE supports CPU and GPUbased architectures, but does not yet support Field Programmable Gate Array (FPGA)based architectures such as the Convey HC1 [3]. In this paper, we describe our effort to add FPGA support to BEAGLE as well as the resultant performance.
Related work
There has been previous work in accelerating the PLF to FPGAbased coprocessor architectures. Mak and Lam were perhaps the first team to implement likelihoodbased phylogeny inference on an FPGA [4]. They used specialpurpose logic in the FPGA fabric to perform the PLF using fixedpoint arithmetic. Alachiotis et al. also implemented the PLF in special purpose logic and achieved an average speedup of four relative to software on a sixteen core processor [5, 6].
There has also been recent work in using Graphical Processor Units (GPUs) as coprocessors for MLbased phylogenetic inference. In recent work, Suchard et al. used the NVIDIA CUDA framework [7] to implement single and double precision versions of the PLF [8]. Using three NVIDIA GTX280 GPUs, they achieved a speedup of 20 for the single precision nucleotide model as compared to singlethreaded software. Zhou et al. developed a GPU implementation and evaluated it on a CPU running four processes and two NVIDIA GTX 480 GPUs [9]. They achieved a speedup of 42.3 relative to MrBayes running on a CPU with a single thread, and 13.9 when compared to an optimized CPU version with 4 threads.
Descriptions of PLF kernel
The kernel function of BEAGLE depends on which options and evolutionary models are used for the analysis. When using the 4state nucleotide model, nearly all the execution time is spent evaluating the loglikelihood score. This evaluation consists of evaluating the PLF, normalizing the conditional likelihood tables to avoid numerical underflow, and updating two logscaler arrays.
The kernel algorithm is described in Cstyle pseudocode below. Note that although our implementation produces consistent results with that of BEAGLE, the pseudocode describes the authors’ implementation and is not necessarily descriptive of the corresponding kernel included in BEAGLE.
Pseudocode 1: BEAGLE Kernel
// for each nucleotide in the sequence perform the PLF to complete the fourcolumn
// conditional likelihood table
h = 0;
for (k = 0; k < nsites; k++) {
State is {AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT};
Base is {A, C, G, T};
State = State.first;
for (i = 0; i < 4; i++) {
Base = Base.first;
sopL = sopR = 0;
for (j = 0; j < 4; j++) {
sopL = sopL + tipL[State] * clL[Base];
sopR = sopR + tipR[State] * clR[Base];
State = State.next;
Base = Base.next;
}
clP[h + i] = sopL * sopR;
}
// find the maximum of the previously computed values (scaler value)
scaler = 0;
for (i = 0; i < 4; i++)
if (clP[h + i] > scaler) scaler = clP[h + i];
// normalize the previously computed values
for (i = 0; i < 4; i++)
clP[h + i] = clP[h + i] / scaler;
// store the log of the scaler value and store in scP array
scP[k] = log(scaler);
// update the lnScaler array
lnScaler[k] = scP[k] * lnScaler[k];
// accumulate the loglikelihood value
condLike = 0;
for (i = 0; i < 4; i++)
condLike = condLike + bs[i] * clP[h + i];
lnL = lnL + numSites[k] * (lnScale[k] + log(condLike));
// increment counters
h = h + 4; clL = clL + 4; clR = clR + 4;
}
double log (double x) {
// initialize binary search
log_comp = −16;
coeff_set = 0;
coeff_incr = 8;
log_comp_incr = 8;
// perform a logarithmic binary search
for (i = 0;i < 4;i++) {
if (x < 10^log_comp) {
log_comp = log_comp  log_comp_incr;
} else {
log_comp = log_comp + log_comp_incr;
coeff_set = coeff_set + coeff_incr;
}
coeff_incr = coeff_incr / 2;
log_comp_incr = log_comp_incr / 2;
}
// compute the polynomial approximation
return_val = 0;
pow_x = 1.0;
for (i = 0; i < 5; i++) {
return_val + = return_val + coeff[coeff_set][i] * pow_x;
pow_x = pow_x * x;
}
return return_val;
}
The natural log approximation is implemented as an order5 polynomial where the coefficients are divided into 16 segments whose range scales logarithmically. The coefficients are computed using a Chebyshev approximation [10]. In all of our experiments, we verify that the results computed from our design, including the log approximation, are accurate to within 1% of the results delivered by BEAGLE.
Implementation
Our objective is to implement the PLF and tree likelihood computations on a Convey HC1 heterogeneous platform as a hardware/software codesign. This application is a good match for the HC1, which provides high memory bandwidth and its FPGAs are wellsuited for computing data intensive loops containing no loopcarried dependences except for the accumulate required for the final likelihood value. We first briefly describe the platform then discuss our implementation in detail.
Platform
The coprocessor board contains four userprogrammable Virtex5 LX 330 FPGAs called “application engines (AEs)”. The coprocessor board also contains eight memory controllers, each of which is implemented on its own Virtex5 FPGA. Each of the AEs is connected to each of the eight memory controllers through a 4x4 crossbar switch on each memory controller.
Each memory controller allows up to two independent 64bit memory transactions (read or write) per cycle on a 150 MHz clock, giving a peak theoretical bandwidth of 16 bytes * 150 MHz = 2.4 GB/s per memory controller, or 8 * 2.4 GB/s = 19.2 GB/s per AE, or 4 * 19.2 GB/s = 76.8 GB/s of aggregate bandwidth for the coprocessor board.
Although the HC1’s coprocessor shares the same memory space as the host CPU, the coprocessor’s memory is partitioned into eight disjoint regions corresponding to each of the eight memory controllers. In other words, each memory region is only accessible by one of the eight memory interfaces.
Each of the eight memory controllers are connected to two Conveydesigned scattergather DIMM (SGDIMM) modules. Each of these two DIMMs contains eight DRAMs, each with eight decoupled DRAM banks that can be addressed independently. Each of the memory controllers attempts to maximize bandwidth by scheduling incoming memory requests to each of the sixteen banks, routing requests to nonbusy banks and grouping reads and writes into bursts to minimize bus turns (state change between read and write). A reorder module, developed by Convey, rearranges the loaded data to match the request order before delivering it back to the user logic.
The effectiveness of their scheduler depends on the memory access pattern issued by the user logic on the AEs. Contention for crossbar ports and bank request FIFOs cause the HC1’s memory controller to throttle the AE by asserting a “stall” feedback signal. These stalls reduce the effective memory bandwidth.
Hardware implementation of the kernel
On the coprocessor, each AE has eight processing units (PEs). To achieve maximum memory efficiency, each should make full use of the 128bit per cycle memory channel. Each PE implements the kernel described in Pseudocode 1 and thus each consumed 10 inputs per loop iteration (for each character in the input sequence), performing 130 single precision floatingpoint operations, and producing six outputs. Since the memory interface can only deliver four floating values per cycle, each PE requires three cycles to read its inputs and one and a half cycles to write its results for each input character.
Resource and throughput constrained synthesis
The data introduction interval (DII) is the number of cycles required for the pipeline to read all of its inputs from the available input ports, i.e. $\mathit{DII}=\u2308\frac{l}{p}\u2309$, where l = number of logical input ports and p = the number of physical input ports. In this case, the number of logical input ports is ten and the number of physical input ports is four. Given these parameters, our pipeline synthesis tool synthesizes this expression into an arithmetic pipeline having minimum DII, minimum critical path latency, and using the minimum number of floatingpoint functional units.
In conventional scheduling it is sufficient to provide at least one functional unit of each required functional unit type to ensure that a schedule exists. However, in order to achieve maximum throughput, the minimum number of functional units of an operation type is $R\ge \u2308\frac{M}{\mathit{DII}}\u2309$, where M is the number of operators of a type in the DFG, and DII the data introduction interval.
We use the “As Soon As Possible (ASAP)” scheduling technique for data path synthesis. ASAP repeatedly schedules the ready operations to the time slot in a manner of firstcomefirstserved. The start time of each operation is the minimum allowed by the dependencies of operations.
Our pipeline synthesis tool takes, as input, the number of input ports of the target platform (physical input ports), the number of input variables derived from the target expression (logical input ports), and a data flow graph (DFG) representing the target expression. Given these parameters, the tool converts the DFG into a pipeline described in Verilog hardware description language with minimum number of floatingpoint functional units.
As an example, consider the high level code below, which computes one column of the conditional probability table.
clP[0] = (tipL[AA]*clL[A] + tipL[AC]*clL[C] + tipL[AG]*clL[G] + tipL[AT]*clL[T]) * (tipR[AA]*clR[A] + tipR[AC]*clR[C] + tipR[AG]*clR[G] + tipR[AT]*clR[T]);
The two 4x4 transition likelihood tables tipL and tipR are invariant across all nodes and are thus treated as constants. As such, this expression has eight inputs and one output.
Xilinx IEEE754 singleprecision floatingpoint operator’s latency and slice register usage
Floatingpoint operator  Lowlatency  Slice registers  Maxlatency  Slice registers 

fadd  3  139  12  547 
fmul  3  87  8  361 
fdiv  11  499  28  1377 
fcomp  1  2  2  8 
Since each PE has two independent 64bit physical ports, we synthesized the PLF kernel to a deeppipelined singleprecision floatingpoint circuit with four input ports that receive four 32bit data per cycle. The latency of the entire pipeline is 125 cycles.
Hardware architecture of PLF kernel
In the write portion of the memory interface there are four FIFOs for clP, scP and lnScaler pipeline outputs. Four clP outputs (clP0, clP1, clP2, and clP3) correspond to four consecutive locations of clP array. By writing data into four FIFOs of scP and lnScaler in a roundrobin manner, the memory interface can store 128 bits per cycle into the memory, increasing the output throughput.
Until the output FIFOs have not reached full state, the controller repeats this sequence. When the output FIFOs become almost full, the current load request is interrupted and the machine state jumps to its store result state. The output results in the twelve FIFOs are then stored back to memory. When stores are finished, the interrupted load request will be resumed. Note that the pipeline can continue consuming input data if the input and output FIFOs are not full. After all load requests are satisfied the machine goes to the store state to store the remaining data in the FIFOs. When it is complete, the machine returns to the initial state IDLE.
Steps of calculating root log likelihood
BEAGLE provides a set of interface functions needed for the user to describe a candidate tree and request that its likelihood be computed. Specifically, these include functions that:

initialize input arrays and scalars, including the transition probability tables and leaf node (tip) conditional likelihood tables, and all other necessary scalars and arrays,

describe the topology of the proposed tree,

request that the BEAGLE runtime library traverse the tree and update the conditional likelihood tables of all internal nodes, and

request that BEAGLE compute the likelihood of the tree.
Descriptions of BEAGLE API implementation
BEAGLE API call  Arrays initialized and coprocessor action 

beagleSetEigenDecomposition beagleUpdateTransitionMatrix beagleGetTransitionMatrix  4x4 transition probability matrices for each node (initialize arrays tipL and tipR in Pseudocode 1) 
beagleSetTipPartials  Copy an array of partials into an instance buffer (initialize arrays clL and clR in Pseudocode 1) 
beagleSetStateFrequencies  Copy a state frequency array into an instance buffer (initialize array bs in Pseudocode 1) 
beagleSetPatternWeights  Set the vector of pattern weights for an instance (initialize array numSites in Pseudocode 1) 
beagleResetScaleFactor  Reset a cumulative scale buffer 
beagleUpdatePartials  Calculate partials for all internal nodes (compute array clP in Pseudocode 1) 
beagleCalculateRootLogLike  Calculate loglikelihood of root node. (compute arrays lnScaler and scP in Pseudocode 1, calculate lnL in Pseudocode 1) 
Data organization
In order to utilize eight processing elements (PEs) on each FPGA we distribute the input data among four FPGAs evenly. We copy the host data arrays to the coprocessor’s memory space based on the organization and partition of coprocessor’s memory.
Convey’s memory mapping scheme requires that memory addressing of each PE be aligned to the MC controller number. Because of this, the data is arranged using a stride of 64 bytes for each PE in an AE and 512 bytes for the same PE space between two consecutive AEs. (e.g. AE0 and AE1).
Results and discussion
We implemented the kernel on a multiFPGA platform Convey HC1. The platform has four Xilinx Virtex5 LX330 FPGAs. The design is described using our pipeline synthesis tool which generates deep floatingpoint pipeline in Verilog HDL. Xilinx ISE 13.2 [17] is used to synthesize, map, place and route the design, and generate the final bitstream. A single bitstream file is used to configure all the FPGAs. Each FPGA works on different input data based on the FPGA numbers and processing elements in each FPGA.
Performance results of our implementations
In the software implementation using BEAGLE APIs, we timed the elapsed time of function beagleUpdatePartials(), beagleAccumulateScaleFactors() and beagleCalculateRootLogLike() since they comprise the computation kernel. In the hardware implementation, we timed the kernel execution on multiFPGA. We assume the number of sites is a multiple of 128 to allow each PE to process the same amount of input data.
Performance results of our design
nsites  CPU(us)  GPU(us)  FPGA(us)  Memory efficiency (%)  Speedup FPGA vs. CPU  Speedup FPGA vs. GPU 

128  27  93  96  2  0.28  0.97 
256  41  93  101  4  0.41  0.92 
512  69  94  103  7  0.67  0.91 
1024  133  99  106  13  1.25  0.93 
2048  225  107  107  20  2.10  1.00 
4096  462  130  115  30  4.02  1.13 
8192  944  167  125  28  7.55  1.34 
16384  1894  240  148  28  12.80  1.62 
32768  3873  385  207  28  18.71  1.86 
65536  7922  672  304  29  26.06  2.21 
131072  15898  1247  415  38  38.31  3.00 
262144  31774  n/a  764  40  41.59  n/a 
524288  63696  n/a  1240  44  51.37  n/a 
1048576  127957  n/a  2280  46  56.12  n/a 
8192000  1028649  n/a  15750  49  65.31  n/a 
We achieved a memory efficiency of around 50% for large data sizes. This is competitive with similar implementations in the literature. For example, Cong et al. reported their implementation of a bandwidthbounded application that has 32 PEs and utilizes all the memory access ports on Convey HC1 [18] having 30% efficiency. In general, the factors that contribute to the efficiency are external memory access order, memory buffer size, and frequency of memory bus turns (alternating between read and write operations).
Convey’s unique interleaved memory addressing attempts to maximize memory system utilization by distributing memory accesses across all memory banks. Memory stalls occur when the number of pending memory load requests reaches the size of memory request queue in the memory controller. In order to avoid this, the size of the custom memory buffer in the user design must be close to the size of the request queue. A smaller memory buffer cannot overcome the long latency of DDR2 memory access while a larger one increases memory stalls.
Frequently alternating between memory read and memory write will reduce the effectiveness of the Convey memory scheduler and reduce memory bandwidth. The use of deep output FIFOs and writing the entire contents of the output FIFOs when they fill will reduce the frequency of readwrite transitions and improve bandwidth.
FPGA resource usage
Area results of our design
Resource  Utilized  Total available  Utilization ratio 

LUT  193,518  207,360  93% 
FF  200,833  207,360  97% 
Slice  51,629  51,840  99% 
BRAM  235  288  82% 
DSP48E  152  192  79% 
Power consumption
The Xilinx power analyzer xpa reports that each FPGA design consumes about 23 W, or 92 W for all four FPGAs. The thermal design power of the Tesla GPU card is around 200 W. The FPGA implementation delivers a better performance while consuming less than half of the power of the Tesla GPU.
Discussion
Design motivation
This work is based on the “MrBayes accelerator” design, previously developed in the author’s lab [8]. The original design performed the same basic computations as described in this paper, but its pipeline was designed by hand and did not incorporate any functional unit reuse. As such, the original design instanced one functional unit for each operator in the DFG. The resulting design was large, allowing only one PE to fit on a single FPGA. In order to automate the design process and improve the resource efficiency, the authors developed a highlevel synthesis tool that generates a pipeline from a dataflow graph description of the kernel, and exploits functional unit reuse in such a way as to achieve the maximum throughput as bounded by the available memory bandwidth on the target platform. This synthesis tool was developed specifically for this application, but can be also used for any dataintensive kernel that has no loopcarried dependencies.
In addition, we implemented the new version of the design on the Convey HC1 reconfigurable computer, which has 13.2 times the amount of FPGA resources, 28.4 times the memory bandwidth, and over 1000 times the memory capacity of the original platform, an Annapolis Micro Systems WildStar II Pro. In order to make the design more general purpose, we integrated our design with the BEAGLE library instead of integrating it only into MrBayes 3 as in the original work.
Performance results
Our design achieves the highest possible level of performance as allowed by the memory system of the HC1. Memory efficiency was relatively low, and can be potentially improved by rearranging the order in which inputs are requested from the memory. Specifically, this can be performed by buffering a set of consecutive values from each input array before streaming the values into the pipeline in the order implied by the outermost loop in Pseudocode 1. This would require that the input values be read from memory in a different order than read by the pipeline. While we expect this to improve memory efficiency, the buffer would need to be designed in a way that doesn’t itself contain inefficiencies that negate its benefit. This is part of the authors’ future work.
Conclusions
In this paper we described an FPGAbased implementation of the core computations in the BEAGLE library that perform the phylogenetic likelihood function and tree likelihood computations. With this design we achieve 3X the performance of BEAGLE’s GPUbased implementation for large datasets.
The kernel implemented in this work is characterized by having a relatively low arithmetic intensity, making its performance dependent on the effective memory bandwidth achievable by the target platform. In order to achieve high performance under this condition, we developed a design automation tool that synthesizes the kernel’s data flow graph in a way that matches the pipeline’s throughput to the platform’s memory bandwidth while minimizing hardware requirements.
Availability and requirements
Project name: BEAGLE_HC1
Project home page: http://www.cse.sc.edu/~jbakos/software.shtml
Operating system(s): Linux
Programming language: C and Verilog
Other requirements: Must be run on a Convey HC1
License: GNU GPL v4
Any restrictions to use by nonacademics: None
Notes
Declarations
Acknowledgements
The authors wish to thank Glen Edwards of Convey Computer Corporation for his assistance in this work. The authors would also like to thank the anonymous reviewers for their insightful comments that allowed us to improve the quality of this paper.
This material is based upon work supported by the National Science Foundation under Grant No. 0844951.
Authors’ Affiliations
References
 Felsenstein J: Inferring Phylogenies. Sunderland, MA: Sinarer Associates, Inc. Publishers; 2004.Google Scholar
 Ayres DL, Aaron D, Zwick DJ, Peter B, Holder MT, Lewis PO, Huelsenbeck JP, Fredrik R, Swofford DL, Cummings MP, Andrew R, Suchard MA: BEAGLE: an Application Programming Interface and HighPerformance Computing Library for Statistical Phylogenetics. Syst Biol 2012,61(1):170173. 10.1093/sysbio/syr100PubMed CentralView ArticlePubMedGoogle Scholar
 Convey HC1 family. http://www.conveycomputer.com
 Mak TST, Lam KP: Embedded Computation of MaximumLikelihood Phylogeny Inference Using Platform FPGA. Proc. IEEE Computational Systems Bioinformatics Conference table of contents 2004, 512514.Google Scholar
 Alachiotis N, Sotiriades E, Dollas A, Stamatakis A: Exploring FPGAs for accelerating the Phylogenetic Likelihood Function. Proc. Eighth IEEE International Workshop on High Performance Computational Biology 2009, 18.Google Scholar
 Alachiotis N, Sotiriades E, Dollas A, Stamatakis A: A Reconfigurable Architecture for the Phylogenetic Likelihood Function. Proc. International Conference on Field Programmable Logic and Applications 2009, 674678.Google Scholar
 Nvidia CUDA Framework. http://www.nvidia.com/object/cuda_home_new.html
 Suchard MA, Rambaut A: ManyCore Algorithms for Statistical Phylogenetics. Bioinformatics 2009,25(11):13701376. 10.1093/bioinformatics/btp244PubMed CentralView ArticlePubMedGoogle Scholar
 Zhou J, Liu X, Stones DS, Xie Q, Wang G: MrBayes on a Graphics Processing Unit. Bioinformatics 2011,27(No. 9):12551261. 10.1093/bioinformatics/btr140View ArticlePubMedGoogle Scholar
 Stephanie Z, Bakos JD: FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods. BMC Bioinforma 2010, 11: 184. 10.1186/1471210511184View ArticleGoogle Scholar
 Nallatech, Intel Xeon FSB FPGA Accelerator Module. http://www.nallatech.com/IntelXeonFSBSocketFillers/fsbdevelopmentsystems.html
 DRC Computer, DRC Reconfigurable Processor Units (RPU). http://www.drccomputer.com/drc/modules.html
 XtremeData Inc., XD2000i™ FPGA InSocket Accelerator for Intel FSB. http://www.xtremedata.com/products/accelerators/insocketaccelerator/xd2000i
 Xilinx Core Generator. http://www.xilinx.com/ise/products/coregen_overview.pdf
 Xilinx Floating Operator Data Sheet. 2009. http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf
 Virtex5 FPGA User Guide. http://www.xilinx.com/support/documentation/user_guides/ug190.pdf
 Xilinx ISE Design Suite 13. http://www.xilinx.com/support/documentation/dt_ise132.htm
 Cong J, Huang M, Zou Y Proceedings of the 9th IEEE Symposium on Application Specific Processors (SASP 2011). In 3D Recursive Gaussian IIR on GPU and FPGAs, A Case Study for Accelerating BandwidthBounded Applications. San Diego, CA; 2011:7073.Google Scholar
 Xilinx UG193 Virtex5 XtreameDSP Design Considerations. http://www.xilinx.com/support/documentation/user_guides/ug193.pdf
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.