• No se han encontrado resultados

This chapter provided an introduction to the main concepts that are necessary to comprehend the rest of the thesis, highlighted the importance of accelerating MCMC to tackle large-scale problems, gave a brief introduction to current hardware platforms and presented a review of previous related work.

The literature review revealed that, although a significant amount of work on MCMC acceleration has been published during the last five years, this work has been limited in several ways: Research that uses FPGA only tackles the acceleration of the Bayesian likelihood computation (for a specific Bayesian model) and not the acceleration of the MCMC algorithm. Also, the special features of FPGAs (e.g. custom precision) have not been exploited. Even research that employs CPUs and GPUs to accelerate

MCMC is typically limited to direct mappings of algorithms to hardware. The idea of adapting the MCMC algorithm to make it more suitable for the underlying platform is unexplored.

The following chapters of this thesis focus on extending the existing MCMC acceleration research in various ways, so that the above limitations are tackled. In particular, they look at how FPGA technology can be exploited to accelerate classes of MCMC methods for which only CPU and GPU implementations currently exist. Novel algorithms, which are more suitable for FPGA mapping, are also proposed. Finally, the following chapters explore various ways in which custom precision can be used to accelerate MCMC.

Algorithms and architectures for

Population-based MCMC

3.1

Introduction

A common form of complexity in Bayesian posterior distributions is multi-modality, i.e. the exis- tence of two or more separate modes in the probability density. Multi-modal distributions appear in many Bayesian inference application, e.g. machine learning using Restricted Boltzmann Machines or mixture models [22, 40, 7], computational genetics [13, 25] and biological simulations [27]. They cause baseline MCMC samplers (e.g. Metropolis sampler [30]) to get stuck in one of the modes of the distribution for a long time, thus making them inefficient.

Population-based MCMC (popMCMC) [7] is a class of methods specifically designed to address multi-modality in the target distribution. Parallel Tempering (PT) [41] is the most popular of these methods (see Section 2.3.2 and [41]). This chapter proposes ways to tackle the computational chal- lenges of PT, e.g. the processing burden of running multiple MCMC chains instead of the one chain used by basic MCMC methods. The chapter focuses on combining hardware acceleration (using FP- GAs but also CPUs and GPUs) with novel algorithmic modifications based on the use of custom arithmetic precision. Both the characteristics of the FPGA architecture and the structure of PT are ex- ploited to accelerate inference. The main questions that this chapter seeks to answer are the following:

• “How can PT be parallelized in multi-core CPU, GPU and FPGA implementations and what are 69

the gains?”

• “Is there a way to reduce the arithmetic precision in large parts of the algorithm without affecting sampling accuracy?”

• “What extra speedup does such a strategy deliver in each platform?”

The results of this chapter demonstrate that significant speedups are possible when parallelizing PT and that smart modifications to the algorithm permit the reduction of precision in the majority of PT computations without any cost in sampling accuracy. This reduction translates to significant area savings and throughput improvement in FPGA designs.

Chapter outline

Section 3.2 repeats basic background information on the PT algorithm for easier reference (PT has already been presented in Chapter 2). It also describes the available forms of parallelism in the algo- rithm. The remaining sections contain the main contributions of the chapter, which are the following:

1. An optimized FPGA accelerator for PT, which employs double precision and delivers a speedup of up to 174x over sequential code running on a single-core CPU. Highly optimized imple- mentations of PT on a multi-core CPU and a GPU are also proposed, delivering up to 16.1x and 165x speedup compared to sequential code respectively. Each implementation takes ad- vantage of specific features of the respective hardware platform in order to maximize sampling throughput (Section 3.3.1).

2. Two novel, custom precision methods (i.e. algorithmic modifications) for PT, which allow the use of reduced precision in parts of the algorithm and thus lead to reduced runtimes (Section 3.3.2). Both methods guarantee that the use of reduced precision does not affect sampling qual- ity, i.e. does not introduce error in the Monte Carlo estimate of Equation (2.8). Instead, by reducing precision, the mixing of the PT algorithm is affected, allowing for a trade-off between raw speedup and mixing. The first method uses a weighting scheme to correct errors (Weighted PT - WPT). The second method uses custom precision in parts of the algorithm which do not affect output accuracy (Mixed-Precision PT - MPPT). A theoretical proof that MPPT maintains the detailed balance condition (which is necessary to guarantee convergence to the correct tar- get distribution) is presented. WPT is guaranteed to converge to the correct target distribution

because it is essentially an Importance Sampling (IS) method. All necessary conditions for IS to sample from the correct distribution apply in the WPT case by design.

3. Two tailored architectures which map the two custom precision methods to an FPGA, taking advantage of reduced precision to improve performance (Section 3.3.3). These accelerators offer further speedups of up to 6.5x over the baseline (double precision) FPGA accelerator. The two custom precision methods are also mapped to a CPU and a GPU, delivering speedups of up to 1.4x and 3.2x respectively over the double precision samplers.

4. A precision optimization process for WPT and MPPT on FPGAs, which is able to find the precision configuration which maximizes effective sampling throughput, defined later in the text (Section 3.6.4). The optimization process takes advantage of the trade-off between speedup and mixing to deliver maximum effective performance.

The performance of the various accelerators is evaluated using a case study representative of the types of problems that PT is applied to: Bayesian inference on a mixture model [40, 16]. This case study leads to a multi-modal posterior. An investigation of the way the performance of the accelerators scales with the size of the chain population, the size of the data set and the size of the hardware device is also presented. Finally, results on the power efficiency of each accelerator are included (Section 3.6.3).