• No se han encontrado resultados

Antecedentes, modalidades y concepto

Approximate computing is an emerging area of research [31]. The general idea is to design applications and hardware so that they can tolerate a loss of quality or accuracy in their results to improve performance or energy efficiency.

Some existing work has integrated approximate computing concepts into compilation systems. Petabricks is a language and compiler which supports variable accuracy algorithms [7]. Programmers provide multiple implementa- tions of algorithms, and with the use of benchmarking and auto-tuning, the sys- tem selects the appropriate implementation for a new platform. Programmers can also provide accuracy constraints for the implementation of an algorithm using annotations, which the compiler can incorporate into the code generation process, producing faster but less accurate code. Green is another compiler sys- tem which supports approximate computing, which is targeted specifically at reducing the energy used by programs [9].

Chapter 4 introduces a method of generating low-level vectorized code to implement customized data storage formats, which are designed to represent reduced-accuracy results produced by approximate algorithms. The idea is to save space in memory (and reduce memory bandwidth) by using formats which have lower precision than natively supported types.

2.5.1

Floating Point Programs with Reduced Accuracy

Exploiting reduced accuracy requirements in floating point computations to ac- celerate applications is the topic of several related works. Some, like Buttari et al. [16], choose to reduce precision in increments of the available native types on the platform. In the case of Buttari et al., their approach performs com- putationally expensive portions of numerical algorithms using single precision arithmetic, and less expensive portions using double precision arithmetic. They note that the choice of single precision arithmetic leads to speedups of up to 2x versus double precision for applications which are memory bound, due to the 2x reduction in the amount of data transferred through the memory bus.

This approach is also used by Rubio-Gonzalez et al. [71] who present an au- tomated system which finds a compile-time instantiation of the types of float- ing point variables in the program which improves performance. Their ap- proach is subject to accuracy constraints which are specified by the program- mer via annotations in the source code. Like Buttari et al. [16] they consider only the available native machine types.

Lam et al. [41] also pursue this approach, by using binary instrumentation and translation to modify existing binaries and automatically find a satisfac- tory mixed-precision version of a program. Input programs are constrained to use double precision arithmetic only, and the system searches for program variables which can have their precision lowered without adversely affecting the computed results. Again, only natively supported types are considered for replacement.

2.5.2

Customized Data Storage Formats

Programs which produce approximate results with reduced accuracy often do not require the full precision of native types to represent those results. In this situation, the use of full precision native formats can lead to inflated storage requirements and excess memory traffic. Using a specialized data storage for- mat can reduce the amount of memory traffic created by the program, with associated gains in energy efficiency and overall runtime.

Using vectorization to support specialized storage formats is particularly attractive. Vectorized execution provides a low-cost means of accelerating data reorganization and conversion by exploiting fine-grained data parallelism across multiple accesses. Vectorized implementations of specialized storage formats can make use of special vector reorganization instructions provided by modern processors, which are not available for scalar registers. In addition, vectorized execution is already the norm for many numerical applications.

2.5.3

Multibyte Floating Point

Jenkins et al. [33], propose to accelerate I/O performance of applications by reducing the resolution of the data. However, although the goal of that work is similar, there are several important differences with the approach we propose.

In terms of practical differences, an important distinction between the two pieces of work is that our approach parallelizes at the level of loop iterations, using vector memory access and vector reorganization to achieve a parallel speedup. Jenkins et al. evaluate their low-resolution scheme at extreme scale using thread-level parallelism via MPI, and using MPI to perform the data lay- out transformations. They note that a low-level vectorized approach appears to be promising future work given that data reorganization using MPI incurs sig- nificant overhead. In addition, the work of Jenkins et al. only deals with reads, whereas we propose an end-to-end approach which can deal with writes of reduced precision data in addition to reads.

Another important refinement over the work of Jenkins et al. is the topic of

rounding, which is not addressed by that work. Jenkins et al. use simple trun- cation to obtain low-resolution data. This approach causes a measure of avoid- able error which can be reduced significantly by performing correct rounding to select the nearest representable value in the target format for a given input value. In Chapter 4 we present a scheme for supporting custom floating point storage formats with correct rounding of data.