del Orden del Día. Informe de Secretaría General

The instruction set of the scalable SIMD processor architecture can be divided into instructions on the scalar processing units and instructions on the vector processing units. The analysis of algorithms on the EVP shows that scalar processing is necessary for two purposes: control ow operations and the calculation of scalar parameters that are later broadcasted to vectors. In rare cases, access to single vector elements (for setting or read- ing one value at a time) is required. The vector data path requires arithmetical instructions (addition, subtraction, negation, multiplication, MAC), as well as comparison and maxi-

Chapter 3 Scalable SIMD processor architecture

Table 3.1: Supported basic arithmetic data types Word length Data format Saturation Rounding Description

support support

16 bits integer X default integer data type

1 bit Boolean Boolean data type

16 bits Q.15 _X _X default xed-point data type

16+16 bits Q.15 _X _X complex-valued xed-point data

type using consecutive vector elements for imaginary and real part

40 bits Q8.31 _X _X accumulator data type for multi-

plication and MAC operation

40+40 bits Q8.31 _X _X complex-valued accumulator

data type

mum/minimum instructions on pairs of data vectors for sorting (e. g. for MIMO detection and channel decoding). Furthermore, shift instructions for scaling and instructions for de- termining a shift distance (e. g. by calculating leading bits for two's complement data) are useful. Permutation instructions are required to perform a reordering of vector elements and are elaborated in more detail in section 3.1.5.

Next to these basic requirements on the instruction set, instructions and processing units that accelerate one or several applications are implemented in most SIMD-based processor architectures [vHM+_{05, MG08, SVPG}+_{10, WLS}+_{08a]. Examples for application}

independent specialized operations, which can be accelerated, are division, square root, and reciprocal square root. For example, a reciprocal square root is required for the QR matrix decomposition [GVL96], which is utilized in dierent MIMO algorithms, such as the sphere decoder in chapter 5. An example for a specialized processing unit that ac- celerates some algorithms is the EVP's intra vector unit (IVU, see section 2.3.1). This processing unit supports minimum and maximum search over all elements of one data vector and summation of vector elements. For example, the QRD-M MIMO detection algorithm (section 5.2) benets from minimum search support for the detection of the most likely transmitted symbols. Although there is a potential performance gain from these specialized instructions and processing units, the focus of this thesis is on the analysis and assessment of the benet of an increased SIMD vector width and not the design of accelerators. Therefore, neither specialized instructions nor processing units have been considered during the design of the scalable SIMD processor architecture.

3.1 Development of the SIMD processor architecture based on algorithm requirements -1 2 2-2 2-3 ... 2-12 2-13 2-14 2-15 sign weight bit 15 14 13 12 3 2 1 0

16-bit fixed-point data type

14 2 13 2 12 2 ... 3 2 2 2 1 2 0 2 sign weight bit 15 14 13 12 3 2 1 0

16-bit integer data type

-1 2 -2 2 ... -28 2 -29 2 -30 2 -31 2 weight bit 31 30 29 3 2 1 0 7 2 6 2 5 2 ... sign 39 38 37 36 0 2 40-bit accumulator data type

Figure 3.1: Denitions of arithmetic data types

As an alternative to branch-based control ow, conditional instruction execution enables the conditional execution of an instruction based on the value of a Boolean condition register. If the value of the condition is true, the instruction is executed normally; otherwise, the instruction does not execute and the value of the destination register is left unchanged. Conditional instruction execution of SIMD operations can either be performed using scalar conditions referred to as predicated execution or predication or on an element-by- element basis using a condition vector. This case is denoted as masked execution or masking, as depending on the values of the condition mask some elements of the destination vector are updated with newly computed values, while the remaining elements are left unchanged. Predicated execution is commonly used in processor architectures that support instruction level parallelism (ILP, see section 3.1.3), because predication allows avoiding conditional control ow, which in turn limits ILP. Masked execution is useful for any kind of SIMD processor architecture as it allows more exibility during the algorithm design by enabling the vectorization of conditional code. Furthermore, masking enables to exclude some SIMD vector elements from a computation. Hence, the SIMD vector length can be temporarily decreased.

The EVP supports masking and predication for most vector operations and predication for most scalar operations enabling to evaluate the implemented algorithms for use cases of masking and predication. Furthermore, the EVP supports a so-called conditional add/subtract operation, which performs addition for mask element value true and subtraction otherwise. The evaluation showed that predicated execution is never used for SIMD vector operations. The only use cases are the calculation of scalar parameters and conditional pointer updates. Masked execution is used for permutation operations during

Chapter 3 Scalable SIMD processor architecture

the calculation of radix-2 and mixed-radix FFTs (see chapter 4), as well as for masked arithmetical operations. The conditional add/subtract operation is repeatedly used in the HSDPA spreader.

Based on the analysis, six vector and ve scalar processing units have been implemented in the scalable SIMD processor architecture. The processing units and their supported

Table 3.2: Vector processing units and supported operation types

Processing unit Abbrev- Masking Description of operation types

iation

Vector arithmetic logic unit VALU X arithmetical and logic op-

erations including shift, comparison and conditional add/subtract

Mask arithmetic logic unit MALU logic operations on vector

masks

Vector multiply-accumulate unit VMAC X multiplication, MAC; accumu-

lator & complex-valued data types

Vector load/store unit VLSU memory access,address updates

Vector permutation unit VPU X vector permutations on a vector

permutation network

Vector move unit VMU move operations for masks / ac-

cumulator registers

Scalar exchange unit SXU vector element access, scalar

broadcast

operation types are listed in tables 3.2 and 3.3. As predicated execution is not necessary for vector processing units, only the scalar ALU and scalar MAC support predication. The VALU, VMAC and VPU support masking for all operations; the VALU also includes the special conditional add/subtract operation. The VMAC supports both complex-valued data types and 40-bit accumulator data types for intermediate results with increased pre- cision. Most vector and scalar operations are designed for single cycle latency as dis- played in table 3.4. Exceptions are control ow operations, memory access operations, and complex-valued multiplication and MAC operations. In table 3.4, the initiation interval of an operation is dened as the minimum interval between starting the execution of one operation and starting another operation on the same unit. Hence, operations with an initiation interval of one cycle can be started every clock cycle. The complex-valued

3.1 Development of the SIMD processor architecture based on algorithm requirements Table 3.3: Scalar processing units and supported operation types

Processing unit Abbrev- Pred- Description of operation types

iation ication

Scalar arithmetic logic unit ALU X arithmetical and logical opera-

tions including shift and comparison

Predicate arithmetic logic unit PALU logic operations on Boolean pred- icates

Scalar multiply-accumulate unit MAC X multiplication, MAC

Scalar load/store unit LSU memory access,address updates

Branch control unit BU branches, zero-overhead loops

multiplication and MAC operations have an initiation interval of two clock cycles, as the computation is split into two parts in two consecutive clock cycles (see equation (3.1)), but the same multipliers are used in both clock cycles to reduce the hardware overhead.

Re {a · b} = Re {a} · Re {b} − Im {a} · Im {b} (3.1) Im {a · b} = Re {a} · Im {b} | {z } cycle 1 + Im {a} · Re {b} | {z } cycle 2

Branch and loop operations have an initiation interval equal to the instruction latency, because only one control ow operation can be processed at a time.

Table 3.4: Latencies of scalar and vector instructions measured in clock cycles

Operation type On unit Latency Init. interval

Load/store operations VLSU, LSU 3 1

Complex-valued multiplication/ MAC VMAC 2 2

Branch operation BU 4 4

Zero-overhead loop BU 4 4

Chapter 3 Scalable SIMD processor architecture

In document COLEGIO PÚBLICO DE ABOGADOS DE LA CAPITAL FEDERAL (página 34-40)