The instruction set of the scalable SIMD processor architecture can be divided into in- structions on the scalar processing units and instructions on the vector processing units. The analysis of algorithms on the EVP shows that scalar processing is necessary for two purposes: control ow operations and the calculation of scalar parameters that are later broadcasted to vectors. In rare cases, access to single vector elements (for setting or read- ing one value at a time) is required. The vector data path requires arithmetical instructions (addition, subtraction, negation, multiplication, MAC), as well as comparison and maxi-
Chapter 3 Scalable SIMD processor architecture
Table 3.1: Supported basic arithmetic data types Word length Data format Saturation Rounding Description
support support
16 bits integer X default integer data type
1 bit Boolean Boolean data type
16 bits Q.15 X X default xed-point data type
16+16 bits Q.15 X X complex-valued xed-point data
type using consecutive vector el- ements for imaginary and real part
40 bits Q8.31 X X accumulator data type for multi-
plication and MAC operation
40+40 bits Q8.31 X X complex-valued accumulator
data type
mum/minimum instructions on pairs of data vectors for sorting (e. g. for MIMO detection and channel decoding). Furthermore, shift instructions for scaling and instructions for de- termining a shift distance (e. g. by calculating leading bits for two's complement data) are useful. Permutation instructions are required to perform a reordering of vector elements and are elaborated in more detail in section 3.1.5.
Next to these basic requirements on the instruction set, instructions and processing units that accelerate one or several applications are implemented in most SIMD-based pro- cessor architectures [vHM+05, MG08, SVPG+10, WLS+08a]. Examples for application
independent specialized operations, which can be accelerated, are division, square root, and reciprocal square root. For example, a reciprocal square root is required for the QR matrix decomposition [GVL96], which is utilized in dierent MIMO algorithms, such as the sphere decoder in chapter 5. An example for a specialized processing unit that ac- celerates some algorithms is the EVP's intra vector unit (IVU, see section 2.3.1). This processing unit supports minimum and maximum search over all elements of one data vector and summation of vector elements. For example, the QRD-M MIMO detection al- gorithm (section 5.2) benets from minimum search support for the detection of the most likely transmitted symbols. Although there is a potential performance gain from these specialized instructions and processing units, the focus of this thesis is on the analysis and assessment of the benet of an increased SIMD vector width and not the design of accelerators. Therefore, neither specialized instructions nor processing units have been considered during the design of the scalable SIMD processor architecture.
3.1 Development of the SIMD processor architecture based on algorithm requirements -1 2 2-2 2-3 ... 2-12 2-13 2-14 2-15 sign weight bit 15 14 13 12 3 2 1 0
16-bit fixed-point data type
14 2 13 2 12 2 ... 3 2 2 2 1 2 0 2 sign weight bit 15 14 13 12 3 2 1 0
16-bit integer data type
-1 2 -2 2 ... -28 2 -29 2 -30 2 -31 2 weight bit 31 30 29 3 2 1 0 7 2 6 2 5 2 ... sign 39 38 37 36 0 2 40-bit accumulator data type
Figure 3.1: Denitions of arithmetic data types
As an alternative to branch-based control ow, conditional instruction execution enables the conditional execution of an instruction based on the value of a Boolean condition reg- ister. If the value of the condition is true, the instruction is executed normally; otherwise, the instruction does not execute and the value of the destination register is left unchanged. Conditional instruction execution of SIMD operations can either be performed using scalar conditions referred to as predicated execution or predication or on an element-by- element basis using a condition vector. This case is denoted as masked execution or masking, as depending on the values of the condition mask some elements of the destination vector are updated with newly computed values, while the remaining elements are left unchanged. Predicated execution is commonly used in processor architectures that support instruction level parallelism (ILP, see section 3.1.3), because predication allows avoiding conditional control ow, which in turn limits ILP. Masked execution is useful for any kind of SIMD processor architecture as it allows more exibility during the algorithm design by enabling the vectorization of conditional code. Furthermore, masking enables to exclude some SIMD vector elements from a computation. Hence, the SIMD vector length can be temporarily decreased.
The EVP supports masking and predication for most vector operations and predication for most scalar operations enabling to evaluate the implemented algorithms for use cases of masking and predication. Furthermore, the EVP supports a so-called conditional add/subtract operation, which performs addition for mask element value true and sub- traction otherwise. The evaluation showed that predicated execution is never used for SIMD vector operations. The only use cases are the calculation of scalar parameters and conditional pointer updates. Masked execution is used for permutation operations during
Chapter 3 Scalable SIMD processor architecture
the calculation of radix-2 and mixed-radix FFTs (see chapter 4), as well as for masked arithmetical operations. The conditional add/subtract operation is repeatedly used in the HSDPA spreader.
Based on the analysis, six vector and ve scalar processing units have been implemented in the scalable SIMD processor architecture. The processing units and their supported
Table 3.2: Vector processing units and supported operation types
Processing unit Abbrev- Masking Description of operation types
iation
Vector arithmetic logic unit VALU X arithmetical and logic op-
erations including shift, comparison and conditional add/subtract
Mask arithmetic logic unit MALU logic operations on vector
masks
Vector multiply-accumulate unit VMAC X multiplication, MAC; accumu-
lator & complex-valued data types
Vector load/store unit VLSU memory access,address updates
Vector permutation unit VPU X vector permutations on a vector
permutation network
Vector move unit VMU move operations for masks / ac-
cumulator registers
Scalar exchange unit SXU vector element access, scalar
broadcast
operation types are listed in tables 3.2 and 3.3. As predicated execution is not necessary for vector processing units, only the scalar ALU and scalar MAC support predication. The VALU, VMAC and VPU support masking for all operations; the VALU also includes the special conditional add/subtract operation. The VMAC supports both complex-valued data types and 40-bit accumulator data types for intermediate results with increased pre- cision. Most vector and scalar operations are designed for single cycle latency as dis- played in table 3.4. Exceptions are control ow operations, memory access operations, and complex-valued multiplication and MAC operations. In table 3.4, the initiation inter- val of an operation is dened as the minimum interval between starting the execution of one operation and starting another operation on the same unit. Hence, operations with an initiation interval of one cycle can be started every clock cycle. The complex-valued
3.1 Development of the SIMD processor architecture based on algorithm requirements Table 3.3: Scalar processing units and supported operation types
Processing unit Abbrev- Pred- Description of operation types
iation ication
Scalar arithmetic logic unit ALU X arithmetical and logical opera-
tions including shift and compar- ison
Predicate arithmetic logic unit PALU logic operations on Boolean pred- icates
Scalar multiply-accumulate unit MAC X multiplication, MAC
Scalar load/store unit LSU memory access,address updates
Branch control unit BU branches, zero-overhead loops
multiplication and MAC operations have an initiation interval of two clock cycles, as the computation is split into two parts in two consecutive clock cycles (see equation (3.1)), but the same multipliers are used in both clock cycles to reduce the hardware overhead.
Re {a · b} = Re {a} · Re {b} − Im {a} · Im {b} (3.1) Im {a · b} = Re {a} · Im {b} | {z } cycle 1 + Im {a} · Re {b} | {z } cycle 2
Branch and loop operations have an initiation interval equal to the instruction latency, because only one control ow operation can be processed at a time.
Table 3.4: Latencies of scalar and vector instructions measured in clock cycles
Operation type On unit Latency Init. interval
Load/store operations VLSU, LSU 3 1
Complex-valued multiplication/ MAC VMAC 2 2
Branch operation BU 4 4
Zero-overhead loop BU 4 4
Chapter 3 Scalable SIMD processor architecture