The register les (RFs) of a SIMD architecture require a signicant amount of the total energy and area. Lin et al. [LLW+06] report that 7 percent of the area of SODA are needed
for the SIMD register le (16 × 512 bit); the SIMD register le consumes 37 percent of the total power for W-CDMA (2 Mbps) and 27 percent for 802.11a (24 Mbps). Hence, a careful dimensioning of the register les is necessary.
The area, the delay, and the power consumption of a register le depend on the number of registers Nreg and the number of register ports p. According to register le models based
on Rixner et al. [RDK+00], the area of a register le has an asymptotic complexity of
O (Nreg· p2), the delay of a register le with a large number of ports has a complexity
of O pNreg· p
, and the power dissipation of a register le with many ports grows as O (Nreg· p2). Area, delay, and power dissipation can be reduced by using hierarchical,
distributed, or streaming register les [RDK+00]. However, these optimizations are beyond
the scope of this thesis.
The most important register le is the general-purpose SIMD register le, which is accessed by the vector arithmetic units (VALU and VMAC). In the following, design decisions for this register le are discussed. The number of read and write ports pr and pw depends
on the number of LIW slots (see section 3.1.3) and the port requirements of functional units. pr and pw can be reduced by sharing ports between functional units. Yet, this
technique prevents the parallel execution of these units. An analysis of algorithms on the EVP showed that vector element permutations and scalar element access or broadcast operations very seldom occur in the same clock cycle. Hence, the ports of the VPU and SXU are shared. Altogether, the port requirements of the SIMD units are as follows:
• Two read ports and one write port for the VALU • Three read ports and one write port for the VMAC • One read and one write port for the VLSU
• One read and one write port shared for VPU and SXU if a single-vector permutation network is utilized (see section 3.1.5) / two read and two write ports shared for VPU and SXU if a double-vector permutation network is utilized
Figure 3.5 shows the connections between the general-purpose SIMD register le and the SIMD units. Assuming a single-vector permutation network, which operates on one input vector and generates one output vector, seven read and four write ports are required for the register le. A double-vector permutation network supports permutations on pairs of vectors. Therefore, a double vector permutation network requires one additional read port and one additional write port, as illustrated by the dashed arrows in gure 3.5.
3.1 Development of the SIMD processor architecture based on algorithm requirements
General purpose SIMD register file
VALU VMAC VLSU VPU/SXU
read write
register read
additional register read (double-vector network) register write
additional register write (double vector network)
Figure 3.5: Read/write connections between the general-purpose SIMD register le and the SIMD units
The number of registers Nreg has been selected based on the algorithm requirements: Nreg
should be sucient to avoid bottlenecks for spilling values to memory, but the register le also should not be too large. As a rule of thumb, Rixner et al. [RDK+00] claim that
four registers are needed per ALU and per cycle of memory latency for a LIW processor. However, the actual number of required registers depends on the processed algorithms. The algorithm implementations on the EVP have been analyzed to identify the register demands of the algorithms. The vector register le of the EVP contains 16 registers. Except for a pair of loops in the 1024-point and 256-point FFT implementations2, the
performance of all implemented algorithms is not dominated by memory access due to a lack of registers. Hence, 16 vector registers are apparently sucient. On the other hand, a reduction of the number of registers to eight would signicantly degrade the performance as demonstrated by the exemplary discussion of radix-2 FFT loops below.
One radix-2 buttery operation requires two input operands (which are stored in vectors) and (at most) one twiddle factor operand. As consecutive radix-2 buttery stages operate on dierent input operands, four input operands (and twiddle factor operands) need to be available for computing two consecutive radix-2 buttery stages without spilling data to memory. Correspondingly, eight input operands (and twiddle factor operands) need to be available for grouping three radix-2 buttery stages together (see gure 3.6). Each radix-2 stage requires one operation per vector operand on the VMAC3 and on the VALU,
while memory access always requires two operations (one load and one store operation) per 2These loops contain the processing of two consecutive radix-2 FFT stages and require memory access
for loading twiddle factors on the y.
3Here, one complex-valued multiplication that occupies the VMAC for two clock cycles is counted as two operations.
Chapter 3 Scalable SIMD processor architecture
vector. Table 3.6 summarizes the impact of the number of registers Nregon the grouping of
radix-2 buttery stages: if only eight registers are available, the same number of operations is necessary for arithmetic operations as for loading and storing data vectors. If vectors containing twiddle factors need to be loaded from memory, the performance is dominated by memory access. For Nreg = 16, three radix-2 buttery stages can be grouped together,
improving the ratio of arithmetic to memory access operations. Hence, memory access for reading twiddle factor vectors can be hidden by arithmetic operations, which are executed in the same clock cycle during the LIW execution. An increase to 32 registers further improves the ratio of arithmetic to memory access operations, yet there is no performance gain. 0 w4 1 w4 2 w4 3 w4 0 w4 0 w4 2 w4 2 w4 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 x0 x1 x2 x3 x4 x5 x6 x7 y0 y4 y2 y6 y1 y5 y2 y7
Figure 3.6: 8-point decimation in frequency (DIF) FFT
Table 3.6: Evaluation of the grouping of radix-2 buttery stages for dierent Nreg. Smax
denotes the maximum number of consecutive radix-2 buttery stages without spilling to memory; Nvec describes the number of vectors for the FFT operands
needed to achieve Smax.
Nreg Smax Nvec VALU/VMAC operations Memory operations
8 2 4 8+8 8 + twiddle loads
16 3 8 24+24 16 + twiddle loads
32 4 16 64+64 32 + twiddle loads
3.1 Development of the SIMD processor architecture based on algorithm requirements Based on the analysis above, the general-purpose SIMD register le (RF) has been im- plemented with 16 registers. Similar optimizations of register le sizes, as well as port sharing, have been done for the other register les. The attributes of the implemented register les are summarized in table 3.7. Most register types support bypassing to avoid read accesses and to speedup algorithms. Furthermore, the write back to the register le can be disabled for some register types on some processing units to reduce register le power. In this case, the computed result is only available via bypassing.
Table 3.7: Register les in the scalable SIMD processor architecture. Nbit and Nreg denote
the register bit-width and the number of registers in the register le, respec- tively. For SIMD register les, Nbit denotes the width of one element of the
distributed register le. The number of read and write ports is described by pr
and pw. The size of the permutation pattern registers depends on the permu-
tation network and the SIMD width and is not listed. BYP denotes bypassing support and dis. WB describes optionally disabled write back.
Register le description Nbit Nreg pr pw BYP dis. WB
General-purpose SIMD RF:
single-vector perm. 16 16 7 4 X X
General-purpose SIMD RF:
double-vector perm. 16 16 8 5 X X
SIMD accumulator RF 40 2 2 2 X X
Vector mask RF: single-vector perm. 1 8 5 2 X X
Vector mask RF: double-vector perm. 1 8 6 2 X X
Permutation pattern RF special 8 1 1
General-purpose scalar RF 16 16 9 5 X X
Scalar predicate RF 1 8 3 1 X X
Pointer RF 16 8 2 3 X
Range, base RFs for modulo addressing 2 · 16 2 2 1
Pointer oset RF 16 8 2 1 X