As the scalable SIMD processor architecture has a limited number of vector registers (16 general-purpose SIMD vector registers, see chapter 3.1.4), not all FFTs can be imple- mented without spilling data to memory between FFT stages. Short radix-2 FFTs that require at most eight data vectors2 can be implemented in a single step, without spilling.
The maximum FFT size for processing the FFT in a single step is 8 · V ; hence, the three shortest implemented radix-2 FFTs can be implemented in one step for all vector lengths. The FFT sizes are listed in table 4.4.
Longer radix-2 and mixed-radix FFTs have to be split into groups of consecutive FFT stages, which process a subset of the complete DFT. The complete processing of the grouped FFT stages is realized by loops on the input data. The grouping of FFT stages is done to achieve a good ratio between computational operations on the VALU and VMAC and memory access operations on the VLSU. The computational operations cannot be
2The remaining vector registers are used for twiddle factor vectors and intermediate results.
4.5 Radix-2 and mixed-radix FFT implementations based on LTE Table 4.4: Short radix-2 FFTs that t into the vector register le
SIMD width 128 bit 256 bit 512 bit 1024 bit
FFT sizes 8, 16, 32 16, 32, 64 32, 64, 128 64, 128, 256
avoided, as they are necessary for the FFT algorithm. Hence, the utilization of the VALU and the VMAC is a lower boundary for the runtime of a loop on a LIW processor ar- chitecture. Useful VALU and VMAC operations and memory access operations can be performed in parallel in one LIW operation. If the number of memory access operations is smaller than or equal to the number of useful operations on the VALU or VMAC, an over- head due to memory access can potentially be avoided by ecient LIW programming. If more memory access operations are required than computational operations on the VALU or VMAC, the runtime is determined by the number of memory access operations. As each FFT stage operates on complete vectors and consecutive stages process dierent data values, the register demand increases with the number of consecutive DFT stages. Next to the registers for input data, further registers are required for twiddle factors and intermediate results. In particular, radix-3, radix-5, and radix-6 FFT stages require many data vectors for intermediate results (see section 4.5.2). Based on these restrictions, at most three consecutive radix-2 stages can be grouped together (eight input data vectors). Radix-5 and radix-6 stages cannot be eciently grouped together with other FFT stages, due to the high register demand for intermediate values. Multiple radix-3 stages also cannot be grouped together, yet a radix-3 stage can potentially be combined with one or two radix-2 FFT stages (depending on the number of registers required for twiddle factors). In the majority of cases, consecutive radix-3 and radix-2 stages should be replaced by a radix-6 stage, which has a lower computational complexity (see section 4.5.2).
In case only two radix-2 FFT stages can be grouped together (i. e. all other stages already have been grouped together), the number of operations for loading and storing data is the same as the number of useful operations on the VALU and the VMAC (see section 3.1.4, table 3.6). In case further memory access is necessary for twiddle factors, the runtime is dominated by memory access. A single radix-2 FFT stage always requires more clock cycles on the VLSU than on the VALU or the VMAC.
Table 4.5 lists the decompositions of radix-2 and mixed-radix FFTs into groups of FFT stages for dierent SIMD widths. Groups of FFT stages, whose performance is degraded by memory access, are emphasized by using bold font and underlines (e. .g. 2). In most cases, the decomposition into groups of FFT stages is the same for all SIMD vector lengths. The only exceptions are the 1024-point and the 384-point FFTs. The 384-point FFT requires a dierent grouping of FFT stages on 512-bit and 1024-bit SIMD processors than on
Chapter 4 Radix-2 and mixed-radix FFTs for OFDM-A and SC-FDMA
processors with a smaller SIMD width, as a dierent FFT algorithm is used, because the constraint for the vectorization of the mixed-radix FFT is not satised (see section 4.5.4). Table 4.5: Decomposition of long radix-2 and mixed-radix FFTs into groups of FFT stages
in loops. The notation 2x means that x radix-2 stages are grouped together.
SIMD bit width 128 bit 256 bit 512 bit 1024 bit
64-pt. FFT 23, 23 short FFT short FFT short FFT
128-pt. FFT 23, 2, 23 23, 2, 23 short FFT short FFT 256-pt. FFT 23, 22, 23 23, 22, 23 23, 22, 23 short FFT 512-pt. FFT 23, 23, 23 23, 23, 23 23, 23, 23 23, 23, 23 1024-pt. FFT 23, 23, 2, 23 23, 23, 2, 23 23, 23, 2, 23 23, 22, 22, 23 2048-pt. FFT 23, 23, 22, 23 23, 23, 22, 23 23, 23, 22, 23 23, 23, 22, 23 192-pt. FFT 23, 3, 23 23, 3, 23 384-pt. FFT 23, 6, 23 23, 6, 23 23, 3 · 2, 23 22, 2 · 3 · 2, 2, 22 576-pt. FFT 23, 3, 3, 23 23, 3, 3, 23 768-pt. FFT 23, 3, 22, 23 23, 3, 22, 23 23, 3, 22, 23 960-pt. FFT 23, 5, 3, 23 23, 5, 3, 23 1152-pt. FFT 23, 6, 3, 23 23, 6, 3, 23
The 128-bit, 256-bit, and 512-bit implementations of the 1024-point FFT comprise one separate radix-2 FFT stage, preceded by a group of three radix-2 FFT stages. The runtime of the separate radix-2 stages is determined by memory access, while the runtime of the group of three radix-2 stages is determined by useful computations. If these radix-2 stages are instead grouped in two pairs of radix-2 stages, the performance of both corresponding loops is determined by memory access for loading twiddle factors, leading to a slightly worse performance than with the proposed decomposition. On a 1024-bit SIMD processor, all required twiddle factor vectors can be stored in registers and no memory access operations during loops are needed for loading twiddle factors. Hence, a grouping of pairs of radix-2 FFT stages oers the best performance on a 1024-bit SIMD processor architecture. Table 4.5 also shows that only few FFTs suer from performance degradations due to memory access. Furthermore, increasing the SIMD width counteracts performance degra- dations due to memory access, as long as the vectorization constraints on the ratio between FFT size and SIMD width are still satised.
All implementations of FFTs that satisfy the constraints on the FFT size share common loops for groups of radix-2 FFT stages that can be reused for all FFT sizes and in part also for all SIMD widths: The FFTs start and end with groups of three radix-2 FFT stages. The rst group of radix-2 stages is the same for all FFT sizes and SIMD widths, only
4.5 Radix-2 and mixed-radix FFT implementations based on LTE parameters, such as address osets and twiddle factors, change. The last group of radix-2 stages performs the reordering of vector elements or part of the reordering of vector elements and can be used for all FFT implementations on the same SIMD processor architecture3. Radix-3, radix-5, and radix-6 FFT stages can be reused for dierent SIMD
widths; they can also be reused for dierent FFT sizes as long as the necessary reordering of vectors is adjusted.
Memory requirements of the FFT algorithms
All short radix-2 FFTs, which can be realized by a single loop, can be performed in place, i. e. the input values are overwritten by the nal output of the FFT. The memory requirements of longer radix-2 and mixed-radix FFTs depend on the grouping of FFT stages. An FFT can be implemented in place if the groups of FFT stages can perform the necessary reordering of data vectors.
FFTs with NDFT = V · M · V perform the reordering of data vectors during the M-point
FFT.4 If an M-point FFT ts into the register le, all necessary permutations of complete
data vectors can be done in place. If M vectors do not t into the register le, the FFT can only be computed in place if the permutation of data vectors can be split into a series of smaller permutation operations, which can be performed on the input or output of groups of FFT stages that t into the register le. Otherwise, there is a small memory overhead for storing intermediate results during the sorting of vectors. The memory overhead can be avoided by inserting a separate sorting stage, at the cost of an increased runtime of the FFT, or by smartly overlapping memory read and write access for the same group of FFT stages on dierent input data, enabling to perform more complex permutations of complete vectors without memory access. The latter approach leads to an increased (doubled, tripled, or quadrupled) code size of the corresponding loop. Yet, the increase in code size is signicantly lower than the decrease in data memory overhead.