FPGAs are very well-suited for an implementation of the final search phase of a key sub-space. Lin- ear Feedback Shift Registers form the main part of the DSC algorithm, and they can be implemented much more efficiently on an FPGA than on a CPU or GPU platform. We decided to implement the time consuming final search for the correct key on an FPGA.
Basic Implementation Idea
An FPGA implementation of the improved DSC attack requires the knowledge of a valid reference (IV, Keystream)pair. It must iterate over all potentially valid cipher keys (according to an equation system), compute the keystream and compare it to the known keystream. Therefore, a cipher key generator, a DSC keystream generator and a compare unit comparing the keystream output to the reference keystream is necessary for the FPGA implementation.
When an attack against a particular key is run, the FPGA is given a reference tuple(IV, Keystream)as well as equation systems of the formAk=bthat constrains the key space by only allowing keyskthat satisfy all equations. The device shall report cipher keys that produce a reference keystream. Cipher Keys that generate the reference keystream must be sent to a PC which controls the FPGA.
The most convincing way to implement the key generator is using a counter or full-cycle LFSR that generates theindependentbits and a combinatorial function generating thedependentbits that uses the independent bits as an input. The equation systems must be transformed beforehand for this purpose, such that the dependent bits are described as a function of the independent bits. The DSC keystream generator can be implemented straight-forward as described in Section 4.3.
Optimizations
Optimizations are possible on several levels compared to a straight-forward implementation:
Basic Improvements This includes sharing the key generator among several DSC units, removing unnec- essary control signals, or keeping logic delays short by inserting registers on critical paths.
DSC Speedup The fundamental DSC implementation as described in Section 4.3 requires three clock cycles per bit of keystream output. This can be reduced to one clock cycle by multiplexing and re-arranging the feedback taps. The new tap positions were determined by using a matrix repre- sentation of the LFSRs. [2]
Key Loading Section 4.3 proposes to load the session key in 128 clock cycles. The same can be done in one cycle by loading the initial state in parallel instead of clocking it in. The combinatorial function transforming the cipher key into the initial state can be obtained by using the matrix representation of the LFSRs.
As a second step of improvement, the calculation of the full cipher key can be skipped: As described before, the dependentpart of the cipher key is a combinatorial function of the independent cipher key bits. Therefore, the whole initial state can be expressed as a function of the independent cipher key bits.
Hard-Coding Where the plain attack from Section 7.3 proposes one equation system Ak = b, the key ranking allows us to reuse the matrixAand just invert one or more equations, i.e. modify b, if no key has been found for a particular sub-space.
Therefor, only the b vector needs to be loaded into the FPGA at run time, while theAmatrix can be hard-coded into the design by a VHDL preprocessor. This saves hardware resources on the FPGA, reduces the complexity and eliminates potentially critical paths. potentially critical path is eliminated, as it is no longer necessary to combine all independent counter bits. The reference keystream can be hard-coded as well.
Early Abort A cipher key can be considered invalid as soon as one bit comparison to the reference keystream fails. Therefore, the next key can be loaded upon failure of a comparison. For a wrong key, the bit comparison already fails at the second keystream bit on average, such that
n−2 cycles are saved on average when the reference keystream has a length of n. Early Abort introduces non-determinism, however, which increases the control overhead.
Pre-Ciphering Pipeline With the Early Abort optimization, several DSC units are competing to be loaded with a new initial state. As the arbitration logic complexity rises with the number of competing units, this number is to be kept low. A good way to do this is outsourcing the pre-ciphering phase into a strictly sequential, deterministic pipeline. With this optimization, the stateafter pre-ciphering is directly loaded into the competing DSC units.
Input Buffering Idle time of the FPGA has a negative impact on the effective performance. Therefore, an input FIFO is inserted such that the PC can queue multiple tasks and the FPGA can directly load the next task as soon as the previous one is finished.
Implementation
For our implementation, a Xilinx Spartan-3E 1200 (XC3S1200E) FPGA on a Digilent Nexys 2 board was used. The PC communication was implemented via the on-board RS-232 interface.
Our final implementation includes all optimizations as described in the last sections. The runtime of the design is not entirely deterministic, as – for a specific keystream – the position of the first failing comparison with the reference keystream is not known in advance. Therefore, the key generator was given the ability to be paused, which is necessary when all available DSC units are busy.
One pipelined key generator was chosen to serve four DSC units. The complete design consumed only about 30% of the FPGA resources in total, such that three instances could be created on our device.
Performance Evaluation
This section compares the performance achieved by our FPGA implementation with the CUDA perfor- mance published in [29].
We used five different, randomly generated equation systems for evaluating the maximum frequency by synthesizing the design for each of the equation systems. Table 5 shows a summary of our results.
Table 5.:Performance Evaluation
Max Frequency Performance [ke ys
s ] Cost [US$] Cost-Performance
FPGA 140 MHz 408.8·106 169 2.42·106USke ys$·s
[29] CUDA / GTX 260 unknown 148·106 190 0.78·106ke ys US$·s