To the best of our knowledge there were no previous attempts to implement fast binary-field arithmetic on the Cell. The cycle counts for all binary-field operations are summarized in Table 5.2 for both approaches. Our experiments showed that on the Cell processor the bitsliced implementation of highly parallel binary-field arithmetic is more efficient than the standard (non-bitsliced) implementation. For applications that do not process large batches of different independent computations the non-bitsliced approach remains of interest.
Using the bitsliced normal-basis implementation—which uses DMA transfers to main memory to support a batch size of 512 for Montgomery inversions—on all six SPUs of a Sony Playstation 3 in parallel, we can compute 25.57 million iterations per second. The expected total number of iterations required to solve the ECDLP given in the ECC2K-130 challenge is 260.9(see [7]). This number of iterations can be computed in 2,654 Playstation 3 years.
5.6 Conclusion
In the first part of this chapter we developed SIMD multiplication modulo primes of the form 232`c±m for small `, m, c ∈ Z>0 that achieves a speedup of approximately 30% over more traditional methods. It uses a redundant representation modulo 232`± mand a truncation-based reduction method, whose probability to produce an incorrect result has been argued to be very small. The method is suitable for error-tolerant applications, such as cryptanalytic ones.
As an application, we have shown the cryptanalytic potential of a commonly available toy by using a cluster of PlayStation 3 game consoles to solve an elliptic curve discrete logarithm problem over a 112-bit prime field. The runtimes and their extrapolations provide upper bounds for the effort required to solve larger instances of the same problem using a larger
83
network of game consoles. Such a network is in principle accessible using programs such as BOINC [3]. Although surreptitious application of such programs would not be difficult to arrange for any miscreant who desires to do so, the effort required to solve a “practically relevant” problem remains staggering.
In the second part of this chapter we have outlined a novel approach to implement fast (non-bitsliced) binary-field arithmetic. Although it turned out that a bitsliced approach to implement the arithmetic is faster in practice for this setting. The standard approach (unlike the bitsliced approach) can be used to speed up arithmetic in single-stream settings such as cryptography.
Chapter 6
Efficient SIMD arithmetic modulo a Mersenne number
Numbers of a special form often allow faster modular arithmetic operations than generic moduli. This is exploited in a variety of applications and has led to a substantial body of literature on the subject of fast special arithmetic. Speeding up calculations using special moduli was already proposed in the mid-1960s by Merrill [142] in the setting of residue number systems (RNS) [88]. Other applications range from speeding up fast Fourier transform based multiplication [64], enhancing the performance of digital signal processing [69, 187, 195], to faster elliptic curve cryptography (ECC; [124,143]), such as in [12].
Another application area of special moduli is in factorization attempts of so-called Cun-ningham numbers, numbers of the form bn±1 for b = 2, 3, 5, 6, 7, 10, 11, 12 up to high powers.
This long term factorization project, originally reported in the Cunningham tables [66] and still continuing in [52], has a long and distinguished record of inspiring algorithmic devel-opments and large-scale computational projects [48, 49, 130, 134, 149, 163]. Factorizations from [52] with b = 2 are used in formal correctness proofs of floating point division meth-ods [101]. Several of these developments [133] turned out to be applicable beyond special form moduli, and are relevant for security assessment of various common public-key cryptosystems.
This chapter concerns efficient arithmetic modulo a Mersenne number, an integer of the form 2M −1. These numbers, and a larger family of numbers called generalized Mersenne numbers [8,58,188], have found many arithmetic applications ranging from number theoretic transforms [44] to cryptography. In the latter they are used to run calculations concurrently using RNS [9] or to improve the speed of finite field arithmetic in ECC based schemes [188, 199]. The great internet Mersenne prime search project [89] is based on an implementation of the Lucas-Lehmer primality test [129, 139] for Mersenne numbers in the many-million-bit range. Hence, efficient arithmetic modulo a Mersenne number is a widely studied subject, not just of interest in its own right but with many applications.
Our interest in arithmetic modulo a Mersenne number was triggered by a potential (spe-cial) number field sieve (NFS) project [133], for which we need a list of composites dividing
85
2M −1 for exponents M in the range from 1000 to 1200. The Cunningham tables contain over 20 composite Mersenne numbers (or composite factors thereof) in the desired range that have not been fully factored yet. It may be expected that some of these composites are not suitable candidates for our list because they can be factored faster using the elliptic curve method (ECM) for integer factorization [136] than by means of special NFS (SNFS). The only way to find out whether ECM is indeed preferable, is by subjecting each candidate to an extensive ECM effort (which, though it may be substantial, is small compared to the effort that would be required by SNFS): only candidates that ECM failed to factor should be included in the list.
The efficiency of ECM factoring attempts relies on the efficiency of integer arithmetic modulo the number being factored. Given the need to do extensive ECM pre-testing for over 20 composite Mersenne numbers, we developed arithmetic operations modulo a Mersenne number suitable for implementation of ECM on the platform that we intended to use for the calculations: the Cell processor as found in the Sony PlayStation 3 (PS3) game console.
Because each ECM effort consists of a large number of independent attempts that can be executed in single instruction multiple data (SIMD) mode and because each core of the Cell processor can be interpreted as a 4-way SIMD environment, our arithmetic modulo a Mersenne number is geared towards SIMD implementation.
This chapter is published as [39].
6.1 Arithmetic Modulo 2
M− 1 on the SPE
In this section we describe the SPE-arithmetic that we developed for arithmetic modulo N = 2M −1, for M in the range from 1000 to 1200 (allowing larger values as well). Notice that the following description can easily be carried over to numbers of the form 2M + 1.
Assume that M < 13 · 96 − 2 = 1246 (larger M-values can be accommodated by putting M < u · v −2 with v · (2u−1)2 < 231). Our approach aims to optimize overall throughput as opposed to minimize per process latency. Two variants are presented: a first approach where addition and subtraction are fast at the cost of a radix conversion before and after the multiplication, and an alternative approach where radix conversions are avoided at the cost of slower addition and subtraction. This second variant turns out to be faster for our ECM application. In applications with a different balance between the various operations the first approach could be preferable, so it is described as well. All our methods are particularly suited to SPE-implementation, but the approach may have broader applicability. See Section 2.1 for the notation of the integer representation.