As mentioned earlier, Type 1 pairings can be evaluated on supersingular curves. These curves can be separated into three sub-classes: curves over binary fields (q = 2mwith k = 4), curves over fields of characteristic 3 (q = 3mwith k = 6) and curves over fields of large prime characteristic (q = p, p > 3 with k = 2). The representation of field elements inF3m
is not very straightforward on a binary system. The case where k = 2 requires p to have at least 512 bits (due to security reasons). Arithmetic operations in such a big finite field would be very slow on a constrained 8-bit processor. Therefore, the most suitable curve to implement Type 1 pairings on sensor nodes is the binary curve with the embedding degree k = 4.
In order to achieve an adequate security level, the binary field can be chosen asF2271.
The equation of the supersingular curve overF2271 is as follows:
y2+ y = x3+ x (5.8)
In this case the number of points on the curve is known to be 2271+2136+1 = 487805·r
where r is a large prime, large enough to make any Pohlig-Hellman-like attack [58] on the Elliptic Curve Discrete Logarithm Problem infeasible. The next thing is to consider the security of the discrete logarithm overF(2271)4, where it is vulnerable to an index calculus
type of attack. The current state of knowledge of the Discrete Logarithm Problem over extension fieldsF2m·k is not well studied. However, it is believed to be roughly the same
as that of the Discrete Logarithm Problem overF2m, for prime m, where m ≈ 4 · 271 =
1084. In this setting the current record is for m = 613 [68], so there is (for now) a relatively wide margin of safety.
From the previous sections, it is known that the fastest pairing algorithm on super- singular curves is the ηT pairing [7]. Supersingular curves lead to more efficient imple-
mentations in terms of bandwidth, memory usage and processing speed, hence they are more suitable for wireless sensor networks. The ηT pairing was implemented according
to Algorithm 9. Elements in the extension field F24m were represented as polynomials
with 4 coefficients in F2m. So, for example, element b(z) = B3z3+ B2z2 + B1z + B0 in
F24m was represented as a vector [B3, B2, B1, B0]. The elements s, t∈ F24m were equal to
[0, 1, 1, 0]and [0, 0, 1, 0] respectively. Both values were derived from the distortion map Ψ. Looking at Algorithm9, one can see that the most expensive part is the for loop which needs to be executed (m + 1)/2 times. All the operations inside this loop have to be optimized in terms of execution speed in order to achieve good performance of the algo- rithm. The most time-consuming operation inside this loop is the polynomial multipli- cation, and its implementation has a major impact on the overall execution time of the ηT pairing. In particular, multiplication ofF24m values (line 7) is very time-critical, but
consists of multiplications in F2m, which again emphasizes the importance of the base
field multiplication. Using the Karatsuba method [74] forF24m multiplication, decreases
the number of necessary operations to nine modular multiplications and some additions which are very cheap in binary fields (just bitwise exclusive-or of the coefficients of the polynomials). Inside the main loop there are also other important operations like squar-
Table 5.3:Cost of the ηT pairing on y2+ y = x3+ xcurve
ModMuls ModSquares Square roots
Main loop 1904 544 544
Final exp. 114 139 0
ing and calculation of the square root of the field element. However, these functions are not as complex as binary polynomial multiplication and when efficiently implemented can be performed much faster.
The last stage of the ηT algorithm is the final exponentiation. This step is not very time
consuming as it can be performed for the relatively inexpensive cost of (m + 1)/2 exten- sion field squarings, four extension field multiplications, one extension field division and some other cheaper arithmetic operations.
5.3.2.1 Binary field arithmetic
The total cost of the ηT pairing (m = 271) in terms of basic arithmetic operations is given
in Table5.3. Additions are not taken into consideration because they are fast inF2m(being
just XOR operations). The key to efficient ηT implementation lies in the performance
of binary polynomial multiplication. Therefore, Micro-pairings used assembly language routines for all the basic arithmetic operations on binary polynomials.
The binary polynomial multiplication inF2271 was implemented using the optimized
hierarchical method described in Section 3.4.1.1. The optimizations for particular hard- ware platforms were performed according to Section3.4.1.1 as well. Other polynomial arithmetic operations like squaring, reduction and calculation of the square root were also implemented in assembly language on all three target platforms. Operations like modular reduction and calculation of the square root were strictly optimized for a spe- cific form of the irreducible polynomial f (z) = z271+z201+1. Reduction modulo f (z) was implemented based on Algorithm3, and squaring of binary polynomials was performed as described in Section3.4.2. Calculation of the square root inF2271 was implemented us-
ing the techniques from Section3.4.3. Table5.4summarizes the performance of the main arithmetic routines in F2271 on all three target processors. All values are in clock cycles
of a given CPU and the multiplication and squaring routines also include the reduction operation.
Table 5.4:Timings in clock cycles for modular arithmetic routines inF2271
Atmega128 MSP430 PXA271
Operation Mul Sqr Sqrt Mul Sqr Sqrt Mul Sqr Sqrt
Assembly 13557 1581 1730 10147 1363 1644 4926 499 546 C code 66271 4711 12021 40666 3667 11212 13183 2375 2496
Decrease 80% 66% 86% 75% 63% 85% 62% 79% 78%
All the figures in Table5.4were obtained on simulation environments like AVR Studio (Atmega128) and IAR Embedded Workbench (MSP430 and PXA271). There were signifi- cant differences for the same optimization levels when different compilers were used (for example gcc and the IAR compiler). That it is why the same settings for the compilers in all cases were used. Optimization flag -O0 was set during all simulations, so the results in Table5.4can be directly compared with other implementations no matter which com- piler is used. Results achieved with the assembly language routines are compared with a C-only implementation to show the savings in execution time.
As can be seen in Table5.4, the difference between the standard C code functions and specially optimized assembly routines is quite significant. Handcrafted code gave a nice improvement in execution time on all tested hardware platforms. All the operations tim- ings were decreased by between 60% and 85%. Square root computation was around four to seven times faster and, (of the most significance for the ηT algorithm), polynomial mul-
tiplication improved up to five times. Field-specific assembly code gives the maximum speed up for the ηT pairing algorithm. The timings in clock cycles for the ηT pairing to-
gether with memory occupation on all three processors are presented in Table 5.5 and Table5.6.
Table 5.5:Performance of the ηT pairing on Atmega128 and MSP430
Atmega128 MSP430
Cycles ROM Stack Cycles ROM Stack
Assembly 19,660,993 47.41KB 3.17KB 14,097,304 23.66KB 4.17KB C code 80,608,843 41.23KB 3.17KB 50,684,686 23.01KB 4.17KB
Decrease 76% -15% 0% 72% -3% 0%
With the introduction of specially optimized arithmetic routines, Micro-pairings cal- culate the ηT pairing 65-76% quicker. In the best case (the ATmega128 CPU) the execution
Table 5.6:Performance of the ηT pairing on the PXA271 PXA271
Cycles ROM Stack
Assembly 6,002,134 29.55KB 4.12KB C code 16,974,044 37.24KB 4.12KB
Decrease 65% 20% 0%
optimization of critical routines in assembly language leads to a large performance in- crease on embedded microcontrollers. Usually on standard desktop computers, savings of around 20-30% are possible to achieve when using assembly language.
The results presented in Tables5.5and5.6are especially significant because in almost all cases, the same level of memory usage was achieved. The memory requirements for the ηT pairing on the three platforms tested are reasonable when taking into considera-
tion the complexity of the operations. Stack usage in all implementations remained at the same level, as assembly routines did not use any additional variables. RAM utilization may seem high, but the memory is reserved only for the duration of the pairing calcu- lation. After that, all of that RAM memory is released and can be reused for different purposes. Stack size values presented in Tables5.5and5.6were also the peak numbers during program execution. Average stack utilization was usually 60% of those values. The increase in memory overhead is considerable only on the ATmega128 platform, but provides the best performance results. For the MSP430 processor, the 3% increase in ROM utilization is negligible, as it leads to 72% improvement in execution time. On the PXA271 microcontoller the assembly routines resulted in a 20% decrease in program code.