Low-Cost Elliptic Curve Digital Signature ... - Semantic Scholar

4 downloads 113537 Views 209KB Size Report
Low-Cost Elliptic Curve Digital Signature. Coprocessor for Smart Cards. Guerric Meurice de Dormale. ∗. , Renaud Ambroise. ∗. , David Bol. †. , Jean-Jacques ...
Low-Cost Elliptic Curve Digital Signature Coprocessor for Smart Cards Guerric Meurice de Dormale∗, Renaud Ambroise∗, David Bol†, Jean-Jacques Quisquater, Jean-Didier Legat {gmeurice,ambroise,bol,quisquater,legat}@dice.ucl.ac.be UCL Crypto Group, Laboratoire de Micro´electronique Universit´e Catholique de Louvain Place du Levant, 3, B-1348 Louvain-La-Neuve, Belgium

Abstract This paper proposes different low-cost coprocessors for public key authentication on 8-bit smart cards. Elliptic curve cryptography is used for its efficiency per bit of key and the Elliptic Curve Digital Signature Algorithm is chosen. For this functionality, an area constrained coprocessor is probably the best approach to perform the most computer-intensive operations at an acceptable speed considering the limited memory and power of the selected platform. For that purpose, the scalar point multiplication in GF(2m) in both affine and projective coordinates was implemented in order to compare their performances with the same level of optimization and the same technology. A hardware/software co-design strategy was also used to avoid the need of a dedicated register file. The performances of the arithmetic units of both coprocessors were adjusted in order to achieve the same throughput. They were then implemented on a 0.13 μm CMOS process for the extension degrees 163, 193 and 233. Resource usage and a 10% improvement of silicon area show that the affine version can be an attractive solution. Keywords: Digital signature, elliptic curve, low-cost, coprocessor, hardware/software co-design, affine coordinates.

1 Introduction Public key authentication is a functionality needed by a huge number of resource-constrained embedded devices like wireless sensor nodes, smart cards and Radio Frequency IDentification (RFID) tags. In order to offer reasonable performances on area constrained devices, it is necessary to speed up the computation of the digital signature. ∗ Supported † Supported

by the Belgian fund for industrial and agricultural research by the Belgian national fund for scientific research (FNRS)

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

For that purpose, elliptic curve cryptography (ECC) is an attractive solution. Co-invented by V. Miller and N. Koblitz in 1985, the public key scheme provided is currently one of the most secure per bit. This means that it presents highly wanted properties like less processing power, storage, bandwidth and power consumption. ECC has become the standard to protect U.S. government mission critical information and also sensitive and unclassified data. In their set of recommended algorithms, known as “Suite B”, the Elliptic Curve Digital Signature Algorithm (ECDSA) was chosen for the authentication. This paper proposes different low-cost coprocessors for ECDSA in area constrained systems like 8-bit smart cards. For such devices, the expected lifecycle makes inappropriate use of fully flexible architectures. For instance, a single key size can be applied and the use of a fixed irreducible polynomial [20] (like the trinomials and pentanomials recommended by standards like [17, 15]) is a reasonable choice. Plenty of possibilities are still available by varying the base point and curve parameters. It is assumed that the devices are not self powered. As an acceptable speed for the signature is only required, the silicon area of the hardware is the prime concern. For instance, the coprocessor use the memory of the processor, following a hardware/software co-design strategy, to avoid the cost of a dedicated register file (cfr. Section 7). The most computer-intensive part of ECDSA is the scalar point multiplication over GF(2m ). The algorithm also requires other functionalities like a hash and very few arithmetic operations in GF(p), like one inversion and two multiplications. As a general purpose processor is available, it seems not efficient to add support for this kind of arithmetic. Moreover, the costly inversion could be computed by the processor while waiting for the communications with the coprocessor. The extra cost of dual-field units [19], working in both GF(p) and GF(2m ), seems therefore not vindicated. In order to provide an area-efficient coprocessor, the

scalar point multiplication in GF(2m ) was implemented in both affine and projective coordinates with the same technology. In particular, the performances of the arithmetic units of both coprocessors were adjusted in order to achieve the same throughput. To the author’s knowledge, it is the first real comparison between area constrained coprocessors using affine and projective coordinates. The scalar multiplication algorithm was carefully selected and optimized for each coordinate system in order to minimize the temporary registers, the set of different operations and the I/O transfers. Therefore, achieved results allow fair comparisons between the two main different kinds of coordinates. This paper is structured as follows: Section 2 introduces the previous works and complete the explanation of our contribution. Section 3 gives a small mathematical background and presents the scalar multiplication algorithms. The description of the different arithmetic circuits stands in Section 4. Then, the architecture of both coprocessors is explained in Section 5. The flexibility of the affine coprocessor is discussed in Section 6 and the impact of a register file is addressed in Section 7. Finally, the achieved results are presented in Section 8 and conclusions are in Section 9.

2 Previous Works and Contribution There are several ways to achieve the computation of digital signatures: pure software, instruction-set extensions (ISE), complete elliptic curve coprocessor and hardware/software co-design. Though efficiency of ECC makes it suitable for embedded devices, pure software implementations of ECDSA [4] can lead to unacceptable computational times1 . This is mainly due to the lack of GF(2m ) instructions for the scalar multiplication. ISE can therefore be used to fill this gap [5], but the slow data transfer rate between an 8-bit GF(2m ) ALU and the CPU thwarts this optimization. Complete elliptic curve coprocessor could then be built to take care of the whole computation, without any interaction with the CPU [16]. Nevertheless, as acceptable speed is only required, the extra area cost of such “high-speed” solution and the use of one large or several parallel multipliers is not justified. A natural choice is therefore a hardware/software codesign approach [8, 10, 11, 1]. For this solution, the extra area cost of a dedicated register file is also not justified. The load and store operations have to be done through the low bandwidth bus of the software platform. In order to present a complete analysis, the cost of different kinds of register files is nevertheless given in section 7. One of the first problems while dealing with elliptic curves is the choice of system coordinates. Using the classical affine coordinates, a modular inversion is required for 1 Especially

when a human user has to wait for the result.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

each point addition or doubling. As inversion algorithms are commonly considered as slow, projective coordinates are usually preferred (like in [8, 5, 11, 4]), trading the Inversion (I) for several Multiplications (M), Squarings (S) and Additions (A). However, more coordinates are required, the control logic is more complex and an inversion is still needed to convert the final result back to affine coordinates. In the literature, the choice of system coordinates is usually based on the ratio I/M (as in [1]). Nevertheless, this criterion is not accurate while a low bandwidth bus is used for load/store operations. The amount of such operations must be taken into account, depending on the selected scalar multiplication algorithm and the hardware capabilities. Even if affine coordinates are faster, adding support for modular inversion has a cost. A good way to practice is to share the hardware between the multiplier and the inverter, as used in [10, 1]. This paper proposes an alternative: adding a hardwired modular squarer and not allowing support for modular multiplication. This is possible by the use of the Montgomery ladder [12] for the scalar multiplication. As a hardwired irreducible trinomial or pentanomial is used, a dedicated squarer can indeed be afforded (as shown in Table 3). The extra hardware cost of a digit serial by parallel multiplier, as in [1], is therefore avoided. This possible loss of flexibility is adressed in Section 6. While the Montgomery ladder algorithm has some builtin resistance against timing and other side-channel attacks [9], the architecture presented in this papers is not claimed to defeat those attacks. For that purpose, the underlying multiplication and division circuits should be thoroughly analyzed, but it is not the topic of this paper. As an added value, the Montgomery ladder in affine coordinates whith the use of the additive point randomization technique is provided in section 6.2.

3 Elliptic Curve Scalar Multiplication In this section, an overview about the theoretical basis of ECC over GF(2m ) is first given. Then, the scalar multiplication algorithm in different system coordinates is described.

3.1 Mathematical Background Let p(z) ∈ GF (2)[z] be an irreducible polynomial of degree m generating the finite field GF(2m ). A nonsupersingular elliptic curve E over GF (2m ) is defined as the set of points (x, y) verifying the reduced Weierstraß equation: E : y 2 + xy = x3 + ax2 + b, where a, b ∈ GF(2m ) and b = 0, together with the point at infinity O. For a thorough description of the topic, the reader is referred to the literature [2].

3.2 Scalar Multiplication Algorithm

4 Arithmetic Circuits

While using ECC, the main operation is the multiplication of a point P by a scalar k on a chosen curve: Q = kP . This scalar multiplication is basically performed by a binary algorithm, also called double and add method [6]. Another popular choice is the Montgomery ladder algorithm [14], presented as Algorithm 1. It is based on the observation that the x-coordinate of the sum of two points whose difference is known can be computed in terms of the x-coordinates of the involved points. The computation of the y-coordinate is therefore avoided. It is possible to recover this coordinate at the end of the scalar multiplication. Nevertheless, the x-coordinate is only required for the signature generation of ECDSA. For a comprehensive analysis of the Montgomery ladder, the reader is referred to [9]. The GF(2m ) version of this algorithm [12] works as follow: consider the base point P = (x, y), and the two points P1 = (x1 , y1 ), and P2 = (x2 , y2 ). With P2 = P1 + P , we can compute the x-coordinate of P1 + P2 , namely x3 , as follows in affine coordinates: ⎧  2 x1 ⎨ x+ 1 + x1x+x if P1 = P2 x1 +x2 2 x3 = 4 ⎩ x1 +b if P = P

As the arithmetic circuits are the main body of the coprocessors, the chosen algorithms and their different implementations have to be presented. For those selections, an important parameter is the use of hardwired irreducible polynomials. The three polynomials p(z) applied in this work are those recommended by the NIST [15] and the SECG [17]: p(z) = z 163 + z 7 + z 6 + z 3 + 1, p(z) = z 193 + z 15 + 1 and p(z) = z 233 + z 74 + 1. The addition is a bitwise xor and four kinds of modular arithmetic circuits have to be analyzed: the squarer, the multiplier, the divider and another kind of cheap inverter for the final transformation while using projective coordinates.

x21

1

2

In projective coordinates, where the x-coordinate of Pi is represented by Xi /Zi , the point doubling and the x3 coordinate of P1 + P2 become:  if P1 = P2 (X1 · Z2 + X2 · Z1 )2 Z3 = Zi2 · Xi2 if P1 = P2  if P1 = P2 x · Z3 + (X1 · Z2 ) · (X2 · Z1 ) X3 = Xi4 + b · Zi4 if P1 = P2 Algorithm 1 Montgomery ladder algorithm Input: P ∈ E, k = (kl−1 , ..., k1 , k0 )2 Output: x(k · P ) P1 ← P, P2 ← 2P for i from l − 2 downto 0 do if ki = 1 then x(P1 ) ← x(P1 + P2 ), x(P2 ) ← x(2P2 ) else x(P2 ) ← x(P1 + P2 ), x(P1 ) ← x(2P1 ) return x(P1 )

This algorithm is popular while implementing scalar multiplication in projective coordinates as it requires few arithmetic operations. The relevant amount of modular operations is 6mM + 5mS + (I + 2S + 1M ), where M stands for a Multiplication, S for a Squaring and I for an Inversion. In this work, this algorithm is also used for the implementation in affine coordinates as it does not need multiplications. It is not the fastest algorithm, but it is fast enough for the targeted application. The relevant amount of operations is 2mD + 3mS, where D stands for a modular Division.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

4.1 Squarer As a hardwired irreducible r-nomial is used, a dedicated modular squarer can be afforded. The squaring itself is achieved by inserting a 0 bit between each consecutive bits of the binary representation of the polynomial [6]. The result of this expansion process is then reduced modulo p(z). As a result, this operator is cheap and can perform a modular squaring in one cycle. The number of required xor gates for the chosen p(z) is: 237 for m = 163, 103 for m = 193 and 153 for m = 233. This amount is directly proportional to the number of coefficients of the p(z).

4.2 Multiplier For the modular multiplication, digit-serial by parallel architectures are best suited for systems requiring moderate sample rate and where area and power consumption are critical [18]. In this paper, a Most Significant Digit (MSD)-first architecture was selected. It is based on the publication [18] but differs slightly: the control signals are precomputed in order to avoid the final modular reduction. The main purpose of this precomputation is therefore not to achieve a high working frequency, as it is usually the case. Let D be the digit-size and d the total number of digits. The iterative multiplication algorithm is therefore based on:   C (i) = C (i−1) z D + BAd−i mod p(z) Without the initialization, d = m/D cycles are necessary to compute a multiplication. The corresponding architecture is presented in Fig. 1, without the initialization. The first set of and gates multiplies the digit-serial operand Ad−i by the parallel operand B. The second set of and gates is used for the modular reduction. Finally, the xor gate tree sum up those multiplications and accumulate the shifted partial product C i−1 .

A

C

AD MSBs B m

«D

«D

p

have to be used. The best approach is to compute the inverse m by little Fermat’s theorem: β −1 = β 2 −2 , with β = 0 ∈ GF(2m ). As a dedicated squaring circuit is available, the multiplication chain technique of Itoh-Tsujii [7] can be employed. Their algorithm can be rewritten as Algorithm 3.

ctrl

LSBs

A C

B

En Rst

Figure 1. Modular multiplier architecture

4.3 Divider An efficient way to perform a modular division is to use gcd based algorithms and especially the binary versions of Stein. The annoying large integer comparison is then avoided thanks to the counter δ idea of Brent and Kung. More details can be found in [13]. The chosen serial modular division algorithm is presented in Algorithm 2, where ⊕ stands for the bitwise xor operation. The operations between brackets are performed in parallel. Without the initialization, 2m−1 cycles are required to compute a division. This algorithm is less complex than the one used in [10, 1]. Algorithm 2 Serial modular division Input: X(z), Y (z) ∈ GF (2m ), p(z) Output: X(z)/Y (z) mod p(z) U ← Y, V ← p, R ← X, S ← 0, k ← 2m − 2, δ ← −1 while k ≥ 0 do k ← k − 1 if u0 = 0 then δ ← δ − 1, U ← U/z, R ← R/z mod p else if δ ≥ 0 then δ ← δ − 1, U ← U/z ⊕ V /z, R ← R/z ⊕ S/z mod p else δ ← −δ − 1, {U ← U/z ⊕ V /z, V ← U }, {R ← R/z ⊕ S/z mod p, S ← R} return S

The corresponding architecture is presented in Fig. 2, without the initialization. It is able to perform on U − V and R − S the set of three different operations {Shift, ShiftAdd, Shift-Add & Swap}. R»1

S»1 c0

c0 V»1

c3 p»1

U»1 U U

R

V

R En

S

Figure 2. Modular division architecture

4.4 Inverter While using projective coordinates, the result of the scalar multiplication has to be converted in affine coordinates. An inversion is therefore required. As this operation is not critical, available functionalities of the coprocessor

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

Algorithm 3 Rewriting of Itoh-Tsujii inversion algorithm Input: a ∈ GF(2m ), m = (ml−1 , ..., m1 , m0 )2 Output: a−1 mod p(z) b ← a2 · a, s ← 1 for i from l − 2 downto 1 do if mi = 0 then s ← 2s else s ← 2s + 1, b ← b2 · a s b ← b2 · b return b2

The number of multiplications required to compute this expression is log2 (m − 1) + W (m − 1) − 1, where W () stands for the hamming weight function.

5 Coprocessors In this section, the architecture of both affine and projective coprocessors is presented. In addition to their main arithmetic circuits, the coprocessors have to embed some other functionalities like addition, load and store, an I/O buffer and instructions interpretation and processing. For instance, the load/store operations were implemented to minimize the number of I/O transfers. Moreover, the scalar multiplication algorithm was slightly modified to avoid the need of an extra temporary register (like in [10]). In order to have an efficient coprocessor, the data I/O and control overheads have to be reduced as much as possible. The data sent by the coprocessor have not to be read by the processor for the control. Direct Memory Access controlled operation with a status bit can therefore be used. The behavior of the coprocessor is fully deterministic. The latency of the initialization of a memory burst transfer can therefore be partially compensated. For a new operation, the first byte embeds the op-code and small data. Subsequent data transfers does not need this extra byte. Currently, only high level instructions are implemented (by a finite state machine) in order to reduce the communication overheads: Init, Add & Double, Inv1, Inv2(s) (cfr. Tables 6 and 7). The Mult instruction is also required for the conversion to affine coordinates. Nevertheless, other low level instructions like Add, Square, Div, Mult & Accumulate, ... could be easily added. Datapath of both coprocessors are presented in Fig. 3 and Fig. 4. Db stands for the bus size of the processor. Multiplexers with shift by Db are used for load/store operations. For the projective coprocessor (ProC), shift by D is selected in multiplication mode. C without any shift can also be selected in order to allow addition by B. Transfer from

C to B register reuses logic available for the load operation, thanks to a small Db-bit multiplexer. C A

»Db ^2 «D

AD MSBs B

»Db «D

p

ctrl

m LSBs

A ShiftReg(Db)

0

En

Sh_out En Rst

C

Figure 3. Projective coprocessor

c0 V»1 U»1

R

S c0

U »Db

c3 p

»1 S »Db ^2

Sh_out

U

R R

U V

S=R R = R + S, SetLSB(R) Load(S ← a) Load(U ← S) DIVISION ...

Load(U ← S) S=R R = R+S Load(S ← b) DIVISION Write(RAM ← S)

En

B

Sh_in Sh_out

Table 1. a · b with affine architecture

En Rst

6 Flexibility of the affine coprocessor For the sake of optimization, AffC was built without multiplier. This can be seen as a lack of flexibility since a multiplication could occur in some contexts. Fortunately, the divider can be used to compute this operation. This process is slow but not critical, this trick is therefore sufficient. Moreover, the additive point randomization technique is developed in affine coordinates in order to show that multiplications are not required in this context also.

6.1 Multiplying with the divider

Sh_in

Sh_out

S

En Rst

Figure 4. Affine coprocessor For the affine coprocessor (AffC), the shifted input of register R and input R of register S are selected for division. The unshifted input of R is used for the addition with S. Reset of register R uses this addition (R ← R ⊕ S(= R)). Transfer from S to U register also reuses the load logic. An extra Db-bit register was added to allow loading the numerator from S to R register (R ← S⊕R(= 0)) while finishing the transfer of the denominator from S to U register. To compare area of the AffC and ProC, the digit size D was adjusted to achieve the same throughput. This was done by inspecting the cycles needed to perform the most important operation: Add & Double (repeated m − 1 times). Let ls’ = m/Db be the cycles required by an internal data transfer and ls the cycles needed by a m-bit DMA transfer. Assume an optimistic value of ls = ls’ + 2 for the overheads. Let Mul = m/D and Div = 2m − 1 be the cycles for a multiplication and a division. Then, from Table 6 and Table 7, the cycles counts for the Add & Double are: 7 ls + 18 + 2 Div + ls’ Affine, 23 ls + 15 + 6 Mul + 2 ls’ Projective. For m = 163, ls = 23, Div = 325, Mul = 55 (D = 3) or 41 (D = 4), the cycles required by this operation are 850 for AffC and 920 (D = 3) or 836 (D = 4) for ProC. D = 4 was therefore selected. For those parameters, the bus is busy 63 % (ProC) and 19 % (AffC) of the time. The time spent in I/O transfers plays therefore an important role.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

A way to multiply with a divider is to perform an inversion followed by a division: a · b = b/a−1 . Using the proposed architecture, this can be achieved by the operations of Table 1. To compute an inversion with the divider, the numerator has to be set to 1. To alleviate the task of sending a 1 to the coprocessor, the SetLSB() operation was added.

6.2 Additive point randomization It could be interesting to randomize the base point P of the scalar multiplication algorithm in order to defeat some kinds of side-channel attacks (see [3] for references). For that purpose, the additive point randomization method based on curve isomorphism could be used [3]. Concretely, the base point is masked by the two random numbers  and σ, with (x, y) → (x + , y + σ). Combining their approach and the method of L´opez and Dahab [12], the scalar multiplication algorithm of section 3.2 becomes: ⎧  2 ⎨ 1 1 x + x+x + x+x if P1 = P2 1 +x2 1 +x2 x3 = ⎩  + (x1 + )2 + σ+b 2 if P1 = P2 (x1 +)

As the x coordinate is only used, σ can be set to zero. As a result, a multiplier is still not needed.

7 Register File A register file could be used in order to implement a stand-alone processor (as in [8, 10]). The required form factor for the RAM is usually not available and regular gates have therefore to be used. This extra functionality is not

Table 2. Register files area [μm2 ] Circuit \ m RAMx2 RAMx4 ROMx2 AffC (RAMx2 + ROMx2) ProC (RAMx4 + ROMx2)

163 11154 23645 753 13963 26499

193 13183 27565 862 16496 30920

233 15903 33186 1008 19846 37170

Table 3. Arith. circuits area [μm2 ] Circuit \ m Squarer Mul (D = 3) Mul (D = 4) Div

163 2866 32871 36143 38598

193 1161 38760 42610 45344

233 1850 46678 51369 54375

needed for the targeted application but is nevertheless described in order to provide a complete analysis. The register file can be built with enabled regular registers and output multiplexers (as in [20]). However, this synchronous RAM consumes a lot of silicon area. Another method is to use latches or looped inverters for the memory cells and tri-state gates for write access. Output is also selected by multiplexers. Post-synthesis results show that this method can save about 70 % of silicon area compared with flip-flops and 15 % compared with latches. The use of a register file is not advantageous for projective coordinates. Both kinds of coordinates require a cheap ROMs with 2 memory locations, but projective ones need a RAM with 4 memory locations instead of 2 for affine ones. At the expense of flexibility, a ROM only could be used to provide the constant data. Nevertheless, it only allows a reduction of 5 to 10 % of the I/O transfers and the cost of the serial load interface is clearly not negligible.

8 Results In order to compare the area of both coprocessors, they were implemented using a 0.13μm bulk technology with 9 metal layers from UMC. Only standard cells library was employed. Synopsys was used for the logical synthesis, Modelsim for the power consumption and Silicon Ensemble for the place and route step (PAR). First, area of arithmetic circuits are reported in Table 3. These results were obtained by extrapolation of the synthesis results, with an error less than 2 %. Then, area of both coprocessors after PAR step and the power consumption of

Table 4. Co. area [μm2 ] \ power [μW ] 10Mhz Copro. \ m ProC (D = 3) ProC (D = 4) AffC

163 66564 \ 380 72280 \ 461 66522 \ 461

193 72781 \ 428 81296 \ 549 73984 \ 518

233 84309 \ 578 93660 \ 715 86463 \ 613

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

Table 5. Scalar mult. timings [cycles] Copro. \ m ProC (64 % I/O) AffC (19 % I/O)

163 135,987 138,122

193 189,291 193,652

233 271,983 279,462

the main operation (Mul or Div) are shown in Table 4. Finally, synthesis results of the register files without power ring are provided in Table 2. The cycles for a full scalar multiplication, with the I/O transfer ratios, are given in Table 5. A typical operating frequency of 10 Mhz was chosen. As stated in section 4, the area of the dedicated squarer is small. As expected, even with the optimizations described in section 7, the cost of the register file is quite high. The most interesting result is a difference of 10 % comparing the silicon surface of projective and affine coprocessor. The reported power consumptions embed 5 % of leakage and are for a 10 Mhz clock. Those numbers are quite high, but it was not the goal to add techniques to decrease the power consumption (like gating). In order to compare accurately both coprocessors, the consumption of the processor and especially of the bus should be taken into account. The most relevant work for a comparison of the area and the timings is the article [1]. Their coprocessor computes the scalar multiplication with the field GF(2191 ) and is implemented with a 0.13 μm technology. Their circuit consumes an area of 159,434 μm2 and requires 341,430 cycles to compute the scalar multiplication (without the GF(p) inversion). Compared with our GF(2193 ) affine coprocessor, their architecture needs 40 % more computational time. The most important result is that our coprocessor consumes more than 2 times less silicon area. Of course, this is partially due to the availability in their design of a GF(p) inverter. Nevertheless, as explain in the introduction, the processor can handle this operation. Moreover, our approach uses a more efficient GF(2m ) inversion, takes care of the number of I/O transfers and employs a more efficient scalar multiplication algorithm.

9 Conclusion Different coprocessors for the scalar multiplication on low-cost smart cards were presented. In particular the performances of both affine and projective coordinates were compared. Implementation results show that, at the same throughput, affine coprocessor requires 10 % less silicon area. Other advantages of affine coordinates are less smart card memory requirement, a smaller complexity of the scalar multiplication algorithm and a much smaller utilization of the bus. Using a hardware/software co-design approach, the number of I/O transfers is definitively of prior importance for the timings of the coprocessor, but also for the utilization of the processor. Indeed, when the proces-

Table 6. Projective coprocessor code – INIT Load(X1 ← x), Load(Z1 ← 1) Load(B ← x), Reset(C) C =C+B C = C2 Store(Z2 ← C), Load(B ← C) C =C+B C = C2 Load(B ← b) C =C+B Store(X2 ← C) – ADD Load(B ← Z12 ), Reset(C) Load(B ← X21 , A ← B) MULT Store(Z12 ← C) Load(B ← X12 ) Load(B ← Z21 , A ← B) MULT Store(X12 ← C) C =C+B Load(B ← Z12 ) C =C+B C = C2 Store(Z12 ← C) Load(B ← X12 , A ← B) MULT Store(X12 ← C) Load(B ← Z12 ) Load(B ← x, A ← B) MULT Load(B ← X12 ) C =C+B Store(X12 ← C)

– INV1, b2 · a C = C2 Load(B ← Z1 ) Load(B ← C, A ← B) MULT s

– INV2(s), b2 · b Load(B ← C) C =C+B C = C 2 (Repeat s times) Load(B ← C, A ← B) MULT – DOUBLE Load(B ← Z), Reset(C) C =C+B C = C2 Store(Z ← C) Load(B ← X) C =C+B C = C2 Load(B ← C) C =C+B C = C2 Store(X ← C) Load(B ← Z, A ← B) MULT Store(Z ← C) C =C+B C = C2 Load(B ← C) Load(B ← b, A ← B) MULT Load(B ← X) C =C+B Store(X ← C)

sor is idle, it is available for other purposes like the GF(p) inverse. As an added value, the Montgomery ladder in affine coordinates with the use of the additive point randomization technique was provided. Moreover, analysis concerning the register file can be interesting for related problems like RFID and could be useful for work like [20].

References [1] H. Aigner, H. Bock, M. H¨utter, J. Wolkerstorfer, A Low-Cost ECC Coprocessor for Smartcards, CHES’04, LNCS 3156, pp. 107-118, 2004. [2] I.F. Blake, G. Seroussi, N.P Smart, Elliptic Curves in Cryptography, London Mathematical Society, LNS 265, Cambridge University Press, 1999. [3] M. Ciet, M. Joye, (Virtually) Free Randomization Techniques for Elliptic Curve Cryptography, ICICS’03, pp. 348–359, 2003. [4] V. Gupta, M. Millard et al., Sizzle: A Standards-Based End-to-End Security Architecture for the Embedded Internet, PerCom’05, pp. 247–256, 2005. [5] J. Großschadl, E. Savas, Instruction Set Extensions for Fast Arithmetic in Finite Fields GF(p) and GF(2m ), CHES’04, LNCS 3156, pp. 107–118, 2004. [6] D. Hankerson, A. Menezes, S. Vanstone, Guide to Elliptic Curve Cryptography, Springer Professional computing, Springer, 2004.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 © 2006

Table 7. Affine coprocessor code – INIT Load(x1 ← x) S=R R =R+S Load(S ← x) S = S2 R =R+S ... – ADD S=R R =R+S Load(S ← x1 ) R =R+S Load(S ← x2 , U ← S) R =R+S DIVISION R = R + S, S = R R =R+S S=R S = S2 R =R+S Load(S ← x) R =R+S S=R R =R+S Store(x12 ← S)

√ Load(S ← b) R=R+S R = R + S, S = R S = S2 Load(U ← S) DIVISION Store(x2 ← S) – DOUBLE Load(S ← x21 ) S = S2 R = R + S√ Load(S ← b) R=R+S R = R + S, S = R S = S2 Load(U ← S) DIVISION Store(x21 ← S)

[7] T. Itoh and S. Tsujii, A Fast Algorithm for Computing Multiplicative Inverses in GF(2m ) Using Normal Bases, Inf. and Comp., vol. 78, pp. 171–177, 1988. [8] S. Janssens et al., Hardware/software co-design of an elliptic curve public-key cryptosystem, IEEE Signal Processing Systems, pp. 209-216, 2001. [9] M. Joye, S.-M. Yen, The Montgomery Powering Ladder, CHES’02, LNCS 2523, pp. 291–302, 2002. [10] J.-H. Kim, D.-H. Lee, A compact finite field processor over GF(2m ) for elliptic curve cryptography, ISCAS’02, pp. 340–343, vol. 2, 2002. [11] S. Kumar, C. Paar, Reconfigurable Instruction Set Extension for Enabling ECC on an 8-Bit Processor, FPL’04, LNCS 3203, pp. 586-595, 2004. [12] J. L´opez, R. Dahab, Fast Multiplication on elliptic curve over GF(2m ) without precomputation, CHES’99, LNCS 1717, pp. 316–327, 1999. [13] G. Meurice de Dormale, J.-J. Quisquater, Iterative Modular Division over GF(2m ): Novel Algorithm and Implementations on FPGA, ARC’06, LNCS 3985, pp. 370–382, 2006. [14] P.L. Montgomery, Speeding the Pollard and Elliptic Curve Methods of Factorization, Mathematics of Computation, vol. 48, pp. 243–264, 1987. [15] U.S. Department of Commerce/National Institute of Standards and Technology (NIST), Digital Signature Standard, FIPS PUB 182-2change1, 2000. [16] R. Schroeppel, C. L. Beaver et al., A Low-Power Design for an Elliptic Curve Digital Signature Chip Source, CHES’02, LNCS 2523, pp. 366–380, 2002. [17] Certicom Research, SEC 2: Recommended Elliptic Curve Domain Parameters, v1.0, 2000. [18] L. Song, K.K. Parhi, Low-Energy Digit-Serial/Parallel Finite Field Multipliers, J. of VLSI Signal Processing, Kluwer, vol. 19, pp. 149–166, 1998. [19] J. Wolkerstorfer, Dual-Field Arithmetic Unit for GF(p) and GF(2m ), CHES’02, LNCS 2523, pp. 500–514, 2002. [20] J. Wolkerstorfer, Scaling ECC Hardware to a Minimum, ECRYPT workshop - CRASH’05, 2005.