Radix-4 Modular Multiplication and Exponentiation ... - CiteSeerX

Radix-4 Modular Multiplication and Exponentiation Algorithms for the RSA Public-Key Cryptosystem Jin-Hua Hong and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan 30013 R.O.C. Abstract We propose a radix-4 modular multiplication algorithm based on Montgomery’s algorithm, and a radix-4 cellulararray modular multiplier based on Booth’s multiplication algorithm. The radix-4 modular multiplier can be used to implement fast RSA cryptosystem. Due to reduced number of iterations and pipelining, our modular multiplier is four times faster than the cellular-array modular multiplier based on the original Montgomery’s algorithm. The time to calculate a modular exponentiation is about n2 clock cycles, where n is the word length, and the clock cycle is roughly equal to the delay time of a full adder. The utilization of the multiplier is 100% by interleaving consecutive exponentiations. Locality, regularity, and modularity make the proposed architecture suitable for VLSI implementation.

realized [3–6]. One of the most attractive modular multiplication algorithms was proposed by Montgomery [7]. Montgomery’s algorithm needs n iterations in each modular multiplication and two additions per iteration, where n is the word length. Cellular arrays based on Montgomery’s algorithm can be found in [8–10].

A modified Montgomery’s algorithm was first reported in [11], where the multiplication and modular reduction steps in Montgomery’s algorithm are separated such that only one addition is required in each iteration. However, the number of iterations in the modified algorithm is two times that of Montgomery’s, hence the overall computation time is not reduced. In [12, 13], the algorithm was further modified to reduce the number of iterations, doubling the speed of modular multiplication.

Keywords: cellular array, Montgomery algorithm, modular multiplication, high radix modular multiplier, publicIn this paper, we reduce the number of partial products key cryptography, RSA. by using radix-4 Booth’s algorithm, and propose a radix4 modular exponentiation algorithm. The proposed algo1 Introduction rithm can be implemented using a linear cellular array. With the increasing popularity of electronic communi- For a word length of n, the array has about n cells, and cation, data security is becoming more and more impor- each cell contains two full adders and some logic gates. tant. In 1976, Diffie and Hellman invented a new concept The two full adders in the cell are pipelined. The numn +3 of public-key cryptography [1]. Later, in 1978, Rivest, ber of iterations in the multiplication process is d 2 e as Shamir, and Adleman proposed the RSA public-key cryp- compared with 2n in the original multiplier. Therefore, tosystem which is relatively secure and easy to imple- the speed of the proposed modular multiplier is quadrument [2]. In the RSA cryptosystem, both the encryption pled. The time to calculate a modular exponentiation is 2 and decryption are modular exponentiation, which can be n τ, where τ is the delay time of a full adder. done by a sequence of modular multiplications. Therefore, fast modular multiplication becomes the key to realtime encryption and decryption in such a scheme.

To improve the utilization for radix-r modular multiplication without interleaving, we can merge log r cells into Various algorithms for modular multiplication have one processing element (PE), resulting in a digit-level arbeen proposed in the past, and some of them have been ray.

2 Modular Multiplication Algorithm In RSA, to encrypt a message using the encryption key (E, N), we first partition the message (a string of bits) into a sequence of blocks and consider each block M as an integer between 0 and N 1. Then, we encrypt the message by raising each block to the Eth power modulo N, i.e., C = M E (mod N ), for each message block M. Similarly, to decrypt the ciphertext using the decryption key (D, N), we raise each ciphertext block to the power D modulo N, i.e., M = CE (mod N ), for each ciphertext block C.

initial range (0; N ). Therefore, post adjustment is required before the next modular multiplication is performed.

2.2 Radix-4 Modular Multiplication

We present the radix-4 Montgomery algorithm. By using radix-4 numbers for multiplication and modular reduction, the number of iterations can be reduced by half. The number is n in Montgomery’s algorithm, which is equivalent to the number of partial products. To reduce the number of iterations (and the number of partial products a radix-4 Montgomery alExponentiation is performed by repeated (iterated) to be accumulated) we proposen+ gorithm which requires only d 2 3 e iterations. The radixsquaring and multiplication operations. Let the binary 4 Montgomery algorithm can be implemented by Booth’s representation of the exponent E be en 1 en 2 e1 e0 , then n 1 2 multiplier. M E = M 2 en 1 M 2 e2 M 2e1 M e0 . A simple way to perform modular exponentiation is to repeat the modular Let A be (n + 3)-bit and B be (n + 1)-bit 2’s-complement squaring (Mi2 ) and modular multiplication (Mi Pi ) oper- numbers and N = (n ; : : : ; n ; n ) be an n-bit odd n 1 1 0 ations from the least-significant bit (LSB) of E. This is integer, where N A; B < N. Also, let PPi = called the L-algorithm. In the L-algorithm, n iterations ( pp ; : : : ; pp ; pp ) represent the radix-4 Booth pari1 i0 i(n+1) are needed and each iteration needs two modular multi- tial product of iteration i. Since the radix-4 Booth recodplications. However, the modular multiplications Mi2 and ing is considered, we have PP d n+2 1 e = 0 and 2N PPi < Mi Pi can be done in parallel. To reduce the time comn+1 2N, where 0 i < d 2 e. Suppose Ti = S[i] + PPi, then plexity, we have developed a radix-4 modular multiplicathe two LSBs of Ti (i.e., ti1 and ti0 ) can be used to detertion algorithm and designed a radix-4 modular multiplier mine the modular reduction value (i.e., Ni , where Ni = 2N, based on Montgomery’s modular arithmetic. N, or 0). The proposed radix-4 Montgomery algorithm 2.1 Review of Montgomery’s Algorithm is shown below. Suppose N = (nn 1 ; : : : ; n1 ; n0 ) is an n-bit odd integer. Let A = (an 1 ; : : : ; a1 ; a0 ) and B = (bn 1 ; : : : ; b1 ; b0 ) be two n-bit integers, where 0 A; B < N. Montgomery’s modular multiplication is shown below, which generates a series of numbers, S[0]; S[1]; : : : ; S[n] as outputs.

R4MG(A; B; N)

f

S[0] = 0; for (i = 0; i < d n+2 3 e; i++) f (ti1 ; ti0 ) = (S[i] + PPi) (mod 4); if (ti0 = 0) f if (ti1 = 0) S[i + 1] = (S[i] + PPi)=4; else S[i + 1] = (S[i] + PPi + 2N )=4;

MG(A; B; N)

f

S[0] = 0; for(i = 0; i < n; i++) f qi = (S[i] + aiB) (mod 2); S[i + 1] = (S[i] + aiB + qiN )=2;

g

else f if (ti1 = n1 ) S[i + 1] = (S[i] + PPi N )=4; else S[i + 1] = (S[i] + PPi + N )=4;

g

g

return S[n];

In each iteration of the above procedure we need to accumulate three n-bit integers and divide the result by 2. This can be done by using two n-bit adders and right shifting by one bit the adder output. Therefore, each cell is composed of two full adders. Futhermore, by induction, the value of S[n] falls in the range (0; 2N ) instead of the

g g

g

return S[d n+2 3 e];

Procedure R4MG() has about n2 iterations, while previous algorithms require n or 2n iterations [7, 11]. In each iteration, two additions are required, which is the same as MG(). Therefore, it is faster than Procedures MG(). By induction, the value of S[d n+2 3 e] falls in the range ( N ; N ), which is the same as the initial range. Therefore, no post adjustment is required during the entire exponentiation process.

2.3

Radix-4 Modular Exponentiation

Our modular exponentiation algorithm is based on the n+3 L-algorithm. Let R = 4d 2 e and C = R2 (mod N ), then the proposed exponentiation procedure is as shown below, where the final value Pn is equal to M E (mod N ). Also, as discussed above, the range for the intermediate numbers (Mi ’s and Pi ’s) will not grow since we use R4MG(). R4ME(M ; E ; N)

f

3.1 Montgomery’s Modular Multiplier We implement Montgomery’s algorithm using a cellular array circuit. The dependence graph (DG) and signal flow graph (SFG) [14] of the modular multiplication algorithm are shown in Fig. 1. In the figure, the projection direction vector d~ = (0; 1), and the schedule vector ~ s = (1; 2)T . The utilization ratio of the cells is 50%. The resulting modular multiplier is shown in Fig. 2, where the signal qi is generated by XORing the LSB of S[i] and ai B. We use two n-bit adders to add S[i], ai B, and qi N. The pipeline technique is applied to the design to reduce the clock period. We divide a cell into three subcells which belong to different pipeline stages. As shown in Fig. 2, the D flip-flops (FFs) are inserted where the dashed lines cross the signals. The clock period is about the delay time of a full adder. X

T0

Hyperplane

T1

M0 = R4MG(M ; C; N); P0 = 1; for (i = 0; i < n; i++) f Mi+1 = R4MG(Mi ; Mi ; N ); if (e j = 1) Pi+1 = R4MG(Mi ; Pi ; N ); else Pi+1 = Pi ;

s

T4

0

0 FA

T6

Y 2

FA

T8

0

Y

2

2

2

Y

2

Y

2

Y

Y

X

a0

X

a1

X

a2

X

a3

2

Y

2

FA

T10

2

Y 2

s4

Y

s3

Y

2

2

2

s2

s1

s0

d : Projection direction(0,1)

2

M4 R M 2e1 +e0

3

M8 R M 4e2 +2e1 +e0

D

Y D

s4

s3

n1 b1

Y D

s2

n0 b0

Y D

ai

si

X

qo ao Co[1:0] 2

ni bi

Y

qi ai

2 Ci [1:0]

so no bo ao = a i no = ni qo = qi bo = bi {Co[0] , tij} = a i .bi + s i + Ci [0] {Co[1] , so } = qi . ni + tij + Ci [1]

s : Schedule vector (1,2)

n2 b2

X

so = 0 no bo ao = a i no = n i bo = bi qo = s i ai. bi {Co[0] , so} = qo .ni + s i + ai. bi

2

T11 FA

ni bi

2

Y

2

Co[0]

n0 b0

T9

FA

M2 R M e0

n1 b1

0

T7

T12

1

n2 b2 Y

n3 b3

0 MR 1

0

T5

Table 1: Intermediate values of Procedure R4ME(). i Mi Pi

n3 b3

qo ao

d

T3

if (Pn < 0) Pn = Pn + N; return Pn ;

si

Y

T2

g

g

3 Modular Multipliers

a0 a1 a2 a3

D

s1

s0

n

Figure 1: The DG and SFG of a 4-bit modular multiplier. ME

3.2 Radix-4 Modular Multiplier Note that we keep intermediate results Mi and Pi in the range ( N ; N ) during the entire exponentiation process and convert only the final result Pn to the range (0; N ) by adding N to Pn if Pn < 0. The intermediate values are shown in Table 1. We can perform the exponentiation for one message block concurrently when we read the next message block and write the previously processed cipher block (with post adjustment).

The recoding rules for the radix-4 Booth algorithm are shown in Table 2 [15]. The signals Code[2:0] are used to select B, 2B, or 0 as the partial product. The DG and SFG of the radix-4 modular multiplication algorithm are shown in Fig. 3. We let the projection direction vector d~ = (0; 1) and the schedule vector ~s = (1; 3)T . Note that the utilization ratio of the cells is 33%. The

b3

b2

D

c0

D

qi

FA

D D

c1

D

S[i+1]4

S[i+1]3

D

c0

FA

ti3 D

D

b1

D

n3 qi

FA D

S[i+1]2

c1

D

c0

FA

ti2 D

D

b0

n2 qi

FA

c1

D

D

c0

FA

ti1 D

D

n1 qi

FA

D

S[i+1]1

ai

D

c0

qi D ti0

X s : Schedule vector (1,3) d : Projection direction (0,1)

Hyperplane T0 T1

0

2B

2 2

0 FA

2

FA

2

s4

Table 2: Radix-4 Booth’s recoding rules. ai 0 0 1 1 0 0 1 1

ai 1 0 1 0 1 0 1 0 1

ai 2 0 0 0 0 1 1 1 1

Booth code 00 01 10 01 01 10 01 00

Action +0 +B 2B B +B +2B B +0

Code[2:0] 000 001 110 101 001 010 101 000

2 2

2 2

0 Z’

2 2

0

3 3 2

3 3 2

Z

2 2

3 3 2

Z

Code0 [2:0]

3

Y

X

2

Z’ s3

3 3 2

3 3 2

Z s2

3 3 2

Z s1

D

D

FA

2 D

D

Z’

s3

s4

3 3 2 D

D

Z D

Z D

X

2

Ci [1:0]

D

D

X

2

3

Codei [2:0]

3 Codei[2:0]

Y

q i[0] Ci [1:0]

2

2

b0

s0

2

22

so nono-1 bo bo-1 Codeo[2:0] = Code i[2:0] {no , no-1 }= {n i , n i-1 } {bo , b o-1 }= {b i , b i-1 } {Co[0] , tij} = ppij + s i + Ci [0] {Co[1] , so} = nij + t ij+ Ci [1] qo [2:0] = q i[2:0]

a a21 a3

recode

b0 3

Y

3

s0

Codeo[2:0] 3 qo [2:0] 32 Co[1:0]

3 Code i[2:0] 3 q [2:0] i 2

3 3 2

0 a0 a1

recode

n1 b 1b 0

s1

2 2

Z

D

s1

s2

ni n i-1 b i b i-1

Codeo[2:0] 3 qo [2:0] 3 2 Co[1:0]

3 3 2

3

Code1 [2:0]

3

Y s0

(b3b3) (b3b2) (b2b1) (b1b0) (0 n3) (n3n2) (n2n1) n1 ctr2 ctr1 2 2 2 2 2 2 2

si

d

0

1

T9 T10

Figure 2: A 4-bit Montgomery modular multiplier.

1

1

T6 T7 T8

S[i+1]0

s

(b3b3) (b3b2) (b2b1) (b1b0) (b0 0) (0 n3) (n3n2) (n2n1) (n1n0) (n0 0) n0 = 1

T2 T3 T4 T5

D

Y

B

Codeo[2:0] 3 qo [0] 2 Co[1:0]

X

3

Code i[2:0]

so = 0 b0

n1 b1 b0

Codeo[2:0] = Code i[2:0] {Co[0] , t i0}= ppi0+ s 0 + Codei [2]

Codeo[2:0] = Code i[2:0] {Co[0] , t i1} = ppi1+ s1 +Ci[0] {Co[1] , so} = ni1 + t i1+Ci[1] qo [0] = q i[0] qo [1] = t i1

Co[1] = t i0 utilization can be improved by interleaving as shown in qo [0] = t i0 Fig. 4, where inputs (a; b) are interleaved with the inqo [2] = t i1 n1 puts (a0 ; b0 ). Interleaving two modular multiplications increases the utilization to 67%. As described in R4MG(), Figure 3: The DG and SFG of a radix-4 modular multiwe must determine the modular reduction value Ni , where plier. Ni = (ni(n+1) ; : : : ; ni1 ; ni0 ). Three control signals q[2:0] are used to select 2N, N, or 0 as the reduction value, where q[0] = ti0 , q[1] = ti1 , and q[2] = ti1 ni1 . Note that subtracting N is easier than adding 3N, so we subtract N when ti0 = 1 and ti1 = n1 . Therefore, a 20 s-complement modular multiplier is required. In 20 s-complement multiplication, b b’ a’ a’ a’ b b b b’ b’ sign extension is needed when we accumulate the partial b b b’ b’ b a a a b b b’ b’ b’ b’ b’ b’ b b b’ a’ a’ b b b b b’ b’ a a products. This can be done by using signals ctr1 and ctr2 b b b’ b’ b’ b’ b b b’ a’ a’ a’ b b b’ b’ b b’ b’ b b a a a and the ppi4 XNOR gate as shown in Fig. 4. b’ b’ b b

L

T10 T9 T8 T7 T6 T5 T4 T3 T2 T1 T0

Time

As shown in Fig. 5, we need to add at most three D FFs in each cell. However, the utilization ratio is reduced to 25% due to pipelining. Fortunately, in RSA, a message to be encrypted is divided into a sequence of blocks, and each block is raised to the Eth power modulo N independently, so the modular exponentiations of the successive message blocks can be done in parallel by interleaving. Furthermore, in a modular exponentiation, the computation of Mi2 and Mi Pi can also be interleaved. Our design thus can execute four modular multiplications at the same time by this double interleaving. The utilization is then increased to 100%. Since the number of partial products

1 1

1 1 0 0 0 0 0

1

3

3

0 0

3

b3

3 3

3 3

3 3

1 1 0 0 0 0

ctr2 D D

D

D

ctr1 b3 one two

3 3

b3 D D

one two

D

sign

ppi4 FA

D

c0

ti4 D D D

q1 q2 q0

D

c1

D D D

q1 q2 q0

S[i+1]4

D

S[i+1]3

2 2

b2 D D

one two

D

sign

ppi3 FA

D

c0

FA D

S[i+1]2

n3

D D D

q1 q2 q0

D

c1

c1

1 1

2 2

1 1

2 2

1 1

b2

b1

1

0

1 1

0 0

1 1

0 0

1

D D

D

ppi2 FA

D

0

b1

b0

b1

b0

one two

sign c0

FA D

S[i+1]1

n2

D D D

q1 q2 q0

one two

D

sign

D

D

D

c1

2 2

0 0

1 1

0 0 0 0

0 0

3 3

2 2

1 1

c0

Code[1] Code[2] pp

i0

FA

q0 D

n i1 FA

1 1

Code[0]

ti0

n1

n i2

FA

3 3

b0 D D

ti1

n1

0 0

b’0 a’1 a’0 0 b0 a1 a0 0

ppi1 FA

MUX

n i3

D

2 2

ti2

n2

MUX

n i4

FA

2 2

3 3

ti3 n3

0

MUX D D

3 3

b3

sign c0

2 2

n i0 =ti0 D

c1

0

S[i+1]0

Figure 4: The radix-4 modular multiplier (n=3).

is reduced in half, the number of iterations is so reduced. Compared with [12, 13], we have two times the speed and 1.5 times the hardware cost. The extra 50% hardware cost is due to the interleaving control circuit and the pipeline registers.

1=3 to 1=2. In a radix-4 digit-level cellular array, each PE contains two 2-bit adders. The critical path is equal to the delay of three full adders. Therefore, the clock period is roughly 3τ. X

s : Schedule vector (1,2) ctr1 b3

ctr2 D D

D

D

b3

b3 D D

one two

D

sign

c0

pp

i4 D

FA

q1 q2 q0

D

FA

D

c1

D

S[i+1]4

S[i+1]3

D D

D

sign

c0

pp

i3

FA

D

0

n3

FA

ni4

D

c0

n2

n3

c1

FA

ni3

D

S[i+1]2

D D D

q1 q2 q0

S[i+1]1

b0

D one D

D

pp

i2 D

FA

D

sign

c0

b0 D D

two

sign

MUX D

b1

ti2

q1 q2 q0

b1

one two

ti3 D D D

MUX

D

b2

one two

ti4 D D D

b2

pp

i1

FA

D

one two

Code[0] Code[1] Code[2]

n1 n1

n2

D

q1

c0

q0

D

q0 t i0

MUX D

c1

FA

ni2

D

T1 T2

c1

D

T5

S[i+1]0

2 2

1

T3 T4

T6

Y s

(b3b3) (b3b2) (b2b1) (b1b0) (b0 0) (0 n3) (n3n2) (n2n1) (n1n0) (n0 0)

T0

sign

ti1 D

d : Projection direction (0,1)

Hyperplane

0

1

2 2

0

FA

Z’

FA

Z’ s3

1

s4

2 2

0

3 3 2 3 3 2

2 2

2 2

Y

X

Y s0

X

d

0

Z

Z

Z s2

Z s1

3 3 2 3 3 2

3 Code0 [2:0]

3 Code1 [2:0]

Figure 5: The pipeline radix-4 modular multiplier. The time to compute four interleaved modular multiplications is about 2n clock cycles, and the time to compute a modular exponentiation is about n2 clock cycles. Table 3 compares the hardware and time complexity of several linear-array RSA systems. In the table, α and τ are roughly the area and delay of a full adder cell, respectively. We normalize the clock period to the delay time τ. From the table, our architecture has the lowest computation time as compared with other works. When the size of the message block and encryption key are 512 bits, the encryption throughput rate is about 300K bps using a 150MHz clock. Table 3: Comparison of linear-array RSA systems. Approach [8] [9] [11] [12] [13] Ours

Time 4n2 τ 4n2 τ 4n2 τ 2n2 τ 2n2 τ n2 τ

Area 4nα 2nα 2nα 2nα 2nα 3nα

(b3b2) (b2b1) (b1b0) (b0 0) (n3n2) (n2n1) (n1n0) (n0 0)

(b3b3) (0 n3) (ctr2,ctr1) D D

D

Codei [2:0]

Z Z

Z’

Y X

FA

s4

s3

s2

s1

s0

Figure 6: The DG and SFG of a radix-4 digit-level multiplier. Though the number of iterations is reduced, the hardware complexity and the clock period increase. As Fig. 7 shows, each PE of a radix-16 multiplier contains an NSelector circuit, a B-Selector circuit, and two 4-bit adders. For n-bit modular multiplication, the number of iterations is d n+4 5 e, but the clock period is roughly 5τ (i.e., log r + 1). Because the utilization is 50%, we need roughly n2 clock cycles to compute two interleaved modular multiplications. We can reduce the clock period by using a fast 4-bit adder (such as the carry look-ahead adder), but this is basically a tradeoff between computation time and silicon area. Another possibility is that the signed digit number system can be used to avoid carry propagation in the PE and hence increase the speed of the high-radix modular multiplier.

4 Digit-Level Modular Multiplier

5 Conclusions

For a radix-r system, the number of iterations is e. However, the utilization is reduced to log r 1 log r +1 . To increase the utilization of a radix-r modular multiplier without interleaving, we can merge logr cells into one processing element (PE) and form a digit-level array multiplier. For example, in Fig. 6 two cells are merged into one PE, so the utilization is increased from

We have presented radix-4 modular multiplication and modular exponentiation algorithms. In the algorithms, we are able to keep the intermediate results in the range ( N ; N ), so no post adjustment is required for each multiplication during the modular exponentiation process. A cellular-array modular multiplier based on the algorithm and Booth’s multiplication has been presented, which is

dn

+(logr+1)

ctr2 ctr1

b15 ~ b9

b15 ~ b13

D

D

Code[4:0] B Selector

ppi18~i16 3 bits Adder

n15 ~ n13

D

c1 D

4 bits Adder

n15 ~ n9

D

S[i+1]15 ~ 12

D

n11 ~ n5

D

S[i+1]11 ~ 8

n7 ~ n1

D

S[i+1]7 ~ 4

c0 D

t i3~i0 q[4:0]

N Selector

D

ppi3~i0 4 bits Adder

N Selector

c1 4 bits Adder

D

[7] P. L. Montgomery, “Modular multiplication without trial division,” Math. Computation, vol. 44, pp. 519– 521, 1985.

n3 ~ n0

D

c1 4 bits Adder

B Selector

t i7~i5 q[4:0]

N Selector

D

4 bits Adder

D

c1 4 bits Adder

ppi7~i4

c0 D

Code[4:0]


t i11~i8 q[4:0]

N Selector

D

4 bits Adder

D

c1 4 bits Adder

ppi11~i8

c0

b3 ~ b0 D


t i15~i12 q[4:0] D

N Selector

D

ppi15~i12

c0

b7 ~b1 D


t i18~i16 q[4:0] D

b11 ~ b5 D

Code[4:0]

4 bits Adder

D

[8] P. Kornerup, “A systolic, linear-array multiplier for a class of right-shift algorithms,” IEEE Trans. Computers, vol. 43, pp. 892–898, Aug. 1994.

S[i+1]3 ~ 0

Figure 7: A radix-16 digit-level modular multiplier.

[9] C. D. Walter, “Systolic modular multiplication,” IEEE Trans. Computers, vol. 42, pp. 376–378, Mar. 1993.

efficient especially for RSA cryptosystem where modu- [10] M. Shand and J. Vuillemin, “Fast implementation of lar multiplications are performed iteratively. Compared RSA cryptography,” in Proc. 11th IEEE Symp. Comwith previous designs, the number of clock cycles of our puter Arithmetic, (Windsor, Ontario), pp. 252–259, pipelined radix-4 modular multiplier for a modular exJune 1993. ponentiation is about n2 τ, where τ is roughly the delay of a full-adder. The proposed multiplier is four times [11] P.-S. Chen, S.-A. Hwang, and C.-W. Wu, “A systolic RSA public key cryptosystem,” in Proc. IEEE faster than those based on the original Montgomery alInt. Symp. Circuits and Systems (ISCAS), (Atlanta), gorithm, and is two times faster than those reported in pp. 408–411, May 1996. [12, 13]. A high-radix digit-level modular multiplier was also discussed. Extending the design for a larger moduli [12] C.-C. Yang, T.-S. Chang, and C.-W. Jen, “A new is straightforward. RSA cryptosystem hardware design based on Montgomery’s algorithm,” IEEE Trans. Circuits and SysReferences tems II: Analog and Digital Signal Processing, vol. 45, pp. 908–913, July 1998. [1] W. Diffie and M. E. Hellman, “New directions in cryptography,” IEEE Trans. Information Theory, [13] C.-Y. Su, S.-A. Hwang, P.-S. Chen, and C.-W. Wu, “An improved Montgomery algorithm for highvol. 22, pp. 644–654, Nov. 1976. speed RSA public-key cryptosystem,” IEEE Trans. [2] R. L. Rivest, A. Shamir, and L. Adleman, “A method VLSI Systems, vol. 7, pp. 280–284, June 1999. for obtaining digital signatures and public-key cryptosystems,” Communications of the ACM, vol. 21, [14] S.-Y. Kung, VLSI Array Processors. Englewood Cliffs, New Jersey: Prentice-Hall Inc., 1988. pp. 120–126, Feb. 1978. [3] Ç. K. Koç and C. Y. Hung, “Bit-level systolic arrays for modular multiplication,” J. VLSI Signal Processing, vol. 3, pp. 215–223, 1991. [4] Ç. K. Koç, “RSA hardware implementation,” technical report, RSA Laboratories, RSA Data Security, Inc., Redwood City, CA, 1995. [5] N. Takagi and S. Yajima, “Modular multiplication hardware algorithms with a redundant representation and their application to rsa cryptosystem,” IEEE Trans. Computers, vol. 40, pp. 887–891, July 1992. [6] N. Takagi, “A radix-4 modular multiplication hardware algorithm for modular exponentiation,” IEEE Trans. Computers, vol. 41, pp. 949–956, Aug. 1992.

[15] I. Koren, Computer Arithmetic Algorithms. Englewood Cliffs, New Jersey, 07632: Prentice-Hall Inc., 1993.