Scalable and Systolic Architecture for Computing Double ...

3 downloads 18203 Views 360KB Size Report
Sep 22, 2006 - A scalable and systolic double-exponentiation can thus be obtained based on the proposed scalable AB2 and A × B architecture. Embedded ...
Acta Appl Math (2006) 93: 161–178 DOI 10.1007/s10440-006-9071-0

Scalable and Systolic Architecture for Computing Double Exponentiation Over GF(2m ) Chiou-Yng Lee · Jim-Min Lin · Che Wun Chiou

Received: 29 March 2006 / Revised: 29 March 2006 / Accepted: 10 April 2006 / Published online: 22 September 2006 © Springer Science + Business Media B.V. 2006

Abstract Double-exponentiation is a crucial arithmetic operation for many cryptographic protocols. Several efficient double-exponentiation algorithms based on systolic architecture have been proposed. However, systolic architectures require large circuit space, thus increasing the cost of the protocol. This would be a drawback when designing circuits in systems requiring low cost and low power consumption. However, some cost savings can be attained by compromising speed, as in portable devices and many embedded systems. This study proposes a scalable and systolic AB2 and a scalable and systolic A × B, which are the core circuit modules of doubleexponentiation. A scalable and systolic double-exponentiation can thus be obtained based on the proposed scalable AB2 and A × B architecture. Embedded system engineers may specify a target double-exponentiation with appropriate scaling systolic circuits. The proposed circuit has lower circuit space/cost and low time/propagation than other circuits. Key words double-exponentiation · Galois field · polynomial basis · systolic architecture · scalable architecture · cryptography.

C.-Y. Lee (B) Department of Computer Information and Network Engineering, Lunghwa University of Science and Technology, Taoyuan County 3306, Taiwan, Republic of China e-mail: [email protected] J.-M. Lin Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 407, Taiwan, Republic of China e-mail: [email protected] C. W. Chiou Department of Computer Science and Information Engineering, Ching Yun University, Chung-Li 320, Taiwan, Republic of China e-mail: [email protected]

162

Acta Appl Math (2006) 93: 161–178

1 Introduction The power-sum operation (C + AB2 ) is a significant arithmetic operation for publickey cryptosystems [9] including RSA [32] and Elliptic curve cryptosystem [6], and in decoding multiple-error-correcting binary BCH and RS codes [25, 38, 43]. The AB2 operation is the major portion of this power-sum operation and, thus, is the target calculation to be solved in this study. Finite field (or Galois field) arithmetic operations have been successfully adopted in solving many complex calculations, such as cryptography [18], switching theory [2], coding theory [19], pseudorandom number generation [36] and digital signal processing [5, 29]. Finite field arithmetic operations can be conducted in three different representations, namely normal basis (NB), polynomial basis (PB) and dual basis (DB). The operation performance depends on the representation of the field elements. A significant benefit of NB [10, 20, 24, 30, 31, 34, 35, 37] is that it computes the squaring of an element with a cyclic shift of the binary representation. PB [1, 7, 11, 12, 15– 17, 21, 22, 26, 27] is extensively adopted, leading to efficient implementation of finite field arithmetic operations. PB has a low design complexity and an extensible system architecture for various applications owing to its greater simplicity, regularity and modularity in architecture than the other two representations. Conversely, the DB representation [3, 44, 45] requires a smaller chip area than the other two representations. This study employs the PB representation. Although several efficient power-sum architectures [13, 39–41] have been developed with the PB representation of GF(2m ), their high space and time complexities are major limitations in cryptographic applications. Therefore, further research on efficient power-sum architectures with low space and time complexities is required. This study proposes a scalable and systolic array implementation of the AB2 circuit with low space complexity by employing circuit folding. This scalable and systolic AB2 architecture can be adopted to derive several important arithmetic operations, such as the multiplication and double exponentiation. Double-exponentiation is a crucial arithmetic operation for many modern cryptographic protocols, such as public-key cryptosystems (Diffie and Hellman [9] and RSA [32]), encoding the Reed Solomon codes [28], and other cryptographic protocols [4, 23, 33]. However, double exponentiation is a rather complicated operation. Therefore, efficient and high-speed double exponentiation algorithms are hard to obtain. Double-exponentiation is traditionally performed by iterating multiplication operations. This approach is time-consuming and has high space complexity. This study therefore aims to solve double-exponentiation using the proposed Galois field based algorithms. Systolic architecture can lead to cost-effective, high-performance, high throughput special-purpose systems for a wide range of problems. In particular, systolic architecture has been successfully and widely employed to solve finite field arithmetic operations [8, 13, 16, 17, 21, 22, 36–41] owing to its regularity and ease of reconfiguration. A systolic array of n array stages can perform up to n multiplications at the same time in a pipelined manner. Systolic architecture has high performance and high throughput, and therefore is appropriate for the applications of high performance server architecture. However, the large scale of systolic circuits makes them expensive. High performance and low cost are major concerns in many embedded applications for

Acta Appl Math (2006) 93: 161–178

163

personal use or low system load, while high throughput is unimportant. Therefore, a scalable low-cost systolic architecture that fits to embedded systems is desirable. This study presents a scalable and systolic architecture for calculating double exponentiation in which AB2 represents core scalable and systolic circuits. In the proposed scalable AB2 architecture, a systolic array can be n-folded with the result produced by n iterations of the n-folded circuit. Consequently, the circuit space is reduced into 1/n of an original full systolic architecture, while maintaining its original performance. The remainder of this study is structured as follows. Section 2 briefly reviews the mathematical background. Section 3 then introduces the proposed scalable and systolic power-sum circuit based on n-folding. Next, Section 4 presents a multiplication circuit based on the proposed scalable and systolic power-sum circuit. The novel double exponentiation circuit, which is based on the proposed power-sum and multiplication in Sections 3 and 4, is then discussed in Section 5. Conclusions are finally drawn in Section 6. 2 Mathematical Background The Galois field calculation is the key mathematical foundation of this study. This study assumes that the readers already have basic knowledge of the Galois field operations, which are described in detail by Lidl and Niederreiter [18]. This section reviews some fundamental mathematical results in the Galois field. Let GF(2m ) be a finite field of 2m elements. GF(2m ) is an extension field of the ground field GF(2). It is an extension field of GF(2), which has elements 0 and 1. All finite fields contain a zero element, a unit element, a primitive element have at least one primitive irreducible polynomial P(x) = p0 + p1 x + · · · + pm−1 xm−1 + xm of degree m over GF(2) associate with it. The primitive element is a root of primitive polynomial P(x). The non-zero elements of GF(2m ) can be represented as power m of the primitive element, i.e., GF(2m ) = {0, 1, α, α 2 , · · · , α 2 −2 }. Since P(α) = 0, then m m−1 we have α = p0 + p1 α + · · · + pm−1 α . Therefore, the elements of GF(2m ) can also be expressed as polynomial of α with a degree less than m by doing a mod P(x) computation over GF(2), i.e., GF(2m ) = {A| A = a0 + a1 α + · · · + am−2 α m−2 + am−1 α m−1 , where ai ∈ GF(2), 0 ≤ i ≤ m − 1}. The basis {1, α, α 2 , · · · , α m−1 } is known as standard basis and often referred to as polynomial basis, conventional basis or canonical basis. In the following, we now briefly describe the multiplication over GF(2m ) using the canonical basis. Let P(x) be the primitive irreducible polynomial of degree m m−1 m−1   for GF(2m ) and let α be a root of P(x). Let A = ai α i and B = b i α i be two elements in GF(2 ), and let C = m

m−1 

i=0

i=0

ci α be the product of A and B. The product C i

i=0

can be written as follows: C = ABmodP(x) = A(b0 + b1 α + b2 α 2 + · · · + bm−1 α m−1 )modP(x) = Ab0 modP(x) + Ab1 αmodP(x) + · · · + Abm−1 α m−1 modP(x)

(1)

164

Acta Appl Math (2006) 93: 161–178

or C = (· · · ( Abm−1modP(x))α + Abm−2 modP(x))α + · · · )α + Ab0 modP(x)

(2)

The above equations lead to two types of realizations, least significant bit (LSB) first scheme and most significant bit (MSB) first scheme. LSB and MSB mean the LSB and MSB of the multiplier B. The LSB first multiplication is based on Equation (1). The intermediate multiplication in the step k, 1 ≤ k ≤ m, does the following computation in parallel. A(k) = Aα k = ( Aα k−1 )αmodP(x) C(k) = C(k−1) + A(k−1) b k−1 where C(0) = 0 and C = AB = C(m) . The MSB first multiplication is based on Equation (2). The intermediate multiplication in the step k, 1 ≤ k ≤ m, does the following computation in parallel. C(k) = C(k−1) α + Abm−k modP(x)

(3)

where C(0) = 0. The essential step in Equation (3) can be identified as follows: (k) (k) (k) m−2 C(k) = c(k) + cm−1 α m−1 0 + c1 α + · · · + cm−2 α

(4)

where (k−1) c(k) = ci−1 + c(k) m−1 pi + ai b m−k , for 1 ≤ i ≤ m − 1 i (k) c(k) 0 = cm−1 p0 + a0 b m−k

(5)

The operations performed in both multiplication algorithms can be identified as multiply-by-α, generate-current-products and accumulate-to-previous result. The multiply-by-α operation is common in both schemes. In the LSB-first scheme, the three operations are performed in parallel while in the MSB-first scheme they are performed sequentially. The MSB-first scheme leads to efficient implementations with a systolic architecture than the LSB-first scheme. In VLSI designs, systolic architectures are fundamentally suited to rapid computation and depend on regular circuitry to perform arithmetic operations over finite fields GF(2m ). Their common nature supports architectural characteristics such as concurrence, I/O-balance, and simple and regular design. The systolic arrays are one or two-dimensional arrays of simple processing elements that perform specific task, e.g., matrix–vector multiplication. In the MSB-first scheme, a polynomial basis multiplier using the idea in Equation (5) developed down to the bit-level proved to be good way to obtain descent designs for VLSI realization of arithmetic processing elements. For an example, Wang–Lin [46] adopts the unidirectional data flow concept to present a bit-parallel multiplier over GF(2m ). The circuit is identical of m × m cells, each of which consists of two two-input AND gates, one three-input XOR gates and seven 1-bit latches. The array can provide the maximum throughput of one result per clock cycle after an initial delay of 3m clock periods. The brief propagation delay of each cell is the total delay of one two-input AND gate, one two-input XOR gate and one 1-bit latch.

Acta Appl Math (2006) 93: 161–178

165

3 Proposed Scalable and Systolic AB2 Operation in GF(2m ) AB2 is a conventionally utilized computation in many significant applications, like decoding BCH codes and RS codes, and computing inversions and divisions. Wei–Wei [42] has presented a systolic circuit for AB2 with bidirectional data flow using a polynomial basis representation of GF(2m ). However, this architecture is not appropriate for testable design. Wang and Guo [39] then also utilized a polynomial basis to show a systolic array for AB2 computation with unidirectional data flow having low space complexity, short latency and fault tolerance. Instead of applying LSB-first schemes in conventional designs, Kim et al. [13] utilized the MSB-first approach to further lower the space and time complexities of off-the-shelf AB2 circuits. However, such systolic AB2 architectures have drawbacks such as high space complexity and long latency, still existing when applied to cryptographic application. Hence, this study proposes a novel systolic AB2 architecture adopting the concept of folded computation to reduce the space complexity. This study develops a scalable and systolic architecture for AB2 in GF(2m ). The term ‘scalable’ is adopted to describe an architecture of an original m-stages systolic array performing the AB2 function that can be scaled down in t-stages of a systolic array ranging from two-folded to at most m-folded. Therefore, such a scaled-down architecture is a core and reusable circuit module. The calculation of AB2 of any word size of bit length m could thus be derived through iteratively operating this reusable folded circuit module for up to n = m/t times. The main benefit of this scaled-down architecture is the reduction in cost (due to lower circuit space requirement) to approximately 1/n of original circuit. Assume that a scalable systolic array circuit of t stages is created from two input data, A and B, of length m bits, by n iterations, where n = m/t. Let A = a0 + a1 α + · · · + am−1 α m−1 B = b0 + b1 α + · · · + bm−1 α m−1 = b0 + b1 α + · · · + bm−1 α m−1 +(bm α m + bm+1 α m+1 + · · · + bnt−1 α nt−1 ) = (b0 + b1 α + · · · + bt−1 α t−1 ) +(bt α t + bt+1 α t+1 + · · · + b2t−1 α 2t−1 ) +··· +(b(n−1)t α (n−1)t + b(n−1)t+1 α (n−1)t+1 + · · · + bnt−1 α nt−1 ) = (b0 + b1 α + · · · + bt−1 α t−1 ) +(bt + bt+1 α + · · · + b2t−1 α t−1 )α t +··· +(b(n−1)t + b(n−1)t+1 α + · · · + bnt−1 α t−1 )α (n−1)t = B0 + B1 α t + · · · + Bn−1 α (n−1)t where bm = · · · = bnt−1 = 0, Bk = bkt + bkt+1 α + · · · + bkt+(t−1) α t−1 and n = m/t.

166

Acta Appl Math (2006) 93: 161–178

We are then going to do the computation of AB2 : Y = AB2 modP(x) = A(B0 + B1 α t + · · · + Bn−1 α (n−1)t )2 modP(x) = AB20 + AB21 α 2t + · · · + AB2n−1 α 2(n−1)t modP(x) = (· · · ((0)α 2t + AB2n−1 )α 2t + AB2n−2 )α 2t + · · · )α 2t + AB20 modP(x)

(6)

Therefore we can represent the above equation as the form: Y0 = 0 Yk+1 = Yk α 2t + AB2n−(k+1) = Yk α 2t + A(Bn−(k+1) )2 = Yk α 2t + A(b(n−(k+1))t + b(n−(k+1))t+1α + · · · + b(n−(k+1))t+(t−1)α t−1 )2 = Yk α 2t + Ab(n−(k+1))t + Ab(n−(k+1))t+1α 2 + · · · + Ab(n−(k+1))t+(t−1)α 2t−2 = (· · · ((Yk )α 2 + Ab (n−(k+1))t+(t−1) )α 2 + · · · )α 2 + Ab(n−(k+1))t AB2 = Yn

(7)

where Yk denotes the output of the k-th iteration computation. To compute Equation (7), we may now express it as an iterative form: Si+1 = Si α 2 + Ab(n−(k+1))t+(t−1)−i = Si α 2 + Ab(n−k)t−(i+1)

(8)

where 0≤ i ≤ t − 1 and S0 = Yk , Si ∈ GF(2 ). Let Si be represented in polynomial basis, therefore Si is expressed as m

Si = si,0 + si,1 α + si,2 α 2 + · · · + si,m−1 α m−1 where si, j ∈ GF(2), 0 ≤ i ≤ t − 1, 0 ≤ j ≤ m − 1. Because α is a root of P(x), thus P(α) = 0 and we have the following results α m = p0 + p1 α + p2 α 2 + · · · + pm−1 α m−1 α

m+1

=

p0

+

p1 α

+

p2 α 2

+ ··· +

pm−1 α m−1

where pj = pm−1 p j + p j−1 for 1 ≤ j ≤ m − 1 p0 = pm−1 p0 p0 = 1

(9) (10)

Acta Appl Math (2006) 93: 161–178

167

The circuit for computing pj is shown as Figure 1. By substituting Equations (9) and (10) into Equation (8), we could extend this iteration form into the following equation: Si+1 = Si α 2 + Ab(n−k)t−(i+1) = (si,0 + si,1 α + si,2 α 2 + · · · + si,m−1 α m−1 )α 2 +(a0 + a1 α + a2 α 2 + · · · + am−1 α m−1 )b(n−k)t−(i+1) = (si,0 α 2 + si,1 α 3 + si,2 α 4 + · · · + si,m−1 α m+1 ) +(a0 + a1 α + a2 α 2 + · · · + am−1 α m−1 )b(n−k)t−(i+1) = si,0 α 2 + si,1 α 3 + si,2 α 4 + · · · + si,m−3 α m−1 +si,m−2 ( p0 + p1 α + p2 α 2 + · · · + pm−1 α m−1 ) +si,m−1 ( p0 + p1 α + p2 α 2 + · · · + pm−1 α m−1 ) +(a0 + a1 α + a2 α 2 + · · · + am−1 α m−1 )b(n−k)t−(i+1) = (si,m−2 p0 + si,m−1 p0 + a0 b(n−k)t−(i+1) ) +(si,m−2 p1 + si,m−1 p1 + a1 b(n−k)t−(i+1) )α +(si,0 + si,m−2 p2 + si,m−1 p2 + a2 b(n−k)t−(i+1) )α 2 + · · · +(si,m−3 + si,m−2 pm−1 + si,m−1 pm−1 + am−1 b(n−k)t−(i+1) )α m−1 = si+1,0 + si+1,1 α + si+1,2 α 2 + · · · + si+1,m−1 α m−1

Figure 1 Circuit for computing P (x).

(11)

P

p0

p’0

p0

p1

p1 p2

p’1

p’2

P’

pm-2 pm-1 pm-1

p’m-1

168

where

Acta Appl Math (2006) 93: 161–178

si+1,0 = si,m−2 p0 + si,m−1 p0 + a0 b(n−k)t−(i+1) si+1,1 = si,m−2 p1 + si,m−1 p1 + a1 b(n−k)t−(i+1) si+1, j = si, j−2 + si, j p j + si, j pj + a jb(n−k)t−(i+1) , for 2 ≤ j ≤ m − 1

(12)

Based on Equation (12), the scalable and systolic architecture of size t × m for Equation (7) is shown in Figure 2, where there are four input data, namely A, Bk , P and P . In the circuit, si, j indicates an arbitrary cell. Assume that F is the output after each round. F will then be forwarded to an input of next round. After n rounds, the output F will be assigned to the desired result Y. Figure 3 illustrates the block diagram of a reusable function module for Y = AB2 . This simplified representation is employed for later computations based on AB2 . Figure 4 displays the detailed circuit of the cell S in Figure 2. In round 0 (see Figure 2), the systolic array is used to calculate F = ABn−1 . In round 1, the result F produced in round 0 is forwarded as

Figure 2 The proposed scalable and systolic architecture for Y = AB2 .

Acta Appl Math (2006) 93: 161–178

169

Figure 3 Block diagram of a reusable function module for Y = AB2 .

P

A

Y=A× B2 mod P

P’

B

Y

the input of the same semi-systolic array to derive the result of the next round. After n iterations, the final F is produced as the result of the computation Y. The number of XOR gates of circuit modules is commonly used as the basis for comparing their space complexities. This study uses some common assumptions about space complexity from Weste and Eshraghian [43]: (1) an XOR gate with three-input and an XOR gate with four-input are constructed with two two-input XOR gates and three two-input XOR gates, respectively; (2) a two-input AND gate, a one-bit latch, and a two-input XOR gate consist of six, eight, and six transistors, respectively. The following paragraphs compare the proposed architecture and offthe-shelf AB2 array architectures. Table 1 lists the results of comparison for various AB2 array architectures. The proposed single cell S contains 68 transistors. Table 1 indicates that the proposed

Figure 4 Detailed circuit of an S cell.

170

Acta Appl Math (2006) 93: 161–178

Table 1 Comparison of systolic arrays for computing AB2 in GF(2m ) Items

Wang–Guo [39]

Wei [40]

Wei [41]

Ours

Function Number of cells Throughput (unit = 1/cycle) Latency (unit = cycles) Type Data flow Propagation delay through one cell Cell complexity

C + AB2 m2 /2

C + AB2 m2

C + AB2 m2

AB2 m m/n

1

1

1

1/(n)

2.5m Systolic Unidirectional

4m Systolic Bi-directional

m Semi-systolic Bi-directional

n m/n Semi-systolic Unidirectional

T AN D2 + T X OR4 6 AN D2 2 X OR4 17 L1

T AN D + T X OR3 3 AN D2 1 X OR2 1 X OR2 13 L1 140m2 MSB No

T AN D + T X OR3 3 AN D2 1 X OR2 1 X OR3 4 L1 68m2 LSB No

T AN D2 + T X OR4 3 AN D2 1 X OR4 4 L1

Transistor count Algorithm Scalable design

104m2 MSB No

68m m/n MSB Yes

Note: AN Di: i-input AN D gate, X ORi : i-input X OR gate, L1 : 1-bit latch.

scalable and systolic architecture for the AB2 circuit module can save about (n − 1) × 100/n% in space complexity as compared to other existing AB2 array architectures. For instance, the proposed architecture saves about 50% for n = 2, and 75% for n = 4. Assume that the propagation delays raised in one cell for all of the listed array architectures are the same as those in Table 1, since the propagation delays of going through a three-input XOR gate and a four-input XOR gate are the same. Table 1 also reveals that the proposed AB2 array architecture runs as quickly as other existing AB2 array architectures. Additionally, the unidirectional data flow of the proposed AB2 array architecture makes the fault-tolerant circuit design easy and practical. As for the throughput analysis, previous proposals [39–41] are based on a full systolic array with fully pipeline processing for both a single datum and multiple data (the throughput is 1 datum per clock, after the first m clocks latency). By contrast, the proposed architecture is scalable, i.e. a partial systolic array has pipeline processing only within a single datum and not for contiguous multiple data (throughput is one datum per m/n clocks after the first n m/n clocks latency). Restated, the proposal has the same processing rate as that of a full systolic array for a single datum, but a lower throughput for contiguous data. Hence, the proposed scalable architecture is appropriate for an environment requiring fast response and low cost but where only a few data need to be processed, particularly personal and mobile embedded devices. Consequently, the proposed AB2 array architecture reusing the folded semisystolic array has the advantages of saving space complexity and speeding up the execution time compared with other existing AB2 array architectures. AB2 is a valuable computation because many operations, such as AB, and A K B H , can be transformed into computation forms based on AB2 . Therefore, the following sections introduce the calculation of AB and A K B H based on AB2 .

Acta Appl Math (2006) 93: 161–178

171

4 The Proposed Multiplication Operation in GF(2m ) This section develops a multiplication operation AB in GF(2m ) based on the proposed scalable and systolic AB2 . The proposed AB computation is needed to compute the double exponentiation described later. Let A = a0 + a1 α + a2 α 2 + · · · + am−1 α m−1 B = b 0 + b 1 α + b 2 α 2 + · · · + b m−1 α m−1 Then D = ABmodP(x) = A(b0 + b1 α + b2 α 2 + · · · + bm−1 α m−1 )modP(x)  = A (b0 + b2 α 2 + · · · + bm−2 α m−2 )  +(b1 α + b3 α 3 + · · · + bm−1 α m−1 ) modP(x)  m = A (b0 + b2 α + · · · + bm−2 α 2 −1 )2  m +α (b1 + b3 α + · · · + bm−1 α 2 −1 )2 modP(x) = AB21 + α AB22 modP(x) = AB21 + A B22 modP(x)

(13)

where A = a0 α + a1 α 2 + a2 α 3 + · · · + am−1 α m modP(x) = a0 α + a1 α 2 + a2 α 3 + · · · + am−2 α m−1 +am−1 ( p0 + p1 α + p2 α 2 + · · · + pm−1 α m−1 ) = (am−1 p0 ) + (am−1 p1 + a0 )α + · · · +(am−1 pm−1 + am−2 )α m−1 B1 = b 0 + b 2 α + · · · + b m−2 α 2 −1 m

B2 = b 1 + b 3 α + · · · + b m−1 α 2 −1 m

(14)

In Equation (14), A (x) could be obtained through similar circuit of computing P (x), as shown in Figure 5. Since AB21 and A B22 have the same form as the above proposed computation AB2 , Equation (13) can be obtained by using twice power-sum iterations. Figure 6 depicts the circuit corresponding to Equation (13), which employs switches SW1 and SW2 to manage the input pairs A and B1 (for calculating AB21 ) or A and B2 (for calculating A B22 ) for two power-sum iterations. Additionally, a switch, SW3, redirects the result of the first iteration to Y1 linking to a latch buffer, and hence to the result of the second iteration to Y2 . Finally, D = AB can thus be obtained from Y1 XOR Y2 . Table 2 presents the switch controls to obtain the correct results in

172

Acta Appl Math (2006) 93: 161–178

Figure 5 Detailed circuit of A .

A

p0

a0

a’0

p1

P

a1 p2

a’1

am-2 pm-1 pm-1

a’2

a’m-1

A’

different iterations. Each iteration for generating AB2 takes n × t × 1 clock cycles, then AB takes 2n × t clock cycles, where n = n m/2t . Table 3 compares the proposed and existing multipliers, indicating that the proposed multiplier saves about 75% in space complexity over Wei’s multiplier [41], but has a similar latency. The proposed multiplier also saves about 79% in space complexity over Wang–Lins multiplier [46]. Notably, Wei’s multiplier [41] can be performed by both functions AB2 + C and AB + C.

5 Proposed Double Exponentiation Operation in GF(2m ) Traditional implementations of double-exponentiation are performed by multiplying two separate exponentiation computations [14]. For instance, a doubleexponentiation A K B H can be performed with two exponentiations, A K and B H , and one multiplication, A K B H . This study develops a novel approach to performing double-exponentiation by iterating of AB2 . No additional multiplication circuits are needed in this approach. Let A K = Ak0 +k1 2+···+km−1 2 B H = Bh0 +h1 2+···+hm−1 2

m−1

m−1

Then, the double exponentiation A K B H = Ak0 +k1 2+···+km−1 2

m−1

Bh0 +h1 2+···+hm−1 2

m−1

= ( Ak0 Bh0 )( Ak1 Bh1 )2 · · · ( Akm−1 Bhm−1 )2

m−1

= (· · · (( Akm−1 Bhm−1 )2 Akm−2 Bhm−2 )2 · · · )2 Ak0 Bh0

(15)

Acta Appl Math (2006) 93: 161–178 Figure 6 The circuit for D = AB mod P(x).

173

P

A

A’

P’ SW1

P’

P

IA

SW2 2

Y=IA×IB (t×m cells)

B1 B

IB

B2 Y Y2

Y1

SW3

L

D=A×B

Therefore we can represent the above equation as an iterative form: C0 = 1 2 , for 1 ≤ i ≤ m Ci = ( Akm−i Bhm−i )Ci−1

Finally, after m iterations, we get A K B H = Cm . Chiou and Lee [8] proposed a multiplexer-based double exponentiation scheme that only requires m multiplications and saves about 66% time complexity compared

Table 2 States for switches in D = AB using AB2 circuit module

Iterations of AB2

First iteration

Second iteration

SW1 SW2 SW3

A B1 Y1

A B2 Y2

174

Acta Appl Math (2006) 93: 161–178

Table 3 Comparison of systolic arrays for computing AB in GF(2m ) Items

Wang–Lin [46]

Wei [41]

Ours

Number of cells Throughput (unit = 1/cycle) Latency (unit = cycles) Type Data flow Propagation delay through one cell Cell complexity

m2

m2

m m/2n

1

1

1/2n

3m Systolic Unidirectional

m Semi-systolic Bi-directional

2n m/2n Semi-systolic Unidirectional

T AN D2 + T X OR3 2 AN D2 1 X OR3 7 L1

T AN D + T X OR4 3 AN D2 1 X OR4 4 L1

80m2 MSB m(T AN D2 + T X OR3 )

T AN D2 + T X OR3 3 AN D2 1 X OR2 1 X OR3 4 L1 68m2 LSB m(T AN D2 + T X OR3 )

No

No

Transistor count Algorithm Propagation delay of whole function unit Scalable design

68m m/2n + 6m MSB 2n m/2n (T AN D + T X OR4 ) Yes

Note: AN Di: i-input AN D gate,X ORi : i-input X OR gate, L1 : 1-bit latch.

with standard binary methods. This study utilizes the same multiplexer-based approach to calculate double exponentiation as follows: Let Y MU X = Aki Bhi , then derive a truth table for a multiplexer according to ki and hi (see Table 4): Exponential bits ki and hi can be created simply through parallel-to-serial registers for exponents K and H , respectively. Since AB can be precalculated using Equation (13), only AB2 circuit module is needed, and no extra AB circuit is necessary for double exponentiation. Figure 7 illustrates the circuit for calculating Equation (15). Like the multiplication circuit, this circuit has three switches controlling the inputs/output to/from the AB2 circuit module. The switches are the key modules controlling the timing of reusing the AB2 circuit module. Table 5 presents the states of switches required to generate desired result at appropriate timing periods. To determine the processing time of generating A K B H , the iteration of the core circuit AB2 is assumed to take most of the processing time, and the processing time of other auxiliary circuits is thus disregarded. The time required for calculating A K B H

Table 4 Truth table for 2-to-1 multiplexer

Input ki 0 0 1 1

Output hi 0 1 0 1

Y MU X 1 B A AB

Acta Appl Math (2006) 93: 161–178

175

Figure 7 The circuit for A K B H mod P(x).

K H

A B

P

D

1

ki

PTS

4 to 1 MUX

PTS

hi

YMUX

A’

P’

SW1

SW2

B1

2

Y=IA×IB (t×m cells)

B2 1

Y SW3 Y1

Y2

C

L

D=A×B Z=AKBH

mostly includes the iterations of AB2 for precomputing D = AB and the iterations of AB2 for generating A K B H . The calculation of D = AB takes two AB2 iterations, that is 2nt clock cycles, and the multiplexer-based A K B H computation takes m times Table 5 States for switches in Z = A K B H using AB2 circuit module Iterations of AB2

First iteration

Second iteration

Third iteration

Fourth iteration to (m + 1)’th iteration

(m + 2)’th iteration

SW1 SW2 SW3

A B1 Y1

A B2 Y2

Y MU X 1 C

Y MU X C C

Y MU X C Z

176

Acta Appl Math (2006) 93: 161–178

Table 6 Comparison of systolic arrays for computing A K B H modP(x) in GF(2m ) Items

Using Wang–Guo [39]

Using Wei [40]

Using Wei [41]

Ours

Number of cells Throughput (unit = 1/cycle) Latency (unit = cycles) Type Data flow Propagation delay through one cell Cell complexity

m2 /2

m2

m2

mm/n

1/2m

1/2m

1/2m

1/(mn)

5m(m − 1) Systolic Unidirectional

8m(m − 1) Systolic Bi-directional

2m(m − 1) Semi-systolic Bi-directional

(m + 2)n m/n Semi-systolic Unidirectional

T AN D2 + T X OR4 6 AN D2 2 X OR4 17 L1

T AN D + T X OR3 3 AN D2 1 X OR2 1 X OR3 13 L1 140m2 MSB No

T AN D + T X OR3 3 AN D2 1 X OR2 1 X OR3 4 L1 68m2 LSB No

T AN D2 +T X OR4 3 AN D2 1 X OR4 4 L1

Transistor count Algorithm Scalable design

104m2 MSB No

68m m/n MSB Yes

Note: AN Di: i-input AN D gate,X ORi : i-input X OR gate, L1 : 1-bit latch.

of AB2 iterations, that is mt clock cycles. Therefore, the approximate total processing time needed to generate A K B H takes (m + 2)nt clock cycles. Table 6 compares the proposed and traditional double exponentiation architectures. The proposed double exponentiation architecture saves not only space complexity but also time complexity over the traditional double exponentiation architecture. For example, the proposed architecture saves about 50%(75%) space complexity and 50%(50%) time complexity for n = 2(n = 4).

6 Conclusions This study develops a novel double exponentiation architecture with low space complexity based on a core scalable and systolic AB2 circuit. Compared with existing architectures, the proposed architecture can save at least 50% of space complexity (cost) while maintaining approximate single data processing performance. Furthermore, the proposed architecture also saves about 50% time complexity over traditional architectures. The proposed architecture can be tailored to the desirable scale, and applied to many embedded devices in which space complexity feature is a major design concerns. Further applications of the proposed core scalable and systolic AB2 circuit could be explored in future works. Acknowledgement The authors would like to thank the National Science Council of the Republic of China, Taiwan for financially supporting this research under Contract No. NSC94-2213-E-231-021 and NSC94-2218-E-262-003.

Acta Appl Math (2006) 93: 161–178

177

References 1. Bartee, T.C., Schneider, D.J.: Computation with finite fields. Inform. and Comput. 6, 79–98 (1963) 2. Benjauthrit, B., Reed, I.S.: Galois switching functions and their applications. IEEE Trans. Comput. C-25, 78–86 (1976) 3. Berlekamp, E.R.: Bit-serial Reed–Solomon encoders. IEEE Trans. Inform. Theory IT-28, 869– 874 (1982) 4. Birickell, E.F., McCurley, K.S.: Interactive identification and digital signatures. ATT Tech. J. 73–86 (1991) 5. Blahut, R.E.: Fast Algorithms for Digital Signal Processing. Addison-Wesley, Reading, Massachusetts (1985) 6. Blake I., Seroussi, G., Smart, N.: Elliptic Curves in Cryptography. Cambridge University Press, New York (1999) 7. Chiou, C.W., Lin, L.C., Chou, F.H., Shu, S.F.: Low complexity finite field multiplier using irreducible trinomials. Electron. Lett. 39(24), 1709–1711 (2003) 8. Chiou, C.W., Lee, C.-Y.: Multiplexer-based double-exponentiation for normal basis of GF(2m ). Comput. Secur. 24(1), 83–86 (2005) 9. Diffe, W., Hellman, M.E.: New directions in cryptography. IEEE Trans. Inform. Theory 22(6), 644–654 (1976) 10. Fan, H., Dai, Y.: Key function of normal basis multipliers in GF(2n ). Electron. Lett. 38(23), 1431–1432 (2002) 11. Hasan, M.A., Wang, M., Bhargava, V.K.: Modular construction of low complexity parallel multipliers for a class of finite fields GF(2m ). IEEE Trans. Comput. 41(8), 962–971 (1992) 12. Itoh, T., Tsujii, S.: Structure of parallel multipliers for a class of fields GF(2m ). Inform. and Comput. 83, 21–40 (1989) 13. Kim, N.-Y., Kim, H.-S., Yoo, K.-Y.: Computation of AB2 multiplication in GF(2m ) using lowcomplexity systolic architecture. IEE Proc., Circuits Devices Syst. 150(2), 119–123 (2003) 14. Knuth, D.E.: The art of computer programming. In: Seminumerical algorithms, vol. 2. AddisonWesley, Reading, Massachusetts (1981) 15. Koc, C.K., Sunar, B.: Low-complexity bit-parallel canonical and normal basis multipliers for a class of finite fields. IEEE Trans. Comput. 47(3), 353–356 (1998) 16. Lee, C.Y., Lu, E.H., Lee, J.Y.: Bit-parallel systolic multipliers for GF(2m ) fields defined by allone and equally-spaced polynomials. IEEE Trans. Comput. 50(5), 385–393 (2001) 17. Lee, C.Y.: Low complexity bit-parallel systolic multiplier over GF(2m ) using irreducible trinomials. IEE Proc., Comput. Digit. Tech. 150(1), 39–42 (2003) 18. Lidl, R., Niederreiter, H.: Introduction to Finite Fields and their Applications. Cambridge University Press, New York (1994) 19. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes. North Holland, Amsterdam (1977) 20. Massey, J.L., Omura, J.K.: Computational method and apparatus for finite field arithmetic. US Patent 4,587,627, May 1986 21. Mastrovito, E.D.: VLSI architectures for multiplication over finite field GF(2m ). In: Mora, T. (ed.) Applied Algebra, Algebraic Algorithms, and Error-Correcting Codes, Proceedings of Sixth International Conference, AAECC-6, Rome, pp. 297–309 (July 1988) 22. Mastrovito, E.D.: VLSI architectures for computations in Galois fields. PhD thesis, Linkoping University, Department of Electrical Engineering, Linkoping, Sweden (1991) 23. NIST: A proposal federal information processing standard for digital signature standard (DSS). Federal Registration 56, 42980–42982 (1991) 24. Oh, S., Kim, C.H., Lim, J., Cheon, D.H.: Efficient normal basis multipliers in composite fields. IEEE Trans. Comput. 49(10), 1133–1138 (2000) 25. Okano, H., Imai, H.: A construction method of high-speed decoders using ROM’s for Bose– Chaudhuri–Hocquenghem and Reed–Solomon codes. IEEE Trans. Comput. C-36, 1165–1171 (1987) 26. Paar, C.: A new architecture for a parallel finite field multiplier with low complexity based on composite fields. IEEE Trans. Comput. 45(7), 856–861 (1996) 27. Paar, C., Fleischmann, P., Roelse, P.: Efficient multiplier architectures for Galois Fields GF(24n ). IEEE Trans. Comput. 47(2), 162–170 (1998)

178

Acta Appl Math (2006) 93: 161–178

28. Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. SIAM J. Appl. Math. 8, 300– 304 (1960) 29. Reed, I.S., Truong, T.K.: The use of finite fields to compute convolutions. IEEE Trans. Inform. Theory IT-21(2), 208–213 (1975) 30. Reyhani-Masoleh, A., Hasoan, M.A.: A new construction of Massey–Omura parallel multiplier over GF(2m ). IEEE Trans. Comput. 51(5), 511–520 (2002) 31. Reyhani-Masoleh, A., Hasan, M.A.: Fast normal basis multiplication using general purpose processors. IEEE Trans. Comput. 52(11), 1379–1390 (2003) 32. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978) 33. Schnorr, C.P.: Efficient identification and signature for smart cards. In: Advances in Cryptology. Crypto’89, Lecture Notes in Computer Science, vol. 435, pp. 239–252. Springer, Berlin Heidelberg New York (1990) 34. Sunar, B., Koc, C.K.: An efficient optimal normal basis type II multiplier. IEEE Trans. Comput. 50(1), 83–87 (2001) 35. Takagi, N., Yoshiki, J.-I., Takagi, K.: A fast algorithm for multiplicative inversion in GF(2m ) using normal basis. IEEE Trans. Comput. 50(5), 394–398 (2001) 36. Wang, C.C., Pei, D.: A VLSI design for computing exponentiation in GF(2m ) and its application to generate pseudorandom number sequences. IEEE Trans. Comput. 39(2), 258–262 (1990) 37. Wang, C.C., Truong, T.K., Shao, H.M., Deutsch, L.J., Omura, J.K., Reed, I.S.: VLSI architectures for computing multiplications and inverses in GF(2m ). IEEE Trans. Comput. C-34(8), 709–717 (1985) 38. Wang, C.-L., Bair, W.-J.: A VLSI architecture for implementation of the decoder for binary BCH codes. In: Proceedings of Symposium on Communication, Taiwan, pp. 36–40 (Dec 1991) 39. Wang, C.-L., Guo, J.-H.: New systolic arrays for C + AB2 , inversion, and division in GF(2m ). IEEE Trans. Comput. 49(10), 1120–1125 (2000) 40. Wei, S.-W.: A systolic power-sum circuit for GF(2m ). IEEE Trans. Comput. 43(2), 226–229 (1994) 41. Wei, S.-W.: VLSI architectures for computing exponentiations, multiplicative inverses, and divisions in GF(2m ). IEEE Trans. Circuits Systems I Fund. Theory Appl. 44, 847–855 (1997) 42. Wei, S.-W., Wei, C.H.: High speed decoder of Reed–Solomon codes. IEEE Trans. Commun. 41(11), 1588–1593 (1993) 43. Weste N., Eshraghian, K.: Principles of CMOS VLSI Design: A System Perspective. AddisonWesley, Reading, Massachusetts (1985) 44. Wu H., Hasan, M.A.: Low complexity bit-parallel multipliers for a class of finite fields. IEEE Trans. Comput. 47(8), 883–887 (1998) 45. Wu, H., Hasan, M.A., Blake, I.F.: New low-complexity bit-parallel finite field multipliers using weakly dual bases. IEEE Trans. Comput. 47(11), 1223–1234 (1998) 46. Wang, C.L., Lin, J.L.: Systolic array implementation of multipliers for GF(2m ). IEEE Trans. Circuits Syst. II 38(7), 796–800 (1991)