High-Level Side-Channel Attack Modeling and ... - IEEE Xplore

164

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,

VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

High-Level Side-Channel Attack Modeling and Simulation for Security-Critical Systems on Chips Francesco Menichelli, Renato Menicocci, Mauro Olivieri, Member, IEEE, and Alessandro Trifiletti Abstract—The design flow of a digital cryptographic device must take into account the evaluation of its security against attacks based on side-channel observation. The adoption of high-level countermeasures and the verification of the feasibility of new attacks presently require the execution of time-consuming physical measurements on the prototype product or the simulation at a low abstraction level. Starting from these assumptions, we developed an exploration approach centered at high-level simulation in order to evaluate the actual implementation of a cryptographic algorithm, this being software or hardware based. The simulation is performed within a unified tool based on SystemC, which can model a software implementation running on a microprocessor-based architecture or a dedicated hardware implementation and mixed software-hardware implementations with cycle-accurate resolution. Here, we describe the tool and provide a large set of design explorations and characterizations based on actual implementations of the AES cryptographic algorithm, demonstrating how the execution of a large set of experiments allowed by the fast simulation engine can lead to important improvements in the knowledge and identification of the weaknesses in cryptographic algorithm implementations (“Side Channel Analysis Resistant Design Flow”). Index Terms—Power-analysis attacks, smart cards, AES, code breaking, system-level simulation.

Ç 1

INTRODUCTION

S

ECURITY-CRITICAL systems on chips (SoCs) such as microprocessor-based smart cards are gaining dramatic importance in the consumer electronics market. In this market sector, security issues evidently play a central role and definitely affect the design strategies of the system. Although the encryption technology has been continuously improving for decades, a whole class of attacking techniques has been developed, which can extract critical information from the system without analyzing the regular data flow but rather exploiting the information contained in side-effect physical phenomena such as instantaneous power consumption and electromagnetic emissions. The effectiveness of this class of attacking techniques, often referred to as the side-channel analysis (SCA), is well demonstrated in the literature [1], [2], [3], mostly describing attacks exploiting measurements of the power consumption, called the Power Analysis (PA), and sometimes on the Electromagnetic Analysis (EMA). Presently, countermeasures have been taken at the circuit level [4], [5], [6], [7] to make the current absorption less dependent of the processed data values at the gate level and register-transfer level (RTL) [8], [9], [10], [11] and at the software level [12], [13], [14]. Nevertheless, no existing technique has been proved capable of a priori defeating SCA but rather making it statistically less effective.

. The authors are with the Department of Electronic Engineering, University of Rome La Sapienza, 00184 Rome, Italy. E-mail: {menichelli, menicocci, olivieri, trifiletti}@die.uniromal.it. Manuscript received 2 Aug. 2006; revised 16 May 2007; accepted 4 Oct. 2007; published online 26 Oct. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-0110-0806. Digital Object Identifier no. 10.1109/TDSC.2007.70234. 1545-5971/08/$25.00 ß 2008 IEEE

In principle, the design flow of secure cryptosystems can take advantage of the concept of PA simulation for early assessing the susceptibility of a given system to selected PA attacks, but the ever-increasing complexity of SoC architectures makes it difficult to quantitatively understand the degree of vulnerability of a system at the design time. In fact, although circuit-level, gate-level, and even RTL simulations of a whole SoC are nearly unfeasible for the average performance and power consumption, they are even more time consuming for SCA simulation, where a much higher detail is needed than in the average power consumption simulation. System-level performance and average power simulations have been in use for some years now based on modeling languages such as SystemC, C/C++, and Java [15], [16], [17], [18], [19], [20], [21], but there is presently the lack of consolidated tools and results in system-level SCA simulation, which may assess the potential failure points of a SoC architecture with respect to SCA attacks [22]. Several previous work has addressed the issue of accelerating PA simulation at several abstraction levels. In [10], a countermeasure against PA attacks is evaluated by analyzing power traces generated from accelerated transistor-level simulations (Synopsys Nanosim). In [23], a specialized CMOS logic style is evaluated by executing a PA attack on power traces produced by SPICE simulations. In [24], simulations of PA attacks are performed on power traces generated by Synopsys Primepower(TM) from a postlayout gate-level model of the design. A sophisticated extension of this approach is presented in [25], where a methodology for simulating EMA attacks is presented, which takes into account both the circuit and the package of the given system yet relies on postlayout power traces. Published by the IEEE Computer Society

MENICHELLI ET AL.: HIGH-LEVEL SIDE-CHANNEL ATTACK MODELING AND SIMULATION FOR SECURITY-CRITICAL SYSTEMS ON CHIPS

In [26], a countermeasure against PA attacks is evaluated by analyzing cycle-accurate power traces from a prelayout gate-level model of the given system. In [27], a PA attack is executed by RTL power simulations accounting for logic-value transitions in CMOS registers, and the results are interestingly compared with the corresponding ones based on real power traces collected on the given physical system. The first instance of system-level simulation is shown in [28] and [29], which present a Java tool, that is, PINPAS for simulating side-channel effects in smart card microprocessors. The simulator models processor instructions and accounts for the Hamming weight (HW) of the bits in the logical registers of the processor architecture. Another interesting example of the instruction-level simulation of PA attacks is shown in [30], which uses cycle-accurate power traces generated by the SimplePower processor simulator [31]. In this work, we illustrate our experience in developing a methodology for system-level modeling and simulation of a whole SoC architecture executing an industrial standard cryptography algorithm. The approach is based on the SystemC 2.0 language [32], which is a de facto standard in complex digital system simulations because of its straightforward and efficient hardware-software cosimulation capabilities. The flexibility inherent to SystemC allows us to model and evaluate both microprocessor-based designs (that is, relying on software-implemented algorithms) and hardwired designs (that is, relying on some applicationspecific hardware units), including the presence of hardware and/or software countermeasures against SCA. Our case studies demonstrate the feasibility of SCA attack simulations in the early phases of SoC design, assessing the potential weakness points of an architecture and the effectiveness of countermeasures. In the following, Sections 2 and 3 illustrate the principles of the approach and the platform architecture used in our analyses. Sections 4, 5, 6, 7, 8, and 9 report the results of a comprehensive set of case studies. Section 10 reports a comparison with a circuit-level analysis, and Section 11 summarizes our conclusions.

2

HIGH-LEVEL SIDE-CHANNEL ANALYSIS SIMULATION ENVIRONMENT

2.1 Reference Hardware/Software Platform The reference SoC platform architecture is based on a 32-bit ARM7 CPU core connected to an AMBA bus and peripherals, configured as shown in Fig. 1. The software environment used to compile the algorithm for the ARM CPU is gcc version 3.2. The simulation platform used as a starting point for our environment is the MPARM simulator [33], a signal-accurate cycle-accurate SystemC model developed for the architectural exploration of embedded SoCs. In our environment, the MPARM was configured with one ARM CPU and 256 Kbytes of zero-wait-state memory directly connected to the local bus (Fig. 1) and used for both program and data. In some of the experiments, we also considered the inclusion of an AES coprocessor to be implemented as an ASIC (see Section 3.2), whose SystemC model was

165

Fig. 1. Architecture configured in MPARM for the project.

abstracted from an RTL specification. Thus, as for the model abstraction-level and signal tracing mechanism, there is no particular difference in the SytemC high-level simulation of the ASIC and of the ARM processor. The original MPARM simulator supports the collection of statistical performance and total power consumption data based on power models well documented in the literature [20], [21], [34], [35]. Such data is not useful for SCA simulations, as it does not take into account the dependency from the algorithmic data values being processed.

2.2 Side-Channel Analysis Power Model Features Usually, high-level power simulations aim at evaluating the average power consumption. In a few cases, they claim some degree of cycle accuracy [20], [21], [36], [37]. Although cycle accuracy is certainly needed by SCA simulation, the type of approximations used in power models for absolute consumption estimation is unacceptable for SCA purposes: in order to compute the absolute power consumption, the most significant contributions are considered while neglecting the ones whose magnitude is negligible with respect to the total.1 Conversely, in an SCA simulation, power consumption play the role of a signal carrying some information so that the relevant contributions are those that have strong correlation with data values, even if they are a little significant to the total power. Thus, in the SCA simulation, we should discard power contributions not correlated with the data values under attack, independent of their magnitude. In our approach, we model the dependency of the electrical currents drawn from data processing, possibly discarding other contributions that are not correlated with data, even if they could have a major impact on the total power consumption. The model operates at an abstraction level intentionally independent of the physical technology aspect. We chose to trace the internal logic values (that is, hardware registers, buses, and signals, including those 1. Instruction-level CPU power models are consolidated tools [20], [21], [34], [35]. For example, Tiwari et al. [20] try modeling the CPU power, considering the cost of single instructions (base cost) and the cost of a state change from an instruction to another as a second-order contribute (interinstruction cost). The dependency of power consumption from data is considered a higher level contribution and is not analyzed at all. Other approaches [38] analyze data dependency, but due to the extremely large data space, they consider the average Hamming distances, which, in turn, kills the data-dependent power contribution needed by SCA simulation.

166


internal to the CPU core) in the form of a function related to physical events (that is, the HW and the Hamming distance (HD)) with clock-cycle accuracy. That way, we can model potential information leakage through side channels if the physical measure (for example, power consumption) is dependent on one of these functions or on a linear combination of them. We based our choice on the results of previous work, where physical measurements were compared to RTL simulation models [27]. In that ¨ rs et al. conclude that a HW/HD power model work, O simulation shows a good match with physical measurements. For a complete description of the functions traced, see Section 3.3. Due to the high level of abstraction, the model might be used to address side channels different from power consumption (for example, electromagnetic emission and temperature) if the physical measure is dependent on one of the functions traced or a linear combination of them.

2.3 Sensitive Data Tracing Sensitive data tracing provides crucial indications to identify the architecture-critical components, that is, those involved in information leakage. In complex SoCs, this preliminary step can be useful to optimize the subsequent simulation of attacks by reducing the tracing activity to the interesting parts. The concept of “sensitive data” can be considered an attribute of the data that is critical for the security of the system (for example, keys of a cryptographic algorithm). In a microprocessor system, for instance, data is usually stored in the memory, transferred into CPU registers during the execution, processed, and then written back to the memory. This means that a security-critical data originally stored in a restricted number of locations is spread throughout the system in a way that is usually not under control, especially if the software is written in a high-level language. In this version of the tools, we assumed only two degrees of sensitivity for each register or memory location (that is, sensitive or not sensitive), and the update policy was “if the sources were sensitive, the register/memory location used as a destination becomes sensitive.” Such a sensitive status update policy is applied at the end of each clock cycle and after recognizing the registers/memory locations that acted as sources and destinations.

3

CASE STUDIES

In order to evaluate the potentials of our technique, we explored several implementations of the AES cryptographic algorithm [39]. The exploration particularly stressed the problem of information leakage, but we also took into consideration other aspects such as the complexity and the performance of each implementation.

3.1

Theoretical Background of the Simulated Attacks We used an AES encryption with 128-bit keys based on 11 roun ds R0; R1; . . . ; R10 and 11 r ound ke ys RK0 ; RK1 ; . . . ; RK10 , of which RK0 is the external key K. The goal of the attack is to find the external key K. Details on the AES algorithm are recalled in the Appendix.

VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

We explored the applicability of a type of attacks known as the Correlation Power Analysis (CPA) [40], which exploits the knowledge of N plaintexts P1 ; P2 ; . . . ; PN , encrypted under key K, and of the corresponding simulated traces T1 ; T2 ; . . . ; TN , each composed of M samples (one per system clock cycle). Instead of P1 ; P2 ; . . . ; PN , one can use the corresponding ciphertexts C1 ; C2 ; . . . ; CN . To recover the ith byte of RK0 , a set of correlation curves must be produced. Each correlation curve is obtained through the following steps [40]: Guess a value g for the ith byte of RK0 . By using g on the ith byte of the N plaintexts, guess the corresponding values g0 ð1Þ; g0 ð2Þ; . . . ; g0 ðNÞ for the ith byte at the output of SubBytes (we say that the logical attack point is the SBox output at round R1). 3. For each of those guesses, by applying a “prediction” function, estimate (by means of a prediction function) the power consumption contribution associated with it. 4. Collect the N predicted power contributions into a vector U (prediction vector). 5. Simulate the AES device operation with secret key RK0 and trace internal logic values, and for each clock cycle, extract a vector V of N elements from the N simulated traces. 6. Compute the Pearson correlation coefficient between U and V . 7. Repeat step 6 for all the M samples (clock cycles), producing a correlation curve Cg associated with g. By repeating the above steps for the 256 possible values of g, 256 correlation curves are produced. One of these correlation curves corresponds to the right guess g^ for the ith byte of RK0 . The recovery is successful if it is possible to distinguish Cg^ from the curves corresponding to wrong guesses. Observe that in our experiments, we always used the HW as the prediction function.2 Such a choice is significant, because it has the advantage (with respect to HD) of requiring only the knowledge of the algorithm, without any hypothesis on the architecture. The described procedure can be easily adapted to other logical attack points, for example, the SBox input at round R1. In addition, instead of the N plaintexts, the N corresponding ciphertexts can be used to target RK10 , moving the logical attack point to the SBox output or input at round R10. 1. 2.

3.2 Implementations Analyzed The implementations that we explored are the following: –

AES software version (AES SW). This is a C language implementation of the public AES algorithm optimized for 32-bit architectures [41]. The code was ported and compiled with the gcc compiler version 3.2 for the ARM CPU.

2. The prediction function used to estimate the power contribution before simulating is not to be confused with the logic functions traced during simulation, namely, HW, HD, HDZO, and HDOZ, as described in Section 3.3.


167

Fig. 3. Sensitiveness status after key expansion. Fig. 2. Basic architecture of the AES coprocessor.

–

–

–

AES software version with countermeasure (AES SW+). This is the same algorithm as AES SW but incorporates specific software countermeasures against SCA attacks. The basic idea is the use of random masks [12], [14] in order to hide some critical attack points of the AES algorithm such as KeyExpansion and AddRoundKey by XOR masking data before these operations. AES hardware version (AES HW). This is a hardwired implementation of a coprocessor dedicated to AES execution. The module was described in SystemC and attached to the MPARM local bus (see Fig. 1) seen as an external device by the main CPU. Its architecture is shown in Fig. 2. AES hardware version with countermeasure (AES HW+). This is the same architecture as AES HW but incorporates architectural countermeasures against SCA attacks. The countermeasure is essentially based on precharging registers with random values before actual data are written according to [10].

3.3 Simulation Environment Configuration For each case study, we simulated the encryption of 256 or 1,024 data blocks, each data block being 128 bits wide. A set of traces was generated, contributed by the internal registers of the ARM7 CPU visible at the assembly level [42] and additional registers simulated by MPARM [33] (AES SW and AES SW+). As for the coprocessor (AES HW and AES HW+), we traced all the registers both in the data path and in the FSM controller (Fig. 2). This conservative choice was made possible by the limited complexity of the simulated architectures. A methodology for activating only the tracing of selected parts of the architecture will be exposed in Section 4. The traces contained, for each clock cycle, the following information: – – –

–

The HW of the current values of traced registers (in the following, HW traces), The standard HD between the current and previous values of traced registers (HD traces), The modified HD Zero to One (HDZO), that is, the number of “1s” in the current value coming from “0s” in the previous value (HDZO traces), and Modified HD One to Zero (HDOZ), that is, the number of “0s” in the current value coming from “1s” in the previous value (HDOZ traces).

Noticeably, the simulation time for all the AES implementations was below 10 minutes (for the 256 data blocks case) running on a 1.5-GHz PC under Linux OS.

4

RESULTS ON CRITICAL COMPONENT IDENTIFICATION

In this section, we report the outcome of sensitivity analysis on the specific architectures, showing its capability of identifying critical parts for further attack analysis. The selection of critical components can be used to speed up the subsequent simulation of the attacks in the case of complex SoC architectures. In the reported experiments in the following sections, we conservatively traced all registers in the architecture without assuming the possibility of isolating any specialized target component.

4.1

Advanced Encryption Standard Software Version Sensitive Data Tracing Figs. 3, 4, and 5 report the status of the sensitive attribute (see Section 2.3), respectively, after the key expansion at the end of the first ciphering and at the end of the ciphering. Registers/Memory locations marked with an “O” contains sensitive data, each “O” and “-” representing a group of 16 bytes for conciseness. The figures show the diffusion and the presence of spurious sensitive locations in the stack space located at the bottom of the memory space (that is, the last line) at the end of the processing (see Fig. 5), whereas the CPU registers do not contain sensitive data. 4.2

Advanced Encryption Standard Hardware Version Sensitive Data Tracing Having set the key K as “sensitive data” (represented in the AES coprocessor by four 32-bit registers), Fig. 6 shows a

Fig. 4. Sensitiveness status after the first ciphering.

168


VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

Fig. 5. Sensitiveness status at the end of the ciphering.

portion of the evolution obtained by dumping the sensitive status at every cycle. Each line represents the whole internal registers of the coprocessor. In Fig. 6, we note that at the first cycle, “sensitive status” is confined in the four key registers. After a few cycles, it is spread to some of the other registers, remaining confined in these registers until the end of the encryption (about 1,700 cycles, not fully reported in the picture).

5 5.1

RESULTS OF ATTACKS ON AES SOFTWARE IMPLEMENTATION WITHOUT COUNTERMEASURE Side-Channel Analysis

5.1.1 Experiment 1 In this experiment, we encrypted 256 random plaintexts (4,096 random bytes in total) under the same key. We exploited the given plaintexts and the corresponding HW traces to execute a correlation run (see Section 3.1), with the SBox output at round R1 selected as the logical attack point and the leftmost byte of RK0 (i.e., K) being the target. Fig. 7 reports the values taken by one of the correlation curves Cg within a selected range of clock cycles (the whole encryption lasts about 2,500 clock cycles). Clock cycles are on the horizontal axis. The diagram shows the curve corresponding to the actual target key byte (denoted as K0 in the legend), which is the only one presenting a peak above the reference bounds empirically set to (0.3, 0.3). As for the remaining correlation curves (corresponding to wrong hypotheses on the key byte value), they stayed well within the above-mentioned bounds, which apply to the whole trace length. This characteristic allows one to distinguish the curve corresponding to the right key byte value Cg^ among those corresponding to wrong values and to affirm that the attack was successful. In all the following experiments and figures, the same description is valid. The

Fig. 7. CPA based on SBox output at round R1. Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

diagrams trace only the correlation curves that override the bound. Generally, 16 curves are traced, identified by different graphical markers, corresponding to the 16 key bytes of the AES encryption algorithm. The curves override the bound only in a limited range of clock cycles (peaks). The position of these peaks is in close relation with critical operations of the algorithm, as we will show in Sections 5.2 and 7.2.

5.1.2 Experiment 2 Under the same key used in Experiment 1, we encrypted 256 partially random plaintexts, each formed by repeating one random byte 16 times. We selected again the SBox output at round R1 as the logical attack point and executed a correlation run by exploiting the given plaintexts and the corresponding HW, HD, and HDZO traces. It is easy to see that in this case, specifying the target key byte (one of RK0 or K bytes) is superfluous, and the 16 possible correlation runs reduce to just one. Note that unless some knowledge of the specific implementation is given (see Section 5.2), this kind of attack cannot be expected to reveal the exact order of the recovered bytes of RK0 . The results obtained by exploiting HW, HD, and HDZO traces are respectively shown in Figs. 8, 9, and 10. The pictures report the significant portion of the set of correlation curves corresponding to the set of distinct bytes of the actual RK0 . Based on the knowledge of RK0 , we could associate its bytes, denoted as K0; K1; . . . ; K15 in the picture legends, with the curves of the above set. We also

Fig. 8. CPA based on SBox output at round R1 (HW traces). Horizontal axis: time (clock cycle). Vertical axis: correlation function Fig. 6. Sensitiveness analysis for the AES coprocessor.

value.


Fig. 9. CPA based on SBox output at round R10 (HD traces). Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

sorted K0; K1; . . . ; K15 by the occurrence order of the corresponding correlation peaks (notice that for the chosen K, we had K4 ¼ K7). The remaining correlation curves (corresponding to wrong hypotheses on the set of distinct bytes of RK0 ) stayed well within the reported bounds, which apply to the whole trace length. As a result, an attack would be successful on any physical side channel whose values are correlated with HW, HD, and HDZO.

5.1.3 Experiment 3 In this case, we reused the HW traces produced by Experiment 2 but tried an attack based on ciphertext knowledge. Our traces were then associated with the ciphertexts resulting from the said experiment (256 partially random plaintexts encrypted under the same key K). Based on guessing the SBox input at round R10, we executed 16 correlation runs, each targeting a specific byte of RK10 . The results are partially shown in Fig. 11, which reports from four distinct correlation runs the significant portion of the correlation curves corresponding to the four leftmost bytes of the actual RK10 (denoted as K0, K1, K2, and K3 in the legend). Again, the remaining curves, corresponding to wrong hypotheses about the specific target byte, stayed well within the reported bounds, which apply to the whole trace length.

169

Fig. 11. CPA based on SBox input at round R10. Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

5.2 Matching Leakage Evidence and Software Code It is possible to compare the CPA simulation results and the assembler code of the cryptographic algorithm, identifying the critical parts. Moreover, we will show that a partial knowledge of the algorithm implementation can lead to reconstructing the correct byte order of the key in Experiments 2 and 3 (in which the byte order could not be recovered by the attack). Experiment 2 is an attack based on the SBox output at round ðR1Þ. The C code of the first round is reported in Fig. 12. This is an optimized version of the Rijndael algorithm for 32-bit CPUs: –

–

rk[i] is an array of 32-bit integers containing the round keys. Thus, rk[0]-rk[3] is the 128-bit key, rk[4]-rk[7] is the round key of the first round, etc. Te0, Te1, Te2, and Te3 are precomputed tables used to perform the operations of the round in one step.

It is possible to verify that t0 depends on key bytes K0 , K5 , K10 , and K15 . This dependency is introduced respectively from expressions s0 >> 24 (the most significant byte of s0), ðs1 >> 16Þ & 0xFF, ðs2 >> 8Þ & 0xFF, and s3 & 0xFF. A similar analysis can be repeated for the lines computing t1, t2, and t3, concluding that the order of the key bytes in the first round is K0 ; K5 ; K10 ; K15 ; K4 ; K9 ; K14 ; K3 ; K8 ; K13 ; K2 ; K7 ; K12 ; K1 ; K6 ; K1 : The plaintext used to perform the attack is composed of one random byte repeated 16 times so that a single analysis reveals 16 curves containing correlation peaks (Fig. 8). The peaks are divided into four subgroups so that each

Fig. 10. CPA based on SBox output at round R1 (HDZO traces). Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

Fig. 12. C source code of round R1.

170


VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

TABLE 1 Peak Cycles in Experiment 3

this order, whereas the order changes for s1, s2, and s3. The peak cycles correspond, as in the former case, at the moment during which the target variable is loaded inside the CPU.

5.3 Fig. 13. ARM assembly code of t0 computing.

subgroup has a length (distance between the first and the last peak) of about 32 cycles. The division in subgroups corresponds to the division into the four lines of C code that implement round R1. In fact, Fig. 13 reports the C line computing t0 during round R1 and the corresponding assembly code produced by the ARM cross compiler (remaining lines). Comparing the position of the correlation peaks (cycles) and the trace of instructions executed at each cycle (generated by MPARM), we find that the peaks of correlation are produced by the four lines, evidenced with a dash in Fig. 13. The four lines correspond to the load instructions that access tables Te0-Te3 by using portions of s0-s3 as indices. The peak of correlation starts during the second cycle of the ldr instruction (a load instruction lasts three cycles in ARM instruction set) exactly when the data from tables are read and brought inside the CPU and vanishes when these data are overwritten by a subsequent instruction (that is, the instruction at address 94bc for the first peak when the register r2 is overwritten). The above analysis also explains why the first two peaks in the group are almost overlapped, since the first two load instruction are consecutive. The same analysis is possible for the remaining three subgroups of correlation peaks, since the C source code for t1, t2, and t3 and the corresponding assembly code is very similar to that of t0. In Experiment 3, the mean correlation peaks (Fig. 11) are located for the four bytes between cycles 2,100 and 2,300. After this principal peak, three secondary shorter peak groups are present. Inside each peak group, the byte order changes. Table 1 summarizes the above observations. We can note how the order of the bytes in the principal and secondary peaks changes: K0 , K1 , K2 , and K3 for the mean peak and K1 , K2 , K3 , and K0 ; K2 , K3 , K0 , and K1 ; and K3 , K0 , K1 , and K2 , respectively, for the others. This behavior can be explained by observing the implementation of round R10, which is reported in Fig. 14. In the figure, we can see that t0-t3, which are the inputs of round R10, that is, the target of the attack, are used in the exact order noticed above when s0-s3 (corresponding to the ciphertext) are computed. For example, in order to produce s0, we use t0, t1, t2, and t3 in

Concluding Remarks on AES Software Experiments Due to the complete absence of countermeasures, this implementation was correctly expected to be vulnerable to all of our attacks. We tried different attack points and different information leakage models, which all revealed the secret key. The experiments clearly show how a noncountermeasured system is potentially open to a wide variety of attacks (known plaintext, known chipertext, etc.) and different attack points, even without knowledge of the implementation.

6

RESULTS OF ATTACKS ON AES SOFTWARE IMPLEMENTATION WITH COUNTERMEASURE

6.1 Side-Channel Analysis Having introduced a countermeasure based on data randomization, this implementation was expected to be secure, at least, to our attacks (first-order CPA analysis). We performed a series of experiments, always obtaining the same results. Here, we report one of the experiments. 6.1.1 Experiment 1 We used the same key and the same 256 partially random plaintexts as in AES SW Experiment 2 (Section 5.1). As for the random values needed by AES SW+ to enforce data randomization, we used a new value for each encryption. The random values were assumed not to be available for the following attack.

Fig. 14. C source code for round R10.


171


Fig. 16. CPA based on SBox output at round R1 (HW-traces). Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

We executed a correlation run based on guessing the SBox output at round R1 and targeting the set of the distinct bytes of RK0 (see Experiment 2 in Section 5.1). As expected, the attack was not successful, as none of the correlation curves had significant peaks: the 256 correlation curves produced from HW and HD traces remained confined in the ranges (0.30, 0.30) and (0.32, 0.32), respectively.

curves. Knowing RK0 , we could associate four of its bytes (denoted as K3, K7, K11, and K15 in the picture legend) with the four selected curves. Observe that the illustration shows K3, K7, K11, and K15 sorted by the occurrence order of the corresponding correlation peaks and that, like in Experiment 1, K0 is not one of the recoverable key bytes. When using HD traces, we got the successful results shown in Fig. 17, which reports the significant portion of the set of correlation curves corresponding to the set of the distinct bytes of the actual RK0 . We exploited the knowledge of RK0 to associate its bytes, denoted as K0; K1; . . . ; K15, with the curves of the above set and to sort K0; K1; . . . ; K15 by the occurrence order of the corresponding correlation peaks (recall that K4 ¼ K7). The remaining correlation curves (corresponding to wrong hypotheses about the set of the distinct bytes of the actual RK0 ) remained within the reported bounds for the whole trace length.

7 7.1

RESULTS OF ATTACKS ON AES HARDWARE IMPLEMENTATION WITHOUT COUNTERMEASURE Side-Channel Analysis

7.1.1 Experiment 1 For this experiment, we used the same 256 random plaintexts and same key K as in Experiment 1 on AES SW (Section 5.1) and collected the resulting HW and HD traces. Both traces were exploited for a correlation run targeting the leftmost byte of RK0 based on guessing the SBox output at round R1. The attack based on HW traces was unsuccessful, as all of the 256 correlation curves stayed well within the range (0.30, 0.30) for the whole trace length. The first point toward the understanding of this result can be found in Experiment 2 in the following. When using HD traces, we obtained the successful results shown in Fig. 15, which reports the significant portion of the correlation curve corresponding to the actual target key byte (denoted as K0 in the legend). As for the remaining correlation curves (corresponding to wrong hypotheses about the target key byte), they all remained within the reported bounds. 7.1.2 Experiment 2 For this experiment, we used the same 256 random plaintexts and same key K as in Experiment 2 on AES SW (Section 5.1) and collected the resulting HW and HD traces. Then, we tried again a correlation run targeting the set of distinct bytes of RK0 based on guessing the SBox output at round R1. The correlation analysis based on HW traces was partially successful, since it allowed us to recover only four elements in the set of distinct bytes of RK0 . Fig. 16 shows the significant portion of the four relevant correlation curves, along with the bounds limiting the remaining

7.2

Matching Leakage Evidence and Architecture Model Here, we propose an explanation for the results on the AES HW implementation. We focus on the results of the attacks based on HW traces (Fig. 16). During the attack, based on guesses on the key bytes and exploiting the known plaintexts, we produce guesses on the output bytes from SubBytes at round R1. In addition, recall that our prediction vectors contain the HW of these last guesses.


172


Since the relevant correlation peaks appeared during the execution of MixColumns at round R1, we give a brief description of its implementation in the AES HW coprocessor. According to AES encryption specifications, if b0 ; b1 ; . . . ; b15 are the 16 bytes after SubBytes, then the four columns to be processed by MixColumns are C0 ¼ ðb0 ; b5 ; b10 ; b15 Þ, C1 ¼ ðb4 ; b9 ; b14 ; b3 Þ, C2 ¼ ðb8 ; b13 ; b2 ; b7 Þ, and C3 ¼ ðb12 ; b1 ; b6 ; b11 Þ. In the given coprocessor, the output bytes from SubBytes are generated in the order b11 , b6 , b1 , b12 , b7 , b2 , b13 , b8 , b3 , b14 , b9 , b4 , b15 , b10 , b5 , and b0 and the four columns are processed in the order C3 , C2 , C1 , and C0 . Each column is transformed in four steps. At each step, a new byte coming from the combinatorial SBox is suitably transformed by a combinatorial network into a 32-bit word, which is then XORed with the status of a 32-bit register ACC. The result of this XOR is accumulated into ACC, where after four steps, the new column value is available. Before mixing any column, ACC is reset to zero. We show here how the mixing of C3 works. With denoting a bitwise XOR, the temporal sequence of the values stored into ACC when computing the new value of C3 is X ¼ ðb11 ; b11 ; 3b11 ; 2b11 Þ 0; Y ¼ ðb6 ; 3b6 ; 2b6 ; b6 Þ X; W ¼ ð3b1 ; 2b1 ; b1 ; b1 Þ Y ; Z ¼ ð2b12 ; b12 ; b12 ; 3b12 Þ Z; where the elements of the form 2b or 3b are products in GFð28 Þ. Now, we can show how the leakage observed in the experiments to be explained comes from register ACC. First, observe that cycle by cycle, the HW hðÞ of ACC contents contributes as an addend to the relevant HW traces. Then, focus on the four clock cycles t1 , t2 , t3 , and t4 when X, Y , W , and Z are respectively stored into the ACC. We have the following: On t1 , hðXÞ has to contain hðb11 Þ as an addend. On t2 , t3 , and t4 , hðY Þ, hðW Þ, and hðZÞ cannot be expected to contain hðb6 Þ, hðb1 Þ, and hðb12 Þ as the respective addends. This should explain why a correlation peak appeared on t1 when using the right key guess for K11 to guess the value of b11 . It should also explain why correlation peaks did not appear on t2 , t3 , and t4 when using the right key guesses for K6, K1, and K12 to respectively guess the values of b6 , b1 , and b12 . The proposed approach can be easily extended to the other columns to explain the origin of the correlation peaks, revealing K7, K3, and K15 in the given order. – –

7.3 Concluding Remarks on AES Hardware Like AES SW, due to the complete absence of countermeasures, the AES HW implementation was expected to be vulnerable to all of our attacks. For comparison, we replicated the same conditions (attack point and plaintext key values) already used for the AES SW version, obtaining a successful attack. An important outcome is that such two completely different implementations as AES SW and AES HW resulted to be vulnerable to the very same CPA setup.

VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

Fig. 18. AES HW+ CPA based on SBox output at round R1 (HD traces). Horizontal axis: time (clock cycle). Vertical axis: correlation function value.

8 8.1

RESULTS OF ATTACKS ON AES HARDWARE IMPLEMENTATION WITH COUNTERMEASURE Side-Channel Analysis

8.1.1 Experiment 1 We used the same key and the same 256 partially random plaintexts as in AES HW Experiment 2 (Section 7.1). For the random values needed by AES HW+ to enforce random precharging, we used a new value for each encryption. These random values were assumed not to be available. We executed a correlation run exploiting HW and HD traces and targeting the set of the distinct bytes of RK0 based on guessing the SBox output at round R1. Because of the said random precharging, this attack was expected to be unsuccessful, at least, when using traces dependent on the coprocessor state transitions (like, for example, the HD traces). However, the results obtained when using HW traces were practically identical to the ones reported in Experiment 2 in Section 7.1. That is, we could recover the four key bytes shown in Fig. 16 in the same cycle order. For the sake of conciseness, we do not report the result of the attack. The attack based on HD traces was only partially successful in that it allowed to recover four elements in the set of the distinct bytes of RK0 . Fig. 18 shows the significant portion of the four relevant correlation curves, along with the bounds limiting the remaining curves for the whole trace length. Knowing RK0 , we could associate four of its bytes (denoted as K0, K4, K8, and K12 in the picture legend) with the four selected curves. Observe that the picture legend shows K0, K4, K8, and K12 sorted by the occurrence order of the corresponding correlation peaks. 8.2

Matching Leakage Evidence and Architecture Model It is possible to explain the limited effect of the countermeasure in AES HW+, referring to HW-trace-based attacks. First, the explanation of the success of the attack reported in Section 7.2 is valid also for AES HW+. In addition, we note that for AES HW+, the relevant correlation peaks appeared during the execution of MixColumns at round R1. The relevant variation when moving from AES HW to AES HW+ is that because of the implemented


10 COMPARISON RESULTS

Fig. 19. Encoding time (clock cycles).

random precharging, most of the coprocessor registers are enforced to alternate dummy and proper values. As for register ACC, this is done by coupling ACC with a second 32-bit register ACC’. Before mixing any column, ACC and ACC’ contain a null and a random value, respectively. Normally, when ACC is updated, ACC’ is also updated with the old value of ACC. The proper values to be stored into ACC are always computed after computing dummy values, which are stored into ACC as well. This applies to the proper values X, Y , W , and Z, which finally explains, for the attack based on HW traces, the similarities between the experimental results for AES HW and AES HW+.

8.3

Concluding Remarks on AES Hardware with Countermeasure Having introduced a countermeasure against the firstorder CPA analysis based on state transitions, this implementation was expected to be secure to attacks based on HD traces. Surprisingly, the experiment based on HD demonstrated a partial information leakage (Fig. 18), which has never been evidenced. After an in-depth analysis of the design, we found that such a phenomenon was due to an imperfect implementation of the countermeasure, which could be evidenced only with a large number of simulations. Interestingly, our approach allowed us to find such an imperfection, which would not be practically feasible with slower low-level simulations.

9

EVALUATION OF THE IMPACT OF COUNTERMEASURES ON ENCRYPTION SPEED

As a further characterization step, we report the performances of the four AES implementations considered in our experiments. Since our simulation environment is cycle accurate, we can obtain measures of the encryption time of the AES implementations under test. Fig. 19 reports the encryption speed, expressed in the number of clock cycles needed to complete the encryption of a 16-byte block. The insertion of SCA countermeasures in the algorithm implementations has a huge impact on the encryption speed (although we must say that the present AES SW+ implementation was not optimized for speed). As for the AES HW+ implementation, the architectural countermeasures have a performance impact of about 25 percent (but we can expect an area overhead in the ASIC synthesis).

WITH

173

CIRCUIT-LEVEL ANALYSIS

High-level simulations are faster than lower level approaches, trading accuracy for speed. In the previous sections, we showed that the accuracy of our simulation environment is sufficient to perform detailed attack simulations in the early phases of architecture design. From the speed point of view, it is possible to compare our work and results with other lower level approaches (that is, circuit simulations or physical measures) referring to the same encryption algorithm (AES) and hardware architecture (AES coprocessor). In [10], an architectural countermeasure against DPA attacks is presented and evaluated with transistor-level simulations. The coprocessors used in that work are the same as the “AES HW” and “AES HW+“ cases presented here. Transistor-level simulations of the whole architectures were used to build a database of simulated current consumption and perform a DPA attack. The simulation time reported in [10] was 60 hours to collect 256 current waveforms database, from which the identification of one byte of the encryption key was possible: corresponding to the correct key hypothesis, the DPA curves showed a marked peak for the noncountermeasured version and very less evident peak for the countermeasured version. Actually, for the particular structure of the attack (plaintext composed of one byte, varying from 0 to 255, while keeping fixed the others in order to minimize the algorithmic noise), the complete identification of the 128-bit key requires the repetition of the simulation for the remaining bytes of the key, leading to a 960-hour total simulation time. Using our simulation environment, we were able to successfully repeat the same attacks (collection of 256 waveforms relative to the complete encryption of 256 different plaintexts) in a simulation time of about 10 minutes. Although the proposed high-level approach is less accurate in determining the amplitude of the peak value of the CPA curves (since all the monitored switching nodes have equal contribution), we were able to exactly identify which architectures were vulnerable to CPA attacks, as confirmed by the circuit-level analysis reported in [10]. Referring to physically measured DPA attacks, the work reported in [27] presents a case study on an ASIC AES implementation, detailing the practical aspects of the analysis. In addition to the complex setup (test system with very high speed oscilloscopes), it states that the attack based on measurements required 160 times more samples than an RTL-simulated one because of the presence of noise in the real measurements. Although the implementation is not the same as ours, the comparison with [27] highlights the huge amount of time required by an attack based on real measurements with respect to our approach on a different but similar ASIC architecture implementing the same AES algorithm. (Interestingly, there is a strong correlation between some of our attacks and the one described in [27], having the same target, that is, one byte of the key after the initial key addition operation.)

174

11 CONCLUSIONS


AND

VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

FUTURE WORK

We defined and realized a system-level architecture exploration environment, which determines the effectiveness of SCA countermeasures in security-critical digital SoC devices. We applied this methodology to a real-world case study by analyzing several implementations of the AES algorithm with and without countermeasures against SCA-based attacks. The approach allowed us to efficiently take into account several implementations and, for each of them, to perform a very wide range of correlation analyses, which would be definitely impossible with lower level simulations. Executing a wide range of analyses showed to be particularly important, since it evidenced the presence of unexpected or not visible points of attack, which could otherwise remain hidden, especially to a software programmer or to the designer of a complex hardware implementation. The evidence of cycle-accurate matching between information leakage evidence and assembly code (for the software implementations) or RTL description (for the hardware implementations) turns out to be particularly important to understand the actual source of information leakage. From the particular experiments conducted on the AES implementations, we can say that we demonstrated that both software and hardware implementations without countermeasures were completely attackable by CPA by using typical attack points in the AES algorithm (that is, without considering particular hypothesis on the implementation). The masking countermeasures implemented in the AES SW+ version showed to be completely effective, although we remark that we have to limit the validity of this affirmation to the simulated type of attacks (first-order CPA). The AES HW+ implementation analysis seemed to still reveal information on a portion of the key. An accurate identification of the source of information leakage in the RTL code is still in progress. As our future work, we mention the possibility of refining the model with information coming from postcircuit or postlayout design phases (backannotation). For example, it could be possible to take into account different weights based on the concept that a transition on a very capacitively loaded line (that is, a bus) has a stronger impact on consumption than that of an internal signal. Another important feature would be the support for active information leakage simulation, which brings us to other classes of attack families based not only on the passive registration of physical entities but also on the introduction of forced deviations from the correct computing flow (fault analysis).

APPENDIX BACKGROUND ON THE ADVANCED ENCRYPTION STANDARD ALGORITHM AES encryption with 128-bit keys (see Fig. 20) is based on 11 rounds R0; R1; . . . ; R10, each transforming a 128-bit input into a 128-bit output under the control of a 128-bit round key.

Fig. 20. AES block diagram.

The input at R0 is the plaintext P . The output from round R10 is the ciphertext C. Each round from R1 to R10 takes as input the output from the previous round. The 11 round keys RK0 ; RK1 ; . . . ; RK10 are generated from the external key K by a specific algorithm (KeyExpansion). Observe that RK0 ¼ K and that any round key can be determined from any other round key. R0 produces its output by XORing the 16 bytes of P with the corresponding 16 bytes of RK0 (AddRoundKey). The rounds from R1 to R9 are given as follows: Each of the 16 input bytes is mapped into a new value by an invertible function called SBox (SubBytes), and the new 16 bytes are permuted according to a fixed rule (ShiftRows). Then, the 16 bytes are divided into four columns of four bytes, and for each byte of any column, a new value is computed by a specific linear combination, over GFð28 Þ, of the four bytes in the column (MixColumns). These 16 bytes are finally XORed with the with the corresponding 16 bytes of the specific round key (AddRoundKey). R10 differs in that MixColumns is omitted.

ACKNOWLEDGMENTS This work has been developed as part of the EuropeanUnion-funded Project IST-2002-507270 SCARD.

REFERENCES [1] [2] [3] [4]

[5] [6]

P.C. Kocher, “Timing Attacks on Implementations of DiffieHellman, RSA, DSS, and Other Systems,” Lecture Notes in Computer Science, vol. 1109, pp. 104-113, 1996. P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Lecture Notes in Computer Science, vol. 1666, pp. 388-397, 1999. E. Biham and A. Shamir, “Differential Fault Analysis of Secret Key Cryptosystems,” Lecture Notes in Computer Science, vol. 1294, pp. 513-525, 1997. K. Tiri, M. Akmal, and I. Verbauwhede, “A Dynamic and Differential CMOS Logic with Signal Independent Power Consumption to Withstand Differential Power Analysis on Smartcards,” Proc. 28th European Solid-State Circuits Conf., pp. 403-406, 2002. K. Tiri and I. Verbauwhede, “Charge Recycling Sense Amplifier Based Logic: Securing Low-Power Security IC’s against Differential Power Analysis,” Cryptology ePrint Archive, Report 2004/067, 2004. T. Popp and S. Mangard, “Masked Dual-Rail Pre-Charge Logic: DPA-Resistance without Routing Constraints,” Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded Systems, pp. 172-186, 2005.


[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

G.B. Ratanpal, R.D. Williams, and T.N. Blalock, “An On-Chip Signal Suppression Countermeasure to Power Analysis Attacks,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 3, pp. 179-189, July-September 2004. T. Popp and S. Mangard, “Masked Dual-Rail Pre-Charge Logic: DPA-Resistance without Routing Constraints,” Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’05), pp. 172-186, 2005. W. Fischer and B.M. Gammel, “Masking at Gate Level in the Presence of Glitches,” Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’05), pp. 187-200, 2005. M. Bucci, M. Guglielmo, R. Luzzi, and A. Trifiletti, “A Power Consumption Randomization Countermeasure for DPA-Resistant Cryptographic Processors,” Proc. 14th Int’l Workshop Power and Timing Modeling, Optimization and Simulation, pp. 481-490, 2004. A. Shamir, “Protecting Smart Cards from Passive Power Analysis with Detached Power Supplies,” Proc. Second Int’l Workshop Cryptographic Hardware and Embedded Systems, pp. 7177, 2000. J.D. Golic and C. Tymen, “Multiplicative Masking and Power Analysis of AES,” Proc. Fourth Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’02), pp. 198-212, 2002. E. Trichina and L. Korkishko, “Secure and Efficient AES Software Implementation for Smart Cards,” Proc. Fifth Int’l Workshop Information Security Applications (WISA ’04), pp. 425-439, 2004. E. Oswald and K. Schramm, “An Efficient Masking Scheme for AES Software Implementations,” Proc. Sixth Int’l Workshop Information Security Applications (WISA ’05), pp. 292-305, 2006. D. Marculescu, R. Marculescu, and M. Pedram, “Information Theoretic Measures for Power Analysis,” IEEE Trans. ComputerAided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 599-610, 1996. E. Macii, M. Pedram, and F. Somenzi, “High-Level Power Modeling, Estimation, and Optimization,” Proc. 34th Design Automation Conf. (DAC ’97), pp. 504-511, 1997. M. Nemani and F.N. Najm, “Towards a High-Level Power Estimation Capability [Digital ICS],” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 588-598, 1996. R.J. Evans and P. Franzon, “Energy Consumption Modeling and Optimization for SRAMS,” IEEE J. Solid-State Circuits, vol. 30, no. 5, pp. 571-579, 1995. E. Macii, O.G. Koufopavlou, and V. Paliouras, Proc. 14th Int’l Workshop Integrated Circuit and System Design, Power and Timing Modeling, Optimization and Simulation (PATMOS), 2004. V. Tiwari, S. Malik, and A. Wolfe, “Power Analysis of Embedded Software: A First Step towards Software Power Minimization,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp. 437-445, 1994. C. Brandolese, F. Salice, W. Fornaciari, and D. Sciuto, “Static Power Modeling of 32-bit Microprocessors,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 21, no. 11, pp. 1306-1316, 2002. M. Renaudin, F. Bouesse, P. Proust, J.P. Tual, L. Sourgen, and F. Germain, “High Security Smartcards,” Proc. Conf. Design, Automation and Test in Europe (DATE ’04), pp. 228-233, 2004. K. Tiri and I. Verbauwhede, “A VLSI Design Flow for Secure SideChannel Attack Resistant ICs,” Proc. Conf. Design, Automation and Test in Europe (DATE ’05), pp. 58-63, 2005. J.J.A. Fournier, S.W. Moore, H. Li, R.D. Mullins, and G.S. Taylor, “Security Evaluation of Asynchronous Circuits,” Proc. Fifth Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’03), pp. 137-151, 2003. H. Li, A.T. Markettos, and S.W. Moore, “Security Evaluation against Electromagnetic Analysis at Design Time,” Proc. Seventh Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’05), pp. 280-292, 2005. L. Benini, A. Macii, E. Macii, E. Omerbegovic, F. Pro, and M. Poncino, “Energy-Aware Design Techniques for Differential Power Analysis Protection,” Proc. 40th Design Automation Conf. (DAC ’03), pp. 36-41, 2003. ¨ rs, F.K. Gu¨rkaynak, E. Oswald, and B. Preneel, “PowerS.B. O Analysis Attack on an ASIC AES Implementation,” Proc. IEEE Int’l Conf. Information Technology: Coding and Computing (ITCC ’04), pp. 546-552, 2004.

175

[28] J. den Hartog, J. Verschuren, E.P. de Vink, J. de Vos, and W. Wiersma, “PINPAS: A Tool for Power Analysis of Smartcards,” Proc. SEC ’03, pp. 453-457, 2003. [29] J.I. den Hartog and E.P. de Vink, “Virtual Analysis and Reduction of Side-Channel Vulnerabilities of Smartcards,” Proc. Second Int’l Workshop Formal Aspect of Security and Trust (FAST ’04), pp. 85-98, Aug. 2004. [30] S. Yang, W. Wolf, N. Vijaykrishnan, D.N. Serpanos, and Y. Xie, “Power Attack Resistant Cryptosystem Design: A Dynamic Voltage and Frequency Switching Approach,” Proc. Design, Automation and Test in Europe Conf. (DATE ’05), pp. 64-69, 2005. [31] N. Vijaykrishnan, M.T. Kandemir, M.J. Irwin, H.S. Kim, and W. Ye, “Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower,” Proc. 27th Ann. Int’l Symp. Computer Architecture (ISCA ’00), pp. 95-106, 2000. [32] SystemC Language Reference Manual Version 2.0, http:// www.systemc.org, 2007. [33] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri, “MPARM: Exploring the Multi-Processor SoC Design Space with SystemC,” J. VLSI Signal Processing, vol. 41, no. 2, pp. 169-182, 2005. [34] V. Tiwari, S. Malik, A. Wolfe, and M. Lee, “Instruction Level Power Analysis and Optimization of Software,” J. VLSI Signal Processing, pp. 1-18, 1996. [35] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto, “Energy Estimation for 32-Bit Microprocessors,” Proc. Eighth Int’l Workshop Hardware/Software Codesign (CODES ’00), pp. 24-28, 2000. [36] T. Simunic, L. Benini, and G.D. Micheli, “Cycle-Accurate Simulation of Energy Consumption in Embedded Systems,” Proc. 36th Design Automation Conf. (DAC ’99), pp. 867-872, 1999. [37] W. Ye, N. Vijaykrishnan, M.T. Kandemir, and M.J. Irwin, “The Design and Use of Simplepower: A Cycle-Accurate Energy Estimation Tool,” Proc. 37th Design Automation Conf. (DAC ’00), pp. 340-345, 2000. [38] G.A.D. Sarta and D. Trifone, “A Data Dependent Approach to Instruction Level Power Estimation,” Proc. IEEE Alessandro Volta Memorial Workshop Low-Power Design, pp. 182-190, 1999. [39] Advanced Encryption Standard (AES), FIPS, Nov. 2001. [40] E. Brier, C. Clavier, and F. Olivier, “Correlation Power Analysis with a Leakage Model,” Proc. Sixth Int’l Workshop Cryptographic Hardware and Embedded Systems (CHES ’04), pp. 16-29, 2004. [41] Rijndael Algorithm, http://efgh.com/software/rijndael.htm, 2007. [42] ARM7TDMI Datasheet. ARM, 1995. Francesco Menichelli received the bachelor’s (summa cum laude) and PhD degrees in electronic engineering from the University of Rome “La Sapienza” in 2001 and 2005, respectively. He is currently a research assistant in the Department of Electronic Engineering, University of Rome “La Sapienza”. His research interest is low-power digital design, in particular system-level techniques for low-power consumption, power modeling, and simulation of digital systems.

Renato Menicocci received the bachelor’s degree in electronic engineering from the University of Rome “La Sapienza” in 1991. From 1992 to 2000, he was with the Cryptology Group, Fondazione Ugo Bordoni (FUB). From 2000 to 2002, he was with the Crypto Design R&D Centre, Gemplus. Since 2003, he has been cooperating with the Department of Electronic Engineering, University of Rome La Sapienza, where he was under a research fellowship connected with the EU Project SCARD from 2004 to 2006, and the Security Group, FUB. He is currently serving the Italian National Body for the Common Criteria Security Certification.

176


Mauro Olivieri received the master’s degree (cum laude) in electronic engineering and the PhD degree in electronic and computer engineering from the University of Genoa, Genoa, Italy, in 1991 and 1994, respectively. He was with the University of Genoa as an assistant professor. In 1998, he joined the University of Rome “La Sapienza,” where he is currently an associate professor of digital electronics and VLSI system architectures in the Department of Electronic Engineering. His research interests include digital system on chips and microprocessor core design. He is the author of more than 80 research papers and was a reviewer of several IEEE Transactions. He supervises research projects supported by private and public fundings in the field of VLSI system design. He is a member of the IEEE.

VOL. 5,

NO. 3,

JULY-SEPTEMBER 2008

Alessandro Trifiletti received the bachelor’s degree in electronic engineering from the University in Rome “La Sapienza.” In 1991, he joined the Department of Electronic Engineering, University of Rome “La Sapienza” as a research assistant and is currently an associate professor. His research interests include high-speed circuit design techniques and III-V device modeling.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.