Design, Implementation, and Verification of Active Cache ... - CiteSeerX

Design, Implementation, and Verification of Active Cache Emulator (ACE) Jumnit Hong

Eriko Nurvitadhi

Shih-Lien L. Lu

IPD Intel Corporation Hillsboro, OR

ECE Department Carnegie Mellon University Pittsburgh, PA

MRL Intel Corporation Hillsboro, OR

[email protected]

[email protected]

[email protected]

ABSTRACT This paper presents the design, implementation, and verification of the Active Cache Emulator (ACE), a novel FPGA-based emulator that models an L3 cache actively and in real-time. ACE leverages interactions with its host system to model the target system (i.e. hypothetical system under study). Unlike most existing FPGA-based cache emulators that collect only memory traces from their host system, ACE provides feedback to its host by modeling the impact of the emulated cache on the system. Specifically, delays are injected to time dilate the host system which then experiences hit/miss latencies of the emulated cache. Such active emulation expands the context of performance measurements by capturing processor performance metrics (e.g. cycle per instruction) in addition to measuring the typical cachespecific performance metrics (e.g. miss ratio). ACE is designed to interface with a front-side bus (FSB) of a typical Pentium®-based PC system. To actively emulate cache latencies, ACE utilizes the snoop stall mechanism of the FSB to inject delays to the system. At present, ACE is implemented using a Xilinx XC2V6000 FPGA running at 66MHz, the same speed as its host’s FSB. Verification of ACE includes using the Cache Calibrator and RightMark Memory Analyzer software to confirm proper detection of the emulated cache by the host system, and comparing ACE results with SimpleScalar software simulations.

Categories and Subject Descriptors C.0 [Computer Systems Organization]: General – Modeling of Computer Architecture

General Terms Design, Experimentation, Measurement

Keywords FPGA-based emulator, Cache modeling, Real-time emulation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FPGA’06, February 22–24, 2006, Monterey, CA, USA. Copyright 2006 ACM 1-58113-000-0/00/0004…$5.00.

1. INTRODUCTION As FPGA technologies continue to improve and proliferate, there are more opportunities to utilize FPGA-based systems for various applications. One area of importance is in computer architecture research. In particular, several existing studies [2, 7, 9, 15, 16] have proposed utilizing FPGA-based systems to model experimental cache designs in real-time by leveraging interactions between an FPGA-based emulator and a real host system. This approach has gained popularity because it provides a complementary way to the commonly used software-based simulation for evaluating experimental computer architecture designs. Software-based simulation approaches offer several benefits, such as flexibility and cost-effectiveness due to easy-to-modify software and affordable commercial PCs for execution platforms. On the other hand, there are prevailing speed and scalability issues with software simulators. First, as more details are being incorporated into software models, longer simulation times are needed to perform the modeling. Increasingly complex advances in computer architecture leads to the need for modeling more sophisticated system designs, enhancing the need for faster simulations. Second, in order to perform software simulation within a reasonable amount of time and resources, the application (workload) dataset used in the simulation often needs to be scaled down. This may induce inaccuracies and could lead to misleading conclusions [9]. With multi-cores and multi-threaded applications, we can expect that future workloads of interest will grow in complexity, resulting in larger dataset sizes and worsening the software scalability issue. The aforementioned FPGA-based emulators address these issues by allowing fast online evaluation of experimental designs using a real host system that can run workloads of interest with realistic dataset sizes. The reconfigurability of FPGAs provides flexibility needed for design space exploration, while the use of hardware, in the broader sense, offers realism and fast modeling speed. However, most of the existing FPGA-based emulators exploit only passive interactions between themselves and their real system host to model a target system1. They act as hardware monitors that collect traces generated by the host and use these 1 Target system is the hypothetical system being studied, which is modeled by the real system host and the emulator hardware in this case. ®

Pentium is a registered trademark of Intel Corp. or its subsidiaries in the United States and other countries.

traces as inputs to the experimental component they emulate. Such a passive approach, which we refer to as passive-emulation, has implications on the type of evaluation that can be done. Measurements for observing performance trends can only be done from within the context of the emulator itself. This is because the impact of the emulated component on the rest of the target system (i.e. represented by the real system host that generates the traces) is not modeled. Specifically for cache modeling, passive interactions are termed as hit/miss latencies from the emulated L3 cache that are not experienced by the real system host. Passive emulators only collect memory traces generated by the host via the system bus, and use the traces as inputs for online cache emulation. Thus, evaluation of the experimental cache design can only be done by observing cache statistics (e.g. miss ratio) obtained from the hardware emulator. System performance trends due to the experimental cache can not be studied. For example, processor speedup or slowdown due to the experimental L3 cache cannot be obtained without actually modeling the emulated L3 cache hit/miss latencies perceived by the system. One might argue that system performance impact can be estimated using miss ratio. However, such estimates may be inaccurate due to various nonblocking effects in different parts of the system. With nonblocking caches and an out-of-order superscalar processor, some cache misses may be overlapped with useful work and have no impact on system performance while others do have impacts. Miss ratio by itself does not capture this information. To ameliorate this limitation in the evaluation context, we propose active-emulation, an emulation approach that leverages active interactions between the emulator hardware and a real system host. Here, the emulator provides feedback to its host system in addition to collecting traces, altering the hosts’ behavior in a way that models impacts of the experimental component on the system. Active emulation provides a complementary solution to the issues of passive emulation used by the aforesaid existing FPGA-based cache emulators. Note that besides passive-emulation, it is also possible to do full-emulation where all of the components of the target system are modeled by an emulator. While this approach is very flexible, building such an emulator can be a large endeavor. Moreover, running realistic commercial workloads on such a custom emulator framework may be difficult. We only know of one emulator [1] that fits this category. It can do full-emulation of MIMD systems, which includes emulation of caches. It has never been used to study large-scale commercial applications. The background information presented in Section 2 provides more discussion that compares active emulation with existing emulation approaches. In this paper, we present the Active Cache Emulator (ACE), a novel FPGA-based L3 cache emulator that adopts the activeemulation approach. More specifically, ACE interfaces with the front-side bus (FSB) of its real system host, and injects delays to the FSB to model L3 hit/miss latencies perceived by the time dilated host. This is different from the previously mentioned emulators, which only observe traces. By injecting delays, ACE alters the behavior of its host system in an orchestrated way, and effectively models the impact of the experimental L3 cache on the target system. Time dilation is used to make the host perceive the existence of the emulated L3 cache, while time scaling is used to

map the results of emulation using a slow host to a faster target system. Time dilation and time scaling are discussed in Section 3. ACE provides a fast way to explore various L3 cache designs on a real system running various software applications of interest with realistic dataset sizes. Thus, similar to existing emulators, ACE is useful to rapidly identify high-level characteristics of experimental cache designs. This can be used to quickly find design spaces of interest. If needed, software simulators can then be used to investigate a design space in more detail and with more flexibility. Furthermore, ACE improves upon the previously mentioned cache emulators by allowing such exploration to be done in a more complete view (system-level and emulator-level). Not only are the various L3 cache designs evaluated by looking at cachespecific metrics (e.g. miss ratios), but they can also be evaluated based on their impact on system performance (e.g. cycle per instruction, or CPI). This is useful since different cache designs can have different latency requirements and simply evaluating experimental cache designs based on cache-specific metrics may lead to misleading conclusions. A well-known example of this is the comparison between a direct-mapped cache and a cache with high set-associativity [5]. While a direct-mapped cache may have a higher miss ratio, it typically has a lower hit latency than a setassociative cache. Thus, evaluating by looking only at miss ratio may reveal that set-associative cache is preferred, while the opposite conclusion may be obtained when impacts of these caches on system performance are considered (e.g. adding a direct-mapped L3 cache results in higher CPI speedup than adding a set-associative L3 cache). Currently, ACE is designed to work with a typical Pentiumbased PC system as a host. A dual-processor motherboard with 2 Slot-1 connectors is used. ACE interfaces to the FSB of the host via one of the Slot-1 connectors, while the other is used by the real processor. The snoop stall mechanism that exists in the Intel FSB is used by ACE to perform delay injection. ACE is implemented using the Xilinx XC2V6000 FPGA running at 66 MHz, the same speed as the FSB. Verification of ACE is done via the following two studies. First, the Cache Calibrator and RightMark Memory Analyzer software are used to confirm the proper detection of the emulated L3 cache by the host, and show that time dilation does work. Second, ACE results are compared with the results from a well-known SimpleScalar software simulator. This verifies the validity of the cache performance impact emulated by ACE, and the applicability of time scaling. The rest of the paper is organized as follows. Section 2 discusses the background information. Section 3 provides ACE overview. Section 4 presents the design of ACE. Section 5 elaborates on the implementation aspect of ACE. Section 6 discusses the verification of ACE. Section 7 talks about the usefulness and limitations of ACE. Finally, Section 8 provides concluding remarks.

2. BACKGROUND 2.1 Existing FPGA-based Cache Emulators Previous studies have proposed several FPGA-based cache emulators that utilize passive interactions with a real system host to model the target system. These emulators interface with the

host system’s bus and use memory references observed on the bus as inputs for the emulated cache. The emulation is done online by the emulator hardware as a workload of interest is being run on the host system, without causing any slowdown to the host system. In this case, the host system models most of the target system (e.g. processors, memories, disks, etc), while the cache modeled by the emulator constitutes the experimental component under study that is a part of the target system. One of the earlier works relevant to these cache emulators is BACH [3][4], which is a trace collector developed by BYU. BACH utilizes a logic analyzer to interface with its host system. Traces collected from the host system are buffered. A highpriority interrupt is used to halt the system when the buffer is full, at which point the buffer is moved to a disk. After that, the execution of the host system is continued and the same trace collection routine is repeated. This way, BACH is capable of collecting traces from long workload runs. However, the halting mechanism of BACH may alter the original behavior of the host system. It was reported that a system with BACH generated 1.125% more references than an original system without BACH. HACS [15] is a successor of BACH. While BACH acts mostly as a trace collector, HACS provides the ability to conduct emulation online as the host system executes using the traces collected in real-time. For trace collection, HACS utilizes a FIFO board to interface with the host system. Emulation is done by a WILD-ONE FPGA board (equipped with two Xilinx XC4000XLA series FPGAs and a 4MB SRAM). The FPGA board performs emulation using traces as they are collected through the FIFO board in real-time. Emulation results are generated by the FPGA board and collected via a PCI interface. RACFCS [16] also provides the capability to process the collected traces on the fly. The trace collection uses a latch board that connects to the microprocessor’s output pins and a trace board. The trace board is equipped with SRAMs for storing the collected traces. The emulation is done by the Flying Cache emulator that interfaces with the trace board. The emulator outputs the number of hits and misses of the experimental cache structure being emulated. MemorIES [9] is another emulator that can perform online emulation of caches. MemorIES works with the IBM S70 class RS/6000 or AS/400 servers. It is implemented with 7 FPGAs that function as a cache controller and SDRAM’s (i.e. 1GB) that store the simulated cache tags and state tables. Since it sits on a Symmetric Multi-Processor (SMP) bus, MemorIES can emulate shared L3. For this, multiple processors are grouped using their respective CPU IDs and a single emulated cache is assigned to be the cache for this set of processors. Multiple groups can be emulated by deploying multiple MemorIES boards, each emulating a shared L3 for each group of processors. Additionally, MemorIES can emulate multiple cache configurations in parallel. The most recent FPGA-based cache emulator reported in the literature is PHA$E [2, 7], which is capable of emulation in realtime on Pentium-based system hosts. To match the maximum speed of its host, PHA$E interleaves the trace processing for emulation among four Xilinx XC2V1000 FPGAs. PHA$E complements MemorIES by providing support for different hardware platforms (i.e. IA32 architecture) and cache parameters.

PHA$E also provides the ability to emulate multiple cache configurations with differing set-associativities simultaneously. Finally, RPM [1] is another emulator that is capable of emulating experimental cache designs. However, its emulation approach is rather different than the cache emulators mentioned previously. Rather than exploiting interactions with a real system host to model the target system, RPM emulates the entire target system within its emulator hardware. Thus, the cache emulation in RPM is a part of the full system it emulates. RPM can emulate various MIMD multiprocessor systems. The emulator hardware consists of 8 boards, each emulating a single multiprocessor node (i.e. consists of processor, caches, and memory), and another board acting as an I/O interface to its host system that controls the execution of the emulator hardware. Multiple FPGAs are used in each board for emulating cache and memory controllers.

2.2 Active Emulation Virtually all of the cache emulators discussed in the previous section passively interact (i.e. collect traces) with a real system host to model the target system. For our discussion, we classify this type of emulation as passive-emulation. Using only passive interactions limits the boundary of evaluation that can be done to emulate experimental designs. For L3 cache modeling, only cache-specific performance statistics can be obtained, while the impact of hit/miss latencies on the target system (e.g. changes in CPI) cannot be measured since they are not emulated. On the other hand, using an emulator to model the entire target system can provide much more flexibility and create a larger evaluation boundary than passive-emulation. We refer to this approach as full-emulation. However, developing an emulator for full-emulation can take more effort in comparison to developing an emulator for passive-emulation. Logically, fullemulation requires hardware that can model various aspects of the target system while passive-emulation requires hardware that models only a particular aspect of the target system. Another issue with full-emulation is the overhead in adopting workloads to be run on the emulation platform. Of course, these two emulation approaches have their own benefits and drawbacks and should be used depending on research needs. Active-emulation is being proposed to complement the aforementioned approaches. It is more closely related to passiveemulation, and specifically addresses the evaluation boundary limitation of passive-emulation. The idea is to have the emulator interact actively with its real system host, providing feedback in an orchestrated manner that effectively models the impacts of the emulated component on the target system. Therefore, performance measurements of experimental designs can be done at system-levels, in addition to the measurements at emulatorlevels provided by typical passive emulators.

3. ACTIVE CACHE EMULATOR (ACE) 3.1 Overview ACE interfaces with the FSB of a typical Pentium-based PC system and monitors the memory requests seen on the FSB. ACE performs emulation in real-time using the observed transactions as emulation inputs, and injects appropriate delays to emulate cache hit/miss latencies as perceived by the host. Cache statistics are

sampled periodically and sent via a serial interface to the data collection computer. Performance monitoring software, such as VTune [6], is used in conjunction with ACE to collect processor performance statistics via the processor performance counters. ACE is capable of emulating various aspects of L3 caches (e.g. cache architectures, sizes, associativities, replacement policies, etc), but for this paper ACE is configured to emulate an 8-way write back cache, with a pseudo-LRU replacement policy. Figure 1a shows an overview of the ACE system setup. The host uses a dual-processor motherboard. A real Pentium processor sits in one of the processor slots, while ACE interfaces with the FSB via the other processor slot.

3.2 Modeling L3 Latencies with Time Dilation ACE uses time dilation to emulate the latencies of the L3 cache and memory of its target system. This is done by delay injection to the FSB to lengthen (dilate) the host memory access latency by a certain amount. Figure 1b shows the cache and memory latencies seen by the host with and without ACE’s time dilation. ACE dilates its host memory latency using two adjustable parameters, HIT and MISS, to emulate the L3 cache and memory latencies. ACE’s host experiences two different dilated memory access latencies, with the smaller latency (original memory latency + HIT) perceived as its L3 cache latency and the larger one (original memory latency + MISS) as memory latency. Therefore, the target system emulated using ACE is a system with emulated L3 cache and memory latencies, and real L1 and L2 cache latencies. ACE’s method for injecting delays into the FSB is as follows (refer to Figure 1b). If a memory request hits in the emulated L3 cache, a HIT delay (can be zero) is injected into the FSB. During a miss on the emulated cache, the MISS delay is injected instead. The minimum hit latency that can be emulated is equal to real memory latency. This is because ACE only stores cache tags for emulation, but lets the memory supply the actual data. While this is a limitation, it is expected that future systems targeted by ACE will have larger cache and memory latencies relative to processor speed as the processor-memory speed gap continues to grow. Note that ACE can not adjust the L1 and L2 cache latencies since these caches are on the processor die.

Figure 1. (a) ACE system setup. (b) The latencies seen by the host with and without time dilation by ACE.

3.3 Modeling Faster Target with Time Scaling Time scaling is complementary to time dilation with ACE. It is applied to map the performance results from running workloads on a real system host that is time dilated by ACE to the target system performance. In essence, time scaling [1] means that if parameters of a system are expressed in terms of relative latencies and bandwidths among its components (e.g. caches and main memory), then another system with the same organization and relative latencies and bandwidths would behave the same way, regardless of the absolute parameter values. For example, a system with 1GHz processor speed, 2 ns L1 cache latency, and 30 ns average memory latency has the same processor utilization as an emulated system which has a 50 MHz processor, 40 ns L1 cache latency, and 600 ns average memory latency (i.e. scaled by 20), assuming both systems have the same architecture. By expressing the L1, L2, L3, and memory latencies of the time dilated host in terms of processor clock, ACE can emulate faster target systems. Thus, even though ACE is hosted by a relatively old Pentium III-based system, it can be used to approximate the behavior of future higher frequency systems, with the assumptions that these systems would have the same architecture and similar relative latencies and bandwidths as ACE’s host.

3.4 Emulated L3 Cache ACE can emulate non-blocking caches with in-order completion that is inclusive with main-memory. The emulation of non-blocking characteristic comes from the capability of Intel FSB to handle up to 8 outstanding transactions, effectively emulating up to 8 outstanding requests to the L3 cache. This means that the processor does not have to wait until a cache miss is satisfied before continuing execution, unless there is dependence to the requested data such that the processor needs to stall. However, perhaps unlike a typical non-blocking cache, the outstanding requests in the L3 cache emulated by ACE are completed in-order. This is because the FSB enforces in-order completion of its transactions. Thus, the outstanding requests are serviced in the order of its issuances on the FSB. The inclusiveness of the emulated cache with the main memory is ensured by the following two reasons. First, a line brought to the emulated cache can only be caused by a real memory request seen on the FSB. Thus, lines that are brought in by ACE to its emulated cache are also in main memory when these cache fills occur. Second, if a line is replaced from the main memory (i.e. the corresponding page is swapped out), invalidation requests are issued on the FSB. ACE monitors such a request, and accordingly invalidates the associated cache line it emulates upon detecting the request. Thus, lines that are not in the main memory can not possibly be kept in the emulated cache. Together, these two reasons insure inclusiveness, where lines cached by ACE are also in the main memory. Note that, at present, ACE does not maintain inclusiveness between the L2 and L3 caches. When a line is evicted from L3, no invalidation is sent to L2. It is possible to have ACE issue an invalidation request to L2 during a line eviction. However, we chose not to do this because adding such a mechanism would complicate ACE’s design and would result in more constraints to the speed of the implementation. Furthermore, we are mostly interested in studying large L3 caches (at least 1MB) that are much larger than the L2 cache. Therefore, inclusiveness of L2

cache with respect to L3 cache would not have a big impact on modeling results. With a large L3 cache, if a line is replaced, it is likely that the line is no longer kept in the much smaller L2 cache.

4.1.2 Cache Emulator

At the top-level, ACE design consists of four modules: (1) FSB interface, (2) cache emulator, (3) statistic collector, and (4) serial interface. Figure 2 depicts the top-level design of ACE.

The cache emulator module accepts cache requests from the FSB interface and models the L3 cache. It contains a block ram (BRAM), combinatorial logic that emulates cache controller functions, and logic for detecting events of interest. The BRAM contains cache tags and is indexed by the set address. The cache controller logic accepts a cache set (containing eight cache tags in the case of an 8–way cache), performs the check for cache hit/miss and necessary tag shifting to model LRU replacement policy, and outputs the updated cache tags, which are then written back to the BRAM in the next cycle. The event detection logic identifies cache events of interest (e.g. cache hits, cache misses, etc) to be counted by the statistic collector module. The cache miss event, in particular, is supplied to the FSB interface module, where an appropriate FSM that tracks the current transaction will utilize the delay device and interact with FSB to insert delays based on this information. More information on the mechanics of delay insertion is provided in Section 4.2.

4.1.1 FSB Interface

4.1.3 Statistic Collector and Serial Interface

4. ACE DESIGN This section elaborates on the design of ACE. The first subsection presents the top-level design of ACE, providing an overview of major modules in ACE and their interactions. Following it is the details on how ACE interfaces with the FSB. Then, details on the finite-state machines (FSMs) that are used to keep track of FSB transactions are presented. Finally, various delay parameters and their relationships are explained.

4.1 Top-Level Design

The FSB interface module deals with interactions involving the FSB. Its tasks are monitoring transactions, keeping track of each phase the outstanding transactions are in, and injecting delays. The module comprises of several finite-state machines (FSMs), eight instances of the trans_track FSM to keep track of the maximum eight outstanding transactions allowed in the FSB protocol, and an instance of a trans_select FSM that manages the trans_track eight FSM instances. Details on the workings of these FSMs are discussed in Section 4.3. The FSB interface module also contains a delay device, which is a counter that is used to track the amount of delay injected on the FSB. Emulated cache latencies are set through the delay device. The FSB interface module interacts with the cache emulator module. It collects the transaction information when they become available at the FSB, converts the information into cache requests that contain set address, tag address, request type (e.g. code read, data read, etc), and processor id, and then gives the request to the cache emulator.

Figure 2. Top-level design of ACE.

The statistic collector consists of a set of counters, which are incremented based on the signals driven by the event detection logic at the cache emulator module. Each time an event of interest occurs in the cache emulator, it is counted using a statistic counter. The values of each of the counters are supplied as inputs to the serial interface. The serial interface module samples the values of the event counters in the statistic collector module periodically. Currently, 1 second sampling period is used. The sampled counter values are then sent through a serial port to a PC that serves as a data collector system.

4.2 Injecting Delays to the FSB 4.2.1 FSB Protocol Overview The Pentium III FSB protocol consists of the following phases of execution. When a bus agent (e.g. a processor) wants to issue an FSB request, it has to first request ownership of the FSB. During the arbitration phase the FSB protocol selects (among possibly multiple currently requesting agents) the agent that will be allowed to issue request to the FSB. This is followed by the request phase, where the agent who has won the arbitration issues its request on the bus. Information such as the requested address and the request type are broadcasted on the FSB. After this, error phase happens, where parity bits of the issued transaction are checked. If an error occurs, then the transaction is canceled. If no error occurs, the snoop phase follows. During this phase, all snooping agents (i.e. bus agents that have caches and are subjected to coherency control) check their caches and broadcast their findings to the FSB. This is to ensure that all cache copies are kept in coherence. Each bus agent that has a copy of the requested data asserts HITM or HIT signal, depending on whether the copy it has is in a modified state or not. Snooping agents can require different latencies when doing this check. An agent that is slow and still requires more time checking its cache can stall the FSB by asserting both HIT and HITM signals during the snoop phase. The stall is referred to as a snoop stall. The Response phase comes after the snoop phase. At this phase, how the transaction will be completed is published on the bus by the responding agent. Data phase may occur after the response phase, at which the requested data is transferred via the FSB. These

phases, from arbitration to completion, constitute a FSB transaction. Intel documentation [10] provides further details on the FSB protocol.

4.2.2 Delay Injection by ACE ACE injects delay to its host system by using the snoop stall mechanism mentioned earlier. The trans_track FSMs in Figure 2 keeps track of the phase that each of the outstanding transactions are in. Section 4.3 provides further details on how the tracking is done. Note that for modeling L3 cache, ACE only tracks memory and invalidation transactions. When any of the eight trans_track FSM instances detects that a transaction is at a request phase, it directs the transaction information broadcasted on the FSB from ACE’s input signals to the cache emulator module. The module then performs the cache controller functions and determines whether there is a cache hit. By the time the transaction passes the error phase (assuming no parity error) and enters the snoop phase, the cache emulator module would have already determined whether the transaction results in a hit or a miss in the emulated cache. Based on this information, ACE then performs a snoop stall accordingly to model the cache hit/miss latency. After the snoop stall, the FSB protocol proceeds normally. Figure 3 illustrates the FSB transaction phases and the related actions made by ACE.

4.3 Keeping Track of FSB Transactions ACE tracks FSB transactions utilizing the FSMs in the FSB interface module mentioned previously. There are two types of FSMs used: trans_track and trans_select. The trans_track FSM follows a transaction as it progresses through the various phases of the FSB protocol. The trans_select FSM manages multiple instances of the trans_track FSM. It assigns one of these instances to each new FSB transaction in a round-robin fashion.

4.3.1 Tracking Individual FSB Transactions

this phase). As the transaction proceeds to the snoop phase, the FSM tracking it enters the Tag Check state, at which time it signals the cache emulator module to perform the cache modeling (e.g. lookup up the set, check for matching tags, shift to maintain LRU positions, etc). The cache emulator module completes its modeling at the end of the cycle. If there is a L3 cache hit and the emulated hit latency is equal to main memory then no delay needs to be injected and the FSM jumps to the Valid Snoop1 state. Otherwise, it jumps to Snoop Stall1 state, where snoop stall will be initiated. The delay device will be used to count the delay injection to model the desired hit/miss latency. The Snoop Stall1, Snoop Stall2, and Snoop Wait states are used to make sure that an even number of stall cycles are inserted to the bus, which is the snoop stall requirement imposed by the FSB protocol. When the desired delay has been injected, the Valid Snoop1 state is entered. At the next cycle, Valid Snoop2 state is reached, at which the snoop completion identification is channeled to the succeeding trans_track FSM. This allows the next trans_track FSM to progress to its tag check state. Response state follows the Valid Snoop2 state. At this state, the FSM waits for the response status of the transaction to be broadcasted on the FSB. When this happens, the FSM completes its tracking, and eventually goes back to the Init state, where it waits for another transaction to be assigned to it for tracking by the trans_select FSM.

4.3.2 Tracking Multiple Outstanding Transactions Multiple outstanding transactions are tracked using the eight trans_track FSM instances. These instances are managed by the trans_select FSM, which is depicted in Figure 5. The trans_select FSM follows a round-robin policy for assigning a trans_track

Figure 4 shows the trans_track FSM. At the start of the FSM, it enters the Init state. It waits for two conditions: (1) a valid transaction (i.e. those that contain memory or invalidation requests) is observed on the FSB and (2) the trans_sel signal to allow trans_track to track the transaction. When these conditions are met, the FSM moves to the Error Check state. At the Error Check state, the FSM waits for the previous trans_track instance to complete. This round-robin approach tracks the completion of a snoop phase. Next, the current transaction that the FSM is tracking will enter the snoop phase (since the preceding transaction no longer blocks it from entering Figure 4. State machine for tracking an FSB transaction.

Figure 3. Overview of the FSB protocol and ACE actions.

Figure 5. State machine to handle outstanding transactions.

FSM instance to a new FSB transaction. At the start, the trans_select FSM enters the Trans1 state, where it enables the first instance of the trans_track FSM (i.e. trans_track1) to monitor the FSB for a new transaction to be tracked. At this point, the trans_track1 is still in the init state. When the trans_track1 FSM observes a valid FSB transaction, it moves to the Error Check state because it observes a valid transaction and the signal from the trans_select FSM indicates that it is its turn to monitor a new valid FSB transaction. At the Error Check state, it signals the trans_select FSM acknowledging that it will track the current transaction and informing the trans_select FSM to assign the next trans_track FSM instance to monitor the new valid FSB transaction succeeding this one. From here, the procedure repeats again, moving from Trans2 to Trans3 and so on, wrapping around at Trans8.

5. ACE IMPLEMENTATION ACE was developed using a typical Xilinx ISE design flow. First, VHDL is used to describe the design. ModelSim is then used to simulate the VHDL code for functional verification. Then the Xilinx ISE software is used to synthesize and eventually generate the bit file. Changing cache parameters requires only changing generic parameters in the VHDL code. Batch scripts automate the bit file generation for various cache configurations. The system setup involves two PCs, one for hosting ACE (target system) and the other for data collection and automation scripts. The automation scripts manage the runs on ACE’s host machine by remotely starting and stopping workload execution and data collection from processor performance counters. The data collector machine is also used to collect cache performance data generated by ACE via the serial interface. Table 1 summarizes the parameters of the system used to host ACE. Figure 6 shows a picture of the system. From left to right are the ACE FPGA board, Pentium III CPU, and 4 DRAM DIMMs. The host machine’s motherboard has two Slot-1 connectors for a dual-processor system configuration. However, one of the connectors is used to interface with the ACE board while the other slot is occupied with an Intel Pentium III processor, as depicted in the ACE setup shown in Figure 1a.

tags so that reads and writes of these tags can be done within onecycle, thereby allowing ACE to determine cache hit/miss on the emulated cache within one-cycle (e.g. BRAM access plus tag checking logic). Single-cycle cache hit/miss determination is needed so that ACE can meet the snoop phase deadline for injecting stalls. Thereby allowing the minimum emulated L3 hit latency to be equal to main memory (i.e. ACE does not have to inject delay on the FSB). Alternatively, we may pipeline the cache emulator module so that cache hit/miss determination does not have to be done in 1 cycle. This can result in an implementation with higher frequency, but the minimum L3 hit latency emulated by ACE will be larger than the real memory latency (i.e. always inject delay while waiting for hit/miss determination, and the amount of delay depends on the number of pipeline stages). Since at a reduced 66MHz FSB speed the system still functions normally and fast enough to complete large workloads in a reasonable amount of time, we chose not to pipeline the cache emulator. In general, ACE only utilizes a small part of the FPGA resources (e.g. about 6% of the available slices). However, the design is limited by the speed of the input FSB clock of the host system and the constraint of having to determine the hit/miss outcome of the emulated cache in one cycle. At present, with the 4 speed grade of the FPGA, ACE can run faster than 66MHz, but can not meet the 100MHz original FSB speed of the host. Thus, the FSB is slowed down to 66MHz. Consequently, the processor frequency reduces to 333MHz from the original 500MHz.

6. ACE VERIFICATION As a basic verification, we simulate the HDL code extensively with testbenches. Moreover, since FPGAs are reconfigurable, we are able to analyze internal signals, making the design fully observable. Using a logic analyzer, ACE’s FSMs are validated by observing the FSB signals and their interaction with the system. Snoop stalls were observed when both HIT and HITM lines are activated, causing all bus traffic to be stalled in accordance with the FSB protocol. In addition to these basic verification practices, we also verify the validity of the two main techniques used by ACE: time dilation and time scaling.

A Xilinx XC2V6000 Virtex II FPGA (package: ff1152, speed grade: -4) is featured on the ACE board. It has a 2,592 Kbits Block RAM (BRAM). ACE uses this BRAM to store cache Table 1. ACE system parameters Component Motherboard

Description Intel Lancewood Server Board Dual Processor Slot-1 L440GX Chipset BIOS ver.14.3 66Mhz Front-side Bus

CPU

Intel Pentium III 333Mhz L1-I and L1-D = 16KB, 4-way, 3 cycles L2 = 256KB, 8-way, 7 cycles

Memory OS

2GB SDRAM, avg. lat. = 56 CPU cycles Windows XP, service pack 1

Figure 6. The ACE hardware.

Figure 7. Snapshot from the RightMark Memory Analyzer for ACE emulating a 1MB L3 (hit latency = real memory latency; miss latency = real memory latency + 16 FSB cycles).

used here instead of the Cache Calibrator, because it can generate graphs. The x-axis shows the varying sizes of working sets being accessed by the analyzer tests, while the y-axis shows access latency in terms of CPU cycle (left) and time (right). The RightMark ‘forward’ test shown in the figure works by performing read accesses of cache line granularity sequentially from the beginning to the end of a working set. As the working set size gets bigger, the accesses would no longer hit in the current level of memory hierarchy, and reveals the access latency of the next level of the hierarchy. The four levels of increase shown in the graph represent the latencies for L1, L2, L3, and memory. The working set size at which each of this leveling occurs indicates the size of the cache for that level. As the figure shows, emulated L3 along with L1, L2, and memory parameters are detected correctly by the host. Adding 16 FSB cycle results in an increase of 84 CPU cycles, which means that 1 FSB cycle correspond to roughly 5 CPU cycles, which is expected since our CPU frequency is 5 times higher than the FSB’s (333MHz vs. 66MHz). Therefore, time dilation seems to work correctly, as indicated by the correct L3 parameters perceived by the host machine.

6.2 Verifying Time Scaling

6.1 Verifying Time Dilation We utilize the Cache Calibrator [8] to verify that time dilation by ACE actually results in an emulated L3 cache that is recognized by the host system. The cache calibrator captures the time it takes to perform memory operations and based on those times determines the different cache levels. Each design loaded on the ACE board was verified using the Cache Calibrator to match the intended cache size and latency. As an illustration of the detected cache sizes and latencies, we used the RightMark Memory Analyzer v3.58 [11] to generate the graph shown in Figure 7. Like the Cache Calibrator, RightMark can also analyze system caches and memories. It is

To verify time scaling, we compared the results of emulation with ACE against simulation of a target machine with the same relative latencies. If the results compare well, then the notion of time scaling holds. For this, we run several SPECCPU2000 benchmarks [14] using ACE and the widely-used SimpleScalar simulator. We chose SPECCPU2000 benchmarks because their behavior has often been studied and are quite well understood. Moreover, checkpoints are available for these benchmarks, so that we do not need to run the whole execution of the benchmarks in the simulation. SimpleScalar is configured to model a target system with the same relative cache and memory latencies as ACE (i.e. same CPU cycles). Other unpublished Intel parameters

mgrid

128B

256B

32B

512B

64B

128B

256B

Cache Size / Block Size

512B

64M

32M

16M

8M

16M

8M

16M

32M

512B

2 1.5 1 0.5

32B

64B

128B

256B


Figure 8. Comparison of performance curves obtained with ACE and SimpleScalar.

512B

64M

32M

16M

32M

16M

8M

16M

8M

4M

8M

4M

2M

4M

2M

0 1M

64M

32M

16M

32M

16M

8M

16M

8M

4M

8M

4M

2M

4M

2M

ACE

CPI Slowdown

SimpleScalar

1M

CPI Slowdown

256B

gzip

eon 1.2 1 0.8 0.6 0.4 0.2 0

64B

128B



32B

4M

8M

4M

2M

4M

2M

64M

32M

16M

32M

8M

16M

8M

16M

4M

8M

4M 64B

6 5 4 3 2 1 0 1M

CPI Slowdown 32B

2M

4M

2M

1M

CPI Slowdown

art 6 5 4 3 2 1 0

(e.g. number of outstanding requests the L1 and L2 caches can handle) are estimated. The precise matching of the parameters is less important since we are interested in seeing relative performance rather than absolute simulation numbers. Each SimpleScalar simulation is started from a checkpoint made with SimPoint [13] and run for 20M instructions. While the SimpleScalar simulation does not run the entire benchmark, each of these checkpoints have been reported to very closely represent the behavior of the complete execution of the SPECCPU2000 benchmark it is associated with [12]. We model various L3 cache and block sizes, with delay injection for L3 miss latency (MISS) of 64 FSB cycles. L3 hit latency is set to be equal to main memory (no delay injection, or HIT equal 0). From the modeling results, a performance curve is created by plotting normalized CPI relative to a perfect cache (L3 with 100% hit ratio, or no delay injection by ACE) over a range of cache sizes we studied. The verification is done by comparing performance curves generated by ACE against those generated by SimpleScalar. Note that we do not expect the SimpleScalar absolute performance numbers to perfectly match ACE results due to various factors such as ISA difference (PISA versus IA32), simulation time difference (20M instructions versus full run), level of modeling details (e.g. SimpleScalar does not model any OS effects), etc. However, we do expect to see similar shape of performance curves because such curves should be dominated by the sensitivity of the benchmark with regards to L3 cache sizes rather than certain differences in the modeling framework. Figure 8 shows the performance curve comparison. The yaxis represents CPI slowdown, which is the CPI value from the experimental design (with MISS equal to 64 FSB cycles) divided by the CPI of modeling with a perfect L3 cache (i.e. no delay injection during a cache miss). For example, a CPI Slowdown of 1.5 means that the performance of the experimental system is 1.5 times worse (i.e. the CPI is 1.5 times of the ideal CPI) than the performance of the baseline system with a perfect L3 cache. The x-axis shows the various L3 cache configurations evaluated. As the figure shows, the shape of the performance curves obtained from ACE approximate well with the curves obtained using SimpleScalar. For example, several levels of CPI slowdown are seen in mgrid curves generated by both ACE and SimpleScalar. As another example, for eon, which is not sensitive to L3 caches, both ACE and SimpleScalar produce a flat performance curve. This verification work indicates that ACE is emulating the L3 cache performance impact reasonably, and time scaling is a valid technique for mapping ACE’s emulation results to target system performance.

7. ACE USEFULNESS AND LIMITATIONS 7.1 Usefulness of ACE Similar to the existing passive emulators, ACE is useful for conducting rapid explorations of experimental designs on real machines using realistic workload setups and dataset sizes. Furthermore, ACE adds the extra capability to measure system performance in addition to the performance of the emulated component. In particular, ACE is useful for quickly exploring target systems with various experimental L3 designs and parameters. Questions such as how performance of a hypothetical system in the future runs sophisticated workloads (e.g. multi-tier

application server) with large datasets (e.g. TB) changes when a different L3 cache configuration is used can be answered by investigation using ACE. Additionally, ACE’s quick exploration allows narrowing down the design space of interest. Detailed study using simulators may be done, depending on research needs, once the design space of interest is identified. Secondly, ACE also acts as a proof-of-concept for the proposed notion of active-emulation. ACE shows an example of how an emulator provides feedback to a real system host in an orchestrated manner. Even with its limitations (discussed in the next subsection), ACE is useful and does shed light to the potential of active-emulation.

7.2 Limitations of ACE ACE has several limitations. First, as mentioned earlier, the minimum L3 hit latency that can be emulated by ACE is equal to the real latency of the main memory. However, this may be acceptable since ACE is used to study future systems, and with the continuing increase in the speed gap between processor and memory, it is expected that future systems will have larger L3 latencies relative to processor speed. Second, injection of delays to the FSB may cause other complications that we need to study further. However, the verification we did indicates that, in overall, the results from modeling using ACE approximates well with those from SimpleScalar. Further, even if the performance results are only approximations, it is still valuable since it could be difficult to obtain such information using other means. For example, complicated workloads with large dataset modeled using software simulators may not be feasible. On the other hand, passive emulators simply do not allow measurements at the system-level. Of course, these tools can be used in conjunction with ACE. Software simulators can be used for more detailed study, while passive emulators may be modified to perform active emulation. Finally, emulation with ACE may not be as fast as emulation using passive emulators that process traces in real time. This is because ACE induces slowdown due to delay injections. The amount of slowdown is dependent on the cache latencies emulated, which is affected by various factors such as experimental cache design and behavior of the workload being studied. Note, however, that in the case of ACE, Active emulation capability can always be turned off (no delay injection), in which case ACE acts as a passive emulator and does not induce slowdown. Thus, there is no passive-emulation feature being traded off by having the active-emulation capability.

8. CONCLUSION The continued improvement in FPGA technology opens opportunities to use FPGA-based systems in various application areas. One application area of importance is computer architecture research, where such systems can be used for emulation. Computer architects have commonly used software simulators in evaluating the experimental architectures they study. As architectures and software applications grow in complexity, the speed and dataset scaling drawbacks of simulators exacerbate. FPGA-based emulators address disadvantages of software simulators and the passiveness of most existing hardware emulators. Passive emulators can only measure performance

within the context of their emulator hardware. Alternatively, the full-emulation approach can be used to increase flexibility and evaluation context, but it requires full-fledge hardware that can model all aspects of the target system, as well as adjustments to software applications to run on the emulator. This paper has described the design, implementation, and verification of the Active Cache Emulator, an FPGA-based emulator that interacts with a real system host in an active manner, obtaining inputs from the system bus and providing feedback to the host system via delay injections on the bus, all happening online at the speed that matches the host system’s. Overall, ACE creates orchestrated alterations of the host system’s behavior, effectively emulating impacts of the experimental cache design it emulates on the host system. This way, performance measurements can be done at both system and emulator levels. ACE allows an expanding evaluation boundary over passiveemulation, while retaining the benefits of passive-emulation over full-emulation. Furthermore, ACE’s active-emulation capability does not trade off any of the passive-emulation features (e.g. real time emulation, leveraging real system host). Future opportunities for extending ACE are abundant. Three immediate extensions to the current work are as follows. First, with faster FPGAs, it is possible to mitigate the limit of the L3 latencies that can be emulated by actually storing the cache data in the emulator. This will allow ACE to provide data on the bus right away after the snoop stall injection, rather than waiting for memory to supply it. In this case, ACE will act as a real L3 that sits on the bus, or as an on-chip L3 of a hypothetical system modeled by the time dilated host. Second, ACE can be utilized for experimentation with real system deployments. This will provide insights into the behavior of realistic workloads and how experimental cache designs and parameters affect such behavior. Lastly, ACE can also be used to emulate shared L3 caches in multiprocessor system, in the same way that MemorIES [9] emulate shared caches. At a broader view, active-emulation also opens opportunities for future research. One possible research direction is to apply active emulation to other aspects of computer systems. Another possibility is in exploring techniques for implementation of active emulators, such as ways to interface and to provide feedback to the host system. In general, opportunities for using FPGAs for computer architecture research should be explored. One idea that we are currently exploring is to use FPGAs to accelerate software simulation by having the simulator offloads heavy duty work to FPGAs. This can possibly allow computer system modeling with software flexibility and near hardware speed. Most existing emulators use real processors, thus it is difficult to investigate an experimental processor microarchitecture. However, this hybrid approach can address such issue by having the processor model implemented in flexible but slower software, while other components are implemented in less flexible and faster FPGAs. Finally, it is also important to develop a methodology for conducting computer architectural research that takes advantage of the various modeling approaches (e.g. software- and hardwarebased) that computer architects have at their disposal.

9. REFERENCES [1] L. A. Barroso, S. Iman, J. Jeong, K. Oner, K. Ramamurthy and M. Dubois, "RPM: A Rapid Prototyping Engine for

Multiprocessor Systems," IEEE Computer Magazine, pp. 2634, February 1995 [2] N. Chalainanont, E. Nurvitadhi, R. Morrison, L. Su, K. Chow, S. L. Lu, and K. Lai, "Real-time L3 Cache Simulations Using the Programmable Hardware-Assisted Cache Emulator (PHA$E)," Sixth Annual Workshop on Workload Characterization, October 27, 2003. [3] J. K. Flanagan, B. Nelson, J. Archibald, and K. Grimsrud, “BACH: BYU Address Collection Hardware, the Collection of Complete Traces”, Proc. of the 6th International Conference on Modeling Techniques and Tools for Computer Performance Evaluation, Edinburgh U.K., September 1992, pp. 128-137. [4] K. Grimsrud, J. Archibald, M. Ripley, K. Flanagan, and B. Nelson, “BACH: A Hardware Monitor for Tracing Microprocessor-based Systems”, Microprocessors and Microsystems, v.17, n.8, Elsevier Science, Amsterdam, October 1993, pp. 443-458. [5] M. D. Hill, “A Case for Direct-Mapped Caches,” Computer, v.21 n.12, p.25-40, December 1988. [6] Intel VTune Performance Analyzer, http://www.intel.com/cd/software/products/asmona/eng/vtune/vpa/219637.htm [7] S.-L. Lu and K. Lai, “Implementation of Hardware Cache Simulator (Hw$im) – A Real Time Cache Simulator”, Proc. of FPL 2003, Portugal, Sept. 2003. [8] S. Manegold. Cache Calibrator http://www.cwi.nl/~manegold/Calibrator/.

v.0.9e.

[9] A. Nanda, K. Mak, K. Sugavanam, R. K. Sahoo, V. Soundararajan, and T. B. Smith, “MemorIES: A Programmable, Real-Time Hardware Emulation Tool for Multiprocessor Server Design”, Proc. of the 9th Int. Conf. on Arch. Support for Prog. Lang. and Operating Systems, Cambridge MA, November 2000, pp. 37-48. [10] Pentium Pro Family Developer’s Manual Volume 1: Specifications. Intel Corp. 1996. [11] RightMark Memory Analyzer http://cpu.rightmark.org/products/rmma.shtml

Website,

[12] T. E. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct 2002. [13] SimPoint Website, http://www.cs.ucsd.edu/~calder/simpoint/ [14] Standard Performance Evaluation Corporation (SPEC) Website, http://www.specbench.org/ [15] M. Watson and J. Flanagan, “Simulating L3 Caches in Real Time Using Hardware Accelerated Cache Simulation (HACS): a Case Study with SPECint 2000”, 14th Symposium on Computer Architecture and High Performance Computing, October 2002. [16] H. Yoon, G. Park, K. Lee, T. Han, S. Kim, and S. Yang, “Reconfigurable Address Collector and Flying Cache Simulator”, Proc. of High Performance Computing Asia ’97, Seoul Korea, April 1997, pp 552-556.