Reducing Network Traffic of Token Protocol Using ... - IEEE Xplore

7 downloads 1197 Views 332KB Size Report
It avoids indirections of directory protocols for common cache-to-cache transfer misses, and achieves higher .... essors that share data with the host processor.
TSINGHUA SCIENCE AND TECHNOLOGY ISSN  1007-0214  11/20  pp691-699 Volume 12, Number 6, December 2007

Reducing Network Traffic of Token Protocol Using Sharing Relation Cache* WANG Haixia (ฆ‫ں‬ຉ)**, WANG Dongsheng (ฅՊಖ), LI Peng (ह କ), WANG Jinglei (ฆ࠹ढ), LI Chongmin (हѬ੖) Research Institute of Information Technology, National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China Abstract: Token protocol provides a new coherence framework for shared-memory multiprocessor systems. It avoids indirections of directory protocols for common cache-to-cache transfer misses, and achieves higher interconnect bandwidth and lower interconnect latency compared with snooping protocols. However, the broadcasting increases network traffic, limiting the scalability of token protocol. This paper describes an efficient technique to reduce the token protocol network traffic, called sharing relation cache. This cache provides destination set information for cache-to-cache miss requests by caching directory information for recent shared data. This paper introduces how to implement the technique in a token protocol. Simulations using SPLASH-2 benchmarks show that in a 16-core chip multiprocessor system, the cache reduced the network traffic by 15% on average. Key words: token protocol; sharing relation cache; network traffic

Introduction Multiple-processor systems, such as symmetric multiple processor systems and cluster systems, are widely used in modern commercial and scientific computing infrastructures. A chip multiprocessor (CMP)[1-3], which integrates multiple processor cores into a single chip, is a promising technique that efficiently exploits the inherent thread-level parallelism inside modern workloads. CMP systems share many critical design issues with traditional shared memory multiprocessor Received: 2007-01-08; revised: 2007-06-20 γSupported by the National Natural Science Foundation of China (No. 60673145), the Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList), the Intel/University Sponsored Research, the National Key Basic Research and Development (973) Program of China (No. 2006CB303100), and the IBM China Research Laboratory γγTo whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-13641106732

systems, especially the cache-coherence protocols. The shared-bus in the shared-memory multiprocessor system offers a convenient solution to maintain cache-coherence with snooping mechanisms[4,5]. In the snooping protocol, cache miss requests are broadcast to all the other processors through the bus, with all the processors in the system snooping on the bus to get their messages. Although this broadcast-based protocol is simple and easy to implement, the shared-bus architecture serializes all the messages in the system, which limits system scalability. Directory-based protocols[6,7] were proposed to solve the coherence problem in a different way from snooping mechanisms. Directory-based protocols can be applied on unordered interconnects. They introduce a global directory that keeps records of the locations of cached copies. In directory-based protocols, cache miss requests are sent to the directory at first and then the directory entry is used to forward those requests to processors with cached copies. Several optimization

692

Tsinghua Science and Technology, December 2007, 12(6): 691-699

directory-based cache-coherence protocols have been proposed, including the distributed directory[8], limited directory[8], and chained directory[9]. With a fixed destination set, the network traffic of directory-based protocol will be much less than that of the snooping protocol. The protocol is applicable to large-scale sharedmemory multiprocessor systems. Unfortunately, directory-based protocols suffer from longer latency for cache-to-cache transfer misses. Token protocol[10,11] directly sends broadcasts on unordered interconnects, thus avoiding indirections for cache-to-cache misses as with the directory-based protocol. Unfortunately, token protocols broadcast requests to the entire destination set, which creates heavy network traffic. The sharing relation cache (SRC)[12] is an efficient technique to reduce network traffic in the broadcastbased protocol. The SRC provides destination set information for cache-to-cache miss requests by caching directory information for recent shared data. Unlike the directory protocol, the SRC only keeps directory for shared data in each processor node instead of a global directory structure for any data. The shared data access among multiple processors shows temporal locality in many parallel applications[12], so the SRC can achieve fairly high hit rates with minimum cache space consumption. A similar work to reduce the network traffic in the directory-based protocol is the destination-set prediction technique[13]. This technique predicts the destination set to be one of the following three kinds: the owner node (newly added logic to record owner node), the maximal destination set (all processor nodes), and the minimal destination set (directory entries). This technique in the token protocol can only predict owner node and maximal destination set, which only reduces the read request traffic with no reduction in the write request traffic. This paper integrates the SRC technique in the token protocol to reduce network traffic. The protocol is referred to as token-SRC protocol. The implementation adds an SRC in each processor core. On each cache miss, the processor core first looks in the SRC for a destination set. If an SRC entry matching the memory data location is found, the cache miss request will be sent to the destination set denoted by the SRC entry. Otherwise, the system broadcasts requests to all

processor cores. A preliminary evaluation showed that for the SPLASH-2 parallel benchmarks on a 16-core CMP, the token-SRC protocol achieved an average 15% reduction of network traffic per cache miss compared to the classical token-protocol.

1

CMP System Design with SRC

1.1 Architecture of CMP system with SRC The CMP architecture with the SRC is shown in Fig. 1. Each processor core owns a private SRC, which records directory information for shared data between this processor and other processor cores.

Fig. 1

CMP architecture with SRC

On the read or write operation, the processor first looks in its private cache. If the load or store operation cannot be satisfied by its local cache, the processor sends requests to other processors or memory nodes. In the cache coherence protocol with the SRC, before issuing these requests, the processor searches the SRC to find which processors own valid data copies. If an entry is found in the processor SRC, the processor will send requests to the destination set pointed to by the SRC entry. Otherwise, requests will be broadcast to all processors. An important optimization step is to remove the SRC lookup operation from the critical memory access path. The SRC lookup process can be designed to run in parallel with the normal data cache lookup process. When the normal data cache completes its lookup process, the SRC lookup results will also be available. The SRC is organized just like a normal data cache. Each SRC entry has 3 fields: valid, tag, and sharer. The sharer of the SRC entry records the identities of processors that share data with the host processor. The SRC address lookup process is the same as for a normal data cache.

WANG Haixia (ฆ‫ں‬ຉ) et alġReducing Network Traffic of Token Protocol …

1.2

Correctness substrate and performance consideration

In broadcast-based protocols (such as snooping protocol and token protocol), the destination set of cache miss requests includes all processor nodes, which is the maximum destination set. In directory-based protocols, the destination set of cache miss requests is the minimum destination set, including one owner processor for the read request and all processors with valid cached copies for the write request. The minimum destination set consists of all processors that are necessary for acknowledging the cache miss request. If one does not receive the cache miss request, that request will not be satisfied. Thus for correctness, the SRC-based protocol needs to ensure that every data location either has no entry in the SRC or the destination set denoted by the SRC entry is a superset of the minimum processor set. Figure 2 illustrates the inclusion relationships among the destination sets of broadcast-based protocols, directory-based protocols, and SRC-based protocols. Compared with the broadcast-based protocol, the SRC-based protocol removes processors that have no valid cached copies from the destination set, which maintains correctness and reduces network traffic.

Fig. 2

693

keeping the safety property by enforcing the coherence invariance of a single writer and multiple readers. In the token protocol, a processor is only allowed to read a data block when it holds at least one token for this block and to write a data block only when it holds all tokens for the data block. The token counting mechanism ensures that conflicts will not break the coherence invariance though this mechanism does not ensure that a request will be eventually satisfied (such as a write conflict case). When a processor detects potential starvation, the token protocol initiates a persistent request. In principle, the token protocol will activate at most one persistent request per block with a fairarbitration mechanism, which ensures that all conflict requests will be finally satisfied and finished in order. The SRC can be easily integrated with the token protocol with no extra consideration of correctness because the token protocol already takes care of correctness. When requests are not correctly sent to a superset of the minimum destination set, they will not get enough tokens. That case will be classified as starvation by the token protocol and requests are satisfied eventually by issuing persistent requests. Figure 3 illustrates the inclusion relationships among the maximum destination set of broadcastbased protocol, the minimum destination set of the directory-based protocol, and any destination set of the token-SRC protocol. In conclusion, any destination set in the SRC will be correct in the token-SRC protocol.

Correct destination set with SRC

The size of the destination set determines how many messages are sent for one cache miss request. Small destination sets mean less messages and less network traffic. Thus, for less network traffic, the SRC design objectives are to increase the SRC hit rate to as high as possible and to reduce the destination set as much as possible when the SRC hits.

2 Implementation of Token-SRC Protocol 2.1 Correctness substrate and performance consideration The token protocol extends broadcast protocols from an ordered network to an unordered network while

Fig. 3

Correct destination set in token-SRC protocol

Regardless of which destination set is stored in SRC, the token protocol can keep it correct. However, the overall system performance will degrade when the SRC destination set is not a superset of the minimal destination set. The starvation satisfying process is rather time-consuming. Therefore, the token-SRC protocol designs should avoid starvation. Thus, improved performance requires that the destination set denoted by the SRC should better be a superset of the minimum destination set and be close to the minimum destination set.

694

2.2

Tsinghua Science and Technology, December 2007, 12(6): 691-699

Token-SRC protocol design

The cache controller design for a CMP system using the token-SRC protocol must decide: (1) which states are used to describe a cache block; (2) what events incur the cache state transition; (3) what the SRC records for each cache state; (4) how the cache state transition (what is the next cache state for each cache state and each event from local or remote processors) does; (5) how each event is handled (what actions take place in local and remote processors for each cache state and each event from the local processor). 2.2.1 Cache states definition The MOESI protocol was used in the token-SRC protocol design with 6 states to describe a cache block. The M (modified) state means the cache block was modified. The O (owned) state means the local processor owns the data block though another processor may have shared data copies. The S (shared) state means the local processor has valid data copies. The E (exclusive) state means that only the local processor has data copy and this copy is not modified. The I (invalid) state means that the cache block in the local processor is invalidated by another processor. The NP (not present) state means that the data block is not in the cache, so it is not a real state saved in the cache block. 2.2.2 Cache events definition The token-SRC protocol has 4 kinds of events coming from the local processor: cache read miss, cache write miss, data cache replacement, and SRC replacement. To deal with local events, the processor may generate 3 kinds of remote events: remote read request, remote write request, and remote data cache replacement. The following cache state transition analysis only handles local events because the handling process includes remote events. 2.2.3 SRC contents definition The SRC contents are defined in Table 1. Data blocks in the NP state do not exist in the local cache, so the SRC need not save entries for that block. For data blocks in the M or E states, reading or writing requests always result in a hit so the SRC will not search the destination set. Reading data blocks in the O or S states also result in hits, but writing those data blocks will invalidate other valid copies in the CMP system, so the SRC can be used to notify the destination set. For data

block in the I state, read misses need to request the owner processor and write misses need to invalidate all processors with valid cached copies. With the definition in Table 1, the SRC provides owner processor information for each reading case, but not for writing cases. Thus, writing a data block in the I state has to broadcast writing requests. Table 1

SRC contents for each cache state

Cache state

SRC contents

M

None

O

All processors with valid cached copies

E

None

S

All processors with valid cached copies

I

Owner processor

NP

None

2.2.4 Cache state transition graph Cache states may change for each event. Figure 4 shows the state transition graph of the token-SRC protocol, which is a traditional MOESI protocol. Those events which cause no state transitions are ignored in the figure.

Fig. 4

State transition graph of token-SRC protocol

2.2.5 Event handling process The event handling process for each cache state and each event from a local processor consists of 3 continuous phases: the request sending process in the local processor, the request response process in the remote processor, and the response receiving process in the local processor. The token-SRC protocol introduces some new actions in addition to those in the traditional token

695

WANG Haixia (ฆ‫ں‬ຉ) et alġReducing Network Traffic of Token Protocol …

protocol, regarding the SRC in each phase of the event handling process. First, the request sending process in the local processor should increase the SRC lookup function. If the SRC hits, requests are sent to the destination set denoted by the SRC. Otherwise, requests are broadcast to all processor nodes. Secondly, the request response process in the remote processor requires two changes. If the remote processor is the owner of a data block, it searches its own SRC to get sharers, sending the sharer back together with the data and tokens. The remote processor updates its SRC according to the request type and its own cache block state. Finally, in the response receiving process, the local processor updates its SRC according to the response message type and its own cache block state. Figure 5 shows a general event handling process. The event handing process differs for each event and each cache state. Table 2 lists all the SRC actions in the event handling process of the token-SRC protocol. “ü” refers to no actions. “SRC-cast” refers to the request sending mode where if the SRC hits, the request is sent to the destination set denoted by the SRC; otherwise, the request is broadcast to the whole destination set. Table 2 Event

Cache read

Cache write

Data cache replacement

SRC replacement

12

Fig. 5 General event handling process for the tokenSRC protocol

3 Simulation Environment 3.1

Target system

The system was evaluated on a 16-core SPARC CMP system running unmodified Solaris 8. Each processor core is a simple in-order processor with split first level instruction and data caches and a local second level cache. Table 3 illustrates the memory system parameters of the target system.

Event handling process for the token-SRC protocol

Cache state NP

Request sending in processor P Broadcast

ü SRC-cast Broadcast ü SRC-cast ü SRC-cast Broadcast ü Broadcast SRC-cast ü SRC-cast ü

Request response in processor Q ü Send data back with sharer {Q}; Build SRC in Q to be {P} Search SRC and send data back with sharer; Add {P} to SRC in Q Send data back with sharer {Q}; Build SRC in Q to be {P} Add {P} to SRC in Q ü ü Build SRC in Q to be {P} Build SRC in Q to be {P} Build SRC in Q to be {P} Build SRC in Q to be {P} Build SRC in Q to be {P} ü ü Delete {P} from SRC in Q ü Delete {P} from SRC in Q Delete {P} from SRC in Q

M

ü

O

ü

E

ü

S I NP M O E S I NP M O E S I All states

Response receiving in processor P Add sharer to SRC in P

ü Add sharer to SRC in P ü ü Delete SRC in P ü Delete SRC in P Delete SRC in P ü ü Delete SRC in P ü Delete SRC in P ü

ü

ü

ü

ü ü ü

Tsinghua Science and Technology, December 2007, 12(6): 691-699

696 Table 3 Memory system parameters of the target system L1 instruction cache

64 KB, 4-way

L1 data cache

64 KB, 4-way

L2 cache ( private)

16 MB, 4-way

Memory

4 GB, 16 banks

Block size

64 B

Network topology

2-D torus

On-chip link latency

1 cycle

Out-of-chip link latency

40 cycles

SRC

4 KB, 4-way

2-D torus topology was used to interconnect the 16 processor nodes, with an on-chip link latency (processor-to-processor) of 1 ruby cycle and an out-of-chip link latency (processor-to-directory) of 40 ruby cycles. A sketch of the interconnection network topology is given in Fig. 6. P0, …, P15 represent the 16 processor cores with private L1 cache and L2 cache. D0, …, D15 are the 16 memory banks.

Fig. 6

3.2

Interconnection network topology

Simulation method

The target system was simulated on a general execution-driven multiprocessor simulator (GEMS)[14]. GEMS is an open source execution-driven multiprocessor simulator developed by the Wisconsin Multifacet project. GEMS provides a set of modules for Virtutech Simics, a full-system multiprocessor simulator[15]. It extends Simics with detailed processor, memory hierarchy, and interconnection network models to compute execution times, enabling detailed simulation of multiprocessor systems, including CMPs. The token protocol and the directory protocol were implemented in GEMS, so the token protocol only needed to be extended to integrate the SRC. All three protocols were implemented on the same target system and used the same MOESI protocol to describe the

cache state. The SPLASH-2[16] parallel benchmark package was used as the workload. SPLASH-2 provides a suite of shared-memory benchmarks for parallel systems with a set of kernel and application components. A representative subset of the SPLASH-2 was selected as the workloads. Table 4 lists the input dataset for the eight benchmarks which were four kernels and four applications. The kernel benchmark LU uses contiguous blocks and non-contiguous blocks. The application benchmark OCEAN includes contiguous partitions and non-contiguous partitions. Table 4

Execution parameters of SPLASH-2 Parameters

FFT

-p16

( 1024 data point)

RADIX

-p16

(256h1024 integers, 1024 radix)

LU(Con)

-p16

(512h512 matrix, 16h16 blocks)

LU(NonCon)

-p16

(512h512 matrix, 16h16 blocks)

CHOLESKY

-p16 inputs/tk15.O

BARNES

-p16

OCEAN(Con)

-n258 -p16 (258h258 ocean)

OCEAN (NonCon)

-n258 -p16 (258h258 ocean)

RADIOSITY

room -ae5000 -en0.05 -bf0.1 -batch -p16

RAYTRACE

-p16 inputs/teapot.env

(8h1024 nbody)

Though GEMS provides a detailed out-of-order processor model, opal, for fast simulation, the Simics in-order processor driver was used together with a detailed memory hierarchy simulation module ruby. To accelerate simulation, we only collected statistics for the parallel part of each workload instead of the complete workload.

4 Simulation Results The execution time, request traffic, and network traffic are compared for the directory protocol, the token protocol, and the token-SRC protocol. The SRC hit rates are also analyzed. 4.1

Workload execution time

The workload execution time is measured in ruby cycles in GEMS. Figure 7 shows the execution time ratios for the token protocol and token-SRC protocol relative to the directory protocol. The results show that

WANG Haixia (ฆ‫ں‬ຉ) et alġReducing Network Traffic of Token Protocol …

the execution times for the three protocols are similar for the SPLASH-2 benchmarks.

Fig. 7

Normalized execution time (16-core CMP)

The predominance of the token protocol in the short cache-to-cache miss request latency did not achieve higher execution speeds in these tests due to frequent data conflicts. In the token protocol simulation, 16% of the total requests are long-latency persistent requests on average. The token protocol used in this simulation is a flat CMP architecture[10]. The protocol was not tested on a multiple-CMP system[11] where the higher data locality of token protocol may achieve higher execution speed. 4.2

Network traffic

Network traffic is measured by the amount of information delivered in the network per time unit. Here, the amount of information delivered in the network is measured by the total network message bytes and the time unit is measured by cache misses. network traffic = total network message bytes/ cache misses=(request message numberh8 + data response message numberh72+data-sharer response message numberh76) / cache misses. All request messages are 8 B, data responses are 72 B (64 B data with an 8-B header), and data responses with sharer are 76 B (64 B data, 4 B sharers with an 8B header) in the token-SRC protocol. The number of request messages in the token-SRC protocol is reduced by the SRC, which reduces the network traffic. The request messages per cache miss will be analyzed first, and then the network traffic. Figure 8 displays the evaluation results for the request messages per cache miss for the three protocols. Requests include forwarded request, retried request, and persistent request. In the statistics, if a request is

697

sent to k destination nodes, the number of messages for the request is set to k.

Fig. 8

Request messages per miss (16-core CMP)

The directory protocol issues 2.16 request messages per cache miss on average for all 10 benchmarks. The token protocol has a much higher bandwidth usage than the directory protocol with 20.27 request messages per cache miss on average. The token-SRC protocol issues 13.85 request messages per cache miss on average. In contrast to token protocol, the token-SRC reduces request traffic by 32% on average. Figure 9 shows the normalized network traffic per cache miss for three protocols. In Fig. 9, the network traffic for the directory protocol is normalized to 1. For the given system configuration, the token protocol uses about 66% more interconnection bandwidth on average than the directory protocol, while the token-SRC protocol uses about 42% more interconnection bandwidth. Thus, the token-SRC uses about less 15% network traffic than the token protocol. Since the 72-B point-topoint response messages are much larger than the 8-B request messages, although the token-SRC protocol

Fig. 9

Network traffic per miss (16-core CMP)

Tsinghua Science and Technology, December 2007, 12(6): 691-699

698

issues 32% less request messages than the token protocol, it has only a 15% reduction in the network traffic per cache miss. 4.3

SRC hit rate and average sharer size

The effect of SRC on reducing the number of request messages in the token-SRC protocol depends on the SRC hit rate and the average sharer size of the SRC hit. Figure 10 shows the SRC hit rate and the average sharer size for the token-SRC protocol. The SRC hit rate varies from 68% (Barnes workload) to 94% (ocean non-contiguous block workload) with an average of 88%. The SRC hit rate is determined by the organization and replacement policies of the normal data cache and the SRC cache. Figure 10 also shows the average sharer size for the token-SRC protocol for each workload, which varies from 1.69 to 2.33 with an average of 2.07.

show that the token-SRC protocol reduces interconnection network traffic by 15% relative to the token protocol on average. The SRC technique can be developed further. More tests are needed to analyze how the SRC organization, SRC size, data block size, as well as normal cache replacement policy affects system performance and network traffic. Tests are also needed to evaluate whether the current token-SRC implementation creates more persistent requests and whether other SRC implementation optimization methods would be more effective, especially for write miss cases. Finally, a tokenindependent cache coherence protocol using the SRC can be implemented and evaluated. References [1]

Hammond L, Nayfeh B, Olukotun K. A single-chip multiprocessor. IEEE Computer, 1997, 30(9): 79-85.

[2]

Olukotun K, Nayfeh B, Hammond L, Wilson K, Chung K. The case for a single-chip multiprocessor. In: Proceedings of the Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). Cambridge, MA, USA, 1996: 2-11.

[3]

Hammond L, Hubbert B, Siu M, Prabhu M, Chen M, Olukotun K. The Stanford hydra. IEEE Micro., 2000,

[4]

20(2): 71-84. Goodman J R. Using cache memory to reduce processormemory traffic. In: Proceedings of the Int’l Symp. on Computer Architecture. Stockholm, Sweden: IEEE Com-

Fig. 10 CMP)

Therefore, the SPLASH-2 workloads have high SRC hit rates and issue less request messages when the SRC hits, which should reduce the number of request messages dramatically. Unfortunately, many requests are broadcast directly without searching the SRC in tokenSRC protocol, such as the read or write request for non-cached data blocks (NP state), write request for invalidated data blocks (I state), and persistent request.

5

puter Society, 1983: 124-131.

SRC hit rate and average sharer size (16-core

Conclusions and Future Work

This paper introduces the shared relation cache into the token protocol to reduce the network traffic in multiprocessor systems, especially in chip multiprocessor systems. Evaluations based on SPLASH-2 benchmarks

[5]

Katz R, Eggers S, Wood D, Perkins C, Sheldon R. Implementing a cache consistency protocol. In: Proceedings of the Int’l Symp. on Computer Architecture. Boston, MA, USA: IEEE Computer Society, 1985: 276-283.

[6]

Tang C K. Cache design in the tightly coupled multiprocessor system. In: AFIPS Conference Proceedings of National Computer Conference. NY, USA, 1976: 749-753.

[7]

Censier M,

Feautier P. A new solution to coherence

problems in multicache systems. IEEE Transactions on Computers, 1978, C-27(12): 1112-1118. [8]

Agarwal A, Simoni R, Hennessy J, Horowitz M. An evaluation of directory schemes for cache coherence. In: Proceedings of the Int’l Symp. on Computer Architecture. Honolulu, Hawaii, USA: IEEE Computer Society, 1988: 353-362.

699

WANG Haixia (ฆ‫ں‬ຉ) et alġReducing Network Traffic of Token Protocol … [9]

James D, Laundrie A, Gjessing S, Sohi G. Distributed-

tradeoff in shared-memory multiprocessors. In: Proceed-

directory scheme: Scalable coherent interface. IEEE

ings of the Int’l Symp. on Computer Architecture. San

Computer, 1990, 23(6): 74-77.

Diego, California, USA: IEEE Computer Society, 2003:

[10] Martin M, Hill M, Wood D. Token coherence: Decoupling

206-217.

performance and correctness. In: Proceedings of the Int’l

[14] Martin M, Sorin D, Beckmann B, Marty M, Xu Min,

Symp. on Computer Architecture. San Diego, California,

Alameldeen A, Moore K, Hill M, Wood D. Multifacet's

USA: IEEE Computer Society, 2003: 182-193.

general execution-driven multiprocessor simulator (GEMS)

[11] Marty M, Bingham J, Hill M, Hu A, Martin M, Wood D. Improving multiple-CMP systems using token coherence.

toolset. ACM SIGARCH Computer Architecture News, 2005, 33(4): 92-99.

In: Proceedings of the Int’l Symp. on High-Perf. Computer

[15] Magnusson P, Christensson M, Eskilsson J, Forsgren D,

Architecture. San Francisco, CA, USA: IEEE Computer

Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B.

Society, 2005: 328-339.

Simics: A full system simulation platform. IEEE Com-

[12] Wang Haixia, Wang Dongsheng, Li Peng.

SRC-based

puter, 2002, 35(2): 50-58.

cache coherence protocol in chip multiprocessor. In: Pro-

[16] Woo S, Ohara M, Torrie E, Singh J, Gupta A. The

ceedings of the Japan-China Joint Workshop on Frontier

SPLASH-2 programs: Characterization and methodologi-

of Computer Science and Technology. Fukushima, Japan:

cal considerations. In: Proceedings of the Int’l Symp. on

IEEE Press, 2006: 60-67.

Computer Architecture. Santa Margherita Ligure, Italy:

[13] Martin M, Harper P, Sorin D, Hill M, Wood D. Using des-

IEEE Computer Society, 1995: 24-36.

tination-set prediction to improve the latency/bandwidth

Eight Faculty to Receive Grants from the National Science Fund for Distinguished Young Scholars Eight Tsinghua faculty will receive National Science Fund for Distinguished Young Scholars grants ranging from 1.4 to 2 million RMB over a period of four years. The grant recipients are: Professor Zhang Shuangnan from the Department of Physics, Associate Professor Wang Xun from the Department of Chemistry, Professor Liu Yongle from the Department of Biological Sciences and Biotechnology, Professor Yang Dinghui from the Department of Mathematical Sciences, Professor Zhang Yinping from the School of Architecture, Professor Huang Xia from the Department of Environmental Science and Engineering, Professor Wang Xiangbin from the Department of Physics, and Professor Yang Baiyin from the School of Economics and Management. Up to now, a total of more than 100 Tsinghua faculty have received grants from this fund. The National Science Fund for Distinguished Young Scholars aims to promote and support the growth of prominent young scientists under the age of 45, to encourage overseas Chinese scholars to continue their work in China, and to develop a core group of prominent academic pacemakers in the forefront of natural science and technology. (http://news.tsinghua.edu.cn)