Application Driven Optimization of VLIW ... - Semantic Scholar

1 downloads 320 Views 788KB Size Report
num. of clusters, issue width, data cache ports, num. of reg. files, gen. purpose/branch regs, int. ALUs/MULs. 03/08/2005, RTAS 2005 – A. Ferrante – Application ...
Application Driven Optimization of VLIW Architectures: a Hardware–Software Approach Alberto Ferrante

Giuseppe Piscopo, Stefano Scaldaferri

DTI,

ALaRI Institute,

University of Milan

University of Lugano

E-mail: [email protected]

E-mail: {piscopo, scaldaferri}@a.alari.ch

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.1/23

Presentation Outline 1. Motivations and Goals; 2. Optimization Steps; 3. Case Study; 4. Conclusions and Future Work.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.2/23

1/4. Motivations and Goals •

We want to optimize an architecture for a specific task: • a detailed description of a methodology is lacking; • it is well known in industrial environment; • goal is to provide a step-by-step description through

a case study.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.3/23

2/4. Main Optimization Steps (1) •

Preliminary Analysis: • profiling; • preliminary simulations;



design space and cost function definition;



simulations and study of the results obtained.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.4/23

2/4. Main Optimization Steps (2) •

Addictional steps: • code tuning for the optimized architecture; • custom instructions.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.5/23

3/4. Case Study •

We considered the Imaging Pipeline benchmark: • provided by HP Labs; • main operations: • decompression of a jpeg image into a RGB one; • resizing of RGB image; • color-space conversion from RGB to CMYK ; • dithering to obtain a half-toned image.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.6/23

3/4. C.S.: Reference System (1) •

VLIW architecture;



VEX toolset: • developed by HP Labs; • compiler simulator derived from the Multiflow C

compiler ; • clustered VLIW processors, HP Lx ISA.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.7/23

3/4. C.S.: Reference System (2) •

Architectural parameters: • visible to the compiler; • num. of clusters, issue width, data cache ports,

num. of reg. files, gen. purpose/branch regs, int ALUs/MULs.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.8/23

3/4. C.S.: Reference System (3) •

Microarchitectural parameters: • not visible to the compiler; • Dcache and Icache size, associativity, line size,

miss penalty.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.9/23

3/4. C.S.: Preliminary Analysis (1) Dynamic profiling: Relative execution time

70% rgb_to_cmyk_and_dither

60%

interpolate_x interpolate_y

50%

write_ppm_row

40% 30% 20% 10% 0% Routines

we focus on rgb_to_cmyk_and_dither University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.10/23

3/4. C.S.: Preliminary Analysis (2) 3,50 rgb_to_cmyk_and_dither interpolate_x interpolate_y write_ppm_row

3,00

ILP

2,50 2,00 1,50 1,00 0,50 0,00 Routines

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.11/23

3/4. C.S.: Design Space Definition (1) Preliminary simulations: relevant parameters? Architectural configurations Parameter

Range

Combin.

4, 8, 16, 32

4

1, 2, 4

3

ALU per cluster

n

1

MemLoad (= p)

n/2, n

2

MemStore

p

1

MemPorts per cluster

p

1

Multipliers per cluster

2

1

Issue Width (= n) # clusters

University of Lugano ALaRI

Total Architectural Combinations

24

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.12/23

3/4. C.S.: Design Space Definition (2) Microarchitectural configurations Parameter

Explored Range

Combin.

DCache Size (kbyte)

32, 64, 128

3

DCache Associativity

4, 8

2

32, 64

2

ICache Size (kbyte)

32, 64, 128

3

ICache Associativity

1

1

ICache Line Size (bit)

64

1

DCache Line Size (bit)

Total Microarchitectural Combinations

36

University of Lugano ALaRI

Total: 864 configurations DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.13/23

3/4. C.S.: Design Space Definition (3) Default architecture Machine Model Parameters Number of Clusters

1

Issue Width

4

Latencies of Main Operations (cycles)

Data Cache Ports

1

Load

2

Register Files

2

Store

0

General-Purpose Registers (32 bit)

64

ALU

0

Branch Registers (1 bit)

8

MPY

1

Integer ALU

4

Integer Multiplier

2

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.14/23

3/4. C.S.: Cost Function • A: area estimation; • Adef : area estimation of default architecture;

A T C(A, T ) = × Adef Tdef

• T : simulated time (in clock cycles); • Tdef : simulated time for the default architecture.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.15/23

3/4. C.S.: Area Estimation A = Amem + Ncluster (Areg + Nalu Aalu + Nmul Amul ) •

measured in register bit equivalent area (rbe);



related to the underlying technology.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.16/23

3/4. C.S.: Design Space Exploration (1) Results of the simulations: 1,1

Relative cycles

1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0

10

20

30

40

50

60

70

Relative cost University of Lugano ALaRI

Overall simulation time: 8 hours.

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.17/23

3/4. C.S.: Design Space Exploration (2) Zoomed version of the DSE results: 1,1

Relative cycles

1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 University of Lugano

0

2

4

6

8

ALaRI

Relative cost

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.18/23

3/4. C.S.: Optimal Point •

Previous analysis allowed to identify an optimal point among the explored ones;



benchmark runs almost 3 times faster than on the base architecture;



cost is 3 times higher than base configuration.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.19/23

3/4. C.S.: Code Optimization •

to improve performance with the optimal architecture;



original code is already highly optimized;



providing compiler directives for selection loop unrolling of 16 furtherly improves performance by 5%.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.20/23

3/4. C.S.: Custom Instructions •

Custom instructions have been introduced for the most time consuming parts ot the code: • 4 instructions like: int dx_c=(ex_c*7+ey_c*5+eyp_c*3+eym_c)/16;

• were transformed into: int dx_c=(_DITH75(ex_c,ey_c)+_DITH31(eyp_c,eym_c))/16;



performance are improved by another 6%.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.21/23

4/4. Conclusions •

Through an example, we have shown a methodology allowing to optimize an architecture for a specific task;



we have obtained an optimal performance–area tradeoff point in a reasonable amount of time: • different tradeoffs can be optimized, by changing

the cost function. University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.22/23

4/4. Future Work •

Study of a model for evaluating the influence of architectural changes on the clock cycle time is ongoing.

University of Lugano ALaRI

DTI

03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.23/23