Application Driven Optimization of VLIW Architectures: a Hardware–Software Approach Alberto Ferrante
Giuseppe Piscopo, Stefano Scaldaferri
DTI,
ALaRI Institute,
University of Milan
University of Lugano
E-mail:
[email protected]
E-mail: {piscopo, scaldaferri}@a.alari.ch
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.1/23
Presentation Outline 1. Motivations and Goals; 2. Optimization Steps; 3. Case Study; 4. Conclusions and Future Work.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.2/23
1/4. Motivations and Goals •
We want to optimize an architecture for a specific task: • a detailed description of a methodology is lacking; • it is well known in industrial environment; • goal is to provide a step-by-step description through
a case study.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.3/23
2/4. Main Optimization Steps (1) •
Preliminary Analysis: • profiling; • preliminary simulations;
•
design space and cost function definition;
•
simulations and study of the results obtained.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.4/23
2/4. Main Optimization Steps (2) •
Addictional steps: • code tuning for the optimized architecture; • custom instructions.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.5/23
3/4. Case Study •
We considered the Imaging Pipeline benchmark: • provided by HP Labs; • main operations: • decompression of a jpeg image into a RGB one; • resizing of RGB image; • color-space conversion from RGB to CMYK ; • dithering to obtain a half-toned image.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.6/23
3/4. C.S.: Reference System (1) •
VLIW architecture;
•
VEX toolset: • developed by HP Labs; • compiler simulator derived from the Multiflow C
compiler ; • clustered VLIW processors, HP Lx ISA.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.7/23
3/4. C.S.: Reference System (2) •
Architectural parameters: • visible to the compiler; • num. of clusters, issue width, data cache ports,
num. of reg. files, gen. purpose/branch regs, int ALUs/MULs.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.8/23
3/4. C.S.: Reference System (3) •
Microarchitectural parameters: • not visible to the compiler; • Dcache and Icache size, associativity, line size,
miss penalty.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.9/23
3/4. C.S.: Preliminary Analysis (1) Dynamic profiling: Relative execution time
70% rgb_to_cmyk_and_dither
60%
interpolate_x interpolate_y
50%
write_ppm_row
40% 30% 20% 10% 0% Routines
we focus on rgb_to_cmyk_and_dither University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.10/23
3/4. C.S.: Preliminary Analysis (2) 3,50 rgb_to_cmyk_and_dither interpolate_x interpolate_y write_ppm_row
3,00
ILP
2,50 2,00 1,50 1,00 0,50 0,00 Routines
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.11/23
3/4. C.S.: Design Space Definition (1) Preliminary simulations: relevant parameters? Architectural configurations Parameter
Range
Combin.
4, 8, 16, 32
4
1, 2, 4
3
ALU per cluster
n
1
MemLoad (= p)
n/2, n
2
MemStore
p
1
MemPorts per cluster
p
1
Multipliers per cluster
2
1
Issue Width (= n) # clusters
University of Lugano ALaRI
Total Architectural Combinations
24
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.12/23
3/4. C.S.: Design Space Definition (2) Microarchitectural configurations Parameter
Explored Range
Combin.
DCache Size (kbyte)
32, 64, 128
3
DCache Associativity
4, 8
2
32, 64
2
ICache Size (kbyte)
32, 64, 128
3
ICache Associativity
1
1
ICache Line Size (bit)
64
1
DCache Line Size (bit)
Total Microarchitectural Combinations
36
University of Lugano ALaRI
Total: 864 configurations DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.13/23
3/4. C.S.: Design Space Definition (3) Default architecture Machine Model Parameters Number of Clusters
1
Issue Width
4
Latencies of Main Operations (cycles)
Data Cache Ports
1
Load
2
Register Files
2
Store
0
General-Purpose Registers (32 bit)
64
ALU
0
Branch Registers (1 bit)
8
MPY
1
Integer ALU
4
Integer Multiplier
2
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.14/23
3/4. C.S.: Cost Function • A: area estimation; • Adef : area estimation of default architecture;
A T C(A, T ) = × Adef Tdef
• T : simulated time (in clock cycles); • Tdef : simulated time for the default architecture.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.15/23
3/4. C.S.: Area Estimation A = Amem + Ncluster (Areg + Nalu Aalu + Nmul Amul ) •
measured in register bit equivalent area (rbe);
•
related to the underlying technology.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.16/23
3/4. C.S.: Design Space Exploration (1) Results of the simulations: 1,1
Relative cycles
1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0
10
20
30
40
50
60
70
Relative cost University of Lugano ALaRI
Overall simulation time: 8 hours.
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.17/23
3/4. C.S.: Design Space Exploration (2) Zoomed version of the DSE results: 1,1
Relative cycles
1,0 0,9 0,8 0,7 0,6 0,5 0,4 0,3 University of Lugano
0
2
4
6
8
ALaRI
Relative cost
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.18/23
3/4. C.S.: Optimal Point •
Previous analysis allowed to identify an optimal point among the explored ones;
•
benchmark runs almost 3 times faster than on the base architecture;
•
cost is 3 times higher than base configuration.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.19/23
3/4. C.S.: Code Optimization •
to improve performance with the optimal architecture;
•
original code is already highly optimized;
•
providing compiler directives for selection loop unrolling of 16 furtherly improves performance by 5%.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.20/23
3/4. C.S.: Custom Instructions •
Custom instructions have been introduced for the most time consuming parts ot the code: • 4 instructions like: int dx_c=(ex_c*7+ey_c*5+eyp_c*3+eym_c)/16;
• were transformed into: int dx_c=(_DITH75(ex_c,ey_c)+_DITH31(eyp_c,eym_c))/16;
•
performance are improved by another 6%.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.21/23
4/4. Conclusions •
Through an example, we have shown a methodology allowing to optimize an architecture for a specific task;
•
we have obtained an optimal performance–area tradeoff point in a reasonable amount of time: • different tradeoffs can be optimized, by changing
the cost function. University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.22/23
4/4. Future Work •
Study of a model for evaluating the influence of architectural changes on the clock cycle time is ongoing.
University of Lugano ALaRI
DTI
03/08/2005, RTAS 2005 – A. Ferrante – Application Driven Optimization of VLIW Architectures – p.23/23