Hardware Support to Reduce Overhead in Fine

HardwareSupport to ReduceOverhead in Fine-Grain M

ediaCodes

DeependraTalla,LizyKJohn, . andDougBurger TechnicalReport L – aboratory forComputerArchitecture TheUniversity oTexas f atAustin Abstract The - growingimportanceom f ediaandmedia-likecodesha csausedgeneral-purposeprocessorstoincorporate SIMD-likeextensions,such aM s MX, SSE,andAlt iVec.While thesemediaextensionsdoimproveperformance,significantparallelisminthesecodesremainsu nexploited.Inthispaper,weproposeandevaluatea programmableloopengine(PLEforshort)thatexecutesmedia codesa ndkernelsefficientlybymovingmostof theoverheadassociatedwiththemediaprogramsintohardwar e.Westudyarangeom f ediacodestodetermine whichfeaturesandcharacteristicsotfheseprogramsare sufficientlyfrequenttomeritimplementationinhardware.Thecommonclassesoof perationsthatwefindare multiplenestedloopcontrol , addressgeneration,data transformations, and streamingmemoryoperations We . quantifythefractionofinstructionsthatcanbe liminatedforourbenchmarks,showingthatitisover65%onave rage.Weevaluateadesign ofonePLE,showing howitcanbientegratedcleanlywith general-purpose a pi peline,defininganinstructionsetinterface,andquanti fyingitsareaandpoweroverheadwithaVHDLimplementation We . findthatthePLErequiresamere10%of thearearequiredbytheMMXandSSEextensions. Further more,thePLEimprovesperformanceosafetofmediakernels,over thatof 4a-way SIMD processor, by more t han 7Xon average,and accelerates saetoffivemedia applicationsbyana verageo54%. f Forallkernels and3of5applications,a4-wayprocessorwiththePLEoutperformsan8-waySIMDprocessorwithoutthePLE.Foral the l kernelsa nd2ofthe5applications,a4-way processorenhancedwith the PLEoutperformseven 16-way a processor.

1

Introduction The growingimportanceoafudioa ndvideoapplicationshasresulte

majorinstructionsetarchitectures. Sincegeneral-purpose copiousfine-grainparallelism,mediaextensionspermita general-purposepipeline. Exampleom f ediaextensionsare MAX,Compaq’sMVI,MIPS’sMDMX,andM otorola’sAltiVec

dinmedia-specificsupportinmost

processorsarenotwellsuitedtoexecutingcodeswith higherrateoef xecutionwithinamostlyunmodified Intel’sMMX,SSE1,andSSE2,Sun’sVIS,HP’s [1][2].

data)extensionshavebeenimplementednotonlyincommercial

general-purposeprocessors,butalsoinDSP

processorssuchastheTMS320C64processorfromTexasInstrume AnalogDevices[4].

SuchSIMD(singleinstructionmultiple

nts[3]andtheTigerSharcprocessorfrom

Formanycodes,however,theoverheadassociatedwiththeloo

psthemselveslimitsthepar-

allelismthatcanbeachievedthroughSIMD-likemediaext

ensions. Inthispaper,wefocusonusinghardware

supporttoreducethetimespentonexecutingtheinstruction

tshatsupporttheSIMDinstructions,ratherthanfo-

cusing oanccelerating the computational,fine-grain parallel

instructionsthemselves.

Wefirstanalyze range a omedia f workloadstodetermine tions. Wefindthatmorethan75%ofinstructionsthatare

thea mountofserialoverheadintheseapplicaexecutedinthemediainstructionstreamarenotthe

computationinstructions,butareinstructionsthataremana

gingdataandcontrolforthecomputation. Amdahl's

ThisresearchwassupportedinpartbyaStateoTexas f Adv supportedbytheNationalScienceFoundationundergrantsEIATivoli,MotorolaandIBMCorporations.D.Burgerisuppo REER Award,anIBMUniversity partnership Award, and Slo a

ancedTechnology programgrant.L.K.Johniaslsopartially 9807112andECS-0113105andbyDell,Intel,Microsoft, rtedbyagrantfromtheIntelResearchCouncil,anNSFC anFoundationFellowship.

A-

lawdictatesthattheseoverheadinstructionswilllimitt

hefine-grainparallelism thatmaybextractedfrom thes

mediacodes.WeshowanexampleofthisphenomenoninTable1,which

showsexecutiontimeofaone-

dimensionaldiscretecosinetransform(1-DDCT)kernel

thatisusedintheJPEGimagestandardandtheMPEG

videostandard.The1-DDCTisessentially multiply a o

two f 8x8matrices.In non-MMX a implementation,ifall

ofthe compute unitswere fully utilized,themultiplication tionothe f

would require 512cycles. With MMX,the full utiliza-

functionalunitswouldcompletetheDCTin128cycles.Wer

anthiscodeonaPentiumIIIprocessor

bothwithandwithoutprefetching,andwithandwithoutMMX. provideperformanceboosts,thelevelofparallelism

WeseethatwhileprefetchingandMMXboth thatisexploitedifsarfrom ideal.WithoutMMX,paotenti

five-fold speedup remainseven after prefetching, and with MMX

more , than 13-fold improvementin performance

ispossible,ifthe computationswere tboexecuted athe t

peak computation rate.

Sincewaerestudyingmediaworkloadsrunningongeneral-purposepr portshould supportefficientexecutionobafroad rangeome f

ocessors,anyaddedhardwaresup-

dialoops. Theworkload a nalysiswperesentinthis

paperbreaksoverheadinstructionsandnecessarysupportingmecha

nismsdownintofourmajorclasses,which

coverthevastmajorityooverhead f instructionscontainedin tion,datatransformations,

thesemedialoops:

andstreamingmemoryaccesses

loopprocessing,addressgenera-

Loop . processinginvolvesmaintainingtheloopindi-

cesandissuing thebranchesthatcontrol theloops.Addressgene

rationconsistsocomputing f a ddressesformulti-

dimensionalarrayswithvariousconstantstridesforeac

hdimensionalindex. Datatransformationsreorderdata

accessedfrommemoryforfastconsumptionbytheSIMD

instructions. Themajortransformationsthatwefound

include packing,unpacking,andmulticastingooperands. f WeevaluateahardwaredesigncalledaProgrammableLoop

Engine(PLE)intendedtoreducethetime

spentexecutinggeneralmedialoops.Itishardware a unitth

attakessingle a 140-bytecommand.Itcanthenexe-

cuteuptofivenestedloops,accessinguptofourstreamsodf

ata(threeinputandoneoutput)withdifferent

strides,transformingthedatainwaysrequiredbythecom

putationalgorithmandwritingtheresultstreamback.

ThePLEcontainshardwarelooping,hardwareaddressgener

ationandaddresstransformation,andfetchmecha-

nismstsotream datainandout.ThePLEcanbceleanlyint

egratedwith superscalar a pipeline,requiringminimal

controltoinitiateandterminatehardware-supportedloopexe

cution.Thesinglemacro-instructionneedstobe

fetched and decodedonlyonce pernestedloopstructure,resul

ting isubstantial an reduction ofetch/decode f activ-

ity and corresponding powersavings.

Table 11-D . DCTwithoutandwith MMXon Pentium a III

processor

Pentium III Maximumcompileroptimizations Perfectmemorysystem/prefetching Fullcomputeutilization

e

Cycles 3500 2737 ~512

IPC 1.47 1.88 -

2

Pentium IIIwith MMX Cycles 2375 1578 ~128

IPC 1.04 1.56 -

al

Insection2ofthispaper,wea nalyze suite a omedia f appli

cationsandkernelstoquantifytheloopover-

headrequiredtosupportmediacomputation.Wedecomposethatove

rheadinto

discreteclasses,withthegoalof

designingasingle,simplehardwareenginethatcanprovidebroad

coverageothe f applications'medialoops. In

section3we , describeourPLEdesign,andshowhowitisprogr

ammedbyasinglelargeinstruction. Wealso

describehowthatinstructioncanbiencorporatedintoca

onventionalout-of-orderprocessorpipeline. InS ection

4,wemeasuretheperformanceimprovementsthatthePLEprovi

desoverpa ipelinethatalreadyusesSIMDISA

extensions. Fora4-waybaseline,weshowspeedupso1f 3x,11x,

8xand2.3xonthefourkernelswevaluate,

andperformanceimprovementso3.5x, f 8%,94%,61%,and1%onthe

fiveapplicationswm e easure.Similaror

higherspeedupsareobtainedwith8-wayand16-waybaselinec

onfigurations.InS ection5wequantifythearea

andpoweroverheadotfheproposedPLEbyanalyzingitsVHDL,

aswellastovalidatethetimingassumptions

thatweusedinoursimulationenvironment. InS ection6w ,

ediscusssomeothe f copiousrelatedwork,andwe

conclude the paperin S ection 7.

2.

AnalyzingLoopOverheadsin MediaCode Welisttheninebenchmarksthatweusedtostudythecharact

eristicsom f ediacodeinTable2We . in-

cludethedynamicinstructioncountstoshowthatthecounts

aresufficientlyhightoa mortizecold-starteffects

thesimulationenvironment.Oursuiteincludesseveralcommonmedi decrypt)andfourcommonkernels(

cfa, dct, motest,and

a pplications( scale).The

andvideoprocessingstandardssuchaJsPEG,MPEG,H.263,e

g711,aud,jpeg,ijpeg

kernelsusedaremajorcomponentsinimage

tc.Severalofthesebenchmarksarealsopartof

mediabenchmark suitessuch aM s ediaBench [10].For those b

enchmarksthatare notincludeditnhe M ediaBench

suite,we havemade the source code available on-line [9]. Table.2.Description othe f multimediabenchmarks Benchmark

Instruction count

Description

cfa

Colorfilterarrayinterpolationofa2millionpixel filter(16-bit data)

imagewitha5x5

dct

2-Ddiscrete cosine transform of million 2a pixel image (

16-bitdata)

160 million

motest

Motionestimation routine oframe na o2m f illionpixels

(8-bit data)

136 million

scale

Linearscalingoan fimage om 2f illionpixels(8-bit dat

g711

G.711speechcodingstandard(A-lawtou-lawandvice-versa) sionson million 2 audio samples(8-bitdata)

conver-

aud

Audioeffectson2millionaudiosamples(echo,signalmi filtering) (16-bit data)

xingand

jpeg

JPEGimage compressiono800-by-600 na pixel image

208 million

ijpeg

JPEGimage de-compressionresulting i800-by-600 an pixel image

136 million

decrypt

IDEAdecryptionon192,000 bytesodata f

125 million

a)

3

349 million

3million 63 million 283 million

in and ,

Toprovidehardwaresupportthatcanaccelerateawidera

ngeom f ediacodeswithasimpledesign,we

mustfirstunderstandwhatiscommonacrossthesemediacodes.

Ananalysisoof urbenchmarksuiteledtothe

followingfourcategorieso“computational f support”operations:(

1) loopprocessing,(2)addressgeneration,(3) show anexampleothe f relativefrequenciesothese f . To

datatransformation, and(4)streamingmemoryaccesses categoriesinamedialoop,weshowtheassemblycodeoafone-

dimensionaldiscretecosinetransform(1-D

DCT)routineinF ig.1.ThisassemblycodeifsorPentium

IIIprocessorbasedontheP6microarchitecture[17],

andwascompiledusingmaximumcompileroptimizations(inclu

dingloopunrolling)andintrinsicsbyIntel

C/C++ compilerversion 4.5. Intheassemblycode,wehavemarkedthe

truecomputation

performthecomputationalessenceotfhe1-DDCT,whichare

themultiplyandtheaccumulateoperations.The

Pentium III Assembly instruction

instructionsinboldface.Theseinstructions

M – MXcode

Comment

Category oOverhead f

lea ebx,DWORDPTR [ebp+128] ;load/address overhead 2 mov DWORDPTR [esp+28],ebx ;load/address overhead 2 $B1$2: xor eax,eax ;address overhead 2 mov edx,ecx ;address overhead 2 lea edi, DWORDPTR [ecx+16] ;load/address overhead mov DWORDPTR [esp+24],ecx ;load/address overhead $B1$3: movq mm1,MMWORDPTR [ebp] ;loadoverhead 4 pxor mm0,mm0 ;initializationoverhead 3 pmaddwdmm1, MMWORDPTR[eax+esi] ;TrueD ataParall el Computation movq mm2,MMWORDPTR [ebp+8] ;loadoverhead 4 pmaddwdmm2, MMWORDPTR[eax+esi+8] ;TrueD ataPara llel Computation add eax, 16 ;address overhead 2 paddw mm1,mm0 ;TrueD ataparallel computation No paddw mm2,mm1 ;TrueD ataParallel Computation No movq mm0,mm2 ;load relatedoverhead 4 psrlq mm2, 32 ;SIMDreductionoverhead 3 movd ecx,mm0 ;SIMDloadoverhead 4 movd ebx,mm2 ;SIMDloadoverhead 4 add ecx,ebx ;SIMDconv. Overhead 3 mov WORDPTR [edx],cx ;storeoverhead 4 add edx,2 ;loopoverhead 1 cmp edi,edx ;branch relatedoverhead 1 jg $B1$3 ;loopbranchoverhead 1 $B1$4: mov ecx, DWORDPTR [esp+24] ;load/address overhead add ebp,16 ;loop/address overhead 1/2 add ecx, 16 ;address overhead 2 mov eax, DWORDPTR [esp+28] ;load/address overhead cmp eax,ebp ;branch relatedoverhead 1 jg $B1$2 ;loopbranchoverhead 1

Fig.1.Optimized assembly code forthe 1D

2 2

Not Overhead Not Overhead Overhead t Overhead t

2

2

-DCTroutine (essentially a8nx8matrixmultiply).

The major component of discrete cosine transform is an 8x8 ma tions necessary fortransposing the secondmatrix, whichwoul

trix multiply. In this example,we have not included the instru hdave increased the supporting overhead instructions further.

4

c-

restoftheinstructionsintheloopare

overhead/supportinginstructions such , atsheinstructionsnecessarytoper-

formaddresscomputationsotroaccomplishmultiplelevels

ofloopingasrequiredbythea lgorithm.Wehave

brokenthemdownintothefourclassesoof verheadads efined structionsmayfallintomorethan

inthepreviousparagraph.Wenotethatsomein-

oneoour f definedcategories;sinceanoperationoregist na

ermaybuesedfor

bothloopmanagementand addresscalculation. Itisclearthat,inDCTatleast,themajorityof

theinstructionsintheloopareoverheadinstructions.The

highfractionotfheseinstructionsintheloopips artlydu

etotheprogrammingconventionsogf eneral-purpose

processors,abstractionsandcontrolflowstructuresusedin

programming,andmismatchbetweenhowdatais

used icnomputationsversusthe sequence inwhichdataist

oredinmemory.

Ifthe SIMDunitsare widened,someothe f loopinstructions

can perform moreoperationsperinstruction.

In Fig.1,however,only theinstructionsibnold can benefit

from widerarithmeticunits.However,asone sees,the

vastmajorityoinstructions f itnheinstructionstream a

rejustsupportingthecore computationinstructions.Hence

itisessentialtoimprovetheirperformanceitfheoverall

performanceitsogetbetter.ItisclearfromAmdahl

lawthat,giventhefractionooverhead f intheloop,widerSI

’s

MDunitswillprovideonlyincrementalperformance

benefits. .

Weshowedthe1-DDCTexampletoillustratethebreakdownof

computationversusoverhead.InFig.2,

weshowsaimilarbreakdownforsixothe f ninebenchmarkswe

studied. Wedidnotinclude

cryptintheseresultsexperimentbecausethesourcecodeforthes

tehreebenchmarksincludesinitializationrou-

tinesandfileI/O.WeranthesixbenchmarksonaPenti

umIIIprocessorwithMMXsupport.Infiveofthe

benchmarks( g711wastheexception),allcorecomputationwasexecutedusingS decomposedtheexecutionintothefourdefinedclasses,exce

IMDinstructions.InFig.2we

ptthatwewereunabletoseparateaddressandloop

arithmetic.Forthesecodes,theoverheada ndsupportinginstruc

tionsaccountfor75%-85%ofthedynamicin-

struction stream;the true computationinstructionsare sm a

Streamingmemory Addresstransformation

all percentage.

Loopbranches Truecomputation

Address/looparith.

100% 80% 60% 40% 20% 0% cfa

Fig.2.

dct

jpeg, ijpeg,and

scale

motest

aud

Breakdown odynamic f instructionsinto variousclasses

5

g711

de-

2.1 Sourcesof overheadinstructions Inthissubsectionwdeiscussthereasonsthatsuch high a fr

actionothese f medialoopsareoverhead.Es-

sentially,mediaapplicationsuse nestedloopsand use multipl

setrides.Currentgeneral-purpose processors(GPPs)

havelittlesupporttohandlemultipleloopsotrheabundanta

ddressgenerationefficiently.Hardwaretogenerate

multiplea ddresssequencesins otoverlycomplicated,butcur

rentISAsrequirealargenumberoifnstructionsto

produce them,asthe available addressingmodesare limited trackof

Furthermore, . there insotenough supportforkeeping

multipleindices/stridesefficientlyinGPPs.Keepingt

rackomultiple f loopnests/boundsinvolvescoma

binationoseveral f addressingmodesandinstructions. Thus, sionstoextractdata-levelparallelisminmultimediapr

eventhoughGPPsareenhancedwithSIMDextenograms,thereiasmismatchbetweentherequirementsof

mediaapplications(foraddressgenerationandnestedloops)and

thecapabilitiesoG f PPstosupportthenested

loops,memoryoperations,andaddressgenerationefficientl

y.Weelaborateontheloopmanagementandaddress

generationoverhead below. Loopmanagement: Desktop/workstationmultimediaapplicationssuchasstream (MPEG1/2/4andMotionJPEG),audioencoding/decoding(ADPCM,G.7 (H.323,H.261,etc),3Dgames,andimageprocessing(JPEG,fi

11,MP3,etc),videoconferencing ltering)typicallyoperateonsub-blocksinalarge

1-o2r -dimensionalblockofdata.Audioapplicationsoperat

eonchunksoof ne-dimensionaldatasamplesaat

time.Imageandvideoapplicationsoperateonsub-blocksotf

wo-dimensionaldataaatime.Forexample,the

DCTalgorithmoperateson8x8segmentsodf atainimages typicallyusemultiplenestedloopswith

ingvideoencoding/decoding

ofsizeslike1600x1200pixels.Thesealgorithms

staticallyknownloopboundariestoprocessthesub-blocksa ndt

streams.Todeterminethedeptholfoopnests,weanalyzed

hedata

ourbenchmarksuiteandanumberofothercommon

mediacodes,and show the resultsiT n able 3.

Table.3.Loopnestdepth and common addressing sequencesikney Nested loops

Multimedia/DSPalgorithm Discrete Cosine Transform (JPEG& MPEGcoding) MotionEst./Comp.(MPEG,H.263, etc) WaveletTransform (JPEG2000) ColorSpace Conversion(JPEG, MPEG,3D graphics) Scaling and matrix operations(image/video) FastFourier transform ColorFilterArray,medianfiltering, correlation Convolution,FIR,and IIR filtering Edge detection,alpha saturation(image/video) Up/Downsampling, 3-D transformation(graphics) Quantization(JPEG,MPEG) ADPCM,G.711 (speech)

5

mediaapplications AddressingSequences

Sequent ial andsequential withmultiple offsets/strides 5 Sequential andsequentia with l multiple offsets/strides >5 Sequential andsequential wit hmultiple offsets/strides >4 S equential,sequential withoffsets,and shuffled 3 Sequential andsequential withmultiple offsets/strides >3 Shuffledand bit-reversed 2 5– Sequential andsequential withmultiple offsets/strides 3 4– Sequential,sequential withoffsets,and reflected 2 5– Sequential andsequential withmultiple offsets/strides 3 5– Sequential andsequential withmultiple offsets/strides 2 4– Sequential andsequential withmultiple offsets/strides 2 3– Sequential andsequential withmultiple offsets/strides

6

Addressgenerationandstridemanagement:

Thedivisionodf ataintosub-blocksresultsinthedatabei

cessedwithdifferentstridesavarious t instancesin

ngac-

thealgorithm.Managingmultiplestridesresultsin ume

softwareinstructions.Ingeneral,addressingsequencesin

rous

mediaprogramsmaybeclassifiedintosequences

showninFig.3.Wehaveincludedtheprevalentaddressings

equencesintheanalyzedmediaapplicationsinTa-

ble By 3s. upporting sequential accesses,multiplestrided

accesses,reflected a nd shuffleddatatransformationsi

hardware,we can overthe bulk othe f addressing anddatastre

aming acrossthisspectrum ofmediacodes.

2.2 Effectof prevalentoverheadinstructions TheneteffectoftheoverheadinstructionsitshattheS I

MDexecutionunitsareoftenidle.Inthissection,

weevaluatewhatutilizationothe f SIMDexecutionunits

isachievedforsaubsetofourmediaapplications.Peak

throughputcanbaechievedithe f requiredoperandsareavai able.In Table 4,

lableasoonatsheSIMDexecutionunitsareavail-

weshowhowwell the SIMDunitsareutilizedotnheP entium

IIIbaselinewith MMX.The utili-

zationachievedbythemediaextensionsilsow(1-12%),despit

teheabundanceodf ata-levelparallelisminthese

codes,becausethesupportingunitsarenotabletofeedtheri

ghtdata,intherightpackedform,attherequired

rate. Theresultsothis f section show both thattheproperties

oftheoverheadinstructionsarecommonacrossa

broadrangeom f ediaapplications,andthatsignificantpoten

tialforimprovedperformanceexistsduetothelow

executionothe f SIMDunits. Inthenextsection,wedescrib

teheProgrammableLoopEngine(PLE),whichim-

plementsthosecommonoverheadoperationsinhardware,reducing

theserialoverheadandallowingagreater

utilization othe f SIMDunits. Givensequence a olength f LifA , isaddressminthe range0 ≤m ≤L- 1,mostmultimediaandDSPkernels m canbceonsidered tboceomposedoprimitive f addressingse quencessuchathe s following: (i) Sequential addressing: A A , A , …A , 0 1 2 N-1 (ii) Sequential withoffset(k)/stride addressing: A , 1+kA, 2+k…, , A N-1+k 0+kA (iii) Shuffledaddressing(base rN/r , =p):A , pA, 2p…, , A 1A, p+1A, 2p+1…, , A 2A, 2p+2…, ,A 0A AN-1 (iv) Bit-reversedaddressing(e.g. N =8): A , 4A, 2A, 6A, 1A, 5A, 3A, 7 0A (v) Reflectedaddressing: A , N-1A, 1A, N-2…, , A mA, N-m…, , A N/2-1A, N/2 0A

Fig.3. Typical accesspatternsinmultimediaand DSPker Table.4.Execution statisticsand utilization omedia f prog Benchmark cfa dct scale motest aud g711

nels[13] rams

Pentium III –MMX& SSE Actual Fraction opeak f Inst.Count Cyclecount throughput 404,290,544 188,798,806 2,170,274 156,734,613 220,320,505 59,066,806

231,616,932 123,944,326 20,756,929 113,623,185 150,386,375 64,006,729

7

5.16 % 6.2 % 2.31 % 3.38 % 11.97 % 1.12 %

, 2p+2…,

n

3

HardwareSupportforEfficientMediaProcessing Themediaapplicationcharacterizationpresentedinthepre

vioussections,especiallyFig2,presentsa

compelling case forproviding hardware supportformultiplel

evelsolooping f and addressgeneration.Such facili-

tiescommoninDSPprocessorscanbeintegratedinacost-e

ffectivefashiontogeneral-purposecores.Whilea

varietyoimplementations f arepossible,wepresentanexa

mplearchitecturetostudytheeffectivenessothe f pro-

posedhardware accelerationmechanismsiG n PPs.Were

ferto thisatshe PLE(programmableloopengine)archi-

tecture.Fig.4illustratestheproposedscheme.Theshaded

unitsarethenewadditions.Essentially,wea dd new a

instruction(calledProgrammableLooporPLinstruction)eq whichindicatesthemultiplenestedloopsthatarerequir

uivalenttoamultidimensionalvectorinstruction edinthecomputation,thedifferentaddressstrides,loop

bounds,etc.OncethePLinstructionisencountered,pro

gramexecutionproceedsintheshadedregionwithout

using thefetch,decode,rename and issue blocksothe f origi

nal processor.The detailsothe f added instruction and

the a dded unitsare provided below.Whilewheavemadesomechoice

rsegarding thenumberoflevelsolooping f

orsimultaneousdatastreams,ourobjectiveios nlytoprovet hardwaresupportforthesefunctionalities.

heeffectivenessothe f generalconceptofproviding SimpleASICscanperformthesetasksefficiently;howe

programmability and flexibilityiweakness as othat f appr

oach,andhencewdefoavor parogrammable processor.

Butthepointisthatprogrammabilitydoesnotneedtoextend ity can bleimited forthese structuredmediaapplications

toeveryaspectofthefunctionality.Programmabilto ,improveefficiency.

Thenewhardwareunits(shadedblocksinFig.4)arethe

addressgenerationunits,hardwarelooping,and

PLInstructionmemory &decoder. Theirfunctionsare desc •

ribed below:

Loopprocessing: Toeliminatebranchinstructionoverhead,PLEemployszero usingdedicatedhardwareloopcontrolandsupportsuptofivelevel

Decode

Rename

-overheadbranchprocessing oslfoopnesting.Wechosetohavefive

levelssincethatwouldbseufficienttohandlethemostc

Fetch

ommonalgorithmsandroutinesusedinmediaproc-

Issue

Read Registers MemoryAccess

HardwareLoop I nstr. Mem.& Dec.

Hardware Looping

ver,lossof

Execute

Writeback

Address Generation

Data Reorganization

Fig.4.A superscalar processorenhancedwith the PLEArc

8

hitecture

essing.Allbranchesandinstructionsrelatedtoloopincreme

ntsarehandledbythistechnique.Thisapproach

isfairly simpleand straightforward toimplementandhas

beenimplementedinmany conventional DSPproc-

essorssuch athe s Motorola56000 and TMS320C5x from TexasInstr •

uments[18].

Addresscalculation: ThecurrentPLEallowsforthreeinputdatastructures/s

treamsa nd producesoneoutput

structure.Thechoicewasmadebecausemanymediaalgorit

hmscanbenefitfromthiscapability(current

SIMDexecutionunitssometimesoperateonthreeinputregis

terstoproduceoneoutputvalue).Adedicated

hardwareunitwhereaddressarithmetichardwarewouldgenerat

eallinputandoutputaddressstreams/data

structuresconcurrentlywith the computations. •

DataReorganization: Inmanyalgorithms,thelogicalaccesssequenceodf atai

vs astlydifferentfromthe

physical storagepattern.Variouspermuteoperationsincludingpack, ple,thefirstelementineightcolumnsoamatrix f needs

tobpe ackedintoasinglerow(orSIMDregister).

Similarly,atimes,asingleelement(16-bitswide)nee

dstobebroadcastintoallthefoursub-wordsoaf

SIMDregister(64-bitswide).PLEefficientlyhandlest port.Asanexampleof mable patterns.

unpackinstructionsareused.Forexam-

hetaskoreordering f datawithexplicithardwaresup-

operationssupported,thereim s ulticastingodf atainto Multicastingeliminatestheneedfortransposingdatastructures,to

computations,andtoincreasereuseodf ataitemssoonafter

selected registersinprogramallowforreorderingothe f

fetchbyexploitingDLPinouterlevelloops.

Multicastingmeanscopyingone/manydataitemsintosevera

registers l obuffers r athe t sameitem.Forexam-

ple,daatavalueAmaybceopiedinto8registers(or8s

ectionsobafigSIMDregister)resultingipattern na

A,A,A,A,A,A,A,AortwoitemsAandBmaybecopiedto8

registersinthepatternA,A,B,B,A,A,B,Bor

A,B,A,B,A,B,A,Boranothersuch pattern. •

Hardwareloopinstructionstorageanddecoding:

Inordertoprogram/controlthehardwareunitsinthe

PLEarchitecture;aspecialinstructioncalledthePLin

structionifsormulated.ThePLinstructionies quiva-

lenttoamultidimensionalvectorinstruction.ThePLin

structionmemorystorestheseinstructionsoncethey

enterthe processor.

3.1 PLEimplementationdetails Themajorhardware additions,the looping circuitry and addr Loop support:

Figure 5

essgeneration circuitry are described below:

illustrates thelooping circuitry.Loopindexvaluesare producedevery

ontheloopboundforeachlevelofnesting(boundsforeach

clock cycle based

ofthefiveloopsarespecifiedinthePLinstruction).

Thevalueolafoopindexvariesfrom 1(lowerbound)to the

correspondingloopbound(upperbound),and resets

to itslowerbound once the upperbound irseached itnhe previous

cycle.Theexecutionothe f PLinstructionends

whentheoutermostloop(loop1inFig.5)reachesitsupperbound.

Onencounteringeitheranexceptionosartall,

theloopindicesarestoredandtheincrementlogicishal tion/stallisserviced. signalsthatare

Eachothe f five

ted;thecountingprocessis tartedoncetheexcep-

comparators(32-bitwide)operatesinparalleltogenerate

priorityencoded todeterminewhichoneotfhefiveloopcounterstoincreme 9

flag(1-bitwide) nt.Whenaloop

Loop1-count index-1

Loop2-count index-2

Loop3-count index-3

Loop4-count index-4

Loop5-count index-5

comparator-1

comparator-2

comparator-3

comparator-4

comparator-5

flag-1

flag-2

flag-3

flag-4

flag-5

priority encoder End-of-all-loops incL1 Increment-by-1 index-1

incL2 Increment-by-1 index-2



Increment-by-1 index-5

index-1

index-2

index-3

index-4

index-5

Fig.5.Blockdiagram ofthe hardware looping circuitry counteris

incremented-by-1 (circuitforincrementinga32-bitvalueby1),allthel

innerlevelarereset(forexample,ilfoop3is

oopcountersbelongingtoits

incremented-by-1,thenloop4andloop5areresettotheirlower

bound). Addressgeneration: ThePLEarchitecturesupportsthreeinputandoneoutputdata thefourdatastreamshasdedicated a addressgeneration

hardwareunit.Addressarithmeticoneachstream isper

formedbasedonthestridesandmaskvaluesindicatedint

stridesiselected.Thenewaddressvalueitshen

computedbasedontheselectedstrideandthepreviousaddress dressgeneration circuitry for saingledatastream/struc

value.Fig.6depictstheblockdiagramoftheadture.

last_valcomparators determinewhichothe f fourinnerlevelloopcountershaver

bound.Theoutermostloopcomparisonins otnecessarybecausethe

eachedtheirupper

PLinstructionfinishesexecutionathe t in-

stantwhentheoutermostloopcounterreachesitsupperbound

The .

flagsignalsbasedontheoutputfromthe

andmaskvaluesfromthePLinstruction.Ifnone

ofthe

flagsignalsaretrue,then

4)isselecteddependingon

last_valcomparators stride-5ius sedtoupdatethe

flag-(1–4). The

inc-condand

inc-combineblocksgenerate

prev-address;otherwise,theappropriate

stride-(1–

address-generateblockuses32-bit a addertoaddtheselectedstride

tothepreviousaddress.Oneitheranexceptionoastall, r o

nlythe

loopcountersarestoredbythehardwareloopingcircuitry.

Foreachofthefourdatastructures/streams,the

prev-addressvalueneedstobestoredatshe

last_valcomparators portion othe f logicishared,but the remaininghardware Loopinstruction decoder:

-

hePLinstruction.Foreachclockcycle,dependingon

themaskbitsandloopindexcounts,oneotfhefivepossible

The

structures/streams.Eachof

needstboreeplicated.

Astand-aloneinstructiondecoderforthePLinstructionsel

ifytheconventionalinstructiondecoderocfurrentGPPs.APL variouscontrolparametersarestoredinhardwareregiste

iminatestheneedtomod-

instructionneedstobedecodedonlyoncesince rsafterthedecodingprocess.Theimplementationothe f

PLinstruction decoderwasmerged into the addressgeneration

and looping circuitry.

10

Loop(2-5)-count

indice-(2-5)

last_valcomparators lastval-(2-5) mask-1

mask-2

mask-3

mask-4

inc-cond1

inc-cond2

inc-cond3

inc-cond4

inc-combine1

inc-combine2

inc-combine3

inc-combine4

flag-1 stride-(1-5)

flag-2

flag-3

flag-4

address-generate

prev-address updated-address

Fig.6Block diagram

ofaddressgeneration hardware (perdatastream)

3.2 RequiredISAsupport ThemajorISAaddition

isanewinstruction,thePLinstruction,whichconveysinform

ouslevelsoloops f a nd theirstridesttohehardware.Thest 7.ThePLinstructionfacilitatesmultiplestrides(onea

ructure othe f addedPLinstructioniesxplainedinFig. each t levelofloopnesting,i.e.,taotaloffivestride

eachotfhethreeinputstreamsandoneoutputstream.The

stridesindicateaddressincrement/decrementvalues

basedontheloop-nestlevel.Dependingonthemaskvaluesforeac

hstream (indicatedinthePLinstruction)and

theloop-nestlevel,oneothe f five possiblestridesiussed

to updatetheaddresspointer.Ifan applicationdoesnot

need five levelsonesting, f non-constantstridescan bgeener

atedwith theextralevelsolooping f [19].

Datatypesoef achstream/structurearealsoindicated

inthePLinstruction.CurrentSIMDextensions

providedatareorganizationinstructionsforsolvingthepro

blemofhavingdifferentelementsizesacrossthedata

structures(packing,unpacking,andpermute)andintroduceadditi

onalinstructionoverhead.Byprovidingthis

informationinthePLinstruction,specialhardwareintheP tionoperationsa ndthisiaslsoindicatedinthePLinst

ationonthevari-

LEperformsthisfunction.ThePLEperformsreducruction(forexample,multipleindependentresultsisinna

gleSIMDregisterarecombinedtogetherindotproductwhich

requireadditionalinstructionsincurrentDLP

techniques).Supportforsigned/unsignedarithmetic,saturation

shifting/scaling , ofinal f resultsias llindicatedin

11

s)for

Loop1-count

Loop2-count

Loop3-count

Loop4-count

Loop5-count

Starting Address of IS-1



Starting Addressof OS

OPR/ Legend RedOp / Shift LL /

Stride-1 I S-1

Stride-2 IS-1

Stride-3 IS-1

Stride-4 IS-1

Stride-5 IS-1

Stride-1 I S-2

Stride-2 IS-2

Stride-3 IS-2

Stride-4 IS-2

Stride-5 IS-2

Stride-1 I S-3

Stride-2 IS-3

Stride-3 IS-3

Stride-4 IS-3

Stride-5 IS-3

Stride-1 OS

Stride-2 OS

Stride-3 OS

Stride-4 OS

Stride-5 OS

Masks -

Masks -

IS-1 and IS-2

IS-3 a ndOS

IS input - stream OS output stream OPR operation code RedOp reduction operation LL loop - level to writeresults

Multicast anddatatypes ofeach streamwith remaining bits unused

32-bits

Fig.7. Structure othe f PLInstruction

thePLinstruction.Thiseliminatesadditionalinstruction

tshatareotherwiseneededforconventionalRISCproc-

essors. Withthesupportformultiplelevelsolfoopingandmultiple

strides,thePLinstructioniacomplex s in-

structionanddecodingsuchaninstructionicomplex as processi

ncurrentRISCprocessors.PLEinsteadhandles

thetaskodecoding f othe f PLinstruction.PLEhasitsown

instructionmemorytohold PL a instruction.Twoad-

ditional32-bitinstructionsarealsoaddedtotheISAofthe

general-purposeprocessorformarkingtheSTART

andSTOPofthePLEsectionotfhecode.These32-bitinstruc structionissuelogic)indicatethestartandthelengthof

tions(fetchedanddecodedbythetraditionalinthePLinstruction.WheneverP a Linstructioniesncoun-

teredinthedynamicinstructionstream,thedynamicinstruc

tionspriortothePLinstructionareallowedtofinish

afterwhichthePLEinstructiondecoderdecodesthePLinstruct

ion.Inourcurrentimplementation,wehaltthe

superscalarpipeline until theexecutionothe f PLinstruction units.Otherwise,arbitrationoresources f ins ecessaryto

iscompleted because the PLEusesexisting hardware allowforoverlapothe f PLinstructionandothersuper-

scalar instructions. Encodingalltheoverhead/supportingoperationsalongwiththeSIMD hastheadvantagethatthePLinstructioncanpotentially

true/corecomputationinstructions

replacemillionsodynamic f RISCinstructionsthathave

tobefetched,decoded,andissuedeverycycleinanormal

superscalarprocessor.SIMDinstructionsinGPPs

themselvesreducethenumberoifnstructionfetchesbecause

oneinstructionoperatesonmultipledata.ThePL

instruction a dditionallycapturesalltheoverheadoperationsa

longwith theSIMDcomputationoperationsthereby

drasticallyreducingrepeated(andunnecessary)fetchan the PLEarchitecture advantagessimilarto ASIC-basedac

d ecodeotfhesameinstructions.Thisresultsingiving celeration i[n20].

Itispossiblethatanexceptionoirnterruptoccurswhil

aePLinstructioniisnprogress.Thestateoaf ll

fiveloops,theircurrentcounts,andloopboundsare savedand

restoredwhentheinstruction returns.Thisisimi-

larto thehandlingoexceptions f duringmoveinstructionswith

REP(RepeatPrefix)in x86.The PLE

to hold the loopparametersforall the loops

.

12

has registers

3.2.1 PLinstruction encodingexample ThePLinstructionidensely as encodedinstruction a ndhencem justafewPLinstructions.Fig.8illustratestheactions

ostmediaalgorithmscan bperocessedin duringtheexecutionoafPLinstructionusingpseudo-

code.In scenario a inwhichalltheloopnestsa nddatastr

eamsare processed,the PLEexecutes(inhardware)the

followingequivalentnumberofdynamicsoftware instruction •

five branches

•

threeloadsandone store

•

fouraddressvaluegeneration(oneoneachstreamwitheac

(sin conventional ILPprocessors)

haddressgenerationrepresentingmultipleRISC

instructions) •

one SIMDoperation (2-way t1o6-way parallelism depending on

•

one accumulationoSIMD f resultand one SIMDreduction ope

•

fourSIMDdatareorganization (pack/unpack,permute,etc)

•

shifting &saturationoSIMD f results

eachdataelementsize) ration operations

CommonkernelssuchastheDCT,colorspaceconversion,motion

estimation,andfilteringcanbe

mappedtoeitheroneotwo r PLinstructions.Fig.9illustrat

esthePLinstructionmappingothe f 1-DDCTassum-

ing an8-waySIMDfor16-bitdata.Forthe1-DDCTroutine,

onlyfourofthefivepossibleloopnestsareneeded

withtheloopboundariesindicatedinthePLinstruction.Th starting addressoeach f othe f arrays. The thirdinput

setartingaddressoeach f stream isrepresentedbythe stream isnotused forthisalgorithm.The valueothe f st

iscomputedbasedontheloopindicesa ndthevalueotfheaddre

sspointerinthepreviouscycle.Theaddress

pointerisupdatedeach clock cycle choosingone stridedependi

ng otnhenestinglevel ofthe loops.

IS1= start_address_IS1;IS2= start_address_IS2; IS3= start_address_IS3; OS1= start_address_OS; increment_address ( level{) if (mask_IS1[level] ) IS1+= stride_IS1[level]; if (mask_IS2[level] ) IS2+= stride_IS2[level]; if (mask_IS3[level] ) IS3+= stride_IS3[level]; if (mask_OS [level] ) OS += stride_OS[level]; }

if elseif elseif elseif else

(i_5 ( + 1)= loop1_count) increment_address(4); (i_4 ( + 1)= loop2_count) increment_addres (i_3 ( + 1)= loop3_count) increment_addres (i_2 ( + 1)= loop4_count) increment_addres increment_address(5);

s(3); s(2); s(1);

SIMD_data_reorganization(R1,R2); SIMD_compute (MAC, R1,R2, R4); SIMD_data_reorganization(R4);

Fig.8. Pseudo-code illustrating operationsduring executionof

13

P a LE compute instruction

rides

1D_DCT( image[1200][1600], dct_coef[8][8],output[8 { < 1ifor = 0200/8; i(; i++) for = 0< 1j(;600/8; j++) for (k= 0; k< 8; k++){ temp_simd_vector= 0; for (l = 0< 8++) l;

][8] )

/*Sincetherei8-way s SIMDparallelism,theinner quired */

mostloopfoldsintooneiterationandinsotr

temp_simd_vector+= multicast(dct_coef[ *]kl[ output[ i*8 ]k*8 [= t]emp_simd_vector>> s_bits;

e-

image[ i*8+k]j*8+l [ ]);

} 0

1200/8

Starting Address of image

1600/8

Starting Address of dctcoeff

8

-------------NONE --------------

NONE

16 bytes

-22384 bytes

NONE

-126 bytes

-126 bytes

8

Starting Address of output

OPR =MAC Shift =s_bits LL = 4

-22400 bytes

NONE

3200 bytes

b2ytes

b2ytes

NONE

NONE

NONE

NONE

NONE

NONE

NONE

-22384 bytes

3200 bytes

IS-1 = 01111

IS-3 = 00000

IS-2 = 01111

OS = 01100

Multicast is used fordctcoefficients data types of each stream isset to 16-bitdata

Fig.9. PLinstructionmapping o1D-DCT f

4.

PerformanceEvaluation andResults To measure performanceothe f PLEarchitecture,wemodif

ied the

outorder)andsimulatedPLinstructionsusinginstructionannot

ations.WeusethesameSIMDexecutionunits’

configurationasinaPentiumIIIprocessor(two64-bit

SIMDALUsandone64-bitSIMDmultiplier.

showsthespeedupobtainedforeachotfhebenchmarksusingthe

the PLEenhanced architecture incorporate prefetching.

Table 5Speedups . o4-way, f 8-way and 16-way processorsenha cfa dct mot scale aud 1.00 13.30 1.24 26.00 1.24 51.03

1.00 11.70 1.29 22.14 1.66 42.07

Table5

PLEarchitecturewitha4-wayprocessorwith

SIMDextensionsathe s baseline. The baseline aw s ellas

4-way baseline 4-way + PLE 8-way baseline 8-way + PLE 16-way baseline 16-way + PLE

PISAversion oSimplescalar-3.0 f (sim-

1.00 8.30 1.33 16.00 1.35 31.00

1.00 2.30 1.96 4.58 3.35 9.12

14

1.00 3.53 1.47 5.49 2.11 6.13

g711 1.00 1.08 1.68 1.83 2.50 2.74

ncedwith PLE jpeg ijpeg decrypt 1.00 1.94 1.18 2.23 1.91 2.25

1.00 1.61 1.40 2.09 1.77 2.29

1.00 1.01 1.18 1.19 1.55 1.56

Incorporationohf ardwareloopinga nda ddressgeneration activityresultsinsignificantspeedupinallthekernelsand benefitis

decrypt.G711 and

coupledwithsavingsinrepeatedfetch/decode severalapplications.Oneapplicationthatdoesnot

decryptareapplicationsthathavethesmallestfractionoSIMD f

isthesuperscalarpipelinethataccountsforba ulkothe f

instructions(i.e.,it

executiontimeratherthanthePLEpipeline).Consider-

inggeometricmeanospeedups, f incorporatingthePLEenhanc

es4-way a processorby7.3xinkernelsand1.54x

inapplications.Wealsocomputedthespeedupsothe f 8-waye

nhancedarchitectureoverthe8-waybaselineand

the16-wayenhancedarchitectureoverthe16-waybaselinearchi speedupiknernelsis

tecture.Inthecaseotfhe8-wayprocessor,the

10xand the speedupianpplicationsi1s.63x(overthe 8-way baseline

way architecture,the speedupoverthe 16-way baselinei1s5.9x in

kernelsand 1.37xianpplications.

OnemayalsonotethattheincorporatingPLEsupportforlo processorperform betterthan a8n-way processorin al

).In thecaseothe f 16-

opingandaddressgenerationmakes4-way a the l kernelsand 3of5applications.Forall thekernelsand

2othe f 5applications, 4a-wayprocessorenhancedwith theP

LEoutperformseven 16-way a processor.Henceif

oneisearchingforcomplexityeffectiveenhancements,incorpo addresstransformationias neffectivewaytoachievei

ratinghardwarelooping,addressgeneration,and t.Wediscussthearea,timeandpowersavingsassociated

withourproposal in thenextsection. Powersavings: SincethePLinstructionids enselyencoded,fewPLinstructi dia-processingalgorithm.Thenumberofdynamicinstructionst

onsareneededforanyme-

hatneedtobfeetchedanddecodedishrunktre-

mendously,leadingtoareduceduseotfheinstructionfetch,

decode,andissuelogicinasuperscalarprocessor.

Theinstructionfetchandissuelogicareasignificant

consumeropf owerinspeculativeout-of-orderprocessors.

OncethePLinstructioniisnterpreted,theinstructionfe

tch,decode,andissuelogicinthesuperscalarprocessor

can bsehutdown forthedurationothe f loopnest.An indicat

ion othe f powersavingscan boebtained byexamin-

ingthesavingsinfetchanddecode.AsshowninFig.10,theuseo

the f vector-stylePLinstructioncaneliminate

morethanhalfotfheinstructionsfromtheoriginalprogra

m(65%onaverage).Theinstructionsrequiredtoim-

plementlooping,addresscomputations,andtransformationsar energy savingsitnhe fetch,decode and registerrenaming st

eremoved.Eacheliminatedinstructionresultsin ages.

%Reductionindynamicinstructions %eliminatedinstructions

99.90

99.90

99.90

99.90

91.00

100.00 80.00 60.00

42.60

41.70

40.00

11.30

20.00

0.20

0.00

cfa

dct

motest

scale

g711

aud

Fig.10.ReductionindynamicinstructionsbyusingthePLEarchi ingsproportionalto theinstructioncountsavingscanbexpectedforfetch, renamingenergy. 15

jpeg

ijpeg

decrypt

tecture.Powersa vdecodeand

5

HardwarecostofthePLEArchitecture Inthissection,wedescribeaVHDLimplementationoftheP

LE,whichwedevelopedtoestimatethe

PLE’sareaandpoweroverhead,aswellastovalidate

thetimingassumptionsthatweusedinoursimulationen-

vironment. Using S ynopsyssynthesistools[21],we used cell-based a methodology twoASICcell-librariesfromLSILogic[22][23].Table6list

to targettheVHDLmodelsto

tshelibrariesandtechnologiesusedforevaluating

the implementation cost.

Table.6.Cell-based Libraries(LSILogic)usedisnynthesi

Libraryname

s

Description

lcbg12-p (G12-p)

A0.18-micronL-drawn(0.13-micronL-effective)CMOSproce ss. Highestperformance solutiona1.8 tV withhighdri ve cellsoptimizedforlong interconnectsassociatedwithlarge designs.

lcbg11-p (G11-p)

A0.25-micronL-drawn(0.18-micronL-effective)CMOSproce Highestperformance solutiona2.5 t V.

WeusethedefaultwireloadmodelsprovidedbyLSILogic’sAS

ss.

IClibraries.TheSynopsyssynthesis

toolscomputetiminginformationbasedonthecellsinthedesign

andtheircorrespondingparametersdefinedin

theASICtechnologylibrary.Theareainformationprovidedbythe

synthesistoolsips riortolayoutandicsom-

putedbasedonthewireloadmodelsotfheassociatedcells

inthedesign.Averagepowerconsumptionim s eas-

uredbasedontheswitchingactivityofthenetsinthedesign.I

nourexperiments,theswitchingactivityfactor

originatesfromtheRTLmodelsgatheredbythetoolfromsimula

tion.Thearea,power,andtimingestimatesare

obtained afterperformingmaximum optimizationsforperfor

manceitnhesynthesistools.Moreinformation a bout

the detailsothese f toolscan bfeoundelsewhere [21]. Table7showsthecompositeestimatesotiming, f area,andpow

erconsumptionforthehardwarelooping

andaddressgenerationcircuitrywhenimplementedusingthec

ell-basedmethodology.Thepowerandareaesti-

matesiT n able 7correspond tclock ao frequencyo1G fH

z.Thehardware costofcommercial SIMDimplementa-

tions[25][26]isalso shown iT n able 7We . discusseach othe f

three categoriesbelow.

Area: Theoverallchiparearequiredforimplementingthehard

wareloops,addressgeneration(for

fourdatastreams),andthePLinstructiondecoder(mergedin mately0.31mm

toloopinganda ddressgenerationlogic)isapproxi-

2

inthe0.18-micronlibrary.In 0.29-micron a process,theincr

theVisualInstructionS et familywas15mm

(VIS)hardwareintotheSparcprocessorfamilywas4mm

2

a, ndAltiVecintotheP owerPCfamilywas30mm

AltiVechardwarewasexpectedto ccupy15mm processorwas106mm

2

2

MMX , intotheP entium

2 [25]. Ina0.25-micronprocess,the

2

In . a0.18-microntechnology,thediesizeoafPentiumIII

withtheMMXandSSEexecutionunitsrequiringapproximately

increaseinareaduetotheaddedhardwareunitsilses

easeinchipareaforimplementing

3.6mm

tshan10%ofSIMD-relatedhardwareandtheoverallin-

creaseicnhipareailsessthan 0.3%. 16

2

[26].Thus,the

all

Table 7Timing, . Area,and Powerestimatesforhardware (The instruction decoderwasmerged into the looping and addre

looping and addressgeneration ssgeneration)

Hardware Looping (5loops)

AddressGeneration (perstream)

Area (µm2)

Time (ns)

Power (mW)

Time (ns)

1.00 ns

72830 µm2

88.57 mW

1.74 ns

57398 µm2

85.16 mW

G11-p(0.25 µ)

1.49 ns

273249 µm2

249.30 mW

2.60 ns

165099 µm2

193.20 mW

VIS – MMX – 15 mm AltiVec – Pentium IIIprocessor – MMX + SSE in Pentium a IIIprocessor

15 mm

Power (mW)

G12-p(0.18 µ)

Area ocommercial f SIMDand GPPunitsforcomparison [

06 mm

Area (µm2)

3.6 m–m

Power: Thepowerconsumed

25][26]

2 m 4 m in 0.29-micron a process 2 in 0.29-micron a process 2 in 0.25-micron a process 2 in 0.18-micron a process 2 in 0.18-micron a process

bythelooping,addressgeneration(allfourstreams),and

tiondecoderisapproximately 430mWin the0.18-micronlibrary.

General-purposeprocessorswith speedsover1

GHztypicallyconsumeapowerrangingfrom50Wto150Wthus ,

thePLEhardwarewouldincreasepowerby

lessthan1%.Theoverallenergyconsumptionothe f P LEarchite scalarprocessorwithSIMDextensions,sincethePLin

thePLinstruc-

cturewouldbcelearlylessthanthatofsauperstructionsignificantlyreducesthetotalinstructioncount,

asexplained iF n ig.10. Timing: ThePLEhardwarecan biencorporatedinto high-speed a pro cal pathothe f processor(afterappropriate pipelining).In supporta

pipe stagestaochievefrequenciesgreater

6

Table 7some , othe f unitsinourimplementationdonot

1GHzclock,however,pipeliningthehardwareloopinglogicint

ogy)wouldpermitgigahertzfrequencies. Similarly,the

cessorwithoutelongating thecriti-

otwostages(in 0.18-micron a technol-

addressgenerationstageneedstobedividedintothree

than 1GHz.

RelatedWork Corbal etal. [36]proposedtoexploitDLPintwodimensionsinsteadofo

SIMDextensions.Vassiliadis

nedimensionasincurrent

etal. [37][38]haveconcurrentlyproposedtheComplexStreamedInstructi

(CSI)thatcanexploittwolevelsolfooping.Thoughtheya theircomplexinstructionscaneliminatetwoloops,

reabletoeliminatesomeoverheadbecauseeachof ourresultsshowthatmoreoverheadcanbeliminatedwith

loops. Leea ndStoodley[39]proposedsimplevectormicroprocessor ordersimpleprocessorsforscalarprocessinga ndvectors

onset

fsormediaapplications,buttheyusedin-

formediaprocessing.Butsuchanarchitecturecanper-

formwelloverlimiteddomainsonly,becausethescalarpr

ocessorisin-order.Ranganathan

out-of-orderexecutionibs eneficialtomediaapplications.T

etal. [5]observethat

hereareseveralcomponentsinmanymultimediaap17

5

plicationsthatcannotexploitDLP,butrequiregoodbranch

predictionandspeculationtoexploitILP,andhence

we also favorthe use othe f out-of-orderprocessor. Rixner etal. [40]developedtheImaginearchitectureforbandwidth-effic chitectureibsasedonclustersoALUs f processinglargedat

ientmediaprocessing.Thisar-

satreamsandibsuiltas co-processor a for haigh-end

multimediasystem.Themethodology adopteditspoutadditional com

putation units,whilethePLEapproachim-

provestheutilizationothe f existing computation unitsbryeduc

inglooping andaddressgenerationoverhead.An-

otherrelatedeffortisthe reconfigurable PipeRench cop

rocessor[41].

Thereareafewresearcheffortsinidentifyingthebot

tlenecksinexploitingsub-wordparallelismusing

SIMDextensions.Fridmandiscussesapproachestodataalignmen

ftorsub-wordparallelismintheTigerSharc

processorusingfoursub-wordM ACunitsin[28].ThakkarandH

uffdiscusstheneedfordataalignmentforSSE

extensionsi[n29].TheBurroughsScientificProcessor(BSP

[42] ) was pure-SIMD a array processorthathad spe-

cial-purposehardware(calledAlignmentnetworks)forpacking

andunpackingdata.Therealsohasbeen research

in specialized accessprocessorsand addressgenerationco

processors[13][35]anddecoupledaccessexecute proc-

essors[30,31,32,19,33,34],which also tried taoccelerate the

overhead componentoftheinstruction stream.

Vermueleneal. t [20]describedhowDCT,Reed-Solomoncodea nd tionscouldbenhancedwithahardwareacceleratorthatw

othersimilarmediaorientedopera-

orksinconjunctionwithaGPP.However,theaccel-

erator hastobedesignedforeachalgorithm.Retargetingtheaccel effort,while,in ourcase,we stillhave fully a progra

7

eratortoanotheralgorithmincurssignificant mmableengine.

Conclusions WhileSIMDextensionshaveimprovedmediaapplicationperforman

thatexistinthemediainstructionstream.

. Westudyarangeom f ediacodestodeterminewhichfeaturesand

characteristicsotfheseprogramsaresufficientlyfreque classesooperations f thatwefindare streamingmemoryoperations •

ce,therearea dditionalbottlenecks

nttomeritimplementationinhardware.Thecommon

multiplenestedloopcontrol

, addressgeneration,datatransformations,

We . note that:

Approximately75-85%ofinstructionsinthedynamicinstruction

streamofmediaworkloadsareonlysup-

portingtheactual/corecomputations.Theseinstructionsaremos

tlyperformingaddressgeneration,data

rangement,loopbranches,and loads/stores. •

and

The utilizationothe f SIMDcomputation unitsicnurrentSIM

Dextensionsivserylow becauseothe f copious

overhead/supportinginstructions.OurmeasurementsonaPentium nelsand applicationsillustrate SIMDutilization ranging

IIIprocessorwithavarietyom f ediaker-

from 1%to 12%.

Then,weproposea ndevaluateaprogrammableloopengine(aP

LE)that

ciently bymovingmostofthe overhead associatedwiththem

executesmediacodesandkernelseffi-

edialoopsintohardware.

18

rear-

•

Ontheaverage,65%ofallinstructionsintheinstructionst proposed hardware.This

•

reamcanbeliminatedwiththea dditionotfhe

leadstpoerformance and powersavings.

Incorporatinghardwareloopingandaddressgenerationintoa

4-wayprocessorwithSIMDextensions,results

in speedup a oup ft1o.54X in applicationsand 7.3Xinkernels. •

Forallkernelsand3of5applications,a

4-wayprocessorwiththePLEoutperformsa n8-waySIMDpro

sorwithouttheproposedhardware.Forallthekernelsand

ces-

2ofthe5applications,a4-wayprocessoren-

hanced with the PLEoutperformseven 16-way a processor. •

Thecostoaf ddingthePLEhardwaretoaSIMDGPPisnegli

giblecomparedtotheperformanceimprove-

ments.WefindthatthePLEhardwareunitsoccupylessth

an0.3%oftheoverallprocessorarea,consumes

lessthan1%ofthetotalprocessorpower,andonappropriat

peipeliningdoesnotelongatethecriticalpathof

G a PP. Oursolutionessentiallyapproximatesthe performanceohard f softwaresolutions.Thenecessitytorunavarietyow f orkl

waresolutions,butretainstheflexibilityof oadsincludingdesktop,database,media,Java,scien-

tificandtechnicalapplicationsjustifiesnotabandoningthea gg

ressivegeneral-purposecoreinfavorofamedia-

specificsolution.Therightsolutionitsoappendsimplehar

dwaresupportfortasksthatcanbedoneefficiently

andelegantly ihnardware.

References [1] R.B.Lee,“Multimediaextensionsforgeneral-pur poseprocessors,” Proc.IEEEWorkshoponSignalProcessingSystemspp. , 9-23,Nov.1997. [2] K.Diefendorff,P.K.Dubey,R.Hochsprung, and H.Scales, “AltiVec extensiontP o owerPCaccelerate media s processing,” IEEEMicro vol. , 20, no.2, pp. 85-95, Mar/Apr2000. [3] TMS320C64xDSP Technical Brief.Available: http://www.ti.com/sc/docs/products/dsp/c6000/c64xmptb.pdf. [4] J.FridmanandZGreenfield, . “TheTigerSHARCDS Parchitecture,” IEEEMicro vol. , 20,no.1,pp.66-76,Jan/Feb. 2000. [5] P.Ranganathan,S.Adve,andNJouppi, . “Perform anceoimage f andvideoprocessingwithgeneral-purp oseprocessors and mediaISA extensions,” Proc. IEEE/ACMSym. on Computer Architecture pp. , 124-135,May 1999. [6] E.Salami,J.Corbal,M.Valero,andR.Espasa, “AnEvaluationofdifferentDLPalternativesforthee mbeddeddomain,” Proc.WorkshoponMediaProcessorsand DSPsin conjunctionwith Micro -32Nov. , 1999. [7] R.Bhargava,L.K.John, B.L.Evans, andRRad . hakrishnan,“EvaluatingMMXtechnology usingDSPan dmultimedia applications,” Proc.IEEE/ACM Sym.onMicroarchitecture pp. , 37-46,Dec. 1998. [8] H.V.Nguyen,andL.K.John,“ExploitingSIMDp arallelisminDSPandmultimediaalgorithmsusingt he AltiVec technology,” Proc.ACM Int.Conf. on Supercomputing pp. , 11-20, Jun.1999. [9] Sample source code forthe Benchmarks. Link suppr essedforBLINDreview. [10]CLee, . M.PotkonjakandW.H.Smith,“MediaBenc h:AToolforEvaluatingandSynthesizingMultimedi andCommunicationsSystems”, Proc.of30 thIEEE/ACM Sym. on Microarchitecture, pp. 330-335,Dec 1997. [11]DBurger, . andTM. . Austin,“TheSimpleScalar toolset,”Version2.0. TechnicalReport1342 Univ. , ofWisconsinMadison, Comp. Sci. Dept, 1997. [12]JFritts, . andWWolf, . “Dynamicparallelmedia processingusingspeculativebroadcastloop(SBL),” Proc. Workshop onParallelandDistributedComputinginImageProcessing,VideoProcessi ng,andMultimedia(heldinconjunction withIPDPS'01) Apr. , 2001. [13]PT. .Hulina,L.D.Coraor,L.Kurian, andEJ. ohn,“DesignandVLSIimplementationoan faddress generationcoprocessor,” IEEProc.on Computersand DigitalTechniques vol. , 142,No. 2, pp. 145-151, Mar. 1995. [14]JE. .Smith,“Decoupledaccess/executecomputer architectures,” ACMTrans.onComputerSystems vol. , 2,No.4,pp. 289-308,Nov.1984.

19

[15]JE . S. mith,SWeiss, . andN.YP. ang,“Asimul ationstudyofdecoupledarchitecturecomputers,” IEEETrans.on Computersvol. , C-35,No. 8, pp.692-701, Aug. 1986. [16]JCorbal, . R.Espasa,andM.Valero,"Ontheeff iciencyorfeductionsinmicro-SIMDmediaextensions, ” Proc.Intl. Conf.on Parallel Architecturesand Compilation Techniques Sep. , 2001. [17]IntelArchitecture OptimizationReference Manua l. Available: http://developer.intel.com/design/pentiumii/ manuals/245127.htm. [18]PLapsley, . J.Bier,A.Shoham,andEA. . Lee. DSPProcessorFundamentals:ArchitecturesandFeatures Chapter , 8, IEEEPressseriesonSignal Processing, ISBN0-7803-3 405-1, 1997. [19]AR. .Pleszkun,andES. .Davidson,“Structured memory accessarchitecture,” Proc.IEEEIntl.Conf.onParallelProcessingpp. , 461-471, 1983. [20]FVermeulen, . L.Nachtergaele,F.Catthoor,D. Verkest,andHDe . Man,“Flexiblehardwareaccelera tionformultimedia orientedmicroprocessors,” Proc.IEEE/ACM Sym.onMicroarchitecture pp. , 171-177, Dec.2000. [21]SynopsisSoldDocumentation, version2000-0.5-1 Distributed . withSynopsysCADtools. [22]LSILogic ASICtechnologies.Available: http:// www.lsilogic/products/asic/technologies/index.html. [23]LSILogic ASKKDocumentationSystem.Distribute dwithLSILogic CADtools. [24]HG. . Cragon,andWJ..Watson,“TheTaI dvance dscientificcomputer.” IEEEComputerMagazine pp. , 55-64, Jan. 1989. [25]LGwennap, . “AltiVec vectorizesPowerPC,” MicroprocessorReport vol. , 12, no.6,May 11,1998. [26]Pentium III implementation(IA-32). Available: htt p://www.sandpile.org/impl/p3.htm. [27]KWilcox . and SManne, . “Alphaprocessors:Ahi story opower f issuesand look a athe t future,” Cool ChipsTutorialin conjunction withIEEE/ACM Sym. on Microarchitecture Nov. , 1999. [28]JFridman, . “Sub-wordparallelismindigitalsi gnalprocessing,” IEEESignalProcessingMagazine pp. ,27-35,vol.17, no. 2, Mar. 2000. [29]SThakkar . andTHuff, . “InternetstreamingSIMD extensions,”IEEEComputerMagazine,pp.26-34,vol 32, . no.12, Dec.1999. [30]JE. . Thornton,“ParalleloperationintheCont rolData6600,” FallJointComputersConference vol. , 26,pp.33-40, 1961. [31]RR. .Shively,“Architecture opafrogrammable digital signalprocessor,” IEEETrans.Computers vol. , C-31,pp.16-22, Jan. 1978. [32]JR. .Goodman,T.J,Hsieh,K.Liou,A.R.Ples zkun,P.B.Schechter,andHC. .Young,“PIPE:AVL SIdecoupledarchitecture,” Proc. IEEESym. on Computer Architecture pp. , 20-27, Jun. 1985. [33]Wm.A.wolf,“EvaluationoftheWMarchitecture ,” Proc.IEEE/ACMSym.onComputerArchitecture pp. , 382-390, May 1992. [34]YZhang, . andGB. . Adams,“Performancemodelin gandcodepartitioningfortheDSarchitecture,” Proc.IEEE/ACM Sym. on Computer Architecture pp. , 293-304,Jun. 1998. [35]AS. .Berrached,P.T.Hulina,andLD. .Coraor “Specification , ocafoprocessorforefficientacc essodata f structures,” Proc. Ann.HawaiiInt. Conf. on System Sciences pp. , 496-505, Jan.1992. [36]JCorbal, . M.Valero,andR.Espasa,“Exploitin ganewleveloD f LPinmultimediaapplications,” Proc.IEEE/ACM Sym. on Microarchitecture pp. , 72-79, Nov.1999. [37]SVassiliadis, . B.Juurlink,andEA. . Hakkenne s,“Complexstreamedinstructions:introductionand initialevaluation,” Proc.IEEEEuromicro Conf .,vol.1, pp.400-408,Sep.2000. [38]BJuurlink, . D.Tcheressiz,S.Vassiliadis,and H.Wijshoff,"Implementationandevaluationofthe complexstreamed instructionset,” Proc.Int.Conf.on Parallel ArchitecturesandCompilation Tec hniquesSep. , 2001. [39]CG. . Lee,andM.G.Stoodley,“Simplevectorm icroprocessorsformultimediaapplications,” Proc.31 stIEEE/ACM Sym. on Microarchitecture pp. , 25-36, Dec.1998. [40]SRixner, . W.J.Dally,U.J.Kapasi,B.Khaila ny,A.Lopez-Lagunas,P.R.Mattson,andJD. . Owens “A , bandwidthefficientarchitecture formedia processing,” Proc.32 ndIEEE/ACM Sym. on Microarchitecture pp. , 3-13, Dec,1998. [41]SC. . Goldstein,H.Schmit,M.Moe,M.Nudiu,S Cadambi, . R.R.Taylor,andRLaufer, . “PipeRench: Acoprocessor forstreaming multimedia acceleration,” Proc. 26 th IEEE/ACMSym. on Computer Architecture pp. , 28-39, May 1999. [42]DJ..Kuck,andRA. . Stokes,“TheBurroughssc ientificprocessor(BSP),” IEEETrans.onComputers vol. , 31,no.5, pp. 363-376, 1982. [43]TM. . Conte,P.K.Dubey,M.D.Jennings,R.B. Lee,A.Peleg,S.Rathnam,M.Schlansker,P.Song, andAWolfe, . “Challengestocombininggeneral-purposeandmultim ediaprocessors,” IEEEComputerMagazine p, p.33-37,Dec. 1997. [44]PR . anganathan,SAdve, . andN.Jouppi,“Reconf igurablecachesandtheirapplicationtomediaproc essing,” Proc. IEEE/ACM Sym. on Computer Architecture pp. , 214-224,Jun. 2000. [45]A. SMckee, . “Maximizingmemorybandwidthfor streamedcomputations,” Ph.D.Thesis School , ofEngineeringand AppliedScience,University oVirginia, f May 1995.

20

[46]ZA. . Ye,A.Moshovos,S.Hauck,andP. Banerjee,“CHIMAERA:Ahigh-performancearchitectur ewithatightlycoupledreconfigurable functionalunit,” Proc.IEEE/ACM Sym.on Computer Architecture pp. , 225-235, Jun. 2000. [47]HLieske, . J.Wittenburg,W.Hinrichs,H.Kloos M. , Ohmacht,P.Pirsch, "EnhancementsforS a econdGenerationParallel Multimedia-DSP," Proc.WorkshoponMediaProcessorsand DSPs inconjunctionwithMicro-32, Nov.1999. [48]Techreport, Link suppressedforblind review.

21