Building 10000 Spoken Dialogue Systems

BUILDING 10,000 SPOKEN DIALOGUE SYSTEMS Stephen Sutton, David G. Novick, Ronald Cole, Pieter Vermeulen, Jacques de ViUiers,Johan Schalkwyk andMark Fanty

Center for Spoken Language Understanding, Oregon Graduate Institute P.O.Box 91O00, Portland, OR 97291,USA

ABSTRACT

2. BARRIERS TO UBIQUITY While many aspects of spoken-language technoogy-such as ttcognitioll accuracy. outof-vocabulary reje4xion and mixedinitiativedialogucja". uetoposescxiousdifficulties,anumk of spoken-language systems are already m successful use. 'These systems are c3"mA - byltnitedvocabuianes * . simple gammas. and welldefined tasks. But the number of such systans remains small relative to their p r e , we attriite this to three principal banias:

Lack of expertbe. Development andresearch of spoken language system m t l y rrquinstechnicaiexpertise across several subject amas. Because of these requirunats,developmentand research is

CuKentiyIimitedtoafew spcciaiizcd laboratories.

1. INTRODUCTION

T

past decade, research in spoken language rechnology has en' yed btrong support in the United States. However, spoken lan ge systems (IIC not yet ubiquitous. and there are many pro kms that& to be solved before they can become so. These. problems include high Urprtise raphr", a lengthy and cxpicnsive developnent process, md the lack of of SpOGm language technobgy. TO ~ c a hfuturr with ubiquitous cpolren dialogue systuns. b~the field needs to make it possible forlnonexpczt developers to build portable spoken-language a&cations and intufaccs rapidly. This approach should cnable dw+authors devenend-usentoncateandadaptspeech rechhologytothe tau of husands of specialized domains in which have tbeir own urperrise.

r

High development casts. The development process is lengthy and expensive, often requiring months or even years to produce a spokm language application. Data collection for training rccognizQsand for building language and dialoguemodels is costly and o h must be done via "witardof-ot" simulation, with humans aueanpting to mimic the pafarmanceof a spoken language sys-

Lack of portabnity. cllrrentspoken languagesysmns tecblogy is not vay portable [l, 41. That is, the technology cannot suppon a&quately the development of new applications with acceptable pexfrrnance without significant engineaing for each new task. The combhation of these problems s e v d y hinders appli&on development and limits the role that spokendialogue technology can play mkey areas such as ddeduution.

3. THE CSLU TOOLKIT

+tal

The CSLU mllcit is made up from number of layers, as shown m Figure 1 [a.At the highest layer, a direct manipulation interface (CSLUrp) enables authors to design graphically a spoken language system. This graphical s p c c i f i i is translated into Tcl Scripts which arc then execucd inside a programming shell (CSLUsh). This shell is made up from a collection of core likaries, written mostly in C. The libraries can also be used without the shell for developing srand-alone C applicationsand third-party applicatims.

terne et work.

The CSLU toolkit suppons the complete life-cyck of a spokenlanguagesystem. It enablesresearch by providing the essential mob and in&asmcaae for advancing spcech technology,as well 8s other

To bhieve this vision, we propose to address the baxriers to ubi 'ty through a toolkit approach. This paperpresentr the CSLU too 't, which supparts rapid-prototypin& itrrative design. evaluation of s p ~ k e nlanguage systems. training of SpeOiJized SpeaCsI ncognizas. research into spoken language *logy, rad p v k k a modular and flexible environment whidh allows the sharing of resources(e.g. telephony cards) OVQ a

2

709

modules is performed by passing object handles.This works ~ c r o s s the network as well as lyxoss platforms. For example, it is quite easy in CSLUshto send an object to a CSLUshmernmningon another machiae, perfcnm some computation and receive a r d t object back CSLUshautomatically takes care of repackingthe underlying stnume to take care of platform depesxhcies such as byte &.

To give the flavor of CSLUsh programming,here is the tramxip? ofsnintnactivesessionmwhichtheuserreadsaspeechf~~scales it so that the max sample is 15,000 and write it back to disk. The identifiers"wavco" and"wave:l" are object handles

[a.

CSLUsh Environment

% wave read NU-2234.zipcode.wav wave :0 % wave i n f o -max wave:O {4158 5168 646.0) 0 wave scale wavesQ [expr 15000.0/4158.0] wave:l . % wave i n f o -max wave:l (14999 5168 646.0) % wave w r i t e wave:l new-wav

wave :1

H a d c l i h to tectmology Qeated aK CSLU, CSLUsh incorporates implmentarions of many common and widely-used algorithms essential for Qeafing spoken language systems.

3.2. The CSLU Rapid Prototyper Figure 1:Ovaview of CSLU tookit

aspeas of humancomputes such as dialogue design It allows quick and easy development of spoken-language system prototypes.It suppartr applications development through activities such as collecting speech data, haining of recognks, and inclpdes integrated tools for browsing and labeling speech.F d y it serves as a valuable educational tool offaing the opportunity for "handson" leanling. We now describethe main componeats of the toolkit-CSLUsh and CSLUrp-in more detail.

3.1. The CSLU Shell

CSLUsh is a good development and research e " c n t , but building a telepbne dialogue using CSLUshstilltakes several steps that may be mtimidating for non-expcrts. The CSLU Rapid Protoyper (CSLUrp, pronounced "slurp'? is a graphically-based authoring environment built on top of CSLUsh and incorporates all the steps necessary for building and executing simple spokendialogue systems. The main strengths of CSLUrp include: (a) the

speedwithwhi~applicarionprotorypwcanbeu~(b)aneasyto-useinterface;(c)strongsupportforauthorswbiackspecialized technical expertise m speech recognitioxx (d) the ability to c~eatea wide range of nal-world applications; and (e) suitability for a broad cammrmityofwrs.

CSLUrp includes a graphid palette of dialogue objects and a simple drag-and-drop interface. The dialogue objects serve as The base componentof the tm&t is the CSLU Shen (CSLUsh, p- visual-prognmming building blocks. During the design phase, the no& "slush"), which is based on the extensible scripting author selects ard manges appropriate objects, linking them languageTcL W e have added a number of modular,i n t e mand togetha to createa finite-statedialoguemodel.Then, duringthe run ayllMlically loadable packages which add new cormnands to the phase, CSLUrpprovides a real-timeanimatedview of the dialogue. language, designed U)support research and development of spoken The authol can a-l between the design wd nm phases, languagesystems. Basic functionsincludemanipulatingwave files, enabling the incrementaldevelopment and iterative refinement of performing signal analysis (cg. FFT. me1cepsa~m).exaacDing feaspoken language systems. The set of objects m the palette covers a tures, training and utilizing srtificial neural networks, and doing range of frmdamenral spoken language system functions including speech recognition for isolated words and continuous speech with answering the telephone, speaking a prompt, recording speech finitestate gram". It includes a general-purpose (vocabulary- inpus recognizing speech input, and identifying DTMF tones. independent)recognizerand anumber of special-purposerecognizThe interfaceis designed to require minimal technical"pertise on QS for common vocabulhes such as digits and alphabets. the author's pan and to simpw the design and specification F " e n t a l to CSLUsh is the use of objects,which arc essentially procss. For example, specifying a speech recognizer is largely C stzucturcshidden to the CSLUsh developer. Access to the objects auto&, all that is required of the author is to enter the is provided by the new Tcl commands.Communication between ncognitionvocabularyby typingor saying-for spealreraependenr

710

experienced and familk with CSLUsh's capabilities, they can move beyond the scope of CSLUrp's initialset of functions and take advantage of the CSLUsh level to develop a wide range of intereshg applicdcions such as speech interfpcesfor exisring textbased applications. speccsl fronr-ends to replace existing MUF (to~-tonc) interface& voice-nsponse questiormaires and a#htim for spoken l c c e ~ sto drc w~rld-wi& w&.

In summaxy, the CSLU toolkit is designed to address many of the problems associated with the lack of portsbility of spoken language system technoogy discussedabove: the need for multidisciplimy urpatise. substantial ~aseucturerequirements. and the effort nquirsdto develop systems faaerchncw task.Thcseprobluns are dressed (a) by poviding a toolkit that incaporates most of the iIhsmlcm needed to desi@&develop and investigate spoken language systems (most rrsesrch at CSLU is now pafonnedwithin the toolkit); (b) by making dl of CSLU's speech corpora available to univdty mar&crs free of charge and providing tools to train new networks, and (c) by padcaging other publicdomain language nsourceswithinthedkit

33. Platform Portability To insure portability and ease technology transition, the CSLU

Fl&m2: Rototype system being developed using CSLUrp.

toolkit includes colllpletc EpccificPtians for puaing togetha bankey systems. It also spec* the hardware a d sofovrrre rrquiremenrs to make porting to new platfomrs as easy as possible. our princippl target platfopm is 831 Intel x86 ( P ~ u mor bencr) proccwr nmning Solaris. a Dialogic telephone board, and DECUUC'text-to-speech sohare (which we have pon~dto XM S O W in coopaPtionwith DEC. md which can be Ccnnmacidy licensed). The toolkit povidcs a genaic interfacefor text-to-speech engines. including acolmnon set of embsdded commands,such as for changing pi& The toolkit also pmvides a genaic interface f a speech U0 and a guide for writing servem for new devi=. It suppo~tst h t " p b e s p e r k a o f aSormdblllstn in aPC running Solaris x86 and the stadsd xnicrophone-qxaker on a Sm SparcstatiOlLDialogic and Linlron telephony b o d are slppomk and &vice drivczs aist f a several piaaannr. ThcJc devices can be shred betweenplatformsm a clim-serverfashion.MUlti-pWm porting is f a c i m a l by the toolkit's c / r c m bnplanentation.

4. TECHNOLOGY TRANSITION

CSX.$J~~ is €~IUY iutcgratcd into the lo~alenvircarmau. or instance, &signing a pototype the author simply clicks rhe "build" buaqnhllowedby~'hm"buaonafta which thesystemis nady to bdcalltduthephonenumber displayed. A d d i a d y . the author

afta(

One of the major objectives of the CSLU toolkit is to make spokendialogue technology less exclusive and more accessible by pmnoting technology rmnsition. To encourage sharing of appkatiom and technological ed"- the toollcit is designed to make it easy to addnew sofnvare. to be asplatfm indepadent as possible, to include 110 pmIniuary software and to incolporateor be "patiile with the most useful and widely-used softwaredevelapment tools. Our goal is to aeare amkitthat will be emtmccdby agenaafion of studentt. developers and ressarchns simply because it is useful. We d c i w that such a toolkit aDoIolch will umte adbenefit

1SECtak is a tdanark of Digital EquipmentC o r p c x k

711

from a rmrltiplia &et. By providing the capability to build spoken languagesystems,spsschinterfaceswillbehrodwai inagrowing nnmba of applicstions the limitations of the tecblogy will become appareat,andmore effort will be expaadedto improve the technology to amble betta applications. Bena applications will yield more value, more use and more interest in improving the technology,and the multiplier effect will move the development of spokm languagemterfacaout of the labonuoIy and into the public sector. The CSLU toolkit is designed to support the many activities needed to make this happerr--systcm design and &ployna nsearchandoaining.

WearepovidingfheCSLUtoo~ttotheacademiccommunityfrce of charge. we m also aiding thepocess of sharing software and ideas by developinsand maintaining a World-Wide Web sitewhere developas, researchers md usas can amtriiute and obtain useful sofnurae. such as ncognizus, subdialogues SIKIworking systems. We plan to incorparatethe most useful of these mto new releasesof the toolkit We have also set up amailmg list for reponingproblems and requesting help. \

of dialogues, sutulialogues. metadialogues, and referent objects. CSLUrp’s subdialogue objects are designed to support this effon npresauing entin taskorientedsubdialogues8s a single ican

6. CONCLUSIQN As danand mcrwes for ubiquitous spoken-dialogue applications. thae is a critical need to make spokendialogue technology less exclusive, more affordable and more accessible. An important step towards satisfying this need is to be able to place development of spoken-dialoguesystuns m the hands of the real domain experts ratha than limit it to technicalspecialists.The technology win have succedd when a large numk of spoken-dialogue systems are developed and used. To reach this goal, we need technology that will make it possible to build 10,OOO spokm-dialogue systems. To address this need, we have developed and an distributhgthe CSLU toolkit

7. ACKNOWLEDGEMENTS

This research was suyported by U S WEST,the National Science Foundation, the Defense Advanced Research Projects Agency, the byoffaingshortcounesinwhichweteach~pletousethetoolldr office of Naval Research, and the member companiesof the Center

Fdy.we are supporting techmlogy development and transition

for designing and developing spoken language systems. An earlia version of the toolkit has already fomred the basis of a short course described in [2]. We are working with o k universities to d e r the toolkit to their sites and use it to develop laboratory cou~sesin spokm language technology, to be incorporated mto their undagduate~cula

5. FUTURE WORK We are commioed to continuing to develop the scope of the toolkit am3 its underlying tcctmology. In addition to incorporating more advanced speedr recognition Capabiiti~such as inrerpreting spontaneous speech using robust parsing techniques, we are mvestigahg fundamental advances m dialogue technology. For instance,we would like to support more fluid and usable humancompltcr dialogues by moving beyond S I X U C ~dialogues ~ with simple finiteatate grammar^ Imd fixed v o c a b l k i ~ .R e is aeededtodevelopamongen~basisforbuildingflexiMcspo~ language systems using high-level representations of knowledge, such as the goals and expectations of the nsa and the system Improved dialogue repesentarionsshould capaac the dynamics of mixed-initiative intaacaon ’ and provide authors with dialogue conml sfruaura suitable for tracking the dialogue focus, moniroring the mutuality wd coherence of contributions. end providing a basis for performing automatic repair. Spscifymg a spokendialoguesystem fiwn scratch requires a great deal of experdseboth about the domain and about thenanae of humwcomputer intaaction. This burden can be lessened through the reuse of knowledge. while it may be difficult to generalize intaactions Bcmss domains, capplring POJreuof lcnowledge for specific domains and tasks is arealpouibility. Certain tasks and subtasks may recur m diffexent applications, such as getting somame’s name and address, obtaining an order. retrieving messages, and schedulingmeetings. Research is needed to discova suitableknowledge for reuse and to explore the building of libraries

712

for Spoken Language Understanding.We thank Azdine Tadrist for his help in uea!ing CSLUrp.

8. REFERENCES 1. Cole, R. A., Hirschman, L, et al. “The challenge of spoken language systems: Research dinctions for the nineties.” IEEE T r m a a w m on Speech and Audio Processing. 3(1), 1-21.1995. 2. Colton.

D,

Cole,

R, Novick, D., wd S m n , S. “A

laboratory course for designing and testing spoken dialogue syst~ms“Proceedings of the International Conference on Awustics, Speech and Signal Processing (ICASSF %), Atlanta. GA, 1129-11321996.

3. Hansen, B.. Novick, D., and Sutrm S. “Systematic design of spoken prompts,” Conference on HL” Faaovs in Compvting S y s ” (CHI96). Vancouver, BC, 157-164. 1996. 4. Hinchman, L., et al. Sununary reprtfiom the workhop on toolkits for language inte$bce portability: The toolip ~ o t k s h o p .Technical Report MP-9530000173, MITRE, Bedford, MA,1995.

5. Schallovyk J, Colton, D, Panty. M. The CSLU toolkitfor automatic speech rewgnith, Technical Report No. CSLU-011-96, Center for Spoken Language Understandin& Oregon Graduate Institute of science & Technology, 1996. 6.

Sutton, S, Vameulem, P, de Villiers. J, Schalkwyk J, Fanty.M.. Novick D.and Cole, R. Technical spec@xztion #the CSLU toolkit. Ttchical Repwt NO.CSLU-013-96. Center for Spoken Language Understanding, Oregon Graduate Institute of Science & Technology, 1996.