Whence and Whither: The Automatic Recognition of Emotions in Speech

1 downloads 0 Views 69KB Size Report
... of Emotions in Speech. Anton Batliner. Lehrstuhl für Mustererkennung (Informatik 5). Friedrich-Alexander-Universität Erlangen-Nürnberg. PIT 2008, 17.6.2008 ...
Whence and Whither: The Automatic Recognition of Emotions in Speech

PIT 2008, 17.6.2008

Anton Batliner Lehrstuhl für Mustererkennung (Informatik 5) Friedrich-Alexander-Universität Erlangen-Nürnberg

page Seite 2/27 2

Click to edit Master title style Overview „ whence: a short history „ terminology: what "is" emotion „ corpus engineering ƒ ƒ ƒ

scenario annotation segmentation

„ features & feature selection „ classification, evaluation „ whither: progress and applications „ the future

© A. Batliner

whence

page Seite 3/27 3

Click to edit Whence: A Short Master History title style „ basic studies on emotion in speech from the 20ies onwards „ bulk of basic research from the 80ies onwards „ first studies on automatic recognition of acted emotions

mid 90ies „ first studies using "realistic" data

end of the 90ies & this century ⇒ tenth anniversary „ still, too many studies using acted data

© A. Batliner

terminology

page Seite 4/27 4

What Emotion: Terminological Click "is" to edit Master title style Remedies

„ {despair, fear, joy, ...}

= (full-blown) emotion

vs. vs.

{stress, tiredness, interest, ...} mood, ...

„ cover terms used: ƒ emotions ("emotional intelligence") ƒ emotion-related states ƒ affective states ƒ user states (changing over time) vs. user traits (stable) ƒ pervasive emotions (ex negativo: whatever present in most of life but absent when people are emotionless) „ term not used: ƒ (speech) register (subset of a language used for a particular purpose or in a particular social setting)

© A. Batliner

corpus engineering I

Click Corpus to edit Engineering: Master title Scenario, style Design

„ sub-optimal but common breakdown:

± {acted, induced, natural(istic)} „ vs. ± {natural, spontaneous} „ representative non-acted/non-prompted scenarios ƒ ƒ ƒ ƒ ƒ ƒ ƒ

© A. Batliner

mother-child interaction call-center interaction stress detection in driving scenario human-machine (computer/robot) interaction tutoring dialogues appointment scheduling dialogues human-human multi-party interaction

page Seite 5/27 5

corpus engineering II

page Seite 6/27 6

Click to Engineering: Corpus edit Master title Annotation style „ basics: orthographic transcription plus lexicon „ labellers: experts or naïve, how many „ catalogue of terms ƒ ƒ ƒ

deduced (catalogue of terms) categories mixed ("early or late mixture")

or or or

data-driven dimensions pure

„ assessment ƒ ƒ ƒ

© A. Batliner

reliability (kappa etc. vs. classifier performance) external ground truth ecological validity

corpus engineering III

Corpus Segmentation Click toEngineering: edit Master title style „ normally, units of analysis not addressed ƒ ƒ

either given trivially (acting using semantically void sentences or short dialogue moves) or defined on an intuitive basis

„ beneficial/necessary in case of longer utterances „ emotion units correlate with ƒ ƒ ƒ

prosodic units (intonation units and/or pauses) syntactic-semantic units & dialogue acts size depending on register (e.g. in FAU-Aibo, 2.7 words/unit)

„ better performance & time constraints

within end-to-end system (time-alignment)

© A. Batliner

page Seite 7/27 7

corpus engineering IV

Click to Engineering: Corpus edit Master title Selection style of Cases „ almost never: ƒ ƒ

random selection using all cases

„ choosing more prototypical cases

(majority voting) „ up- or down-sampling „ usually no rejection class

© A. Batliner

page Seite 8/27 8

features I

Click to edit Master title style Features „ extraction ƒ ƒ

frame-based (ASR), segment-based, global manual or automatic

„ low-level (or high-level: sub-optimal) „ raw or normalized (implicitly or explicitly) „ types ƒ ƒ

© A. Batliner

Low Level Descriptors LLDs: acoustic, linguistic functionals

page Seite 9/27 9

features II

Click to edit Master title style Features „ acoustic LLDs ƒ ƒ ƒ ƒ ƒ ƒ ƒ

voice quality (jitter/shimmer, HNR) pitch spectrum and formants cepstrum (MFCC features from ASR) energy duration and: Teager operator, dynamic features for HMM, ...

„ linguistics LLDs ƒ ƒ ƒ ƒ

© A. Batliner

non-verbals, disfluencies stemming, e.g. part of speech (POS) bag of words (from document retrieval tasks) n-grams: (mostly uni-grams)

page Seite 10/27 10

features III

Click to edit Master title style Functionals

ƒ ƒ ƒ ƒ ƒ ƒ

percentiles (e.g. quartiles) specific functions (e.g. regressional) extremes: (e.g. min/max) higher statistical moments: (e.g. std. dev.) means sequential and combinatorial (e.g. two functionals applied, e.g. mean of max)

⇒ CEICES feature encoding scheme

© A. Batliner

page Seite 11/27 11

features IV

Click to edit Feature Selection Master title style

„ # features: some ten ⇒ some thousands ƒ

small number of pattern + high number of features ⇒ reduction + selection necessary

„ feature reduction: PCA, LDA, ICA, ... „ feature selection ƒ ƒ

© A. Batliner

wrapper, filter methods such as IGR sequential forward SFS or backward selection SBS, ..., Sequential Floating Forward Selection SFFS

page Seite 12/27 12

features V

Click"Best" The to editFeature Master Vector? title style

„ a holy grail - there is no "best" feature vector „ depends on what's available ƒ ƒ ƒ

spoken word chain or ASR result manual and/or automatic extraction of feature values ...

„ (high) correlation between (types of) features

⇒ some (any?) combination will be adequate

© A. Batliner

page Seite 13/27 13

features VI

page Seite 14/27 14

Click"Best" The to editFeature Master Vector: title style a Suggestion „ types of acoustic features ƒ ƒ ƒ ƒ ƒ

energy MFCC duration pitch voice quality, spectrum

varied data speaker-independent no full-blown emotions

„ any linguistic information „ types of functionals ƒ ƒ ƒ ƒ

means (robustness) extremes higher statistical moments (regression) layered (combining smaller and larger units of analysis)

© A. Batliner

uniform data (synthesis) personalized sadness vs. anger

classification I

page Seite 15/27 15

Click to edit Master Classification: Methods title style „ pattern recognition & data mining „ availability of tools such as WEKA and HTK „ from changing standard methods to multiplicity ƒ ƒ

Linear Discriminant Classifiers, Nearest Neighbour, Decision Trees (Random Forests) , Artificial Neural Networks Naive Bayes, Support Vector Machines, Hidden Markov Models, Ensembles, ...

„ Regression „ Fusion: early or late

© A. Batliner

classification II

Click to edit Master Classification: Evaluation title style „ train vs. test ƒ ƒ ƒ ƒ ƒ

leave-one-case out leave-one-part out (10-fold cross-validation in WEKA) leave-one-speaker-out stratified cross-validation train + validation + test

„ measures ƒ ƒ ƒ ƒ

recognition rate RR class-wise computed recognition rate CL ... basis: confusion matrix

© A. Batliner

page Seite 16/27 16

classification III

Click to edit Master Classification: Assessment title style „ a simple ranking of classifier performance

is not very enlightening „ worse classifier = less classifier tuning „ any classification method is as good

provided a good feature vector „ there is no free lunch „ out-of-the-box procedures are competitive „ standards nowadays e.g.: SVM and ensemble methods

© A. Batliner

page Seite 17/27 17

whither

Click to edit Whither: an Master Epistemological title styleLoop?

„ basic research ƒ ƒ

clash of cultures if based on acted data, no transfer to analysis/recognition of real data

„ real data necessary „ from real data to real(istic) applications ƒ ƒ ƒ

© A. Batliner

proof of the pudding not generic, too focused? recall & false alarm rate can/should be different for different applications

page Seite 18/27 18

applications I

Click to edit Applications: Master title a style Taxonomy System Design online

system reacts (immediately/delayed) while interacting with user

offline

no system reaction, or delayed reaction after actual interaction

mirroring

user gets feedback as for his/her emotional expression

non-mirroring

system does not give any explicit feedback

emotional

system reacts itself in an emotional way

non-emotional

system does not behave emotionally but “neutral”

Meta-assessment critical

application's aims are impaired if emotion is processed erroneously

non-critical

erroneous emotion processing does not impair application's aims

© A. Batliner

page Seite 19/27 19

applications II

page Seite 20/27 20

Emotional Click to edit Master Monitoring title style (1) Main scenario: Call Center Anger detection with discontent users Quality of service

online offline

Interaction with user Quality control of agent

mirroring non-mirroring

Fully automatic system Handing over to agent

emotional non-emotional critical non-critical

⇒ necessary: very low false alarm rate © A. Batliner

applications III

page Seite 21/27 21

Emotional Click to edit Master Monitoring title style (2)

Main scenario: Call Center Anger detection with discontent users Quality of service

online offline

Interaction with user Quality control of agent

mirroring non-mirroring

Fully automatic system Handing over to agent

emotional non-emotional

⇒ beneficial: high recall (and not too many false alarms) © A. Batliner

critical non-critical

applications IV

page Seite 22/27 22

Emotional Click to edit Master Monitoring title style (3)

Main scenario: Call Center Anger detection with discontent users Quality of service Interaction with user Quality control of agent Fully automatic system Handing over to agent ⇒ screening, i.e. average recall over time should and can be close to human performance - but don't forget the trade unions! © A. Batliner

online offline mirroring non-mirroring emotional non-emotional critical non-critical

misc.

Click to edit Nothing SaidMaster About title yet style „ performance - that's what you remember ƒ

state-of-the-art ƒ > 60% for 4 classes, ~80% for 2 classes ƒ up to > 80% for 4 classes and > 90% for 2 classes

ƒ ƒ

worse with ASR, esp. linguistic features what about rejection classes?

„ generation and synthesis ƒ ƒ ƒ

ECAs are speaker-dependent and want to convey emotions our field is mostly speaker-independent and we want to recognize emotions bridging not possible yet

© A. Batliner

page Seite 23/27 23

outlook I

ClickFuture The to edit Master title style „ more natural data needed „ taking into account realistic applications „ fine-tuning classification will help - but not much „ examples of promising applications ƒ ƒ ƒ ƒ ƒ

call centers - millions of calls, pays off autismus therapy - adequate treatment tamagotchi - "meaningless" and successful MP3-player - yet another feature the sex industry - real money

„ the rubicon: higher performance than now,

lower than needed for dictation systems „ the challenge: to bring together ƒ ƒ © A. Batliner

analysis & synthesis basic research & applications

page Seite 24/27 24

outlook II

page Seite 25/27 25

Click toTake-Away Some edit Master Messages title style „ acted data are not useful „ segmentation into emotion units is necessary „ the "best" feature vector might consist of some combination „ there is no best classifier „ let's use ASR and all cases

and two more caveats: it’s no panacea ƒ ƒ

to employ voice quality features: they are speaker-dependent and multi-functional to employ multi-modality (= facial gestures and speech): sequential multi-modality is ok, but simultaneous multi-modality will mostly not pay off because ground truth is vague and partly antagonistic

© A. Batliner

page Seite 26/27 26

Click to edit Master title style Acknowledgments many thanks to my colleagues in the CEICES initiative ⇒

The Automatic Recognition of Emotions in Speech Anton Batliner, Björn Schuller, Dino Seppi, Stefan Steidl, Laurence Devillers, Laurence Vidrascu, Thurid Vogt, Vered Aharonson, Noam Amir in HUMAINE Handbook on Emotion. ed. Paolo Petta et al., Springer, 2009

© A. Batliner

page Seite 27/27 27

Click to edit Master title style

Thank you for your attention

© A. Batliner