Evaluation of user perceived quality based on user ...

2 downloads 0 Views 912KB Size Report
exploits a learning-based model to obtain the relationship between above parameters and user opinion. I Cser OpillioD. Seort'. l_Proposed :: . LummI Phue.
CINTI 2011 • 12th IEEE International Symposium on Computational Intelligence and Informatics' 21-22 November, 2011 • Budapest, Hungary

Evaluation of user perceived quality based on user behavior Hamid Reza Hasani*, Naser Movahhedinia**, and Behrouz Shahgholi Ghahfarokhi*** *Department of Computer Engineering, University of Isfahan, [email protected] ** Department of Computer Engineering, University of Isfahan, [email protected] *** Department of Computer Engineering, University of Isfahan, [email protected]

Abstract:

User judgment about the quality is an important aspect in quality evaluation. Current approach to evaluate quality of service is based on measuring it from network related parameters such as delay, jitter, and loss or from objective estimations by evaluating degradation of speech. However, the user's view is more meaningful than network view or objective estimation, since the end user is the final judge for evaluation of quality. Therefore involving user's real opinion is an appropriate parameter to improve the performance of multimedia communication networks. Evaluation of user satisfaction depends on user conditions and is varying from user to user. This research attempts to evaluate voice quality from user's feedbacks which are calculated from hislher vocal behaviour. The idea behind this novelty is that quality degradations affect the user's mental state including user emotion and CL where such effects are assessable from user speech. Keywords: User Satisfaction Level, User Perception Quality, Cognitive Experience. I.

Load,

Emotion,

Quality

of

Introduction

Growing demands for applications of voice communications over wireless mobile networks emphasize that the quality of service provisioning must be considered intelligently. Quality of services in mobile communications can be viewed from network perspectives and end user perspective. The network-based quality can be measured from simple network parameters such as loss rate, delay, jitter and so on. However, the user's perception is more meaningful than network parameters. Moreover, the user's satisfaction level is varying from user to user. User Perception Quality (UPQ) assessment involves both "subjective" and "objective" evaluation methods. Subjective evaluation in most of the studies refer to human judgment of speech quality [1]. Subjective is measured by MOS metric which has been proposed by International Communication Union (ITU) in 1996 in ITU-T Rec P.800.1. MOS metric has been defined as: "the values on a predefined scale that subjects assign to their opinion of the performance of the telephone transmission system used either for conversation or for

978-1-4577-0045-3/11/$26.00 ©2011 IEEE

listening to spoken material [2]" . Subjective methods are in laboratory, too expensive and cannot be used in online applications. Also they just say what the average user's opinion is. On the other side, objective evaluations attempt to simulate human judgment by computational methods. Objective methods divided into two classes; one is based on input-output comparisons (double ended or full reference) and other one is based on output evaluation. In double ended methods speech quality is estimated based on measuring deference between input (clean) and output (degraded) signals and employ regression methods to map the difference values to the predicted quality metric. One of those intrusive methods has been proposed by ITU-T in P.862 (PESQ) [3]. In some situations an intrusive approach may not be applicable since the input speech signal may be unavailable or some other difficulties such as time shifting problem are existed. Other objective methods are output-based measurements (single ended or reference free) which only exploit the degraded speech signal and are more challenging approaches to objective speech quality estimation such as ITU-T Rec. P.563 [4] or other methods which attempt to improve it. For example in [5], a new method has been proposed to provide high correlation with subjective listening quality scores, yielding accuracy similar to that of the ITU-T P.563 while maintaining a relatively low computational complexity. However, the objective methods have not attended on subject opinion as our best of knowledge. User's quality of experience and satisfaction level is depending on user's condition and personality, and may be different from one user to another. It means that a special person may be satisfied in presence of some amount of delay or jitter while those impairments may not be acceptable for himlher in another situation or for someone else. On the other hand, subjective methods which employ the user in assessment are not applicable for online (in-service) applications. This paper proposes a new method to evaluate user's perceived quality of a voice call using indirect feedback of user which is inferred from his/her vocal behaviour. The proposed method employs the cognitive load and emotional state of the listener as assessment parameters.

- 513-

H, R. Hasani et al. • Evaluation of User Perceived Quality Based on User Behaviour

l_Proposed�::�

Next section shows the place of the proposed approach and its differences comparing to other quality evaluation approaches. Proposed UPQ evaluation model is presented in section 2, user's behavioral indications are explained in section 3. Experimental results and contribution of proposed approach will come in section 4. Finally section 6 summarizes the paper and provides future works. I I.

Proposed Method

I

t;"ser'$ Speech

l�

III

��J

.....

B eha'

Cogpruve Load

]1

Il No...

2:

.....

,1

III.

Proposed method and traditional ones

As the figure shows, traditional approaches evaluate UPQ by two methods: Single ended methods such as P.563 which estimate MaS from received speech signal, double ended methods such as P.862 that compare original speech signal (i.e. partner A's speech) to transferred speech signal (i.e. Partner A's speech received in the other end) to calculate MOS. However, in our proposed method, the user's answers (i.e. B's speech) are processed to determine what the user's behaviour is in response to received sound and the UPQ level is estimated from this response. When the quality of received sound decreases, the user should process more to recognise the input sound, therefore any changes in quality of received sound will impress user's Cognitive Load (CL), user's emotion and other user's behaviour which can be recognized by processing the listener's speech. Fig. 2 shows the structure of the proposed method. When the quality of received voice decreases, user's dissatisfaction is perceptible from hislher responses. User usually becomes angry and hislher CL increases. Moreover user may repeat some words or utilize negative words. In this paper, CL, emotional state, and word repetition are used to estimate user's perception quality. The method exploits a learning-based model to obtain the relationship between above parameters and user opinion.

A.

.."

orion

Normal

Il Emotion

,[lr

J

Wo,d

Rep,,;,.,.

Proposed method

Behavioural Factors

Cognitive load

Cognitive Load has been exploited as a metric to evaluate satisfaction level from the user's speech. "Cognitive load refers to the amount of mental demand imposed by a particular task, which has been closely associated with the limited capacity of working memory and learning [6]". The reality behind the employment of cognitive load for UPQ evaluation is that user's cognitive load increases as the clarity of the heard sound diminishes. This is due to the fact that the user has to process too much information at once to identify lost or damaged voice samples and repairs them to recognize what the partner is saying. "Speech is highly suitable for measuring cognitive load due to its non-intrusive nature and ease of collection [7]" . I t i s very useful t o think o f the overhead involved in the human processing of degraded and delayed speech which is related to the idea of cognitive load. The increased difficulty in perceiving degraded speech causes some fluctuations in user talks for the sake of increasing CL. Cognitive load increases in this situations since the brain is presented with ambiguous speech and therefore needs to work more to identify the phonological content of the spoken message leaving less working memory available to process semantic and interpersonal information [8]. An especial test, namely Stroop-Test has been employed to gather a set of speeches under different cognitive loads. These tests are used to create CL classifier. The 'Stroop-Test' was originally developed by John Ridley Stroop for the purpose of experimental psychology research on CL [9]. A number of cards are

- 514-

urpur

ClassdKahon Phase

.. II II JI E

Cognim;e L03d

Fig.

--., �,.

CI:usifie

During training phase, users are asked to report the perceived quality throughout the conversations in five­ scaled MaS score. User's evaluation called User Opinion Score (UOS) in this paper. The learning model is based on Multi-Layer Perception (MLP) Neural Network (NN) classifier. The NN's input layer has five nodes for user behaviour factors and one output node for estimated UOS. After training phase, the NN-based model is used to estimate the perceived quality from above factors of listener's speech. In the next section, we will introduce these factors and their evaluation methods further.

P.563

1:

'IIIf .....

Fil-e score MOS

WeM!

LummI Phue

��ser

In this section the proposed UPQ evaluation framework is introduced. Fig. 1 shows the general UPQ evaluation framework which compares the proposed method to traditional ones.

Fig.

'",

Cser OpillioD Seort'

I

l

JJ

CINTI 2011 • 12th IEEE International Symposium on Computational lnteliigence and Informatics' 21-22 November, 2011

prepared for this test where the name of a color is written on each card with an incongruent font color. There are two types of tests: the 'Reading Color Names'(RCN), in which participants are asked to read out the words ignoring the font color; and 'Naming Colored Words' (NCW), in which the actual font color of the words has to be read out. Reading interferes with recognition process and participants have to try more to override the meaning of the words to read out the actual font color. Later research conducted by Edith's group extended Stroop-test to more situations [10], such as naming color fields, congruent color words, incongruent color words, and combined. Given the nature of these tests, they are found to be extremely useful in creating situations of different cognitive loads. In our method, two levels of CL are assumed (low CL and high CL). The basic parameters for CL classification are MFCC and Shifted Delta Coefficient (SDC) [11]. Therefore, a Support Vector Machine (SVM) based classifier is employed for discriminating between high and low CL according to above parameters. SVM focuses on the separation edges between classes as a hyper plain in feature space. This separation edge maximizes distance between two separated classes. So the training task is to find out support vector that separate two classes. Some Stroop-Tests have been undertaken by a number of random participants and relevant data have been collected. 108 feature vectors are used for train and test. The train and test data are selected using holdout cross­ validation algorithm. Experimental results show that about 90% accuracy is achieved by our CL classifier after training. B.

Emotion

Normalization

Users perception quality is affected by their initial mental state (i.e. a user may start conversation with high cognitive load). To get to the bottom of this problem, the user's initial cognitive load or emotional state should be spot as the normal state of user to have a better and real result. To make a normal classification, the beginning of each conversation is used to normalize the remainder of conversation. In the learning phase of cognitive load, the first CL level in Stroop-Test is assumed as normal state and all other tests are normalized against it. For emotion,

Budapest, Hungary

the neutral emotion of user is used as normal state and other emotion's features are normalized according to it. The classifiers are trained with these normalized features and are used as another user behavior's factor. D.

Word repetition

When the quality of conversation degrades, the user tries to ask his/her partner about missed words. Usually users do it by repeating the listened words to check the correctness of them. This metric is used as another factor to evaluate the perception quality as discussed in previous section. IV.

Experimental Results

To evaluate UPQ, a set of conversations should be collected and a test plan should be made. A.

Test plan

Apart from free conversation, different types of conversation scenarios have been described in the literatures. Naturalness and interactivity are two important parameters to have a good scenario. Therefore interactive Short Conversation Test (iSCT) that has been introduce in [17] is used to provide conversation data set. In iSCT, both partners have a table of names, phone numbers and email addresses in front of them where names column is common for both partner and is used as the identifier to fill the blank fields of the other column in the tables. The second and third columns contain empty fields that require interactive conversations to be filled in with the information obtained from the conversation partner. An example table is shown in Table 1, for one of the speakers.

Emotion is used as another factor to evaluate user's behaviour. Bad or unsatisfied quality may anger user, so this can be recognized and used as behaviour of users. MFCC, LPC, PLP and energy are baseline in emotion recognition systems [12, 13], so these features are utilized to classify speech into angry and neutral speeches. SVM is used as a classifier and German emotion speech dataset [14] is to learn the classifier. Since emotional speech is independent on language [15, 16], the German emotion speech that used in this research does not significantly affected approach's results. Result shows that our classifier separates angry from other emotions by about 90% accuracy. C.



Table I: Example of iSCT scenario; user A E-Mail

Phone

Na

me

Number [email protected]

3378338 33 54332

[email protected]

Marzie Asadi Kes havarzi

Ali

poy a. ak@gmail. com

Poya Akbari 4508 94 3 550405 0

[email protected]

[email protected]

Hamid Hasani

Mahbobe Adham

Asghar Rahimi Somaye

Karimi

334895 7

Hamid Gholipour

446 7 890

Mojraba A Sgharpour

Maryam Jafari

Two partners are in two separated room without any special restriction. Participants talk together through a VoIP system and fill the empty fields. Speeches are collected in three places: VoIP server, phone client and with a separated microphone. It is necessary to find out what the condition of bandwidth was when user speech, so VoIP server and phone client recorded sound are used to synchronize times between client phone and server. The speech that is recorded from microphone just contained talker voice and will be used to calculate user perception quality. During the conversation, quality of channel is changed (low / high) for periods of 1 to 2

- 51 5-

H. R. Hasani et al. • Evaluation of User Perceived Quality Based on User Behaviour

minutes in a random manner by reducing bandwidth of communication channel (using TC traffic control tool). User is asked to evaluate the conversation quality in five scaled MaS during the conversation. User can change his/her opinion about quality of channel using an implemented application. In the application, users can report their opinion just selecting numbers 1 to 5.

--five

score MOS

-- scaled MOS

l\r (/)

o

:;: 3

B.

Evaluation results and discussions

Recorded speech that just contains speaker sound is filtered by Spectral Subtraction Algorithm [18] to remove background noise and then voice parts are split to be used in user behavior analysis. The results of learning-based model have been compared to UOS for two ends of a VoIP conversation in Fig. 3 and Fig. 4. Continues lines (red lines) show UOS while circles show the model outputs (estimated UOS).

--

o

10

20

30

40

SO

60

70

80

90

100

Time 5:

Fig.

User opinion score and threshold on it

First row of tables shows low quality parts and second row shows high quality parts of conversations with respect to user's opinion score. Each column of the tables shows the average of scaled scores obtained from proposed model for that part of speech.

Subjective MOS Estim ated MOS o

o

Table

f/)

o :>:

2:

User A results

1.0

1.143

1.308

1.313

1.158

1.857

1.333

1.786

1.60

1.667

Low Qualit Good Quality

o

Table o �--��----�--� 40 o 60 20 80 100 120

Time(Second) Fig.

3:

User A's opinion score comparing to model results

-- Subjective MOS o Estimatsd MOS

o o

(/)

o :;:

o

o

00 00

o

o

User B results

1.769

1.75

1.0

1.0

1.0

1.0

1.60

1.3

1.666

1.916

1.25

1.666

1.615

1.666

1.75

1.8

Low Quality Good Quality

As the tables show, excluding second and third column of the first row of Table 2 and first, second, seventh and eighth column of first row of Table 3, all other results are correct. Therefore result shows about 75% accuracy in model results.

o

o

3:

o

4

v.

o ---�--��-1 �20 60 80 100 120 40 o

Time(Second) Fig.

4:

User B's opinion score comparing to model results

To better analyse the result, the five-level MaS scores are scalded between 1 (low quality) and 2 (high quality) according to a threshold (Table 2 and Table 3).

Conclusion and Futures Works

A new approach to estimate user satisfaction level introduced in this paper. The new method is based on analysis of vocal user feedbacks and investigates such behaviours to estimate experienced quality of the conversation. Since this new approach is based on user's behaviour, this new approach is more familiar to real user opinion and can be adapted to users' mental state and even environmental status. As the results showed, it is a better metric to be used in QoE-aware applications to improve the satisfaction level of the user from network communications. Employing other parameters such as physical behaviour of user (e.g. stir of mobile phone) has been considered as our future work to improve the quality of proposed modeL References [1]

- 516-

S. Rein, et ai., "Voice Quality Evaluation in Wireless Packet Communication Systems A

CINTI 2011 • 12th IEEE International Symposium on Computational lnteliigence and Informatics' 21-22 November, 2011

Tutorial and Performance Results for ROHC, "

D. C. Delis, et aI., "The Delis-Kaplan Executive Function System," presented at the The Psychological Corporation, 2001.

[11]

B. Y. Ruiz, et aI., "Investigating speech features and automatic measurement of cognitive load " presented at the Multimedia Signal Processing, 2008 IEEE 10th Workshop on 2008.

[12]

E. H. Kim, et al. , "Speech emotion recognition separately from voiced and unvoiced sound for emotional interaction robot," presented at the in Control, Automation and Systems, 2008. ICCAS 2008. International Conference on, 2008.

[13]

I. Shafran, et aI. , "Voice Signatures," in in IEEE

67, 2005. I.-T. R. P.800, "Methods for Subjective Determination of Transmission Quality, " ed, 1996.

[3]

I.-T. R. P.862, "Perceptual evaluation of speech quality (PESQ), " ed, 2001.

[4]

I.-T. R. P.563, "Single-ended method for objective speech quality assessment in narrow­ band telephony applications, " 2004.

[5]

M. Abdulhussain and P. Dorel, "New single­ ended objective measure for non-intrusive speech quality evaluation, " Signal, Image and Video Processing, vol. 4, pp. 23-38, 2010.

[6]

[7]

[8]

B. Yin, et al. , "Automatic cognitive load detection from speech features, " presented at the Proceedings of the 19th Australasian conference on Computer-Human Interaction: Entertaining User Interfaces, Adelaide, Australia, 2007. B. Yin, et aI. , "Investigating Speech Features and Automatic Measurement of Cognitive Load, " in Multimedia Signal Processing, 2008 IEEE 10th Workshop on, 2008. M. Robert, "The combined effect of speech codec quality and transmission delay on human performance during complex spoken interactions, " International journal of speech

technology, International journal of speech technology, vol. 9, pp. 53-74, 2008. [9]

Budapest, Hungary

[10]

IEEE Wireless Communications, vol. 12, pp. 60-

[2]



Workshop on Automatic Speech Recognition and Understanding, 2003, pp. 31-36. [14]

F. Burkhardt, et aI., "A Database of German Emotional Speech," presented at the Proc. Interspeech, 2005.

[15]

T. Nwe, et aI. , "Speech emotion recogmtlOn using hidden Markov models," presented at the Speech Communication, 2003.

[16]

K. R. Scherer, et aI., "Emotion inferences from vocal expression correlate across languages and cultures, " Journal of Cross-Cultural Psychology, vol. 32, pp. 76-92, 2001.

[17]

A. Raake, Speech Quality of VoIP, Assessment and Prediction, 1 ed.: Wiley, 2006.

[18]

R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics, " presented at the IEEE Trans. Speech and Audio Processing, 2001.

J. R. Stroop, "Studies of interference in serial verbal reactions, " Journal of Experimental Psychology, vol. 18, pp. 643-662, 1935.

- 517-