Emotional Entrainment in Music Performance - Casa Paganini

Emotional Entrainment in Music Performance Giovanna Varni, Antonio Camurri, Paolo Coletta, Gualtiero Volpe Casa Paganini - InfoMus Lab, DIST, Università di Genova Viale Causa 13, 16145 Genova, Italy {varni,toni,colettap,volpe}@infomus.org

Abstract This work aims at defining a computational model of human emotional entrainment. Music, as a non-verbal language to express emotions, is chosen as an ideal test bed for these aims. We start from multimodal gesture and motion signals, recorded in a real world collaborative condition in an ecological setting. Four violin players were asked to play, alone or in duo, a music fragment in two different perceptual feedback modalities and in four different emotional states. We focused our attention on Phase Synchronisation of the head motions of the players. From observation by subjects (musicians and observers), an evidence of entrainment emerges between players. The preliminary results, based on a reduced data set, however do not grasp fully this phenomenon. A more extended analysis is current subject of investigation.

1. Introduction The study of human intended and unintended interpersonal co-ordination is one of the more interesting and challenging topics in the psychological and behavioral sciences (e.g., Schimdt et al. [17], Issartel et al. [7]). In the last years, co-ordination is receiving an increasing interest from the research community also in collaborative multimodal interfaces, ambient intelligence and social networks. The objective is to develop more natural and intelligent interfaces, with a focus on non-verbal communication, embodiment, and enaction (Camurri and Frisoli [3]). In this field, as in natural sciences and medicine, the co-ordination phenomenon is better-known as entrainment or synchronisation. There is no general accepted scientific definition of entrainment. Pikovsky et al. [13] define it as “an adjustment of rhythms of oscillating objects due to their weak interaction”. Entrainment and related phenomena can be studied focusing attention on different kinds of synchronisation (Phase Synchronisation, General Synchronisation, Complete Synchronisation) and with different approaches depending on experimental conditions (e.g., “passive” or “active” experiments (Pikovsky et al. [13])) and on

978-1-4244-2154-1/08/$25.00 ©2008 IE

physical observables (e.g. physiological data, Schäfer et al. [16], Rzeczinski et al. [15]). The idea behind this work is to extract knowledge on human emotional entrainment starting from multimodal gesture and motion recorded in a real world collaborative condition in an ecological setting. We focused on gesture with a twofold perspective: gesture as simple physical signals, i.e., gesture as physical movements, and expressive gesture, that is gesture as conveyer of nonverbal emotional content (e.g., Camurri et al. [2]). In this way, we intend to test how non-verbal expressive gestural communication can play a relevant role in entraining people under different perceptual coupling strengths and inducted emotional states. In order to have an ecological set-up, we chose to act in passive experiment conditions. Additional analysis challenges emerge in passive experiments, because we have not direct access to the system parameters affecting the coupling. Further, we can observe the dynamics of the system only under free running conditions. The experimental set up and data set, as described in Section 2, is ambitious and very challenging: in this work we focused on the sub-set of data concerning only heads motion. This paper is organized as follows. In Section 2 we describe the set up and we disclose details of the data set. Analysis techniques and synchronization detection techniques are described in Section 3 and in Section 4 respectively. A discussion on the obtained results is presented in Section 5.

2. Experimental set up and data set The real world scenario we set up is an on-stage live musical performance (Fig.1). We chose this scenario because it enables the observation of human to human interaction in a truly ecological setting. Moreover, performing arts, and in particular music, are an ideal testbed to investigate non-verbal expressive communication and expressive gesture. Four violin players were involved in the experiment and were asked to play alone or in duo a short musical excerpt (a canon) from the “Musical Offering” by J.S. Bach. This excerpt was selected since it is a piece with no agogical

indications. Also, the composer did not indicate any musical instrument for playing the piece. Therefore, the piece maintains its coherence (both from the point of view of the listener, and of the music performer) even when performed within a range of different interpretations, including the four emotional interpretations we chose, and different instruments. In particular, in the initial phase of our study, once individuated the musical excerpt, we made recording with both flutists and violinists, in order to confirm the previous hypotheses and to identify the non functional movements (basically, head movements in our study). Furthermore, the use of the same musical excerpt guaranteed the minimization of the differences in functional gesture (that is the gestures directly involved in the sound production process, e.g, moving the bow), and enabled us to focus mainly on non functional gesture. The piece was performed by the players in two different perceptual feedback modalities and in four different emotional states according to verbal instructions provided by psychologists. More specifically, in each trial, one player was verbally requested to play in order to convey and to arouse one specific emotional state in the other player by means of gesture (physical and expressive). We call emotional entrainment this emotional locking. The following feedback conditions were tested: in the first one, players could co-ordinate by means of the sound they produced and their glance, in the second one only by means of sound (no eye-contact). In this last condition, the glance effect was removed using a wallboard in between the players. Anger, Joy, Sadness, Serenity were chosen as emotional states according to the dimensional theories of emotion introduced by Russel [14]. Each player played the canon three times for all the possible feedback and emotional conditions. Each single performance lasted about 25 seconds. All players were naïve about the hypotheses being tested and all were verbally informed about the general purpose of the research. During the performances for each player we recorded the set of multimodal data listed in the following. In order to deal with multimodality, from a more technical and operative point of view, we needed to set up a multi-sensor data recording environment. Technical complexity resulted also from the need to handle the different features and the synchronization of the heterogeneous hardware devices employed. We used four video cameras, two cameras for each player: two ultra-Fast Full-Frame Shutter Digital Cameras (1280 x 800 pixels, 45 fps) and two b/w video cameras (720 x 576 pixels, 25 fps). The black and white cameras were placed 5-meters above each player, downward looking; the other two were placed in front of the players. This video set-up allowed us to record gestures from different viewpoints (e.g., full-body gestures and arms/hand/head gestures). Furthermore, we used four

Figure 1: The set-up of the scenario developed at our Lab site of Casa Paganini (www.casapaganini.org). The two players are performing the music excerpt in the non-eye contact condition: a wallboard is here placed between the players.

microphones (two Neumann KM 184 cardioid microphones and two radio microphones AKG C444) sampled at 48 kHz and 16 bits per channel. Two microphones were used to capture environmental sound. The other two microphones were placed directly on the violin body, thus filtering out the room acoustics. Finally, BioMuse sensors collected and processed human bioelectric signals acquired using standard non-invasive transdermal electrodes. Data were recorded on four computers in the uncompressed custom file format for multimodal signals supported by EyesWeb (ebf format), maintaining synchronization among signals. Each computer ran EyesWeb XMI (Camurri et al. [1]) to manage in real-time the different sensor devices and to record in a synchronized way with timestamp the data streams in real time. Audio signals were used as the main clock in EyesWeb XMI to collect all multimodal signals.

3. Head motion analysis Head motion analysis seems to have a great relevance in expressive gesture analysis as referred in Dahl and Friberg [4], Davidson [5], and Timmers et al. [18]. In order to have an ecological setting, we used a video-based marker-less approach based on EyesWeb XMI. In this work, we focused on the video streams obtained from the b/w top cameras: from such viewpoint, it is reasonable to assume that heads move on the horizontal plane, thus neglecting the small displacement from such plane when the body is twisted. We assumed that head contours have an elliptical shape and we used the time-series of a bivariate state vector to describe the dynamics of the players’ head. We chose as state vector the center of mass (CoM) coordinates of the approximating ellipse. In this way the series of these vectors represented the trajectory in the physical plane of image.

For the analysis, an off-line EyesWeb XMI modular application was developed to detect and track the heads. We extracted the 2D-blobs corresponding to the heads by means of a simple identification algorithm based on a pixel value threshold and on the computation of neighborhoods pixels of nth degree. The next step was filtering out blobs which area was too small or too large to be identified as head. The tracking of the remaining blobs was performed with the following approach. The tracking module takes as input a list of the blobs detected in the current frame of the video-stream and compares them with the blobs list detected in the previous frame and stored in an internal buffer of the module. A weighted distance-area criterion is used to match and track blobs frame by frame. More specifically, at each frame, the state of this module consists of a couple of lists: the list of the identified blobs and the list of the unidentified blobs at the previous frame. At each new frame, for each of the identified blobs, a search in the input list is performed: the blob having the minimum difference among the blobs moved from their previous position within neighborhood with prefixed radius is chosen. The same numerical identifier as the previous frame is assigned to this blob. The input blobs not yet identified are tracked as follows. For each blob unidentified at the previous frame, a new search in the input blobs list is done and the same algorithm described above is applied. In this case, a new numerical identifier is assigned to the blob with the minimum difference. No operation is done on left behind blobs. The minimum difference is expressed by Eq. (1):

minimum difference =

min

{diff }

L 2 − norm , Δν 0 , 0

(1)

where:

diff = w1 * (L 2 − normof blobs CoM ) + w2 * sqrt (Δν 0,0 ) + ξp , q

ν 0, 0 is the zero order central normalized moment of

blobs;

ξp, q are weighted contributes by non zero central normalized moment of blobs, and w1 and w2 are the weights of distance and area respectively, subject to: 0 ≤ w1 ≤ 1 0 ≤ w2 ≤ 1 w1 + w2 = 1 We obtained best results with w1 = 0.8 and w2 = 0.2. The presence of noise gave rise up to some erroneous and sharp contours of blobs so that it was not possible to use directly these blobs for the analysis and we decided for an elliptical approximation of the head.

Figure 2: Left panel: trajectory of Player A when the wallboard is removed. Right panel: trajectory of Player A with the wallboard is between the two players.

A module based on the computation of the central normalized moments of objects was employed to perform this approximation. Moments up to the second order are here considered. Data inspection revealed that ellipse approximation works better than simple blob or bounding box for all data set. In order to remove the residual noisy spikes, a Kalman filter was used to smooth the obtained trajectories. The state transition model for the filter is a simple kinematical model with no input.

4. Entrainment analysis Most of the difficulties to analyse real time-series grows out of the observation that such real data time-series are generally short, noisy, and non-stationary. Moreover, when time-series come from expressive gesture and movements and the interest is in making a comparison of raw data or features from these time-series, we have to take into account their different temporal length. For example, in our data set, a significant temporal variability in the duration of the performances of the four players and of the same player respectively was detected. Non-stationarity and non-linearity globally detected in the time-series did not allow to use usual linear techniques of analysis like cross-correlation, coherence and Fourier transform. Wavelet transform and cross-wavelet transform techniques (e.g., [8]) have been preferred in these last years to these standard methods. They can be applied to non-stationary data and take into account multiple frequencies that are usual in movement. Further, they enable to probe the time and frequency domains simultaneously. However, cross-wavelet transform has the drawback that it does not overcome the problem of handling time-series with different length. By visual inspection of the shapes of the heads trajectories, it appears that each player has a set of preferred or recurring positions in the image plane. Figure 2 depicts an example of trajectory of a performance. This simple observation encouraged us to consider the method

of recurrence plots (RPs) introduced by Eckmann et al. [6]. This approach is based on the recurrence property of dynamical systems and provides qualitative information on the dynamics of systems. Quantitative information on dynamics can also be obtained by the recurrence quantification analysis RQA (Marwan and Kurths [11], Webber and Zbilut [19], Zbilut and Webber [20]). RPs and RQA were already successfully applied in synchronisation analysis. We focused our attention on detection of Phase Synchronisation by means of RPs and RQA. Phase Synchronisation is the more common and the simplest detectable form of entrainment among two or more signals, and in particular motion signals (e.g., see [7][8]). As referred in Marwan et al. [10], the computation and the comparison of the probabilities of the system recurrences “to the ε-neighbourhood of a former point xj of the trajectory after τ time steps” can detect and quantify Phase Synchronization regime. An estimate of this probability function is given by Eq. 2 [10]:

pˆ (ε ,τ ) = RRτ ⎛⎜⎝ ε ⎞⎟⎠ =

1 N −τ

∑ Θ(ε − N −τ i =1

G G x i − xi + τ

)

(2)

where: N is the number of samples of time-series; ε is the radius of neighbourood; τ is the time delay; Θ is the Heaviside function; G G x i , x i + τ are two states of the system under study. We fixed the radius of neighbourood according to the criterion suggested in Koebbe and Mayer-Kress [9] and Zbilut and Webber [18] in order to minimise, for example, tangential motion effect. The maximum time delay τ was chosen equal to the maximum time interval in which possible periodicity was observable and taking into account the RQA principles. In this first approach to Phase Synchronisation, we considered also in the analysis the time delay introduced by the canon framework. Due to this delay, we computed the cross-correlation function of each pair of pˆ (τ ) (for this computation, pˆ (τ ) were normalised to zero mean and standard deviation of one). In case of Phase Synchronisation, a high value of crosscorrelation at a temporal lag value is expected. The CRP (Cross Recurrence Plot) Toolbox for Matlab® (provided by TOCSY: http://tocsy.agnld.uni-potsdam.de) was used to deal with recurrence plots and recurrence quantification analysis.

5. Results and discussion Table I, II, III, IV summarise preliminary results. Each table refers to a single emotional state we tested.

MVC 2-way co-ordination (sound – glance) Trial 1 0.26 Trial 2 0.31 Trial 3 -0.44

Lag (s) 4.72 4.40 17.20

MVC 1-way co-ordination (sound) 0.41 0.41 -0.15

Lag (s) 6 13.52 12.12

Table I: Anger-Player A vs Player B. Maximun value of crosscorrelation (MVC) and lag. MVC 2-way co-ordination (sound –glance) Trial 1 -0.52 Trial 2 -0.40 Trial 3 0.30

Lag (s) 14.88 11.72 11.20


Lag (s) 11.20 9.36 20.28

Table II: Sadness-Player C vs Player D. Maximun value of crosscorrelation (MVC) and lag. MVC 2-way co-ordination (sound –glance) Trial 1 -0.19 Trial 2 0.30 Trial 3 -0.41

Lag (s) 5.96 8.16 3.76


Lag (s) 6.24 3.08 11.04

Table III: Joy-Player C vs Player D. Maximun value of crosscorrelation (MVC) and lag. MVC 2-way co-ordination (sound –glance) Trial 1 0.45 Trial 2 0.33 Trial 3 - 0.43

Lag (s) 3.76 13.8 14.68

MVC 1-way co-ordination (sound) 0.14 -0.47 -0.19

Lag (s) 7.28 11.28 12.36

Table IV: Serenity-Player A vs Player B. Maximun value of cross-correlation (MVC) and lag.

We observed that about the 62.5% of data shows a maximum value of cross-correlation function very low (