CROSSTALK CANCELLATION SYSTEM USING A ... - IEEE Xplore

International Workshop on Acoustic Signal Enhancement 2012, 4-6 September 2012, Aachen

CROSSTALK CANCELLATION SYSTEM USING A HEAD TRACKER BASED ON INTERAURAL TIME DIFFERENCES Yesenia Lacouture-Parodi, Emanuël A.P. Habets International Audio Laboratories Erlangen∗ Am Wolfsmantel 33, 91058 Erlangen - Germany {yesenia.lacouture,emanuel.habets}@audiolabs-erlangen.de

ABSTRACT To accurately reproduce binaural signals through loudspeakers, proper crosstalk cancelation filters have to be added to the reproduction chain. When the crosstalk cancellation filters are time invariant, head rotations cause the perceived position of virtual sound sources to shift towards the loudspeakers’ region. Further, errors in the interaural time difference (ITD) of the reproduced virtual sound sources are more critical than errors in the interaural level differences (ILD). This suggests that to relax the location of the listener with respect to the loudspeakers and to reproduce the virtual sound sources correctly, we need to have a good knowledge of the location of the listener or the acoustical impulse responses (AIRs) between the loudspeakers and the ears of the listener. Given that the loudspeaker signals are highly correlated, doing an on-line estimation of these AIRs remains a challenging task. This paper explores the use of microphones placed close to the ears of the listener to estimate the head orientation based on ITDs. By estimating the difference between the ITD of the signals arriving at the microphones and the ITD of the desired binaural signal, the orientation angle is estimated and the crosstalk cancellation filters are updated. A model describing the relation between the ITD error and the orientation angle is presented. Two ITD estimation methods are compared and results of a preliminary evaluation of the proposed system are discussed. Index Terms— Crosstalk cancellation, interaural time difference, interaural cross-correlation, spherical-head model, head tracker 1. INTRODUCTION To accurately reproduce binaural signals through loudspeakers, we need to mitigate the crosstalk that exists between the loudspeakers and the contralateral ears. This can be achieved by incorporating appropriate crosstalk cancellation filters into the reproduction chain. Usually, crosstalk cancellation filters are designed for a specific location of the listener with respect to the loudspeakers and the reproduction area is limited to a rather narrow region known as the sweetspot [1]. It has been previously shown that, when using crosstalk cancellation filters, head rotations cause the perceived position of the virtual sound sources to shift towards the loudspeakers’ region [2]. It has also been shown that even though the sweet-spot size with respect to head rotations is relatively large (when looking only at the magnitude ratio of the crosstalk to the direct signal), the errors introduced in the interaural time differences (ITDs) of the reproduced virtual sound sources, due to head rotations, are rather critical [3]. Thus, to attain a crosstalk cancellation system that relaxes the constraints on the listener’s location, we need to have good knowledge of ∗ A joint institution of the University Erlangen-Nuremberg and Fraunhofer

IIS

Fig. 1. Simplified diagram of an adaptive crosstalk cancellation system (CCS) with a head tracker based on the microphone signals vi (a) and the geometry used in the model (b). the position of the listener or the acoustic impulse responses (AIRs) between the loudspeakers and the ears of the listener. There are a number of known technologies to track the head of the listener, such as magnetic trackers and video cameras. For instance, Karjalainen et al. proposed in [4] a system that uses two microphones placed at the ears of the listener and a set of known anchor sources to estimate the location and orientation of the listener. The system uses the signals emitted by the anchor sources to estimate times of arrival. In this paper we propose a system that also makes use of two microphones placed near the ears of the listener to estimate the orientation of the head. Instead of using additional anchor sources, we make use of the signals rendered by the crosstalk cancellation system. In principle, the head orientation can be inferred from the AIRs between the loudspeakers and the ears of the listener. Unfortunately, the loudspeaker signals are often highly correlated, which makes the identification of the AIRs problematic. We propose alternatively to make use of the ITD error between the desired binaural signals and the signals captured by the microphones to estimate the orientation angle with respect to the loudspeakers. Based on the known spherical-head model [5], we derive a model of the ITD error as a function of the orientation angle. We also discuss several ITD estimation methods and compare two of them in the simulations. We then present an example of a simple adaptation algorithm to test the proposed system. 2. MODEL FORMULATION Fig. 1 shows a basic diagram of the proposed system and the geometry that will be used throughout the paper. The angle φs is the span angle between the loudspeakers and α is the orientation angle with respect to the middle point between loudspeakers. The angle θe is the angle of the observation point on the sphere, i.e. the angle of the ears with respect to the median plane. The functions Hij correspond

to the transfer functions from the ith loudspeaker to the j th ear and α ˆ is the estimated orientation angle based on the ITD error between the desired binaural signals di and the measured signals at the ears vi . The inputs to the crosstalk cancellation system (CCS) are the ˆ ij , which are desired binaural signals di and the transfer functions H an approximation of the functions Hij and are calculated using the Spherical Head Model (SHM). 2.1. Spherical head related transfer functions To analyze the changes in ITD at the ears with head rotations and to implement our system, we make use of the spherical-head model to model the transfer functions Hij (∀ i, j). The spherical-head related transfer function (SHRTF), for a given source S, is defined as [5] rs SHRTF(ρ, kr, θs − θe ) = − e−ikrρΨ(ρ,kr,θs −θe ) , (1) k where θs is the angle of the source with respect to the median plane, ρ = rs/r is the normalized distance, rs is the distance between the source and the center of the sphere, r is the radius of the sphere, k = ω/c is the wave number and Ψ(ρ, kr, θs − θe ) =

∞

(2l + 1)Pl (cos(θs − θe ))

l=0

hl (krρ) , (2) hl (kr)

for ρ > 1. The term Pl is the Legendre polynomial of degree l, hl is the lth −order spherical Hankel function and hl is the first-order derivative of hl with respect to its argument. Using the WoodworthSchlosberg ray-tracing formula [6, p.76], it can be shown that the ITD of the sphere is given by 2 θs 0 ≤ θs ≤ θe − θ 0 r θe + θs + Γ(θs ) θe − θ0 ≤ θs ≤ π − θe , ITD = (3) c 2π − θe − θs + Γ(θs ) π − θe ≤ θs ≤ π where Γ(Ω) = −θ0 + ρ2 − 1 − ρ2 − 2ρ cos(θe − Ω) + 1 and −1 1 θ0 = cos ( /ρ) for ρ ≥ 1. 2.2. Crosstalk cancellation filters Perfect crosstalk cancellation is obtained when the signals at the ears are equal to the desired binaural signals, i.e. vi = di (see Fig. 1). Expressing the system in matrix form in the frequency domain we have that ideally H · C = I, where H is a matrix containing the transfer functions Hij (∀ i, j) and C is a matrix containing the crosstalk cancellation filters Cij (∀ i, j). Note that in this paper we omit the dependence on ω. An exact solution to this system does not exist, due to the non-minimum-phase characteristics of the functions H. There are a number of methods to approximate the solution, of which the ones based on least squares approximations are the most commonly used [7, 8]. Even though in this study we make use of the least squares approximation to calculate C, we derive our model for the head-tracker using the Generic Crosstalk Canceler (GCC) described in [9] such that a mathematically trackable solution can be obtained. This method is based on a minimum-phase approximation. The system is modeled in terms of the interaural transfer functions (ITF), which are calculated as the ratio between the minimum-phase components of the ipsilateral and contralateral transfer functions. The excess-phase components are approximated as a frequency independent delay: 1 0 1 1 −ITFLS2 H11 C= , (4) 1 −ITFLS1 1 0 A H22 where A = 1 − ITFLS1 ITFLS2 and ITFLSi ∼ =

Hij minphase Hiiminphase

e−jωITDLSi ,

for i = j, where Hij and Hii are, respectively, the contralateral and

ipsilateral transfer functions of the ith loudspeaker and ITDLSi is the frequency independent interaural time difference. 3. HEAD TRACKER To update properly the crosstalk cancellation filters C, we need to have good knowledge of the acoustic transfer functions H. Because the loudspeaker signals are highly correlated it is rather problematic to estimate H directly from the microphone signals vi . Instead of trying to estimate H directly from vi , we estimate the current orientation angle α using delay differences. In the following derivations, denotes the transfer functions used to calculate the crosstalk canH cellation filters C. Further, we assume that the current transfer func The signals at the ears are described in the tions H differ from H. frequency domain as (see Fig. 1) Vi =

2

Hij Xj ,

i ∈ {1, 2}

(5)

j=1

Let us now assume that the transfer functions Hij (∀ i, j) and the signals Xi can be modeled as a magnitude and a linear delay, i.e. Xi = |Xi |e−jωτxi and Hij = |Hij |e−jωτij . We can thus formulate the interaural difference in terms of the interchannel differences between the signals at the loudspeakers: |H12 | e−jω(τ12 −τ22 ) + |H22 | ICLDx e−jωICTDx −jωΔv V2 e = , V1 |H21 | ICLDx e−jω(τ21 −τ11 ) + |H11 | e−jωICTDx (6) where Δv = ICTDx + τ22 − τ11 and ICTDx = τx2 − τx1 and ICLDx = |X2 |/|X1 | are, respectively, the interchannel time difference and the interchannel level difference between the signals X1 and X2 . Using the crosstalk cancellation filters described in (4), we can express the interchannel differences at the loudspeakers as |C21 ||D1 | e−jωITDLS1 + |C22 ||D2 | e−jωITDd −jωΔx X2 e = , X1 |C12 ||D2 | e−jωITDLS2 + |C11 ||D1 | ejωITDd (7) where Δx = ITDd − τ22 + τ11 . The delays τij correspond to the 2 j and ITD

LSi = delays of the transfer functions H ij . j=1 (−1) τ ITDd and |Di | are, respectively, the ITD and magnitude of the input binaural signals d1 and d2 . Assuming that the terms in brackets of (6) and (7) contain mainly frequency dependent phase components and that the linear delay of the system is described by the second term, we can model the frequency independent ITD at the ears of the listeners as ITDears ≈ ITDd − τ22 + τ11 + τ22 − τ11 .

(8)

Using (3) to calculate τ22 − τ11 , it can be shown that the difference in ITD between the desired signal and the reproduced signal according to our model is ITDd − ITDears ≈ τ11 − τ22 ⎧ + Γ φ2s − α ⎨ θe − φ2s − α −2 α + rc ⎩ −θe + φ2s − α − Γ φ2s + α

− φ2s ≤ α ≤ −θlim −θlim ≤ α ≤ θlim , (9) θlim ≤ α ≤ φ2s

where θlim = θe − θ0 − φ2s . Note that in this model we are assuming that head rotations do not exceed the range of the loudspeakers, i.e. |α| ≤ φs/2. The ITD difference ITDears − ITDd is zero when α = α. Thus, equating (9) to zero, we can solve the current orientation angle α as a function of the current delay difference τ22 − τ11 . It follows that for φs ≤ θe − θ0 c (τ11 − τ22 ) . α= (10) 2r

60

α(m) ˆ = α(m ˆ − 1) + β(m) sgn (ITDerror (m)) ,

(11)

where sgn(·) is the sign function, ITDerror (m) = ITDd (m) − ITDears (m), β(m) is the adaptation step size and m is the time frame index. Note that this is a general approach, which does not assume linear delays of the system and it is not constrained by the span angle φs . 3.1. ITD estimation

20 W/S IACC WLS

10 ITD error [μs]

ITD [μs]

600 400

0

−10

200 0

IACC WLS 0

50 100 150 Angle virtual sound source [deg]

(a)

0 −20 IACC WLS Model

−40 −60

5

10

15 20 Time [s]

25

30

Estimated angle [°]

20

20 10 0 −10 −20 −30 5

(a)

10

15 20 Time [s]

25

30

(b)

Fig. 3. Simulation results of the proposed head-tracker algorithm (11) using IACC (dash-dotted line) and WLS (solid line), and the model (10) (dashed line): (a) ITD error as a function of time. (b) Estimated angle as a function of time compared with the original trajectory (thick solid line). The gray areas correspond to |α| ≤ (θe −θ0 )/2. 4. SIMULATIONS

To estimate the orientation angle α ˆ using (11), ITDd and ITDears should be accurately calculated. One conventional method to estimate the ITD is the interaural cross correlation (IACC) method, which estimates the ITD as the time at which the cross-correlation between the left- and right-ear is maximized. This method is known to be robust to noise, though it over-estimates the ITD values, especially near the interaural axis [10]. Another approach is to calculate the ITD as the slope of the phase difference between the channels at low frequencies. In [11] the regression problem is solved using a weighted least squares (WLS) approach, where the weighting is included to give more emphasis on the cross-spectrum coefficients with high energy. To select the proper estimation method, we compared first the above mentioned methods throughout simulations without considering head rotations for the case of α = α = 0. A two-channel CCS was simulated using (1) to model the transfer functions H and the crosstalk cancellation filters were calculated using the least squares approximation method proposed in [8]. The ears were assumed to be located at θe = ±100◦ with respect to the median plane. The loudspeakers spanned 30◦ and were placed 1.2 m away from the listener. The virtual sound sources were located 0.75 m from the listener and were also simulated using the spherical-head model. The angle of the virtual sources was varied between 0◦ (median plane front) and 180◦ (median plane back). Fig. 2(a) shows the ITD calculated using (3) (W/S) and the ITD at the ears estimated with the aforementioned methods. Compared to the numerical model, both methods over-estimate systematically the ITD as the angle approaches the interaural axis. Nevertheless, when looking at the error between the estimated ITDd and ITDears , the IACC shows to be more robust to errors introduced by the CCS than the WLS (see Fig. 2(b)). 800

Original IACC WLS Model

30 40 ITD error [μs]

Thus, in principle, we can directly estimate the current orientation angle α by measuring the delays τ11 and τ22 . This is, however, not a simple task, given that the microphone signals contain not only the direct signals but also residual crosstalk signals, which are in general highly correlated. Another approach is to recursively estimate the orientation angle α by minimizing the instantaneous error between ITDd and ITDears . In this study we used a simple sign-error algorithm, i.e.

−20

0

50 100 150 Angle virtual sound source [deg]

(b)

Fig. 2. Performance of ITD estimators for α = α = 0. (a) ITD at the ears as a function of the angle of the virtual source: ITD calculated with (3) (dashed line), estimated using IACC (dash-dotted line) and WLS (solid line). (b) Estimation error between ITDd and ITDears .

To evaluate the proposed system, we simulated the following scenario in MATLAB: a virtual source is reproduced through a twochannel CCS as depicted in Fig. 1. The virtual source consisted of white Gaussian noise with a duration of 32s, which was convolved with the SHRTFs corresponding to 45◦ with respect to the median plane placed at rs = 0.75 m. The sampling frequency was 48 kHz. The loudspeakers were placed symmetrically with respect to the center of the head of the listener at a distance of 1.2 m and the span angle between loudspeakers was 30◦ . The listener rotated his head from right to left at constant speed of 4 degrees per second. The adaptation was done on a frame-by-frame basis with a frame size of M = 2048. For each frame, ITDd , ITDears and the delay difference τ11 − τ22 were estimated. The latter was estimated as the time at which the cross-correlation between the inverse Fourier transform of H11 and H22 is maximized1 . Note that this is equivalent to the IACC approach but using only the ipsilateral impulse responses. The orientation angle α(m) ˆ was updated using both (10) and (11). During the experiments we found that when using (11), a better adaptation is obtained with a variable step size, which depends on the absolute value of the ITD error, i.e. β(m) = B |ITDerror (m)|, where B is a scaling factor that was set to 3.9 × 104 . With the ˆ were calculated using new estimated α(m), ˆ new transfer functions H (1). These transfer functions were then used to update the crosstalk cancellation filters C, which were calculated with the least squares approximation method proposed in [8]. Given that in our implementation we were estimating the ITDs on a frame-by-frame basis, we counteracted possible random fluctuations in the ITD estimates by recursively smoothing the ITD error over time, ITDerror (m) = λ ITDerror (m − 1) + (1 − λ) ITDerror (m),

(12)

where ITDerror (m) is the instantaneous estimate of the ITD error and λ is a forgetting factor that controls the influence of previously estimated errors. We replaced in (11) the instantaneous estimate ITDerror (m) by the average over time ITDerror (m). The forgetting factor λ was set to 0.85. Fig. 3(a) shows the ITD error as a function of time for the model (10) and the adaptive algorithm (11) using the aforementioned ITD estimators. The gray areas correspond to angles within the limit in which (10) is valid, i.e. |α| ≤ (θe −θ0 )/2. The head-tracker based on the model (10) shows in general ITD errors below 20 μs and below 10 μs in the areas where α is within the model’s limits. The sign1 In this ideal case we assume a prior knowledge of the ipsilateral transfer functions H11 and H22 .

Fig. 4. Channel separation (CHSP) [9] as a function of frequency at the left ear evaluated at each time frame. Left panel: algorithm (11) with IACC. Central panel: algorithm (11) with WLS. Right panel: model (10). error algorithm with the IACC approach shows errors below 10 μs throughout the whole movement and, as expected, it shows to be rather robust to variations in the input signal. The WLS approach, on the other hand, shows rather large estimation errors. In [2] it is argued that ITD errors above 10 μs become audible and can potentially add ambiguous cues to the virtual sound source, eventhough errors in the range of 20 μs do not have a significant impact in the channel separation. Looking at the estimated angle as a function of time shown in Fig. 3(b), we can observe that the movement of the head is followed by the three approaches. In the case of the algorithm (11) with the WLS approach, the large ITD estimation errors are smoothed due to the forgetting factor introduced in (12), though the angle errors are still in the range of 5◦ . In contrast, the headtracker with the IACC approach does not show fluctuations throughout the whole movement of the head and errors are below 1◦ . The model (10) shows slightly larger angle estimation errors (around 2◦ ) than the IACC method when α is outside the model’s limits. One important measure of the performance of a CCS is the channel separation (CHSP), which is a measure of how much crosstalk is leaked into the direct signal [9]. Fig. 4 shows the CHSP as a function of frequency and time. The angle estimation errors observed with the adaptive algorithm using the WLS estimator have a significant impact in the CCS’s performance, which could result in audible artifacts, whereas the performance of the head-tracker with the IACC estimator shows only a slight degradation in CHSP at the time frames where the direction of the movement changes. The results obtained with the model (10) show some degradation at the time frames where α is outside the limits of the model (e.g. between 5 and 10 seconds), though they are still comparable to the head-tracker with the IACC estimator. 5. DISCUSSION We presented a head tracker system that makes use of two microphones located close to the ears of the listener to estimate the orientation angle of the head. Instead of estimating the AIRs from the loudspeakers to the ears using the microphone signals, we proposed to estimate the orientation angle based on the ITD difference between the desired binaural signals and the signals captured by the microphones. Based on the spherical-head model and the WoodworthSchlosberg ray-tracing formula, we derived a relation between the ITD error and the orientation angle. Two different ITD estimation methods were also discussed and the performance of the proposed system was evaluated in a simulated environment. The direct estimation of the orientation angle based on the delays of the direct signals showed to be a good approximation, even outside the range where the model is valid. This approach is however problematic in practice, since we would need to be able to accurately estimate these delays from signals that also contain residual crosstalk signals. We demonstrated that by minimizing the ITD error recursively instead, we can accurately track head rotations and adapt the crosstalk cancellation filters accordingly, improving the overall

performance of the CCS, even though a rather simple recursive algorithm to estimate the orientation angle was used. The estimation of the ITD is however critical, and a robust method is required. The IACC method showed to be rather consistent and robust to variations in the input signal, while the WLS method resulted in estimation errors that degrade the performance of the crosstalk cancellation filters. The natural step to follow now is a systematic evaluation of the proposed system in more adverse conditions, e.g. noisy environment and multiple sources. Additionally, other ITD estimation techniques should be evaluated as well as adaptation algorithms robust to outliers. 6. REFERENCES [1] Y. Lacouture Parodi and P. Rubak, “Objective Evaluation of the Sweet Spot Size in Spatial Sound Reproduction Using Elevated Loudspeakers,” J. Acoust. Soc. Am., vol. 128, no. 3, pp. 1045 – 1055, September 2010. [2] Y. Lacouture-Parodi, A systematic Study of Binaural Reproduction Systems Through Loudspeaker: A Multiple Stereo-Dipole Approach, Ph.D. thesis, Aalborg University, 2010, [3] Y. Lacouture Parodi and P. Rubak, “Sweet Spot Size in Virtual Sound Reproduction: A Temporal Analysis,” in Principles and Applications of Spatial Hearing. World Scientific, 2011, [4] M. Karjalainen, M. Tikander, and A. Härmä, “Head-Tracking And Subject Positioning Using Binaural Headset Microphones And Common Modulation Anchor Sources,” in IEEE Int. Conf. on Acoust. Speech, and Signal Proc., (ICASSP ’04)., 2004. [5] R. O. Duda and W. L. Martens, “Range dependence of the response of a spherical head model,” J. Acoust. Soc. Am., pp. 3048–3057, 1998. [6] J. Blauert, Spatial Hearing, Hirzel-Verlag, 3rd edition, 2001. [7] O. Kirkeby and P. A. Nelson, “Digital Filter Design for Inversion Problems in Sound Reproduction,” J. Audio Eng. Soc., vol. 47, no. 7/8, pp. 583–595, July/August 1999. [8] O. Kirkeby, P A. Nelson, H. Hamada, and F. OrdunaBustamante, “Fast Deconvolution of Multi-Channel Systems Using Regularization,” IEEE Trans. on Speech and Audio Proc., vol. 6, no. 2, pp. 189–195, 1998. [9] Y. Lacouture Parodi and P. Rubak, “Analysis of Design Parameters for Crosstalk Cancellation Filters Applied to Different Loudspeaker Configurations,” J. Audio Eng. Soc., vol. 59, no. 5, pp. 304 – 320, May 2011. [10] P. Minaar, J. Plogsties, S. Krarup, F. Christensen, and H. Møller, “The Interaural Time Difference in Binaural Synthesis,” in 108th AES Convention, February 2000. [11] C. Tournery and C. Faller, “Improved Time Delay Analysis/Synthesis for Parametric Stereo Audio Coding,” in 120th AES Convention, May 2005.