spotforming using distributed microphone arrays - IEEE Xplore

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 20-23, 2013, New Paltz, NY

SPOTFORMING USING DISTRIBUTED MICROPHONE ARRAYS Maja Taseska and Emanu¨el A. P. Habets International Audio Laboratories Erlangen∗ Am Wolfsmantel 33, 91058 Erlangen, Germany {maja.taseska, emanuel.habets}@audiolabs-erlangen.de ABSTRACT Extracting sounds that originate from a specific location, while reducing noise and interferers is required in many hands-free communications systems. We propose a spotforming approach that uses distributed microphone arrays and aims at extracting sounds that originate from a pre-defined spot of interest (SOI), while reducing background noise and sounds that originate from outside the SOI. The spotformer is realized as a linear spatial filter, which is based on the signal statistics of sounds from the SOI, the signal statistics of sounds outside the SOI and the background noise signal statistics. The required signal statistics are estimated from the microphone signals, while taking into account the uncertainty in the location estimates of the desired and the interfering sound sources. The applicability of the method is demonstrated by simulations and the quality of the extracted signal is evaluated in different scenarios. Index Terms— distributed arrays, PSD matrix estimation, source extraction, spatial filtering 1. INTRODUCTION Acquisition and enhancement of desired sounds while reducing background noise and interferers is required in many modern handsfree communication systems. Many of the spatial filtering methods that have been developed for this task are based on beamforming, where signals from closely-spaced microphones are linearly combined to enhance sounds from certain directions [1]. The beamformers can be fixed to achieve a desired spatial selectivity, or signal-dependent, based on the statistics of the desired and undesired signals. The authors in [2] use a fixed beamformer and a postfilter to extract signals from the endfire direction of a microphone array used in a headset. In [3], a signal-dependent beamformer is proposed that provides an arbitrary spatial response using multiple instantaneous direction of arrival (DOA) estimates, while minimizing the noise and diffuse power at the output. Signal-dependent beamformers usually achieve better noise and interference reduction compared with fixed beamformers. However, the desired and undesired signal statistics need to be estimated from the microphone signals. For further signal-dependent beamforming methods, the reader is referred to references in [1, 3]. In scenarios where distributed microphone arrays are applicable, spatial selectivity given in terms of positions, instead of DOAs, can be achieved. Recent work with distributed arrays is presented in [4], where sources from certain locations are extracted by computing the signal of a virtual microphone. This method does not explicitly aim at signal enhancement and additive noise was not con∗ A joint institution of the University Erlangen-Nuremberg and Fraunhofer IIS, Germany

978-1-4799-0972-8/13/$31.00 ©2013IEEE

sidered. Moreover, it is a parametric approach, where the output signal is obtained by applying spectral gain to a reference microphone. In this work, we propose a signal-dependent spatial filtering approach where the signals from all the microphones from the distributed arrays are linearly combined to extract sounds originating from a pre-defined spot of interest (SOI), while reducing noise and interference. We refer to this kind of spatial filtering as spotforming. The approach is based on the power spectral density (PSD) matrices corresponding to signals that originate inside the SOI, outside the SOI and the background noise. The PSD matrices are estimated using position-based probabilities, in a framework similar to [7, 8]. In this manner, the uncertainty about the positions of active sources inside and outside the SOI is taken into account in the spotformer design. The paper is organized as follows: in Section 2 the signal model and the spotforming problem are formulated. The computation of spotforming spatial filter coefficients and the estimation of the PSD matrices are discussed in Section 3. The computation of a speech presence probability (SPP) and a spot probability needed to estimate the PSD matrices are detailed in Section 4. In Section 5 the proposed method is evaluated and Section 6 concludes the paper. 2. PROBLEM FORMULATION We consider scenarios where M microphones from at least two distributed arrays are used to capture the sound field. We assume that the sound field is composed of a direct component corresponding to one or more speech sources and an additive component corresponding to background noise and late reverberation. By defining a SOI S, the direct component can be further decomposed into a component that corresponds to sources inside the SOI and a component that corresponds to sources outside the SOI. Therefore, the signal at the m-th microphone can be written in the short-time Fourier transform (STFT) domain as follows Ym (n, k) = Xm,i (n, k) + Xm,o (n, k) + Vm (n, k) (1) Z Z = gr (k)Sr (n, k) dr + gr (k)Sr (n, k) dr + Vm (n, k), r∈S

r6∈S

where Xm,i , Xm,o and Vm denote the STFT coefficients of the signal component in the SOI, outside the SOI, and the background noise, respectively. The time and frequency indices, are denoted by n and k, respectively, gr denotes the array propagation vector corresponding to a position r, and Sr denotes the STFT coefficients of a signal originating from position r. In the following, the time and frequency indices are omitted when possible. We introduce the vector notation y = [Y1 . . . YM ]T , and define the PSD matrix of y as Φy (n) = E y y H , where (·)H denotes the conjugate transpose of a vector or a matrix. The vectors xi , xo


and v and their respective PSD matrices Φi , Φo and Φv are defined similarly. The signals Xm,i (n, k), Xm,o (n, k) and Vm (n, k) are mutually uncorrelated, zero-mean random processes, such that Φy (n, k) = Φi (n, k) + Φo (n, k) + Φv (n, k). (2) The aim in this paper is formulated as follows: given an arbitrary SOI S, compute a filter h that provides an estimate of the direct signal originating from S at a reference microphone m, while reducing noise and direct signals originating from outside S, i.e., bm,i (n, k) = hH X (3) m (n, k) y(n, k). In the following, we refer to the filtering in (3) as spotforming. 3. SPOTFORMING Given the PSD matrices of the direct signal inside S, outside S, and the noise, a spotformer h can be computed which preserves direct signals originating from S, while minimizing the power of the noise signal and direct signals originating outside S according to (Φo + Φv )−1 Φi (n) hm = em , (4) 1 + tr{(Φo + Φv )−1 Φi } where tr (·) denotes the trace of a matrix and em is given by em = [0 . . 0} 1 |0 .{z . . 0}]T . | .{z m−1

(5)

M −m

Note that in the case when a single source is located in S such that Φi is of rank one, the filter given by (4) represents a minimum variance distortionless response (MVDR) filter [1] for S. Alternatively, an LCMV filter [1] can be employed to extract multiple sources located in S. The reference microphone m is chosen from the M microphones from the distributed arrays, such that the distance to the centroid of S is minimum. The PSD matrices required in (4) are usually not available and need to be estimated from the microphone signals. In the proposed PSD matrix estimation framework, we use the following hypotheses Hi : indicates presence of direct signal inside S, (6a) Ho : indicates presence of direct signal outside S, (6b) Hv : y = v, indicates absence of direct signal, (6c) Hx : indicates presence of direct signal (Hi ∪ Ho = Hx ). (6d) 3.1. Estimating the noise PSD matrix The noise PSD matrix is approximated by a recursive temporal average, such that for each time-frequency (TF) bin, an estimate is obtained as a weighted sum of the instantaneous PSD matrix of the current frame and the estimate from the previous frame as follows b v (n) = αv (n) Φ b v (n − 1) + [1 − αv (n)] y(n)y H (n). (7) Φ Note that the parameter αv is time (and frequency)-dependent. To avoid leakage of speech signal into the noise PSD matrix estimate, the value of αv (n, k) needs to accurately represent the certainty that a direct signal component (speech) is absent in TF bin (n, k). In state-of-the-art approaches, αv (n, k) is computed using the posterior SPP [6, 7], denoted by p[Hx | y(n)], such that for a chosen constant α ˜ v ∈ [0, 1] αv (n) = p[Hx | y(n)] + α ˜ v (1 − p[Hx | y(n)]) . (8) 3.2. Estimating the spot and off-spot PSD matrices The spot PSD matrix Φi (n, k) is estimated similarly from the observed microphone signals, by computing an appropriate averaging


parameter αi . Φi (n, k) is updated based on a posterior spot probability p[Hi | y(n)], such that αi is computed as follows αi (n) = 1 − p[Hi | y(n)] + α ˜ i p[Hi | y(n)], α ˜ i ∈ [0, 1]. (9) Due to the constant presence of background noise, the spot PSD matrix is obtained in two steps as follows b i,v (n) = αi (n) Φ b i,v (n − 1) + [1 − αi (n)] y(n)y H (n), (10a) Φ b i (n) = Φ b i,v (n) − Φ b v (n). Φ

(10b)

In order to avoid leakage of direct signals outside the spot in the spot PSD matrix due to erroneous spot probability estimates, an additional threshold is incorporated, such that the update (10a) takes place only if p[Hi | y(n)] > pth . The off-spot PSD matrix can be estimated analogously using the off-spot probability. Experimental results showed that estimating Φo with a fixed parameter αo , rather than a probabilitydependent parameter, leads to a more stable spotformer and better performance. Therefore, we compute the update parameter αo as follows αo (n) = Ip[Hi | y]>pth + α ˜ o · (1 − Ip[Hi | y]>pth ), (11) where Ip[Hi | y]>pth = 1 if p[Hi | y] > pth and 0 otherwise. In this manner, it is ensured that updates of the spot and off-spot PSD matrices never take place simultaneously, which would lead to distortion of the output signal. Similarly to (10a) and (10b), the off-spot PSD matrix is estimated in two steps. 4. SPOT AND SPEECH PRESENCE PROBABILITIES In this section, the computation of the SPP p[Hx | y] and the spot probability p[Hi | y] required for the PSD matrix estimation is discussed. The spot probability can be decomposed as follows p[Hi | y(n, k)] = p[Hi , Hx | y(n, k)] = p[Hi | Hx , y(n, k)] · p[Hx | y(n, k)]. (12) The factor p[Hx | y(n, k)] represents the posterior SPP and the factor p[Hi | Hx , y(n, k)] represents the posterior probability that a detected direct component originates from the source located inside the spot. We refer to this probability as the conditional spot probability (conditioned on the presence of a direct signal). Similarly as in [8], we propose to approximate the conditional spot probability using bin-wise position estimates rˆ, i.e., p[Hi | Hx , y(n, k)] ≈ p[Hi | Hx , rˆ(n, k)]. (13) The position-based approximation is valid under the assumption that speech signals are sparse in the STFT domain [5] and the reverberation level is sufficiently low. The signal model in (1) can then be reformulated as a single wave model, where the main energy contribution of the direct signal component in a TF bin (n, k) corresponds to a single wave originating from position r, i.e., y(n, k) = gr (k)Sr (n, k) + v(n, k). (14) 4.1. Speech presence probability If the STFT coefficients of the speech and the noise signals are modeled as complex Gaussian vectors [9], the SPP is given by −1 β(n) q(n) − p[Hx | y(n)] = 1 + [1 + ξ(n)] e 1+ξ(n) , (15) 1 − q(n) where q(n) denotes the a priori speech absence probability and ξ(n) = tr{Φ−1 (16) v (n) [Φi (n) + Φo (n)]}, −1 β(n) = y H (n)Φ−1 v (n)[Φi (n) + Φo (n)]Φv (n)y(n),

(17)


The sum Φi (n) + Φo (n) represents the PSD matrix of the speech signals (inside and outside the SOI). In order to detect all coherent signal components as speech, we use the direct-to-diffuse ratio (DDR)-based a priori SAP, proposed by the present authors in [10]. Note that in order to compute the SPP in the current time frame by (15), the noise PSD matrix Φv (n−1) of the previous time frame is used, whereas PSD matrix Φv (n) is updated after computing αv (n) in (8). The matrix sum Φi (n) + Φo (n) is estimated as b y (n) − Φ b v (n − 1), where Φ b y (n) is an estimate of Φy (n). Φ

If r denotes the true position of a source, the position-based approximation of the conditional spot probability in (13) implies that p[Hi | Hx , rˆ] = p[r ∈ S | Hx , rˆ], (18) which can be further expressed as follows Z p[r ∈ S | Hx , rˆ] = f (r | Hx , rˆ) dr (19a) r∈S

Z f (ˆ r | Hx , r) ·

f (r) dr, f (ˆ r | Hx )

(19b)

r∈S

where f (·) is a probability density function (PDF), and (19b) follows from (19a) using Bayes theorem. If the PDFs are known, the integral can be approximated by sampling S, such that Z N X f (r) f (ri ) f (ˆ r | Hx , r) · dr ≈ f (ˆ r | H x , ri ) · , f (ˆ r | Hx ) f (ˆ r | Hx ) i=1

r∈S

where N denotes the number of sampled positions in S. The PDFs f (r), f (ˆ r | Hx ) and f (ˆ r | Hx , ri ) usually depend on the array configuration, the positions of sources, and the noise and reverberation levels. The PDF of the true source positions f (r) can be based on prior knowledge about the possible source locations in the room. In order to obtain the conditional PDF f (ˆ r | Hx , ri ) a source is placed at ri and the bin-wise position estimates are observed during a certain training time interval. An analytic function can then be fitted to the observations for each ri , i = 1, . . . , N . Finally, the PDF f (ˆ r | Hx ) is obtained by marginalization as follows f (ˆ r | Hx ) =

N X i=1

f (ˆ r , ri | H x ) =

Scenario A

Scenario B talker

S1

S2

[m]

S3

microphone arrays

[m]

[m]

Figure 1: Scenario A (left) and Scenario B (right). The circles represent the borders of different SOIs.

4.2. Conditional spot probability

=


N X

f (ˆ r | Hx , ri )·f (ri ). (20)

i=1

To avoid the training process, we modelled f (ˆ r | Hx , ri ) directly by a symmetric two-dimensional Gaussian distribution with mean r and variance σ 2 I. The variance σ 2 usually depends on the noise and reverberation levels and can be determined empirically. 5. PERFORMANCE EVALUATION 5.1. Experimental setup and performance measures The proposed spotformer was evaluated in two simulated scenarios (see Figure 1): in Scenario A the goal is to enhance the signal of a single talker in the presence of background noise by choosing an SOI that contains the talker. In Scenario B, two talkers and background noise are present, where the signals of the two talkers are of approximately equal power. Three different SOIs were defined: one for each talker, and one containing none of the talkers. The effect of changing the size of SOI was investigated. The following objective performance measures were considered: segmental speech distortion index νsd [1], segmental output-to-input noise ratio ∆v and

segmental output-to-input off-spot signal ratio ∆o (computed over time segments of 30 ms). The ratios ∆v and ∆o were obtained by averaging the segment-wise values in the log domain. The speech distortion index νsd was computed by comparing the signal of the talker inside the SOI as received at the reference microphone and after applying the spotformer. Note that this reference signal contains reverberant parts that do not correspond to a direct signal from the SOI. Therefore high values of νsd do not necessarily indicate low quality of the spot signal. The motivation for computing νsd in this manner was to have a fixed reference signal for different sizes of the SOI and compare the values for νsd . It is expected that by increasing the SOI size νsd will decrease, as more of the reverberant signal is captured in the SOI. The length of speech signal in Scenario A was 22 s. The signal Scenario B contained 6 s where only the first talker is active, 14 s where both talkers are active simultaneously, and 6 s where only the second talker is active. The microphone signals were obtained by convolving clean speech signals with simulated room impulse responses [11], adding uncorrelated sensor noise with segmental speech-to-noise ratio of 50 dB, and diffuse babble noise [12] with a segmental speech-to-noise ratio of 4 dB in Scenario A and 10 dB in Scenario B (computed only during double talk). The method was evaluated for reverberation time T60 of 0.2 s and 0.4 s. The sampling frequency was 16 kHz and the STFT frame length was 64 ms (50% overlap). Two uniform circular arrays were used with three omnidirectional microphones, a diameter 2.5 cm and an array spacing of 1.5 m. The parameters in (8), (9) and (11) were chosen as α ˜ v = 0.9, α ˜ i = 0.85 and α ˜ o = 0.9, and the variance of the Gaussians modelling the PDFs f (ˆ r | Hx , ri ) was σ 2 = 0.1. The bin-wise position estimates were computed by triangulating instantaneous DOA vectors from the two arrays, as done in [8]. 5.2. Results The evaluation results are summarized in Tables 1 and 2. In Scenario A, the radius of the SOI was varied in the range 0.2 – 0.8 m, whereas in Scenario B, two radii were evaluated for each SOI centre (see Figure 1). Note that the higher values of νsd for T60 =0.4 s are partly due to estimation errors, but also because the reference signal is more reverberant and the power contribution that corresponds to a direct signal in the SOI decreases. In both scenarios, a significant noise reduction and off-spot signal reduction given by ∆v and ∆o are achieved. In Scenario B, the amount of off-spot signal reduction is quite different for the two talkers, due to inaccuracy in modelling f (ˆ r | Hx , r) as a fixed Gaussian PDF, while in general, it depends on r. As expected, the values of νsd decrease with increasing SOI radius, whereas the amount of noise and off-spot signal reduction generally decreases. Example spectrograms are given in Figures 2 and 3, where in Figure 3, a segment when only the first talker is active, a segment during double talk, and a segment when only the second talker is active are shown.


Frequency [kHz]


time [s]

time [s]

time [s]

time [s]

Frequency [kHz]

Figure 2: Spectrograms for Scenario A (T60 =0.4 s). Left to right: mixture, clean speech, S with radius 0.2 m, S with radius 0.8 m.

time [s]

time [s]

time [s]

time [s]

Figure 3: Spectrograms for Scenario B (T60 =0.4 s). Left to right: mixture, S1 , S2 , S3 (see Figure 1).

νsd ∆v (dB) νsd ∆v (dB)

0.2 m

0.3 m

0.4 m

0.5 m

0.6 m

0.7 m

0.8 m

0.12 -17.3 0.34 -19.6

0.11 -16.8 0.31 -18.6

0.09 -16.2 0.29 -18.1

0.08 -15.8 0.27 -17.0

0.08 -15.6 0.25 -16.6

0.07 -15.2 0.24 -16.3

0.07 -15.0 0.23 -16.1

Table 1: Performance evaluation for Scenario A. T60 = 0.2 s (top) and T60 = 0.4 s (bottom). S1 νsd ∆v (dB) ∆o (dB) νsd ∆v (dB) ∆o (dB)

S2

S3

0.2 m

0.6 m

0.2 m

0.6 m

0.2 m

0.6 m

0.10 -16.5 -10.6 0.28 -12.1 -15.9

0.07 -17.4 -9.4 0.26 -13.9 -15.0

0.12 -18.9 -21.9 0.37 -17.0 -20.6

0.07 -17.1 -19.1 0.30 -11.3 -14.1

-26.0 -39.6 -26.2 –32.1

-19.7 -33.0 -21.5 -28.1

Table 2: Performance evaluation for scenario B. T60 = 0.2 s (top) and T60 = 0.4 s (bottom). 6. CONCLUSIONS A spotforming method was proposed that preserves sounds that originate from a given spot of interest while reducing background noise and interferers. The coefficients of the spotformer were obtained by exploiting spot and off-spot probabilities based on instantaneous bin-wise position estimates, and a speech presence probability. Simulations demonstrated that sounds originating from the SOI can be extracted with low distortion, while the noise and interferers outside the spot are significantly reduced. Future work includes evaluating the method with measured data, improving the spatial filter, and evaluating the possible performance gain of a training-based approach. 7. REFERENCES [1] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, 2008.

[2] I. Tashev, M. Seltzer, and A. Acero, “Microphone array for headset with spatial noise suppressor,” in Proc. IWAENC, Eindhoven, The Netherlands, 2005. [3] O. Thiergart and E. A. P. Habets, “An informed LCMV filter based on multiple instantaneous direction-of-arrival estimates,” in Proc. IEEE ICASSP, May 2013. [4] G. Del Galdo, O. Thiergart, T. Weller, and E. A. P. Habets, “Generating virtual microphone signals using geometrical information gathered by distributed arrays,” in Proc. HSCMA, Edinburgh, United Kingdom, May 2011. [5] O. Yilmaz and S. Rickard, “Blind separation of speech mixture via time-frequency masking,” IEEE Transactions on Signal Processing, vol. 52, pp. 1830–1847, 2004. [6] T. Gerkmann and R. C. Hendriks, “Noise power estimation base on the probability of speech presence,” in Proc. IEEE WASPAA, New Paltz, NY, 2011. [7] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A multichannel MMSE-based framework for joint blind source separation and noise reduction,” in Proc. IEEE ICASSP, 2012. [8] M. Taseska and E. Habets, “MMSE-based source extraction using position-based posterior probabilities,” in Proc. IEEE ICASSP, 2013. [9] M. Souden, J. Chen, J. Benesty, and S. Affes, “Gaussian model-based multichannel speech presence probability,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 5, pp. 1072– 1077, July 2010. [10] M. Taseska and E. A. P. Habets, “MMSE-based blind source extraction in diffuse noise fields using a complex coherencebased a priori SAP estimator,” in Proc. IWAENC, Sept. 2012. [11] E. A. P. Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep., 2006. [12] E. A. P. Habets, I. Cohen, and S. Gannot, “Generating nonstationary multisensor signals under a spatial coherence constraint,” J. Acoust. Soc. Am., vol. 124, no. 5, pp. 2911–2917, Nov. 2008.