Broadband DOA Estimation using Convolutional Neural Networks

Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals Soumitro Chakrabarty and Emanuël Habets WASPAA 2017

Motivation Signal processing methods §  Cross-correlation-based methods •  GCC-PHAT •  SRP-PHAT •  MCCC…

Challenges - 

Performance degradation in presence of noise and reverberation

- 

High computational cost

§  Subspace-based methods •  MUSIC... §  Model-based methods •  Maximum-likelihood estimation... §  ...

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 2

Motivation Supervised learning methods §  Advantage: Supervised learning methods can be adapted to

different acoustic environments §  Recently, deep neural network (DNN) based supervised learning

methods have been successful across a range of applications: §  §  § 

Automatic speech recognition Object recognition in images Machine translation….

§  Few DNN based methods that estimate DOA of a sound source from

the observed signals

[1] Xiao et al. "A learning-based approach to direction of arrival estimation in noisy and reverberant environments,“ 2015 [2] Takeda et al. "Sound source localization based on deep neural networks with directional activate function exploiting phase information“ 2016 © AudioLabs, 2017 Soumitro Chakrabarty


Motivation DNN based DOA estimation

Aim: A DNN based supervised learning method for DOA estimation that §  Estimates DOA per time frame given the STFT representation of the

observed signals §  Simple input representation to learn relevant features during training



Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...

i ≈ θi ∈ Θ = {θ1 , . . . , θI }

§  DOA estimation is formulated as an I class classification problem §  Discretize the whole DOA range into I discrete values to obtain a set

of possible DOA values: ⇥ = {✓1 , . . . , ✓I } §  Each class corresponds to a possible DOA value in the set © AudioLabs, 2017 Soumitro Chakrabarty


Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...

i ≈ θi ∈ Θ = {θ1 , . . . , θI }

§  For each frame, compute the posterior probability for each class §  DOA estimate is the DOA of the class with the highest posterior



System Overview Supervised learning framework True DOA Labels

Training data

STFT

Input feature

Train DOA classifier

Training Inference/Test

Trained parameters

Test data

STFT

Input feature

DOA classifier

Posterior probabilities

DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty



Training data

STFT

Input feature



Trained parameters

Test data

STFT

Input feature

DOA classifier




Input feature representation STFT magnitude and phase component

icr

op

ho

ne

s

K − Frequency Bins

m (n,k)

Magnitude


Phase

Am (n, k)


m (n, k)

− M

−

N − Time frames

M

N − Time frames

M

M

icr

op

ho

ne

s


Ym (n, k) = Am (n, k)ej

Input feature representation Phase map


for each time frame

n

Soumitro Chakrabarty

s ne ho op −

M

icr

Phase

M

M

n

N − Time frames

© AudioLabs, 2017

K

m (n, k)

Phase Map


n


Training data

STFT

Input feature

Φn



Trained parameters

Test data

STFT

Input feature

Φn

DOA classifier




Total Feature Maps: 64 Size: (M − 2) × (K − 2) Total Feature Maps: 64 Size: (M − 3) × (K − 3)

Conv1: 2 × 2, 64

Conv2: 2 × 2, 64

Conv3: 2 × 2, 64

§  Pooling is not employed §  Experiments showed decreased performance



FC2: 512

Total Feature Maps: 64 Size: (M − 1) × (K − 1)

FC1: 512

Input: M × K

Output: I × 1

CNN Architecture

Training with synthesized noise signals

§  CNN learns the required information for DOA estimation from the

phase map §  CNN can be trained using synthesized noise signals (!?)

§  Advantages: 1. 

No speech/audio database required

2. 

Easier to create training data

§  Spectrally white noise was used



System overview True DOA Labels

Training data (Noise signals)

STFT

Input feature

Φn

Train CNN based DOA classifier

Training Inference/Test Test data (Speech signals)

Trained parameters

STFT

Input feature

Φn

CNN based DOA classifier

p(θi |Φn )

DOA estimate θˆn = arg max p(θi |Φn ) θi



Evaluation Experiments 1. 

Generalization to speech and robustness to noise

2. 

Performance in acoustic conditions different from training

3. 

Robustness to small perturbations in microphone positions

4. 

Real acoustic environments



Evaluation Experiments 1. 

Generalization to speech and robustness to noise

2. 

Performance in acoustic conditions different from training

3. 

Robustness to small perturbations in microphone positions

4. 

Real acoustic environments



Evaluation Performance measure § 

Performance of CNN compared to SRP-PHAT

§ 

Evaluation measure § 

Frame-level accuracy ˆ N A(%) = c ⇥ 100, Ns Ns ˆc N


Total time frames with speech active Time frames with correct estimate


Evaluation Experimental parameters § 

Uniform linear array (ULA) §  Number of microphones = 4 §  Inter-microphone distance = 3 cm

§ 

STFT length 256, 50% overlap

§ 

Resolution for classes: 5 degrees, I = 37 classes

§ 

Training data is simulated using RIR generator [3]

[3] https://github.com/ehabets/RIR-Generator © AudioLabs, 2017 Soumitro Chakrabarty


Evaluation CNN training conditions

Room with training positions Simulated training data Signal Synthesized noise signals Room size R1: (6 ⇥ 6) m , R2: (5 ⇥ 5) m Array positions in room 7 di↵erent positions in each room Source-array distance 1 m and 2 m for position RT60 R1: 0.3 s, R2: 0.2 s SNR Uniformly sampled from 0 to 20 dB © AudioLabs, 2017 Soumitro Chakrabarty


Evaluation CNN training parameters §  Training data: 5.6 million time frames §  Validation data: 20% split from the training data §  Loss: Cross-entropy §  Activation: ReLU, Softmax (final layer) §  Optimizer: Adam §  Batchsize: 512 §  Nb Epochs: 10 §  Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)



Evaluation Test conditions §  Database: TIMIT test §  Speech samples: 500, 4 s each §  Test data size: 100000 active time frames

Simulated test data Signal Speech signals from TIMIT Room size Room 1: (7 ⇥ 6) m , Room 2: (8 ⇥ 8) m Array positions in room 1 random position for each room Source-array distance 1.5 m for both rooms RT60 Room 1: 0.45 s, Room 2: 0.53 s SNR 2 categories: 5 dB, and 15 dB © AudioLabs, 2017 Soumitro Chakrabarty


Evaluation Test conditions Frame-level accuracies (%)

CNN SRP-PHAT

Room 1 5 dB 15 dB 56.2 69.8 22.6 33.6

Room 2 5 dB 15 dB 54.1 68.2 21.8 38.4

§  Better performance compared to SRP-PHAT §  For all cases, CNN is accurate for the majority of the frames



Evaluation Qualitative results - Room 2, 15 dB D OA ( ◦ )

1 150 100

0.5

50 10

20

30

40

50

60

Tim e fr am e s

70

80

90

100

CNN

0

D OA ( ◦ )

1 150 100

0.5

50 10

20

30

40

50

60

Tim e fr am e s

70

80

90

100

SRP-PHAT

0

D OA ( ◦ )

1 150 100

0.5

50 10

Fr e que nc y bins

True DOA

20

30

40

50

60

Tim e fr am e s

70

80

90

100

0

120 100 80 60 40 20

−10 −20 −30 10


0

20

30

40

50

60

Tim e fr am e s

70

80

90


100

Speech segment

Evaluation Qualitative results – Average over 0.8 s

CNN

Pr obability

1

SRP−PHAT

True DOA

0.8 0.6 0.4 0.2 0

0

20

40

60

80

100

120

D OA



140

160

180

Evaluation Real environment – Measured RIRs §  Database: Multichannel Impulse Response Database from Bar-Ilan §  Source setup: [0 , 180 ] , steps of 15 degrees §  Array configuration: M = 8 microphones, d = 8 cm §  Source-array distances: 1 m and 2 m §  Test sample: 15 s long speech sample

15 resolution {1, 2} m



Evaluation Real environment – Measured RIRs §  Database: Multichannel Impulse Response Database from Bar-Ilan §  Source setup: [0 , 180 ] , steps of 15 degrees §  Array configuration: [8,8,8,8,8,8,8], M = 8 microphones §  Source-array distances: 1 m and 2 m §  Test sample: 15 s long speech sample §  CNN was retrained for the new array geometry (with simulated data)

6m

6m


{1,2} m


Evaluation Real environment – Measured RIRs Frame-level accuracies (%)

CNN SRP-PHAT

RT60 = 0.160 s 1m 2m 91.8 88.7 94.4 69.0

RT60 = 0.360 s 1m 2m 86.8 79.4 87.1 68.3

RT60 = 0.610 s 1m 2m 72.3 67.3 71.7 62.4

§ 

For 2 m distance, CNN outperforms SRP-PHAT

§ 

SRP-PHAT performs better at lower reverberation times for 1 m source-array distance



Conclusions § 

Proposed a CNN based supervised learning method for DOA estimation with a simple input representation

§ 

CNN trained with synthesized noise signals is able to localize a speech source

§ 

Proposed system performs better than SRP-PHAT in unmatched simulated acoustic conditions

§ 

Adaptability to unseen real acoustic environments was also demonstrated



Current Work Multi-speaker localization § 

Aim: Localize simultaneously active speakers

§ 

Formulate multi-speaker localization as a multi-class multi-label classification problem



Current Work Multi-speaker localization 1

D OA

150

0.8 0.6

100

0.4

CNN

50 0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Ti m e Fr am e s 1

D OA

150

0.8 0.6

100

0.4

True DOA

50 0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Ti m e Fr am e s 1

D OA

150

0.8 0.6

100

0.4 50

SRP-PHAT

0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Fr e que nc y bins

Ti m e Fr am e s 250

15

200

10

150

5

100

0 −5

50 10

20

30

40

50

60

70

80

90

100

110

Ti m e fr am e s



−10

Speech segment

Current Work Multi-speaker localization

Pr obability

CNN

SRP−PHAT

True DOAs

1 0.8 0.6 0.4 0.2 0

0

20

40

60

80

100

120

140

160

D OA

Still trained with synthesized noise signals!!



180

Thank you for your attention!

Trained model and weights available: Soumitro-Chakrabarty/Single-speaker-localization