Broadband DOA Estimation using Convolutional Neural Networks

0 downloads 0 Views 3MB Size Report
Broadband DOA Estimation using Convolutional. Neural Networks Trained with Noise Signals. Soumitro Chakrabarty and Emanuël Habets. WASPAA 2017 ...
Broadband DOA Estimation using Convolutional Neural Networks Trained with Noise Signals Soumitro Chakrabarty and Emanuël Habets WASPAA 2017

Motivation Signal processing methods §  Cross-correlation-based methods •  GCC-PHAT •  SRP-PHAT •  MCCC…

Challenges - 

Performance degradation in presence of noise and reverberation

- 

High computational cost

§  Subspace-based methods •  MUSIC... §  Model-based methods •  Maximum-likelihood estimation... §  ...

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 2

Motivation Supervised learning methods §  Advantage: Supervised learning methods can be adapted to

different acoustic environments §  Recently, deep neural network (DNN) based supervised learning

methods have been successful across a range of applications: §  §  § 

Automatic speech recognition Object recognition in images Machine translation….

§  Few DNN based methods that estimate DOA of a sound source from

the observed signals

[1] Xiao et al. "A learning-based approach to direction of arrival estimation in noisy and reverberant environments,“ 2015 [2] Takeda et al. "Sound source localization based on deep neural networks with directional activate function exploiting phase information“ 2016 © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 3

Motivation DNN based DOA estimation

Aim: A DNN based supervised learning method for DOA estimation that §  Estimates DOA per time frame given the STFT representation of the

observed signals §  Simple input representation to learn relevant features during training

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 4

Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...

i ≈ θi ∈ Θ = {θ1 , . . . , θI }

§  DOA estimation is formulated as an I class classification problem §  Discretize the whole DOA range into I discrete values to obtain a set

of possible DOA values: ⇥ = {✓1 , . . . , ✓I } §  Each class corresponds to a possible DOA value in the set © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 5

Problem Formulation DOA estimation as classification Θ = {θ1 , . . . , θI } ... I classes ...

i ≈ θi ∈ Θ = {θ1 , . . . , θI }

§  For each frame, compute the posterior probability for each class §  DOA estimate is the DOA of the class with the highest posterior

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 6

System Overview Supervised learning framework True DOA Labels

Training data

STFT

Input feature

Train DOA classifier

Training Inference/Test

Trained parameters

Test data

STFT

Input feature

DOA classifier

Posterior probabilities

DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 7

System Overview Supervised learning framework True DOA Labels

Training data

STFT

Input feature

Train DOA classifier

Training Inference/Test

Trained parameters

Test data

STFT

Input feature

DOA classifier

Posterior probabilities

DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 8

Input feature representation STFT magnitude and phase component

icr

op

ho

ne

s

K − Frequency Bins

m (n,k)

Magnitude

© AudioLabs, 2017 Soumitro Chakrabarty

Phase

Am (n, k)

Broadband DOA Estimation with CNN 9

m (n, k)

− M



N − Time frames

M

N − Time frames

M

M

icr

op

ho

ne

s

K − Frequency Bins

Ym (n, k) = Am (n, k)ej

Input feature representation Phase map

K − Frequency Bins

for each time frame

n

Soumitro Chakrabarty

s ne ho op −

M

icr

Phase

M

M

n

N − Time frames

© AudioLabs, 2017

K

m (n, k)

Phase Map

Broadband DOA Estimation with CNN 10

n

System Overview Supervised learning framework True DOA Labels

Training data

STFT

Input feature

Φn

Train DOA classifier

Training Inference/Test

Trained parameters

Test data

STFT

Input feature

Φn

DOA classifier

Posterior probabilities

DOA estimate © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 11

Total Feature Maps: 64 Size: (M − 2) × (K − 2) Total Feature Maps: 64 Size: (M − 3) × (K − 3)

Conv1: 2 × 2, 64

Conv2: 2 × 2, 64

Conv3: 2 × 2, 64

§  Pooling is not employed §  Experiments showed decreased performance

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 12

FC2: 512

Total Feature Maps: 64 Size: (M − 1) × (K − 1)

FC1: 512

Input: M × K

Output: I × 1

CNN Architecture

Training with synthesized noise signals

§  CNN learns the required information for DOA estimation from the

phase map §  CNN can be trained using synthesized noise signals (!?)

§  Advantages: 1. 

No speech/audio database required

2. 

Easier to create training data

§  Spectrally white noise was used

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 13

System overview True DOA Labels

Training data (Noise signals)

STFT

Input feature

Φn

Train CNN based DOA classifier

Training Inference/Test Test data (Speech signals)

Trained parameters

STFT

Input feature

Φn

CNN based DOA classifier

p(θi |Φn )

DOA estimate θˆn = arg max p(θi |Φn ) θi

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 14

Evaluation Experiments 1. 

Generalization to speech and robustness to noise

2. 

Performance in acoustic conditions different from training

3. 

Robustness to small perturbations in microphone positions

4. 

Real acoustic environments

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 15

Evaluation Experiments 1. 

Generalization to speech and robustness to noise

2. 

Performance in acoustic conditions different from training

3. 

Robustness to small perturbations in microphone positions

4. 

Real acoustic environments

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 16

Evaluation Performance measure § 

Performance of CNN compared to SRP-PHAT

§ 

Evaluation measure § 

Frame-level accuracy ˆ N A(%) = c ⇥ 100, Ns Ns ˆc N

© AudioLabs, 2017 Soumitro Chakrabarty

Total time frames with speech active Time frames with correct estimate

Broadband DOA Estimation with CNN 17

Evaluation Experimental parameters § 

Uniform linear array (ULA) §  Number of microphones = 4 §  Inter-microphone distance = 3 cm

§ 

STFT length 256, 50% overlap

§ 

Resolution for classes: 5 degrees, I = 37 classes

§ 

Training data is simulated using RIR generator [3]

[3] https://github.com/ehabets/RIR-Generator © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 18

Evaluation CNN training conditions

Room with training positions Simulated training data Signal Synthesized noise signals Room size R1: (6 ⇥ 6) m , R2: (5 ⇥ 5) m Array positions in room 7 di↵erent positions in each room Source-array distance 1 m and 2 m for position RT60 R1: 0.3 s, R2: 0.2 s SNR Uniformly sampled from 0 to 20 dB © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 19

Evaluation CNN training parameters §  Training data: 5.6 million time frames §  Validation data: 20% split from the training data §  Loss: Cross-entropy §  Activation: ReLU, Softmax (final layer) §  Optimizer: Adam §  Batchsize: 512 §  Nb Epochs: 10 §  Regularization: Dropout rate 0.5 (After Conv.3 layer, and each FC layer)

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 20

Evaluation Test conditions §  Database: TIMIT test §  Speech samples: 500, 4 s each §  Test data size: 100000 active time frames

Simulated test data Signal Speech signals from TIMIT Room size Room 1: (7 ⇥ 6) m , Room 2: (8 ⇥ 8) m Array positions in room 1 random position for each room Source-array distance 1.5 m for both rooms RT60 Room 1: 0.45 s, Room 2: 0.53 s SNR 2 categories: 5 dB, and 15 dB © AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 21

Evaluation Test conditions Frame-level accuracies (%)

CNN SRP-PHAT

Room 1 5 dB 15 dB 56.2 69.8 22.6 33.6

Room 2 5 dB 15 dB 54.1 68.2 21.8 38.4

§  Better performance compared to SRP-PHAT §  For all cases, CNN is accurate for the majority of the frames

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 22

Evaluation Qualitative results - Room 2, 15 dB D OA ( ◦ )

1 150 100

0.5

50 10

20

30

40

50

60

Tim e fr am e s

70

80

90

100

CNN

0

D OA ( ◦ )

1 150 100

0.5

50 10

20

30

40

50

60

Tim e fr am e s

70

80

90

100

SRP-PHAT

0

D OA ( ◦ )

1 150 100

0.5

50 10

Fr e que nc y bins

True DOA

20

30

40

50

60

Tim e fr am e s

70

80

90

100

0

120 100 80 60 40 20

−10 −20 −30 10

© AudioLabs, 2017 Soumitro Chakrabarty

0

20

30

40

50

60

Tim e fr am e s

70

80

90

Broadband DOA Estimation with CNN 23

100

Speech segment

Evaluation Qualitative results – Average over 0.8 s

CNN

Pr obability

1

SRP−PHAT

True DOA

0.8 0.6 0.4 0.2 0

0

20

40

60

80

100

120

D OA

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 24

140

160

180

Evaluation Real environment – Measured RIRs §  Database: Multichannel Impulse Response Database from Bar-Ilan §  Source setup: [0 , 180 ] , steps of 15 degrees §  Array configuration: M = 8 microphones, d = 8 cm §  Source-array distances: 1 m and 2 m §  Test sample: 15 s long speech sample

15 resolution {1, 2} m

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 25

Evaluation Real environment – Measured RIRs §  Database: Multichannel Impulse Response Database from Bar-Ilan §  Source setup: [0 , 180 ] , steps of 15 degrees §  Array configuration: [8,8,8,8,8,8,8], M = 8 microphones §  Source-array distances: 1 m and 2 m §  Test sample: 15 s long speech sample §  CNN was retrained for the new array geometry (with simulated data)

6m

6m

© AudioLabs, 2017 Soumitro Chakrabarty

{1,2} m

Broadband DOA Estimation with CNN 26

Evaluation Real environment – Measured RIRs Frame-level accuracies (%)

CNN SRP-PHAT

RT60 = 0.160 s 1m 2m 91.8 88.7 94.4 69.0

RT60 = 0.360 s 1m 2m 86.8 79.4 87.1 68.3

RT60 = 0.610 s 1m 2m 72.3 67.3 71.7 62.4

§ 

For 2 m distance, CNN outperforms SRP-PHAT

§ 

SRP-PHAT performs better at lower reverberation times for 1 m source-array distance

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 27

Conclusions § 

Proposed a CNN based supervised learning method for DOA estimation with a simple input representation

§ 

CNN trained with synthesized noise signals is able to localize a speech source

§ 

Proposed system performs better than SRP-PHAT in unmatched simulated acoustic conditions

§ 

Adaptability to unseen real acoustic environments was also demonstrated

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 28

Current Work Multi-speaker localization § 

Aim: Localize simultaneously active speakers

§ 

Formulate multi-speaker localization as a multi-class multi-label classification problem

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 29

Current Work Multi-speaker localization 1

D OA

150

0.8 0.6

100

0.4

CNN

50 0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Ti m e Fr am e s 1

D OA

150

0.8 0.6

100

0.4

True DOA

50 0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Ti m e Fr am e s 1

D OA

150

0.8 0.6

100

0.4 50

SRP-PHAT

0.2 0 10

20

30

40

50

60

70

80

90

100

110

0

Fr e que nc y bins

Ti m e Fr am e s 250

15

200

10

150

5

100

0 −5

50 10

20

30

40

50

60

70

80

90

100

110

Ti m e fr am e s

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 30

−10

Speech segment

Current Work Multi-speaker localization

Pr obability

CNN

SRP−PHAT

True DOAs

1 0.8 0.6 0.4 0.2 0

0

20

40

60

80

100

120

140

160

D OA

Still trained with synthesized noise signals!!

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 31

180

Thank you for your attention!

Trained model and weights available: Soumitro-Chakrabarty/Single-speaker-localization

© AudioLabs, 2017 Soumitro Chakrabarty

Broadband DOA Estimation with CNN 32