GTM-UVigo Systems for the Query-by-Example ...

2 downloads 0 Views 154KB Size Report
Mission accomplished. •We are focusing on zero-resource approaches. –Large set of acoustic, spectral and prosodic features extracted with OpenSMILE.
GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 Paula Lopez-Otero, Laura Docio-Fernandez, Carmen-Garcia Mateo {plopez,ldocio,carmen}@gts.uvigo.es AtlantTIC Research Center, Multimedia Technologies Group, Universidade de Vigo

Overview • Our systems consist in a fusion of 11 DTW-based subsystems • All the subsystems use phoneme posteriorgram representation

System Description minCnxe Dev minCnxe Eval • We used posteriorgrams in 6 different languages: Galician (GA), Spanish Primary Phoneme selection 0.905 0.905 (ES), English (EN), Czech (CZ), Hungarian (HU), Russian (RU) Contrastive No selection 0.918 0.923 Primary late Primary + z-norm 0.847 0.838 • We focused in two topics: Contrastive late Contrastive + z-norm 0.864 0.852 – Neural network architectures for phoneme posteriorgram extraction – Phoneme unit selection ⇒ more is not always better

→ Phoneme unit selection works! → Normalization is VERY important

Neural networks • We tried three different architectures:

minCnxe Dev GA ES EN CZ HU RU ISF PMUI lstm 0.895 0.879 0.915 0.904 - 0.34 0.22 – lstm: long short-lerm memory NN using Kaldi dnn 0.898 0.897 0.915 0.922 0.934 0.944 2.93 6 – dnn: deep NN context-dependent phone recognizer using Kaldi, trained traps - 0.939 0.934 0.944 0.10 0.01 with LDA-STC-fMLLR features – traps: BUT phone decoders

→ Best arquitecture: lstm → Faster architecture: traps → Best performing language: Spanish (WHY?)

*CZ training data are different for CZtraps

Phoneme unit selection Computation of the contribution of each phonetic unit to the cost of the best path P (Q, D) of length K. For example, for phoneme β: 1

CZlstm GAdnn GAlstm ESdnn

CZtraps HUtraps RUtraps CZdnn

0.98

ESlstm ENdnn ENlstm

→ No improvement when the number of phoneme units is small

minCnxe

0.96 0.94

→ Then, what happened in EN systems? → Need for automatic criterion to select the number of phoneme units

0.92 0.9 0.88 20

30

Relevance of phoneme β: R(P (Q, D), β) =

1 K

40

50

60

70

80

Number of phoneme units

PK

k=1 c(qik , djk , β)

We also... → Assessed fingerprint representation of acoustic features as in previous work (Zero-resource approach)

• Assessed model selection techniques: – Borrowed from automatic speech recognition – For each document, find the acoustic model with highest likelihood • Assessed using CZtraps, HUtraps and RUtraps → Hasn’t worked (yet) ⇒ Maybe due to normalization issues

minCnxe Dev MFCC 0DA RASTA-PLP GFCC 0DA No FP 0.996 0.987 0.994 FP 0.983 0.987 0.975 → Performance isn’t very impressive

Right now...

• Performance of ESdnn system after modifying different aspects: Description minCnxe ISF PMUI Initial version 0.897 2.63 6 Phoneme network 0.896 9.96 1.73 One ASR pass 0.896 4.48 1.73 No lattice determinize 0.895 1.66 2.04 No fMLLR 0.867 0.62 0.48 → Mission accomplished

0.94 0.935

• We are focusing on zero-resource approaches – Large set of acoustic, spectral and prosodic features extracted with OpenSMILE – Feature selection using the proposed approach for phomeme units → Outperforms many of the phoneme posteriorgram-based subsystems described above → Gives us some knowledge about which features are useful for this task

0.93 minCnxe

• We are speeding up and reducing memory requirements of dnn systems

0.925 0.92 0.915 0.91 0.905 0.9 20

40

60

80 100 120 140 Number of features

Acknowledgements

This research was funded by the Spanish Government (’SpeechTech4All Project’ TEC2012-38939-C03-01), the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and ’AtlantTIC Project’ CN2012/160, and also by the Spanish Government and the European Regional Development Fund (ERDF) under project TACTICA.

160

180

200