Landmarks

1 downloads 0 Views 2MB Size Report
Surprise) analysis of speech. We introduce STRF. (Spectrotemporal Response Field) kernels derived from speech stimuli, which identify also events in speech, ...
A. 1Institute

1 Kovács ,

M.

2 Coath ,S.

2 Denham ,

I.

1,3 Winkler

of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest; 2 Cognition Institute, Plymouth University, UK; 3 Institute of Psychology, University of Szeged, Hungary [email protected]

Introduction

Results

Enhancement of auditory transients is well documented in the auditory periphery and mid-brain and it is also known that transients are important in, for example speechcomprehension, object recognition and grouping. In this work we introduce a novel approach of using an artificial neural network to implement a model of auditory transient extraction which is based on the assymetry of the distribution of energy inside a frequency-dependent time window. We compare the outputs using this method with the original model and with other methods identifying salient events in auditory stimulus motivated by phonologicalacoustical (Landmarks) and information theoretical (Bayesian Surprise) analysis of speech. We introduce STRF (Spectrotemporal Response Field) kernels derived from speech stimuli, which identify also events in speech, and compare the outputs of these four salient event detector methods. The results show that the transients identify in most cases the same stimulus features as the other methods.

Implementation of transient detection using ANN

Methods Cochlear model and SKV representation Speech is processed using Simple Cochlear Model (SCM) consisting of a bank of 128 Gammatone filters (Slaney, 1994), with center frequencies ranging from 100 to 8000 Hz arranged evenly on the Equivalent Rectangular Bandwidth (ERB) scale (Glasberg, Moore, 1990). The model of transient responses used here (Coath, Denham, 2005) is based on the skewness of the distribution of energy in a variable time window and referred to as SKV. Skewness is a measure of the asymmetry in the distributions ‚tails’. This SKV method generates responses, which are in agreement with the experimental data in a number of respects (Krumbholz et al. , 2003; Wiegrebe, 2001). As a part of the work we present the SKV method implemented by training an artificial neural network (ANN).

Spectrotemporal Response Fields The structure of spectrotemporal response fields (STRFs) in the human auditory cortex is not known, but if they develop through early acoustic experience, speech might play a large part in this process. It has been previously shown, that STRFs derived from speech fragments contain significant information (Coath, Denham, 2005). In this work STRF refers to a spectrogram-like response field consisting of the shortterm skewness, a higher order spectrogram. Using the STRFs in a model of the auditory processing, the method generates responses, which are compared with the results of the other methods.

The results summarized in this section indicate that the speech processed with the SKV algorithm and the trained neural network implementation (𝑆𝐾𝑉𝑛𝑛) clearly agree. The summed activity across all cochlear channels in the model onset sensitive neurons (SKV representation - blue) compared with the summed STRF response (green) and the Bayesian Surprise (red). The onset landmarks are marked with black.

Comparison of the methods using 3 languages To compare the results identified by the four salient event detector algorithms above, we introduced windowed Fmeasure calculation. Here we used different sizes of tolerance windows to calculate precision, recall. A good result exhibits both high precision and high recall, so we calculated the weighted average of them, the F-measure. SKV (upper) and 𝑆𝐾𝑉𝑛𝑛 (lower) representations of Hungarian speech sample onsets (red) and offsets (blue). The two methods clearly agree in the pattern of onsets and offsets identified.

Summed absolute difference between the SKV and 𝑆𝐾𝑉𝑛𝑛 representations in the previous figure for different ranges of SKV values. It is clear, that for a network with only one neuron per input the hidden layer errors in the ANN dedrived values are large, and that they are greatest at either end of the range of values. Errors for hidden layers of 5 and 10 neurons per input are also illustrated.

SKV and Bayesian Surprise Comparison of the Bayesian surprise for each cochlear channel with the SKV representation shows that both identify broadly the same spectrotemporal regions of the speech, as illustrated below.

Landmarks Acoustic landmarks are features in speech defined on linguistic grounds, which are patterns on spectrotemporal change and are identified by distinctive features (Stevens, 2002). The landmarks are a guide to presence of underlying segments which organize distinctive features into groups. So the landmarks are times in an utterance, when the acoustic correlates of the distinctive features are the most salient. According to the theory, it is believed, that the listener focuses to this landmarks to decipher the underlying distinctive features in the speech.

Bayesian Surprise As an alternative way of change detection in speech, surprise has been introduced, as a general information theoretical concept (Itti, Baldi, 2009). It is derived from first principles and formalized analitically across spatio-temporal scales, sensory modalities, and more generally data types and sources. Bayesian Surprise has been used as a salient event detector in visual stimuli, but it has not been widely explored in auditory experiments.

The SKV representation of a short section of speech (lower) and the Bayesian surprise calculated for each cochlear channel (upper).

Landmarks In this approach we compared the ∑ SKV, ∑ STRF and Bayesian Surprise peaks with those that are identified by algorithms designed to identify Landmarks. In all these cases we have, for simplycity and clarity, included only the events associated with onsets, in each case there is a matching offset landmark event.

The figures show the change in the F-measure values with a range of tolerance window sizes (ranging from 0 ms to 100 ms) for the English, German and Hungarian speech corpus. Each contains one female speaker saying 10 different sentences, the four methods (SKV, STRF, WOW, LM) were evaluated on that corpora. As it shows, there is small difference between the responses of the different languages, but more work is needed to establish its significance.

Discussion It has been previously argued, that within-channel, transient-sensitive processing on multiple frequencyrelated time scales is related to the goal of efficient coding of naturalistic, behaviourally relevant stimuli. The ∑ SKV gives phasic peaks that can be said to define events, or short time windows, where there are changes in overall energy or spectral content – a change of spectral content being a special sort of energy change. The STRF response is the output of a cortical model, it convolves the onset transient activity with a set of kernels representing cortical filters. This filters maximize the information in the speech and generate salient events in the sound. The Bayesian Surprise and Landmarks algorithms identify many of the same events but only the SKV and STRF is biologically plausible. The detected salient events are not language dependent. With parallel processing, multiprocessor and neuromorphic systems ANN detection of salient features of sounds represents an efficient implementation suitable for eg. Auditory prostheses, such as brainstem implants.

References • Slaney,M.: Auditory toolbox documentation, technical report 45., Technical report, Apple Computers Inc, 1994. • Glasberg, B. R., Moore, B. C., Hear Res 47 (1990) 103. • Coath, M., Denham, S. L., Biol Cyber 93 (2005) 22. • Krumbholz, K., Patterson, R. D., Seither-Preisler, A., Lammertmann, C., Ltkenher, B. Cereb Cortex 13 (2003) 756. • Wiegrebe, L., J Acoustic Soc Am 109 (2001) 1082. • Stevenes, K. N., J Acoust Soc Am 111 (2002) 1872. • Itti, L., Baldi, P. Vision Res 49 (2009) 1295.

Acknowledgements: This work was funded by: Lendület Projekt (Magyar Tudományos Akadémia) 0183-13