Post-Cochlear Auditory Modelling for Sound ...

Post-Cochlear Auditory Modelling for Sound Localisation using Bio-Inspired Techniques

Julie Wall, MSc. Faculty of Computing and Engineering

A thesis submitted for the degree of Doctor of Philosophy April 2010

"I conrm that the word count of this thesis is less than 100,000 words"

Contents Acknowledgments

vi

Abstract

vii

Declaration

viii

List of gures

ix

List of tables

xii

Glossary

xiii

1 Introduction

1

1.1

Objectives of the Thesis

. . . . . . . . . . . . . . . . . . . . .

3

1.2

Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . .

7

2 Sound Localisation in the Mammalian Auditory System 2.1

2.2

10

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.1.1

Interaural Time Dierence . . . . . . . . . . . . . . . .

12

2.1.2

Interaural Intensity Dierences

18

. . . . . . . . . . . . .

Mammalian Auditory Pathway for Sound Localisation

. . . .

22

. . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.2.1

Cochlea

2.2.2

Basilar Membrane

2.2.3

Organ of Corti

2.2.4

. . . . . . . . . . . . . . . . . . . .

23

. . . . . . . . . . . . . . . . . . . . . .

24

Hair Cells . . . . . . . . . . . . . . . . . . . . . . . . .

24

ii

2.3

2.2.5

Auditory Nerve . . . . . . . . . . . . . . . . . . . . . .

27

2.2.6

Cochlear Nucleus . . . . . . . . . . . . . . . . . . . . .

28

2.2.6.1

Cell types . . . . . . . . . . . . . . . . . . . .

29

2.2.7

Superior Olivary Complex . . . . . . . . . . . . . . . .

35

2.2.8

Higher Auditory Pathways . . . . . . . . . . . . . . . .

42

2.2.8.1

Lateral Lemniscus

. . . . . . . . . . . . . . .

42

2.2.8.2

Inferior Colliculus

. . . . . . . . . . . . . . .

43

2.2.8.3

Thalamus . . . . . . . . . . . . . . . . . . . .

45

2.2.8.4

Auditory Cortex . . . . . . . . . . . . . . . .

46

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3 Neural Networks and Sound Localisation Modelling 3.1

Introduction

3.2

Cochlea and Auditory Cell Models

3.3

Articial Neural networks

3.4

. . . . . . . . . . . . . . . . . . . . . . . . . . .

50 50

. . . . . . . . . . . . . . .

51

. . . . . . . . . . . . . . . . . . . .

53

3.3.1

Types of Articial Neural Networks . . . . . . . . . . .

56

3.3.2

Learning . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Spiking Neurons

. . . . . . . . . . . . . . . . . . . . . . . . .

59

3.4.1

Biological Neuron . . . . . . . . . . . . . . . . . . . . .

60

3.4.2

Computational models of neurons . . . . . . . . . . . .

62

3.4.3

Dynamic Synapses

. . . . . . . . . . . . . . . . . . . .

66

3.4.4

Training algorithms

. . . . . . . . . . . . . . . . . . .

67

3.4.4.1

Unsupervised Learning

. . . . . . . . . . . .

68

3.4.4.2

Supervised Learning . . . . . . . . . . . . . .

70

3.4.4.3

Reinforcement Learning . . . . . . . . . . . .

75

3.5

Receptive Fields

. . . . . . . . . . . . . . . . . . . . . . . . .

76

3.6

State of the Art in Sound Source Localisation Modelling . . .

76

3.6.1

Articial Neural Network Methods . . . . . . . . . . .

81

3.6.2

Spiking Neural Network Methods . . . . . . . . . . . .

84

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

3.7

iii

4 Spiking Neural Network Model of the Medial Superior Olive 91 4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.2

Initial MSO Model . . . . . . . . . . . . . . . . . . . . . . . .

92

4.2.1

Preliminary Results and Analysis . . . . . . . . . . . .

94

Extended MSO Architecture . . . . . . . . . . . . . . . . . . .

96

4.3.1

Input Layer . . . . . . . . . . . . . . . . . . . . . . . .

98

4.3.2

Bushy Cell Layer . . . . . . . . . . . . . . . . . . . . . 103

4.3.3

Output Layer . . . . . . . . . . . . . . . . . . . . . . . 105

4.3

4.3.3.1 4.4

Generic Delay Structure

. . . . . . . . . . . 108

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5 Spiking Neural Network Model of the Lateral Superior Olive115 5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2

Initial LSO Model

5.3

Complete LSO Architecture . . . . . . . . . . . . . . . . . . . 120

5.4

. . . . . . . . . . . . . . . . . . . . . . . . 115

5.3.1

Input Layer . . . . . . . . . . . . . . . . . . . . . . . . 121

5.3.2

Hidden Layers

5.3.3

Output Layer . . . . . . . . . . . . . . . . . . . . . . . 129

5.3.4

Training Algorithm . . . . . . . . . . . . . . . . . . . . 130

5.3.5

Testing

. . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3.6

Results

. . . . . . . . . . . . . . . . . . . . . . . . . . 131

. . . . . . . . . . . . . . . . . . . . . . 121

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6 Duplex Spiking Neural Network Model of Sound Localisation 136 6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2

Sound Localisation Across the Frequency Range . . . . . . . . 137

6.3

Addition of Noise . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.4

Generalisation Testing

6.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

. . . . . . . . . . . . . . . . . . . . . . 149

iv

7 Conclusions and Recommendations

154

7.1

Comparison to Similar Work . . . . . . . . . . . . . . . . . . . 155

7.2

Concluding Summary . . . . . . . . . . . . . . . . . . . . . . . 157

7.3

Contributions of the Thesis

7.4

Future Work

. . . . . . . . . . . . . . . . . . . 159

. . . . . . . . . . . . . . . . . . . . . . . . . . . 160

v

Acknowledgements It is a pleasure to thank my supervisors Dr. Liam McDaid, Professor Liam Maguire and Professor Martin McGinnity. Their encouragement, supervision and support from the beginning of the PhD to the nal stages enabled me to develop an understanding of the subject and made this thesis possible. I would like to thank my colleagues in the Intelligent Systems Research Centre for their assistance and support, especially Dr. Neil Glackin who read many drafts of this thesis and provided unwavering support. I would also like to thank Dr. D. J. Tollin at the University of Colorado Medical School for providing me with the HRTF data and recommending which auditory periphery models to use. I would like to thank my family for all their love and encouragement, especially my parents who have supported me in my studies for many years. Lastly, I am very grateful to the university for granting me the Vice-Chancellors Research Scholarship which has enabled me to study at the University of Ulster.

vi

Abstract This thesis presents spiking neural architectures which simulate the sound localisation capability of the mammalian auditory pathways. This localisation ability is achieved by exploiting important dierences in the sound stimulus received by each ear, known as binaural cues. Interaural time dierence and interaural intensity difference are the two binaural cues which play the most signicant role in mammalian sound localisation. These cues are processed by dierent regions within the auditory pathways and enable the localisation of sounds at dierent frequency ranges; interaural time dierence is used to localise low frequency sounds whereas interaural intensity dierence localises high frequency sounds. Interaural time dierence refers to the dierent points in time at which a sound from a single location arrives at each ear and interaural intensity dierence refers to the dierence in sound pressure levels of the sound at each ear, measured in decibels. Taking inspiration from the mammalian brain, two spiking neural network topologies were designed to extract each of these cues. The architecture of the spiking neural network designed to process the interaural time dierence cue was inspired by the medial superior olive. The lateral superior olive was the inspiration for the architecture designed to process the interaural intensity dierence cue. The development of these spiking neural network architectures required the integration of other biological models, such as an auditory periphery (cochlea) model, models of bushy cells and the medial nucleus of the trapezoid body, leaky integrate and re spiking neurons, facilitating synapses, receptive elds and the appropriate use of excitatory and inhibitory neurons. Two biologically inspired learning algorithms were used to train the architectures to perform sound localisation. Experimentally derived HRTF acoustical data from adult domestic cats was employed to validate the localisation ability of the two architectures. The localisation abilities of the two models are comparable to other computational techniques employed in the literature. The experimental results demonstrate that the two SNN models behave in a similar way to the mammalian auditory system, i.e. the spiking neural network for interaural time dierence extraction performs best when it is localising low frequency data, and the interaural intensity dierence spiking neuron model performs best when it is localising high frequency data. Thus, the combined models form a duplex system of sound localisation. Additionally, both spiking neural network architectures show a high degree of robustness when the HRTF acoustical data is corrupted by noise.

vii

Declaration I hereby declare that with eect from the date on which the thesis is deposited in the Library of the University of Ulster, I permit the Librarian of the University to allow the thesis to be copied in whole or in part without reference to me on the understanding that such authority applies to the provision of single copies made for study purposes or for inclusion within the stock of another library. This restriction does not apply to the British Library Thesis Service (which is permitted to copy the thesis on demand for loan or sale under the terms of a separate agreement) nor to the copying or publication of the title and abstract of the thesis.

IT IS A CON-

DITION OF USE OF THIS THESIS THAT ANYONE WHO CONSULTS IT MUST RECOGNISE THAT THE COPYRIGHT RESTS WITH THE AUTHOR AND THAT NO QUOTATION FROM THE THESIS AND NO INFORMATION DERIVED FROM IT MAY BE PUBLISHED UNLESS THE SOURCE IS PROPERLY ACKNOWLEDGED

viii

List of Figures 2.1

Auditory pathways for sound localisation . . . . . . . . . . . .

11

2.2

Low frequency sound wave approaching each ear

. . . . . . .

12

2.3

Out-of-phase waves . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4

Volley theory of hearing

14

2.5

Geometric view of Rayleigh's simple formula for determining the ITD

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

The Jeress Model

. . . . . . . . . . . . . . . . . . . . . . . .

2.7

Head shadow causes intensity dierence for high frequency sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.8

17

19

HRTFs for the left and right ears from three dierent angles over all frequencies . . . . . . . . . . . . . . . . . . . . . . . .

2.9

16

20

Interaural intensity dierences as a function of azimuthal angles 21

2.10 Anatomy of the ear . . . . . . . . . . . . . . . . . . . . . . . .

22

2.11 Functional diagram of the ear . . . . . . . . . . . . . . . . . .

24

2.12 Tonotopic organisation of the auditory system . . . . . . . . .

25

2.13 Inner and outer hair cells

. . . . . . . . . . . . . . . . . . . .

26

. . . . . . . . . . . . . . . . . . . . . . . . .

28

2.14 Cochlear nucleus

2.15 Cell types of cochlear nucleus

. . . . . . . . . . . . . . . . . .

30

2.16 Response type of cochlear nucleus cells . . . . . . . . . . . . .

30

2.17 Dierence in phase-locking of spikes between the auditory nerve and bushy cells . . . . . . . . . . . . . . . . . . . . . . . 2.18 Medial nucleus of the trapezoid body cell

. . . . . . . . . . .

33 37

2.19 Possible output if contralateral and ipsilateral inputs to lateral superior olive do not arrive simultaneously . . . . . . . . . . .

ix

39

2.20 Lateral superior olive cells sensitivity to diering interaural intensity dierences . . . . . . . . . . . . . . . . . . . . . . . .

41

2.21 Connections between the lateral superior olive and the dorsal nucleus of the lateral lemniscus

. . . . . . . . . . . . . . . . .

43

2.22 Cell types of the central nucleus of the inferior colliculus . . .

45

2.23 Thalamocortical relay cell

. . . . . . . . . . . . . . . . . . . .

46

2.24 Schematic of the auditory pathways . . . . . . . . . . . . . . .

47

3.1

Auditory periphery model . . . . . . . . . . . . . . . . . . . .

53

3.2

Articial neural network . . . . . . . . . . . . . . . . . . . . .

54

3.3

Articial neuron

. . . . . . . . . . . . . . . . . . . . . . . . .

55

3.4

Biological neuron . . . . . . . . . . . . . . . . . . . . . . . . .

61

3.5

Leaky integrate and re neuron model

. . . . . . . . . . . . .

64

3.6

Spike response model . . . . . . . . . . . . . . . . . . . . . . .

65

3.7

Spike-timing-dependent plasticity . . . . . . . . . . . . . . . .

69

3.8

Remote Supervision Method learning windows . . . . . . . . .

74

4.1

Network topology for initial medial superior olive model

. . .

93

4.2

Delay line structure . . . . . . . . . . . . . . . . . . . . . . . .

93

4.3

Weight values after training for medial superior olive model

.

95

4.4

Extended medial superior olive network architecture

. . . . .

97

4.5

Range of angles chosen for classication

4.6

3-D mesh surface plot of the HRTF acoustical input data

4.7

Comparison of the interaural time dierence values deter-

. . . . . . . . . . . . 100 . . 101

mined by Rayleigh's model to Nordlund's experimentally derived values 4.8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Stimulus time-domain waveforms and their associated output from cochlea model . . . . . . . . . . . . . . . . . . . . . . . . 104

4.9

Cochlea model outputs and their associated bushy cell outputs 105

4.10 Final network architecture for medial superior olive model . . 109 4.11 Training error plot

. . . . . . . . . . . . . . . . . . . . . . . . 111

4.12 Final weights on synapses between the delay structure and each output neuron

. . . . . . . . . . . . . . . . . . . . . . . 113

x

5.1

Initial lateral superior olive model . . . . . . . . . . . . . . . . 116

5.2

Lateral superior olive model responses

5.3

Mapping of lateral superior olive model output frequencies to dierential of inputs

. . . . . . . . . . . . . 118

. . . . . . . . . . . . . . . . . . . . . . . 119

5.4

Extended lateral superior olive model . . . . . . . . . . . . . . 122

5.5

Range of responses produced by lateral superior olive neuron

5.6

Receptive eld layer

5.7

Output of receptive eld layer . . . . . . . . . . . . . . . . . . 129

5.8

Stable weight distribution

5.9

Training error . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.1

Response of both spiking neural network models to full range of sound frequencies

6.2

. . . . . . . . . . . . . . . . . . . . . . . 127

. . . . . . . . . . . . . . . . . . . . 131

. . . . . . . . . . . . . . . . . . . . . . . 138

Presence of phase-locking in the spike train output of the cochlea model for low frequency sounds

6.3

124

. . . . . . . . . . . . 140

Bushy cell layer removed from medial superior olive model for sound frequencies greater than 3 kHz . . . . . . . . . . . . . . 141

6.4

Extra layer of spiking neurons to classify the intermediate range of sound frequencies . . . . . . . . . . . . . . . . . . . . 142

6.5

Results from the extra layer of spiking neurons to classify the intermediate range of sound frequencies

. . . . . . . . . . . . 143

6.6

Classication accuracies for each individual angle . . . . . . . 145

6.7

Incorporating noise into the input data . . . . . . . . . . . . . 147

6.8

Classication accuracies when noise is added to the input data 148

6.9

Generalisation across non-neighbouring sounds

7.1

Classication of complex sounds to angles of location . . . . . 161

xi

. . . . . . . . 150

List of Tables 4.1

Sample of HRTF data

. . . . . . . . . . . . . . . . . . . . . .

4.2

Interaural time dierence values for each azimuthal angle

4.3

Results for initial medial superior olive model . . . . . . . . . 107

4.4

Final medial superior olive model results . . . . . . . . . . . . 112

5.1

Lateral superior olive model results . . . . . . . . . . . . . . . 133

xii

99

. . 100

Glossary AI AII AN ANN AVCN BMT CAM CF CMOS DCN DFE DMGB DNLL EE EI EPSP ES FFT FPGA HRTF IC ICC ICP ICX IE II IHC IID INLL IPD IPSP ITD LIF LL

Primary Auditory Cortex Secondary Auditory Cortex Auditory Nerve Articial Neural Network Anteroventral Cochlear Nucleus Balanced Model Truncation Content-Addressable Memory Characteristic Frequency Complementary Metal-Oxide-Semiconductor Dorsal Cochlear Nucleus Diuse-Field Equalisation Dorsal Medial Geniculate Body Dorsal Nucleus of the Lateral Lemniscus Excited-Excited Excited-Inhibited Excitatory Post Synaptic Potential Evolutionary Strategy Fast Fourier Transform Field Programmable Grid Array Head-Related Transfer Function Inferior Colliculus Central Nucleus of the IC Pericentral Nucleus of the IC External Nucleus of the IC Inhibited-Excited Inhibited-Inhibited Inner Hair Cell Interaural Intensity Dierence Intermediate Nucelus of the Lateral Lemniscus Interaural Phase Dierence Inhibitory Post Synaptic Potential Interaural Time Dierence Leaky Integrate-and-Fire Lateral Lemniscus

LSM

Liquid State Machine

LSO

Lateral Superior Olive

LTD LTP MGB MMGB MNTB

Long Term Depression Long Term Potentiation Medial Geniculate Body Medial MGB Medial Nucleus of the Trapezoid Body xiii

MSO NRT OHC PCA PVCN RBF

Medial Superior Olive Reticular Nucleus of the Thalamus Outer Hair Cell Principal Component Analysis Posteroventral Cochlear Nucleus Radial Basis Function

ReSuMe

Remote Supervision Method

RNA SHL SOC SNN SPL SRM STDP TDOA VLSI VMGB VNLL VNLLc VNLLm

Ribonucleic Acid Supervised Hebbian Learning Superior Olivary Complex Spiking Neural Network Sound Pressure Level Spike Response Model Spike-Timing-Dependent Plasticity Time Delay of Arrival Very Large Scale Integration Ventral MGB Ventral Nucleus of the Lateral Lemniscus Columnar Nucleus of the VNLL Multipolar Cell area of the VNLL

xiv

Chapter 1

Introduction Sound localisation is the ability to perceive the direction from which a sound originates.

This ability provides mammals with the capacity to interact

with the environment by being aware of prey, potential mates and most importantly predators. Localisation is achieved by important dierences in the sound stimulus received by each ear.

These dierences are known as

binaural cues. The two binaural cues which play the most signicant role in the localisation of a sound source are interaural time dierence (ITD) and interaural intensity dierence (IID), which are processed by dierent pathways in the auditory system. ITD refers to the dierent points in time at which a sound from a single location arrives at each ear and IID refers to the dierence in sound pressure levels (SPL) of the sound at each ear, measured in decibels. Low frequency sounds are localised by the ITD binaural cue and high frequency sounds are processed using IID. Low frequency sounds consist of wavelengths which are larger than the diameter of the head; thus each ear receives the sound wave at a dierent time. For example, a sound originating to the left of the head will arrive at the left ear before reaching the right ear. Low frequency sounds also have the ability to phase-lock, i.e.

the resultant stimulus generated by the cochlea

consists of spikes which occur at the period of the curve of the sound wave. Phase-locked spike trains form the basis for the extraction of the ITD from the stimulus at each ear.

The phase-locked stimulus from each ear passes

through each auditory nerve (AN) which preserves the temporal information within the stimulus, i.e.

the phase-locking, and it is then routed to each

anteroventral cochlear nucleus (AVCN). Cells of the AVCN, called bushy

1

cells, maintain the phase-locked features of the stimuli which continue upwards through the auditory pathway to the medial superior olive (MSO), a nucleus of the superior olivary complex (SOC). It is at the MSO that the low frequency stimulus from each ear is combined.

A prevailing theory of low

frequency sound localisation, [Jeress, 1948], discusses how the ipsilateral inputs (from the ear closest to the sound source) pass directly to the MSO neurons while the contralateral inputs (from the ear furthest from the sound source) pass through a graded series of delays. In this way, for a sound source at a particular angle to the head, only one delay will allow the ipsilateral and contralateral stimuli to match in time and the associated neuron to re optimally. High frequency sound waves have a wavelength that is smaller than the human head. This causes a shadowing eect on the sound wave approaching the ear furthest from the sound, producing a diering of intensity for the same sound at each ear. This intensity dierence is then reected in the frequency of the stimulus generated by each cochlea. Again, the high frequency sound stimuli from each ear will pass through their own AN to the bushy cells of the cochlear nucleus. From the cochlear nucleus, the ipsilateral input will travel directly to the lateral superior olive (LSO) located in the SOC. In contrast, the contralateral stimulus is routed to the medial nucleus of the trapezoid body (MNTB), also located in the SOC, where it is converted to an inhibitory stimulus before travelling to the LSO. Thus, neurons of the LSO receive an excitatory ipsilateral input and an inhibitory contralateral input and when these two stimuli interact this results in a neural form of subtraction. To put it simply, the inhibitory stimulus detracts from the excitatory stimulus and their combination causes neurons of the LSO to produce a discharge rate relating to the IID. The localisation of sound is currently present in many applications, from virtual reality to hearing aids. However, the power and speed of mammalian sound localisation can only enhance these applications. The ability to model the ways in which mammals localise a sound source can:

Allow for the development of better virtual realities, whereby sound appears more real by making it appear as if speech is coming from the individual characters or objects in a virtual reality world

Increase the intelligent behaviour of robotics and enable robots to be-

2

come more human-like in their navigation

Improve group teleconferencing by providing realistic communication sensations

Provide surveillance systems with omni-directional sensitivity to threats which occur out of the line-of-sight of cameras

Improve the building of cinemas, opera houses and theatres by being aware of where to put sound-reective surfaces which can enhance the enjoyment of music, lm and theatrical performances

Enhance hearing aids by improving the localisation of individual sounds (current hearing aids have diculties separating speech from background noise).

The above enhancements which are enabled by the ability to understand and model mammalian sound localisation provide the rationale for the research outlined in this thesis, the specic objectives of which are outlined in the subsequent section of this chapter.

1.1 Objectives of the Thesis The aim of this research involves the development of models, using spiking neurons, which will process and extract the binaural cues of ITD and IID with topologies inspired by the mammalian auditory pathways as described above. To that end, this research proposes to create spiking neural network (SNN) models which emulate the way in which mammals localise sound. The reasoning for the use of SNNs is to maintain biological realism, i.e. spiking neurons are the most biologically inspired type of computational neuron model. Biological neurons process and circulate information by electrochemical signalling using spikes or action potentials, and spiking neurons can model this behaviour. Thus, topologies of spiking neurons can closely model a neural circuit; in the case of this research the neural circuit refers to the auditory pathways. In the beginning, before any architectures were developed, it was thought necessary to gain a thorough understanding of how the mammalian brain can process these binaural cues to produce an angle of location. This involved the comprehension of each part of the auditory

3

pathway involved in sound localisation, from the cochlea to the auditory cortex. With this understanding, each model was then developed in the form of a SNN with key areas of the auditory pathways involved in the implementation. The purpose of the models is to process and classify experimentally derived head related transfer function (HRTF) acoustical data into angles of location with satisfactory accuracy. Thus, the chief objectives of the thesis can be summarised as follows:

To perform a review of the literature regarding the mammalian auditory system. This is an important factor in this research, and it is necessary to gain a thorough understanding of the mammalian auditory pathways in order to implement biologically inspired computational models which can process and extract binaural cues for sound localisation.

To perform a review of the SNN literature which is required for the development of SNN models. In particular, it is necessary to be able to understand, implement and choose the appropriate spiking neuron model and learning algorithm for the specic architectures developed in this research. Other features of SNNs such as dynamic synapses, receptive elds and excitatory and inhibitory neurons are reviewed and it is determined how they would be applicable for implementation within the models.

To carry out a review of the many dierent ways in which other researchers have tackled the problem of developing a sound localisation system; including purely computational, signal processing, articial neural networks (ANN) and SNN techniques.

This will provide the

knowledge of the area in order to determine where a novel contribution can be made.

To choose an auditory periphery (cochlea) model which can encode the experimental HRTF acoustical data used in this research into spike trains to be processed by the SNN. The auditory periphery model is required to be able to produce spike trains which directly relate to the input data and from which the ITD and IID binaural cues can be extracted and used for classication into angles of location.

4

To develop an SNN topology designed to process ITDs from low frequency sounds, which includes models of the cochlea, bushy cells and MSO.

To design an SNN topology which processes high frequency sounds to decode the IID, composed of models of the cochlea, MNTB, LSO, receptive elds and facilitating synapses.

To train these SNN models to classify experimentally derived HRTF acoustical data to angles of location.

To analyse and evaluate the performance of both the ITD and IID models and to suggest future improvements with respect to the following points:

To determine the range of frequencies with which each SNN can adequately perform sound localisation to a reasonable accuracy. Thus, determining the need for a duplex sound localisation system where the two binaural cues are processed dierently and localise very dierent ranges of sounds. It is expected that the SNN developed to process the ITD binaural cue will perform with high classication accuracies for low frequency sound data and with low classication accuracies for high frequency data. Conversely, it is expected that the SNN developed to process the IID binaural cue will perform with high classication accuracies for high frequency sound data and with low classication accuracies for low frequency data.

To contaminate the input HRTF data with diering levels of noise and hence test the degree of robustness of each SNN model.

To test the generalisation abilities of both the ITD and IID models, by determining how far removed from the sound frequency used to train a SNN model, could the localisation ability of the SNN be maintained, i.e. if an SNN is trained with a 5 kHz sound frequency, how well will that SNN perform sound localisation when tested with a much higher or lower sound frequency.

To propose a future work direction with regard to the localisation of complex sounds

5

1.2 Thesis Contributions The research outlined in this thesis represents a substantial contribution to the area of biologically inspired sound localisation modelling. The work has been peer reviewed in the form of two published conference papers, [Wall et al., 2007, 2008], and has contributed towards the submission of two journal papers, [Wall et al., 2009, Glackin et al., 2010]. The primary contributions of the thesis are:

The integration of biological models within an SNN, involving an auditory periphery (cochlea) model, leaky integrate and re (LIF) spiking neurons, facilitating synapses, receptive elds and the appropriate use of excitatory and inhibitory neurons.

An SNN which can process and extract the ITD cue from low frequency experimentally derived HRTF sound data enabling classication of the data into angles of location.

An SNN which can process and extract the IID cue from high frequency experimentally derived HRTF sound data enabling classication of the data into angles of location.

The use of two biologically-inspired supervised learning algorithms to train the two SNN models.

The Remote Supervision Method (Re-

SuMe) was used to train the model based on the IID cue while Supervised Hebbian Learning (SHL) was used for training the model based on the ITD cue. The rationale as to why each model needed a dierent learning algorithm is also highlighted.

Use of experimentally derived acoustical HRTF data from adult domestic cats as input to two distinct SNNs which were trained to produce angles of location with satisfactory accuracies.

Classication results

for both the ITD and IID models compare well against similar work in the development of sound localisation systems.

The experimental results which show that the two SNN models behave in a similar way to the mammalian auditory system, i.e.

the SNN

which extracts and processes the binaural cue of ITD performs best when it is localising low frequency data, and the SNN which extracts

6

and processes the binaural cue of IID performs best when it is localising high frequency data. Thus, the combined models form a duplex system of sound localisation.

The robustness of both SNN models to noise, i.e.

the SNN models

developed for each binaural cue have a high degree of robustness when the HRTF data is contaminated with noise.

1.3 Outline of the Thesis The thesis is organised as follows:

Chapter 2 introduces and discusses how the two binaural cues of ITD and IID facilitate mammalian sound localisation. Key methodologies are introduced which underpin the research presented in this thesis; namely Jeress' theoretical computational model to show how ITD works in mammals to determine the angle of origin of a sound signal, and Rayleigh's duplex theory of sound localisation.

This is followed by an outline of those parts of the

mammalian auditory system which process sound and binaural cues from the outer ear to the auditory cortex.

Chapter 3

begins with a review of the cochlea and auditory models with

particular attention paid to the auditory periphery model used in this work. This is followed by a discussion on the ANN and SNN literature which outlines the merits of the dierent neuron models, learning algorithms and architectures. Additionally, dynamic synapses and receptive elds are reviewed. This review in particular, was necessary in order to fully understand and thus implement the appropriate SNN architectures that utilise the binaural cues for sound localisation. Finally, the state of the art involving many dierent computational techniques which have been employed in sound localisation models are examined. Over the last twenty years, research in this area has involved the use of many dierent techniques including geometry, cross-correlation, signal processing, probability, statistics, ANNs, fuzzy neural networks and SNNs.

These techniques range from pure computational

modelling to a more biologically inspired approach.

Chapter 4 outlines the development of a biologically inspired SNN architecture to model the way the binaural cue of ITD is processed by the auditory pathways. The major inspiration for this architecture is Jeress' theoreti-

7

cal computational model of interaural time-based sound localisation, Jeress [1948]. The chapter outlines initial work which investigates the modelling of the Jeress architecture using a SNN. This early work involves a simulated data set which is encoded into single spikes and trained using the spiketiming-dependent plasticity (STDP) learning algorithm. This work is then extended to form a multi-layered SNN using experimentally derived HRTF acoustical data as input in the form of spike trains. The SNN architecture consists of an auditory periphery model to encode the input data, LIF neurons, models of bushy cells from the AVCN and a delay line structure to represent the MSO. The SHL learning algorithm is used to train the network to enable classication of the low frequency HRTF acoustical data into angles of location.

Chapter 5

outlines the development of another biologically inspired SNN

architecture, this time looking at the other binaural cue of IID. Again, similar to Chapter 4, initial work is outlined which consists of the development of a single neuron model which can process IIDs in a similar fashion to the LSO using simulated data in the form of spike trains. This initial work is extended to a multi-layered SNN consisting of an auditory periphery model, LIF neurons to represent the MNTB and the LSO, facilitating synapses and receptive elds.

The ReSuMe learning algorithm is used to train the

network to classify high-frequency experimentally derived HRTF acoustical data encoded into spike trains to angles of location.

Chapter 6

presents an analysis of the capabilities of the two SNN models

developed to process the binaural cues, ITD and IID, in order to localise experimentally-derived HRTF data to angles of location. Sound localisation experiments are outlined which involve the processing of the full range of sound frequencies available.

These experiments conrm the need for two

dierent binaural cues and thus a duplex system of sound localisation. Additionally, the classication accuracies for each individual angle across the entire sound frequency range are reported. Experiments are outlined to test the degree of robustness which both SNN models display in the presence of diering levels of noise embedded in the HRTF input data.

Furthermore,

the generalisation abilities of both SNN models are outlined when testing is performed on unseen data from sound frequencies not presented during training.

Chapter 7 presents a conclusion to the thesis where the results and analysis 8

of the two independent architectures and the duplex system are summarised and discussed. The chapter compares and contrasts the work presented in this thesis to closely related work in the literature, thus emphasising the contributions made by this research and how they advance the work in this eld. Finally, future research directions which would enhance the biological plausibility of this research, in particular the involvement of inhibition to enhance the decision making algorithm in the integrated duplex system are proposed.

9

Chapter 2

Sound Localisation in the Mammalian Auditory System 2.1 Introduction Of all the organs in the body, there are few that can compare to the ear with regards to the degree of functionality it contains within such a small and compressed space. Sound localisation is one function that the ears and auditory pathways perform, dened as determining where a sound signal is generated in relation to the position of the human head.

It is a powerful aspect of

mammalian perception, allowing an awareness of the environment and permitting mammals to locate prey, potential mates and predators, [McAlpine and Grothe, 2003]. The neural components of sound localisation are complex, as the location of a stimulus can only be determined by combining input from both ears, [Yin, 2002]. There is a considerable amount of knowledge as to how these neural components achieve sound localisation within the mammalian brain. In fact, more is known about how we can perceive the angle of origin of a sound source than of any other feature of the auditory system, e.g. pitch perception or vowel discrimination, [Yin, 2002]. Nowadays sound localisation is not as important for humans as it once was. However, the same auditory systems are used for another purpose, commonly known as the cocktail-party eect. Those features of the auditory system which can localise sound accurately, also work for discerning sounds in a noisy environment. This chapter will introduce the two binaural cues which the brain uses to localise sound in the horizontal plane, these are ITD and IID. Following

10

Figure 2.1: Auditory pathways for LSO and MSO processing. Figure adapted with permission from [Joris et al., 1990].

that, each area of the auditory system which is involved in the processing of these cues is described from the outer ear to the auditory cortex. Special attention is paid to those parts which are solely involved in the localisation of sound sources. Sound localisation can be dened as the mammalian perception which enables one to determine the point of origin of a sound source expressed in terms of angles.

In humans, sound localisation depends on binaural cues,

which are extracted from the sound signal and compared to each other to determine from which direction the sound is travelling. They are the dominant cues for azimuth (horizontal) angle estimation but cannot be used to determine whether the sound source is above or below the listener. Sounds in the vertical plane are localised due to the physical shape of the outer ear, particularly the pinna, the head and shoulders; these features create what are called spectral cues. A comprehensive review of vertical sound localisation can be found at [Young and Davis, 2001].

The azimuth

θ

is an

angle on the horizontal plane which can be imagined as a straight line which passes through the centre of the head with 0° in front, -90° to the left and +90° to the right of the head. The two binaural cues which play the most dominant role in sound localisation are called ITD, processed in the MSO of the auditory system, and IID, processed in the LSO. Their combination is better known as the duplex theory of sound localization and was rst devised by Thompson and Rayleigh around the late 19th and early 20th cen-

11

Figure 2.2: Low frequency sound wave approaching each ear

tury, [Thompson, 1882, Rayleigh, 1875-1876, 1907]. Figure 2.1 outlines the mammalian auditory pathways involved in sound localisation. ITD processing involves the cochlea, AN, cochlear nucelus and MSO. IID processing also involves the cochlea, AN and cochlear nucleus, but the stimulus then passes to the MNTB and to the LSO. These areas of the auditory system will be discussed in greater detail in later sections of this chapter.

Robust sound

localisation requires a combination of these two cues to localise across all sound frequencies. Rayleigh carried out many experiments while researching how humans localise sounds. One of his rst experiments involved placing a listener with their eyes closed in the middle of a lawn. The listener was then surrounded by several other people who moved around him speaking periodically. With a little practice the listener was able to point in the direction of the currently speaking person with the use of sounds from a single word down to a single vowel with considerable accuracy. It did not matter whether the speaker was in front, behind, to the left or to the right of the listener. However, when the speaker emitted unnatural sounds such as squeaks or grunts, the accuracy of correct localisation greatly decreased.

This led

Rayleigh to believe that the complex sound of human speech made it easier for the listener to localise the sound, possibly due to our great familiarity and daily practice with it. It was this and many more experiments which led Rayleigh to understand why the two binaural cues were needed, which led to the development of his duplex theory. For further information about these early experiments on sound localisation see [Rayleigh, 1875-1876].

2.1.1 Interaural Time Dierence ITD can be dened as the small dierence in arrival times between a soundsignal reaching each individual ear, [Lewicki, 2006]. From this arrival time

12

Figure 2.3: Out-of-phase waves

dierence, the brain can calculate the angle of the sound source in relation to the head, [Carr, 1993, Grothe, 2003]. The ITD cue works most eectively for sounds greater than

∼ 200

Hz to about 1.5 kHz in humans since the

sound wavelengths are wide and sound intensity is not discernibly weakened by the size of the head, [Burger and Rubel, 2008].

Low frequency sound

waves have a wavelength that is greater than the diameter of the head, therefore each ear receives the sound wave at a dierent point in time, see Figure 2.2. For example, if a sound signal originates to the extreme left of the head (-90°) it will reach the left ear rst and after a time delay which is specic to the azimuthal angle of the sound source it will then reach the right ear, generating the ITD. ITDs occur at both the onset of the sound and throughout the duration of the sound, these are known as onset ITDs and ongoing ITDs respectively, Joris and Yin [2007]. The ITDs in continuous and periodic sounds produce interaural phase dierences (IPD), i.e. dierences in the phase of the sound wave that approach each ear, see Figure 2.3. The bers of the AN which respond best to low frequencies produce spike trains which are time locked to the signals' sine curve, meaning the intervals between spikes is at the period of the curve or a multiple of that period. This feature of the AN is called phase-locking; it also occurs in bushy cells of the cochlear nucleus (discussed later in this chapter) and can only occur at low frequencies, [Yin, 2002]. It occurs at both the signal onset and the ongoing signal and is important in sound localisation for extracting the ITD from the sound arriving at each ear, [Smith et al., 1998, D'Angelo et al., 1999, Grothe and Park, 2000, Grothe, 2003, Ryugo and Parks, 2003]. Each individual AN ber phase locks to the input stimulus, however each ber cannot phase lock to each curve of the stimulus, only to a subset of the waves' curves.

The

entire population of AN bers for each ear produce a combined discharge

13

Figure 2.4: Volley theory of hearing. Individual bers of AN phase lock to a portion of the input stimulus; their combination produces an output which is a temporal representation of the input stimulus.

which is a temporal representation of the input stimulus.

This combined

discharge which relays the temporal information of the input stimulus is known as the Volley Theory of Hearing, see Figure 2.4, [Wever, 1949]. The stimuli at each ear which dier in phase, cause the AN bers to produce spike trains which also exhibit a phase dierence. The MSO combines these spike trains from each ear; the ipsilateral inputs are delivered directly while the contralateral inputs pass through a graded series of delays. For a sound source at a particular angle to the listener, only one particular delay will allow the ipsilateral and contralateral inputs to match, i.e. the original outof-synch phase-locked spike trains will now come into phase. Each ITD value is dependent on the distance between the two ears of the listener, the speed of sound and the angle from which the sound originated, [Burger and Rubel, 2008]. Rayleigh devised a simple formula which could calculate the ITD for each azimuthal angle, [Alim and Farag, 2000]. considered a sound wave travelling at the speed of sound makes contact with a spherical head of radius

r

c,

He

343m/s, which

from a direction at an angle

θ. The sound arrives at the rst ear and then has to travel the extra distance of

rθ + rsinθ

to reach the other ear, see Figure 2.5. Dividing that distance

by the speed of sound gives the simple formula for the ITD:

π π r IT D = (θ + sinθ), − ≤ θ ≤ c 2 2 14

(2.1)

Delays can range from

0µsec

for a sound directly in front of or behind the

head with an azimuthal angle of 0°, to

700µsec.

at an angle of

mans; small mammals have a maximum ITD of about

130µsec,

±90°

in hu-

[Burger and

Rubel, 2008]. Humans can locate sources to an accuracy of a few degrees; therefore we can measure ITD with an accuracy

∼ 10µsec.

At higher fre-

quencies, as the wavelength of the sound is similar to or smaller than the diameter of the human head, this time delay between the sound arriving at the two ears cannot be distinguished and so the other binaural cue of IID is used for localisation. Complex sounds contain both low and high frequencies. However, it is the ITDs of the low frequencies which are the dominant azimuthal information used for sound localisation, since the AN can phase-lock to the low frequency component of the sound, [Yin, 2002]. In 1948, Jeress created a theoretical computational model to show how ITD works in mammals to determine the angle of origin of a sound signal, [Jeress, 1948], see Figure 2.6. This is one of the earliest and most durable models of binaural hearing developed and is used to this day as a basis for binaural hearing research. It was quite remarkable considering how little was known at the time about the structure of the auditory system. The model involves three distinct theories:

1. The inputs to the binaural cells are phase-locked and thus retain accurate timing information; low frequency sounds are phase-locked by the AN and bushy cells before reaching the MSO. 2. A set of delay lines vary the axonal path lengths arriving at the neuron; 3. An array of coincidence detector neurons re maximally when presented with coincidental inputs from both ears. These coincident inputs only occur when the ITD is exactly compensated for by the delay lines. The neurons are organised tonotopically, i.e. in a spatial organisation by frequency.

The fundamental importance of Jeress' model and why it has become the prevailing model of binaural sound localisation is its ability to depict auditory space with a neural representation in the form of a topological map, even though Jeress himself acknowledged the simplicity of his model.

Up to

the 1980s this model remained hypothetical until evidence was found which

15

Figure 2.5: Geometric view of Rayleigh's simple formula for determining the ITD, adapted from [Rumsey, 2001]

16

Figure 2.6: The Jeress model of ITD based binaural hearing

showed that the nucleus laminaris of the barn owl (homologous to the MSO in mammals) works in the same manner, [Carr and Konishi, 1990]. One of the earliest studies of the MSO was that of Goldberg and Brown in 1969 which showed that MSO neurons were most responsive to low frequencies and extremely sensitive to ITDs, [Burger and Rubel, 2008]. Also of signicance was their nding that the spike output of MSO neurons varied with ITD, arming them to be one of the most temporally sensitive neurons in the nervous system. They also showed that diering neurons of the MSO were most sensitive to a particular ITD, called their best ITD, which depended on the time delay of their inputs, i.e.

neurons red maximally

only when their inputs passed a delay which allowed the inputs to arrive in coincidence at the neuron. Consequently, Goldberg and Brown gave weight to Jeress' simple model for processing ITDs over twenty years later. Many other researchers continued in this vein, producing ndings which supported and augmented Goldberg and Brown's work, [Moore, 2000, Fitzpatrick et al., 2002, Bazwinsky et al., 2003, Kulesza, 2007, Burger and Rubel, 2008]. For many years it was thought that small, high-frequency hearing mammals who generate very small ITDs such as bats contained no MSOs. A researcher

17

in 1926, Poljak, had described an MSO region in two dierent bat species, but his ndings were either forgotten or ignored, [Poljak, 1926]. In the early 1980s this belief was reviewed and it is now understood that bat MSOs are arranged tonotopically and that their neurons respond to binaural inputs in a similar manner to other mammals. However, it is also believed that the MSO is used for other purposes, such as enhancing temporal information in the stimulus at an early stage, and not just as a detector for ITDs. This is probably the reason for the original denial of the existence of the MSO in the bat, [Grothe and Park, 2000, Grothe and Neuweiler, 2000]. In 1990, Yin and Chan studied cells in the MSO of the cat for responsiveness to changes in IPDs of low-frequency tones and in ITDs of tones and broadband noise signals, [Yin and Chan, 1990]. They showed evidence for the cells working as coincidence detectors and also provided the rst evidence for a tonotopic organisation in a mammalian MSO. Recent studies of ITD processing now show that another factor is involved, that of synaptic inhibition. Findings show that nely tuned temporal inhibition adjusts the sensitivity of coincidence detector neurons to the range of ITDs.

This research direction is

a current focus of auditory research [Grothe, 2003, McAlpine and Grothe, 2003, Joris and Yin, 2007]. For further information on the auditory circuit for ITDs in birds, see [Olsen et al., 1989, Carr and Konishi, 1990, Carr, 1993, Konishi, 2000, 2003, Burger and Rubel, 2008]

2.1.2 Interaural Intensity Dierences IID can be dened as the dierence in SPL of the sound signal between each ear for a particular frequency, measured in decibels, [Hartmann, 1999, Tollin, 2003]. For high frequency sound waves that have a similar or smaller wavelength than the diameter of the human head, a shadowing eect will occur on the sound wave that approaches the ear furthest from the sound, as shown in Figure 2.7, [Willert et al., 2006]. This shadowing of the sound wave gives a dierence of intensity between the two sound signals at each ear, i.e. the head acts as a lowpass lter causing IIDs of up to 20 dB. Processing in the LSO involves taking as input the two sound signals in the form of a neural stimulus from each ear. The stimulus closest to the sound will take an excitatory form and the other will be inhibitory. To put it simply, the interaction between the two stimuli works as a neural form of subtraction

18

Figure 2.7:

Head shadow causes intensity dierence for high frequency

sounds

which produces an output relating to the IID, [Tollin, 2003]. This will be discussed further in a later section. Figure 2.8 shows HRTFs for the left and right ears corresponding to a sound source originating from three dierent locations, at -90°, 0° and +90° respectively. When the sound originates from

±90°,

the gain values are signicantly dierent for all frequencies and the

IID, the dierence between the gains, is clear to be seen. However, when the sound originates from 0° (in front of the head), the gain values are similar and therefore it is dicult to distinguish an IID. This is due to the sound reaching both ears at the same time and not being attenuated by the head shadow.

The lowest sound frequencies are found at the beginning of each

plot where the IIDs are at their smallest. This is to be expected as at these low frequencies the IID cue of sound localisation is not reliable and the other cue, ITD, is fully functional at these frequencies. Another feature that can also contribute to the IID is that of the amplication ability of the pinna (part of the outer ear, discussed in a later section) closest to the sound source. This characteristic has been found to be evident in adult cats and can amplify the SPL at that ear by up to 25 dB, [Tollin, 2003]. However, for most frequencies it is the sound shadow generated by the head which causes the diering SPLs. High frequency sound sources can be localized with this method, however azimuthal localisation at the crossover point between using ITDs and IIDs,

1.8 kHz ≤ f ≤ 4.2 kHz ,

is not very

reliable, [Zhou, 2002]. Furthermore, similar to the ITD cue, localisation is

19

Figure 2.8: HRTFs for the left and right ears from three dierent angles over all frequencies

performed best when the sound is directly in front of the head. Any ambiguities in determining the angle of location can be removed by turning the head, thus improving the chances of the correct angle being localised. IIDs have a complex relationship with the azimuth of the sound, as can be seen in Figure 2.9.

The IIDs for all frequencies, bar a very few low

frequencies, vary with azimuth as predicted. They are lowest at 0° and 180° and at their highest at about 90°. This is as assumed, if the sound is directly to the left of the head at -90°, then the right ear will be in the head shadow and IIDs will be at their greatest, this can be seen for all frequencies in Figure 2.9 a, b and c. Figures 2.8 and 2.9 were created using the experimentally derived acoustical HRTF data provided by [Tollin, 2004, 2008, Tollin et al., 2008] which is used throughout this thesis.

For details on how this data

was generated from adult domestic cats see [Tollin and Koka, 2009]. This experimentally generated biological data consists of a set of HRTFs for a given sound wave frequency for both the left and right ears at a specic azimuthal angle. The data describes the ltering of a sound before it reaches the cochlea after the diraction and reection properties of the head, pinna and torso have aected it.

20

(a) Relationship between IID and azimuth for frequencies less than 10 kHz

(b) Relationship between IID and azimuth for frequencies between 10 kHz and 20 kHz

(c) Relationship between IID and azimuth for frequencies greater than 20 kHz

Figure 2.9: IIDs as a function of azimuthal angles

21

Figure 2.10: Anatomy of the human ear, from [Brockmann, 2009]

2.2 Mammalian Auditory Pathway for Sound Localisation The human ear (Figure 2.10) consists of three components, the outer, middle and inner ear. The outer ear sometimes called the

external ear

consists of the

(auricle), the funnel-like innermore part (concha) and the ear canal (external auditory meatus), [ASH, 2006]. Outer ears resemble a narrow tube pinna

which decreases in size; the problem with this shape is that sounds below 1-2 kHz are transmitted less eectively. The pinna bends and focuses sound waves and also determines the direction and source of sound. The ear canal is about 3cm long and

½cm

in diameter extending from the pinna to the ear

drum and functions as a resonator to increase the volume of incoming sound, [Pujol et al., 1999, Ashmore, 2002]. The middle ear consists of the eardrum

(tympanic membrane),

ossicular

chain, Eustachian tube, and oval and round window. The Eustachian tube supplies ventilation and the equalisation of air pressure on both sides of the eardrum which separates the outer and middle ear, [ASH, 2006]. The eardrum is connected to the hammer, hammer to anvil and anvil to stirrup; vibrations move through these three bones causing the base of the stirrup

(stapes footplate)

to move in-and-out. This footplate ts into the oval win-

dow, allowing the now mechanical wave access to the inner ear. The surface

22

ratio of eardrum to oval window is 20:1; permitting a sucient energy transfer of the sound pressure between the air and the uids of the inner ear. The middle ear functions as an impedence adapter and if not present about 98% of energy would be reected out before entering the inner ear, [Pujol et al., 1999]. The inner ear organs are contained in a large uid lled chamber called the vestibulus; consisting of the cochlea, the organ of hearing, and the vestibule, the organ of equilibrium; all located within the temporal bone, [Ashmore, 2002].

2.2.1 Cochlea The cochlea is a uid-lled coiled tube with a bony structure; its name comes from its resemblance to a snail; the Latin for snail is cochlea, [Ashmore, 2002, Werner, 2007]. The cochlea is closed at one end, known as the apical cochlea and open at the other, the basal cochlea. The open end contains the oval and round windows. These are exible membranes, the stapes footplate sits on top of the oval window, and the function of the round window is to act as a pressure valve allowing the uids to be in motion, [Weedman Molavi, 1997]. The cochlea is divided into three channels or scalae

(media, vestibuli

and tympani) by two membranes (Reissner's and basilar). The scala media (cochlear duct) contains a uid called endolymph while the scalae vestibuli and tympani surrounding the scala media, contain another uid called perilymph. The modiolus is the central, conical bony core of the cochlea holding the spiral ganglion

(ganglion of Corti),

an elongated group of bipolar cell

bodies forming a nerve centre in the cochlea, [Stedman, 2004]. At the top of the modiolus the uids within the scalae vestibuli and tympani communicate through a foramen (opening or orice in the bone) called the helicotrema enabling equal pressures between the two scalae, [Mammano and Nobili, 2005]. Reissner's membrane separates the scala media from the scala vestibuli and the basilar membrane divides it from the scala tympani, [van Hengel, 1996].

2.2.2 Basilar Membrane The basilar membrane is the main component of the cochlea. It is a noncellular membrane comprised of radially-oriented collagen bres and acts as a structural support for the sensory cells of the inner ear, [Ashmore, 2002]. It

23

Figure 2.11: Mapping of frequencies in the basilar membrane, from [Ashmore, 2002]

changes in width from lean and slender at the base of the cochlea to wider at the apex. It vibrates to high frequencies at the base and to low frequencies at the apex, forming what is known as a tonotopic map, a spatial arrangement of sound frequency. Figure 2.11 shows the dierent frequencies detected at each section of the basilar membrane. This sensitivity to frequency or tonotopic organisation continues to be a feature of those elements of the auditory system which are involved in the process of sound localisation, see Figure 2.12.

2.2.3 Organ of Corti The organ of Corti is the sensory receptor and is located in the scala media.

It holds the hair cells, the nerve receptors for hearing which convert

mechanical stimuli (sound energy) to nerve signals, and pillar and Deiter's cells (support cells), [ASH, 2006, Werner, 2007, Kandler et al., 2009]. The tectorial membrane and cuticular plate enclose the top of the organ of Corti, which sits on the basilar membrane. The organ was discovered by an Italian anatomist, Corti, in the 19th century.

2.2.4 Hair Cells There are two types of hair cells, inner hair cells (IHC) and outer hair cells (OHC), with as many as 4000 IHCs and 12000 OHCs; IHCs are the audi-

24

Figure 2.12: Spatial organisation by frequency occurring in elements of the auditory system which achieve sound localisation, from [Kandler et al., 2009]

tory receptors and OHCs, also known as cochlear ampliers, help to ne tune the cochlea by amplifying the membrane vibration, [Weedman Molavi, 1997]. Dierent hair cells respond to dierent sound frequencies; dependent on the frequency of the particular sound only certain hair cells are stimulated, [ASH, 2006]. The characteristic frequency (CF), the particular frequency to which a hair cell is most sensitive, of adjacent hair cells diers by only 0.2% in comparison to adjacent piano strings which dier by 6%.

IHCs are the

main source of the aerent (carry nerve impulses towards the central nervous system) signals to the AN while the OHCs primarily receive eerent (carry nerve impulses away from the central nervous system) inputs. IHCs (Figure 2.13a) consist of a nucleus, mitochondria, under developed cisternae, stereocilia, a lateral plasma membrane and a synaptic complex; a synaptic complex is composed of an aerent synapse between the IHC and an aerent AN bouton and an eerent synapse between the IHC and the eerent AN bouton. Usually, one IHC is innervated by ten synaptic complexes, [Pujol et al., 1999, Mammano and Nobili, 2005].

OHCs (Figure 2.13b) are very

similar to IHCs, however they have a cylindrical shape and contain a calcium mass called Hensen's body (a rounded modied Golgi net containing a calcium store), [Dorland et al., 2003, Mammano and Nobili, 2005]. Stereocilia are hair-like extensions which jut out of the end of the hair cells into the cochlear uid, ordered in v-shaped rows.

Each row of stereocilia

is taller than the previous row and the tip of each is linked to the side

25

(a) Composition of IHC

(b) Composition of OHC

Figure 2.13: Inner and outer hair cells, from [Pujol et al., 1999]

26

of the stereocilium in the next row by a tip link, [Werner, 2007].

IHCs

stereocilia are free to move in the cochlear uid while the OHCs stereocilia are embedded in the tectorial membrane. Acoustic sound waves travel through the ear canal, ossicular chain, oval window and into the cochlear uids. The movement of the uids causes the basilar membrane to shift up and down causing the stereocilia to be shorn back and forth against the tectorial membrane. If the stereocilia are moved in the right direction, the hair cell depolarizes, i.e.

the membrane potential changes to a positive potential.

This depolarization occurs as mechanically gated ion channels are opened when the stereocilia are moved and once open, positively charged ions can enter the cell from the endolymph in the cochlear duct. A receptor potential occurs causing voltage gated calcium channels to open allowing calcium ions to enter the cell.

Neurotransmitters

(glutamate)

are released at the base

of the cell and move across the space between the hair cell and the nerve terminal, attach themselves to the receptors and trigger action potentials in the nerve; creating the signal. The signal is carried to a nerve process under the organ of Corti and this neuron transmits the signal to the AN, where it is carried upwards through the auditory pathway. The auditory pathway begins at the AN and travels through the cochlear nucleus, SOC, lateral lemniscus (LL), inferior colliculus (IC), medial geniculate nucleus and nally ends at the auditory cortex.

2.2.5 Auditory Nerve The AN, also called the cochlear nerve VIII, carries auditory signals from the hair cells to the cochlear nucleus in the midbrain, [ASH, 2006]. It links to the brainstem at the

ponto-medullary junction

and joins the ipsilateral cochlear

nucleus, whereby the axons bifurcate with ascending branches entering the AVCN and descending branches going to the posteroventral cochlear nucleus (PVCN) and dorsal cochlear nucleus (DCN). Each ber of the AN connects to a single hair cell and responds to a single excitatory best frequency, i.e. each ber is tuned to a specic CF. All of the CFs together cover the entire range of audible frequencies. Processing carried out by the AN involves conveying information about the frequency, intensity and phase of the sound signal, improving the signal and protecting it from noise, and the preservation of temporal information (phase-locking), so that ITDs can be processed further

27

Figure 2.14: Cochlear nucleus of the cat, from [Wittig Jr, 2004]

up the pathway.

The AN is a combination of about 30,000 nerve bers

carrying both aerent and eerent information.

There are two types of

aerent bers, type I and type II. Type I bers are myelinated, bipolar, and large in size. They carry information from the IHCs and make up 90-95% of the total AN bers. Type II bers are unmyelinated, psuedobipolar and are smaller in size than the type I bers. They carry information from the OHCs and only make up 5-10% of the total AN bers. Due to their large size and myelinated state, type I bers convey information 1-2 microseconds faster than type II bers.

2.2.6 Cochlear Nucleus The cochlear nucleus (Figure 2.14) receives only monaural inputs (from one ear only) while distributing auditory information to several dierent areas in the auditory pathway, and carries out processing of both time and frequency information. This is where parallel processing begins in the auditory system. It consists of the AVCN, PVCN and the DCN. These three separate regions are due to the dierent morphologic cell types present and the structures to which the regions connect. Within the cochlear nucleus, features of the signal are processed simultaneously by the dierent cell types, each reacting dierently to individual features of the sound, [Ryugo and Parks, 2003].

28

The AVCN, the starting point of the

binaural brainstem pathway, has a simple

tonotopicity and a straightforward relay of frequencies to higher centres in the auditory pathway. This pathway is believed to be concerned with spatial localisation and processing which requires binaural auditory signals.

The

AN terminates here as a large axosomatic ending, namely the endbulb of Held. The AVCN transfers low frequency sounds via the MNTB to the MSO both ipsilaterally and contralaterally (other side); to the LSO both directly and ipsilaterally via the MNTB with high frequency sounds; and to the IC contralaterally.

It consists of stellate and bushy cells, [Ryugo and Parks,

2003], cell types will be discussed in the next section. The

intermediate brainstem pathway

starts with the PVCN which has a

pattern of topotopicity that lies between the two extremes of the AVCN and DCN. Its functionality is for the most part unknown. The PVCN outputs to the periolivary nuclei both contralaterally and ipsilaterally; the LL and its nuclei contralaterally and ipsilaterally; and the IC contralaterally. It consists almost entirely of octopus cells but does have some bushy and spindle cells, [Ryugo and Parks, 2003]. The

monaural contralateral pathway

begins at the DCN which is a three-

layered structure and contains interneurons (neither purely sensory nor motor, but connect neurons to other neurons).

The three layers include the

outer molecular layer, a granular layer and a deep layer or central region. It conveys information contralaterally, and carries a substantial amount of information down through the auditory pathway. This descending input from the LL and IC is believed to create the unique responses of cells in the DCN; it has more complex response patterns than the AVCN. Its output frequencies, mostly high frequencies, go to the LL via the

dorsal acoustic stria

and

terminate in the contralateral IC. The DCN not only transfers information but carries out frequency analysis with its main role to estimate sound elevation. It consists of bushy, spindle, octopus and pyramidal cells, [Ryugo and Parks, 2003].

2.2.6.1 Cell types

Stellate cells (Figure 2.15a) are also known as multipolar, chopper or spindle cells, and have a star-like facade with sharp pointed dendritic trees, [Wittig Jr, 2004]. Their soma can be triangular or polygonal-shaped with

29

(a) Stellate Cell

(c) Octopus Cell

(b) Bushy Cell

(d) Pyramidal Cell

Figure 2.15: Cell Types of the Cochlear Nucleus

(a) Chopper response type

(b) Pauser response type

of stellate cell

of pyramidal cell

(c) Primary-like response type of

(d) Primary-like with notch re-

spherical bushy cell

sponse type of globular bushy cell

(e) Onset response type of octopus cell

Figure 2.16: Response type of cochlear nucleus cells

30

dendrites protruding from each pole, [Feng and Lin, 1996].

Their nucleus

is oval-shaped and found randomly within the cytoplasm with the nucleolus occupying the centre of the nucleus, [Feng and Lin, 1996]. Each stellate cell responds to input from the AN at a dierent CF and together the cells can encode all the frequencies existing within a sound, [Hudspeth, 1991]. There are two response types for stellate cells; the rst is

onset chopper

and these

neurons have a soma completely surrounded by synaptic contacts.

They

are usually found in the PVCN and sometimes the AVCN, [Delgutte and Oxenham, 2005]. Chopping cells completely reset after spiking and enter a refractory period; during this another spike cannot be generated. When this phase is over, input spikes are integrated over a certain period of time until the theshold is reached and the neuron res again. This behaviour gives a reasonably constant ring rate; however if the sustained inputs are too small to reach the ring theshold, the cell will only chop at the auditory onset, see Figure 2.16a, [Van Schaik et al., 1996]. The neuron's chopping rate rises with the intensity of input spikes and is thus thought to be involved with coding the intensity of sound. Outputs are inhibitory and keep other neurons working within their dynamic range, [Van Schaik et al., 1996]. response type of a stellate cell is the

sustained chopper.

The other

This neuron has

only a few synapses on its soma, receiving the majority of its inputs through dendrites. These dendrites act as a low-pass lter on incoming spikes causing the membrane potential to rise smoothly, giving a regular spike pattern. Sustained chopper cells receive inhibitory inputs from onset chopper stellate cells, ensuring they work within their dynamic range.

Their functionality

involves encoding the intensity of the sound, the extraction of the amplitude modulation frequency and the extraction of the pitch of a speech signal, [Van Schaik et al., 1996].

Bushy cells (Figure 2.15b) have a distinctive dendritic tree with stout primary dendrites and several levels of branching.

They have a primary re-

sponse type; they respond only to the occurence of a new sound and are thought to be extremely important in sound localisation along the azimuthal axis.

For the most part bushy cells phase-lock with more accuracy to the

stimulus than the AN bres at low frequencies, see Figure 2.17. The gure shows the comparison between an AN bre and a bushy cell phase-locking to a short tone input. Each dot in the raster plots signies a spike occurance when presented with the cells' best frequency; 350 Hz for the AN bre and

31

340 Hz for the bushy cell. Individual rows indicate the response to each of the 200 repetitions of the stimulus. It is clear to see that the bushy cell produces a much more distinct and less dispersed output than the bre, causing a smaller temporal window during which coincidence can happen and in turn gives more exact ITDs in the MSO, [Joris et al., 1998, Yin, 2002]. The reasoning behind why bushy cells which take their input from AN bres can produce a more accurate and less distributed phase-lock is that there is a form of coincidence detection at work. It is thought that bushy cells do not have a one-to-one response to AN input at low frequencies, as previously thought, but that a number of inputs occurring in a short space of time are required to cause the bushy cell to respond, [Yin, 2002]. Bushy cells can be divided into spherical and globular bushy cells depending on form and location, [Smith et al., 1993].

Spherical bushy cells are

located in the anterior AVCN and have a primary-like response type (one spike in - one spike out) at frequencies greater than 4 kHz, see Figure 2.16c, [Van Schaik et al., 1996, Delgutte and Oxenham, 2005]. They take inputs from one or two endbulbs of Held and output to the ipsilateral LSO and the bilateral MSO, [Van Schaik et al., 1996, Ryugo et al., 1997]. They also receive inhibitory inputs and these are involved in spectral sharpening (elimination of noise in the auditory signal), [Van Schaik et al., 1996]. Globular bushy cells have an oval shaped soma and usually only one primary dendrite, [Feng and Lin, 1996]. The nucleus can be round or ovoid shaped and is located randomly within the cytoplasm, while the nucleolus can be found in the centre of the nucleus, [Feng and Lin, 1996]. They are located in the posterior AVCN and have a primary-like with notch response type (similar to primary-like response but with a 1-2ms period of no ring after the rst spike) for frequencies greater than 4 kHz, see Figure 2.16d, [Delgutte and Oxenham, 2005]. They take input from nineteen smaller somatic terminals,

modied endbulbs, than their spherical counterparts, [Yin, 2002], and project to the calyx of Held in the contralateral MNTB. The probability of a spike occurring at the onset of the auditory signal is almost 100% due to many of the input bers carrying a spike at onset, however, more than one AN spike is needed to create an output spike, [Van Schaik et al., 1996, Zacksenhouse et al., 1998, Delgutte and Oxenham, 2005].

The notch occurs as the cell

spikes with input and then enters a refractory period, [Van Schaik et al., 1996].

32

Figure 2.17: Dierence in phase-locking of spikes between the AN and bushy cells when presented with a short tone input.

Adapted from [Joris et al.,

1998]

Octopus cells (Figure 2.15c) were named by Osen due to the shape of their soma which looks like the head of an octopus or a padlock, [Osen, 1969, Feng and Lin, 1996]. They have thick dendritic limbs and a polar arrangment of axons and dendrites. The nucleus is oval-shaped and can be found at the curved end of the cytoplasm while the nucleolus is located randomly within the nucleus, [Feng and Lin, 1996].

They have an

onset locker

response;

they respond only at the onset of a specic frequency or frequency range, see Figure 2.16e, [Delgutte and Oxenham, 2005].

The octopus cell has in

the region of sixty AN ber inputs that are sub-threshold (excitation with one input will not create a spike) and a high leakage current; therefore it needs a large amount of simultaneous action potentials to spike, [Van Schaik et al., 1996].

Some octopus cells have an

onset inhibitory

response type

and display an initial peak with very little post activity due to the cell's high theshold; only at the onset of the stimulus is the input large enough to elicit a spike, [Van Schaik et al., 1996].

Another reason for this could

be that the octopus cell might receive direct excitatory inputs from the AN and even some delayed inhibitory contacts; this has only been theorized, but experiments have shown that some octopus cells stay depolarized after the initial spike and therefore their membrane potential stays above the threshold and the cell cannot generate another spike until there is a break in inputs arriving, [Van Schaik et al., 1996]. These cells react very quickly to the start

33

of stimulation, occasionally within 100

µs.

Octopus cells can be found at the

medial border of the DCN and in the PVCN and are mostly involved in the detection of sound and the initiation of certain reexes, [Van Schaik et al., 1996].

Pyramidal

(also known as fusiform) cells (Figure 2.15d) have a spindle-

shaped soma with primary dendrites branching from the two extended ends of the cell body, [Hewitt and Meddis, 1995, Feng and Lin, 1996].

Its nu-

cleus is oval shaped and located randomly within the cytoplasm, with the nucleolus situated randomly within the nucleus, [Feng and Lin, 1996]. They demonstrate excitatory or inhibitory responses to stimulus frequencies, and have a spatial ring pattern which is important for sound localisation along the elevational (vertical) axis, [Hudspeth, 1991]. Their response type is usually

buildup-pauser

but can vary to a chopper, see Figure 2.16b, [Hewitt and

Meddis, 1995, Delgutte and Oxenham, 2005]. The buildup-pauser response type elicits a spike and subsequently, a 5-10 ms pause, [Wittig Jr, 2004]. Pyramidal cells can be found in the ventrolateral region of the DCN, [Feng and Lin, 1996]. There are many other types of cells in the cochlear nucleus which cannot be characterised as distinctly as those above and do not appear to play a major part in the functionality of the auditory system.

These are round, small,

granule and tuberculoventral cells. Round cells are similar to fusiform cells as they protrude two primary dendrites, however their soma are round with some deformation; they reside in the central region of the DCN, [Feng and Lin, 1996]. Small cells have characteristics of many other identied cell types but also include some that cannot be classied; these are grouped together as one group and can be found scattered widely across the DCN, [Feng and Lin, 1996]. Granule cells are extremely small neurons, only about 10 micrometers in diameter; with a small rounded or oval shaped nucleus.

They receive

inputs from somatosensory nuclei and communicate information about the position of the pinna to the DCN, [Osen, 1969]. The tuberculoventral cell delays inhibitory output to stie the response of neurons in the VCN to echoes, [Hudspeth, 1991].

34

2.2.7 Superior Olivary Complex The function of the SOC is to process information about interaural delays and amplitudes, while simultaneously acting as a crossover site for spatially oriented auditory information. Studies have shown that the SOC is essential for the localisation of a sound source.

It consists of four structures, the

MSO, LSO, MNTB and the periolivary nuclei. These structures dier based upon the dierent anatomic and functional input types that cells in these regions receive. The SOC is the rst point in the auditory system to receive binaural input.

It contains some pre-olivary and post-olivary nuclei, both

receive mostly eerent innervation. Within the SOC there are four types of neurons, excited-excited (EE), excited-inhibited (EI), inhibited-excited (IE) and inhibited-inhibited (II). EE neurons are excited by signals from both ears; EI neurons are excited by signals from the ipsilateral ear but inhibited by signals from the contralateral ear; IE neurons are inhibited by signals from the ipsilateral ear but excited by signals from the contralateral ear; while II neurons are inhibited by signals from both ears. The SOC plays a role in the stapedius reex which protects the middle ear from loud sounds. The MSO is disk-shaped and the largest of the nuclei in the SOC; it is the dominant nucleus for sound localisation in humans, [Kulesza, 2007]. It has a tonotopic response pattern that favours low frequencies in concurrence with the duplex theory of sound localisation. It receives excitatory innervation bilaterally from spherical bushy cells of the AVCN which phase-lock the sound signal they are transmitting to preserve the temporal features; inhibitory innervation contralaterally from the MNTB and the lateral nucleus of the trapezoid body; and outputs to the IC, [Burger and Rubel, 2008].

Most

cells in the MSO are EE cells, but it does contain some EI cells, [Otorhinolaryngology, 2002]. It is thought that the existence of inhibitory inputs to the MSO is to increase selectivity of ITDs. However, to date there is only limited data to support this, with further study required on this subject, [Burger and Rubel, 2008]. EE cells display cyclic dependence for both inhibition and excitation phases; have a characteristic delay; and display the strongest response when the signal has been synchronised from both ears. The MSO is made up of three dierent cell types, principal, multipolar and marginal cells. The principal cells are thought to work as coincidence detectors to identify ITDs, [Otorhinolaryngology, 2002, Yin, 2002, Delgutte and

35

Oxenham, 2005]. ITD allows for the recognition of the azimuth of low frequency sounds, thus the MSO is important in the localisation process. The principal cells in the MSO appear to be bipolar with dendrites reaching out rostrocaudally (from head to tail) over a vast amount of the MSO receiving input from their proximal ear, [Yin, 2002, Burger and Rubel, 2008]. Neurons in the MSO respond to both binaural and monaural stimuli, [Abbas, 1988, Grothe and Park, 2000, Grothe, 2003]. Jeress' theory of the MSO is that it takes as input a combination of the sound from the two ears, the ipsilateral inputs arrive directly while the contralateral inputs pass through a graded series of delays. For a sound source at a particular angle to the listener, only one particular delay will allow the ipsilateral and contralateral inputs to match.

The sound source loca-

tion in space is read in as a phase signal, which is converted to a place map in the MSO. In birds, the nucleus laminaris which is homologous to the mammalian MSO exhibits graded axon lengths that can be found in the Jeress model, [Carr and Konishi, 1990, Carr, 1993, Konishi, 2000, Burger and Rubel, 2008]. However, the occurrence, structure and function of this simple delay line model in the MSO of mammals has been debated at length. Studies of the cat MSO have found evidence for diering axon lengths from the contralateral ear to the MSO where the shortest axons innervate the rostral MSO cells and the longer axons innervate caudal MSO cells, [Smith et al., 1993, Beckius et al., 1999]. the Jeress model.

These studies indicate agreement with

However, both studies also show that each axon only

innervates a small portion of the MSO, unlike in the nucleus laminaris of the bird, in which the entire nucleus is innervated. Ultimately, there have been far less studies carried out on the anatomical and physiological structure and function of the mammalian MSO, in comparison to the avian nucleus laminaris, to determine conclusively whether the Jeress model is an appropriate representation or not.

Some evidence has been found for a number of his

theories and questions remain to be answered on others. The MNTB lies medial to the MSO and is the smallest nuclei of the SOC. It has principal cells which are large and bulbous in shape with abundantly branching dendritic trees, see Figure 2.18. They take excitatory input from contralateral AVCN globular bushy cells which phase-lock the sound signal at low frequencies.

These inputs make contact with the MNTB cells at

synapses called the calyces of Held which are the greatest in size, the fastest

36

Figure 2.18: MNTB cell, from [Grothe, 2003]

and the most temporally secure synapses in mammals. The MNTB provides inhibitory input to the ipsilateral LSO (its main function), the MSO and the ventral nucleus of the LL, [Grothe and Park, 2000]. They usually have a

primary-like with notch response similar to bushy cells, although sometimes a chopper response type similar to stellate cells can be seen, see Figures 2.16a and 2.16d; with each cell responding to a distinct CF, [Smith et al., 1998, Zacksenhouse et al., 1998, Grothe, 2003]. The LSO appears as a folded sheet of EI neurons with high CFs in agreement with the duplex theory of sound localisation; it is signicantly smaller than the MSO in humans, [Pollak et al., 2002].

The size of the LSO in dier-

ent mammals is consistent with the range of usable frequencies, e.g.

bats

and porpoises which can process extremely high frequencies have large LSOs whereas humans are not as sensitive to such frequencies and as such have a smaller LSO, [Moore, 2000, Bazwinsky et al., 2003]. The LSO favours high frequencies as the head casts a clearer shadow in the soundscape producing diering SPLs of the signals arriving at each ear for a particular frequency, [Otorhinolaryngology, 2002, Dorland et al., 2003, Tollin, 2003, Shi and Horiuchi, 2005]. It has a tonotopical organisation similar to the MSO, high frequencies are represented in the middle of the LSO and continually lower frequencies at the sides, [Yin, 2002]. It is excited by innervation from small spherical bushy cells of the ipsilateral AVCN and inhibited by innervation from the contralateral MNTB, [Moore, 2000, Otorhinolaryngology,

37

2002, Konishi, 2003]. LSO cells have three distinct types: fusiform, stellate and round. Fusiform cells compose of about one quarter of the population, round cells make up another quarter and about half of the population consists of stellate cells, [Kulesza, 2007]. The LSO is stimulated by excitatory innervation producing an excitatory post synaptic potential (EPSP) while the inhibitory innervation produces an inhibitory post synaptic potential (IPSP). These are combined to produce a discharge rate relating to the IID at a particular sound frequency, i.e. the IPSP is subtracted from the EPSP. The LSO is sensitive to IIDs as small as 10 dB SPL, [Otorhinolaryngology, 2002].

It projects an inhibitory output ipsilaterally to the IC; and to the

dorsal nuclei of the LL both ipsilaterally and contralaterally. For the LSO to produce a discharge which relates correctly to the IID, the inputs from the two ears must be in phase, i.e. arriving at the same point in time. However, unlike the MSO, there are no delay lines in the LSO which can correctly bring the two inputs into phase. There are three features of the auditory pathway which would in fact cause the contralateral input to be further delayed, [Park et al., 1996, Tollin, 2003]: 1. As can be seen from Figure 2.1, the route to the LSO from the contralateral ear is longer than from the ipsilateral ear. 2. The input from the contralateral ear must pass through another synapse, the MNTB, before arriving at the LSO which the ipsilateral input does not. 3. The stimulus will naturally contain an ITD unless its original location is directly in front of or behind the head. Figure 2.19 shows the eect a contralateral time delay will produce on the LSO discharge rate. The top image shows a hypothetical response from the LSO with an IID of 0 dB (the sound is located directly in front of or behind the head) with no delayed contralateral input. As expected the LSO produces a minimal discharge. If the three features above are applied then there will be a delay from the inhibitory contralateral input (bottom image) causing a much larger discharge rate which will not accurately reect the IID. However, there are three other features of the auditory pathway which compensate for the delay produced by the longer route, extra synapse and natural ITD. First, the axons of the globular bushy cells which provide

38

Figure 2.19: Possible output if contalateral and ipsilateral inputs to LSO do not arrive simultaneously, from [Joris and Yin, 1995]

39

input to the contralateral MNTB are about three times larger than those axons of the spherical bushy cells which provide direct ipsilateral input to the LSO, [Yin, 2002, Tollin, 2003]. This dierence in axon size causes considerably shorter axonal conduction times through the globular bushy cells of the contralateral pathway than through the small spherical bushy cells on the ipsilateral side. It is also thought that the MNTB does not work simply as a relay which provides inhibitory input to the LSO but that the calyx of Held synapse can retain both the temporal and spectral properties of the stimulus produced by the globular bushy cells, [Yin, 2002, Tollin, 2003]. Lastly, the inputs from the MNTB arrive directly onto or near to the soma of the LSO neurons whereas the ipsilateral inputs arrive to distal dendrites, i.e. those furthest away from the soma, [Yin, 2002, Tollin, 2003]. Together, these three latter features can cause the contralateral stimulus to arrive within

200 µs

of the ipsilateral input in most cases, [Joris and Yin, 1998]. Four decades of physiological studies by many dierent researchers have consistently found that the cells of the LSO are sensitive to IIDs and the discharge rates of LSO cells have a relationship with the IID present in the stimulus, [Park et al., 1997, Tollin, 2003, Park et al., 2004].

As discussed

above, when the stimulus from both ears exhibits an IID of 0 dB the corresponding discharge rate of the LSO neuron is minimal. The way in which LSO cells are normally studied is to present a pure tone stimulus at the best frequency of the LSO cell with diering IIDs present.

In Figure 2.20, the

LSO cell has a best frequency of 16 kHz and is presented with a 300ms tone burst twenty times, [Tollin and Yin, 2002b, Tollin, 2003]. The responses of the cell are displayed in the form of a dot raster plot and a poststimulus time histogram. Images A to D represent the LSO cell response to a xed ipsilateral stimulus of 30 dB, the contralateral stimulus is varied from 5 dB to 45 dB. The corresponding responses of the LSO cells change in accordance with the varied contralateral stimulus. Image A produces the greatest discharge rate; image B produces a much lower discharge rate than A; image C produces a minimal discharge rate and image D produces no discharge as the IID favours the contralateral input. Image D also describes the relationship between the two LSOs found in the auditory system.

When a sound

is closest to the left side of the head, the ipsilateral (left) LSO produces a discharge responding to the IID but the contralateral LSO is silenced.

40

Figure 2.20: LSO cells sensitivity to diering IIDs, from [Tollin and Yin, 2002b]

41

2.2.8 Higher Auditory Pathways 2.2.8.1 Lateral Lemniscus The LL is sensitive to changes in both the timing and amplitude of sound and is involved in the Startle Reex. Also called the Moro reex from the Austrian pediatrician Ernst Moro, it is a reaction to an unexpected loud noise and is thought to be the only unlearned fear in humans. The nuclei of the LL contain three distinct regions, the dorsal nuclei (DNLL), ventral nuclei (VNLL) and the intermediate region (INLL). The LL performs spectral analysis (e.g.

vowel detection, line spectra tracking), the detection of

transients (strong and out-of-synch spikes), and has a role in measuring the timing of echoes. DNLL neurons are EI type binaural cells, are responsive to all frequencies and are mainly GABAergic providing inhibitory output. These neurons take ipsilateral input from the MSO; both ipsilateral and bilateral input from the LSO; and contralateral and bilateral input from the AVCN. They output to the DNLL on the other side of the auditory pathway via the commissure of Probst, see Figure 2.24 at the end of this chapter; ipsilaterally and contralaterally to the IC; ipsilaterally to the medial geniculate body (MGB); bilaterally to the superior colliculus; to the SOC and nally to the midbrain reticular formation. Neurons with onset response types have been found in this region. Because its principle connections are from the LSO, it is involved with IID as part of the sound localisation process.

Figure 2.21 shows the

inhibitory (dotted line) and excitatory (solid line) connections from the LSO to the DNLL, [Covey and Casseday, 1991, Shi and Horiuchi, 2005]. The VNLL is made up of a three-dimensional matrix of cells; temporal processing occurs here and some studies suggest that it is essential for the decoding of amplitude modulated sounds in the IC. It takes input contralaterally from both the AVCN and PVCN and ipsilaterally from the MNTB. It outputs to the DNLL, MGB, periolivary nuclei, MNTB and ipsilaterally to the IC. The VNLL is most sensitive to sounds in the contralateral ear, i.e. it is connected to the monaural system. The VNLL can be split into two types of cell groups, the columnar nucleus (VNLLc) and the multipolar cell area (VNLLm). All cells of the VNLLc are of the same type, analogous to spherical bushy cells of the AVCN, i.e. round to oval in shape with one large dendrite having great amounts of branching that only begins a certain distance away

42

Figure 2.21: Connections between the LSO and DNLL, from [Shi and Horiuchi, 2005]

from the cell body.

These branching bers lie in parallel to the other as-

cending bers of the LL. The name, columnar nucleus, comes from the way the cells are arranged; packed tightly together in columns between bers. A phasic (onset-like) response type makes up 95% of the population and the other 5% includes primary-like neurons. Cells of the VNLLm are larger than those of the VNLLc and they take their name from their multipolar shape. They have some broad dendrites which are only sparsely branched. Between 20-30% of neurons in the VNLLm have a phasic response type, while some chopper, tonic (similar to choppers but do not lock to the stimulus onset), pauser and primary-like response types can also be found, see Figures 2.16a, 2.16b, and 2.16c, [Covey and Casseday, 1991, Merchan and Berbel, 1998]. The INLL takes input contralaterally from the AVCN and PVCN and ipsilaterally from the MNTB. It outputs to the MGB and ipsilaterally to the IC. Cells are monaural and elongated in shape with numerous dendrites extending from both ends of the soma in a plane orthogonal to the ascending bers of the entire LL and parallel to the bers entering it. Neurons of the INLL have very short integration times and response types include phasic (about 20-30%), choppers, tonic, pauser and primary-like, see Figures 2.16a, 2.16b and 2.16c, [Covey and Casseday, 1991].

2.2.8.2 Inferior Colliculus The IC is an auditory reex centre with key functionalities of auditory relay and spatial localisation.

It can be split into three distinct areas, the cen-

43

tral nucleus (ICC), pericentral nucleus (ICP) (also known as `dorsal cortex') and the external nucleus (ICX) (also known as èxternal cortex'). The IC is an integrative station that is responsive to interaural delay and amplitude dierence. It is believed to provide a spatio-topic map of the auditory environment. It receives inputs from all previous regions of the auditory pathway and both IC communicate with each other. Four response types exist in the neurons of the IC including: sustained response, for the entire duration of the stimulus; o eects, when the stimulus ends; and inhibition, decreasing response over the duration of the stimulus. Similar to the SOC, neurons here can encode location specic information and are involved in the localisation process, [Abbas, 1988, Murray et al., 2004]. The most well known part of the IC is the ICC, a specialised auditory pathway. It outputs bilaterally to the MGB and auditory cortex; to deep layers of the superior colliculus; contralaterally and ipsilaterally to the IC; to the reticular formation; the periaqueductal gray; the VNLL; DNLL; SOC and nally to the MNTB. It consists of two types of cells, see Figure 2.22; about 70% are principal cells, having small, round, planar dendrites lying in the same plane as groups of axons coming up from the LL with their long axis in parallel to the axonal tract. Each axon forms a synapse with multiple principal cells as it runs through the ICC. There are two response types, sustained and onset. The sustained response type can again be divided into rebound and buildup-pauser. Rebound neurons have complex lightly branched dendritic trees and can be either pyramidal or stellate cells. Buildup-pauser neurons have dendrites branched in close proximity to the soma and are similar to the buildup response type found in the cochlear nucleus. Again, the onset neurons are similar to those neurons in the cochlear nucleus with an onset response type, [Peruzzi et al., 2000]. Multipolar cells make up about 30% of cells in the ICC. These have irregular shaped dendritic trees and lie at right angles to, and sometimes covering, the axonal tracts. It is unclear whether they form synapses with these axons. The ICC combines the complex frequency analysis of the DCN with the sound localising ability of the SOC. It is characterised by iso-frequency sheets, with each sheet representing a single CF; low CF sheets are organised rostrally and high CF sheets caudally. The ICP and ICX contain about three layers of cells. The ICP does not receive much input but interacts with the rest of the IC and outputs to the MGB. The ICX is multisensory, receiving auditory and somatosensory input.

44

(a) Principal Cell

(b) Multipolar Cell

Figure 2.22: Cell types of the ICC

2.2.8.3 Thalamus The last location before the cortex on the auditory pathway is the thalamus, shaped like a football.

Its function is to relay information to the cortex

and perform relative intensity and duration comparison. It is divided into the MGB, the lateral posterior nuclei and the reticular nucleus (NRT). The MGB can again be divided into three subsections, the ventral (VMGB), medial (MMGB), and dorsal (DMGB). The primary nucleus responsive to the auditory pathway is the MGB, although the other nuclei are to some extent sensitive to auditory stimuli. Monaural cells make up 10% of the cells in the MGB while the other 90% are binaural. Monaural cells are primarily responsive to sound in the contralateral hemield while the binaural cells are similar to the EE or EI types found in the IC, [Abbas, 1988]. Of the three regions of the MGB, the VMGB is primarily auditory.

It is

a layered structure with low CF layers located laterally and high CF layers located medially. The VMGB takes ipsilateral input from the ICC, contralateral input from the ICP and the NRT, and further input from the auditory cortex.

It outputs to three sections of the auditory cortex, the anterior,

primary, and posterior. It contains two types of cells, thalamocortical relay cells, see Figure 2.23, and intrathalamic interneurons, [Otorhinolaryngology, 2002].

Thalamocortical relay cells (also known as principal neurons) take

input from two sets of dendritic trees located on opposite poles of the cell. The long axis of the relay cells lie parallel to each other running superiorinferiorly with the dendritic trees of cells within the same iso-frequency band. The dendrites of the cells form a synaptic nest with ascending axons from the (excitatory) IC and (inhibitory) intrathalamic interneurons. Intrathalamic interneurons (also called Golgi Type 2 cells) supply the inhibitory input GABA to the relay cells at the synaptic nests. Some interneurons target relay cells, some target other interneurons. The VMGB is believed to be predomi-

45

Figure 2.23: Thalamocortical relay cell, from [Destexhe, 2009]

nantly responsible for relaying frequency, intensity and binaural information to the cortex, [Abbas, 1988]. The MMGB takes ipsilateral input from the ICC; both ipsilateral and contralateral input from the LL, superior colliculus, periolivary nuclei and secondary auditory cortex (AII); the NRT and some somatosensory and vestibular inuences. It outputs to the auditory cortex, ipsilaterally to the primary auditory cortex (AI), AII, and anterior and posterior auditory cortex. The cells of the MMGB have large irregular shaped dendritic trees. Its main functionality is the detection of the relative intensity and duration of a sound. Cell types include binaural cells which can be EE, EI or IE. Most cells in the MMGB respond for the duration of the stimulus and have very little variation. Individual cells can be tuned to certain frequencies, but they often have more than one CF, [Abbas, 1988]. The DMGB takes input from the ICP, ICX, auditory cortex and other thalamic nuclei. It outputs only to the auditory cortex. The cell types of the DMGB are unclear. However, two distinct principal cells types have been found along with two distinct types of interneurons. The cells in the DMGB are broadly tuned but some cells appear to respond only to complex stimuli. Other cells are multi modal, responding to both somatosensory and auditory stimuli.

2.2.8.4 Auditory Cortex The termination of auditory signals is at the auditory cortex which is located in the sylvian ssure of the temporal lobe. Figure 2.24 describes the auditory pathways through which a sound travels from the cochlea to the auditory cortex. There are two auditory regions in the cortex, the AI and AII. The AI contains an auditory specic region called planum temporale;

46

Figure 2.24:

Schematic of the auditory pathways including the cochlea,

trapezoid body, superior olivary complex, lateral lemniscus, commisure of Probst, commissure of inferior colliculus, inferior colliculus, medial geniculate body and the auditory cortex, adapted from [de Jonge, 2008]

47

this area can maintain a track on and identify acoustic objects, [Griths and Warren, 2002].

Inputs to the auditory cortex come from the MGB;

contralaterally from the AI; the AII and other cortical areas. It outputs to sensory association areas which are the parietal and temporal lobe; speech areas such as Broca's Area and Wernicke's area; the MGB and the IC. The human auditory cortex can be divided into six layers.

Layer I and IV re-

ceive input from the thalamus; cells in Layer IV project to pyramidal cells in Layer III; from Layer III, auditory information diverges to several dierent locations including other areas of the auditory cortex and the layers of the AI; while Layers V and VI project out of the auditory cortex. The auditory cortex is organised tonotopically, with higher frequencies located dorsally and lower frequencies located ventrally. It has been suggested that there are several tonotopic maps represented in the cortex, with specic functions for the demodulation of speech and other sounds. There are also suggestions of a spatiotopic map with sounds from the contralateral hemield increasingly more excitatory for each given hemisphere. Areas with specic temporally sensitive responses may play a role in several phenomena of perception including virtual pitch perception, timbre discrimination, spatial localisation and even noise ltering. Cell types here are involved in sound localisation of the sound signal and motion detection, a unique functionality of the auditory cortex, [Abbas, 1988, Otorhinolaryngology, 2002].

2.3 Conclusion To summarise, the duplex theory of mammalian sound localisation states that the localisation of low frequency sounds is based on ITDs and high frequency sounds on IIDs.

These cues are processed in the parts of the

auditory system appropriate to sound localisation; mainly the MSO and LSO. Jeress' computational model of ITD-based localisation was discussed and compared to evidence from physiological studies. Both the duplex theory and Jeress' computational model are key methodologies which underpin the research presented in this thesis. HRTF data was used to demonstrate the complex relationship between IIDs and both frequency and azimuth. Finally, the cell types of neurons found throughout the auditory pathways were described.

This review of the auditory system from the external ear

through to the auditory cortex sets the context for the work presented in

48

this thesis. However, the experimental work described in later chapters will primarily focus on the MSO and LSO. This review of the literature of mammalian sound localisation is a key factor to this research as a thorough understanding of the mammalian auditory pathways is required in order to implement a biologically inspired computational model which can process and extract the binaural cues of sound localisation. The implications of faithfully modelling each attribute of the auditory pathway are twofold; the number of neurons which populate the auditory pathways from the cochlear nuclei to the auditory cortex is vast, and the diering response types of the neurons quite complex. Considering these implications, the aim of this research is to develop networks of spiking neurons with topologies inspired by the auditory pathways to emulate the way in which mammals can localise sounds. The next chapter will involve a review of computational models which enable the development of biologically inspired sound localisation systems, e.g. neuron models, learning algorithms and network architectures. Furthermore, biological and non-biological models of sound localisation developed over the last twenty years will be outlined from geometrical methods to the biologically inspired SNNs.

49

Chapter 3

Neural Networks and Sound Localisation Modelling 3.1 Introduction The development of computational sound localisation models encompasses numerous and distinct research directions, from cochlea models, auditory neuron implementations, extraction of ITD and IID binaural cues from real biological data and the processing of these cues to generate or estimate an angle of location of a sound source.

All or some of these components can

be included in a sound localisation model, and they can be realised in a purely computational way or with a biologically inspired technique or even in a fully biologically plausible manner. However, it is commonly understood that the more biologically plausible a model aims to be, the more complex it becomes. This chapter begins with a review of cochlea and auditory models with a description of the auditory periphery model used in this work. The research areas of ANNs and SNNs are introduced with a description of the properties and functionality of the biological neuron.

The latter parts of

the chapter are concerned with a summary of the dierent methods other researchers have used in the development of sound localisation models from the late 1980s onwards.

50

3.2 Cochlea and Auditory Cell Models For the last thirty years the majority of relevant papers in the eld of cochlea modelling are based on, or at least refer to Lyon's model, a simplication of the activities of the cochlea, [Lyon, 1982]. In the early 1980s there was considerable research on biological factors of hearing but no computational models existed until Lyon produced his model. Later in the decade, Lyon and Mead introduced the analog electronic cochlea; built in very large scale integration (VLSI) complementary metaloxidesemiconductor (CMOS) technology. It involved a cascade of second-order lter stages to mimic the travelling-wave system of uids in the cochlea.

Test results showed that it matched both

previous theories and observations of real cochleas, [Lyon and Mead, 1988]. In the mid 1990s, Kuszta outlined two main methodologies for designing articial cochleas. The rst method involved parallel banks of bandpass lters using switched capacitor techniques, and the other is based on Mead's decription of VLSI systems containing electronic analog circuits that mimic neuro-biological architectures present in the nervous sytem, [Kuszta, 1998]. This area of research then extended to modelling neurons of the auditory pathways that work in combination with cochlear models.

Hewitt et al.

created several computer models of specic cell types within the auditory system; the ventral cochlear nucleus stellate cell, [Hewitt et al., 1992, Hewitt and Meddis, 1993], an IC cell, [Hewitt and Meddis, 1994], and a DCN pyramidal cell [Hewitt and Meddis, 1995]. The stellate cell model included the simulated output of AN bers which were used as input to the cell-soma model, a digital simulation of the Hodgkin and Huxley model of spike generation (discussed later in this chapter). Outputs of the model replicated those found during in vivo research.

The IC cell was modelled as a coincidence

cell point neuron with inputs from the stellate cell models. The model only red when it received a number of synchronous inputs and was able to encode temporal information into a rate-based code. The DCN pyramidal cell model was again a digital simulation of the Hodgkin and Huxley model; however it was modied by the addition of a transient potassium conductance which created the unique properties of these biological cells. Comparisons were made to cells studied in vivo and the model data replicated the neural data in most cases, for example, when the magnitude of the depolarising pulse was increased there was an increase in ring rate of both the real cell

51

and the computational cell. The research of [Jones et al., 2000] moved auditory modelling away from analog and into digital.

These researchers created a neuromorphic pitch

detection system implemented on a eld programmable gate array (FPGA). The auditory model from [Hewitt and Meddis, 1994] was used which included cochlear lters, IHCs, stellate cells and IC coincidence cells.

Implementa-

tion on an FPGA allowed the system to run in real time.

Van Schaik's

research involved models of the cochlea as a three-tier design that included the articial cochlea, IHC model and a spiking neuron circuit on one chip, [Van Schaik et al., 1996, Van Schaik, 2003, Chan et al., 2006]. On the circuit, thirty-two of these neurons could be combined together to create a small and simple network that could reproduce the spiking behaviour of neurons in the auditory system. [Abdalla and Horiuchi, 2005] produced a cochlea-like binaural ultrasonic lterbank connecting to integrate and re neurons for modelling of the bat echolocation system. The model consists of an input signal passing through a bandpass lter layer, where each lter represents a set of exponentially-increasing frequencies. The output of the lter passes to a full-wave rectier and is transformed into current, which is used as input to the neuron models; the output spikes relate to the amplitude of the input stimulus. The auditory periphery model used in this research, see Figure 3.1, for processing the HRTF data into spike trains was created by [Zilany and Bruce, 2006, 2007]. This auditory periphery model was chosen as the HRTF data used as input to the SNN models developed in this research was generated from adult domestic cats, and this auditory periphery model was developed based on empirical observations in the cat. This model also has the ability to generate spike trains which directly relate to the HRTF data, thus enabling the binaural cues of sound localisation to be extracted and classied into angles of location. The input stimulus time-domain waveform initially passes through a middle-ear lter and then through three parallel lter paths; a wideband lter which depicts the ltering properties of the basilar membrane, a chirping lter which is similar to the wideband lter but does not include properties from the OHCs, and a control path lter which models the eects of the OHCs functionality. The outputs of the wideband and chirping lters then pass through models of IHCs, after which the two outputs are summed and then low-pass ltered to produce the IHC receptor potential.

52

Figure 3.1: Plan of the auditory periphery model, from [Zilany and Bruce, 2006, 2007]

This potential causes activity at a synapse model and ultimately spikes are generated through an inhomogeneous Poisson encoding process, i.e. where time is included as a parameter, [Virtamo, 2005]. The next section of this chapter involves a review of the ANN and SNN literature. This was an important factor in this research, as it provided the ability to understand the intricacies of designing network architectures and implementing neuron models and learning algorithms.

3.3 Articial Neural networks A mathematical or computational model which aims to imitate the framework and functionality of a biological neural network is called an ANN; the idea for these types of networks came from physiological studies of the nervous systems of many living beings. It is typically a multi-layered network comprised of articial neurons; a combination of an input layer, any number of hidden layers and an output layer; thus information is processed in the network in a connectionist manner, see Figure 3.2. The neurons in each layer can be connected either completely or partly to neurons in the next layer and feedback connections to preceding layers are also possible. They can be utilised for discerning patterns in data and representing complex relationships between input and output data, [Engelbrecht, 2002]. ANNs of the rst generation consist of the simplistic McCulloch-Pitts threshold neuron models, [McCulloch and Pitts, 1943], while neurons of the second generation use a continuous activation function, [Maass, 1997, Vreeken, 2002].

There are

countless applications for which an ANN would be appropriate, but the four

53

Figure 3.2: ANN consisting of three feedforward layers of articial neurons; an input, hidden and output layer, where

y

x1 , x2

and

xN

are the inputs and

is the output

main groups for which they are correctly used are function approximation, classication, data processing and control. An articial neuron is a mathematical model of a biological neuron which accepts inputs from the environment and other neurons.

The neuron is

presented with one or more inputs akin to one or more dendrites on the biological neuron.

The weights on each of these inputs are summed after

multiplication by input and passed through a non-linear function called an activation or transfer function to produce an output; this process is equivalent to the ring threshold in the soma of a biological neuron. Figure 3.3 and Equation 3.1 demonstrate this functionality of a neuron; where activation function, of

xi

are the inputs and

wi

f

is the

are the weights with an output

y: y = f (net) = f

N X

! x i wi

(3.1)

i=0 There are four main types of activation function:

step, linear, ramp and

sigmoid, [Haykin, 2008]. The step function, also called a threshold function, produces a value of either 1 or -1 depending on whether the sum of the weights on the inputs are greater than or less than a predened threshold. The linear activation function multiplies the sum of the weights by a predened value. The ramp function is similar to the linear function but a set of upper and lower limits are introduced. Between the limits, it behaves as a

54

i number of inputs and weights with y

Figure 3.3: Articial neuron containing activation function

f (net)

and output

linear function but outside the limits the step function is used. Finally, the sigmoid activation function is a non-linear function which takes as input any number between

±∞

and produces an output in the range of 0 and 1.

There is no consensual statement which exactly denes an ANN; however it is generally understood and agreed upon that ANNs consist of a network of individual and quite simple nodes called neurons which can perform complicated global operations dependent on their connective structure and parameter values. The power of ANNs comes from the dynamic strength, called weights, of the connections between the neurons. The network can be trained to recognise patterns or create a relationship between a set of input and output data by changing these weights using an algorithm (a process or set of rules). This process is called learning and there are many dierent types of learning algorithms for ANNs. When deciding to use an ANN for solving a particular type of problem, there are four main factors to take into account.

Firstly, the type of activation function has to be chosen.

Next,

the network structure; there are many dierent ANN structures from a relatively simple feedforward to a more complex recurrent network. Depending on the problem, the appropriate network should be chosen, i.e. overly complicated structures can have diculty with learning. The next choice is to select the learning algorithm and whether the problem requires supervised or unsupervised learning.

Finally, the last issue to consider is robustness;

a trained network should have the ability to tolerate faults given that the correct network, training algorithm and activation function are selected.

55

3.3.1 Types of Articial Neural Networks There are many dierent types of network structures to choose from; this section will outline a number of them.

In a feedforward network, the in-

put data passes through each layer of the network in a sequential manner, however there can be any number of layers of neurons in the network. One of the most widely known feedforward networks is the perceptron network, developed by [Rosenblatt, 1957]. It is a linear classier which maps the input data onto a single binary value in the output. This perceptron network was extended to a more powerful multi-layer perceptron.

The dierence being

that non-linear activation functions are used and multiple hidden layers can be employed. A radial basis function (RBF) network is a type of feedforward ANN which uses radial basis functions as activation functions in the single hidden layer, [Broomhead and Lowe, 1988, Moody and Darken, 1989, Bors, 2001]. Neurons in the output layer perform a weighted sum of the hidden layer outputs. The input is usually non-linear which is then converted to a linear output. Due to their non-linear approximation abilities they are able to perform complex mappings in a small network. Kohonen self-organising networks, sometimes called a self-organising map, were developed by [Kohonen, 1989]. They are similar to feedforward networks and their function is to map the input data into a low-dimensional set of coordinates in the output using neighbourhood functions to preserve the spatial properties of the input data. The learning rule involved in self-organising maps is outlined in the next section. Recurrent networks are dierent to feedforward type ANNs as the data can be routed back through the earlier layers of the network, i.e. the data can move both forwards and backwards through the network. A simple recurrent network, called the Elman network after its creator Je Elman, is a modication of the multi-layer perceptron, [Elman, 1990]. This network has three layers and a set of context units which have a constant weight of 1.

The

output of each hidden layer neuron is sent to both the output layer and to the context units which then maintain a copy of the hidden layer values before learning occurs. This extra functionality to the multi-layer perceptron allows for the solving of sequence-prediction type problems. Simple recurrent networks are not fully recurrent; to be fully recurrent each neuron can take input from, and output to, every other neuron in the network, i.e. the

56

network does not have a typically layered architecture. Also, only a small group of neurons will receive the set of input data and another group will produce the output of the network. Hopeld networks, developed by John Hopeld in 1982, are fully recurrent networks which have a step activation function where one of two values is produced as output depending on the input to a threshold, [Hopeld, 1982]. This learning rule is outlined in the next section.

Hopeld networks are used for content-addressable memory

systems (CAM); these are search systems where the user provides a word and the CAM searches its entire memory to nd where that word is located. The CAM then returns the list of memory storage addresses where that word has been used. Reservoir computing is an area of computing which encompasses the design and learning of recurrent networks.

The reservoir is composed of a large

recurrent network with randomly connected units which can be articial neurons or spiking neurons (discussed later in this chapter). The reservoir maps an input signal to a higher dimension which can then be trained and mapped to a desired output.

The two main types of reservoir computing

are echo state networks and liquid state machines (LSM). The echo state network has scarcely any connections in the hidden layer.

Learning only

occurs at the output, as these are the only weights that can be modied. They work very well for nding or duplicating temporal patterns.

LSMs

consist of a large number of randomly connected units which can perform nonlinear functions. It has been argued that LSMs represent an advancement to ANNs as they are not designed to be used for only one task and they can cater for inputs with various time scales. Tasks that LSMs are employed for include speech recognition and computer vision. For further information on echo state networks and LSMs see [Jaeger and Haas, 2004] and [Maass and Markram, 2004] respectively.

3.3.2 Learning Learning in an ANN aims to adjust the weights and biases using a learning rule or algorithm.

There are three types of learning rules for ANNs; su-

pervised, unsupervised and reinforcement. Supervised learning, also called associative learning, occurs when the ANN is provided with an input data set and target output. Everytime the network is presented with the training

57

data set, the weights and biases for each input are adjusted so that the error between the real output and the target output is minimised. The intention of unsupervised learning is to discern patterns or repeating traits in the input data without any comparison to a target output.

The network must self-

organise to respond to these trends in the data by performing a clustering of these uncovered patterns. The last type of learning in ANNs is reinforcement learning; this can be considered as being midway between supervised and unsupervised learning. This works by rewarding individual neurons or even sections of the overall network for desired performance and to penalise for `poor performance' by changing weights and biases, [Engelbrecht, 2002]. There are many learning rules. Some rules are based on research into biological learning whereas others come from studies which look at how nature deals with learning; however most are a variation of the oldest and most widely known learning rule, Hebb's rule, [Hebb, 1949]. Essentially, the rule states: When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in ring it, some growth process or metabolic change takes place in one or both cells such that A's eciency as one of the cells ring B, is increased. The

Hopeld

rule is similar to Hebbian learning, [Hopeld, 1982, Rojas,

1996]; if two connecting cells are active, the weight is increased on the connection between them by a dened learning rate.

However, the dierence

to Hebbian learning is that if two connecting cells are not active then the weight on the connection between them can be decreased. The

delta

rule,

also known as the Widrow-Ho or Least Mean Square learning rule, is again a modication of Hebb's rule except that the weights are continually changed to reduce the dierence between the actual and desired output, i.e. to reduce the mean squared error of the network the error dierence is back-propagated through the network until it reaches the rst layer. The

gradient descent

rule

is a generalisation of the delta rule, incorporating a learning rate into the calculation of the new weight values in addition to the activation function; the backpropagation algorithm is the most popular algorithm of this type.

Kohonen 's learning rule is used in a self-organising map.

The motivation for

this rule came from studies on learning in biological systems. This is a form of competitive learning where each neuron competes for the chance of chang-

58

ing their weights. The winning neuron is the neuron with the largest output and it can inhibit its competitors while exciting its neighbours. Therefore only the winning neuron and its neighbours can alter their weights. During training with a Kohonen rule, neighbourhood sizes can vary from a large neighbourhood at the start to promote global ordering, to a more compact size as training progresses.

3.4 Spiking Neurons The most biologically inspired ANNs are those of the third generation, called SNNs as individual spikes are used as input, [Maass, 1997]. In essence, a spike (action potential or pulse) is a short-term change in electrical potential and is the basis of communication between neurons in the brain.

In the third

generation, spikes allow for spatio-temporal information to be included in the computation. In 1952, Alan Lloyd Hodgkin and Andrew Huxley produced the rst model of a spiking neuron, [Hodgkin and Huxley, 1952a,b,c, Hodgkin et al., 1952]. Their model showed how spikes are created and how they can travel through a network.

The propagation of spikes between neurons re-

quires neurotransmitter, which allows spikes to travel across the synapse from the presynaptic neuron to the dendrites of the postsynaptic neuron. This complexity produced many more biological models, each of which represented the biological neuron in a somewhat dierent way; leading to both biologically realistic and computational type models. Initial studies of spiking neuron models showed that the temporal coding functionality of spiking neurons could provide more computational power than their second generation counterparts, the ANNs.

Also, it was found that any multilayered

perceptron network could be duplicated by an SNN. Not only are SNNs very powerful tools for processing spatio-temporal data, they have also been used to investigate the functioning of biological neural circuits. A theoretical idea of the topology and functionality of a neural circuit is developed with an SNN in a computational manner. Subsequent electophysiological recordings of said biological neural circuits can then be extracted and compared to the output of the computational SNN to dictate the correctness of the original theoretical idea. The amplitude of a spike is thought to be unimportant, it is the timing and number of spikes that is important; the outputs of a spiking neuron are

59

the points in time when the neuron red. A spiking neuron res when its potential, the sum of the excitatory and inhibitory presynaptic potentials, reaches a predened threshold. These presynaptic potentials arrive as inputs to the neuron and are generated from the ring of other neurons.

If the

presynaptic potential does not reach the threshold and no more spikes arrive within a certain period of time, the presynaptic potential will decay over time.

3.4.1 Biological Neuron The biological neuron is found in the nervous system which consists of the brain, vertebrate spinal cord, invertebrate ventral nerve cord and the peripheral nerves. Its function is to process and circulate information by electrochemical signalling and can be separated into three distinctive parts, the soma, axon and dendrites, see Figure 3.4, each with their own individual functionalities, [Marian, 2002]. There are many dierent types of neurons; sensory neurons which are sensitive to sound, light, touch, taste, etc.; motor neurons which receive information from the brain and spinal cord, these neurons induce muscle movement and they also aect glands; and interneurons link neurons to other neurons. The morphology of the biological neuron consists of a cell soma (body) containing the nucleus and an axon surrounded by the myelin sheath. The soma is surrounded by branched projections called dendrites and the end of the axon has similar branches called axon terminals. Essentially, a neuron works as follows: if the total input to the neuron exceeds a theshold, an output signal is created. This action potential is created at the axon hillock, travels down the axon to the terminal buttons (boutons) where neurotransmitter carry it across the synapse (the junction between two neurons) to other neurons.

Neurons at dierent times can be termed

as a postsynaptic cell (receiving neuron) or a presynaptic cell (sending neuron). In the mammalian brain, each neuron typically connects to more than

4 other neurons, either in the same locality or when their axonal branch

10

stretches over several centimeters, to other parts of the brain, [Gerstner and Kistler, 2002, Trappenberg, 2002]. The soma, from the Greek meaning body, is the round or bulging part of a neuron which contains the cell nucleus. As there are a lot of dierent types of neurons, dierent somas can range in size from only 3

60

µm

to over

Figure 3.4: Biological neuron showing the nucleus, dendrites, soma, axon, myelin sheath, nodes of Ranvier, Schwann cell and the axon terminals, from [Jarosz, 2009]

1

mm.

The nucleus is the essential part of the soma; it is the source of the

greatest amount of ribonucleic acid (RNA) which is produced in neurons. The nucleus also contains the cell's DNA, and mitochondria which provide energy. The axon is a lengthy and narrow nerve ber which transmits electrical signals away from the soma of the neuron. Axons dier from dendrites in many aspects. Over the length of the ber axons preserve their width, whereas dendrites can reduce in thickness or taper o. Axons are usually much longer and typically relay the stimulus to other neurons while dendrites receive the stimulus.

The end of the axon is called the axon terminal and is located

at the synapse, [Marian, 2002].

The action potential travels through the

axon to the synapse and the membrane of the synapse depolarises causing channels to open which emit calcium ions. These ions increase calcium concentration in the area, causing calcium-sensitive proteins to change shape resulting in the synaptic vesicles to which they are axed, to open and release a neurotransmitter which transports the action potential across the synaptic cleft. The axon is surrounded by a fatty insulating matter called the myelin sheath, rst observed by Rudolf Virchow in 1854, [Ndubaku and de Bellard, 2008]. Myelin is critical for the nervous system to function correctly and develops from glial cells which give support, nourishment, protect neurons and provide a stable environment. Along the myelin sheath, gaps

61

roughly 1

µm

long can be found at evenly-spaced intervals called the nodes

of Ranvier which allow for rapid conduction of the nerve impulse. Another type of cell found in axons are called Schwann cells, which are a non-neuronal type of cell which take part in the myelinating process of the axon, remove cellular debris, and provide navigation guidance to neurons, [Bhatheja and Field, 2006]. Dendrites, also from the Greek meaning tree, are the branching structure which surround the soma. They allow for signals (electrochemical stimulation) from other neurons to enter the soma. Dendrites combine the stimuli from many neurons in two dierent ways; temporally and spatially. Temporal integration involves summing the stimuli which arrive in quick sequence whereas spatial integration involves the summing of excitatory and inhibitory stimuli approaching from individual branches. The skinlike tissue surrounding the dendrites, the membrane, contains a rich supply of proteins; some which weaken and others which increase or amplify the stimuli. Sodium, Calcium and Potassium ions also play a role in modulating the input stimulus; such as reaction time, electrical conductance, stimulation voltage and duration. These characteristics allow stimuli which originate from distal neurons to have the same amplication at the soma as stimuli coming from proximal neurons. Dendrites also have the ability to backpropagate the stimuli which plays an important part in synapse modulation and long-term potentiation (LTP) (the long-lasting change in the reactivity of a postsynaptic neuron). This backpropagation only occurs when an action potential is created. As the action potential travels down the axon, the soma becomes depolarised causing voltage-gated calcium channels on the dendrites to depolarise and propagate a dendritic action potential, [Waters et al., 2005].

3.4.2 Computational models of neurons Biological neurons behave dierently depending on the input they receive and their purpose in the nervous system, and can produce many dierent varieties of outputs, [Paulin, 1998]. This section outlines computational models of neurons; in particular, the biological plausibility of each model will be highlighted. The Hodgkin-Huxley model was developed by [Hodgkin and Huxley, 1952a,b,c, Hodgkin et al., 1952]. It is one of the most important and biologically plau-

62

sible models of the biological neuron describing how action potentials are created and propagated. It consists of a set of nonlinear dierential equations which describe the ionic mechanisms in the squid giant axon, [Llinas et al., 1981, Mishra et al., 2006]:

C

dv = IN A + IK + IL + IEXT dt

(3.2)

dm = −m + m∞ dt

(3.3)

n

dn = −n + n∞ dt

(3.4)

h

dh = −h + h∞ dt

(3.5)

m

where,

where

IN A

IN A = m3 hGN A (VN A − v)

(3.6)

IK = n4 GK (VK − v)

(3.7)

IL = GL (VL − v)

(3.8)

refers to the Sodium current;

the leakage current;

IEXT

Ik

is the Potassium current;

is the input current;

m∞ , n∞

and

h∞

h

is the Sodium inacti-

are constants;

GN A

and

the maximum Sodium and Potassium conductances respectively;

VK

GL

is the leakage conductance and

VL

GK

VN A

are the Sodium and Potassium reversal voltages respectively;

membrane potential;

is

m and n are the Sodium and

Potassium activation gating variables respectively; vation gating variable;

IL

v

are and

is the

is the leakage

current. With the many equations and parameters involved, it is a complex model to implement and simulations are typically limited to a small number of Hodgkin-Huxley neurons for computational eciency. For further information on this model see [Nelson and Rinzel, 1995]. The LIF model, [Stein, 1967], is one of the simplest, computationally e-

63

Figure 3.5: Leaky Integrate-and-Fire Neuron Model, [Gerstner and Kistler, 2002]

cient and most popular models of a spiking neuron enabling simulations of networks with up to thousands of neurons. With this model, spikes are not created in a bio-physical manner; instead action potentials are generated

If the membrane potential surpasses a theshold value an action potential is generated and directly after the membrane potential is reset to a predened value, [Vogels et al., 2005]. The basic circuit of the LIF model includes a capacitor C in parallel with a resistor R both driven by a current I(t), see Figure 3.5. The voltage v(t) across the capacitance is compared to the threshold ϑ and if v(t) > ϑ an output spike is generated and the membrane potential is reset to a new value, vreset , [Gerwith a straightforward rule:

stner and Kistler, 2002]. As well as resetting the membrane potential, the neuron enters a refractory period during which the membrane potential cannot build, [Vreeken, 2002].

The LIF model can be implemented with the

following equation, [Gerstner and Kistler, 2002]:

τmem where away,

τmem v

dv = −v(t) + Rin Isyn (t) dt

(3.9)

refers to the membrane time constant of the neuron which leaks

is the membrane potential and

driven by a synaptic current

Isyn (t).

Rin

is the membrane resistance

LIF neuron models display the biolog-

ical neuron behaviour of tonic spiking, class1 excitability and integration, [Izhikevich, 2004]. The spike response model (SRM) is a more simplied form of the classic LIF model where parameters rely on the time the last spike was red, [Marian, 2002]. In this case, the neuron

u

i

can be depicted by a sole variable

describes the voltage on the membrane. If

then for every time

t > tˆ,

the neuron

64

i

tˆ is

ui ,

where

the time of the last spike,

can be described by the following

Figure 3.6: Spike response model, from Gerstner [2001], Gerstner and Kistler [2002]

equation:

ui (t) = η(t−tî )+

X

wij

X

j

(f ) εij (t − tî , t − tj ) +

∞

Z

κ(t − tî , s)I ext (t − s)ds

0

j

(3.10)

(f )

where tj

j , wij

are the spikes of presynaptic neurons

ext is the external current, and ecacy (weight), I in the equation

η , κ,

and

εij

represents the synaptic

(f )

s = t − tj

. The functions

are called response kernels; where

reset of the membrane potential after a spike at potential response to an input current while

εij

Unlike the LIF model, the threshold value

ϑ

tˆ, κ

η

depicts the

portrays the membrane

models the EPSPs or IPSPs.

is not a predened and xed

value, it depends on the time of the last spike, [Gerstner and Kistler, 2002]. Figure 3.6 illustrates the SRM neuron model where each input spike causes the excitatory presynaptic potential to build until it reaches a threshold

ϑ,

upon which an output spike is produced and the membrane potential resets. The [Izhikevich, 2003, 2004] neuron model is a combination of the biological plausibility of the Hodgkin-Huxley neuron model with the computational capabilities of the LIF neuron. It can be described by the following dierential equations:

65

dv = 0.04v 2 + 5v + 140 − u + I dt

(3.11)

du = a(bv − u) dt

(3.12)

  v ← c then  u ← u + d

if v ≥ +30mV,

where

v

refers to the membrane potential,

u

is the membrane recovery vari-

able which describes the activation and inactivation of

K+

currents respectively and provides negative feedback to than

30 mV

then both

v

and

u

(3.13)

v.

can vary. The constant parameters

mV

and

If

v

N a+

a, b, c

and

d

v,

ionic

is greater

are reset according to Equation 3.13.

is not the threshold, it is the maximum value allowed for

is in

and

30 mV

the threshold

are chosen to ensure that

v

t is in ms and to congure dierent types of neuron behaviour.

3.4.3 Dynamic Synapses It is now known that synapses of the neocortex are dynamic; neuron responses are not as simple as an operation that multiplies a postsynaptic input by a synaptic weight, rather neuron response is a reaction to short term input, [Denham, 2001, Shon and Rao, 2002]. There are two known types of dynamic synapses, facilitating synapses which can be found between pyramidal neurons and inhibitory interneurons, and depressing synapses which can be seen between pyramidal neurons, [Tsodyks et al., 1998, Abbott et al., 1997]. Facilitating synapses gradually use their synaptic resources and produce a sustained response. Depressing synapses consume all of their resources in the rst few spikes and become unresponsive very quickly. Depressing synapses can be modelled using the following equations, Tsodyks et al. [1998]:

dx z = − USE x(tsp ) dt τrec

(3.14)

dy y =− + USE x(tsp ) dt τin

(3.15)

66

z dz y − = dt τin τrec

(3.16)

These equations depict the inactive (x), active (y ) and recovered (z ) states of the synapse where

τrec

is the recovery time period,

USE

is a constant

value which denotes the maximum amount of neurotransmitters which can be released after each presynaptic spike arrives, arrival time, and

τin

tsp

is the presynaptic spike

is the inactivation period usually of a few milliseconds,

[Mejías and Torres, 2007]. The postsynaptic current can then be determined using:

Isyn (t) = ASE y(t)

(3.17)

where the current is calculated as being proportional to the fraction of resources in the active state (y );

ASE

is a constant value which represents

the maximum postsynaptic current a synapse can produce, [Tsodyks et al., 1998, Mejías and Torres, 2007]. Equations (3.14 - 3.17) model a depressing synapse. Facilitating synapses need an additional equation:

dUSE USE =− + U 1(1 − USE )δ(t − tsp ) dt τf acil where

USE ,

τf acil

is the facilitation time constant and

U1

(3.18)

is the initial value of

[Tsodyks et al., 1998].

3.4.4 Training algorithms Up until quite recently it was thought that important information processed by a neuron was achieved by rate coding, i.e. the frequency or intensity of the input directly inuences the frequency of the output of the neuron. However, it is now known that the timing of the action potentials or spikes within the input, known as temporal coding, can give additional information.

In the

last chapter, it was discussed how temporal coding exists within the auditory system where low frequency sounds are phase-locked by the AN and bushy cells to allow the MSO to process ITDs.

As with ANNs, for networks of

spiking neurons to perform complex computations involving spatio-temporal data, the process of learning is crucial and again as with ANNs, the main types of learning in SNNs are supervised, unsupervised and reinforcement.

67

3.4.4.1 Unsupervised Learning In 1981, the BCM rule of unsupervised learning for SNNs was developed by Elie Bienenstock, Leon Cooper and Paul Munro, based on Hebbian-type learning, [Bienenstock et al., 1982]. BCM works on the rule that the ring threshold of the neuron is considered a function of postsynaptic activity; if postsynaptic activity is low the threshold moves to allow for potentiation and if postsynaptic activity is high the threshold moves to allow for depression. This learning rule enables stable learning in a network and also encourages competition between synapses.

Evidence has been found for this type of

learning in the neocortex and the hippocampus, in fact at some of the same locations for which evidence of STDP had been found, [Izhikevich and Desai, 2003]. The BCM function can be described as follows, [Bienenstock et al., 1982]:

dmj t = φ(c(t))dj (t) − mj (t) dt where

mj

input,

c

is the synaptic weight of the synapse

is the weighted presynaptic output,

φ

j , dj

(3.19) is the presynaptic

is the postsynaptic activation

function which can change sign at the threshold and

is the time constant of

decay which is consistent throughout all synapses. Drawbacks to the model are that it necessitates both LTP and long term depression (LTD) which has not been found throughout the cortex and it requires a dynamic threshold and some xed constants. Positive attributes of the learning rule are that it brings stability to unsupervised learning and the uniform decay corresponds to the square of the output. STDP, see Figure 3.7, occurs naturally in neurons and is a form of synaptic plasticity, i.e.

the capacity for the synapse connecting two neurons to

change strength, [Martin et al., 2000, Marian, 2002, Roberts and Bell, 2002, Legenstein et al., 2005]. It is a form of Hebbian learning where the timing of the presynaptic and postsynaptic spikes are important, [Bi and Poo, 2001]. Within a sliding time window, STDP strengthens the weights of the synapses that present spikes before the postsynaptic spike is generated and weakens those synaptic weights that present spikes after the postsynaptic spike is generated, [Izhikevich and Desai, 2003, Vogelstein et al., 2003].

Evidence

for this type of learning has been found in neurons of the neocortex and hippocampus, [Izhikevich and Desai, 2003]. The STDP function is modelled

68

Figure 3.7: Spike-timing-dependent plasticity.

(a) Diagrammatic image of

time dierence between presynaptic and postsynaptic spikes in STDP; (b) synaptic modication learning windows based on repeated pairs of presynaptic and postsynaptic spikes, LTP and LTD are superimposed on the spike pairs; where EPSC refers to excitatory post synaptic currents, from [Bohte and Mozer, 2005]

as follows, as described by [Song et al., 2000]:

 A exp(∆t/τ ) if ∆t < 0, + + ∆w = −A exp(−∆t/τ ) if ∆t ≥ 0 − − where

∆w

is the weight change,

potentiation,

A+

τ−

is the maximum value of the weight

A− is the maximum value of weight depression, ∆t is the output

spike time minus the input spike time, and

(3.20)

τ+

is the width of the window for LTP

is the width of the window for LTD. This form of STDP produces

bimodal weights and can have problems with stability and convergence. The learning algorithm was extended to perform a multiplicative form of STDP which is modelled by the following equation, [Rubin et al., 2001, Gutig et al., 2003]:

 A (1 − w)µ exp(∆t/τ ) if ∆t < 0, + + ∆w = −A wµ exp(∆t/τ ) if ∆t ≥ 0 − − where

µ is a non-negative value.

(3.21)

Multiplicative STDP provides a more stable

version of STDP.

69

To relate STDP and BCM or to ensure that STDP behaves as BCM, the nearest-neighbour implementation of STDP is used. In this case, only one presynaptic and postsynaptic spike are involved in the synaptic potentiation and depression rule.

This implementation produces low activity when the

synapse is depressing and high activity when the synapse is potentiating, similar to a BCM synapse.

A relaxed version of this, the semi-nearest-

neighbour implementation, where only one presynaptic spike but all postsynaptic spikes are involved also produces behaviour akin to a BCM synapse. There are many other variations on the classical STDP model such as suppression additive STDP, classical additive STDP with correlated spike trains, nearest-spike additive STDP and more. For a review on these modications to classical STDP see [Izhikevich and Desai, 2003].

3.4.4.2 Supervised Learning In 2000, Bohte, Kok and La Poutré developed the rst supervised learning rule for SNNs analagous to backpropagation learning in ANNs called SpikeProp, [Bohte et al., 2002a]. This learning rule demonstrated the capability of learning non-linear tasks using temporal coding for single spikes with similar accuracy to sigmoidal neural networks. The learning rule enables a network of SRM neurons to produce a set of desired ring times for a pre-dened set of input patterns, by backpropagating the temporal error produced at the output of the network. This learning algorithm was presented for a number of classication problems.

First SpikeProp was applied to XOR; the net-

work learned the pattern with a learning rate of 0.01 using both inhibitory and excitatory neurons. It was then tested on interpolated XOR where the network could learn the input dataset with an accuracy of the order of the integration time step of the algorithm. Benchmark datasets were then applied, namely the Iris dataset, Wisconsin breast cancer dataset and Statlog Landsat dataset. Training accuracies for these problems were equivalent to a sigmoidal neural network and the algorithm always converged, which was not guaranteed with the ANN-type algorithms to which it was compared. Disadvantages of this learning rule involve the inability to change synaptic weights when the postsynaptic neuron does not re for any input; also only the rst spike is important in this algorithm, and therefore it can only be used in time-to-rst-spike coding strategies, [Kasinski and Ponulak, 2006].

70

Statistical learning methods for supervised learning in SNNs were developed in 2003, where training is controlled by the possibility of producing the desired behaviour.

It was rst studied by [Barber, 2003] who looked at

supervised learning for neurons in discrete time while [Pster et al., 2003, 2006] extended the work to the continuous case.

The aim of learning in

[Pster et al., 2003] is to optimise the weights between SRM neurons in order to increase the likelihood that the postsynaptic neuron will produce a set of desired ring times; therefore this rule can be considered as a type of probabilistic spike-based Hebbian learning. In [Pster et al., 2006], this work is continued and tested with experiments such as having diering supervision signals; enabling or preventing postsynaptic spikes ring when not desired; and the postsynaptic neuron producing a desired spike train only in response to a particular presynaptic response pattern. These experiments showed that the learning rule was capable of producing a spike train of desired spike times. However, since the postsynaptic outputs contained no more than two spikes it is uncertain whether this supervised learning algorithm can accommodate inputs and outputs of many more spikes, [Kasinski and Ponulak, 2006]. In 2004, linear algebra was used by [Carnell and Richardson, 2005] to train a neuron to produce specic spike times, i.e. spike trains with many spikes. First they describe a time series

S(t)

of spikes at predened times and a

w S(t) which gives a weight for each spike. The inner weighted time series product of two weighted time series and the metric norm of a weighted time series is then dened. These allow the projection of the rst weighted time series onto the second. These denitions are now used to prepare two algorithms which can produce a desired output and weights

w;

S d (t)

from a set of inputs

S in (t)

the Gran-Schmidt algorithm and an interative algorithm.

The Gran-Schmidt algorithm can nd the orthogonal basis for the subspace spanned by a set of inputs; which gives the best approximation in the subspace for any component of the desired output. The iterative algorithm nds the dierence between the inputs and the desired output; this is repeated until the dierence or error is minimal.

The iterative algorithm was used

successfully in two experiments, the rst was to demonstrate how the algorithm can change the values of the weights to allow the input to become close to the desired output; the second experiment tried to produce the desired output.

A minor drawback of this algorithm is that weight updates

are executed in a batch mode which is not appropriate for online learning,

71

[Kasinski and Ponulak, 2006]. Belatreche et al. employed an evolutionary strategy (ES) for training SNNs, [Belatreche et al., 2003, 2004].

ES is one of the most popular algorithms

for solving multi-faceted optimisation problems, [Cochenour et al., 2005]. Based upon Darwinian theories of evolution, it searches through a set of all possible solutions. For every generation of the algorithm new solutions are generated from older ones using evolutionary operators; the most common operator being mutation.

A tness measure determines the best solutions

and these are the only ones kept for the next generation.

In Belatreche's

network synaptic delays were also incorporated into the training. The feedforward and fully connected SNN consisted of SRM neurons which could re only once.

This algorithm proved successful in comparison to well known

classication algorithms, such as BP, LM and SpikeProp, when applied to non-linear seperable problems such as XOR and the Iris benchmark dataset. However this approach is time consuming, [Kasinski and Ponulak, 2006]. Scientic studies of the brain have shown that the temporal performance of neurons within the brain is very accurate; an example of this has been discussed in the previous chapter where a sound signal reaches each ear at a dierent time but can be linked at the MSO to relate information about the location of the sound. The stimulus passes through a chain of neurons to the point where it becomes meaningful; this is termed a synre chain, [Kasinski and Ponulak, 2006]. In SNNs, a synre chain is a feedforward network with multiple layers where spiking activity travels through the network from layer to layer. Each layer needs input from each neuron in the previous layer to propagate the signal onwards. In 2001, Sougne created an SNN version of the synre chain called INFERNET which can learn a single ring time, [Sougne, 2001]. The network consists of clusters of nodes called subnets; each node can be one of two states: on, where they can re and o, where they are inactive. Within each subnet the nodes are fully connected, however each subnet does not connect to every other subnet in the network. Learning within the synre chain involves recreating the temporal relationship between two successive inputs, e.g. at time

49

to link two nodes

a

and

g

where

a

res at time

0

and

g

involves nding the chain of nodes which link them causing

res

g

to

re at the appropriate time. Weight updates in this approach are similar to Hebbian learning; the weight update is based on a learning window which is dened by the time dierence between the presynaptic and postsynaptic

72

neurons ring and a synaptic delay. Experimental results showed that the network was able to create a synre chain causing the desired node to re at the appropriate time. However, the desired node also red at earlier times than the appropriate time, [Kasinski and Ponulak, 2006]. Hebbian learning for ANNs is one of the oldest and most popular learning algorithms. In 1997, Ruf and Schmitt extended this rule to work with a single synapse from a LIF neuron and two presynaptic and one postsynaptic spikes ring during each epoch, [Ruf and Schmitt, 1997].

The rst presynaptic

spike represents the input and the second is considered a teaching signal which provides a target time for the postsynaptic spike.

The change of

weights in the network is modelled by:

4w = η(tout − tdes ) where

η

is the learning rate which is greater than zero,

(3.22)

tout

des is the desired ring time. the postsynaptic spike and t

is the time of At the end of

out successfully converges to tdes . The algorithm was extended to training, t cope with many more presynaptic inputs to produce a target set of weights relating to the dierence between the presynaptic and postsynaptic ring times, [Kasinski and Ponulak, 2006]:

 ∆w = η(tdes − tin ), 1 ≤ i ≤ n, i i normalise the resulting weight vector w, such that kwk = 1. where

tin

refers to the input and

i

to the presynaptic neurons.

(3.23)

In 2005,

Legenstein et al. developed a supervised version of Hebbian learning (SHL) based on STDP where a teaching signal or spike train performs the supervision ensuring the output neuron only res at predened times, [Legenstein et al., 2005].

They used this rule with a LIF neuron model and dynamic

synapses to perform experiments which entailed varying selections of uncorrelated and correlated inputs along with a pure and noisy teaching signal. Only the weights on the excitatory connections were changed by the learning rule. SHL was able to produce the desired outputs satisfactorily. Legenstein et al. reported some drawbacks to SHL; generalisation (achieving satisfactory accuracy during both training and testing phases) can be problematic

73

Figure 3.8: ReSuMe learning windows where amplitude of change depends on the ring times of the training signal and the postsynaptic spikes, adapted from [Ponulak, 2005]

and the network overtrains, i.e. when the optimal weight vectors have been found SHL continues to change parameters. However, of all the supervised training algorithms discussed heretofore, SHL is the rst method which can learn a desired output spike train from an input, [Kasinski and Ponulak, 2006]. Remote supervision method (ReSuMe), [Ponulak, 2005], learning is similar to SHL, [Hebb, 1949, Ruf and Schmitt, 1997] in that a supervisory training signal is used to supervise the training. However ReSuMe, unlike SHL learning, does not feed these target signals directly to the current learning output neuron but still controls the update of the synaptic ecacies on the active connections leading to the learning output neuron. Hence, the name

remote supervision.

The goal of the ReSuMe learning algorithm is to train an

SNN to produce a desired output in response to a given input. The learning rule modies the synaptic weights using a remote supervisory signal. This modication is done using Equation 3.24, [Ponulak, 2005]:

Z ∞ d d d d d in d d wki (t) = S (t) a + W (s )S (t − s )ds + dt 0 S

out

Z out (t) a +

∞

W

out

out

(s

in

out

)S (t − s

)ds

out

(3.24)

0 d d dt wki (t) is the rate of change of the weights over time; S (t) is the rein out (t) is the mote supervisory signal; S (t) refers to the input spike trains; S

where

74

actual output of the output neurons;

ad

and

aout

are the amplitudes of the

non-Hebbian processes of weight modications and

W d (sd )

and

W out (sout )

are the learning windows themselves; the learning windows can be seen in Figure 3.8. ReSuMe can be considered a biologically plausible learning rule as it is based on Hebbian learning and evidence has been found for remote supervision in biological synapses, [Kasinski and Ponulak, 2006].

ReSuMe

was tested in many experiments to determine that it can learn desired output spike trains. One such experiment involved the ReSuMe learning rule applied to a LSM composed of 800 LIF neurons, [Ponulak, 2005]. Both the input and desired spike trains were generated randomly and the network successfully produced an output almost identical to the desired signal. For more information on the ReSuMe learning algorithm see [Kasinski and Ponulak, 2005, Ponulak and Kasi«ski, 2006, Ponulak and Kasinski, 2006, Ponulak, 2006].

3.4.4.3 Reinforcement Learning Reinforcement learning in SNNs involves hedonistic (reward-seeking) synapses. The idea is that synaptic potentation and depression result when a synapse is rewarded or punished. This can be related back to mammalian learning where a mammal is more likely to repeat a learned action if a reward or punishment is applied immediately after learning, [Seung, 2003].

A standard

global reinforcement learning rule is dened by [Urbanczik and Senn, 2009]:

∆wiv = η(R − 1)Eiv (T ) where

i refers to each synapse and v

to each neuron;

(3.25)

η

is the learning rate;

R

is the global reward which provides feedback on the output and can be either

1

for a correct output or

−1

for an incorrect output;

Eiv (T )

is an eligibility

trace which keeps track of what happens at the synapse of each neuron. Global reinforcement learning can be unreliable as each output neuron is rewarded or punished based on the performance of every other neuron in the network. The rule can be altered quite simply to cater for this, [Urbanczik and Senn, 2009]:

∆wiv = η(rv − 1)Eiv (T )

75

(3.26)

where each individual neuron receives its own form of reward or punishment. Numerous researchers have used reinforcement learning in some way for training SNNs, for more information on this topic see [Seung, 2003, Xie and Seung, 2004, de Queiroz et al., 2006, Farries and Fairhall, 2007, Baras and Meir, 2007, Urbanczik and Senn, 2009].

3.5 Receptive Fields Receptive elds have been found in sensory neurons of the auditory, somatosensory and visual systems.

The receptive eld of a sensory neuron

transforms the ring of that neuron depending on its spatial input, [Paulin, 1998].

Usually there is an inhibitory region surounding a receptive eld

which suppresses any stimulus which is not ltered by the bounds of the receptive eld.

Receptive elds in the auditory system are thought to be

located throughout the auditory pathways; each region consists of receptive elds representing an area of space which responds selectively to sound frequency, see Figure 2.12. It is also thought that auditory receptive elds are sensitive to ITD, IID and other monaural spectral cues, [Zella et al., 2001, Pena and Konishi, 2001, Tollin and Yin, 2002b,a]. In this work, receptive elds take the form of a Gaussian function:

kij = e−((xm −yo )/dm ) where

kij yo

(3.27)

is a scalar variable which will modify the output spike train fre-

quency of the related neuron, eld,

2

xm

is the operating frequency of the receptive

is the input spike train frequency to the receptive eld and

denotes the width of the receptive eld.

dm

There are many computational

techniques for conguring receptive elds such as symmetrical placement, mean approach, median approach and clustering. For information on these techniques see [Abdelbar et al., 2006].

3.6 State of the Art in Sound Source Localisation Modelling There are many dierent methods that have been used for the development of sound source localisation systems. The main techniques fall into the fol-

76

lowing categories: geometry, cross-correlation, probability, statistics, signal processing, ANNs, fuzzy neural networks and SNNs. This section involves a review of the work that has been published in this area; to review this area in its entirety is impractical in this thesis, therefore this section describes research from each of the prementioned categories during the period from the late eighties to the present time, with an emphasis on ANN and SNN techniques. In 1988, Sugie, Huang and Ohnishi developed a novel system for localising multiple sound sources using geometry, [Sugie et al., 1988]. The system consisted of three microphones, three bandpass lters to process the input from the microphones, and a personal computer to perform the computation. The output from each bandpass lter is an almost pure tone; this enables the ITD to be computed using the zerocrossing method (the time at which the wave crosses zero, from [Cavaco and Hallam, 1999], who also use zerocrossings in their azimuth estimation system).

The three microphones and the sound

source are coplaner and their coordinates are

(ds , 0)

respectively.

If

ds

(x1 , y1 ), (x2 , y2 ), (x3 , y3 ),

and

is larger than the intermicrophone distance, the

ITD between two out of the three microphones

i

and

j

can be determined

using the following equation:

∆t0ij = −(xi − xj )/C where

∆t0ij

refers to the ITD,

i, j = 1, 2, 3,

and

(3.28)

C

is the speed of sound.

Knowing the ITD between two microphones, it can now be estimated what the single sound source azimuthal angle is by nding the ITDs for all three microphones:

E(θ) =

X

(∆tij − ∆t0ij )2

(3.29)

i,j To localise multiple sound sources, the onset of each sound is used. Again using the zerocrossing method, a large number of ITDs are found and modelled using histograms.

The peaks of the histograms are measured, these

relate to the real ITDs. To distinguish the ITDs of multiple sound sources the following constraint is employed:

∆t12 + ∆t23 + ∆t31 = 0 77

(3.30)

where

∆t12

is the ITD between microphones 1 and 2, and so on.

Once

the ITDs are calculated the onset of each sound can be determined, the location of each individual sound source can be found using Equation 3.29. These methods were tested in an anechoic chamber with satisfactory results. Many other researchers have used geometrical techniques for the extraction of binaural cues in the development of sound localisation systems, see [Huang et al., 1999, Nakadai et al., 2000, 2002, 2003, Handzel et al., 2003, Li et al., 2009] [Martin, 1995] used statistics for sound localisation involving the use of a statistical estimator to determine both azimuth and elevation. This model involves an input of HRTF measurements which pass through a lter bank to simulate the frequency analysis capability of the cochlea.

Amplitude-

modulation envelopes are emitted from the cochlea lter bank and tested for on-sets.

If present, the time and relative intensity of energy peaks in

the sound signal are noted. This information is then used by an interaural dierences estimator to model the precedence eect. Interaural dierences are used to generate a spatial likelihood map; the global maximum of this likelihood map is attributed to the maximum likelihood position estimate, i.e.

the sound source location.

Results presented show the model rarely

gives a localisation error of more than 5°; however it is indicated that this is probably due to the absence of internal noise in the model. Similar work models the spectral analysis of the cochlea with a lter-bank and extracts localisation cues from the resulting energy patterns, [Chau and Duda, 1995]. Maximum-likelihood methods are used to obtain the elevation estimates. Results show that the model works very well to generate the azimuthal angle with accuracies of up to 2°; elevation accuracies were quite low when tested on monaural cues but signicantly increased when binaural cues were used. Also, elevation errors were good when the sound source was in front, but as the sound source moved to the back, performance dropped considerably. Similar work to this can be found in [Dabak, 1990]. [Ono and Ando, 2001] created a neuromorphic sensor modelled on the auditory mechanisms for sound localisation of the barn owl. The sensor has two micophones sensitive to both azimuthal and elevational cues and reectors which represent the ears of the barn owl. The cochlea is portrayed by an 8 channel constant-Q Gabor lter bank which processes each sound frequency seperately between 3 kHz and 9 kHz; although it should be noted that these

78

frequencies are rather high to be processed using the azimuthal cue of ITD. The azimuthal cues of ITD and IID are decoded using a neuromorphic signal processing algorithm. The decoded ITDs and IIDs are mapped onto both azimuthal and elevational directions using a look-up table which is calibrated by experimentation. Experimental results show that the sensor can locate a band-limited Gaussian noise sound signal with a localisation error of no more than 3° for both azimuth and elevation. Cross-correlation measures the similarity of two waveforms when a time delay is applied to one of them. It is used by [Valin et al., 2003] in sound source localisation for robot mobility to determine the time delay of arrival (TDOA) between signals received by microphones. Multiple microphones are used to balance out the high complexity of the human auditory system; it is not limited to two microphones as this makes it very dicult to determine if the sound is coming from the front or the back of the robot. When the TDOA is estimated, geometrical calculations are then employed to compute the sound source location. Results show a precision of 3° at a distance of between three and ve meters. The system does not need any noise cancellation techniques and works on short-duration noises. Cross-correlation is again employed to enable a robot to locate and orientate towards a sound source, [Murray et al., 2004]. Once more TDOA is utilised to determine the time dierence of the sound wave arriving at the two microphones. Cross-correlation determines the point at which the signals received at two microphones most closely match, i.e. starting with an initial azimuthal angle of 0°, then by moving one of the signals across the other, a correlation vector can be created of dierent delay times. Trigonometry functions of right angled triangles are then used to determine the angle of location.

This system was put on a robot and

tested on real-world pre-recorded sounds. Results indicate that this system can localise accurately with an error of no more than

±5°;

however currently

there are problems with the system having the ability to determine if the sound is coming from in front of, or behind the robot. For information on other sound localisation systems based on cross-correlation see [Guentchev and Weng, 1998, Calmes, 2002, Nakashima et al., 2003]. [Willert et al., 2006] developed a biologically-inspired system which calculates the azimuthal angle for a sound source using probability theory. Using both preceding and current stimulus information, the estimated azimuth is updated at every time step. The input to the system is in the form of sound

79

waves which are converted to cochleagrams by a lterbank which represents the cochlea and is composed of a set number of frequency channels.

The

cochleagram is a matrix of amplitude values for every CF over the length of the input in time. The cochleagrams are correlated to produce the ITD and IID maps. Each map is compared against prelearned maps of ITD and IID to produce two further likelihood maps representing the input at an azimuthal angle for each CF and for each binaural cue at a particular timestep.

To

further increase the correctness of these likelihood maps, external knowledge about mammalian sound localisation is used. For example, probability values for the ITD cue can be decreased if the sound is of a high frequency. The two likelihood maps for ITD and IID are merged using marginalisation to produce one likelihood distribution.

This nal map is propagated over

time using Bayes' theorem to get the posterior distribution from which the azimuthal angle can be generated; this estimated output will be improved upon with every timestep.

The system was tested on audible signals, for

example, human speech signals, from angles in the range

±90°

with a reso-

lution of 15°. When testing occurred in an anechoic room, the localisation accuracy of the system was 98.9%; when testing occurred in a reverberant room, the localisation of the system was 68.69% with a resolution of 15°, which increased to 87.9% when the resolution decreased to every 30°. In the same year Keyrouz, Naous and Diepold used the inverse lters of a dataset of HRTFs with a correlation factor for sound localisation, [Keyrouz et al., 2006b]. The HRTFs are reduced using three dierent methods, diuseeld equalisation (DFE), balanced model truncation (BMT) and principal component analysis (PCA) to increase the speed of the model, for a description of these methods see [Huopaniemi and Karjalainen, 1996]. The three reduced HRTF datasets are Fast Fourier Transformed (FFT) and inverse ltered, i.e.

the inverse dataset is equal to

1/reduced dataset.

A corre-

lation factor and minimum distance measure of the left and right datasets are computed, if they are the same the correlation would be 1 and the minimum distance would produce a minimum value. When tested on a simulated sound signal, DFE achieved a localisation accuracy of 96%, BMT localised to an accuracy between the range of 53% to 92% and PCA achieved localisation results of between 42% and 91%. However it was noted that all of the misclassications did actually localise to angles close to the target angle. Another experiment involved sound produced in a reverberant room using

80

two microphones, this time only the BMT and PCA methods were tested. All the output angles were either the target angle or again very close to the target angle. This work was continued, with changes made to the inversion method, [Keyrouz et al., 2006a].

Instead of simply inverting the datasets

as described above, a state space inversion method is used.

Repeated ex-

periments with simulated sounds produced superior results; the localisation accuracy of the DFE method increased to 99% and PCA improved but still had a wide range of between 55% and 97%. The experiment where sound was produced in a reverberant room was also repeated with similarly increased classication accuracies.

3.6.1 Articial Neural Network Methods The vast majority of research in biologically inspired sound localisation involves the use of ANNs which aim to model the interconnecting system of neurons in the auditory pathway. In 1991, Palmieri, Datum and Shah used an ANN to imitate the sound localisation behaviour of the owl, [Palmieri et al., 1991a,b]. Inputs to the system involved both binaural time and intensity cues to determine the azimuth and elevation of a sound source. The neural network has three layers and was trained with the multiple extended Kalman algorithm. The error was determined by nding the dierence between the estimated azimuthal position produced as output from the ANN against the actual position which was measured with an ideal optical sensor. Using simulated input data sets, the average error produced was 1.86° and 0.81° for the azimuth and elevation respectively. In 1993, Backman and Karjalainen developed a backpropagation trained ANN which takes two input vectors

C

and

R

respectively, [Backman and Karjalainen, 1993].

relating to the ITD and IID

C

was generated using cross-

correlation between the left and right ear signals.

To produce

R,

the left

and right ear signals were Hamming windowed, Fourier transformed, and converted to the Bark scale, from which loudness ratios for each pair of left and right signals were created.

Data from both an anechoic chamber

and a reverberant environment were used to test the network, and many experiments were outlined.

Initial results showed that the performance of

the network increased as the number of hidden layer neurons also increased. Extremely accurate results were obtained when the network was trained

81

with data from the anechoic chamber, only slightly less accurate results were achieved when the network was trained with data from a reverberant environment. However, when the network was trained with a combination of the two, the results were signicantly worse. The type of input data also produced diering results; noise or pulse inputs produced considerably better results than when music or speech inputs were employed.

To read more

on these experiments and other backpropagation trained ANNs for sound localisation see [Backman and Karjalainen, 1993, Anderson et al., 1994]. In 1994, Lim and Duda extracted interaural time and intensity dierences from a cochlear model, Lim et al. [1994].

Cross-correlation was used to

compute the ITD while a logarithmic function was used to determine the IID. The azimuth and elevation of a sound sample was found using the nearestneighbour approach and a set of reference vectors. Results presented showed an average absolute error of 0.8° in azimuth and 16° in elevation. This work was extended in 1995, Chau and Duda [1995], to include monaural spectral cues. They showed that using these monaural cues can increase the accuracy of binaural localisation signicantly. Neural networks have been used to some degree to model the cochlear nuclei in the brain; other computational models of auditory neurons were discussed in Section 3.2.

[Sheikhzadeh and Deng, 1999] created a three layer feed-

forward ANN of the DCN. They rst created a model of the basilar membrane, IHC and the action potential generator. The basilar membrane model was created using the biophysical mechanisms behind the basilar membrane vibration. This gave a dynamic nonlinear basilar membrane lter function as opposed to previous models which used simple linear digital lters. The spiking action potential generator produced random sequences of AN action potentials to act as input to the DCN model. The model was tested with both synthetic and natural speech sounds and processed these sounds in a similar way to a true biological auditory model. In 2000, an ANN was applied to localise sound using more biological rather than simulated data, [Alim and Farag, 2000]. The HRTFs of ve human subjects were used to train and test the network. The IID was determined by calculating the SPL of each ear for every sound frequency used in the training. The ITD was calculated by correlating the right and left ear HRTF data. The neural network has four layers and was trained with back-propagation. Results show an error of 25%, yet when compared against localisation tests

82

carried out on human subjects both sets of results were very similar.

For

similar research employing ANNs and HRTFs see [Chung et al., 2000, Hao et al., 2007]. In the previous section, it was discussed how [Murray et al., 2004] created a sound source localisation system for a robot using cross-correlation and trigonometry functions. This work was extended to include prediction of the next position of a moving sound source, [Murray et al., 2005]. A four-layer recurrent neural network with an Elman architecture was implemented; the input is the estimated azimuthal angle from [Murray et al., 2004] at two sequential timesteps which activate a particular input neuron.

Standard

backpropagation is used to train the network with a stopping criteria based on the sum of the squared error which is based on the dierence between the actual and desired output. This estimation method was tested on prerecorded sound les.

However, even though the azimuth-generation stage

of the network again produced satisfactory results with a maximum error of 17.5%, the predictor stage only classied correctly less than 50% of the time. It is thought that this is due to the incorrect azimuthal classications causing the predictor network to forecast the wrong location. More recently, Murray et al. [2009] built on this work to produce a robotic sound source localisation model which would be active within a noisy environment and could also interact with a human operator.

As with much of the sound

localisation modelling research, the authors found it dicult to benchmark their work against similar models.

±1.5°

for angles around 0° and

However, they do report accuracies of

±7.5°

for angles around

±90°

which is close

to human sound localisation accuracies. In 2008, Keyrouz and Diepold continued their previous work, [Keyrouz et al., 2006b,a], by combining fuzzy logic with ANNs to localise with a high accuracy angles which were not presented to the network during training, [Keyrouz and Diepold, 2008]. Sound signals in an anechoic environment are short time Fourier transformed, multiplied with the spectrum of a HRTF database to simulate the sound coming from dierent directions and passed through a lterbank with centre frequencies in the range of 160 Hz to 20 kHz to simulate the processing ability of the cochlea. The ITD and IID cues are then extracted as follows:

83

IT Dj =

mod2π (arg(XLi (ωj )) − arg(XRi (ωj ))) ωj

(3.31)

XLi (ωj ) IIDj = 20log10 XRi (ωj ) where

XLi

and

XRi

are the signals from the left and right ears and

(3.32)

ωj

is the

centre frequency of that signal. These cues are used as input to the feedforward ANN governed by a fuzzy logic rule base to determine both elevation and azimuth.

Results show that the system can localise broadband noise

with similar accuracies to humans. Fuzzy neural networks have previously been applied to the development of sound localisation systems, see [Nandy and Ben-Arie, 1996] for further details.

3.6.2 Spiking Neural Network Methods More recently, SNNs have been used for developing sound localisation systems inspired by neurophysiological studies on the functionalities of specic auditory neurons. In 1996, Gerstner et al. [1996], Kempter et al. [1996] used integrate-and-re neurons to model the high precision temporal processing of the barn owl auditory pathway. Before learning, the model of the nucleus magnocellularis neuron is converged upon by many auditory input bers and no phase locking information can be derived. After unsupervised Hebbian learning is applied, many of the connections have disappeared and those connections which remain are stronger. The remaining connections are also tuned to the frequency of the signal which stimulated the neuron during training.

During learning, if the presynaptic and postsynaptic spikes oc-

cur during a predened window, the synaptic ecacy is changed by a small amount. Depending on whether the presynaptic spike was active before or after the postsynaptic ring, the synaptic ecacy can be increased or decreased. Binaural input was then considered, relating to the input from both ears with a xed ITD. The purpose of the learning rule in this instance is to select the appropriate synapses which enable the neuron to produce a phaselocked output stimulus and thus produce the maximal output ring rate for the exact ITD used during training, i.e. if the neuron is presented with binaural inputs with a dierent ITD than that presented during training, the neuron will not exhibit phase-locked outputs and thus will not produce a

84

maximum output ring rate. Finally, a group of these neurons with individually tuned ITD responses are stimulated by a tone with an unknown ITD. The authors estimated that about 100 neurons are required to estimate the ITD with a temporal precision of

5 µs,

i.e.

the temporal precision of the

barn owl auditory system can be achieved. In 2001, Leslie S. Smith used depressing synapses to detect the onsets in a phase-locked sound signal, [Smith, 2001]. The onsets can be used to measure the ITDs and thus can be used to perform sound source localisation.

An

audio signal was converted into a spike train by passing through a cochlear lter, half-wave rectier, and a logistic spike generation function. The spike train was routed to a depressing synapse which produced one spike at the onset of the spike train. Digitised sound signals were played in the presence of a model head with two microphones positioned in each ear canal.

The

signals were played from -70° to 30° with a resolution of 10°, at frequencies of 220, 380, 660, 1000, 1500, 2250 and 3250 Hz with linear rise times of 10 ms. The ITDs were estimated as the time dierence between the onsets at the right and left ears. As the spikes were generated randomly, the simulation was run multiple times to nd a mean ITD. This method has shown that ITDs can be estimated with the best accuracies found for lower angles at higher sound frequencies. [Schauer et al., 2000] have based their work extensively on the Jeress sound localisation model.

Their initial research involved a biologically inspired

model of binaural sound localisation again by means of ITD; using a SRM for implementation in analog VLSI. Slight modication to the Jeress model was made including a digital delay line with AND gates; the network produces an output based on the winner-takes-all approach. Data recorded in an open environment was used in testing which was carried out oine; results showed that the model was procient at localising single sound sources for sixty-ve azimuthal angles. [Schauer and Gross, 2001] extended this work to discriminate between sound sources of dierent orientations. However, this was achieved in a biologically implausible way.

The authors simply speci-

ed one microphone for the front and the other for the back.

Dierences

in the sound colour of the binaural signals, calculated using a short-term FFT, determined from which direction the sound approached. Again, positive results were achieved during testing in open environments.

[Schauer

and Gross, 2003] continued this work by developing a computational model

85

for early auditory-visual integration with the aim of developing a robust multimodal attention-mechanism in articial systems. They combined their auditory model from [Schauer and Gross, 2001] with a visual and bimodal map; with the visual map based on spatio-temporal intensity dierences. To test their model they combined recordings of real-world situations and o-line simulations. The authors perceived their model as a benchmark for future research in audio-visual integration. In 2007, BiSoLaNN was developed with functionality based on the ITD auditory cue, [Voutsas and Adamy, 2007]. The network can be described as a cross-correlation model of spiking neurons with multiple delay lines and both inhibitory and excitatory connections. Also developed was a model of the cochlea, IHCs and coincidence neurons. The coincidence neurons cater for the range of sound frequencies used and the dierent ITDs; they were tuned by an evolutionary algorithm type method to ensure that this was the case. The system was tested on pure tone sound signals between 120 Hz and 1240 Hz which were recorded in an anechoic chamber using the Darmstadt robotic head. The localisation accuracy of these frequencies was 59%; signals coming from in front localised better than those at the sides of the robotic head. When tested for lateralisation, i.e. was the sound coming from the right or left side of the head, an accuracy of almost 90% was achieved. In the same year, Poulsen & Moore demonstrated how SNNs could be combined with an evolutionary learning algorithm to facilitate sound localisation, [Poulsen and Moore, 2007]. It involved a simulation in a 2-dimensional environment wherein multiple agents possess an SNN which controls their movements based on binaural acoustical inputs. The evolutionary learning algorithm is employed to evolve the connectivity and weights between neurons. An SRM was selected for each neuron in the network; the agents only take pulse signals as input. Based on the position of the ears relative to the sound source the ITD input is determined, i.e. the time dierence between the pulse signals arriving at each ear. Calculation of the IID input is greatly simplied, the signal is given a base strength, and whichever ear is closest to the source will receive a signal strength of the base strength divided by the distance from the source to that ear. Similarly, the ear which is furthest from the source will receive a signal strength of the base strength divided by the distance from the source to that ear. The evolutionary training algorithm involved updating an agents' tness score if they moved closer (increase) or further

86

away (decrease) from the source. After training, most agents were able to localise single sound sources, however this ability decreased when multiple sound sources were tested. Research on SNNs for sound localisation outlined the development of an auditory processing system to provide live sound source positions utilising both ITD and IID to localise a broadband sound, [Liu et al., 2008a]. Input involves the sound from two microphones passing through a Gammatone lterbank which splits it into a number of frequency channels which are then encoded as phase-locking spikes. through the ITD and IID pathways.

These spikes then take two routes, The ITD pathway consists of a LIF

neuron which produces spikes relating to the time dierence between the two inputs. The IID pathway uses a logarithmic ratio which computes the intensity dierence and produces a spike based on this value. The results of the ITD and ILD pathways are multiplied by a weight array to produce a map of ITDs and ILDs. Only the ITD map produces an angle of location, however the IID map indicates whether the sound came from the left or right. They tested their network on articial sound using the angles [-90° to 90° in steps of 30°] and found the highest localisation accuracy at 0°; however this accuracy decreases as the sound moves to the sides. Overall localisation eciency is 80% and distinguishing on the subset of angles between [-45°, 0° and 45°] they achieved an eciency of 90%. Further work involved replacing the logarithmic ratio mathematics for the IID pathway with a neuron model to determine whether the sound comes from the right or the left, [Liu et al., 2008b].

As before, the generated ITD gives the azimuthal angle and the

ILD determines whether the sound came from the left or right and removes ambiguity from the ITD result.

The correct ITD is chosen from the map

using the winner-takes-all or weighted mean method. Finally, the azimuthal angle is determined by the following equation:

θ = arcsin(IT D × Vsound /dear ) where

Vsound refers to the speed of sound and dear

(3.33)

is the distance between the

two microphones on the robot. Experiments were performed on the network using articial sound and real tones again with the same set of angles. Similar results of 80% occurred for the articial sound across all angles; between the angles -45° to 45° they achieved an eciency of 95%. Testing on a real

87

sound gave 65% accuracy across all angles. In 2009, this work was extended again by incorporating a model of the IC into the network, [Liu et al., 2009]. The ITD pathway appears unchanged, however the calculation of IIDs has reverted back to a logarithmic ratio, the result of which causes a corresponding neuron to re in the IC layer. The calculated ITD and IID spikes are then merged at the IC model. The input ITDs and IIDs are weighted using conditional probability; the IIDs are presented twice to the IC neuron, one as an excitatory stimulus, and the other as an inhibitory stimulus. The inhibitory stimulus represents azimuthal angles from the opposite hemisphere of the sound source, which regulates the ITD input. The IC was designed to either produce a sustained-regular output with no IID inhibition or an onset-type output with IID inihibition. The model was tested on two sound sources, the rst a 500 Hz pure tone which originated from -90° in all cases; the second sound sources were speech and white noise originating from angles between

±90°.

The sustained-regular IC neuron calculated accurately

the speech and white noise but failed to localise the pure tone. However, the onset IC neuron produced better accuracies for both types of sounds. This section has outlined the many dierent methods that have been used for the development of sound localisation systems. All of these techniques involve some or all or a variation of the following attributes:

An implementation approach which can range between pure computational techniques, such as geometry and probability, to the biologically inspired techniques, such as SNN models.

The type of data processed by the models, i.e. pure tones, broadband sounds, HRTF measurements, and whether it is simulated or experimentally derived.

The granularity of the angles to be localised, i.e.

a ner resolution

of angles provides more angular accuracy if the model can localise successfully

The range of sound frequencies used for testing the sound localisation systems and whether the appropriate sound ranges were used in regards to the two binaural cues of ITD and IID, i.e. was the ITD binaural cue used to localise low frequency sounds and the IID binaural cue used to localise high frequency sounds

88

Whether a learning algorithm was involved to train the system to localise incoming sounds to angles of location.

By comparing the reviewed sound localisation models against these attributes, it became clear that a novel contribution could be made in this area by developing a sound localisation model combining the following attributes: a biologically inspired implementation approach based on the mammalian auditory pathways, i.e.

models of spiking neurons; using experimentally-

derived HRTF data generated from adult domestic cats; performing sound localisation with a ne resolution of angles, i.e. localise to every 10°; perform sound localisation across a wide range of sounds including low, medium and high frequencies; process the two binaural cues of ITD and IID and use them to successfully localise sounds at their appropriate frequency ranges; and to utilise a biologically inspired learning algorithm which can train the networks of spiking neurons to localise the HRTF data to angles of location. Chapter 7 will compare and contrast the experimental results achieved by this work to the more closely related research in the literature using the above attributes.

3.7 Conclusion In this chapter, the dierent components required for the modelling of a sound localisation system have been discussed. A brief review of the cochlear and auditory modelling research eld was provided with a summation of the auditory periphery model used in this work. Both research areas of ANNs and SNNs were examined with focus on network topologies, computational neuron models and learning algorithms. The chapter concluded with a discussion of the many dierent methods previous researchers have employed for the development of sound localisation systems. Many of the approaches outlined above rely on purely computational or signal processing techniques to facilitate sound localisation. The ANN and SNN approaches are more biologically inspired and when trained with real experimental data increase the biological credibility even more. The aim of this research is to develop networks of spiking neurons with topologies inspired by the auditory pathways to emulate the way in which mammals can localise sounds. This review encompasses the three distinct areas of cochlea modelling, neu-

89

ral networks and sound localisation modelling. It provides an awareness of the strengths of the various other sound localisation techniques in the sound localisation modelling literature. It also provides the ability to improve on the weaknesses by developing a sound localisation system which furthers the biological realism. The next chapter will outline the development of a biologically inspired SNN architecture to model the way in which the binaural cue of ITD is processed by the auditory pathways for the purposes of localising experimentally derived HRTF measurements.

90

Chapter 4

Spiking Neural Network Model of the Medial Superior Olive 4.1 Introduction Many of the approaches for sound localisation modelling techniques outlined in the previous review chapters rely on purely non-neuronal techniques to facilitate sound localisation. In contrast, the ANN and SNN approaches are more biologically inspired as they are mostly focused on the modelling of the interconnecting system of neurons in the auditory pathway to achieve mammalian inspired sound localisation. However, the work presented in this chapter diers from those reviewed in that the sound localisation model consists of a fully connected SNN which takes in real biological experimentallyderived data and uses a learning algorithm to classify the outputs of the MSO model into distinct angles. A ner selection of angles is used than in other ANN and SNN approaches, each angle is separated by 10°, and a wide range of sound frequencies are processed by the network. Also, only the ITD auditory cue and low frequency sounds are used in the work presented in this chapter as with the biological MSO. Additionally, the localisation accuracies compare well to other SNN-based research on biologically inspired sound localisation.

91

4.2 Initial MSO Model This section presents an STDP-trained SNN implementation of the Jeress sound localisation model, [Jeress, 1948], for a limited number of angles in the azimuthal plane using simulated data. This initial work aims to determine whether an SNN can be trained to perform sound localisation similar to the Jeress model.

The ITD will be encoded in the two inputs to the

network; which signify the time the stimulus arrives at each ear. Their difference produces the ITD and the output neurons determine the angle of location from these two inputs using both axon delay lines and coincidence functions. This initial implementation of the Jeress sound localisation model using SNNs considers a sound source at ve distinct angles on the horizontal azimuthal plane: 0°, 45°, 90°, 135° and 180°. The topology of the SNN, shown in Figure 4.1 consists of an SNN with ve processing neurons implemented using a LIF neuron which models the coincidence-detection neurons of the MSO. The inputs

t1

and

t2

(from a set of synthetic data) correspond to

the time the sound reaches each cochlea respectively; the dierence in these times gives the ITD. These inputs are routed to the processing neurons via the cochlear nodes which encode them as single spikes. The synapse on each pathway encompasses a multiple delay structure, similar to the graded series of delays found in the biological MSO. Figure 4.2 shows how delay lines are used in this model, where tpre is the presynaptic spike time; delays;

wi

di

are the axonal

are the weights; and tpost is the postsynaptic spike time. The out-

put spike from neuron A is passed to each with their own weight

wi ,

where

m

interneuron connecting pathways,

(i = 1, m).

The input spike times are trained to produce an angle of location using STDP, which selects the optimal delay line to faciliate coincidence at one of the output neurons.

STDP was discussed in Chapter 3 and the STDP

functions used in this work are described in Equation 3.20. For this work single spike encoding is used, where the sound source was assigned arbitrarily chosen values of

t1

cochlea respectively.

and

t2

to represent the time the sound reaches each

Each combination of input passes through the delay

line structure causing the inputs to reach the output neurons at a series of predened times.

Each angle to be classied was assigned unique arrival

times; consequently there were ve training sets.

92

Figure 4.1: Network topology for initial MSO model

Figure 4.2: Pre and post synaptic neurons with interconnecting delay lines

d1

to

dm ,

and weights

w1

to

wm ,

from [Bohte et al., 2000b]

93

Supervised training is used in this work where each training set is passed to the network and the weight values for each delay line are calculated using STDP. For the delay lines which cause the two inputs to arrive in coincidence at the appropriate output neuron, STDP increases their weights and the weights of the delay lines which do not cause coincidence are decreased. For instance, the neurons corresponding to each of the ve angles (0° to 180° in steps of 45°) were passed inputs

t1

and

t2 ,

and after training, the

classifying neuron for each angle will only re when presented with their unique combination of inputs.

The post-trained network consists of each

output neuron being connected to two delay structures, one for each input. The delay lines which cause the inputs to coincide at the output neuron will have increased weights and those delays which do not cause the inputs to coincide at the output neuron will have decreased weights. After a period of training (40 epochs), the ITD encoded by the inputs,

t1

and

t2

for each

sample of data will arrive in coincidence at the appropriate output neuron due to the post-trained weights on each delay line chosen by STDP, and only in that case will the neuron re. The other output neurons will also receive the two inputs, but as they will not be in coincidence these neurons will remain silent.

4.2.1 Preliminary Results and Analysis To evaluate the SNN model, the network was trained by passing the inputs (t1 and

t2 ),

encoded as single spikes, to the processing neurons. Dependent

on the azimuthal angle, the output neuron was supervised to re at a predetermined time; thus allowing the STDP rule to select the best pathway to facilitate coincidence.

Figure 4.3 shows the weight distribution on the

vertical axis of the trained SNN for the left and right synapse connecting to each output neuron; the horizontal axis is the spatial distribution of synapses across the network. From the gure it appears that multiple delay structures have the same nal weight distribution. However, the combination of left and right delay connections for each output neuron are unique. All pre-trained weights have a value of 0.5.

This ensures that every output neuron will

respond identically to each input before training commences, and it will thus be the responsibility of STDP to choose the appropriate set of nal weights. It should be noted from Figure 4.3 that there is a post-trained bimodal weight

94

Figure 4.3: MSO weight values after training

distribution which is characteristic of the STDP process. Potentiated weights at a value of approximately 2.5 are associated with pathways that have been selected by the STDP training rule, as their delays cause coincidence at the appropriate classifying neuron. To test the model, each of the ve neurons corresponding to one of the ve angles for classication were fed a number of input test sets containing random input values including their unique input data. In all cases the SNN was able to make accurate classication on all the input data, i.e. each neuron classied to their own respective outputs with 100% accuracy. This initial MSO model entailed a biologically plausible SNN that implemented the Jeress sound localisation model. When presented with simulated times of arrival of the sound signal to each cochlea, the SNN was able to learn the angle of location of the sound source.

The SNN architecture

contains fully-connected pathways, each with a delay line structure.

This

enabled STDP to optimise the network to facilitate coincidence at the appropriate output neuron for each combination of inputs to the network. Five angles were chosen for classication, and the network was trained to relate these angles to specic inputs. Results show that after testing, all neurons

95

classied to their respective outputs correctly. The above network was extended to localise angles in steps of 5°, whereby the number of delay lines were increased, and the number of output neurons were also increased to thirty-seven. Results for this experiment showed the same classication accuracy. To increase the architecture to classify down to 1-2° would merely involve increasing the number of neurons in the output layer again. Classication accuracies would not decrease as there is no fundamental limit on angle; only the size of the network will be aected as in this architecture the number of processing neurons is directly dependent on the resolution of the localisation angle. Further work discussed in this chapter will involve extending the network to deal with real experimental data and spike trains rather than simulated data and single spikes. However, simulated data and single spikes established the proof of concept of using an SNN with STDP to implement the Jeress architecture for extracting ITDs.

4.3 Extended MSO Architecture The initial MSO model showed that an SNN could emulate the Jeress architecture and accurately encode the ITD from two inputs corresponding to the stimulus from the left and right ears to produce a correct angle of location. However, the initial work entailed the use of simulated data which was encoded as single spikes. The rest of this chapter involves a similar network structure which must process real biological HRTF data encoded in the form of spike trains. The objective is to determine whether or not this extended network can perform sound localisation to a satisfactory accuracy using real data. At this point, it must be remembered from Chapter 2 that the MSO is equipped to process low frequency sounds, i.e. sounds which fall below 1.5 kHz and that the AN and bushy cells perform phase-locking on the input stimulus which allows the ITD to be extracted and used for localising the origin of the sound. The topology of the network can be seen in Figure 4.4 where

a

corresponds

to the left MSO which processes angles of location in the range -60° to 0° and

b

corresponds to the right MSO which processes angles in the range 10°

to 60°. The input data from the left and right ears passes through a cochlea model which converts the HRTF data to spike trains, which are then routed through a bushy cell neuron. This neuron maintains the phase-locked signal

96

(a) MSO network for angles originating from the range -60° to 0° (left network)

(b) MSO network for angles originating from the range 10° to 60° (right network)

Figure 4.4: Initial MSO network architecture

97

and minimises noise in the stimulus. When the sound originates from the left of the head, i.e. originating from an angle in the range of -60° to 0°, the input from the left ear passes through a delay structure connected to each output neuron whereas the input from the right ear is routed directly to the output neurons. The opposite of this is the case in Figure 4.4

b,

where the

sound originates from an angle in the range of 10° to 60°. This network is designed to process a single sound frequency.

The mam-

malian MSO consists of approximately 10,000 neurons organised tonotopically by frequency, therefore this work considers the network to be akin to the subset of biological neurons assigned to dealing with an individual frequency. Multiple networks would be required to process multiple sound frequencies.

However, in this work for computational eciency, the same

network structure with no change to parameters was reused to train and test multiple sound frequencies.

4.3.1 Input Layer The input layer consists of two auditory periphery (cochlea) models developed by [Zilany and Bruce, 2006] and is based on empirical observations in the cat; as such, this model is appropriate for the HRTF inputs also based on the cat, [Tollin and Koka, 2009], used here. There is one cochlea model for each ear, which encodes the input data into spike trains, as shown in Figure 4.4. The acoustical HRTF data used in this research was provided by [Tollin, 2004, 2008, Tollin et al., 2008]. For details on how this data was generated from adult domestic cats see [Tollin and Koka, 2009].

The data describes

the ltering of a sound before it reaches the cochlea after the diraction and reection properties of the head, pinna and torso have aected it. Data is available for thirty-six dierent azimuthal angles [-180° to 170° in steps of 10°] at 148 distinct sound frequencies [600Hz to 30kHz in steps of 200Hz] for both the left and right ears. Table 4.1 gives a sample of the experimental data used in this research. Thirteen angles are used for classication, corresponding to the angles in steps of 10°.

The angles within the range of

±60°

±60°

were chosen as they

constitute a continuous range of angles that lend themselves to classication. Figure 4.5 shows the right ear HRTF data for sounds between 3 kHz and 5.8 kHz. From this gure it can be seen how the angles in the range of

98

±60°

Frequency Right Ear Gains 600 Hz 800 Hz 1000 Hz 1200 Hz 1400 Hz 1600 Hz Left Ear Gains 600 Hz 800 Hz 1000 Hz 1200 Hz 1400 Hz 1600 Hz

-20°

-10°

Azimuth 0°

10°

20°

-29.5939

-29.6343

-29.5772

-29.7262

-29.7022

-30.3110

-29.9027

-29.4999

-29.0648

-28.7320

-33.1215

-32.1547

-31.2565

-30.2232

-29.4183

-29.6815

-29.2299

-28.7582

-28.2494

-27.7894

-28.6423

-28.1257

-27.6168

-27.0250

-26.6097

-28.3939

-27.5741

-26.9583

-26.3100

-25.8073

-20°

-10°

0°

10°

20°

-29.4590

-29.0460

-29.0106

-28.6745

-29.8786

-28.7409

-29.7185

-30.0571

-30.2264

-31.0531

-29.9233

-29.7033

-30.5096

-30.0105

-34.0587

-26.9743

-27.3484

-27.7603

-28.0474

-28.6346

-27.1191

-27.5482

-27.8399

-28.4545

-28.7454

-26.7435

-27.1700

-27.4791

-27.8651

-28.2479

Table 4.1: Sample of experimentally derived HRTF acoustical input data

can be linearly distinguished. However, it should be noted that the data for the angles 70°, 80° and 90° overlaps with this data. It is for this reason, that the angles in the range of

±60° were selected to demonstrate the capabilities

of the two SNNs developed in this research to achieve sound localisation. Figure 4.6 demonstrates the highly complex and non-linear nature of this data for a classication-type problem. The HRTF data does not contain any information about ITD; they do however describe the intensity of the sound at each ear. This will be necessary when modelling the LSO in Chapter 5 but for now it is required that the ITDs must be embedded in the input data. To do this, Rayleigh's equation, described in Chapter 2, to generate the ITD is employed:

r π π IT D = (θ + sinθ), − ≤ θ ≤ c 2 2 where and

θ

r

relates to the radius of the head,

c

(4.1)

relates to the speed of sound

is the azimuthal angle on the horizontal plane.

This formula gives

a distinct ITD for every angle in the horizontal azimuthal plane which is independent of the frequency of the sound. Table 4.2 gives the ITDs in ms for every angle which is processed in this work, where the radius

r

is

9cm

(radius of a human head). It is important to note that the resultant ITD

99

Figure 4.5: Range of angles chosen for classication Angle

ITD (ms)

±60° ±50° ±40° ±30° ±20° ±10° 0°

0.502 0.42998 0.3518 0.268 0.18133 0.09 0

Table 4.2: ITDs for each angle processed by SNN

for each angle

±

is the same, i.e. the ITD for a -60° angle is the same as

for a +60° angle. This is the reason why angles in the range of -60° to 0°, see Figure 4.4a, are not processed together with angles in the range of 10° to 60°, see Figure 4.4b, as to do so would cause confusion when the network is being trained to produce an estimated angle as output. It was decided at the outset to place 0° in the left network.

However, the data for 0° could

have been processed by either of the two networks. Rayleigh's formula determines the ITD based on the assumption that the head is spherical or round. shape.

However, the human head is not a spherical

In consideration of this, it was decided to look at the research of

Nordlund [1962]. He undertook a series of experiments to measure the ITDs using a model head. In the experiments, every eort was made to ensure the model had the form of a true standard head, i.e. not spherical. Figure 4.7 shows a plot of the ITD values as a function of azimuth for those calculated

100

(a) HRTF data for left ear

(b) HRTF data for right ear

Figure 4.6: 3-D mesh surface plot of the HRTF acoustical input data

101

Figure 4.7: Interaural time dierence as a function of azimuth, comparing the approximate ITD values determined by Rayleigh's formula to those experimentally measured by Nordlund [1962]

by Rayleigh's formula and those measured in Nordlund's experiments.

It

can be seen from the gure, that the ITDs approximated using Rayleigh's formula for a spherical head dier little from those measured using a standard head-shaped model. In fact, the average error dierence is only 0.0457 ms. across the full range of angles from 0° to 180°. Once the ITDs have been calculated, they need to be embedded into the input data. This is performed at the input layer, i.e. the ITD is embedded by the cochlea model. The cochlea model from [Zilany and Bruce, 2006, Zipser et al., 1993] was adapted to perform this function. If the sound originates from the left side of the head, from an angle in the range of -60° to 0°, the right input waveform is zero-padded with an amount appropriate to the ITD, i.e. a number of zeros representing the ITD value are appended to the beginning of the waveform. For example, for a -60° angle which originates closest to the left ear, the right waveform is zero-padded for a time amounting to

0.502 ms,

see Table 4.2. By embedding the ITDs produced by this formula into the cochlea model, the spike trains produced are suitable for use in a biologically inspired MSO network. Also, by embedding the ITDs at the input stage of the network, no further pre-processing is needed to be done on the generated

102

Poisson spike trains. Each cochlea model takes the frequency and the HRTF of a sound at a particular angle as input and produces a spike train based on and relating to that input. Each spike train produced by the cochlea model is characterised by bursts of spikes which are phase-locked to the original sound frequency, see Figure 4.8. For this gure the sound source originates from a -60° angle which is closer to the left ear than to the right, causing the left waveform to begin at an earlier time than the right. This also produces diering amplitudes, which can be seen clearly in the gure. This feature of the sound wave will be involved in extracting and processing the IIDs which will be discussed in the next chapter. The embedded ITD calculated for this angle causes the two waveforms to be out-of-phase with one another thus producing spike trains which are also out-of-phase.

For the purposes of training the SNN

to recognise and classify this data, multiple spike trains were generated for training and testing. As the spike trains generated by the cochlea model are encoded by a Poisson process, when the same data point is passed through the cochlea multiple times the spike trains generated will be dierent each time.

This allows for the creation of training and test data consisting of

dierent patterns of spikes relating to the same angle and frequency.

In

these experiments, ten samples were produced for each data point, i.e. ten spike trains were generated for every angle at a particular frequency. When training the SNNs, pairs of spike trains are passed through the networks in sequential order, beginning with the rst sample from the left and right data sets, which corresponds to the angles -60° and 0° respectively.

4.3.2 Bushy Cell Layer Knowledge of the bushy cells in biology is limited, however it is known that the main function of these cells is to maintain the phase locked signal and to minimise noise; bushy cells were discussed in more detail in Chapter 2. In the network, spike trains such as those in Figure 4.8 proved dicult to train due to their bursting nature, and the initial erroneous spikes which were not phase-locked to the waveform. Therefore, the role of the bushy cell layer in this network is to remove any erroneous spikes in the spike train (i.e. to remove noise), and to transform the phase-locked bursts to single spike instances, see Figure 4.9.

103

Figure 4.8: The left and right input stimulus time-domain waveforms for an 800 Hz sound and the bursting spike trains produced by the cochlea in response to the waveforms. The Poisson encoding process and the diering amplitudes of the waveforms produce distinctly dierent outputs which are not in phase with each other.

This processing was implemented using a LIF neuron.

The phase-locked

single spike output in place of a burst was achieved through selection of an appropriate neuron threshold and refractory period. The parameters are xed for every bushy cell in the network, i.e. the same parameters are used for every sound frequency with which the network was trained and tested. It is not the time of the rst spike in each resulting spike train that is important when the two spike trains synchronise at the output neurons; each individual spike is necessary for the sound localisation process to be achieved. Figure 4.9 refers to a -60° angle, the rst spike actually occurs in the right spike train, when the sound was actually closer to the left ear. This can occur when the bursting spike trains are being processed by the bushy cell neurons as the initial section of the spike train is usually characterised by noise in the form of erroneous spikes. This is clear to see from the two bursting spike trains in the gure. However, both spike trains in their entirety are out-of-synch with one another, not just the rst spike. Therefore, when a delay structure causes the two spike trains to become coincident at the output layer, each spike in the left spike train will then be in coincidence with a spike in the right spike train. In some cases, due to the Poisson encoding, the rst spike

104

Figure 4.9: The left and right bursting spike trains produced by the cochlea models, and the response of the bushy cell neurons to these inputs

in one of the spike trains will not have a matching pair. This will not aect the behaviour of the network, as the entire spike train is involved in the processing.

4.3.3 Output Layer Reecting the two MSOs in the mammalian brain, there are two SNNs, one for the left network seen in Figure 4.4a, and another for the right network, seen in Figure 4.4b. The left SNN consists of seven LIF neurons in the output layer; each output neuron is assigned to an azimuthal angle in the range of -60° to 0°. The right SNN consists of six LIF neurons in the output layer; each output neuron is assigned to an azimuthal angle in the range of 10° to 60°. For the duration of this section, processing of the left network will be described. However, both networks have identical processing; they can be considered as mirror images of one another. For instance, in the case of the -60° angle in the left network, the right waveform is zero-padded; and in the case of the 60° angle in the right network, the left waveform is zero-padded by an amount relating to the same ITD. At the output layer the contralateral (right) input is passed directly to each output neuron. The ipsilateral (left) input is passed to a set of delay structures. The delay structures in this network are equivalent to those described

105

in the initial section of this chapter and depicted in Figure 4.2. Each delay line is assigned a weight value and the delay value itself is the amount of time that the left stimulus will be delayed. Each output neuron is allocated its own delay structure. In this particular architecture, each delay structure corresponding to each output neuron is unique.

For example, the output

neuron for -60° is assigned a delay structure with four delays, d1 to d4, i.e. the output neuron receives the left stimulus four times, each time the stimulus is deferred by a diering amount of time; the output neuron for -50° is also assigned a delay structure with four dierent delays, d5 to d9; and so on. Only a delay line from a delay structure assigned to a particular output neuron and thus a particular output angle could cause the left and right spike trains from that angle to coincide. Therefore, in most cases only the output neuron which is associated with the current sample of input data will receive spike trains in coincidence. The threshold is set to ensure that the output neuron res when it receives the left and right spike trains in phase. When the left spike train passes through the delay structures of other output neurons, it remains out-of-synch with the right spike train and in some cases becomes even more misaligned. ReSuMe, described in Chapter 3, is used as the supervisory training algorithm. Supervision involves supplying each output neuron with an additional input spike train only when the data for its assigned angle is currently being routed through the network. All initial pre-training weights in the output layer are the same. For an input relating to a particular angle, the training aects the weights on the delay line structures in two ways:

1. If a spike occurs in the supervisory spike train, the weights on the delay structures are increased. Recall that only the output neuron which is associated with that angle receives the supervisory spike train. 2. If the output neuron produces a spike, the weights on the delay lines are decreased.

The aim of the training is to associate the input data for each angle to its appropriate output neuron, i.e.

the nal weights on the delay lines after

training ensures that the appropriate output neuron produces the highest volume of spikes when presented with the input data of its allotted angle. Classication is calculated on a maximum-spikes basis, the estimated output

106

Sound(Hz) Accuracy % Accuracy % ±10° Training Testing Testing Testing

800

92.1154

99.0385

800

89.0384

98.2692

600

57.0128

84.5898

1000

87.0385

98.4231

Table 4.3: Training and testing results for initial MSO network architecture

of the network refers to which output neuron produces the highest number of spikes.

In this network, it is desirable to use unique delay lines within

each output neurons' delay structure to facilitate crisp classication. The network is trained on a particular sound frequency and then tested with dierent samples from that sound frequency (the samples are dierent due to the Poisson encoding process) and from neighbouring sound frequencies. Results are presented in two dierent ways; the rst type of classication termed here as

absolute classication

is where the actual angle produced by

the network is equal to the desired angle. In this way a sample produces a result which is either correct or incorrect. The second type of classication takes into account the possibility of the network producing an angle at its output which is equal to the desired angle

±10°,

i.e. if the desired angle is

30°, an output of 20° and 40° is also deemed acceptable. With this approach the initial results achieved from the experiment are outlined in Table 4.3. The training and testing results were satisfactory; the low testing results for 600Hz were expected as the original data is noisy. The criticism or drawback of this network is the use of unique delay structures for each output neuron. It is questionable as to the need for training in the network if each output neuron is connected to a delay structure which will cause its appropriate inputs to become coincident in any case. Training was only necessary to increase the weight of the optimal delay line in each delay structure to produce the maximum number of output spikes. In all probability, this behaviour could be produced with parameter tuning as the network can be considered to be pre-designed. Yet, even with this pre-designed network, the results achieved are not absolutely accurate, i.e. 100%. The reason for this is the Poisson encoding scheme, which produces a dierent spike train each time from the same data point. As such, when the network was trained with samples from an 800 Hz sound and achieved an absolute accuracy of 92.11%,

107

testing with samples from an 800 Hz sound produced slightly less accurate results of 89.03%. This considerable drawback of network pre-design led to the development of the network structure in Figure 4.10 where each output neuron is fully connected to a generic delay structure which requires a training algorithm to associate the input data to each output neuron.

4.3.3.1 Generic Delay Structure The updated output layer involves a generic delay structure which can be seen in Figure 4.10. Again, there are two SNNs, one assigned to angles to the left of the head and another for angles to the right of the head. Once more, processing of the left network will be described, as the right is a reection of the left. Both the input and bushy cell layers remain the same as discussed in previous sections.

The input from the right cochlea passes through a

bushy neuron and is then routed directly to each output neuron. The input from the left cochlea also passes through a bushy neuron but is routed to the generic delay structure. The delay structure consists of seven delay lines, each assigned a delay value which again causes the stimulus passing through to be delayed by a period of time.

Each delay allows the left and right

stimulus for each azimuthal angle to come into phase, e.g. for an angle of 60° only one delay will provide coincidence while the other delay connections will not.

However, each delay line is connected to every output neuron

producing forty-nine synaptic connections in the output layer; the synaptic weights are identical before training begins. Therefore, every output neuron will receive both in-phase and out-of-phase inputs for every angle.

The

objective of the trained network is to associate each delay to a particular output neuron. For example, the rst delay will become associated with the rst output neuron, the second delay connection will become associated with the second output neuron, and so on.

To do this the post-trained weight

on the connection between the associated delay and output neuron must be larger than the weights on any of the other connections also providing stimulus to that output neuron.

Ultimately, the association of particular

delay lines within the delay structure to particular output neurons is specied by the training algorithm alone, and is not a matter of network design. ReSuMe was again used as the supervisory training algorithm, however it was unable to cope when each output neuron has access to the shared delay

108

(a) SNN network to process ITD for angles originating from the range -60° to 0°

(b) SNN network to process ITD for angles originating from the range 10° to 60°

Figure 4.10: Final SNN network architecture for processing ITDs

109

structure and inital results were poor; the classication accuracy for training was about 50% and testing was similar. The main reason for the poor performance is the inability of ReSuMe in this network structure to associate an output neuron to a specic delay connection when it receives virtually the same information from all of the delay connections; the only dierence is in the delayed timing of the stimulus from each connection. In addition, the creation of the supervisory spike train is problematic; it is dicult to determine what the correct pattern of supervisory spikes should be. Many dierent patterns including high frequency, low frequency and bushy cell output type patterns were tried with no signicant improvement in accuracy.

It was clear that an alternative supervisory learning algorithm was

required, one which was not focused on producing spike trains with precise spike timing. SHL, as discussed in Chapter 3, using STDP windows was employed and proved to be more suitable with respect to producing the desired output of the network. During training the following behaviour occurs: 1. Determine whether the current output neuron is being supervised or not. 2. If it is supervised, the positive part of the STDP window is used to increase the weights on the synapses between the supervised output neuron and the appropriate delay lines providing the current input. 3. If it is not supervised, the negative part of the STDP window is employed to decrease the weights on the synapses between the nonsupervised output neuron and any delay lines providing the current input. This training algorithm proved successful in producing the desired output of the network, i.e.

the appropriate output neuron has the highest ring

frequency when its associated input data is routed through the network. However, training is not stable; over the course of training the accuracy of the network increases to its maximum and then begins to decrease again. Due to this, a multiplicative form of the SHL algorithm is implemented and produces the same accuracies, but remains stable throughout the training period. Figure 4.11 plots the error of the network for each epoch of training, the stability of training is clear to see. Epoch 6 produces the lowest error

110

Figure 4.11: Overall training error across the angles -60° to 0° for the sound 800 Hz showing the accuracy for the desired angle alone and the accuracy for the desired angle

±10°.

with a classication accuracy of 87.14% when the actual angle produced by the network is compared to the desired angle for each sample of training data; at the same epoch the error is lower with a classication accuracy of 98.57% when the actual angle produced by the network is equal to the desired angle

±10°. Sound localisation diers from many classication tasks in that there is a relationship between the classes (angles).

The dierence between the ex-

perimental input data corresponding to neighbouring angles is small and in some cases identical for particular sound frequencies.

This makes the

task of distinguishing between neighbouring angles dicult for the network. Conversely, in most cases, non-neighbouring angles can not only be visually dierent but can be more easily distinguished by an SNN. This is arguably why many researchers present results using a coarser selection of angles. The results presented here show that when a margin of error of ten degrees is allowed, the classication results are signicantly improved, as can be seen from Table 4.4. The network was trained with three dierent sound frequencies: 800, 1400 and 2000Hz; testing the network involves generating new samples from the training sound frequency and samples from the neighbouring sound frequencies. The weighted average training and testing results across all the angles

111

Training Testing1 Testing2 Testing3 Training Testing1 Testing2 Testing3

Sound (Hz) Accuracy % Accuracy % ±10° Experiment1 800

91.1539

98.8462

600

69.2308

92.4359

800

82.5641

94.8718

1000

89.1026

96.7949

Experiment2 1400

87.3077

89.6154

1200

88.4615

92.3077

1400

85.0000

90.3846

1600

72.6923

75.7692

Table 4.4: Final MSO Network Architecture Results

from -60° to 60° are outlined in Table 4.4. The weighted average results represent the classication accuracy of the network across all angles from -60° to 60°, i.e.

the accuracies from both left and right networks are averaged

together using the following

weighted average

formula:

(7 ∗ al ) + (6 ∗ ar ) 13 where

al

0°, and

(4.2)

is the accuracy of the left network which has seven angles, -60° to

ar

is the accuracy of the right network which has six angles, 10° to

60°. Each time the network is trained with a particular sound frequency, it is tested on that frequency and two other sound frequencies again over the range of angles -60° to 60°, see Table 4.4. These other frequencies are neighbouring frequencies of the sound chosen to train the network.

It is inter-

esting, that not only is there a relationship between neighbouring angles in the input data, but that there is also a relationship between neighbouring sound frequencies, as the same network can be used to process to a high classication accuracy multiple sounds. Using neighbouring sounds for testing also gives the advantage of testing the network on completely unseen data. Note also from the table how the classication accuracies begin to decrease as the sound frequencies increase. For example, the classication accuracy for testing a 1.6 kHz sound is the lowest result in Table 4.4. This is to be expected as the cut-o point between low and high frequency sounds is 1.5kHz, and as discussed in Chapter 2 it is well known that the MSO processes low

112

Figure 4.12: Final weights on synapses between the delay structure and each output neuron

frequency sounds. This issue will be discussed in further detail in Chapter 6. Additionally, comparing the results reported in this chapter against both biological performance and the published work by other researchers in the area of sound localisation modelling will be discussed in Chapter 7. The nal weights on the synapses between the delay line connections and each output neuron at the end of training are bimodal, i.e. the delay which is associated with an output neuron after training has the largest weight and the other delays have lower weights. This can be seen in Figure 4.12 where for output neuron 1, the rst delay line has the highest weight; for output neuron 2, the second delay line has the highest weight; and so on.

This

distribution of weights allows this fully-connected network with a generic delay structure to produce the desired angles with the accuracies outlined in Table 4.4.

4.4 Conclusions This chapter has outlined a series of experiments for the development of a biologically inspired SNN architecture to model the way the binaural cue of ITD is processed by the auditory pathways. An initial MSO model was investigated to determine whether Jeress' model could be implemented with an SNN; Jeress' theoretical computational model was a major inspiration for the work developed in this chapter.

This section of the work involved

simulated data in the form of single spikes trained using STDP and was

113

successful in its implementation.

Following on from this initial work, low

frequency experimentally-derived HRTF data encoded into spike trains was used and the network structure was extended to cater for this. An auditory periphery model to encode the data was appended to the front of the network and a layer of bushy cell neurons were included to maintain the phase-locked signal routed from the cochlea models and to remove noise, i.e. they convert the bursts of spikes to single spikes and remove erroneous spikes. Two output layer topologies were outlined; the rst was successful but could be criticised as the design of the network determined the classication accuracies.

A

second output layer was then developed which consisted of a generic delay structure which fully connected each delay line to every output neuron. The SHL algorithm trained the network to provide similarly successful accuracies. The primary contributions of this chapter are: the integration of biological models within an SNN, including an auditory periphery model, LIF neurons and a structure consisting of multiple delayed synaptic connections; the use of experimentally-derived acoustical HRTF data from adult domestic cats as input to the SNN model; the development of an SNN which can process and extract the binaural cue of ITD from low frequency HRTF data, and the use of the SHL learning algorithm which enables the classication and thus localisation of the data. Limitations of this chapter include a lack of analysis and evaluation to determine how the SNN model would perform in the midst of noise; when presented with high frequency sound data; and when tested on non-neighbouring sound frequencies. These limitations will form the basis of the work presented in Chapter 6. The next chapter will outline the development of a biologically inspired SNN architecture to model the way in which the other binaural cue of IID is processed by the auditory pathways.

114

Chapter 5

Spiking Neural Network Model of the Lateral Superior Olive 5.1 Introduction This chapter presents the development of a SNN model of the LSO which performs sound localisation based on IIDs. The SNN contains various layers of processing which aim to mimic the auditory pathways and a computational classication layer which outputs the desired azimuthal angle in response to HRTF input data. The work outlined in this chapter begins with a model of an LSO neuron using a LIF neuron model and simulated input data. This LSO model was then extended to a full scale network which utilised biological HRTF data and a supervisory training algorithm for classication of the desired output angles. The full scale network consists of two subnetworks; the rst processes data corresponding to angles to the left of the head, i.e. those in the range of -60° to 0°; the second subnetwork processes angles to the right of the head, i.e. those in the range of 10° to 60°. These subnetworks are identical in structure and function, the only dierence is their input data sets and conguration of network parameters.

5.2 Initial LSO Model This section outlines the development of an initial LSO model implemented with spiking neurons. Using simulated input data, it considers how a LIF

115

Figure 5.1: LSO neuron model to compute the dierence of the inputs from the excitatory and inhibitory synapses.

Note the inclusion of the MNTB

node through which only the contralateral input is routed to provide an inhibitory input to the LSO neuron. LIF parameters are: voltage threshold = 2.5V; refractory period = 1ms; voltage reset = 0V. Synapse parameters are: initial membrane voltage: 0V; time constant = 37ms.

neuron model can be designed to mimic the behaviour of a mammalian LSO neuron. The aim is that the LIF neuron will have the ability to relate the frequency at its output cies

f1

and

f2 ,

fo

to the dierence of two input spike train frequen-

see Figure 5.1.

sound source at an angle

θ.

lateral ear with an SPL of

The LSO neuron is stimulated by a single

The sound reaches the cochlea node of the ipsi-

E1

and the contralateral ear with an SPL of

The cochlea node encodes each SPL into a spike train, i.e. spike train with a frequency of frequency of

f2

f1

and

E2

E1

E2.

maps to a

is mapped to a spike train with a

where the frequencies of the spike trains relate to each SPL.

In these experiments

f1

is a constant frequency, while

ing dierent intensities at

E2; f2

is varied reect-

their combination reects diering angles on

the horizontal plane. Spike train to the LSO neuron; while

f2

f1

from the ipsilateral ear travels directly

is routed to the MNTB node which trans-

forms the stimulus to an inhibitory input. Thus, the LSO neuron receives two inputs; an ipsilateral excitatory stimulus and an inhibitory contralateral stimulus. The IPSP is combined with the EPSP, i.e. the inhibitory response is subtracted from the excitatory response producing a stimulus for the LSO neuron that reects the dierence in frequency (intensity) between the two inputs.

Parameters for both the neuron and the synapses were chosen by

ne-tuning the neuron model to achieve appropriate frequency ranges at the output

fo .

The LSO neuron generates an output frequency

fo

the dierence between the two input frequencies,

116

which is a measure of

f1

and

f2 . fo

is a key

component in the way the LSO determines the azimuthal angle of the sound signal as the range of output frequencies can be mapped to the range of angles on the horizontal azimuthal plane, [Solodovnikov and Reed, 2001]. When the LSO neuron produces no output, i.e.

fo

= 0, it can be concluded that the

sound signal is at 90°; the sound reaches both ears at the same time causing the sound at each ear to have the same SPL, therefore both

f1

f2

and

the same encoded frequency and the IID is 0. As the IID increases,

have

fo

will

also increase as the angle of the sound source tends towards either 0° or 180°. Figure 5.2 plots dierent combinations of spike train frequencies for

f2

which produce dierent output frequencies,

fo ,

f1

and

at the LSO neuron. The

excitatory frequency of 100 Hz and the three inhibitory frequencies of 80 Hz, 90 Hz and 100 Hz were chosen arbitrarly for the purpose of demonstrating the system. The output frequencies were determined by counting the number of spikes in the spike train for a stimulus duration of one second; all frequencies measured in this chapter use this method. The LSO model was tested with eleven sets of input frequencies to reect eleven dierent angles on the azimuthal plane.

Each test set reected a

dierent IID in the combination of the frequencies of the spike trains corresponding to the two inputs.

Frequency

f1

1 varied over a range of 0 Hz to 100 Hz. With

was xed at 100 Hz while

f1

=

f2 ,

f2

the LSO neuron pro-

duced no output spikes. As the inhibitory frequency was reduced for each subsequent test set, the output frequency increased as expected. Figure 5.3 shows the relationship between the LSO neuron output frequency dierence of its inputs,

f1

and

f2 .

fo

and the

It can be seen that as the dierence of

the input frequencies changes the ring rate of the LSO neuron also changes in a monotonically decreasing or non-linear fashion. With a combination of a high excitatory input frequency and a low inhibitory input frequency, the output frequency of the LSO neuron is high. While with a combination of a high excitatory input frequency and a high inhibitory input frequency, the LSO neuron becomes silent.

It should be pointed out that the LSO neu-

ron model presented here had xed weight values as no training took place. Training was not necessary for this initial neuron model as this work was carried out for the purpose of demonstrating the combination of an IPSP and

1

The author is aware that high frequencies are more typically associated with the

LSO; however for the initial experiments on the one neuron LSO model, low frequencies reduced the complexity of the inputs. Additionally, when scaling up to the LSO network architecture in the next section, input frequencies will be in the range of 1.8 kHz - 30 kHz.

117

(a) Excitatory input of 100 Hz and inhibitory input of 100 Hz

(b) Excitatory input of 100 Hz and inhibitory input of 90 Hz

(c) Excitatory input of 100 Hz and inhibitory input of 80 Hz

Figure 5.2: Matab plots of the LSO model response to two dierent combinations of inputs 118

Figure 5.3: Mapping of the LSO output frequency to the dierential of the input frequencies

f1

and

f2 .

EPSP, and how their dierence when used as input to a neuron produces a signicant output frequency that can be used for sound localisation. However, the relationship between trains,

f1

and

f2 ,

fo

and the dierence between the input spike

could be altered by selectively adjusting the parameters

for both the inhibitory and excitatory synapses. This could then be used to map

fo

to azimuthal angles for the purpose of sound localisation. Moreover,

the relationship between the SPL at each ear and the encoded spike train frequencies needs to be determined in order to relate the output frequency of the LSO neuron to an accurate angle of location for the sound source. This preliminary work outlines how LIF neurons can be employed to emulate the functionality of the LSO, i.e. how the frequency of the output can be related to the IID (dierence of the inhibitory and excitatory input spike train frequencies). This is a key component in the way the LSO determines the azimuthal angle of the sound signal as the range of output frequencies can be mapped to the range of angles on the horizontal azimuthal plane. The rest of the chapter outlines a fully-connected SNN which processes real experimental data for the purpose of biologically inspired sound localisation. These realistic inputs to the network will be classied to their appropriate azimuthal angles by training with a supervised learning algorithm.

119

5.3 Complete LSO Architecture Figure 5.4 outlines the fully-connected feed-forward SNN for sound localisation which consists of two separate networks akin to the mammalian auditory system which has two LSOs; one deals with the data corresponding to the sound originating to the left of the head, i.e. angles in the range of -60° to 0°; the other deals with data corresponding to the sound originating to the right of the head, i.e. angles in the range of 10° to 60°. For the duration of this chapter, the functionality of each layer of the left network will be described as both networks process data identically. However, results will be given for both networks. The network consists of an input layer which encodes the HRTF data into spike trains. The contralateral input passes to an MNTB node and the output of this is then combined with the ipsilateral input at the LSO neuron which decodes the IID. The outputs of the LSO are routed through a layer of receptive elds. The function of the receptive elds and their corresponding neurons is to respond to unique spike frequency ranges and to encode the output responses of the LSO into linear spike trains for the supervisory training algorithm to classify. The nal layers of the network classify the outputs of the LSO model into angles. The purpose of the network is to produce the correct output angles for each input combination of HRTF data from both the left and the right ears. Parameters of the synapses and neurons at all layers are independent of the dierent sound frequencies used in individual experiments. The mammalian LSO consists of approximately 4,500 - 5,000 neurons organised tonotopically by frequency, therefore this work considers the network to be akin to the subset of biological neurons assigned to dealing with an individual frequency. Multiple networks would be required to process multiple sound frequencies. However, for computational eciency, the same network structure with no change to parameters was reused to train and test multiple sound frequencies. Additionally, a layer of bushy cells are not included in this architecture as the sound frequencies being processed are greater than 3 kHz. At these frequencies, biological bushy cells have a primary-like response type, i.e. for every spike which arrives at a bushy cell, one spike is generated, [Yin, 2002]. For computational eciency, the ipsilateral outputs of the cochlea model are routed directly to the LSO neuron and the contralateral outputs of the cochlea are routed directly to the MNTB

120

neuron.

5.3.1 Input Layer The input layer cochlea models and HRTF data is the same as that used for the MSO network described in the previous chapter. However, there are some dierences.

The HRTF data contains information about the SPL of

the stimulus at each ear, i.e. the IIDs can be extracted from the HRTF data alone. Therefore, the ITDs generated by Equation 4.1 were not included in the cochlea model to be embedded in the resulting spike trains.

Also, as

before, the spike trains generated by the cochlea model are encoded by a Poisson process, i.e. when the same data point is passed through the cochlea multiple times the spike train frequencies generated will not be identical each time. However they will all be distributed around a mean frequency. Spike train frequencies generated are usually within

±30

Hz of that mean

frequency, but in some cases a spike train will be generated with a frequency which is far removed from that mean frequency; these outliers will not facilitate exact classications, since in this network processing is focused on the frequency of the spike trains rather than the timing of the spikes as with the MSO network.

However, it is decided to retain these outliers in the

processing as to remove them would hinder the biological plausibility of this research.

5.3.2 Hidden Layers There are three hidden layers in this network, the MNTB layer, the LSO layer, and the receptive eld layer, see Figure 5.4. The MNTB layer consists of a LIF neuron which represents the inhibitory neurons of the MNTB for an individual sound frequency. The MNTB neuron takes as input the spike train from the contralateral input layer and converts it to an inhibitory stimulus with the same pattern of spikes. This output is then routed to the LSO layer. All LIF neurons in this network are dened by Equation 3.9, as described in Chapter 3. The LSO layer consists of a LIF neuron with excitatory and inhibitory facilitating synapses which model the functionality of the biological LSO to determine the IID; the model consists of one LIF neuron as it relates to an individual sound frequency. The facilitating synapse models used in this work

121

(a) LSO network for angles originating from the range -60° to 0° (left network)

(b) LSO network for angles originating from the range 10° to 60° (right network)

Figure 5.4: LSO network architecture

122

are from [Tsodyks et al., 1998], as described in Chapter 3. The neuron takes as input the excitatory spike train from the excitatory facilitating synapse, and the inhibitory spike train which passed through the contralateral MNTB neuron and inhibitory facilitating synapse. The dierence in these spike train frequencies relates to the IID and is reected in the LSO output response which is used to classify the azimuthal angle of the input stimulus in latter layers of the network. To calculate the dierence, the EPSP and IPSP are summed; essentially the IPSP generates the neural equivalent of subtraction. The resultant PSP generated from this summation is the input to the LIF neuron and the associated output response is a measure of the dierence between the two input frequencies. Both inhibitory and excitatory facilitating synapses in this layer use the dynamic synapse dierential Equations 3.14, 3.15, 3.16, 3.17, and 3.18 as described in Chapter 3. Parameters chosen for the facilitating synapses can be found in [Tsodyks et al., 1998]. Figure 5.5 shows the output responses for two dierent sounds, 5 kHz and 15 kHz, produced by the LSO neuron for 100 samples of data at each azimuthal angle in the range of and the

y-axis

±60°.

The

x-axis

portrays the thirteen dierent angles

shows the output responses produced by the LSO neuron

in response to the input data relating to each angle.

Notice the series of

spike train frequencies for each sample of data at each angle.

With 0° as

the centre point, the angles to the left and right are approximately mirror images of each other, as expected. This is the main reason for having separate networks assigned for angles to the left and right of the head respectively. One combined network would be unable to classify between -60° and 60°, -50° and 50°, and so on. Another point to note from Figure 5.5 is that there is a spike train frequency overlap between neighbouring angles, i.e. several training samples for multiple angles are identical.

The amount of overlap also varies with sound

frequency; in Figure 5.5 the LSO output responses for the 5 kHz sound overlap considerably more in comparison to the 15 kHz sound. As the rate of overlap between angles increases, the problem of localisation becomes more dicult, i.e. classication of the 5 kHz sound is more complicated in comparison to classication of the 15 kHz sound. The proceeding layers (receptive eld and output layer) of the network will aim to classify this overlapping data for multiple angles using frequency selective receptive elds and a supervised training algorithm. Lastly, from Figure 5.5 it can be seen that the

123

(a) 5 kHz Sound

(b) 15 kHz Sound

Figure 5.5: Range of responses produced by the LSO neuron for each angle at the sound (a) 5 kHz and (b) 15 kHz. It is possible to see where the clusters of training samples overlap between the angles, resulting in overlapping frequency selective receptive elds. Note also that the overlap is more signicant for the 5 kHz sound in comparison to the 15 kHz sound however in some cases the overlap of neighbouring angles varies throughout the datasets for the two sounds.

124

range of responses for each angle produced by the LSO neuron is wider for the 5 kHz sound than for the 15 kHz sound. This dierence indicates that the receptive elds designed for each sound will vary greatly even with respect to the same angles, i.e. the receptive eld designed for 0° at 5 kHz will be very dierent to the receptive eld designed for 0° at 15 kHz. This dierence in receptive eld widths is consistent with [Tollin and Yin, 2002b] who report that the spatial receptive elds of LSO neurons for lower sound frequencies around 5 kHz are much wider than those for higher frequencies.

Overall,

Figure 5.5 highlights the issues the supervised training algorithm will have with classifying the LSO outputs and also how the success of classication can dier across varying sound frequencies, i.e.

it is clear to see from the

gure that the task of classifying the 5 kHz sound is more dicult than for the 15 kHz sound as the overlap is signicantly worse for the 5 kHz sound. In order to validate the performance of the LSO neuron it was necessary to establish whether the resulting output spike frequencies could be classied by a biologically inspired SNN. The spike trains resulting from the cochlea model and subsequently from the LSO neuron presented a number of diculties. The rst diculty was that the nature of the Poisson encoding, which in the majority of cases resulted in spike trains with extreme bursting activity, made any subsequent classication dicult. The second diculty was a practical one: namely that the time step utilised in the generation of the spike trains, originating in the cochlea model, and subsequently processed by the LSO neuron, was represented by a spike train with a length of 15000 time steps, which placed limits, in terms of computer memory on the number of spike trains that could be represented at any one time in the SNN. In addition, the range of responses produced by the LSO neuron across the angles was very large. To overcome these diculties a layer of receptive elds [Bohte et al., 2002b, 2000a] was created to encode the LSO output spike trains into linear spike trains for the supervisory training algorithm to classify. The receptive elds took the form of a Gaussian function; a Gaussian function was chosen as it provided a smooth transition between the activation of neighbouring neurons:

kij = e−((xm −yo )/dm )

125

2

(5.1)

where

kij

is a scalar variable which will modify the output spike train fre-

quency of the LSO region, eld,

yo

xm

is the operating frequency of the receptive

is the LSO output spike train frequency and

dm

denotes the width

of the receptive eld. A receptive eld was created for every angle being processed by the network using the half maximum distance method from fuzzy logic systems and radial basis function networks, [Bugmann, 1998].

Each

receptive eld was assigned an operating frequency and width based on the LSO output frequencies of the specic angle assigned to it.

To determine

the operating frequency and width of each receptive eld the following steps were taken:

1. For each angle generate 100 samples of data from the left and right cochleas and pass through the LSO neuron. The spike train frequency of each Poisson sample is found by passing through the entire spike train and counting the number of spikes over the time-length of the spike train. Unlike linear spike trains, the ISI cannot be used to determine the frequency of a Poisson spike train. 2. Find the average spike train frequency produced by the LSO neuron. This average frequency becomes the operating frequency of the receptive eld. 3. Find the maximum and minimum frequency produced by the LSO neuron. 4. Determine the dierence between the maximum and the average frequency, and the minimum and the average frequency. Use the largest of these dierences for the width of the receptive eld. 5. Repeat these steps for every receptive eld created for each angle.

Originally, each receptive eld was assigned the same arbitrarily chosen width. In some cases this value was too narrow and some samples of data did not activate their own designated receptive elds.

Similarly, in other

cases this value was too large and neighbouring receptive elds overlapped too much causing diculty with classication. Fine-tuning of the receptive elds led to the use of the above algorithm to determine the widths and operating frequency for each individual receptive eld.

126

In any case, when

Figure 5.6: Spread of receptive elds across the LSO output responses for the angles 10° to 60° of the 15kHz sound frequency. The gure shows how the LSO responses are scaled into an output frequency which can be processed by the supervised training algorithm for classication. LSO output responses range from 0Hz to 600Hz across the angles chosen for classication.

In

this case, when the rst receptive eld for the angle 10° receives an input frequency of

f2 Hz (which is equal to the operating frequency of the receptive

eld), the receptive eld produces an output spike train of 40 Hz. When the receptive eld receives an input frequency of

f1 Hz ,

an output of 7 Hz is

produced.

training and testing the network, samples will pass through their own receptive eld and the receptive eld on either side; hindering the selectivity of the neurons to individual angles. The function of the receptive elds and their respective neurons is to scale the LSO output response to fall into the arbitrarily chosen range of [0, 40 Hz]. If the LSO output spike train frequency equals the operating frequency, an output frequency of 40 Hz is encoded to be routed to the output layer for classication.

Similarly, if an input frequency does not lie within the

scope of the receptive eld, an output frequency of 0 Hz is encoded. This is illustrated by Figure 5.6. It should be noted that this re-scaling method of LSO responses was identical for all LSO spike train outputs; hence the relationship between spike train frequencies of the LSO output was preserved. This processing of the receptive eld is similar to work done by Bohte et al. [Bohte et al., 2002b, 2000a]. In their time-to-rst spike algorithm, inspiration is taken from the local receptive elds of biological neurons. The receptive eld is used to encode the delay of the rst spike time at the input layer. Similarly, in the work outlined in this thesis, as the entire spike train is used, the receptive eld is used to encode the ISI of the linear spike trains routed to the classication layer. Figure 5.7 shows the receptive eld layer output of the 15 kHz sound when

127

the input data samples are processed by the network in sequential order from -60° to 0°. As outlined above, the output spike trains from the LSO neuron are passed though a receptive eld and the corresponding neurons encode an output frequency in the range of [0, 40 Hz] depending on the activation of the receptive eld. The gure shows how the receptive elds lter the data routed from the LSO neuron to the output layer.

The

y-axis

shows each

of the seven receptive eld layer neurons A to G, from Figure 5.4(a). The receptive elds for these neurons were created based on each individual angle, therefore neuron A relates to -60° up to neuron G relating to 0°. The portrays these input samples passing through the network in

order .

x-axis

sequential

As can be seen from the gure, neuron A corresponding to the -

60° angle encodes spike trains of a frequency determined by its receptive eld when training samples for -60° and -50° are processed by the network; although higher frequency spike trains are encoded for -60° as this was the data the receptive eld was designed for.

Neuron A encodes 0 Hz spike

trains when training samples from the other ve angles are processed by the network, i.e. the data from these other angles does not lie within the scope of the receptive eld designed for neuron A. Similar behaviour occurs for neuron B, it encodes spike trains when it receives training samples from both -60° and -50°; again higher frequency spike trains are encoded for 50°. Neurons A and B encode spike trains for both angles because the spike train frequencies produced by the LSO overlap to some degree causing their receptive elds to also overlap. Neurons C, D and E corresponding to the angles -40°, -30° and -20° respectively, produce only very few spike trains when presented with training samples from neighbouring angles.

In these

cases, the LSO output responses for each angle do not overlap very much with neighbouring angles. For the last two encoding neurons, F and G, relating to the angles -10° and 0°, the LSO output responses are for the most part identical and completely overlap. Therefore, each encoding neuron produces a spike train when presented with training samples from either angle. This gure is consistent with Figure 5.5; any angles which had overlapping LSO output responses in that gure cause the behaviour which is shown in Figure 5.7. The technique outlined above for tuning receptive elds reduces computational overhead. The alternative to this is to have many more arbitrary small overlapping receptive elds. This would result in hundreds of receptive elds

128

Figure 5.7: Outputs of left receptive eld layer neurons for the 15 kHz sound frequency. The input to these neurons comes from the LSO which is ltered by receptive elds.

with hundreds of corresponding neurons, which would also increase the time taken to train the networks. Ultimately, including the layer of receptive elds in the network increases neuron selectivity to individual angles and thus decreases the complexity of assigning the angle data to individual neurons in the output layer.

5.3.3 Output Layer The output layer consists of seven LIF neurons relating to the angles -60° to 0° in steps of 10°, see Figure 5.4.

The training algorithm used in this

research is ReSuMe introduced by [Ponulak, 2005], as described in Chapter 3. The aim of the training is to ensure that when data from each angle passes through the network only the appropriate output neuron is activated, i.e. the correct angle is produced as output. Determining what angle is produced, i.e. decoding the output, is done in a straightforward manner. Each output neuron is assigned an angle of location and whichever neuron res with the highest ring frequency, the associated angle is considered to be the output of the network.

129

5.3.4 Training Algorithm Each output neuron was trained to be associated with a particular angle. During each epoch of training, the network was fed the training data for each angle in sequential order from -60° to 0°. For instance, -60° data was routed through the network to all the output neurons.

This data passes through

the receptive eld layer onto one or more of the corresponding encoding neurons.

The encoding neurons send on the stimulus to each neuron in

the output layer in a fully connected manner. As outlined in the previous section, several spike trains will pass through more than one receptive eld, but each receptive eld will provide a dierent activation. The appropriate receptive eld should produce the highest output frequency with most of the receptive elds producing spike trains of 0 Hz. Consistent with ReSuMe, the connections from the neuron producing the highest frequency will undergo the most amount of learning and thus the most signicant weight updates on their connections. In the case of this research where multiple samples of diering spike train frequencies exist for each angle, the aim of ReSuMe is to produce a set of nal weights which will generate the highest output ring frequency at the appropriate output neuron for each individual angle. The same supervisory target signal (a spike train with predetermined spike times) was used for training the network for every angle.

Using the same

target signal for all angles ensures that training is equal for all angles, and allows the network structure to be reproducible for training other sounds without changing parameters to suit any angle or sound. Classication results using the ReSuMe supervisory training algorithm are presented in the next section. Figure 5.8 shows the weights produced by training over thirty epochs; these weights are located on the connections between receptive eld layer neurons and the output neurons. It can be seen from the gure that the weights stabilise over the course of training with many of the weights on the connections falling below zero while others stabilise at a positive value, this distribution of weights is what enables the network to classify the input data.

5.3.5 Testing The training accuracy is determined by passing through the training input data sets with the nal weights from that epoch.

130

This ensures that the

Figure 5.8: Stable weight distribution over thirty epochs of training on the connections between the receptive eld and output layer neurons for the left network of the 15 kHz sound.

ReSuMe training algorithm is producing a set of nal weights which will accurately classify the input data when the algorithm is not in use. A high classication accuracy for training was required before moving on to testing the network with unseen data. The network is tested in three dierent ways. The rst testing set involves generating ten new samples from the cochlea models using the same input data from the sound used for training. As outlined previously, spike trains generated by the cochlea models for the same input data will dier to those generated for training due to the models use of Poisson encoding.

This

allows testing of the network using essentially the same HRTF data but with random variations in the encoded spike trains. The second and third types of testing involve using data from neighbouring sound frequencies; e.g.

in

Experiment 1 (see Table 5.1) the training data corresponds to the sound frequency 5 kHz and the rst and third testing set correspond to 4.8 kHz and 5.2 kHz respectively.

These last two forms of testing ensure that the

network is tested with completely unseen input data.

5.3.6 Results The classication accuracy of the network was determined based on which output neuron is ring with the highest frequency. In this way, if a training

131

Figure 5.9: Overall training error across the angles -60° to 0° for the sound 15kHz showing the accuracy for the desired angle alone and the accuracy for the desired angle

±10°.

or testing sample produced the highest ring frequency at its appropriate output neuron, the sample is deemed to have correct classication. Results of the network are presented in two dierent ways, as discussed in Chapter 4, these are

absolute classication

and

desired angle classication ±10°.

Figure

5.9 plots the training accuracy for the 15 kHz sound over 30 epochs of training.

After 11 epochs the absolute classication accuracy reaches 92.86%

however after only 3 epochs the desired angle classication

±10°

reaches

100% for each sample of training data. Three sound frequencies were chosen to be trained and classied by the network, 5 kHz, 15 kHz and 25 kHz, these frequencies were chosen as they represent a medium, high and very high frequency from all the sound frequencies available.

Training and testing results across all the angles from

-60° to 60° can be seen in Table 5.1.

132

Sound

Accuracy(%)

Accuracy±10°(%)

Training

5kHz

53.27

81.54

Testing1

4.8kHz

49.10

86.92

Testing2

5kHz

52.49

86.15

Testing3

5.2kHz

40.15

84.15

Training

15kHz

84.61

100

Testing1

14.8kHz

76.54

98.46

Testing2

15kHz

83.07

100

Testing3

15.2kHz

78.85

96.15

Training

25kHz

44.62

73.08

Testing1

24.8kHz

43.46

74.62

Testing2

25kHz

39.99

71.54

Testing3

25.2kHz

41.93

69.23

Experiment1

Experiment2

Experiment3

Table 5.1: Classication Results

The network performed at its best when trained and tested with a high frequency sound (15 kHz), achieving absolute classication accuracies of approximately 80% and classication accuracies

±10°

of approximately 99%.

The results for the medium frequency sound of 5 kHz were lower, with absolute classication accuracies of 48% and classication accuracies

±10°

of

approximately 84%. Even though the absolute classication accuracies are low, as the classication accuracies

±10° are considerably higher, it is fair to

say that there is a reasonable degree of classication occurring, i.e. the angle of the incoming sound is being localised to the neighbouring angle in many cases, not a random and completely incorrect non-neighbouring angle. When the highest frequency of 25 kHz was processed by the network, the poorest results were obtained with absolute classication accuracies of 43% and classication accuracies

±10°

of approximately 72%.

This result is consistent

with [Tollin and Yin, 2002b] who report that IIDs vary non-monotonically with azimuth at this very high sound frequency and thus are dicult to classify. Comparing the results reported in this chapter against published work by other researchers, in the area of biologically inspired sound localisation modelling using SNNs, will be discussed in Chapter 7.

133

5.4 Conclusions In conclusion, this chapter investigated the creation of a biologically inspired SNN which when presented with biological experimental data was able to localise that input based on the IID binaural cue. An initial LSO model was outlined which consisted of a single neuron model which can process IIDs using simulated data in the form of spike trains. This proof of concept was extended to a multi-layered SNN architecture modelled on the mammalian auditory system. This topology consists of an auditory periphery model and models of the MNTB and LSO. To facilitate the SNN model being able to process and classify the input data to angles of location certain biologically inspired computational models were used, for instance, facilitating synapses, LIF neurons and receptive elds.

Training with the ReSuMe supervised

learning algorithm enabled the classication of experimentally-derived HRTF acoustical data into angles of location. The primary contributions of this chapter are: the integration of biological models within an SNN, including an auditory periphery model, LIF neurons, facilitating synapses and receptive elds; the use of experimentally-derived acoustical HRTF data from adult domestic cats as input to the SNN model; the development of an SNN which can process and extract the binaural cue of IID from high frequency HRTF data, and the use of the ReSuMe learning algorithm which enables the classication and thus localisation of the data. Limitations of this chapter are similar to those outlined from the previous chapter, i.e. a lack of analysis and evaluation to determine how the SNN model would perform in the midst of noise; when presented with low frequency sound data; and when tested on non-neighbouring sound frequencies.

These limitations form the basis for the work presented in Chapter

6. Another signicant limitation for both the ITD and IID models involves the way in which only the left or right sub-networks can be active at any one time. This is not a feature of the mammalian auditory system and this activity will form the scope for future work on this topic of research.

A

further issue with the IID model is the lack of an onset delay, i.e. even at high frequencies there will be a time dierence between the sound reaching each ear and this time delay is not incorporated into this model. Again, this issue will be dealt with in Chapter 6. The next chapter presents an analysis of the capabilities of the two SNN

134

models developed to process the ITD and IID binaural cues. Sound localisation experiments are performed across the full range of sound frequencies, from 600 Hz to 30 kHz and the individual accuracies achieved by each azimuthal angle are reported. The robustness of both SNN models to HRTF data embedded with diering levels of noise is discussed. Finally, the generalisation abilities of both SNN models are outlined when testing is performed on unseen data from sound frequencies not presented during training, i.e. non-neighbouring sounds.

135

Chapter 6

Duplex Spiking Neural Network Model of Sound Localisation 6.1 Introduction The objective of this chapter is to provide analysis of the processing abilities of both the ITD and the IID SNN models. In the previous two chapters, these two SNN models were introduced, each of which extracted and processed the binaural cues of sound localisation, ITD and IID, for the purposes of localising experimental HRTF data into azimuthal angles. The SNN model developed for the ITD binaural cue reported classication results for six low frequency sounds while the SNN model developed for the IID binaural cue reported classication results for nine high frequency sounds.

These

classication results related to the average accuracy of the combined left and right networks, i.e. reported.

the individual accuracies for each angle were not

The SNN model developed for the IID binaural cue used the

original HRTF data as input, while the SNN model developed for the ITD binaural cue used the same HRTF data with the ITD encoded into the input waveforms.

The main aim of this chapter is to perform additional

experiments to determine the strengths and weaknesses of each SNN model.

136

6.2 Sound Localisation Across the Frequency Range The previous experiments outlined in Chapters 4 and 5 reported the classication accuracies for six low frequency sounds using the SNN model which processes the ITD binaural cue, and nine high frequency sounds using the SNN model which processes the IID binaural cue.

To fully determine the

processing capabilities of each model across all sound frequencies from low to high, it was decided to develop and train thirty-one SNN models based on thirty-one sound frequencies for each binaural cue.

These trained net-

works were then tested on the training sound frequencies with new Poisson spike train samples generated from the cochlea model input layer and also tested on the neighbouring sound frequencies. For these experiments, both models use input data where the ITD is encoded in the inputs. In this way, the initial delay created by the sound arriving at the two ears at dierent times will be incorporated into both the ITD and IID models. Before these experiments began, it was envisaged that the SNN model for ITD should produce the best classication accuracies for low frequency sounds, whereas the SNN model for IID would produce the best classication accuracies for high frequency sounds. The results of these experiments can be seen in Figure 6.1, where the average classication accuracy for both the left and right networks are reported along with the average classication accuracy

±10°.

As expected, the ITD model achieves high classication accuracies for low frequency sounds,

≤ 1.6 kHz , but does not perform well for sounds ≥ 1.6 kHz .

This is a very small range of sound frequencies for which this network can be used for sound localisation. This relates back to Section 2.1.1 from Chapter 2, where it was outlined that the ITD cue works most eectively for sounds greater than

∼ 200 Hz

to

∼ 1.5 kHz

in humans [Burger and Rubel,

2008]. Conversely, the results from the IID model show the reverse. The IID model achieves high classication accuracies for high frequency sounds,

≥ 4 kHz ,

but does not localise well for sounds

≤ 4 kHz .

However, the IID

model does better at localising low frequency sounds than the ITD model manages with high frequency sounds. The range of high frequency sounds is much larger than the low frequency sounds and overall classication results for the IID model are quite high. However, there are two areas where the localisation ability of the IID model is lower than normal, around 10 kHz and above 25 kHz.

These problems seemed to come from the experimen-

137

138

Figure 6.1: Results of both the ITD and IID models when tested by the entire range of sound frequencies,

600 Hz ≤ f ≤ 30 kHz

tal HRTF input data.

The gain values across all of the input data are at

their maximum around these sound frequencies. Initial experiments for these sound frequencies produced very low classication results. The HRTF data was scaled to counteract this problem.

Scaling the data involves shifting

both the left and right ear HRTF datasets to another numerical range, i.e. in the right ear dataset, every data value is divided by the maximum data value of that dataset and then multiplied by the new maximum value. This operation ensures every data value lies in a range below the new maximum value.

After scaling, the results did improve but those problematic sound

frequencies continued to achieve the lowest results across the entire range of sound frequencies. It appears that sound data with a wide range of HRTF gain values will prove to be problematic for a classication task, which can be improved by scaling all of the data but will still manifest itself in comparison to the results reported for all of the other sound frequencies. Furthermore, as discussed in Chapter 5, [Tollin and Yin, 2002b] reported that IIDs vary non-monotonically with azimuth at very high sound frequencies (≥

25 kHz )

and thus are dicult to classify, providing further insight for the lower classication accuracies within this range. Nevertheless, in spite of these issues, Figure 6.1 shows there is clearly a need for a duplex sound localisation system, where the two binaural cues are processed dierently and localise very dierent ranges of sounds. The ITD model cannot classify the input HRTF data to azimuthal angles in sounds

≥ 1.6 kHz

quency sounds.

due to the absence of phase-locking in these high fre-

In low frequency sounds, the outputs of the cochlea are

phase-locked to the input waveform and this is crucial for the ability of the model to decode the ITD and classify the input data (localise input data to angles of location). Figure 6.2 demonstrates the presence of phase-locking in low frequency sounds and how this feature gradually disappears as the frequency of the incoming sound increases. Up to 1.4 kHz, phase-locking is clearly visible in the spike train output of the cochlea model. Increasing the sound frequency by 800 Hz to 2.2 kHz, the presence of the phase-locking is beginning to decrease, it is not as visibly crisp as with the lower frequencies. At 3 kHz, the presence of phase-locking is for the most part gone; there are gaps in the spike train where it could be assumed a spike should be located to achieve the phase-locking, and there are many erroneous spikes which hide the pattern of the input waveform. Once the pattern of the input

139

Figure 6.2:

Phase-locking in the spike train output of the cochlea model

disappears as the sound frequency increases. The data for this plot comes from the -60° angle for each of the following sound frequencies: 600 Hz, 1400 Hz, 2200 Hz and 3000 Hz.

140

Figure 6.3: SNN network to process the ITD for angles originating from the range -60° to 0°. The bushy cell neurons have been removed as the sound frequencies being processed are greater than 3 kHz.

waveform cannot be constructed from the cochlea output, the ITD model cannot decode the ITD from the two inputs and therefore cannot localise the input HRTF data to azimuthal angles.

On a computational note, the

bushy cell layer was removed from the ITD model when the sound frequency being processed was greater than 3 kHz, see Figure 6.3. At these frequencies, biological bushy cells have a primary-like response type, i.e. for every spike which arrives at a bushy cell, one spike is generated, [Yin, 2002]. For computational eciency, the ipsilateral outputs of the cochlea model at these frequencies were routed directly to the output neurons and the contralateral outputs of the cochlea were routed directly to the delay-line structure. From Figure 6.1, it can be seen that there is an intermediate region of frequencies in the range of

1.8 kHz ≥ f ≤ 4.2 kHz

where neither the ITD or

IID model can accurately localise the input HRTF data to angles of location. This can also be seen in the biological auditory system; the crossover between the localisation of low and high frequencies cannot be localised to any great accuracy, [Zhou, 2002].

It was decided to combine the outputs

of both the ITD and IID models at these frequencies to determine the appropriate angle of location.

However, as the classication accuracies from

both models are quite poor in this range, it was doubtful whether their combination could accurately decide on the actual angle. To do this, an extra output layer of LIF spiking neurons was developed which received as input

141

Figure 6.4: Extra layer of spiking neurons which take input from the outputs of the ITD and IID models to provide further classication of the intermediate range of sound frequencies,

1.8 kHz ≤ f ≤ 4.2 kHz .

142

Figure 6.5: Comparison of results from the extra layer of spiking neurons whch combine the outputs of the ITD and IID models for the intermediate range of frequencies,

1.8 kHz ≤ f ≤ 4.2 kHz

the outputs of both the ITD and IID model in a fully connected manner, see Figure 6.4. Again, SHL was used as the supervising learning algorithm to associate the inputs from the ITD and IID models to angles of location. The behaviour of the network during training is the same as discussed in Chapter 4 when training the ITD model with SHL. Classication is calculated on a maximum-spikes basis, the estimated output of the network refers to which output neuron produces the highest number of spikes. ReSuMe was initially used, but again proved problematic for reasons similar to the problems discussed in Chapter 4 with training the ITD model. ReSuMe was unable to associate a specic output neuron to specic input neurons as the network is fully connected and it was dicult to determine what the correct pattern of supervisory spikes should be. The results from this extra layer of spiking neurons can be seen in Figure

143

6.5.

The gure plots the classication results from the ITD model, IID

model and the extra training layer for the intermediate range of frequencies,

1.8 kHz ≥ f ≤ 4.2 kHz .

The results seem to improve slightly in comparison

to those reported from the ITD and IID models. However, it appears the extra layer of spiking neurons improves the classication by reproducing the best results from either the ITD or IID model. The learning algorithm only succeeded in combining the outputs of the ITD and IID models and choosing the higher classication accuracy at each sound frequency. Figure 6.6 shows the testing accuracies for each azimuthal angle,

θ ≤ +60°, across all sound frequencies from 600Hz ≤ f ≤ 30 kHz . ing accuracies for each angle

±10°

The test-

are also displayed. The data from these

plots comes from three dierent sources. Low frequency results,

f ≤ 1600 Hz

−60° ≤

600 Hz ≤

come from the accuracies achieved by the ITD models. The

intermediate range of results,

1.8 kHz ≥ f ≤ 4.2 kHz ,

come from the accu-

racies generated by the extra layer of spiking neurons which takes as input the outputs of both the ITD and IID networks. The high frequency results

≥ 4.8 kHz

come from the accuracies generated by the IID models.

In all

cases, there are higher accuracies achieved across the entire range of sounds for the angle

±10°

in comparison to the classication accuracy of the actual

angle. This can be seen clearly where an angle has problems with classication across the frequency range, e.g. the 50° angle misclassies the input data for two large periods of sound frequencies, 5 kHz to 12 kHz and

≥ 21 kHz .

However, 50°±10° shows almost 100% classication accuracy across the entire frequency range.

In these cases, samples of input data corresponding

to 50° are actually being classied to either 40° or 60°. Furthermore, in accordance with Figure 6.1, many of the angles show a reduced accuracy for sound frequencies around 10 kHz and greater than 25 kHz. An unexpected outcome from reporting the testing accuracies of each individual azimuthal angle is the high classication accuracies for both 0° and 60° across all sound frequencies. It is thought that the reason for this could be the order of delivery of input data to the models during training. Both learning algorithms used in this research, SHL and ReSuMe, increase the weights of those connections which cause correct classications and decrease the weights of those connections which cause misclassications. The data is passed to the network in order of angle and the output neurons for 0° and 60° are represented by the nal output neuron in each of the left and right

144

Figure 6.6: Classication accuracies for each individual angle,

+60°,

across all sound frequencies,

600Hz ≤ f ≤ 30 kHz

145

−60° ≤ θ ≤

sub-networks respectively. During the training procedure, the last batch of input data samples which corresponds to these angles would be delivered to the network while those output neurons are currently being supervised by the learning algorithm. Therefore, the weights on the connections to these output neurons which cause correct classications receive weight increases at the end of each epoch of training. The weights on these connections were decreased while other data is passing through the network if any misclassications occur. But as the positive weight updates occur at the end of each epoch, it is possible that the weights on these connections would be higher than average ensuring very high classication accuracies of these two output neurons and thus the angles 0° and 60°. Further work on this research would involve presenting the data during training in a random order to determine whether this is the reason for the high classication accuracies of 0° and 60° angles.

6.3 Addition of Noise A signicant consideration when modelling mammalian sound localisation is the ability to localise a sound source in the midst of noise. This particular topic has been the focus of much research.

In 1976, Jacobsen determined

that when a pure tone signal is presented along with white noise, the ability to localise that pure tone is only compromised when the signal to noise ratio (SNR) falls below 20, [Jacobsen, 1976]. The SNR denes how much of the original signal has been corrupted by noise, in the case of this research, the signal relates to the HRTF input data and noise relates to the addition of white Gaussian noise. SNR can be dened as:

SN R = where

P

is the average power.

Psignal Pnoise

(6.1)

[Good and Gilkey, 1996] also performed

thorough investigations on the aect of noise. They found that a broadband click can be localised until the SNR falls into the negative range,. For further information on sound localisation experiments which incorporate noise, see [Stern et al., 2006]. It was deemed necessary to determine whether the SNN models of sound localisation presented in this research would have similar performance abilities

146

(a) Left ear data

(b) Right ear data

Figure 6.7: Mapping of original input data against that data when noise is incorporated for the SNRs of 0.1, 1, 5, 10, 20 and 30.

147

148

original classication accuracies of each sound with no noise added

Figure 6.8: Classication accuracies when noise is added with ve levels of SNR, from 0.1 to 30, where Orig. refers to the

to those experiments discussed above. To do this, white Gaussian noise was added to the HRTF data and both the ITD and IID models were tested with a range of low and high frequency sounds. A range of SNRs were chosen for this task: 0.1, 1, 5, 10, 20 and 30. The MATLAB

awgn function was used to

incorporate the white Gaussian noise into the HRTF input data. Figure 6.7 maps the original left and right HRTF input data for angles in the range of

±60°

for the 15 kHz sound against the same input data when noise is added

at dierent SNRs. As expected, an SNR of 0.1 provides the most change to the original input data as there is a higher ratio of noise to original data, while an SNR of 30 has little impact as there is a higher ratio of original data to noise. Figure 6.8 plots the classication accuracies when both the ITD and IID models are tested with noisy data. Each model is tested with three dierent sounds; the ITD model is tested with the 600 Hz, 800 Hz and 1000 Hz sound, while the IID model is tested with 5 kHz, 15 kHz and 25 kHz. Each subplot shows the classication accuracy of the original non-noisy data and for the ve dierent SNRs of noise added to the original data. For each sound, the classication accuracies decrease almost monotonically as the SNR decreases from 30 to 0.1.

However, in some cases, higher SNR ratios report lower

accuracies than for the same sound with a lower SNR. For example, the input data for the 600 Hz sound with an SNR of 1 produces higher classication accuracies than for the same input with an SNR of both 5 and 10.

It is

believed that the reason for this is due to both the random nature of the Poisson encoding scheme of the cochlea models at the input layer and the random nature of adding noise to data itself. However, apart from 600 Hz (original data for this sound frequency is poor), all the sounds tested show a high degree of robustness to all the levels of noise, in agreement with the experiments described at the beginning of this section. In conclusion, both the ITD and IID models developed in this work maintain a high degree of localisation accuracy in the presence of varying levels of noise.

6.4 Generalisation Testing As discussed earlier in this chapter, SNN models were trained for thirtyone dierent sound frequencies for each of the binaural cues, ITD and IID. Testing was carried out on these networks using the training sound again

149

Figure 6.9: Generalisation across non-neighbouring sounds

with dierent Poisson spike trains generated from the cochlea models and the neighbouring sound frequencies. It should be considered however, whether there is a need for having so many trained networks.

Rather than just

testing the xed networks on neighbouring sounds, it would be interesting to determine just how wide a range of sounds can be tested on a trained network while still producing acceptable classication accuracies.

To do

this, both of the ITD and IID networks were tested using non-neighbouring sound frequencies, Figure 6.9 demonstrates how both models fare with this wider range of generalisation testing. The IID network only generalises well when the input data of the testing sound frequency is similar to the input data from the sound frequency used to train the SNN model, in respect to the receptive eld conguration. As the frequency of the sounds increase, the receptive eld congurations need to be adapted to cater for the diering ranges of frequencies. This can be seen from Figure 5.5 from Chapter 5, where the range of output frequencies from the LSO neurons for each angle diers between the two sound frequencies, 5 kHz and 15 kHz. the angles

±60°

The LSO neuron produces output frequencies across

which range between 250 Hz and 600 Hz. In contrast, when

presented with the 15 kHz sound, a wider range of frequencies is produced by the LSO neuron, ranging from 50 Hz to 600 Hz. This indicates that diering sound frequencies far removed from each other require dierent receptive eld parameters for the ability to localise the HRTF input data.

150

Nevertheless,

when the receptive eld conguations are appropriate for the input data, the testing frequency on non-neighbouring sounds is quite good. Figure 6.9 plots the testing accuracy of the IID model for the 5 kHz sound frequency. It also shows the classication accuracies achieved when tested with lower and higher non-neighbouring sound frequencies.

The lower frequencies of

3.8 kHz, 4 kHz and 4.2 kHz achieve a 0% classication accuracy, i.e.

the

receptive eld congurations are so dierent between 5 kHz and these lower sound frequencies that no data can be routed to the output layer. The higher frequencies of 5.8 kHz, 6 kHz and 6.2 kHz produce decreasing classication accuracies as the sound frequency increases. Yet, the classication accuracies

±10°

do not change with the increasing sound frequencies. In these cases,

the receptive eld parameters are adequate for these higher non-neighbouring sound frequencies. The ITD model for 1400 Hz was tested with lower sound frequencies of 600 Hz, 800 Hz and 1000 Hz, and higher sound frequencies of 1800 Hz, 2000 Hz and 2200 Hz. In contrast, the ITD model generalises well across both the lower non-neighbouring sound frequencies.

The main reason is that there

are no receptive elds in the ITD models; classication is achieved through the delay line structures.

However, for the higher sound frequencies, the

classication accuracies decrease as the frequencies increase. to the absence of phase-locking in these higher frequencies.

This is due When phase-

locking is not present in the spike trains, the delay line structures cannot localise the input HRTF data to angles of location. the low frequency sounds, i.e.

sounds

≤ 1.6 kHz ,

Therefore, for all of

there only needs to be

one trained ITD model through which all of the low frequency sounds can be processed.

This generalisation testing demonstrates that there is some

redundancy in this work. Taking into account the results produced by the generalisation testing, it would be possible to remove the trained networks for sounds which can be classied by other trained networks. Doing so, would reduce the amount of time required to train all of the thirty-one networks and would make the overall system more compact.

6.5 Conclusion The purpose of this chapter was to analyse the capabilities of the two SNN models developed in this research and form them into a comprehensive du-

151

plex system. Namely, the combination of the ITD model developed for low frequency sound localisation and the IID model developed for high frequency sound localisation. The previous two chapters outlined the development of these models and the aim of this chapter was to test the stengths and weaknesses of said models. Upon performing the sound localisation experiments on both the ITD and IID models with the full range of sound frequencies available, the need for the use of two dierent binaural cues for sound localisation was conrmed. The ITD model performed sound localisation for low frequency sounds while the IID model localised high frequency sounds. The alternative scenario where the ITD model tried to localise high frequency sounds and the IID model attempted to localise low frequency sounds was proven to be infeasible, as expected. These results conrmed the ability of both models to behave in a similar fashion to the mammalian auditory pathways upon which they were based.

Experiments were performed in which

both models demonstrated their ability to localise sounds in the midst of varying levels of noise with a high degree of robustness.

When testing on

non-neighbouring sound frequencies, it was ascertained that the generalisation ability of the IID networks depended on receptive eld conguration while the generalisation ability of the ITD network was adequate across all low frequency sounds. This highlighted the possibility of reducing the number of trained networks in this work, thus optimising the computational and time overhead. The main contributions of this chapter involve the experimental results which show that the two SNN models behave in a similar way to the mammalian auditory system, i.e. the SNN model which extracts and processes the ITD binaural cue performs best when localising low frequency data, and the SNN model for IID performs best when localising high frequency data.

Their

combination produces the duplex model of sound localisation, where sounds throughout the frequency range can be localised using either of the two binaural cues. The other main contribution relates to the robustness of the two SNN models to varying levels of noise.

Neither of the models were

completely inhibited by the presence of white Gaussian noise in the HRTF data in their ability to localise adequately; in fact they show a high level of robustness to noise. Limitations of this chapter involve the unexpected outcome of the high classication accuracies for both the 0° and 60° angles. To explicitly determine the reasoning for this would involve implementing

152

a dierent training strategy for the two models and as such, this would be interesting to investigate in future work. The next chapter will outline the overall conclusions to be made from the work outlined in this thesis.

153

Chapter 7

Conclusions and Recommendations The ability to model the ways in which mammals can localise sound has numerous applications. These include robotics, virtual reality and teleconferencing, to mention but a few. As discussed in Chapter 1, biologically inspired techniques bring many enhancements to these applications. This attention to biological detail and its associated advantages provided the motivation for this work. Thus, the key objective for this work was to create SNN models which emulate the way in which mammals localise sound. This involved the development of models which will process and extract the binaural cues of ITD and IID with topologies inspired by the mammalian auditory pathways. To do this, a thorough understanding of the workings of each individual component of the auditory pathways was required, as discussed in Chapter 2. Chapter 3 outlined many dierent techniques and methods for the development of sound localisation systems. These ranged from the computational to the biologically inspired. Together, with the knowledge of the workings of the mammalian auditory system and existing modelling techniques, it was possible to formulate ideas which would advance said techniques in a more biologically inspired way. It became clear that there were certain ways in which an improvement to existing techniques of sound localisation could be made. These included the use of both binaural cues; a wide range of sound frequencies from low to high; topologies which are faithful to the architecture of the mammalian auditory pathways; the utilisation of real experimental data rather than

154

simulated data; and a ne resolution of angles. Additionally, to evaluate the capabilities of both the ITD and IID models outlined in Chapters 4 and 5 respectively, a biologically plausible supervised learning algorithm was used to train the networks to localise the real experimental sound data to a high degree of accuracy. Once these models had been developed, it was necessary to analyse the processing abilities of both models with regards to localisation in the midst of noise and generalisation capabilities, as outlined in Chapter 6.

This chapter also provided experimental results of both models when

they were presented with the full range of sound frequencies. As expected, the ITD model localised low frequency data accurately and the IID model was demonstrated to be more suitable to high frequency sound localisation. These models can thus be combined to form a comprehensive duplex system of sound localisation.

7.1 Comparison to Similar Work It is dicult to make direct comparisons between the results achieved in this research and the work outlined by other researchers in Chapter 3. This is due to the many dierent methods that are used for sound localisation modelling, such as: the implementation approach which can range from purely computational to biologically inspired; the use of either or both of the two binaural cues, ITD and IID; the type of data from pure tones to HRTF measurements, and whether it is simulated or experimentally derived; the resolution of angles being localised; the range of sound frequencies tested by the model; and the use of a learning algorithm to train the models. However, out of all the techniques reviewed in Chapter 3, the closest and most comparable to this work is that of [Voutsas and Adamy, 2007, Liu et al., 2008b]. Voutsas and Adamy developed a biologically inspired cross-correlation model of spiking neurons with multiple delay lines using both excitatory and inhibitory connections; they used one binaural cue, ITD; their data consisted of pure tones recorded in an anechoic environment using the Darmstadt robotic head; their angle range of

±105°

had a granularity of 30°; the sound

frequencies ranged between 120 Hz and 1240 Hz, which is appropriate for the binaural cue of ITD which localises low frequency sounds; an evolutionary algorithm was used to tune the network; and they reported localisation

155

accuracies of 59%. Liu et al. developed a biologically inspired network of spiking neurons; some experiments used only the ITD binaural cue, while others used both binaural cues of ITD and IID; their data consisted of both articial and real pure tones; their angle range of

±90° had a resolution of 30°; the sound frequencies

used in their experiments were 500 Hz, 1000 Hz and 2000 Hz; a combination of the winner-take-all and the weighted mean method were used to determine the ITD, the IID only determined whether the sound originated from left or right; the localisation accuracies are: articial pure tone with ITD cue only, 70%; articial pure tone with the two binaural cues, 80%; real pure tone with ITD cue only achieves, 50%; and real pure tone with the two binaural cues, 65%. In contrast, the work in this thesis outlines the development of two SNN models inspired by the architecture of the mammalian auditory system; both binaural cues of ITD and IID are used to process sounds in their respective and appropriate frequency ranges; the data consists of experimentally-derived HRTF measurements from adult domestic cats which are used to create pure tones for input to the cochlea models; the rangle of localised angles is smaller at

±60°

but with a ner resolution of 10°; the sound frequencies

used in these experiments range from 600 Hz to 30 kHz, which allowed for the distinction of two categories of low and high frequencies enabling the two binaural cues to process suitable sound frequencies, i.e. the full duplex system of sound localisation; the SNN model based on the ITD cue used the SHL learning algorithm, while the SNN model based on the IID cue used the ReSuMe learning algorithm; the localisation accuracies were outlined in section 6.2 of the previous chapter, however to summarise the SNN model for ITD achieved high localisation accuracies for low frequency data and the SNN model for IID achieved high localisation accuracies for high frequency data. This discussion compares and contrasts the work presented in this thesis to the most closely related work in the literature. This comparison highlights the signicant contributions made by this research and how they advance the work in this eld.

156

7.2 Concluding Summary The main conclusions of each chapter are summarised as follows: Chapter 2 presented a review of the literature discussing the ability of the mammalian auditory system to localise sounds. Rayleigh's duplex theory describes the two binaural cues, ITD and IID, which are required for successful localisation of low and high frequency sounds. These cues are processed in the parts of the auditory system appropriate to sound localisation; namely the MSO and LSO. Jeress' theoretical computational model describes how the ITD cue is processed and extracted in mammals to determine the angle of origin of a sound signal. This model and Rayleigh's duplex theory were the key methodologies which underpinned the research outlined in this thesis. The paths the sound takes as it travels through the auditory system from the outer ear to the auditory cortex are discussed at length. The cell types and their functionalities are described at each phase of the auditory pathways. This review was a key factor to this research. A thorough understanding of the mammalian auditory pathways and their ability to localise sounds was essential for the implementation of biologically inspired SNN models, which can process and extract the binaural cues in order to localise sound data. Chapter 3 outlined the dierent computational techniques which were required for implementing the SNN models to localise sound. Both SNN models use an auditory periphery model as their input layers to encode the HRTF measurements to spike trains, from which the binaural cues can be extracted. A brief review of the cochlea modelling research eld was provided with particular attention paid to the auditory periphery model used in this work. ANNs, SNNs, learning algorithms, network design, receptive elds and dynamic synapses were discussed in terms of their biological plausibility and/or computational eciency. This review was essential for the practical ability of developing the models with topologies inspired by the mammalian auditory pathways. Additionally, the nal sections of the chapter discussed the many dierent methods previous researchers have employed for the development of sound localisation systems. Techniques ranged between being purely computational to biologically inspired. Biological plausibility of sound localisation modelling can be achieved by using biologically inspired approaches, real non-simulated data, and using the topology of the auditory system as an inspiration for architecture.

157

Chapter 4 introduced the SNN topology developed for processing the binaural cue of ITD. This topology, including an auditory periphery model, models of bushy cells from the AVCN, a delay line structure and spiking neurons, was inspired by Jeress' computational model of the MSO. Initial work is described which involved simulated data, single spike encoding and the STDP learning algorithm. Successful experimental results from this led to the involvement of low frequency experimentally-derived HRTF measurements from cats.

This required extending the network to cater for spike

trains and the other auditory pathway components, the auditory periphery model and bushy cells.

The delay structure was also extended.

The out-

puts of the MSO neurons were classied to angles of location using the SHL learning algorithm. In this chapter, two dierent extended topologies were presented. The rst was successful in its localisation abilities, but could be criticised as the topology of the network was pre-designed and determined the classication accuracies. The second topology was then developed which consisted of a generic delay structure, meaning the SHL learning algorithm was wholly responsible for the classication results. Chapter 5 introduced the second SNN topology developed for processing the binaural cue of IID. This topology included an auditory periphery model, facilitating synapses, spiking neurons, and receptive elds, and was inspired by the behaviour of the mammalian LSO. Again, an initial proof of concept is described consisting of a single spiking neuron which was modelled to emulate the behaviour of LSO neurons using simulated data. This initial work proved successful and led to the inclusion of high frequency experimentally-derived HRTF data.

The network was extended from the single spiking neuron

to a multiple layered SNN. Training with the ReSuMe supervised learning algorithm produced successful classication results. The purpose of Chapter 6 was to analyse the capabilities of the two SNNs developed in this research and form them into a comprehensive duplex system. Namely, the combination of the ITD model developed for low frequency sound localisation and the IID model developed for high frequency sound localisation.

The previous two chapters outlined the development of these

SNNs and the aim of this chapter was to test the stengths and weaknesses of said SNNs with regards to the following: performing additional experiments whereby both SNN models have the ITD incorporated into the input waveforms; processing the full range of sound from low to high frequencies;

158

reporting the localisation accuracy for individual angles; determining the robustness of both models in the presence of noise; and testing the generalisation capabilities of both models.

7.3 Contributions of the Thesis The primary contributions of the thesis are: The integration of biological models with the two SNN models.

This in-

cluded the auditory periphery model which comprised the input layers to both SNNs; LIF neurons were utilised throughout the SNN models; the LSO neuron in the SNN model for IID employed both excitatory and inhibitory facilitating synapses to produce the behaviour of the mammalian LSO neurons; and receptive elds were used to promote neuron selectivity and thus assisted the IID model to localise the input data to angles of location. This work used experimentally derived acoustical HRTF data throughout the development of both SNNs. This data was generated from experiments with adult domestic cats. Despite the complex and non-linear nature of this data, it enhanced the biological plausibility of both the SNNs developed. The development of an SNN model which processes and extracts the ITD binaural cue from low frequency experimental HRTF sound data. This SNN enables the classication (localisation) of this data to azimuthal angles on the horizontal plane. The development of an additional SNN model which processes and extracts the other binaural cue of IID from high frequency experimental HRTF sound data. Again, enabling the localisation of this data to azimuthal angles. The evaluation of both SNN models using two biologically inspired supervising learning algorithms. SHL was used to train the weights on the synaptic connections within the delay structure in the ITD model, whereas ReSuMe was employed for training the output layer weights of the IID model. Both learning algorithms were demonstrated to be well suited to their particular tasks, i.e. the two distinct SNN models. Reasoning as to the need for two dierent learning algorithms was discussed. The experimental results derived from the combination of both SNN models, show that they behave in a manner similar to the mammalian auditory system with regards to the localisation of both low and high frequency sounds.

159

Neither SNN models can localise the full range of sound frequencies, but together they demonstrate this capability thus forming a duplex system of sound localisation. Finally, both SNN models showed a high degree of robustness to sound localisation in the presence of noise.

Experimental results were obtained to

demonstrate this when the experimentally-derived input data was contaminated with diering levels of white Gaussian noise.

7.4 Future Work There are various ways that this work could be continued. These include: Synaptic inhibition, briey introduced in Chapter 2, refers to nely tuned temporal inhibition which can adjust the sensitivity of the coincidence detector neurons to the range of ITDs, [Grothe, 2003, McAlpine and Grothe, 2003].

Incorporating synaptic inhibition into the SNN models outlined in

this thesis would improve the performance in two ways. Firstly, to incorporate inhibition into the delay structure of the SNN model which processes ITD, would enhance the localisation abilities of that model by decreasing the impact of non-coincidental stimuli at the output layer neurons. Currently, the chosen angle is determined by the output neuron which has the highest ring frequency. But in some cases, many output neurons have high ring frequencies as their two inputs are

almost

coincident. Including inhibition,

would remove any ambiguities in the maximum-ring technique. Inhibition would also remove the need for the left and right subnetworks of both SNN architectures. It is more biologically plausible for both MSOs and LSOs to be active concurrently. Incorporating inhibition into the SNN models would allow for this. Currently, both the ITD and IID SNNs developed in this work are capable of processing pure tones only. However, there is the possibility for extending this work to process complex sounds which are a collection of diering pure tones originating from the same location. This would involve the implementation of a change to the processing abilities of the cochlea models.

The

biological cochlea is sensitive to the many dierent frequency components of complex sounds and can separate these components and distribute them tonotopically through the auditory system.

160

For the purposes of the work

Figure 7.1: Classication of complex sounds to angles of location

presented in this thesis, the cochlea models processed each sound frequency component individually, i.e. the cochlea models do not have the ability to recognise and separate a complex sound.

By implementing this feature of

the biological cochlea, a complex sound could be localised using the ITD and IID models. Future work for the sound localisation of complex sounds will incorporate this idea as illustrated by Figure 7.1. The gure presents a complex sound consisting of various frequency components; 5 kHz, 15 kHz and 25 kHz. When separated into these distinct components (pure tones), they can be processed by the existing SNNs outlined in the two previous chapters to produce the same angle of location. The ability to produce the angle of location for multiple frequency components could also be used to increase the accuracy of sound localisation as there is more information associated with each complex sound in the form of multiple pure tones. An area of research relating to sound localisation is known as the

party eect.

cocktail

This describes the ability to focus on a particular sound in the

midst of competing sound sources. Continuing work in the area of auditory research will investigate the extraction of a sound signal from a mixture of several sound sources. This direction of research will also look into the integration of the other senses. It is generally understood that the ability to localise sounds is improved when positional feedback from the other sensory systems are incorporated. [Hofman, 2000] discussed how the visual system

161

can discern spatial knowledge of a sound source and that this information is used by the auditory system to sharpen sound localisation. Furthermore, it is also believed that the somatosensory (touch) system can provide location information to the auditory system.

This multimodal approach to sound

localisation is grounded in biology and as such is an important area of future exploration. In the introduction to this thesis, the rational for research into mammalian sound localisation was outlined. It was mentioned that a key reason for this area of research is to increase the intelligent behaviour of robotics and to enable robots to become more human-like in their behaviour. Robot audition is an area of research in its infancy, thus there are many possible future directions to take.

For instance, future work could involve investigating

localisation and navigation in a noisy and dynamic environment. There is also the added diculty of interaction with humans. A further interest is in the development of humanoid robots, i.e.

robots which can mimic human

behaviour. Incorporating a human-like auditory system within a robot is an important future goal. On a more practical note, it would be interesting to investigate alternative architectures for the extra training layer which was used for the intermediate range of sound frequencies. Currently, the results in this frequency range are based on the combination of the outputs from both the ITD and IID models. This was outlined in Chapter 6, where an additional feedforward layer was used in an attempt to provide higher localisation accuracies in this frequency range. Future work on this topic will aim to solve the problem of producing poor classications when inaccurate data from either the ITD or IID model is being presented to this output layer. Also, the implementation of both SNN models in MATLAB is currently computationally intensive.

For this reason, preliminary work has already

been undertaken to develop these models on an FPGA platform, to improve computational speed and hence the real-world applicability of the models. Finally, I would just like to say a few words on my experience of doing this PhD and thesis.

The research involved in doing a PhD needs to be

fundamentally original, and for this reason I have found the entire experience to be interesting, stimulating and challenging. The wider advantages of doing a PhD involve becoming a specialist in a dened area; gaining the ability

162

to undertake independent study and thus be able to self-manage in nonstructured situations and also to work in teams; additionally, doing a PhD enables you to advance or learn skills in project development, fact nding and analysis, and report writing. Overall, it's an extremely rewarding process that provides many benets to those who undertake it.

163

Bibliography American how

speech hearing

language and

hearing

balance

work,

association 2006.

asha, URL

http://www.asha.org/public/hearing/anatomy/default.htm. P. J. Abbas. Electrophysiology of the auditory system.

Physiological Measurement,

Clinical Physics and

9(1):131, 1988.

L. F. Abbott, J. A. Varela, K. Sen, and S. B. Nelson. Synaptic depression and cortical gain control.

Science,

275(5297):221, 1997.

H. Abdalla and T. K. Horiuchi. An ultrasonic lterbank with spiking neurons. In

IEEE International Symposium on Circuits and Systems, ISCAS,

pages 42014204, 2005. A. M. Abdelbar, D. O. Hassan, G. A. Tagliarini, and S. Narayan. Receptive eld optimization for ensemble encoding.

tions,

Neural Computing & Applica-

15(1):18, 2006.

O. A. Alim and H. Farag. Modeling non-individualized binaural sound local-

Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN), Como, Italy, volume 3, pages 642647, 2000. ization in the horizontal plane using articial neural networks. In

T. R. Anderson, J. A. Janko, and R. H. Gilkey. Modeling human sound lo-

IEEE International Conference on Neural Networks, IEEE World Congress on Computational Intelligence., volume 7, 1994. calization with hierarchical neural networks. In

J. Ashmore.

Signals and Perception. The Fundamentals of Human Sensation,

chapter 1. The mechanics of hearing, pages 115. 2002.

164

Palgrave Macmillan,

J. Backman and M. Karjalainen. Modelling of human directional and spa-

IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, volume 1, 1993.

tial hearing using neural networks. In

D. Baras and R. Meir. Reinforcement learning, spike-time-dependent plasticity, and the BCM rule.

Neural computation,

D. Barber. Learning in spiking neural assemblies.

mation processing systems,

19(8):22452279, 2007.

Advances in neural infor-

pages 165172, 2003.

I. Bazwinsky, H. Hilbig, H. J. Bidmon, and R. Ruebsamen. Characterization of the human superior olivary complex by calcium binding proteins and neurolament H (SMI-32).

The Journal of Comparative Neurology,

456

(3):292303, 2003. G. E. Beckius, R. Batra, and D. L. Oliver. Axons from anteroventral cochlear nucleus that terminate in medial superior olive of cat: Observations related to delay lines.

Journal of Neuroscience,

19(8):31463161, 1999.

A. Belatreche, L. P. Maguire, T. M. McGinnity, and Q. X. Wu. A method for supervised training of spiking neural networks. In

Challenges and Advances,

Cybernetic Intelligence,

2003.

A. Belatreche, L. P. Maguire, and T. M. McGinnity.

Pattern recognition

Proceedings of the 6th International FlINS Conference Applied Computational Intelligence, Blankenberge, Belgium, September 1-3. World Scientic, 2004. with spiking neural networks and dynamic synapses. In

K. Bhatheja and J. Field. Schwann cells: origins and role in axonal maintenance and regeneration.

Biology,

International Journal of Biochemistry and Cell

38(12):19951999, 2006.

G. Q. Bi and M. M. Poo. Hebb's postulate revisited.

Synaptic modication by correlated activity:

Annual Reviews in Neuroscience,

24(1):139

166, 2001. E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity: orientation specicity and binocular interaction in visual cortex.


165

2(1):32, 1982.

S. M. Bohte and M. C. Mozer. Reducing spike train variability: A computational theory of spike-timing dependent plasticity.

Information Processing Systems,

Advances in Neural

17:201208, 2005.

S. M. Bohte, J. N. Kok, and H. La Poutre.

Unsupervised classication

Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN) - Volume 3, 2000a. of complex clusters in networks of spiking neurons.

In

S. M. Bohte, H. La Poutré, and J. N. Kok. Spikeprop: Error-backpropagation

Proceedings of the European Symposium on Articial Neural Networks (ESANN), 2000b. for networks of spiking neurons. In

S. M. Bohte, J. N. Kok, and H. La Poutré. Spike-prop: Error backpropagation in multi-layer networks of spiking neurons.

Neurocomputing,

48(1-4):

1737, 2002a. S. M. Bohte, H. La Poutré, and J. N. Kok.

Unsupervised clustering with

spiking neurons by sparse temporal coding and multilayer RBF networks.

IEEE Transactions on Neural Networks,

13(2):426435, 2002b.

A. G. Bors. Introduction of the radial basis function (RFB) networks. In

Online Symposium for Electronics Engineers, C.

L.

Brockmann.

Anatomy

of

the

volume 1, pages 17, 2001.

human

ear,

2009.

URL

http://en.wikipedia.org/wiki/File:HumanEar.jpg. D. S. Broomhead and D. Lowe. Radial basis functions, multi-variable functional interpolation and adaptive networks. 1988. G. Bugmann. Normalized gaussian radial basis function networks.

computing,

20(1):97110, 1998.

R. M. Burger and E. W. Rubel.

hearing,

Neuro-

Encoding of interaural timing for binaural

pages 613630. Academic Press, 2008.

L. Calmes.

A Binaural Sound Source Localization System for a Mobile

Robot. Master's thesis, Faculty of Mathematics, Computer Sciences and Natural Sciences, Rheinisch-Westf²alische Technische Hochschule Aachen, 2002.

166

A. Carnell and D. Richardson. Linear algebra for time series of spikes. In

Proc. of ESANN,

pages 363368. Citeseer, 2005.

C. E. Carr. Delay line models of sound localization in the barn owl.

Zoologist,

American

33(1):7985, 1993.

C. E. Carr and M. Konishi. A circuit for detection of interaural time dierences in the brain stem of the barn owl.


10(10):

32273246, 1990. S. Cavaco and J. Hallam. A biologically plausible acoustic azimuth estima-

Proceedings of IJCAI Workshop on Computational Auditory Scene Analysis (CASA), pages 7887. Citeseer, 1999. tion system. In

V. Chan, A. Van Schaik, and S. C. Liu. Spike response properties of an AER

Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS, page 4. Citeseer, 2006. EAR.

In

W. Chau and R. O. Duda.

Combined monaural and binaural localization

Conference Record of the Twenty-Ninth Asilomar Conference on Signals, Systems and Computers, volume 2, 1995. of sound sources.

In

W. Chung, S. Carlile, and P. Leong. A performance adequate computational model for auditory localization.

America,

The Journal of the Acoustical Society of

107:432445, 2000.

G. Cochenour, J. Simon, S. Das, A. Pahwa, and S. Nag. A pareto archive evolutionary strategy based radial basis function neural network training algorithm for failure rate prediction in overhead feeders.

In

of the Conference on Genetic and evolutionary computation,

Proceedings pages 2127

2132. ACM New York, USA, 2005. E. Covey and J. H. Casseday. The monaural nuclei of the lateral lemniscus in an echolocating bat: parallel pathways for analyzing temporal features of sound.


11(11):34563470, 1991.

A.G. Dabak. Binaural localization using interaural cues. Master's thesis, Department of Electrical and Computer Engineering, Rice University, Houston, TX, 1990.

167

W. R. D'Angelo, D. L. Oliver, and D. O. Kim. Modeling cochlear nucleus

Proceedings of the IEEE 25th Annual Northeast Bioengineering Conference, pages neurons: Responses to current pulse trains and current steps. In

2728, 1999. R. de Jonge. Anatomy of the ear. University of Central Missouri, 2008. M. S. de Queiroz, R. C. de Berrêdo, and A. de Pádua Braga.

Reinforce-

ment learning of a simple control task using the spike response model.

Neurocomputing,

70(1-3):1420, 2006.

B. Delgutte and A. Oxenham. Hearing and the auditory system: Overview. 2005. M. J. Denham. The dynamics of learning and memory: Lessons from neuroscience. A.

Lecture notes in computer science,

Destexhe.

Cellular

pages 333347, 2001.

morphologies,

2009.

URL

http://cns.iaf.cnrs-gif.fr/. W. A. N. Dorland, D. M. Anderson, J. Keith, P. D. Novak, and M. A. Elliott.

Dorland's Illustrated Medical Dictionary. J. L. Elman. Finding structure in time.

with Readings,

Saunders Philadelphia, 2003.

Connectionist Psychology: A Text

page 289, 1990.

A. P. Engelbrecht.

Computational intelligence: An introduction.

J. Wiley &

Sons, 2002. M. A. Farries and A. L. Fairhall. Reinforcement learning with modulated spike timing dependent synaptic plasticity.

Journal of Neurophysiology,

98(6):3648, 2007. A. S. Feng and W. Y. Lin.

Neuronal architecture of the dorsal nucleus

(cochlear nucleus) of the frog, Rana pipiens pipiens.

parative Neurology,

The Journal of Com-

366:320334, 1996.

D. C. Fitzpatrick, S. Kuwada, and R. Batra. Transformations in processing interaural time dierences between the superior olivary complex and inferior colliculus: beyond the Jeress model. 7989, 2002.

168

Hearing Research,

168(1-2):

W. Gerstner.

What's dierent with spiking neurons?,

chapter 12, pages 23

48. Kluwer Academic Publishers, 2001. W. Gerstner and W. M. Kistler.

Populations, Plasticity.

Spiking Neuron Models: Single Neurons,

Cambridge University Press, 2002.

W. Gerstner, R. Kempter, J. L. van Hemmen, and H. Wagner. A neuronal learning rule for sub-millisecond temporal coding.

Nature,

383(6595):76

78, 1996. B. Glackin, J. A. Wall, T. M. McGinnity, L. P. Maguire, and L. J. McDaid. A spiking neural network model of the medial superior olive using spike timing dependent plasticity for sound localisation. Submitted to: Frontiers in Computational Neuroscience, 2010. M. D. Good and R. H. Gilkey. signal-to-noise ratio.

Sound localization in noise: The eect of

The Journal of the Acoustical Society of America,

99:1108, 1996. T. D. Griths and J. D. Warren. The planum temporale as a computational hub.

TRENDS in Neurosciences,

25(7):348353, 2002.

B. Grothe. New roles for synaptic inhibition in sound localization.

Reviews Neuroscience,

Nature

4(7):540550, 2003.

B. Grothe and G. Neuweiler. The function of the medial superior olive in

Journal of Comparative Physiology, A: Sensory, Neural, and Behavioral Physiology, small mammals: temporal receptive elds in auditory analysis.

186:413423, 2000. B. Grothe and T. J. Park. Structure and function of the bat superior olivary complex.

Microscopy Research and Technique,

51(4):382402, 2000.

K. Guentchev and J. Weng. Learning-based three dimensional sound localization using a compact non-coplanar array of microphones. In

of the AAAI Symposium on Intelligent Environments,

Proceedings

1998.

R. Gutig, R. Aharonov, S. Rotter, and H. Sompolinsky.

Learning input

correlations through nonlinear temporally asymmetric Hebbian plasticity.


23(9):3697, 2003.

169

A. A. Handzel, S. B. Andersson, M. Gebremichael, and P. S. Krishnaprasad. A biomimetic apparatus for sound-source localization.

Conf. on Decision and Control,

In

Proc. IEEE

2003.

M. Hao, Z. Lin, H. Hongmei, and W. Zhenyang. A novel sound localization

8th International Conference on Electronic Measurement and Instruments, ICEMI, page 428,

method based on head related transfer function. In

2007. W. M. Hartmann.

How we localize sound.

Physics Today,

52(11):2429,

1999. S. Haykin.

Neural networks: a comprehensive foundation.

Prentice Hall,

2008. D.O. Hebb. The organization of behavior: a neuropsychological theory.

York: Wiley,

New

1949.

M. J. Hewitt and R. Meddis. Regularity of cochlear nucleus stellate cells: A computational modeling study.

America,

The Journal of the Acoustical Society of

93:3390, 1993.

M. J. Hewitt and R. Meddis. A computer model of amplitude-modulation sensitivity of single units in the inferior colliculus.

Acoustical Society of America,

The Journal of the

95:2145, 1994.

M. J. Hewitt and R. Meddis. A computer model of dorsal cochlear nucleus pyramidal cells: Intrinsic membrane properties.

tical Society of America,

The Journal of the Acous-

97:2405, 1995.

M. J. Hewitt, R. Meddis, and T. M. Shackleton.

A computer model of

a cochlear-nucleus stellate cell: Responses to amplitude-modulated and pure-tone stimuli.


91:

2096, 1992. A. L. Hodgkin and A. F. Huxley. Currents carried by sodium and potassium ions through the membrane of the giant axon of Loligo.

Physiology,

The Journal of

116(4):449, 1952a.

A. L. Hodgkin and A. F. Huxley. The components of membrane conductance in the giant axon of Loligo.

The Journal of Physiology, 170

116(4):473, 1952b.

A. L. Hodgkin and A. F. Huxley. The dual eect of membrane potential on sodium conductance in the giant axon of Loligo.

The Journal of Physiology,

116(4):497, 1952c. A. L. Hodgkin, A. F. Huxley, and B. Katz. Measurement of current-voltage relations in the membrane of the giant axon of Loligo.

Physiology,

The Journal of

116(4):424, 1952.

P. M. Hofman.

On the role of spectral pinna cues in human sound localization.

PhD thesis, Radboud University Nijmegen, 2000. J. J. Hopeld. Neural networks and physical systems with emergent collective computational abilities.

Proceedings of the national academy of sciences,

79(8):2554, 1982. J. Huang, T. Supaongprapa, I. Terakura, F. Wang, N. Ohnishi, and N. Sugie. A model-based sound localization system and its application to robot navigation.

Robotics and Autonomous Systems,

A. J. Hudspeth.

27(4):199209, 1999.

Principles of Neural Science, Chapter32 Hearing.

Elsevier

Science Publishing Co, Inc, 1991. J. Huopaniemi and M. Karjalainen. Comparison of digital lter design methods for 3-D sound. In

SIG),

Proc. IEEE Nordic Signal Processing Symp (NOR-

pages 131134. Citeseer, 1996.

E. M. Izhikevich. Simple model of spiking neurons.

Neural Networks,

IEEE Transactions on

14(6):15691572, 2003.

E. M. Izhikevich. Which model to use for cortical spiking neurons?

Transactions on Neural Networks,

15(5):10631070, 2004.

E. M. Izhikevich and N. S. Desai. Relating STDP to BCM.

tation,

IEEE

Neural Compu-

15(7):15111523, 2003.

T. Jacobsen. Localization in noise. Technical report, Technical University of Denmark Acoustics Laboratory, 1976. H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Q. Jarosz. Biological neuron, 2009. URL

171

Science, 304(5667):78, 2004.

http://en.wikipedia.org/wiki/.

L. A. Jeress. A place theory of sound localization.

and Physiological Psychology,

Journal of Comparative

41(1):3539, 1948.

S. Jones, R. Meddis, S. C. Lim, and A. R. Temple. Toward a digital neuromorphic pitch extraction system.


11(4):978987, 2000. P. X. Joris and T. C. T. Yin. Envelope coding in the lateral superior olive. i. sensitivity to interaural time dierences.

J. Neurophysiol.,

73:104362,

1995. P. X. Joris and T. C. T. Yin. Envelope coding in the lateral superior olive. III. Comparison with aerent pathways.


79

(1):253269, 1998. P. X. Joris and T. C. T. Yin. A matter of time: Internal delays in binaural processing.

Trends in Neurosciences,

30(2):7078, 2007.

P. X. Joris, T. C. T. Yin, and P. H. Smith. Mechanisms of azimuthal sound localisation in the central nervous system of the cat.

Soc.,

J. Dutch Acoust.

104:2335, 1990.

P. X. Joris, P. H. Smith, and T. C. T. Yin. auditory system: 50 years after Jeress.

Coincidence detection in the

Neuron,

21:12351238, 1998.

K. Kandler, A. Clause, and J. Noh. Tonotopic reorganization of developing auditory brainstem circuits.

Nature Neuroscience HEARING,

12(6):711

717, 2009. A. Kasinski and F. Ponulak. Experimental demonstration of learning properties of a new supervised learning method for the spiking neural networks.

Lecture Notes in Computer Science, A. Kasinski and F. Ponulak.

3696:145, 2005.

Comparison of supervised learning methods

for spike time coding in spiking neural networks.

Applied Mathematics and Computer Science,

International Journal of

16(1):101113, 2006.

R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner. Temporal coding in the sub-millisecond range: Model of barn owl auditory pathway.

Advances in Neural Information Processing Systems 8, Editors: Touretzky, M. C., Hasselmo, M. E., Cambridge, MA, MIT Press, pages 124130, 1996.

172

F. Keyrouz and K. Diepold.

A novel biologically inspired neural network

solution for robotic 3D sound source sensing.

Soft Computing-A Fusion of

Foundations, Methodologies and Applications,

12(7):721729, 2008.

F. Keyrouz, K. Diepold, and P. Dewilde. Robust 3D robotic sound localization using state-space HRTF inversion. In

on Robotics and Biomimetics, ROBIO,

IEEE International Conference

pages 245250, 2006a.

F. Keyrouz, Y. Naous, and K. Diepold.

A new method for binaural 3-

IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, volume 5, 2006b.

D localization based on HRTFs.

In

T. Kohonen. Self-organization and associative memory.

tion Sciences Series,

Springer Informa-

page 312, 1989.

M. Konishi. Study of sound localization by owls and its relevance to humans.

Comparative Biochemistry and Physiology, Part A, M. Konishi. Coding of auditory space.

126(4):459469, 2000.

Annual Reviews in Neuroscience,

26

(1):3155, 2003. R. J. Kulesza.

Cytoarchitecture of the human superior olivary complex:

Medial and lateral superior olive. B. Kuszta. Silicon cochlea. In

Hearing research,

225(1-2):8090, 2007.

Proceedings of the IEEE Wescon Conference,

pages 282285, 1998. R. Legenstein, C. Naeger, and W. Maass.

What can a neuron learn with

Neural Computation,

spike-timing-dependent plasticity?

17(11):2337

2382, 2005. M.

S.

Lewicki.

Sound

localization

1,

2006.

URL

http://www.cs.cmu.edu/ lewicki/cpsa/sound-localization1.pdf. H. Li, J. Lu, J. Huang, and T. Yoshiara.

Spatial localization of multiple

sound sources in a reverberant environment. In

Symposium (ICS),

International Computer

2009.

C. Lim, R. O. Duda, A. L. Devices, and C. A. Sunnyvale. Estimating the azimuth and elevation of a sound source from the output of a cochlear

Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, volume 1, 1994. model. In

173

J. Liu, H. Erwin, and S. Wermter. Mobile robot broadband sound localisa-

IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, pages tion using a biologically inspired spiking neural network.

In

21912196, 2008a. J. Liu, H. Erwin, S. Wermter, and M. Elsaid. A biologically inspired spiking neural network for sound localisation by the inferior colliculus.

In

Proceedings of the 18th international conference on Articial Neural Networks, Part II, pages 396405. Springer, 2008b. J. Liu, D. Perez-Gonzalez, A. Rees, H. Erwin, and S. Wermter.

Multi-

ple sound source localisation in reverberant environments inspired by the auditory midbrain.

In C. Alippi, M. Polycarpou, C. Panayiotou, and

Articial Neural Networks ICANN, volume 5768 of Lecture Notes in Computer Science, pages 208217. Springer, 2009. ISBN G. Ellinas, editors,

978-3-642-04273-7. R. Llinas, I. Z. Steinberg, and K. Walton. Relationship between presynaptic calcium current and postsynaptic potential in squid giant synapse.

physical Journal,

Bio-

33(3):323351, 1981.

R. Lyon. A computational model of ltering, detection, and compression in

IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, volume 7, 1982. the cochlea. In

IEEE Transactions on Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], 36(7):11191134, 1988.

R. F. Lyon and C. Mead. An analog electronic cochlea.

W. Maass.

Networks of spiking neurons:

network models.

Neural Networks,

The third generation of neural

10(9):16591671, 1997.

W. Maass and H. Markram. On the computational power of recurrent circuits of spiking neurons.

Journal of Computer and System Sciences,

69(4):593

616, 2004. F.

Mammano

and

R.

Nobili.

The

cochlea.

2005.

URL

http://147.162.36.50/cochlea/index.htm. I. D. Marian.

A biologically inspired model of motor control of direction.

Master's thesis, University College Dublin, 2002.

174

K. D. Martin. Estimating azimuth and elevation from interaural dierences.

IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, pages 9699, 1995. In

S. J. Martin, P. D. Grimwood, and R. G. M. Morris. and memory: An evaluation of the hypothesis.

science,

Synaptic plasticity

Annual Reviews in Neuro-

23(1):649711, 2000.

D. McAlpine and B. Grothe. Sound localization and delay lines - do mammals t the model?

Trends in Neurosciences,

26(7):347350, 2003.

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.

Bulletin of Mathematical Biology,

5(4):115133, 1943.

J. F. Mejías and J. J. Torres. Improvement of spike coincidence detection with facilitating synapses.

Neurocomputing,

70(10-12):20262029, 2007.

M. A. Merchan and P. Berbel. Anatomy of the ventral nucleus of the lateral lemniscus in rats: A nucleus with a concentric laminar organization.

Journal of Comparative Neurology,

The

372:245263, 1998.

D. Mishra, A. Yadav, S. Ray, and P. K. Kalra. Exploring biological neuron models.

Directions,

7:1333, 2006.

J. Moody and C. J. Darken. processing units. J. K. Moore.

Fast learning in networks of locally-tuned

Neural computation,

1(2):281294, 1989.

Organization of the human superior olivary complex.

croscopy Research and Technique,

51(4):403412, 2000.

J. Murray, H. Erwin, and S. Wermter.

Robotic sound-source localization

and tracking using interaural time dierence and cross-correlation.

Proceedings of NeuroBotics Workshop, J. Murray,

S. Wermter,

Mi-

In

pages 8997, 2004.

and H. Erwin.

Auditory robotic tracking of

sound sources using hybrid cross-correlation and recurrent networks. In

IEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS), pages 35543559, 2005. J. C. Murray, H. R. Erwin, and S. Wermter. Robotic sound-source localisation architecture using cross-correlation and recurrent neural networks.

Neural Networks, Elsevier,

22(2):173189, 2009.

175

K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano. for humanoid.

Intelligence,

In

Active audition

Proceedings of the National Conference on Articial

pages 832839. Menlo Park, CA; Cambridge, MA; London;

AAAI Press; MIT Press; 1999, 2000. K. Nakadai, H. G. Okuno, and H. Kitano. Real-time sound source localization and separation for robot audition. In

ence on Spoken Language Processing.

ISCA, 2002.

K. Nakadai, H. G. Okuno, and H. Kitano. taneous speech by active audition. In

Robotics and Automation, ICRA,

Seventh International Confer-

Robot recognizes three simul-

IEEE International Conference on

volume 1, 2003.

H. Nakashima, Y. Chisaki, T. Usagawa, and M. Ebata. Frequency domain binaural model based on interaural phase and level dierences.

Science and Technology,

Acoustical

24(4):172178, 2003.

D. Nandy and J. Ben-Arie. An auditory localization model based on highfrequency spectral cues.

Annals of Biomedical Engineering, 24(6):621638,

1996. U. Ndubaku and M. E. de Bellard. Glial cells: Old cells with new twists.

Acta Histochemica,

110(3):182195, 2008.

M. Nelson and J. Rinzel.

The Hodgkin-Huxley model.

Beeman, eds., The Book of Genesis,

Bower, J.M. and

pages 2751, 1995.

B. Nordlund. Physical factors in angular localization.

Acta oto-laryngologica,

54(1-6):7593, 1962. J. F. Olsen, E. I. Knudsen, and S. D. Esterly. Neural maps of interaural time and intensity dierences in the optic tectum of the barn owl.

Neuroscience,

Journal of

9(7):25912605, 1989.

N. Ono and S. Ando. Sound source localization sensor with mimicking barn owls. In

Proc. Transducers,

volume 1, pages 16541657, 2001.

K. K. Osen. Cytoarchitecture of the cochlear nuclei in the cat.

of Comparative Neurology,

The Journal

136(4):45384, 1969.

Department of Otorhinolaryngology. Auditory 2: Central mechanisms, 2002. University of Pennsylvania Health System.

176

F. Palmieri, M. Datum, A. Shah, and A. Moise. Learning binaural sound

Proceedings of the IEEE Seventeenth Annual Northeast Bioengineering Conference, pages 1314, 1991a.

localization through a neural network. In

F. Palmieri, M. Datum, A. Shah, and A. Moise. Sound localization with a neural network trained with the multiple extended Kalman algorithm. In

International Joint Conference on Neural Networks, IJCNN, Seattle,

volume 1, 1991b. T. J. Park, B. Grothe, G. D. Pollak, G. Schuller, and U. Koch. Neural delays shape selectivity to interaural intensity dierences in the lateral superior olive.


16(20):65546566, 1996.

T. J. Park, P. Monsivais, and G. D. Pollak. Processing of interaural intensity dierences in the LSO: Role of interaural threshold dierences.

Neurophysiology,

Journal of

77(6):28632878, 1997.

T. J. Park, A. Klug, M. Holinstat, and B. Grothe. Interaural level dierence processing in the lateral superior olive and the inferior colliculus.

of Neurophysiology,

Journal

92(1):289301, 2004.

M. G. Paulin. A method for analysing neural computation using receptive elds in state space.

Neural Networks,

J. L. Pena and M. Konishi. multiplication.

Science,

11(7-8):12191228, 1998.

Auditory spatial receptive elds created by

292(5515):249252, 2001.

D. Peruzzi, S. Sivaramakrishnan, and D. L. Oliver. types in brain slices of the inferior colliculus.

Identication of cell

Neuroscience,

101(2):403

416, 2000. J. P. Pster, D. Barber, and W. Gerstner. probabilistic point of view.

Optimal hebbian learning: A

Lecture Notes in Computer Science,

pages

9298, 2003. J. P. Pster, T. Toyoizumi, D. Barber, and W. Gerstner.

Optimal spike-

timing-dependent plasticity for precise action potential ring in supervised learning.

Neural Computation,

18(6):13181348, 2006.

S. Poljak. The connections of the acoustic nerve.

177

J. Anat,

60:465469, 1926.

G. D. Pollak, R. M. Burger, T. J. Park, A. Klug, and E. E. Bauer. Roles of inhibition for transforming binaural properties in the brainstem auditory system.

Hearing Research,

168(1-2):6078, 2002.

F. Ponulak. ReSuMe - New supervised learning method for spiking neural

Technical Report, Institute of Control and Information Engineering, Poznan University of Technology. Available at http://d1. cie. put. poznan. pl/fp, 2005. networks.

F. Ponulak. ReSuMe - Proof of convergence. Technical report, Institute of Control and Information Engineering, Poznan University of Technology, Poland, 2006. F. Ponulak and A. Kasinski. Generalization properties of snn trained with resume method. In

Euro. Symp. on Articial Neural Networks,

F. Ponulak and A. Kasi«ski.

ReSuMe learning method for spiking neural

networks dedicated to neuroprostheses control. In

Symposium,

2006.

Proc. of EPFL LATSIS

pages 119120, 2006.

T. M. Poulsen and R. K. Moore. Sound localization through evolutionary learning applied to spiking neural networks. In

dations of Computational Intelligence,FOCI, R.

Pujol,

S.

Blatrix,

and

T.

IEEE Symposium on Foun-

pages 350356, 2007.

Pujol.

Promenade

round

the

http://www. iurc. montp. inserm. fr/cric/audition/english/start. htm. cochlea.

1999.

URL

L. Rayleigh. On our perception of the direction of a source of sound.

ceedings of the Musical Assoc.,

Pro-

2nd Sess.:7584, 1875-1876.

L. Rayleigh. On our perception of sound direction.

Philos. Mag, 13:214232,

1907. P. D. Roberts and C. C. Bell. Spike timing dependent synaptic plasticity in biological systems. R. Rojas.

Biological Cybernetics,

87(5):392403, 2002.

Neural Networks: A Systematic Introduction,

chapter 13. The

Hopeld Model, pages 337371. Springer-Verlag, 1996. F. Rosenblatt.

The perceptron -A perceiving and recognizing automaton

(Technical Report 85-460-1).

Cornell Aeronautical Laboratory, 178

1957.

J. Rubin, D. D. Lee, and H. Sompolinsky. Equilibrium properties of temporally asymmetric Hebbian plasticity.

Physical Review Letters,

86(2):

364367, 2001. B. Ruf and M. Schmitt. Learning temporally encoded patterns in networks of spiking neurons. F. Rumsey.

Neural Processing Letters,

Spatial Audio.

5(1):918, 1997.

Focal Pr, 2001.

D. K. Ryugo and T. N. Parks. Primary innervation of the avian and mammalian cochlear nucleus.

Brain Research Bulletin,

60(5-6):435456, 2003.

D. K. Ryugo, T. Pongstaporn, D. M. Huchton, and J. K. Niparko. Ultrastructural analysis of primary endings in deaf white cats: alterations in endbulbs of Held.

Morphologic

The Journal of Comparative Neurology,

385:230244, 1997. C. Schauer and H. M. Gross. Model and application of a binaural 360ô sound

Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume 2, 2001. localization system. In

C. Schauer and H. M. Gross. A computational model of early auditory-visual integration.

Lecture Notes In Computer Science,

2781:362369, 2003.

C. Schauer, T. Zahn, P. Paschke, and H. M. Gross. Binaural sound localiza-

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, tion in an articial neural network. In

volume 2, 2000. H. S. Seung. Learning in spiking neural networks by reinforcement of stochastic synaptic transmission.

Neuron,

H. Sheikhzadeh and L. Deng.

40(6):10631073, 2003.

A layered neural network interfaced with a

cochlear model for the study of speech encoding in the auditory system.

Computer Speech and Language,

13:3964, 1999.

R. Z. Shi and T. K. Horiuchi. A VLSI model of the bat dorsal nucleus of the lateral lemniscus for azimuthal echolocation. In

Symposium on Circuits and Systems, ISCAS, A. P. Shon and R. P. N. Rao.

IEEE International

pages 42174220, 2005.

Temporal sequence learning with dynamic

synapses. Technical report, University of Washington, 2002.

179

L. S. Smith. detection.

Using depressing synapses for phase locked auditory onset

Lecture Notes In Computer Science,

2130:11031108, 2001.

P. H. Smith, P. X. Joris, and T. C. T. Yin. Projections of physiologically characterized spherical bushy cell axons from the cochlear nucleus of the cat: Evidence for delay lines to the medial superior olive.

Comparative Neurology,

The Journal of

331(2):24560, 1993.

P. H. Smith, P. X. Joris, and T. C. T. Yin.

Anatomy and physiology of

principal cells of the medial nucleus of the trapezoid body (MNTB) of the cat.


A. Solodovnikov and M. C. Reed.

79(6):31273142, 1998. Robustness of a neural network model

Journal of Computational Neuroscience,

for dierencing.

11(2):165173,

2001. S. Song, K. D. Miller, and L. F. Abbott.

Competitive Hebbian learning

through spike-timing-dependent synaptic plasticity.

Nature Neuroscience,

3:919926, 2000.

Connectionist models of learning, development and evolution: Proceedings of the Sixth Neural Computation and Psychology Workshop, Liège, Belgium, 16-18 September,

J. Sougne. A learning algorithm for synre chains. In

page 23. Springer Verlag, 2001. T. L. Stedman.

The American Heritage Stedman's Medical Dictionary.

Houghton Miin, 2004. R. B. Stein. The frequency of nerve action potentials generated by applied currents.

Proceedings of the Royal Society of London. Series B, Biological

Sciences,

167(1006):6486, 1967.

Computational auditory scene analysis: Principles, algorithms, and applications, chapter 5, pages 147

R. M. Stern, G. J. Brown, and D. Wang.

178. Wiley-IEEE Press, 2006. N. Sugie, J. Huang, and N. Ohnishi. Localizing sound source by incorporating biological auditory mechanism. In

Neural Networks,

pages 243250, 1988.

180

IEEE International Conference on

S. P. Thompson. On the function of the two ears in the perception of space.

Philos Mag,

13:406416, 1882.

D. J. Tollin. The development of the acoustical cues to sound localization in cats.

Assoc. Res. Otol.,

27:161, 2004.

D. J. Tollin. The lateral superior olive: A functional role in sound source localization. D. J. Tollin.

The Neuroscientist,

9(2):127, 2003.

Encoding of interaural level dierences for sound localization,

volume 3, pages 631654. Academic Press, 2008. D. J. Tollin and K. Koka. Postnatal development of sound pressure transformations by the head and pinnae of the cat: Monaural characteristics.


125(2):980, 2009.

D. J. Tollin and T. C. T. Yin. The coding of spatial location by single units in the lateral superior olive of the cat. II. The determinants of spatial receptive elds in azimuth.

Journal of Neuroscience, 22:14681479, 2002a.

D. J. Tollin and T. C. T. Yin. The coding of spatial location by single units in the lateral superior olive of the cat. I. Spatial receptive elds in azimuth.


22(4):1454, 2002b.

D. J. Tollin, K. Koka, and J. J. Tsai. Interaural level dierence discrimination thresholds for single neurons in the lateral superior olive.

Neuroscience,

Journal of

28(19):4848, 2008.

T. P. Trappenberg.

Fundamentals of computational neuroscience.

Oxford

University Press, 2002. M. Tsodyks, K. Pawelzik, and H. Markram. Neural networks with dynamic synapses.

Neural Computation,

10(4):821835, 1998.

R. Urbanczik and W. Senn. Reinforcement learning in populations of spiking neurons.


12(3):250252, 2009.

J. M. Valin, F. Michaud, J. Rouat, and D. Letourneau. Robust sound source

IEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS), vollocalization using a microphone array on a mobile robot. In

ume 2, pages 12281233, 2003.

181

P. W. J. van Hengel.

Emissions from Cochlear Modelling.

Rijksuniversiteit

Groningen, 1996. A.

Van

Schaik.

The

electronic

auditory

pathway.

2003.

URL

http://www.eelab.usyd.edu.au/andre/eap/. A. Van Schaik, E. Fragnière, and E. Vittoz. An analogue electronic model of ventral cochlear nucleus neurons.

In

Proceedings of MicroNeuro,

vol-

ume 96, pages 5259, 1996. J.

Virtamo.

Poisson

process,

2005.

URL

http://www.netlab.tkk.fi/opetus/s383143/kalvot/. T. P. Vogels, K. Rajan, and L. F. Abbott. Neural network dynamics.

Reviews in Neuroscience,

Annual

28:357, 2005.

R. J. Vogelstein, F. Tenore, R. Philipp, M. S. Adlerstein, D. H. Goldberg, and G. Cauwenberghs. Spike timing-dependent plasticity in the address domain.

Advances in Neural Information Processing Systems,

15:914,

2003. K. Voutsas and J. Adamy. A biologically inspired spiking neural network for sound source lateralization.


18

(6):17851799, 2007. J. Vreeken. Spiking neural networks, an introduction. Technical report, Institute for Information and Computing Sciences, Utrecht University, 2002. J. A. Wall, L. J. McDaid, L. P. Maguire, and T. M. McGinnity. A spiking neural network implementation of sound localisation. In

Irish Signals and Systems,

Proc. of the IET

pages 1923, 2007.

J. A. Wall, L. J. McDaid, L. P. Maguire, and T. M. McGinnity.

Spiking

neuron models of the medial and lateral superior olive for sound localisa-

IEEE International Joint Conference on Neural Networks, IJCNN (IEEE World Congress on Computational Intelligence), pages 26412647, tion. In

2008. J. A. Wall, L. J. McDaid, L. P. Maguire, and T. M. McGinnity.

Spiking

neural network model of the lateral superior olive for sound localisation. Submitted to: IEEE Transactions on Neural Networks, 2009.

182

J. Waters, A. Schaefer, and B. Sakmann. Backpropagating action potentials in neurones: measurement, mechanisms and potential functions.

in biophysics and molecular biology, D. Weedman Molavi.

Progress

87(1):145170, 2005.

Auditory and vestibular systems.

1997.

URL

http://thalamus.wustl.edu/course/audvest.html. L. A. Werner. Anatomy of the inner ear. University of Washington, 2007. E. G. Wever.

Theory of hearing.

Wiley New York, 1949.

V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner. A probabilistic

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(5):982, 2006. model for binaural sound localization.

Encoding and enhancing acoustic information at the rst stages of the auditory system. PhD thesis, University of Pennsylvania,

J. H. Wittig Jr.

2004. X. Xie and H. S. Seung. Learning in neural networks by reinforcement of irregular spiking.

Physical Review E,

69(4):41909, 2004.

T. C. T. Yin. Neural mechanisms of encoding binaural localization cues in the auditory brainstem.

Pathway,

Integrative Functions in the Mammalian Auditory

pages 99159, 2002.

T. C. T. Yin and J. C. Chan. Interaural time sensitivity in medial superior olive of cat.

J. Neurophysiol.,

E. D. Young and K. A. Davis.

64:465488, 1990.

Integrative Functions in The Mammalian

Auditory Pathway, chapter 5. Circuitry and function of the dorsal cochlear nucleus, pages 160206. New York: Springer-Verlag, 2001. M. Zacksenhouse, D. H. Johnson, J. Williams, and C. Tsuchitani. neuron modeling of LSO unit responses.

Single-


79

(6):30983110, 1998. J. C. Zella, J. F. Brugge, and J. W. H. Schnupp. Passive eye displacement alters auditory spatial receptive elds of cat superior colliculus neurons.


4:11671168, 2001.

183

Z.H. Zhou. Sound localization and virtual auditory space.

University of Toronto,

Project report of

2002.

M. S. A. Zilany and I. C. Bruce. Modeling auditory-nerve responses for high sound pressure levels in the normal and impaired auditory periphery.

Journal of the Acoustical Society of America,

The

120:1446, 2006.

M. S. A. Zilany and I. C. Bruce. Representation of the vowel/ε/in normal and impaired auditory nerve bers: Model predictions of responses in cats.


122:402, 2007.

D. Zipser, B. Kehoe, G. Littlewort, and J. Fuster. A spiking network model of short-term active memory.


1993.

184

13(8):34063420,

Post-Cochlear Auditory Modelling for Sound ...

Post-Cochlear Auditory Modelling for Sound ...

Suggest Documents

Auditory neuroanatomy: A sound foundation for sound ...

Sound Ontology for Computational Auditory Scence Analysis

Sound Source Localization for Robot Auditory Systems

Understanding Sound: Auditory Skill Acquisition

Sound-by-sound thalamic stimulation modulates midbrain auditory

Auditory steady state response in sound field

Sound source separation via computational auditory ...

Directional Sound for Long Distance Auditory ... - Semantic Scholar

A Comparison of Sound Dimensions for Auditory Graphs - Audio ...

Lyon's Auditory Model Inversion: a Tool for Sound ... - Semantic Scholar

Sound Source Localization for Robot Auditory ... - SDAC Lab in KAIST

Supplementary Material Source-modelling auditory ...

Prediction in polyphony: modelling musical auditory

Modelling of the Auditory Satisfaction Function for the ... - J-Stage

vibroacoustic modelling of sound transmission across ...

Kirby, JASA 1 Modelling sound propagation in

Sound Propagation Theory for Linear Ray Acoustic Modelling

Modelling for sound annoyance evaluation of vehicle noise based on ...

A novel modelling approach for sound propagation ... - Webistem

Modeling auditory coding: from sound to spikes - Springer Link

Auditory feedback through continuous control of crumpling sound ...

Sound tuning of amygdala plasticity in auditory fear ...

Ontogenetic development of auditory sensitivity and sound production

Investigation of Auditory Objects Caused by Directional Sound