effects of individualised headphone correction on

0 downloads 0 Views 563KB Size Report
Oct 8, 2010 - choice (2IFC) front/back discrimination of virtual sound source locations presented via two models of headphones, frequency responses of ...
Guru et al

Effects of Individualised Headphone Correction

EFFECTS OF INDIVIDUALISED HEADPHONE CORRECTION ON FRONT / BACK DISCRIMINATION OF VIRTUAL SOUND SOURCES DISPLAYED USING INDIVIDUALISED HEAD RELATED TRANSFER FUNCTIONS ABHISHEK GURU, WILLIAM L MARTENS, AND DOHEON LEE Faculty of Architecture, Design and Planning, The University of Sydney, Sydney, NSW, Australia [email protected]

Individualised Head Related Transfer Functions (HRTFs) were used to process brief noise bursts for a 2-interval forced choice (2IFC) front/back discrimination of virtual sound source locations presented via two models of headphones, frequency responses of which could be made nearly flat for each of 21 listeners using individualised headphone correction filters. In order to remove virtual source timbre as a cue for front/back discrimination, spectral centroid of sources processed using rearward HRTFs were manipulated so as to be more or less similar to that of source processed using frontward HRTFs. As this manipulation reduced front/back discrimination to chance levels for 12 out of 21 listeners, performance of 9 listeners showing "good discrimination" was analysed separately. For these 9 listeners, the virtual sources presented using individualised headphone correction filters supported significantly better front/back discrimination rates than did virtual sources presented without correction to headphone responses.

1

INTRODUCTION

It is of interest to spatial audio creators to present sounds to listeners such that perceived locations of those sounds are under relatively precise control. Uses of such spatial audio displays include gaming and virtual reality, avionics and air traffic control displays, and advanced telecommunications systems. A general consensus in the literature has been that use of individualised Head Related Transfer Functions (HRTFs) to position virtual sources provides some assurance that precise sound localization will result, but there is less agreement regarding the importance of headphone equalization that is based upon individual measurement of headphone responses. A primary goal of the current study was to determine whether use of such individualised correction of headphone responses could improve performance of a spatial audio system using individual HRTFs, as determined through a simple front/back discrimination task. A second goal was to evaluate performance under conditions in which localization tasks were difficult enough to allow such an improvement to be observed in a laboratory study. 1.1 Difference between Sound Localisation and Apparent Source Position In experimentation under laboratory conditions, such as this study, listeners are often presented with otherwise identical noise bursts processed using binaural technology that conveys these sounds to the ears so they seem to arrive from precisely defined locations. These conditions reduce the complexity of more ecologically valid sound localization situations to a simpler situation

that makes it possible for listeners to attain nearly perfect performance. For example, while holding constant the spectrum of a sound source, processed by a number of an individual’s measured HRTFs, that individual is often able to make reliable correct judgements of the locations at which those HRTFs were measured. When asked if apparent source positions they experience match positions at which HRTFs were measured, there is typically poor agreement between apparent and actual positions for troublesome regions directly in front of and behind the listener. Nevertheless, they are able to indicate sound source locations accurately and quickly, even with limited head movements, perhaps based upon their knowledge of how different sources sound when in those locations. This leads to a curious question regarding what acoustical features of a sound arriving from a particular location makes it seem like it is in fact arriving from there. Wightman and Kistler [1] have referred to these features as the acoustical determinants of apparent position. While they have also termed these features ‘acoustical localization cues’, there is an important distinction between localization of a physical source and the perception of source position. In his seminal book on the topic of “Spatial Hearing,” Blauert [2] has discussed this fundamental misunderstanding between physical location of a sound source and psychological location of an associated auditory event, highlighting that much of the literature incorrectly uses the term as ‘sound localisation’ to refer to judgements about auditory images associated with binaural synthesis, rather than physical locations of acoustical events.

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

1

Guru et al In sound localization, we use a combination of past knowledge and our inferential capabilities to assign a particular physical position to an acoustic event. On the other hand, what is apparent is within the realm of the imagined auditory space that is impossible to speculate about, wherein the listener subjectively experiences the presence of the acoustic event in this conceived or pictured expanse. In other words, the former deals with physical actuality while the latter deals with a perceived image. While it may be worthwhile to see how well these two approaches match, in their closeness of identifying sound source locations, the complications arising out of the disconnectedness of the two domains pose several questions. It is therefore prudent in the context of controlled listening experiments of this kind to obtain subjective information about virtual sound source locations which can then be objectively compared with actual, designed or modified ones. It is indeed a mystery as we do not know if the human mind separately detects cues present in acoustical information (such as spectra) concerning source itself from that of the transmission path. The least we can do is analyse changes that occur from one position to another by accurately measuring what is being heard when the same sound is conveyed from different positions. Furthermore, it must be carefully communicated to listeners undergoing subjective tests involving virtual sound sources (such as headphone based display) that they are not expected to guess source positions based on where they may imagine a sound to be, but that they are required to identify these positions based on where the sound is arriving from. 1.2 Background in Related Studies Experiments by Iwaya et al [3] have shown that with an increase in duration of the source stimulus, the errorrate in front-back localization reduces. It was also shown by the same experiments that restricted head movements increase the same error-rate as opposed to conditions where participants were allowed free head movements. A more detailed mathematical approach to calculating ITD and how the effects of head rotation on ITD are used as a reliable predictor by listeners to resolve front-back confusions can be seen in the work of Hill et al [4]. It is reasonable to presume therefore, that in a situation where the sound emitted by the source is of an even shorter duration of say, 100 milliseconds, there may not be enough time for the listener to move the head after cognisance of the sound, and it is therefore nearly impossible to localise accurately the position of such a sound source. Nevertheless, even within the median plane for different positions, there are huge variations in the frequency domain transfer function that defines the

Effects of Individualised Headphone Correction transmission of sound from a source to the ear of the listener. Blauert [2] presumes that because of the orientation and shell like structure of the pinnae, high frequencies tend to be more attenuated for sources in the rear than for sources in the front. Of course, the reality of spectral differences between identical sounds arriving from front and back is far more complex than this simplistic observation. If trained in recognising (for an identical short duration sound played sequentially at these different positions) these timbral variations, it becomes significantly easier to localise a sound source. ‘Binaural Room Synthesis (BRS)’ is the general term used to define playback over headphones of audio content that has been designed keeping considerations of the natural spatial hearing in free fields in mind. The content is developed for such a playback method by tweaking parameters such as Inter-aural Level Difference (ILD), Inter-aural Time Difference (ILD) and Inter-aural Cross Correlation (IACC) to simulate auditory images representative of source positions across a variety of azimuthal angles relative to the head, at varying distances from the head, and with different image sizes given a particular direction. Wightman and Kistler have studied this technique in great detail and have tabulated the various cues that enable acoustic localization as follows.

Monaural Binaural

Temporal Monaural Phase Interaural Time Difference (ITD)

Spectral Overall Level Monaural Spectral Cues Interaural Level Differences (ILD) Binaural Spectral Differences

Table 1: (Taken from Wightman & Kistler [1]) Potential Acoustic Localisation Cues Apart from the cues enumerated in Table 1, it was discovered by the authors that the front/back confusion rate (or the percentage of responses that incorrectly identify a front-intended source as back-originated, and vice versa) increased with source spectrum scrambling, high-frequency energy attenuation, and listening to spectral cues that are unfamiliar to the listener. Another study by J.F. Burger in 1958 compared the performance of individuals in front-back discrimination of narrowband sounds under four different conditions - head free and ears uncovered, head clamped and ears uncovered, head clamped and one ear covered and masked (by wide-band random noise), and head free but with both ears covered (by a headphone). It was found that the most important determinant of a correct judgement of position was the movement of the head [12].

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

2

Guru et al From as early as 1931, research has been done to examine the effects of head movements on auditory localization. P.T. Young [7] performed an experiment with methods that we may today find antiquated but convey the essence of the idea of obtaining subjective responses identifying sound source position without the advantage of head movements. One interesting result he observed was that irrespective of the original source position, all listeners felt that the phantom source was located in the back. A very good review of research in the area of binaural reproduction systems has been undertaken by Toole (1991) [8]. He states at the time of writing this paper, that headphones had limited consumer appeal leading to binaural reproduction not achieving any mainstream admiration, in spite of an improvement in clarity and spaciousness. He also noted that confusions between virtual sources originating from the front and back were common, and that it was frequently observed that front oriented sources especially, often folded over into the rear hemisphere. Consequently, listeners find it very difficult to achieve convincingly externalised localisations in the front hemisphere, and often localise very close to or within the head. Another drawback is that variations in the acoustical coupling of the headphone to the ear are unavoidable in regular use of headphones. Conflicts between the senses of sight and hearing can also cause confusions in localisation of sound sources. Dynamic localisation of sound sources is also something not usually provided in binaural systems which results in the unnatural experience of the whole auditory world rotating with the head. A early paper by Wenzel et al [13] also observes frontback confusions that are therein defined as judgements which indicate a source in the front hemisphere, usually near the median plane, is perceived by the listener to be in the rear hemisphere. Complex monaural spectral cues play an important role in offering listeners information about sound source position and any degradation of these cues increase front-back confusions significantly. While comparing localization performance between free-field and headphone stimuli, they found that frontback confusion increased by about 6 to 11% in the case of headphone-presented sounds. They also found that differences in measured HRTFs between listeners led to classifying these listeners as “good” and “poor” localizers, the former possessing bodily (primarily outer ear) features that enabled better localization, for example, the presence of elevation dependent acoustical features in the 5 to 10 kHz region of a subjects HRTFs. The main focus of the study though was to see how nonindividualised transfer functions could be used (more feasible for large populations) to compare freefield and virtual free-field localisation by inexperienced listeners. They came to the intuitively realisable

Effects of Individualised Headphone Correction conclusion that nonindividualised HRTFs generate most of the interaural cues but misrepresent the spectral constructs associated with particular directions that listeners are naturally used to. Consequently, nonindividualised transforms primarily result in increased rate of front-back confusions. This further justifies our intention to use individualised HRTFs for our own study. Møller et al performed an interesting study to explore the effect of errors due to non-individualised recordings on localisation performance in binaural reproduction [14]. To do this, they compared for a set of eight listeners, results from the following four experimental conditions - subjects listening in real life, subjects listening to recordings made with their own ears, subjects listening to a mixture of recordings made with their own ears as well as others ears, and subjects listening to recordings from one other subject. They showed that overall, localisation errors occur more with nonindividual recordings than with individual ones, and pertinent to our own study, even more so in the case of the median plane. Regarding front-back confusions, they showed that these errors were not as frequent in the case of individual recordings, with frontal sound source being perceived as behind the subject in 7% of the cases and sources behind the subject being perceived as in front of the subject in only 2% of the cases. They commented here that such errors are observed in real life listening as well and therefore need not originate from intricacies of the binaural technique. In the case of nonindividual recordings however, error rates in frontback reversals were much higher at about 33%. A final observation of interest was made in brief regarding the ability of listeners hearing with another subject’s ears to gradually adapt to the differences between their ears but this was not explored in detail as results were not statistically significant. In a subsequent study [18], the authors found that for 20 subjects, the errors in localization reduced while using individualized headphone equalization as compared to using a common equalization. Ryan and Furlong in 1995 [16] investigated the effects of headphone placement in binaural reproduction. Their study primarily reports that an anomaly exists between subjective listening to headphone based reproduction and objective accuracy of the reproduced sounds which they ascribe to four factors viz., absence of room/reflected sounds, absence of cues provided by head movements, absence of visual cues, and differences between artificial head and subject's ears. Pralong and Carlile [17] showed that the use of nonindividualized headphone transfer functions (HPTFs) is likely to result in a greater disruption of monaural spectral cues compared to binaural spectral differences and even nonindividualised HRTFs can be reconstructed

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

3

Guru et al accurately at the listeners ears if the listeners own HPTFs are deconvolved from the recording. 1.3 Aims of This Study The initial objective of the research study was to test a listener’s ability to discriminate appropriately, which direction a sound was arriving from, namely front or back, while listening over headphones to HRTFprocessed sound stimuli. The processing used in this study used Head Related Transfer Functions (HRTFs) measured for each individual under anechoic conditions. Then artificially generated dry noise signals were convolved with these HRTFs, presented with and without the use of correction filters based upon individually measured Headphone Transfer Functions (HPTFs). This processing by a correction filter served potentially to reduce adverse effects of the headphone ! response, which is unavoidably introduced while listening to headphones, and is superimposed effectively upon the HRTF. Sank proposed in 1978 [5] a personal equalization procedure (PEP), by which means a pair of headphones is made to have subjectively uniform frequency response as perceived by a particular individual. The result of the PEP should be signals reproduced to match closely what would have been heard by a listener were loudspeakers playing these signals in a room. Martens [6] also describes clearly the need for a correction filter to be used for headphone playback where spatial sound reproduction accuracy is an important goal. The technique involves an inversion of the transfer function from the headphone speaker to the ear. Using this technique as a basis, what we have ventured to achieve here is to make the headphones transparent (in audio terms) so as to provide the listener with the auditory illusion closest to that of listening to the actual speakers. In a similar study conducted recently, it was found that for certain listeners, while listening over headphones, it appeared that all sounds even if they were processed such that they should have appeared to be heard from front, were classified as originating from back. Hill et al [4] have pointed out that it is more common for listeners to localize a sound to the front when it should be in the rear whereas Young [7] and Toole [8] have found that it is generally harder to produce realistic frontal virtual sources and most listeners tend to perceive sources to be behind them. To eliminate this bias, it was decided that the listeners would be presented pair-wise stimuli; either front followed by back or vice versa, and would be asked to respond if they felt that the overall motion of sounds of the second relative to the first within the pair, moved in a frontward, or backward manner.

Effects of Individualised Headphone Correction 1.3.1

Transfer Function Expressions

The final signal presented to the listener (YE) over the headphones therefore consists of the source convolved with the measured HRTF and filtered by the derived Correction Filter. But for the sake of simplicity, we use the perfect correction filter which can be expressed as a function of the following transfer functions YE = X " YM spea ker " PCF " HP " CN = X " RAW " PCF " HP " CN = X " (A " SPKR " HRTF " MIC) " PCF " HP " CN 1 = X " (A " SPKR " HRTF " MIC) " ( ) " HP " CN A " HP " MIC = X " SPKR " HRTF " CN

(1)

Where abbreviations denote the following transfer functions A = Analytic signal used for measurement X = An input signal to be spatialised YE = Signal received at the ear-drum YMspeaker= RAW = Speaker signal at microphone PCF = Perfect Correction Filter CN = Ear Canal HP = Headphone SPKR = Loudspeaker MIC = Microphone HRTF = Naturally occurring transformation from an ideal source to an ideal receiver (specifically at blocked ear canal entrance in this case) positioned in free field, often denoted as head-related transfer function (HRTF) Note that if we were trying to play the same input signal (X) on the loudspeaker (L) used for the measurement of the RAW for a given listener (with ear canal transfer function C) we would end up with the exact same combination of transfer functions as the one above. This suggests that our intention of using the correction filter to playback the HRTF (or RAW in this case) processed input noise over headphones produces a virtual source image that matches the real source image of a speaker in a given position in front/back of the listener. Apart from the correction filter, it must be noted here that there has also been a great advantage in using the same analytic signal (impulse response obtained by deconvolution of a logarithmic sine sweep [9]) and the same microphone (with no changes to sensitivity) for both measurements viz. speaker and headphone (explained in detail in sections 2.1.1 and 2.1.2), as they cancel themselves perfectly in the expansion of terms described above, thereby producing the desired result. Also, the transfer function of the ear canal (CN) is unknown to the designer of the experiment and is used here to illustrate its presence in the natural hearing path.

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

4

Guru et al 2

Effects of Individualised Headphone Correction

METHOD

Two alternative experiments could have been performed involving asking two fundamentally different questions to the listeners. For example, they could have been asked “Where do you think the sound is coming from?” or “Where is the image of the sound as perceived by you?” The first question is a purely objective one clarifying the spontaneously identified location of the source while the second involves a perceptive qualification based on mental processes and the interaction between the brain and the auditory system. We chose the first approach for reasons explained in section 1.1. It also gives us the opportunity to validate the ability of the system to produce convincing and realistic reproductions that match an actual loudspeaker at a given location. The tests were performed on 21 listeners, all students enrolled in an audio and acoustics masters program at the University of Sydney. For both steps described, information regarding the nature of the test and the requirements of the listeners were read out to them before the test. These instructions were identical for all listeners so as to not bias a particular individual’s responses based on lesser or more knowledge of the experiment. All initial measurements of impulse responses were made using Farina’s logarithmic sine sweep technique [9]. 2.1 Measurements of Transfer Functions 2.1.1

!

Headphones to Ear Canal Entrance

Initially, objective measurements were carried out to estimate the headphone transfer function (HPTF), from the headphone impulse response (HPIR) defining the transmission of sound from the headphone speaker to the microphone at the entrance of the ear canal. The headphones used were Sennheiser HD600 and AKG K701 Reference Headphones. DPA 4060 miniature microphones were embedded in a layer of Anti-Nois™ ear plug wax and positioned approximately at the centre of the entrance to the ear canal. The plugs not only provided a convenient holding mechanism for the microphone, but also blocked the ear canal; therefore eliminating unwanted ear canal resonances. This microphone position referred to as the blocked meatus position is found to be advantageous as a testing position from the perspective of headphone design [11]. The measured transfer function consists of the transfer function of headphone to in-ear microphone (HP), and the microphone transfer function (MIC) for a given analytic signal used for measurement (A). The measured signal (arriving at the microphone, YM) can therefore be described by the expression involving the aforementioned transfer functions. (Note: Acronyms used are consistent with Begault [10])

YM headphone (= HPTF ) = A ! HP ! MIC

(2)

The headphone was re-seated 10 times to account for broadband variations in the frequency response which were typically observed at middle to high frequencies. By averaging across 10 different cases, these variations are minimized, as described in the following sections. 2.1.2

Speakers (Front and Back) to Ear Canal Entrance

Next, retaining the miniature microphones at the same location that was used in the previous case, measurements were made to obtain head related impulse responses based transfer functions (RAW) describing the sound transmission from a Bose Acoustimass Cube speaker located at seating position height (1.2m) at a distance of 2 metres from the ear. It must be noted here that this transfer function (RAW) consists of the loudspeaker transfer function ( SPKR ), the naturally occurring HRTF in a free field (HRTF), and the microphone transfer function (MIC). The measured signal (arriving at the microphone, YM) in this case, for the same analytic signal ! used for measurement (A, as in the case of headphone) expands as below

YM spea ker = RAW = A " SPKR " HRTF " MIC (3) These measurements were done for the front and back positions, i.e. with the listener facing the speaker and looking away from the speaker, corresponding to 0˚ and 180˚ azimuth angles respectively. These measurements were also done 10 times to account for broadband variations in the transfer function. In this case, the listener was asked to stand and sit at the same position retaining the 2 metres distance from the speaker to ear. All measurements were done in an anechoic room. 2.1.3

Derivation of the Correction Filters

The correction filter (CF) for each headphone is the individualized filter that is designed to balance, or counteract, the effect of the headphone transfer function (HPTF). Since the HPTF is an inherent and therefore unavoidable spectral transformation that is introduced due to the playback over headphones; the CF is introduced by pre-processing, to neutralise this bias, thereby producing in effect a frequency response at the entrance to the ear canal which is nearly identical to that of the source signal. This is often not achieved for reasons discussed below. Firstly, the HPTF changes with reseating of headphones, and introduces significant changes in the high frequency range for minor adjustments. Secondly, the CF is almost never ideal, i.e. a true inversion of the HPTF for a given headphone-ear couple since it represents a filter arrived at by averaging across

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

5

Guru et al

Effects of Individualised Headphone Correction

multiple re-seatings. Lastly, although a minor point, the HPIRs used to derive the CF can only be measured for practical considerations, using the blocked meatus inear microphone arrangement which of course can never represent the presumed and unknown HPIR and HPTF of sound transmitted from headphone speaker to an empty ear canal, at its entrance. Since this correction filter is in its most ideal sense, an inverse of the measured HPTF, which has previously been described as a combination of transfer functions of source signal, headphone and microphone, the transfer function expression for the perfect correction filter (PCF) of a given headphone for a given individual can be expressed as the following, where HP is the transfer function of headphone to in-ear microphone, MIC is the microphone transfer function, and A is a given analytic signal used for measurement.

PCFheadphone =

1 YM headphone

1 1 = = HPTF A ! HP ! MIC

(4)

Although the perfect correction filter described above has been used in the introduction (section 1.4.1) to express in a manner of equations what we intended to achieve in the subjective listening tests, we did not exactly use this ideal correction filter. This is because of the fact that the headphone transfer function measured varies with re-seating of headphones to a certain extent, and the likelihood of this transfer function obtained during measurement (and consequently being used as a correction filter while pre-processing the HRTF processed signals) matching the transfer function present during the subjective listening is weak. It is in fact most likely that a different headphone transfer function is present in the transmission chain from source to listener and obviously our correction filter would not be able to neutralise this effect as it inverts an earlier different headphone transfer function. The best we can do at this stage is to adequately capture some broadband features of the headphone transfer function that we obtain from summarising across the 10 different measurements made. The following equation summarizes this step simplistically, where AZ_CF is an all-zero (or FIR) correction filter and AP_YM is an allpole (or IIR).

AZ _ CFheadphone =

1 AP _ YM headphone

(5)

The following steps were employed (see Figures 2 and 3) in the detailed derivation of the average CF from the ten measured HPTFs. First, using a 24-pole linear

predictive coding (LPC, see Roads [15]), infinite impulse response (IIR) pole coefficients were obtained to represent each measured time-domain impulse response (HPIR) using the ‘lpc.m’ MATLAB routine. These pole coefficients were then represented as a magnitude transfer function in the frequency domain by using the ‘freqz.m’ MATLAB routine. Thus a smoothed transfer function was obtained for each of the measurements in the frequency domain. Next, corresponding to each of these smoothed transfer functions, the underlying 24-pole coefficients used to represent them were treated as zero coefficients in order to invert the derived filter. The 10 resulting impulse responses were averaged in the time-domain, to obtain a single impulse response. Lastly, the 24-pole LPC fit IIR coefficients of the mean impulse response obtained in the previous step were used to derive the FIR filter describing the CF for correcting the HPTF. We now look at Figure 3, where we consider derivation of a correction filter for a single headphone specific to a single listener. Of the ten measured HPTFs for listener 1 on headphone 1, the first (case 1 of 10) is shown in Figure 3. (a). In the time-domain, a 24-pole LPC fit is done to this impulse response and the resulting frequency domain representation of the IIR filter with these 24 pole coefficients is obtained. The same is done for all ten cases to get ten such derived impulse responses. The 10 derived impulse responses are averaged in the time domain and the resulting frequency domain representation of the mean derived LPC fit IR is shown in Figure 3. (b). By averaging this impulse response across left and right ear, and then inverting upside-down about the mean value (between 200 Hz and 1 kHz) we get a single monaural correction filter(CF) as shown in Figure 3. (c). The effect of using this filter on the original headphone transfer function is shown in Figure 3. (d) (i.e. 3. (d) = 3. (a) + 3. (c)) Going from Figure 3. (a) to Figure 3. (b), we see that the linear predictive coding technique captures most of the broadband information concerning the headphone transfer function without the roughness or narrowband detail associated with the former. Averaging fits from ten such cases of the original transfer function, as in Figure 3. (b), captures additional nuances of headphone transfer function as opposed to a single case. The correction filter derived from this average Figure 3. (c), when used on the first headphone transfer function should ideally result in a flat transfer function as desired in Equation 3 but we obtain instead a transfer function as seen in Figure 3. (d). By comparing Figure 3. (a) and (d), we see that there is some advantage to using a correction filter as the broadband variation of about +10dB to -20dB is compressed to a range of close to ±5 dB for most of the range (up to 10 kHz).

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

6

Guru et al

Effects of Individualised Headphone Correction

2.2 Testing of Directionally Oriented Stimuli It was decided to present a pair or noise bursts in sequence rather than an individual one, as it was generally noticed that in case of a single noise burst presented to a listener, there was a tendency to guess only ‘back’ irrespective of which HRTF was used. Therefore, by introducing a pair, listeners were forced to identify a net displacement in direction from one sound to the next from two choices, frontward or backward. This is commonly known as a 2IFC (2-Interval Forced Choice; ‘interval’ here denoting two sequentially presented sound stimuli, as opposed to simultaneously presented stimuli). Each of the listeners in one session was asked to listen to 24 pairs of noise bursts on each headphone twice and respond ‘F’ or ‘B’ each time. Each of the 24 pairs of noise bursts consisted of one front-HRTF and one back-HRTF processed noise burst, played as back-front(defined as frontward, overall) or front-back (defined as backward, overall). This overall simulated displacement of auditory image, obtained by playback of noise bursts, synthesised via actual directionally measured impulse response data (see Section 2.1.2), was termed as a factor named MEASURED (subsequently submitted to hierarchical log-linear multi-way analysis as a factor represented as M having two levels corresponding to Frontward, and Backward). The final HRTF selected from the set of ten measurements for a given speaker direction for a given listener, was based on two criteria. First, the inter-aural cross correlation coefficient (IACC) of the given measurement had to be the highest; and second, the inter-aural time delay (ITD) between the peak arrival at left and right ear had to be the least (ideally zero). Three noise inputs were used for the back-HRTF processed signals, corresponding to flat (or white), brighter, and brightest spectra, which resulted in noise stimuli with spectral centroids that were lowest, lower, or near respectively, as compared to the spectral centroid of the front-HRTF processed noise. This choice of source (for back-HRTF processed noise only), was termed as a factor named SOURCE (subsequently submitted to hierarchical log-linear multi-way analysis as a factor represented as S having three levels of spectral slope). This modification to source is explained in greater detail in section 2.2.2. Playback of the noise bursts used either normal or interchanged left and right channels and this switching was termed as a factor named INTERCHANGE (subsequently submitted to hierarchical log-linear multiway analysis as a factor represented as I, having two levels corresponding to normal or switched playback). This was done to prevent listeners from responding purely based on inter-aural differences.

Inputs were in turn filtered by a monaural Correction Filter (CF, described in detail in section 3.1.3) for Headphone or not filtered (presented as is) and this choice was termed as a factor named EQUALIZATION (subsequently submitted to hierarchical log-linear multiway analysis as a factor represented as E, having two levels corresponding to CF used or not). Headphone model used for listening was an additional factor named HEADPHONE (subsequently submitted to hierarchical log-linear multi-way analysis as a factor represented as H having two levels corresponding to models Hdph1, or Hdph2, respectively). Listeners were asked for each pair of noise bursts heard, to guess overall direction of motion by pressing the keys F (corresponding to frontward, overall) or B (corresponding to backward, overall) and this response was an additional factor named RESPONSE (subsequently submitted to hierarchical log-linear multiway analysis as a factor represented as R having two levels corresponding to a response of frontward, or a response of backward). A related factor that was introduced in the hierarchical log-linear multi-way analysis (discussed in detail in Section 4) was CORRECTNESS of response represented by the letter C having two levels corresponding to a response of correct, or incorrect. A response is defined as correct if it matches correctly the order of stimuli presented, for example, responding F (for Frontward) when the sequence of stimuli presented were such that the first was processed by a back-HRTF and the second by a front-HRTF (corresponding to an overall intended frontward displacement) and responding B (for Backward) when the order of stimuli was reversed. The HEADPHONE and RESPONSE factors were used in analysis only, and not as factors affecting the input presented to a given listener, hence the number of inputs (pairs) presented to each listener were 24 (=n(M)×n(S)× n(I)×n(E)=2×3×2×2) in a session. Total number of input pairs presented to a listener were 96 (=24×2(for each session on a headphone) ×2(for each headphone). Normalisation of the wave file data for each listener was done such that the resultant signal after processing an input noise source signal by HRTF and CF consisted of sample data between -1 and 1. All files were processed at 44100 kHz and noise signals had a frequency range of 0 Hz to the Nyquist frequency of 22.05 kHz. The subjective listening experiments were performed on the same 21 listeners and data obtained was summarised across all listeners, and compiled for hierarchical loglinear multi-way analysis using SPSS. The following table lists all the factors used.

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

7

Guru et al

Effects of Individualised Headphone Correction

Figure 2: Flow chart showing the steps involved in deriving the Headphone correction filter.

Figure 3: Example – Detail of steps in deriving the Headphone correction filter specific to a single listener and headphone ((a), (b) & (d) Left ear (solid) & Right ear (dashed)). (a) Measured HPTF for Listener 1 on Headphone 1 (case 1 of 10) (b) LPC fit HPTF (mean of cases 1 of 10) (c) Monaural Correction Filter (CF) used for Headphone 1 & Listener 1 (d) Resulting composite transfer function for Listener 1 & Headphone 1 (observed residual, case 1 of 10).

AES 40th International Conference, Tokyo, Japan, 2010 October 8–10

8

Guru et al Factor MEASURED SOURCE INTERCHANGE EQUALISATION RESPONSE HEADPHONE CORRECTNESS

Effects of Individualised Headphone Correction Letter M S I E R H C

Levels Frontward , Backward Lowest, Lower, Near Normal , Switched CF used , CF not used Frontward , Backward Hdph1 , Hdph2 Correct , Incorrect

Table 2: Different Factors in Experiment 2.2.1

Preliminary pilot study

In a pilot study on three listeners, it was found that the overall performance of the listeners in identifying the direction of displacement of noise bursts processed only by HRTF and CF was near perfect. On further questioning about methods employed in judging direction, it became evident that ease in discriminating direction resulted completely from observation of sharpness of sound. By assuming that sharper sounds were in front and darker sounds were in back, identifying direction of motion became trivially easy. Therefore, three different back oriented noise stimuli were used, corresponding to gradually increasing spectral centroids which were near, lower and lowest, as compared to the constant spectral centroid used for the front oriented stimulus. 2.2.2 Choice of Experimental Source Signal A 140 ms source signal was gated on and off with a rise/fall time of 20 ms, and it’s spectrum was adjusted in the case of back stimuli to have three spectral slopes, such that the spectral centroids of the back stimuli progressively approached that of a front stimulus. This was done to make it difficult to use a strictly timbrebased cue for discriminating between directions. Fig. 5 shows this modification of the back stimuli, which began as a white noise signal (gaining 3 dB every octave in the magnitude spectrum) to signals with steeper spectra (gaining 4.5 dB and 6 dB every octave).

Figure 5: Spectra of noise inputs used for back oriented stimuli with increasing slope values.

3

RESULTS

For easy interpretation of results and subsequent sections, the following standards are maintained in explanation. The word ‘Frontward’ refers to a displacement between two sounds presented sequentially in an overall frontward direction; while the word ‘Backward’ refers to a displacement between a paired set of sounds in an overall backward direction; while the word ‘Front’ refers to a single sound emanating from the front of the listener and ‘Back’ refers to one emanating from the rear of the listener. The proportion of correct responses is a ratio of correct to overall responses for a given set of conditions. We define a ‘correct’ response to be a response that matches correctly the order of stimuli presented, for example, responding F (for Frontward) when the sequence of stimuli presented were such that the first was processed by a back-HRTF and the second by a front-HRTF (corresponding to an overall intended frontward displacement) and responding B (for Backward) when the order of stimuli was reversed. The likelihood of observing by chance alone 60 correct responses on 96 trials is less than 1 in 100. So at a risk of error set to a probability of p