Intelligibility enhancement of human speech for severely ... - CiteSeerX

2 downloads 0 Views 100KB Size Report
The approach is basically introduced by D. Bauer in [1] at this conference. .... [4] Hans Herrman, Dieter Bauer, Hannele Hypönnen, David Apard, Jani Jävinen, ...
Intelligibility enhancement of human speech for severely hearing impaired persons by dedicated digital processing A. Plinge, D. Bauer, M. Finke Institut f¨ur Arbeitsphysiologie an der Universit¨at Dortmund 44139 Dortmund, Ardeystraße 67, Tel. +49-231-1084-265, Email [email protected] Keywords: hearing impaired, digital processing, transposition

Abstract Sensory hearing-impaired people with severe auditory deficit are neither candidate for cochlear implants nor can they benefit sufficiently from conventional hearing aids. Since they usually suffer from total loss of reception of components from the higher spectrum, but may have sufficient residual sensitivity and selectivity below 1.5 kHz to receive at least part of the speech energy directly, it seems to be adequate to enhance the speech features that consist mainly of lower frequencies, and transpose the others into that range. Experimental algorithms for sensory-matched signal-processing have been developed in C++ which are the basis for evaluation.

1 Introduction The approach is basically introduced by D. Bauer in [1] at this conference. It is a development both enhancing and superseding the approach described in [2]. This demonstration focuses on the digital processing both in its advancements from replacing the analogue processing of [3] as well as the use of the new possibilities.

2

Simulation of Hearing Loss

The conditions of induced hearing losses are again decribed in detail in [1]. We will use this opportunity to demonstrate the effects of such a loss to people with normal hearing by simulation. input signal  FIR filter (2kHz LP) noise generator /

 89:; / ?>=< +

FIR filter

 ouput signal Figure 1: simulation of hearing loss A simple simulation of the typical hearing loss can be done by low-pass filtering the ausio singal and adding spectrally shaped noise which increaese the threshold and compresses the usable dynamic range between threshold and level of discompfort (cp. figure 1).

1

3

Processing

The overall processing consists of an AGC for preconditioning the input signal, feature enhancement and feature replacement units whose output signals are combined. GF /

AGC and noise red.

/

+3

feature enhancemnet

filtering

 ?>=< 89:; + O

• @A /

filtering

+3

/

feature repalcement

Figure 2: overall processing structure

3.1

Feature Enhancement

The feature enhancement processes weak (mostly subthreshold) speech features within the audible frequency range. Plosives (esp. [t] and [k]), which are important for speech stream segmentation, are equalized in perceptual strength to that of the stronger speech elements, such as vowels. The 2nd formant of vowels will also be enhanced if neccessary (esp. of [i]).

Figure 3: [yky] enhancement in the higher band

Figure 4: [ækæ] equalization in the higher band A dedicated detection logic and highly adjustable compression techniques have been introduced using analogue processing in SICONA [3, 29ff]. Essential extensions and improvements became possible through the use of digital processing. The digital design overcomes several shortcomings of the analogue realization. Decoupling of processing time and physical time opens up new ways of signal processing. Look-ahead and increased amount of decision time make it possible to avoid otherwise inavoidable auditory artifacts. Overshoots and phase distortion are eliminated, since it is possible to exactly match the timing of control signal and amplification through use of linear-phase FIR filters and delay memory.

3.2 Feature Replacement The speech features of those sounds, which cannot be selectively amplified beyond the auditory threshold within the residual hearing area, remain completely inaudible to the hearing impaired. Since their main feature-bearing frequency components are far outside the residual hearing area, they have to be replaced by audible synthetic sounds.

2

3.2.1

Fricative Detection

In SICONA the [s] and its voiced counterpart, the [z] were transposed [3]. They are detectable by simple spectral filtering, since their main feature-relevant energy is concentrated above 5kHz. Extending the detection to [c¸] and [M] leads to higher computational effort. Now the feature relevant energy lies between 2 and 5 kHz, and confusion becomes a major problem. Even worse, due to strong coarticulation effects, these sounds can yield similar spectral distributions. If we want to transpose just the [c¸] or both [M] and [c¸] separately, this requires larger processing effort. A dedicated decision logic, using, among others, several spectral features, has been developed that shows good results by matching several prototypes. Based on statistical evaluation of labled speech, a fixed set of decision rules could be derived, avoiding the computation effort of a neural network, and making it possible to transfer them directly into a DSP. It was not considered to transpose either [f], [v] or even [S] and [], since they at least have some energy in the baseband which could be made audible. However, if we wanted to transpose e.g. [S] selectively, it would become inevitable to separate voiced and unvoiced energy, leading again to a much higher computational effort. The high computational cost of exact pitch tracking (e.g. by normalized crosscorrelation, cp. [5]) which in that case cannot be avoided, would only seem reasonable if the pitch estimate was also used for other purposes, as adaptive filtering or tactile pitch transmission. 3.2.2

Replacement Generation

The replacement sound has to meet several requirements. The overall objectives are an improvement of intelligibility and the control of one’s own articulation. As described in [1], the physiological requirements prescribe a spectrally defined noise with a specific amplitude distribution. Since it does not seem feasible to compute such a noise on-line in a DSP, we use two sets of precalculated noise for [s, z] and [c¸], which is stored in memory. The modulation the the noise selected can be done straightforward: The zero-crossing count of a specifically filtered signal is used to modulate the sinosoidal component, which is added to the noise signal. The sensitivity match is achieved by linear interpolating values whic are read off a ”sensitivity table”. Its values can be pre-adjusted according to the individual difference sensitivity of the user. The overall amplitude is controlled by the level of the transposed sound. Abnormal loudness growths can be compensated by a presettable and again individually adjustable compressive characteristic. noise generation +3

energy estimation

/

/

zero cross count



/ band filters

@A

FIR filter

+3

decision logic O BC /

/

 combin. adjustment O /

sin(ωt)

Figure 5: feature replacement

4

Conclusion

From the evaluations done so far we conclude that the digital realization of a fricative transposer is feasable. The user requirements, which have been listed as starting point, can be fulfilled to a sufficient degree. The method produces no artifacts. Sub-selectivity within the class of fricative sounds is limited by the strong coarticulation effects which must be faced in running speech and by the in-availability of top-down processing. We expect no further problems in transferring the C++ - routines to fixed-point integer DSP routines.

3

References [1] Dieter Bauer, Axel Plinge, and Martin Finke. Towards a wearable frication transposer for severely hearing impaired persons. In Proceedings of the 6th AAATE-Conference, 2001. [2] Dieter Bauer. Assistive technology for severely hearing impaired: towards closing essential gaps. In Proceedings of the 5th AAATE-Conference. IOS Press., 1999. [3] D. Bauer, I. Summers, N. van Son, D. Piroth, and J. P. Honor´e. SICONA, Final Report, EU project #1090 (TIDE). Inst. f. Arbeitsphysiologie (IfADo), 44139 Dortmund, Germany, 1998. [4] Hans Herrman, Dieter Bauer, Hannele Hyp¨onnen, David Apard, Jani J¨avinen, Maroc Mercinelli, and Peter Mayer. MORE, Final Report, EU project #3006. IMS Stuttgart, Germany, 2000. [5] David Talkin. A robust algorithm for pitch tracking. In W. B. Klein and K.K. Paliwal, editors, Speech Conding and Synthesis. Elesevier Scienec B.V, 1995.

4