Electronic Signal Processing Algorithm for Close-Talk System

Electronic Signal Processing Algorithm for Close-Talk System Yi Jiang1,2, Hong Zhou2, Hao Zhang2, Jun Qi1, Yuanyuan Zu2, Baoshuai Dong1, and Wei Li2 1

Department of Electronic Engineering, Tsinghua University, Beijing, P.R. China 2 The Quartermaster Equipment Research Institute, CPLA, Beijing, P.R. China [email protected]

Abstract. To portable electronic devices or in big noise environments, closetalk system was often introduced to collect more target electronic signal than common sound collect systems. But the performance of such system was not good enough in big noise applications. In this paper, within the framework of computational auditory scene analysis (CASA), an electronic signal processing algorithm of binary masking based on BelaSigna 300 development tool was proposed. Two microphones got the target sound and far noise at the same time, and then all in one IC calculated the inter-microphone intensity differences (IID) of the two microphones in time-frequency (T-F) units were used as cues to generate the binary masks for the near microphone. With the theory of ideal binary mask (IBM) and head-related transfer functions (HRTF), the threshold value of IID was set to 2. Experiments with one interfere noise in different positions were did to test this ideal. Keywords: speech signal segregation, electronic engineering, Computational Auditory Scene Analysis (CASA), inter-microphone intensity differences (IID), system on chip (SOC).

1

Introduction

In daily life, how to obtain the clean target speech is every important to communication and robust ASR. The target speech is always mixed with the noise in common environments. So Close-talk system is often used to improve the quantity of the speech collection by near the target sound source, such as Mobil phone or head wear microphone. Even to do so, get the clean speech is also a hard task in complex auditory scenes, especially in big noise environments, such as rail station, airport, and subway station and so on. The collected speech would also be in low signal to noise ratio. With the widely use of portable device, the speech enhancement of close-talk system based on high integrated circuits or system on chip (SOC) was become important. In recent years, great progresses have been made on the study of computational auditory scene analysis (CASA) algorithm for speech separation from complex audio scenes [1]. And, ideal binary mask (IBM) has become the goal of such systems [2] in the critical of signal to noise ratio. Within the framework of CASA, the key point is to find proper cues to generate binary masks, and make it approach to IBM. The main cues in monaural speech segregation system include pitch period[3], onset/offset[4] D. Jin and S. Lin (Eds.): Advances in Mechanical and Electronic Engineering, LNEE 178, pp. 455–461. springerlink.com © Springer-Verlag Berlin Heidelberg 2013

456

Y. Jiang et al.

Location based sound segregation method was used to estimate ideal binary mask for far sound sources [5], which used a two microphones system to generate the cues of inter-aural time differences(ITD) and inter-aural intensity differences(IID) as person’s two ears do. For the position of two microphones and the sound sources location were lack of constraints, this algorithms would get multi spatial results, and segregation might meet some problem on certain positions. But it provided a method to segregation speech with double microphones. Binaural cues extraction was the key point in location based CASA system. To close talk system, the target sound source was near one microphone, in most condition, the distances were within several centimeters. In this paper, another microphone was placed at right ear, little far from the mouth. Then the binary cues of IID could be used to segregate the target speech. The achievement of study in auditory localization [6-8] would be used to amend the approach. Due to the research in head-related transfer functions (HRTF) of the nearby source, both the theoretical calculations and the experiments indicate that the IID increase substantially for lateral sound source as distance decrease below 1m, even at low frequencies, which IID were small with distant sound sources [7, 9]. So IID was a robust cue to distinguish the near speech and noise from far sound source. BelaSigna 300 chip is the product of On Semiconductor Company, which integrate 4 ADCs and a HEAR Configurable Accelerator, is suitable for our applications. The rest of the paper is organized as follows. Section 2describes the architecture of the proposed algorithm, and introduces a method to estimating the threshold of the IID cues. Section 3 presents systematic evaluation of the system for two sound sources with different location.

2

Electronic Speech Signal Segregate Processing

The close-talk system was shown in Fig.1. The proposed closed-talk speech segregate processing consists of two parallel parts: the same auditory Filter bank was used to decompose input mixture signal from two microphones into T-F representation units respectively. Then the IID cues of microphone A and B were extracted to generate the binary mask. Subsequently, the binary masks were affected on the output of the near microphone A as the better-ear to group the target speech, and then resynthesized to obtain the target speech

Fig. 1. Schematic diagram of the proposed algorithm. Microphone A was placed font of the mouth about several centimeters. Microphone B was placed near the right ear.

Electronic Signal Processing Algorithm for Close-Talk System

2.1

457

Inter-microphones Intensity Difference for CASA

Energy based CASA system used the energy of each time-frequency (T-F) units as the cues, have been widely used for its simple form and optimum in signal to noise ratio. The premier ideal of such system was to estimate the main energy source of T-F units, which can be describe as 2

1 if | S (t , f ) |2 ≥| N (t , f ) | m(t , f ) =  0 otherwise

(1)

Where | S (t , f ) |2 indicated the energy of the target speech in the T-F units, and | N (t , f ) |2 indicated the energy of the mixed noise. m(t , f ) was the binary masks value in time t and frequency f, ‘1’ indicated the target speech, ‘0’ was not.In this paper, Background noises acoustically add to the clean speech and no correlation between them. So the nearby target sound source and far noise could be discussed separately. In this paper, we only discussed only one interfere noise condition. In the condition of two sound sources, the target one was near and the interferer noise was far away. And there were no correlations between the two sounds. The energy of microphone A and B in each channel could be calculated as 2

2

2

(2)

2

2

2

(3)

X A (t , f ) = S A (t , f ) + N A (t , f ) X B (t , f ) = S B (t , f ) + N B (t , f )

In close-talk system as fig.1, X (t , f ) were the T-F units energy of the microphone A, were the energy of the target speech in microphone A, and N (t, f ) were the S (t , f ) energy of noise in microphone A. while X (t , f ) could get form the peripherally analysis of microphone B. then the inter-microphone intensity differences of microphone A and B, IIDAB can be calculated in 2

A

2

2

A

A

2

B

2

S A (t , f ) IIDAB (t , f ) =

X A (t , f ) X B (t , f )

2 2

S A (t , f ) + N A (t , f ) 2

=

2

S B (t , f ) + N B (t , f )

2 2

=

N A (t , f ) S B (t , f ) N A (t , f )

2 2

+

2

+1

N B (t , f ) N A (t , f )

2

(4)

2

Then we defined the 2

IIDS (t , f ) =

S A (t , f )

IIDN (t , f ) =

N A (t , f )

S B (t , f )

,

2

N B (t , f )

(5)

2 2

(6)

To describe the inter-microphone intensity differences of microphone A and B from near sound and far noise.

458

Y. Jiang et al.

As the theory of IBM, the units was belong to the target unites, which can be described as | S (t , f ) | ≥| N (t , f ) | , So the local signal to noise ratio (LC) of the microphone 2

A

A, which is

S A (t , f ) N A (t , f )

A

2 2

can be calculated from IIDAB .

To the target units, LC should equal or large than1. From the study in [5], IIDAB (t , f ) would bigger than IIDN (t , f ) , and IIDAB (t , f ) less than IIDS (t , f ) 1 then in the close system: IIDAB (t , f ) −1 IIDN (t , f ) LC = = 2 IIDAB (t , f ) N A (t , f ) (1 − ) IIDS (t , f ) S A (t , f )

IIDAB (t , f ) ≥

2

2 1 1 + IID N (t , f ) IID S (t , f )

(7)

(8)

Fig. 2. IID estimate for a rigid spherical head

2.2

IIDS (t , f ) =

S A (t , f )

IIDN (t , f ) =

N A (t , f )

S B (t , f )

N B (t , f )

2 2

=

2 2

=

d SB2 d SA2 d NB2 2 d NA

(9)

(10)

Inter-microphones Level Differences Estimate for a Rigid, Spherical Head

The cues of ILD found in the HRTF for nearby sources were important to auditory distance perception in the proximal region. For the complex of the auditory scene may generate in the proximal region, a simple sphere-model was used to explained the HRTFs intuitively [6]. In this paper, we also used a crude illustration of this phenomenon in Fig. 2. Without thinking of the influence of head, the IID value of near target speech and far noise could be calculate as: Where d SA was the distance between mouth and microphone A, d SB was the distance to microphone B. Where d NA, d NB was the distance form noise to microphone A and B.


459

A series experiment of rabbits ears, the corresponding levels computed with the rigid sphere model for the sound source positioned at 90° azimuth and distances between 10 and 160 cm shown that the level increased with decreasing distance, and was relatively flat across frequency[9]. And in the condition of 160 cm, the ILD was under 5dB, which mean the biggest value of IID of the two ears in such distance was less than 3.1. In low frequency condition the value almost 0dB. In the close-talk system, the two microphones were positioned closer to each other than two ears did, and the noise in far position, so the value of d NB and d NA closed to each other, the value of IID was almost one. To the nearby sound sources, such as in the condition of 10cm distance, the IID was bigger than 10dB and almost 15dB in frequency higher than 10 kHz. In our close-talk system, the microphone was close to mouth no more than 5 centimeters, so the value of IIDS would bigger than the value in such experiment, and also increased with the frequency increase. So we can set the value of 1 to zero N

IIDS

roughly. So the threshold value of IIDAB was equal to 2. At last, the target speech was resynthesized from the output of microphone A with the binary masking based on proposed algorithm as mentioned forward to gain the target speech with little noise.

3

Evaluations and Comparison

In the one chip electronic evaluation system, a geography head model based on Chinese was used, and development tool of BelaSigna 300 as shown in fig 3. Two microphones and ADCs were used got the electronic signal form the position of mouth and right ear separately as the inputs. The target sound was position in front of the mouth about two centimeters so simulate the close talk. Another speaker was placed in front of the mouse one meter to generate the interfere noise on 0 degree or 45 degrees. The near target speech Fig. 3. The functional diagram of development tool was a male speech; one female BelaSigna 300 speech was used as the noise. The system modeled auditory filtering by decomposing the input mixture signal into the time frequency domain using a bank of 128 gammatone filters or WOLA filters, with their center frequencies equally distributed on the equivalent rectangular band-width rate scale from50 to 8000Hz. In each filter channel, the output is divided into 20-ms time frames with 10-ms overlap between consecutive frames.

460

Y. Jiang et al.

Fig. 4. Segregation result with 0 degree noise the top panel was the mixture signal of microphone B. there were enormous noise, and hard to find the target speech. the middle panel was the mixture signal of microphone A, in which the target speech was the mainly components. The target speaker was near to microphone A, and collected more target speech, which shape would also reduce the energy arrival of far noise. The bottom panel was the signal we segregated form microphone A with the algorithm we proposed.

In Fig.4, It was obviously that the noises from 0 to 2 seconds were reduced. In Fig.5, when noise and target sound was in different orientation, the result of segregation was obviously good, especially compare to the mixture signal of microphone B. For the influence of speaker was reduced, the mixture signal of microphone A has bigger value than values in fig.4. Which mean the value should be increased to get better performance.

Fig. 5. Segregation result with 45 degree noise the signal got from microphone B and A was placed on up and middle panel. The bottom panel was the result of segregation.

5

Conclusions

We have presented an electronic signal processing Algorithm for close-talk system based on SOC development tool BelaSigna 300. With the integrated two ADCs and electronic hearing accelerator, the cues of IID extracted from the two microphones, Binary masking algorithm was proposed with certain threshold value of IID. The result of speech segregate experiment indicated it had good performance in one interfere noise on far position. As the founding of HRTF and electronic technology, IID was raised with the frequency raise. So the threshold value of the IID should be


461

various with the different frequency. But as all known, the accurate values of IID were hard to obtain, not only it depended on test equipment and processor, but also various with persons. So more experiments should be done to get mend this algorithm. At the same time, the performance of the algorithm we proposed should be evaluated with different noise types, various noise numbers and in different signal to noise ratio. The circuits of development tool were also need to be careful designed to reduce the noise of electronic circuits.

References 1. Brown, G.J., Cooke, M.: Computational auditory scene analysis. Computer Speech and Language 8(4), 297–336 (1994) 2. Wang, D.: On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer, Norwell (2005) 3. Brungart, D.S., Chang, P.S., Simpson, B.D., Wang, D.: Isolating the energetic com ponent of speech-on-speech masking with ideal time-frequency segregation. Journal of The Acoustical Society of America 120(6), 4007–4018 (2006) 4. Hu, G., Wang, D.: A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation. IEEE Trans. Audio, Speech, and Language Processing 18(8), 2067–2079 (2010) 5. Chao-Ling, H., Jang, J.S.R.: On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset. IEEE Trans. Audio, Speech, and Language Processing 18(2), 310–319 (2010) 6. Yang, S., Srinivasan, S., Zhaozhang, J., Jin, Z., Wang, D.: A computational auditory scene analysis system for speech segregation and robust speech recognition. Computer Speech and Language 24(1), 77–93 (2010) 7. Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Networks 15(5), 1135–1150 (2004) 8. Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoustics Speech and Signal Processing 27(2), 113–120 (1979) 9. Wang, D.L., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley and IEEE Press, Hoboken, NJ (2006)