Department of Computer Science LS XII - Pattern Recognition Group

Experiments in Distant Talking Speech Recognition Using a Standard Database

Gernot A. Fink and Sascha Hohenner
Proc. 31. Deutsche Jahrestagung f{\"u}r Akustik, DAGA '05, 2005.

M{\"u}nchen

Abstract

Distant talking speech recognition is applied in situations where the use of close-talking microphones is not feasible, e.g. when communicating with a mobile robot. The acoustic data are captured with multiple microphones from some distance to the talker. Therefore, not only the desired speech but also sound from interfering sources is picked up. Additionally, the received signals are corrupted by echoes introduced by the acoustic environment. Therefore, a degradation in recognition quality can be observed for distant talking applications compared to results obtained on "clean" speech data. Various methods as e.g. beam-forming and adaptation techniques have been proposed aiming at obtaining the maximum possible recognition quality even in such an adverse environment. However, a direct comparison of results is usually not possible as larger speech databases recorded in both a distant talking and a clean-speech configuration are not available. In order to be able to clearly assess the performance of our approach for distant talking speech recognition developed for our mobile robot BIRON with respect to the ideal close-talking scenario, we decided to make a "standard" speech database (namely the WSJ0 corpus) usable for a comparative study. We simulated a talker using a high-quality studio loudspeaker set up at an appropriate distance to our robot for "uttering" - i.e. re-playing - the complete corpus. This data was then recorded via the stereo-microphones of the robot yielding a distant talking version of the original corpus. In first recognition experiments we observed that the word error rate approx. triples when training on clean and testing (5k closed vocabulary test) on distant speech, a figure which could only slightly be reduced using a filter-and-sum beam-forming approach. However, when training on the distant talking version of the corpus the word error rate only increased about 84% relative.