Decoding speech in the presence of other sources

Jon Barker, Martin Cooke, Dan Ellis.
Speech Communication

The statistical theory of speech recognition introduced several decades ago has brought about low word error ratesfor clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are presentin virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise intoaccount is perhaps the most serious obstacle to the application of ASR technology.Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniquesattempts to estimate the noise and remove its effects from the target speech. While noise estimation can work inlow-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A secondapproach utilises noise models and attempts to decode speech taking into account their presence. Again, model-basedtechniques can work for simple noises, but they are computationally complex under realistic conditions and requiremodels for all sources present in the signal.In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlikeearlier model-based approaches, our framework makes no assumptions about the noise background, although it canexploit such information if it is available. It does not require models for background sources, or an estimate of theirnumber. The new approach extends statistical ASR by introducing a segregation model in addition to the conventionalacoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence ofspeech models which generated a given observation sequence, the new approach additionally determines the most likelyset of signal fragments which make up the speech signal. Although the framework is completely general, we provide oneinterpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, whichsearches both across subword state and across alternative segregations of the signal between target and interference.We call this modified system the speech fragment decoder.The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasksin high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the worderror rate in the condition of factory noise at 5dB SNR from over 59% for a standard ASR system to less than 22%.