Title: Single-Channel Speech Separation in Noisy and Reverberant Conditions
Abstract: An inevitable property of multi-party conversations is that more than one speaker will end up speaking simultaneously for portions of time. Many speech technologies, such as automatic speech recognition and speaker identification, are not designed to function on overlapping speech and suffer severe performance degradation under such conditions. Speech separation techniques aim to solve this problem by producing a separate waveform for each speaker in an audio recording with multiple talkers speaking simultaneously. The advent of deep neural networks has resulted in strong performance gains on the speech separation task. However, training and evaluation has been nearly ubiquitously restricted to a single dataset of clean, near-field read speech, not representative of many multi-person conversational settings which are frequently recorded on room microphones, introducing noise and reverberation. Due to the degradation of other speech technologies in these sorts of conditions, speech separation systems are expected to suffer a decrease in performance as well.
The primary goal of this proposal is to develop novel techniques to improve speech separation in noisy and reverberant recording conditions. One core component of this work is the creation of additional synthetic overlap corpora spanning a range of more realistic and challenging conditions. The lack of appropriate data necessitates a first step of creating appropriate conditions with which to benchmark the performance of state-of-the-art methods in these more challenging conditions. Another proposed line of investigation is the integration of speech separation techniques with speech enhancement, the task of enhancing a speech signal through the removal of noise or reverberation. This is a natural combination due to similarities in problem formulation and general approach. Finally, we propose an investigation into the effectiveness of speech separation as a pre-processing step to speech technologies, such as automatic speech recognition, that struggle with overlapping speech, as well as tighter integration of speech separation with these “downstream” systems.