When: Nov 30 @ 12:00 PM
Where: Hackerman 306

Note: This is both an in person and virtual presentation. Here is the link for where the presentation will be taking place virtually.

Title: A Synergistic Combination of Signal Processing and Deep Learning for Robust Speech Recognition

Abstract: When speech is captured with a distant microphone it includes distortions caused by noise, reverberation and overlapping speakers. Far-field speech processing systems need to be robust to those distortions to function in real-world applications and hence have front-end components to handle them. Ideally the systems have to be equipped to answer “who is speaking, what, when, and where?” to be complete. The typical front-end components used currently has two issues: (1) they are optimized based on signal reconstruction objectives and (2) they don’t try to explicitly localize the direction of the speakers. This makes the overall speech processing system sub-optimal as the front-end is optimized independent of the down stream task and unexplainable as it doesn’t address “where?”. In this thesis, these two issues are incrementally addressed.

Firstly, some new techniques are proposed to train front-end systems with application oriented objectives. Emergent end-to-end neural methods have made it easier to optimize the frontend in such a manner. In this work, carefully designed multichannel speech enhancement/separation subnetworks are encompassed inside a sequence-to-sequence automatic speech recognition (ASR) system. Although the entire network is trained only based on the speech recognition error minimization criteria the intermediate outputs can be reconstructed as enhanced signal and can be interpreted perceptually. This is achieved by a formulation that performs simultaneous dereverberation and denoising using a single differentiable speech recognition network which also learns some important hyperparameters from the data. The dereverberation subnetwork is based on well-known linear prediction where the filter order hyperparameter is estimated using a reinforcement learning approach, and the denoising (beamforming) subnetwork is based on a parametric multichannel  Wiener filter where the speech distortion factor is also estimated inside the network. This method gives a considerable gain in performance on real and unseen conditions. It is also shown how such a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to the ASR word error rate (WER) metric.

Given the success of joint optimization in single source setting, it is extended to multitalker scenarios subsequently. A target speech extraction method which combines both anchor speech and location information is proposed. From experimental comparison with a traditional pipeline system, it is verified that this task can also be realized by end-to-end ASR training objectives without using parallel clean data. The results are promising in mixtures of two speakers and noise. Although these experiments were performed with the ground truth locations, it also serves as a proof of concept to establish the importance of source localization for far-field speech applications. This thesis subsequently introduces methods for estimating the locations.

The second part of the thesis starts with introduction of novel supervised learning methods to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture. At the heart of the method is a source splitting mechanism that creates source-specific intermediate representations inside the network. The experiments establish that a variant of earth mover distance (EMD) loss is very effective in classifying DOA at a very high resolution by modeling inter-class relationships. The experiments also show localization methods to be a very effective frontend for multi-talker speech recognition with a potential to reduce the WER by about a factor of two.

Finally, combining the strengths of joint optimization and location driven systems, the thesis introduces a new paradigm for handling far-field multi-speaker data in an end-to-end (E2E) neural network manner, called directional ASR. In directional ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of directional ASR: localization, separation, and recognition are connected as a single differentiable neural network and trained solely based on ASR error minimization objectives. As directional ASR does not require explicit direction of arrival (DOA) supervision, it is more appropriate for realistic data. Directional ASR outperforms a strong far-field multi-speaker end-to-end system in both separation quality and ASR performance.

Committee Members

  • Shinji Watanabe, Language Technologies Institute, Carnegie Mellon University
  • Sanjeev Khudanpur, Department of Electrical and Computer Engineering
  • Hynek Hermansky, Department of Electrical and Computer Engineering
  • Najim Dehak, Department of Electrical and Computer Engineering