Thesis Proposal: Ruizhi Li

November 21, 2019 @ 3:00 pm – 4:00 pm
Olin Hall 305
Thesis Proposal: Ruizhi Li @ Olin Hall 305

Title: A Practical and Efficient Multi-Stream Framework for End-to-End Speech Recognition

Abstract: The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. In these cases, an appropriate strategy to fuse streams or select the most informative source is necessary. In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this proposal, a multi-stream framework is present based on joint CTC/Attention E2E model, where parallel streams are represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, a secondary stream-fusion network is introduced to steer the decoder toward the most informative encoders.

Two representative framework have been proposed, which are MultiEncoder Multi-Resolution (MEM-Res) and Multi-Encoder Multi-Array (MEM-Array), respectively. Moreover, with an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training scheme is further proposed in this work. Experiments are conducted on various corpora including Wall Street Journal (WSJ), CHiME-4, DIRHA and AMI. Compared with the best single-stream performance, the proposed framework has achieved substantial improvement, which also outperforms various conventional fusion strategies.

The future plan aims to improve robustness of the proposed multistream framework. Measuring performance of an ASR system without ground-truth could be beneficial in multi-stream scenarios to emphasize on more informative streams than corrupted ones. In this proposal, four different Performance Monitoring (PM) techniques are investigated. The preliminary results suggest that PM measures on attention distributions and decoder posteriors are well-correlated with true performances. Integration of PM measures and more sophisticated fusion mechanism in multi-stream framework will be the focus for future exploration.

Back to top