This presentation is happening remotely. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.
Title: Context-aware Language Modeling and Adaptation for Automatic Speech Recognition
Abstract: Language models (LMs) are an important component in automatic speech recognition (ASR) and usually trained on transcriptions. Language use is strongly influenced by factors such as domain, topic, style, and user-preference. However, transcriptions from speech corpora are usually too limited to fully capture contextual variability in test domains. And some of the information is only available at test time. It is easily observed that the change of application domains often induces mismatch in lexicon and distribution of words. Even within the same domain, topics can shift and user-preference can vary. These observations indicate that LMs trained purely on transcriptions that may not be well representative for test domains are far from ideal and may severely affect ASR performance. To mitigate the mismatches, adapting LMs to contextual variables is desirable.
The goal of this work is to explore general and lightweight approaches for neural LM adaptation and context-aware modeling for ASR. In the adaptation direction, two approaches are investigated. The first is based on cache models. Although neural LMs outperform n-gram LMs on modeling longer context, previous studies show that some of them, for example, LSTMs, still only capture a relatively short span of context. Cache models that capture relatively long-term self-trigger information have been proved useful for n-gram LMs adaptation. This work extends a fast margin adaptation framework for neural LMs and adapts LSTM LMs in an unsupervised way. Specifically, pre-trained LMs are adapted to cache models estimated from decoded hypotheses. This method is lightweight as it does not require retraining. The second approach is interpolation-based. Linear interpolation is a simple and robust adaptation approach, while it is suboptimal since weights are globally optimized and not aware of local context. To tackle this issue, a mixer model that combines pre-trained neural LMs with dynamic weighting is proposed. Experimental results show that it outperforms finetuning and linear interpolation on most scenarios. As for context-aware modeling, this work proposes a simple and effective way to implicitly integrate cache models into neural LMs. It provides a simple alternative to the pointer sentinel mixture model. Experiments show that the proposed method is more effective on relatively rare words and outperforms several baselines. Future work is focused on analyzing the importance and the effect of various contextual factors on ASR and developing approaches for representing and modeling these factors to improve ASR performance.