This presentation happened remotely. Follow this link to view it. Please note that the presentation doesn’t start until 30 minutes into the video.
Title: Learning Spoken Language Through Vision
Abstract: Humans learn spoken language and visual perception at an early age by being immersed in the world around them. Why can’t computers do the same? In this talk, I will describe our work to develop methodologies for grounding continuous speech signals at the raw waveform level to natural image scenes. I will first present self-supervised models capable of jointly discovering spoken words and the visual objects to which they refer, all without conventional annotations in either modality. Next, I will show how the representations learned by these models implicitly capture meaningful linguistic structure directly from the speech signal. Finally, I will demonstrate that these models can be applied across multiple languages, and that the visual domain can function as an “interlingua,” enabling the discovery of word-level semantic translations at the waveform level.
Bio: David Harwath is a research scientist in the Spoken Language Systems group at the MIT Computer Science and Artificial Intelligence Lab (CSAIL). His research focuses on multi-modal learning algorithms for speech, audio, vision, and text. His work has been published at venues such as NeurIPS, ACL, ICASSP, ECCV, and CVPR. Under the supervision of James Glass, his doctoral thesis introduced models for the joint perception of speech and vision. This work was awarded the 2018 George M. Sprowls Award for the best Ph.D. thesis in computer science at MIT.
He holds a Ph.D. in computer science from MIT (2018), a S.M. in computer science from MIT (2013), and a B.S. in electrical engineering from UIUC (2010).