Title: A Theory and Practice of the Lifelong Learnable Forest
Abstract: Since Vapnik’s and Valiant’s seminal papers on learnability, various lines of research have generalized his concept of learning and learners. In this paper, we formally define what it means to be a lifelong learner. Given this definition, we propose the first lifelong learning algorithm with theoretical guarantees that it can perform forward transfer and reverse transfer, while not experiencing catastrophic forgetting. Our algorithm, dubbed Lifelong Learning Forests, outperforms the current state-of-the-art deep lifelong learning algorithm on the CIFAR 10-by-10 challenge problem, despite its simplicity and mathematical tractability. Our approach immediately lends to further algorithmic developments that promise to exceed current performance limits of existing approaches.
Title: A Practical and Efficient Multi-Stream Framework for End-to-End Speech Recognition
Abstract: The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. In these cases, an appropriate strategy to fuse streams or select the most informative source is necessary. In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this proposal, a multi-stream framework is present based on joint CTC/Attention E2E model, where parallel streams are represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, a secondary stream-fusion network is introduced to steer the decoder toward the most informative encoders.
Two representative framework have been proposed, which are MultiEncoder Multi-Resolution (MEM-Res) and Multi-Encoder Multi-Array (MEM-Array), respectively. Moreover, with an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training scheme is further proposed in this work. Experiments are conducted on various corpora including Wall Street Journal (WSJ), CHiME-4, DIRHA and AMI. Compared with the best single-stream performance, the proposed framework has achieved substantial improvement, which also outperforms various conventional fusion strategies.
The future plan aims to improve robustness of the proposed multistream framework. Measuring performance of an ASR system without ground-truth could be beneficial in multi-stream scenarios to emphasize on more informative streams than corrupted ones. In this proposal, four different Performance Monitoring (PM) techniques are investigated. The preliminary results suggest that PM measures on attention distributions and decoder posteriors are well-correlated with true performances. Integration of PM measures and more sophisticated fusion mechanism in multi-stream framework will be the focus for future exploration.
Title: “Honey I shrank the microscope!” And Other Adventures in Functional Imaging
Abstract: Imaging the brain in action, in awake freely behaving animals without the confounding effect of anesthetics poses unique design and experimental challenges. Moreover, imaging the evolution of disease models in the preclinical setting over their entire lifetime is also difficult with conventional imaging techniques. This lecture will describe the development and applications of a miniaturized microscope that circumvents these hurdles. This lecture will also describe how image acquisition, data visualization and engineering tools can be leveraged to answer fundamental questions in cancer, neuroscience and tissue engineering applications.
Bio: Dr. Pathak is an ideator, educator and mentor focused on transforming lives through the power of imaging. He received the BS in Electronics Engineering from the University of Poona, India. He received his PhD from the joint program in Functional Imaging between the Medical College of Wisconsin and Marquette University. During his PhD he was a Whitaker Foundation Fellow. He completed his postdoctoral fellowship at the Johns Hopkins University School of Medicine in Molecular Imaging. He is currently Associate Professor of Radiology, Oncology and Biomedical Engineering at Johns Hopkins University (JHU). His research is focused on developing new imaging methods, computational models and visualization tools to ‘make visible’ critical aspects of cancer, neurobiology and tissue engineering. His work has been recognized by multiple journal covers and awards including the Bill Negendank Award from the International Society for Magnetic Resonance in Medicine (ISMRM) given to “outstanding young investigators in cancer MRI” and the Career Catalyst Award from the Susan Komen Breast Cancer Foundation. He serves on review panels for national and international funding agencies, and the editorial boards of imaging journals. He is dedicated to mentoring the next generation of imagers and innovators. He has mentored over sixty students, was the recipient of the ISMRM’s Outstanding Teacher Award in 2014, a 125 Hopkins Hero in 2018 for outstanding dedication to the core values of JHU, and a Career Champion Nominee in 2018 for student career guidance and support.
Title: Deep Learning-based Novelty Detection
Abstract: In recent years, intelligent systems powered by artificial intelligence and computer vision that perform visual recognition have gained much attention. These systems observe instances and labels of known object classes during training and learn association patterns that can be
used during inference. A practical visual recognition system should first determine whether an observed instance is from a known class. If it is from a known class, then the identity of the instance is queried through classification. The former process is commonly known as novelty detection (or novel class detection) in the literature. Given a set of image instances from known classes, the goal of novelty detection is to determine whether an observed image during inference belongs to one of the known classes.
We consider one-class novelty detection, where all training data are assumed to belong to a single class without any finer-annotations available. We identify limitations of conventional approaches in one-class novelty detection and present a Generative Adversarial Network(GAN) based solution. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder’s output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real.
Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.
Title: Towards a better understanding of spoken conversations: Assessment of sentiment and emotion
Abstract: In this talk, we present our work on understanding the emotional aspects of spoken conversations. Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties.
While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this work, we considered emotion recognition in two particular scenarios. The first scenario is predicting customer sentiment/satisfaction (CSAT) in a call center conversation, and the second consists of emotion prediction in short utterances.
CSAT is defined as the overall sentiment (positive vs. negative) of the customer about his/her interaction with the agent. In this work, we perform a comprehensive search for adequate acoustic and lexical representations.
For acoustic representation, we propose to use the x-vector model, which is known for its state-of-the-art performance in the speaker recognition task. The motivation behind using x-vectors for CSAT is we observed that emotion information encoded in x-vectors affected speaker recognition performance. For lexical, we introduce a novel method, CSAT Tracker, which computes the overall prediction based on individual segment outcomes. Both methods rely on transfer learning to obtain the best performance. We classified using convolutional neural networks combining the acoustic and lexical features. We evaluated our systems on US English telephone speech from call center data. We found that lexical models perform better than acoustic models and fusion of them provided significant gains. The analysis of errors uncovers that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly. Also, we found that the customer’s speech is more emotional compared to the agent’s speech.
For the second scenario of predicting emotion, we present a novel approach based on x-vectors. We show that adapting the x-vector model for emotion recognition provides the best-published results on three public datasets.
Title: Single-Channel Speech Separation in Noisy and Reverberant Conditions
Abstract: An inevitable property of multi-party conversations is that more than one speaker will end up speaking simultaneously for portions of time. Many speech technologies, such as automatic speech recognition and speaker identification, are not designed to function on overlapping speech and suffer severe performance degradation under such conditions. Speech separation techniques aim to solve this problem by producing a separate waveform for each speaker in an audio recording with multiple talkers speaking simultaneously. The advent of deep neural networks has resulted in strong performance gains on the speech separation task. However, training and evaluation has been nearly ubiquitously restricted to a single dataset of clean, near-field read speech, not representative of many multi-person conversational settings which are frequently recorded on room microphones, introducing noise and reverberation. Due to the degradation of other speech technologies in these sorts of conditions, speech separation systems are expected to suffer a decrease in performance as well.
The primary goal of this proposal is to develop novel techniques to improve speech separation in noisy and reverberant recording conditions. One core component of this work is the creation of additional synthetic overlap corpora spanning a range of more realistic and challenging conditions. The lack of appropriate data necessitates a first step of creating appropriate conditions with which to benchmark the performance of state-of-the-art methods in these more challenging conditions. Another proposed line of investigation is the integration of speech separation techniques with speech enhancement, the task of enhancing a speech signal through the removal of noise or reverberation. This is a natural combination due to similarities in problem formulation and general approach. Finally, we propose an investigation into the effectiveness of speech separation as a pre-processing step to speech technologies, such as automatic speech recognition, that struggle with overlapping speech, as well as tighter integration of speech separation with these “downstream” systems.
This presentation happened remotely. Follow this link to view it. Please note that the presentation doesn’t start until 30 minutes into the video.
Title: Learning Spoken Language Through Vision
Abstract: Humans learn spoken language and visual perception at an early age by being immersed in the world around them. Why can’t computers do the same? In this talk, I will describe our work to develop methodologies for grounding continuous speech signals at the raw waveform level to natural image scenes. I will first present self-supervised models capable of jointly discovering spoken words and the visual objects to which they refer, all without conventional annotations in either modality. Next, I will show how the representations learned by these models implicitly capture meaningful linguistic structure directly from the speech signal. Finally, I will demonstrate that these models can be applied across multiple languages, and that the visual domain can function as an “interlingua,” enabling the discovery of word-level semantic translations at the waveform level.
Bio: David Harwath is a research scientist in the Spoken Language Systems group at the MIT Computer Science and Artificial Intelligence Lab (CSAIL). His research focuses on multi-modal learning algorithms for speech, audio, vision, and text. His work has been published at venues such as NeurIPS, ACL, ICASSP, ECCV, and CVPR. Under the supervision of James Glass, his doctoral thesis introduced models for the joint perception of speech and vision. This work was awarded the 2018 George M. Sprowls Award for the best Ph.D. thesis in computer science at MIT.
He holds a Ph.D. in computer science from MIT (2018), a S.M. in computer science from MIT (2013), and a B.S. in electrical engineering from UIUC (2010).
This presentation is happening remotely. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.
Title: Interpretable End-to-End Neural Network for Audio and Speech Processing
Abstract: This talk introduces extensions of the basic end-to-end automatic speech recognition (ASR) architecture by focusing on its integration function to tackle major problems faced by current ASR technologies in adverse environments including cocktail party and data sparseness problems. The first topic is to integrate microphone-array signal processing, speech separation, and speech recognition in a single neural network to realize multichannel multi-speaker ASR for the cocktail party problem. Our architecture is carefully designed to maintain the role of each module as a differentiable subnetwork so that we can jointly optimize the whole network but still keep the interpretability of each subnetwork including the speech separation, speech enhancement, and acoustic beamforming abilities in addition to ASR. The second topic is based on semi-supervised training using cycle-consistency, which enables us to leverage unpaired speech and/or text data by integrating ASR with text-to-speech (TTS) within the end-to end framework. This scheme can be regarded as an interpretable disentanglement of audio signals with explicit decomposition of linguistic characteristics by ASR and speaker and speaking style characteristics by speaker embedding. These explicitly decomposed characteristics are converted back to the original audio signals by neural TTS; thus we form an acoustic feedback loop based on speech recognition and synthesis like human hearing, and both components can be jointly optimized only with the audio data.
This presentation will be happening remotely over Zoom. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.
Title: Compressive Sensing for Wireless Systems with Massive Antenna Arrays
Abstract: Over the past two decades the world has enjoyed exponential growth in wireless connectivity that has fundamentally changed the way people communicate and has opened the door to limitless new applications. With the advent of 5G, users will now begin to enjoy enhanced mobile broadband links supporting peak rates of over 10 gigabit per second. The 5G capability will also support massive machine type communications and less than one millisecond latency communications to support ultra-reliable low communication. Continuing to achieve greater increases in system capacity requires the continual advancement of new technology to make efficient use of finite spectrum resources.
Researchers have studied Multiple-Input-Multiple-Output (MIMO) communications over the last several decades as a way to increase system capacity. The MIMO channel is composed of multiple transmit (input) antennas and multiple (output) receive antennas. The channel is represented as the impulse response between each transmit and receive antenna pair. In the simplest of channels, the pairwise impulse response reduces to a single coefficient. Many theoretical MIMO results rely on Rayleigh channels featuring independently distributed complex Gaussian variables as channel coefficients.
The concept of Massive MIMO emerged a decade ago and is a leading technology in 5G wireless. Massive MIMO features base stations that have massive antenna arrays that simultaneously service many users. The Massive MIMO array has many more antennas than users. Unlike traditional phased array antennas, Massive MIMO arrays have all (or a large portion of) their antennas connected to receive chains for baseband processing. Successfully decoding each user’s data stream requires estimates of the propagation channel. Channel estimation is usually aided through the use of pilot signals that are known to both the user terminal and the base station. Simultaneously estimating the channel matrix between each user and each antenna in a massive MIMO array creates challenges for pilot sequence design. More channel resources reserved for pilot sequences for channel estimation result in fewer resources for user data.
Several efforts have shown that the mm wave massive MIMO channel exhibits several sparse features. The number of distinct and resolvable paths between a user and a massive MIMO array is generally much less than the number of base station antennas. Early theoretical MIMO work relied on Rayleigh channels as they are useful for closed form solutions. In reality, the Massive MIMO mm wave channel is low rank as it can be modeled by a smaller number of resolvable multipath components. This opens opportunities for new channel estimation techniques using compressive sensing and sparse recovery.
Although Massive MIMO will be featured in future 5G services, there is still much untapped potential. Through developing better channel estimation schemes, additional system throughput can be achieved. This work will consider:
This was a virtual seminar that can be viewed by clicking here.
Title: Unifying Human Processes and Machine Models for Spoken Language Interfaces
Abstract: Recent years have witnessed tremendous progress in digital speech interfaces for information access (eg., Amazon’s Alexa, Google Home etc). The commercial success of these applications is hailed as one of the major achievements of the “AI” era. Indeed these accomplishments are made possible only by sophisticated deep learning models trained on enormous amounts of supervised data over extensive computing infrastructure. Yet these systems are not robust to variations (like accent, out of vocabulary words etc), remain uninterpretable, and fail in unexpected ways. Most important of all, these systems cannot be easily extended speech and language disabled users, who would potentially benefit the most from availability of such technologies. I am a speech scientist interested in computational modelling of the human speech communication system towards building intelligent spoken language systems. I will present my research where I’ve tapped into the human speech communication processes to robust build spoken language systems — specifically, theories of phonology and physiological data including cortical signals in humans as they produce fluent speech. The insights from these studies reveal elegant organizational principles and computational mechanisms employed by the human brain for fluent speech production, the most complex of motor behaviors. These findings hold the key to the next revolution in human-inspired, human-compatible spoken language technologies that, besides alleviating the problems faced by current systems, can meaningfully impact the lives of millions of people with speech disability.
Bio: Gopala Anumanchipalli, PhD, is a researcher at the Department of Neurological Surgery and the Weill Institute for Neurosciences at the University of California, San Francisco. His interests in i) understanding neural mechanisms of human speech production towards developing next generation Brain-Computer Interfaces, and ii) Computational modelling of human speech communication mechanisms towards building robust speech technologies. Earlier, Gopala was a postdoctoral fellow at UCSF working with Edward F Chang, MD and has previously received PhD in Language and Information Technologies from Carnegie Mellon University working with Prof. Alan Black on speech synthesis.