Title: A Theory and Practice of the Lifelong Learnable Forest
Abstract: Since Vapnik’s and Valiant’s seminal papers on learnability, various lines of research have generalized his concept of learning and learners. In this paper, we formally define what it means to be a lifelong learner. Given this definition, we propose the first lifelong learning algorithm with theoretical guarantees that it can perform forward transfer and reverse transfer, while not experiencing catastrophic forgetting. Our algorithm, dubbed Lifelong Learning Forests, outperforms the current state-of-the-art deep lifelong learning algorithm on the CIFAR 10-by-10 challenge problem, despite its simplicity and mathematical tractability. Our approach immediately lends to further algorithmic developments that promise to exceed current performance limits of existing approaches.
Title: A Practical and Efficient Multi-Stream Framework for End-to-End Speech Recognition
Abstract: The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. In these cases, an appropriate strategy to fuse streams or select the most informative source is necessary. In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this proposal, a multi-stream framework is present based on joint CTC/Attention E2E model, where parallel streams are represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, a secondary stream-fusion network is introduced to steer the decoder toward the most informative encoders.
Two representative framework have been proposed, which are MultiEncoder Multi-Resolution (MEM-Res) and Multi-Encoder Multi-Array (MEM-Array), respectively. Moreover, with an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training scheme is further proposed in this work. Experiments are conducted on various corpora including Wall Street Journal (WSJ), CHiME-4, DIRHA and AMI. Compared with the best single-stream performance, the proposed framework has achieved substantial improvement, which also outperforms various conventional fusion strategies.
The future plan aims to improve robustness of the proposed multistream framework. Measuring performance of an ASR system without ground-truth could be beneficial in multi-stream scenarios to emphasize on more informative streams than corrupted ones. In this proposal, four different Performance Monitoring (PM) techniques are investigated. The preliminary results suggest that PM measures on attention distributions and decoder posteriors are well-correlated with true performances. Integration of PM measures and more sophisticated fusion mechanism in multi-stream framework will be the focus for future exploration.
Title: Automated Spore Analysis Using Bright-Field Imaging and Raman Microscopy
Abstract: In 2015, it was determined that the United States Department of Defense had been shipping samples of B. anthracis spores which had undergone gamma irradiation but were not fully inactivated. In the aftermath of this event alternative and orthogonal methods were investigated to analyze spores determine their viability. In this thesis we demonstrate a novel analysis technique that combines bright-field microscopy images with Raman chemical microscopy.
We first developed an image segmentation routine based on the watershed method to locate individual spores within bright-field images. This routine was able to effectively demarcate 97.4% of the Bacillus spores within the bright-field images with minimal over-segmentation. Size and shape measurements, to include major and minor axis and area, were then extracted for 4048 viable spores which showed very good agreement with previously published values. When similar measurements were taken on 3627 gamma-irradiated spores, a statistically significant difference was noted for the minor axis length, ratio of major to minor axis, and total area when compared to the non-irradiated spores. Classification results show the ability to correctly classify 67% of viable spores with an 18% misclassification rate using the bright-field image by thresholding the minimum classification length.
Raman chemical imaging microscopy (RCIM) was then used to measure populations of viable, gamma irradiated, and autoclaved spores of B. anthracis Sterne, B. atrophaeus. B. megaterium, and B. thuringensis kurstaki. Significant spectral differences were observed between viable and inactivated spores due to the disappearance of features associated with calcium dipicolinate after irradiation. Principal component analysis was used which showed the ability to distinguish viable spores of B. anthracis Sterne and B. atrophaeus from each other and the other two Bacillus species.
Finally, Raman microscopy was used to classify mixtures of viable and gamma inactivated spores. A technique was developed that fuses the size and shape characteristics obtained from the bright-field image to preferentially target viable spores. Simulating a scenario of a practical demonstration of the technique was performed on a field of view containing approximately 7,000 total spores of which are only 12 were viable to simulate a sample that was not fully irradiated. Ten of these spores are properly classified while interrogating just 25% of the total spores.
Title: Robust Adaptive Strategies for Myographic Prosthesis Movement Decoding
Abstract: Improving the condition-tolerance, stability, response time, and dexterity of neural prosthesis control strategies are major clinical goals to aid amputees in achieving natural restorative upper-limb function. Currently, the dominant noninvasive neural source for prosthesis motor control is the skin-surface recorded electromyographic (EMG) signal. Decoding movement intentions from EMG is a challenging problem because this signal type is subject to a high degree of interference from noise and conditional influences. As a consequence, much of the movement intention information contained within the EMG signal has remained significantly under-utilized for the purposes of controlling robotic prostheses. We sought to overcome this information deficit through the use of adaptive strategies for machine learning, sparse representations, and signal processing to significantly improve myographic prosthesis control. This body of research represents the current state-of-the-art in condition-tolerant EMG movement classification (Chapter 3), stable and responsive EMG sequence decoding during movement transitions (Chapter 4), and positional regression to reliably control 7 wrist and finger degrees-of-freedom (Chapter 5). To our knowledge, the methods we describe in Chapter 5 elicit the most dexterous, biomimetic, and natural prosthesis control performance ever obtained from the surface EMG signal.
Title: “Honey I shrank the microscope!” And Other Adventures in Functional Imaging
Abstract: Imaging the brain in action, in awake freely behaving animals without the confounding effect of anesthetics poses unique design and experimental challenges. Moreover, imaging the evolution of disease models in the preclinical setting over their entire lifetime is also difficult with conventional imaging techniques. This lecture will describe the development and applications of a miniaturized microscope that circumvents these hurdles. This lecture will also describe how image acquisition, data visualization and engineering tools can be leveraged to answer fundamental questions in cancer, neuroscience and tissue engineering applications.
Bio: Dr. Pathak is an ideator, educator and mentor focused on transforming lives through the power of imaging. He received the BS in Electronics Engineering from the University of Poona, India. He received his PhD from the joint program in Functional Imaging between the Medical College of Wisconsin and Marquette University. During his PhD he was a Whitaker Foundation Fellow. He completed his postdoctoral fellowship at the Johns Hopkins University School of Medicine in Molecular Imaging. He is currently Associate Professor of Radiology, Oncology and Biomedical Engineering at Johns Hopkins University (JHU). His research is focused on developing new imaging methods, computational models and visualization tools to ‘make visible’ critical aspects of cancer, neurobiology and tissue engineering. His work has been recognized by multiple journal covers and awards including the Bill Negendank Award from the International Society for Magnetic Resonance in Medicine (ISMRM) given to “outstanding young investigators in cancer MRI” and the Career Catalyst Award from the Susan Komen Breast Cancer Foundation. He serves on review panels for national and international funding agencies, and the editorial boards of imaging journals. He is dedicated to mentoring the next generation of imagers and innovators. He has mentored over sixty students, was the recipient of the ISMRM’s Outstanding Teacher Award in 2014, a 125 Hopkins Hero in 2018 for outstanding dedication to the core values of JHU, and a Career Champion Nominee in 2018 for student career guidance and support.
Title: Loss Landscapes of Neural Networks and their Generalization: Theory and Applications
Abstract: In the last decade or so, deep learning has revolutionized entire domains of machine learning. Neural networks have helped achieve significant improvements in computer vision, machine translation, speech recognition, etc. These powerful empirical demonstrations leave a wide gap between our current theoretical understanding of neural networks and their practical performance. The theoretical questions in deep learning can be put under three broad but inter-related themes: 1) Architecture/Representation, 2) Optimization, and 3) Generalization. In this dissertation, we study the landscapes of different deep learning problems to answer questions in the above themes.
First, in order to understand what representations can be learned by neural networks, we study simple Autoencoder networks with one hidden layer of rectified linear units. We connect autoencoders to the well-known problem in signal processing of Sparse Coding. We show that the squared reconstruction error loss function has a critical point at the ground truth dictionary under an appropriate generative model.
Next, we turn our attention to a problem at the intersection of optimization and generalization. Training deep networks through empirical risk minimization is a non-convex problem with many local minima in the loss landscape. A number of empirical studies have observed that “flat minima” for neural networks tend to generalize better than sharper minima. However, quantifying the flatness or sharpness of minima has been an issue due to possible rescaling in neural networks with positively homogenous activations. We use ideas from Riemannian geometry to define a new measure of flatness that is invariant to rescaling. We test the hypothesis that flatter minima generalize better through a number of different experiments on deep networks.
Finally, we apply deep networks to computer vision problems with compressed measurements of natural images and videos. We conduct experiments to characterize the situations in which these networks fail, and those in which they succeed. We train deep networks to perform object detection and classification directly on these compressive measurements of images, without trying to reconstruct the scene first. These experiments are conducted on public datasets as well as datasets specific to a sponsor of our research.
Title: Deep Learning-based Novelty Detection
Abstract: In recent years, intelligent systems powered by artificial intelligence and computer vision that perform visual recognition have gained much attention. These systems observe instances and labels of known object classes during training and learn association patterns that can be
used during inference. A practical visual recognition system should first determine whether an observed instance is from a known class. If it is from a known class, then the identity of the instance is queried through classification. The former process is commonly known as novelty detection (or novel class detection) in the literature. Given a set of image instances from known classes, the goal of novelty detection is to determine whether an observed image during inference belongs to one of the known classes.
We consider one-class novelty detection, where all training data are assumed to belong to a single class without any finer-annotations available. We identify limitations of conventional approaches in one-class novelty detection and present a Generative Adversarial Network(GAN) based solution. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder’s output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real.
Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.
Title: Neural Circuit Mechanisms of Stimulus Selection Underlying Spatial Attention
Thesis Committee: Shreesh P. Mysore, Hynek Hermansky, Mounya Elhilali, Ralph Etienne-Cummings
Abstract: Humans and animals routinely encounter competing pieces of information in their environments, and must continually select the most salient in order to survive and behave adaptively. Here, using computational modeling, extracellular neural recordings, and focal, reversible silencing of neurons in the midbrain of barn owls, we uncovered how two essential computations underlying competitive selection are implemented in the brain: a) the ability to select the most salient stimulus among all pairs of stimulus locations, and b) the ability to signal the most salient stimulus categorically.
We first discovered that a key inhibitory nucleus in the midbrain attention network, called isthmi pars magnocellularis (Imc), encodes visual space with receptive fields that have multiple excitatory hotspots (‘‘lobes’’). Such (previously unknown) multilobed encoding of visual space is necessitated for selection at all location-pairs in the face of scarcity of Imc neurons. Although distributed seemingly randomly, the RF lobe-locations are optimized across the high-firing Imc neurons, allowing them to combinatorially solve selection across space. This combinatorially optimized inhibition strategy minimizes metabolic and wiring costs.
Next, we discovered that a ‘donut-like’ inhibitory mechanism in which each competing option suppresses all options except itself is highly effective at generating categorical responses. It surpasses motifs of feedback inhibition, recurrent excitation, and divisive normalization used commonly in decision-making models. We demonstrated experimentally not only that this mechanism operates in the midbrain spatial selection network in barn owls, but also that it is required for categorical signaling by it. Moreover, the pattern of inhibition in the midbrain forms an exquisitely structured ‘multi-holed’ donut consistent with this network’s combinatorial inhibitory function (computation 1).
Our work demonstrates that the vertebrate midbrain uses seemingly carefully optimized structural and functional strategies to solve challenging computational problems underlying stimulus selection and spatial attention at all location pairs. The neural motifs discovered here represent circuit-based solutions that are generalizable to other brain areas, other forms of behavior (such as decision-making, action selection) as well as for the design of artificial systems (such as robotics, self-driving cars) that rely on the selection of one among many options.
Title: Towards a better understanding of spoken conversations: Assessment of sentiment and emotion
Abstract: In this talk, we present our work on understanding the emotional aspects of spoken conversations. Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties.
While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this work, we considered emotion recognition in two particular scenarios. The first scenario is predicting customer sentiment/satisfaction (CSAT) in a call center conversation, and the second consists of emotion prediction in short utterances.
CSAT is defined as the overall sentiment (positive vs. negative) of the customer about his/her interaction with the agent. In this work, we perform a comprehensive search for adequate acoustic and lexical representations.
For acoustic representation, we propose to use the x-vector model, which is known for its state-of-the-art performance in the speaker recognition task. The motivation behind using x-vectors for CSAT is we observed that emotion information encoded in x-vectors affected speaker recognition performance. For lexical, we introduce a novel method, CSAT Tracker, which computes the overall prediction based on individual segment outcomes. Both methods rely on transfer learning to obtain the best performance. We classified using convolutional neural networks combining the acoustic and lexical features. We evaluated our systems on US English telephone speech from call center data. We found that lexical models perform better than acoustic models and fusion of them provided significant gains. The analysis of errors uncovers that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly. Also, we found that the customer’s speech is more emotional compared to the agent’s speech.
For the second scenario of predicting emotion, we present a novel approach based on x-vectors. We show that adapting the x-vector model for emotion recognition provides the best-published results on three public datasets.
Title: Single-Channel Speech Separation in Noisy and Reverberant Conditions
Abstract: An inevitable property of multi-party conversations is that more than one speaker will end up speaking simultaneously for portions of time. Many speech technologies, such as automatic speech recognition and speaker identification, are not designed to function on overlapping speech and suffer severe performance degradation under such conditions. Speech separation techniques aim to solve this problem by producing a separate waveform for each speaker in an audio recording with multiple talkers speaking simultaneously. The advent of deep neural networks has resulted in strong performance gains on the speech separation task. However, training and evaluation has been nearly ubiquitously restricted to a single dataset of clean, near-field read speech, not representative of many multi-person conversational settings which are frequently recorded on room microphones, introducing noise and reverberation. Due to the degradation of other speech technologies in these sorts of conditions, speech separation systems are expected to suffer a decrease in performance as well.
The primary goal of this proposal is to develop novel techniques to improve speech separation in noisy and reverberant recording conditions. One core component of this work is the creation of additional synthetic overlap corpora spanning a range of more realistic and challenging conditions. The lack of appropriate data necessitates a first step of creating appropriate conditions with which to benchmark the performance of state-of-the-art methods in these more challenging conditions. Another proposed line of investigation is the integration of speech separation techniques with speech enhancement, the task of enhancing a speech signal through the removal of noise or reverberation. This is a natural combination due to similarities in problem formulation and general approach. Finally, we propose an investigation into the effectiveness of speech separation as a pre-processing step to speech technologies, such as automatic speech recognition, that struggle with overlapping speech, as well as tighter integration of speech separation with these “downstream” systems.