Title: Soroban: A Mixed-Signal Neuromorphic Processing in Memory Architecture
Abstract: To meet the scientific demand for future data-intensive processing for every day mundane tasks such as searching via images to the uttermost serious health care disease diagnosis in personalized medicine, we urgently need a new cloud computing paradigm and energy efficient i.e. “green” technologies. We believe that a brain-inspired approach that employs unconventional processing offers an alternative paradigm for BIGDATA computing.
My research aims to go beyond the state of the art processor in memory architectures. In the realm of un-conventional processors, charge based computing has been an attractive solution since it’s introduction with charged-coupled device (CCD) imagers in the seventies. Such architectures have been modified to compute-in-memory arrays that have been used for signal processing, neural networks and pattern recognition using the same underlying physics. Other work has utilized the same concept in the charge-injection devices (CIDs), which have also been used for similar pattern recognition tasks. However, these computing elements have not been integrated with the support infrastructure for high speed input/output commensurate with BIGDATA processing streaming applications. In this work, the CID concept is taken to a smaller CMOS 55nm node and has shown promising preliminary results as a multilevel input computing element for hardware inference applications. A mixed signal charge-based vector-vector multiplier (VMM) is explored which computes directly on a common readout line of a dynamic random-access memory (DRAM). Low power consumption and high area density is achieved by storing local parameters in a DRAM computing crossbar.
Title: Semi-supervised training for automatic speech recognition.
Abstract: State-of-the-art automatic speech recognition (ASR) systems use sequence-level objectives like Connectionist Temporal Classification (CTC) and Lattice-free Maximum Mutual Information (LF-MMI) for training neural network-based acoustic models. These methods are known to be most effective with large size datasets with hundreds or thousands of hours of data. It is difficult to obtain large amounts of supervised data other than in a few major languages like English and Mandarin. It is also difficult to obtain supervised data in a myriad of channel and envirormental conditions. On the other hand, large amounts of
unsupervised audio can be obtained fairly easily. There are enormous amounts of unsupervised data available in broadcast TV, call centers and YouTube for many different languages and in many environment conditions. The goal of this research is to discover how to best leverage the available unsupervised data for training acoustic models for ASR.
In the first part of this thesis, we extend the Maximum Mutual Information (MMI) training to the semi-supervised training scenario. We show that maximizing Negative Conditional Entropy (NCE) over lattices from unsupervised data, along with state-level Minimum Bayes Risk (sMBR) on supervised data, in a multi-task architecture gives word error rate (WER) improvements without needing any confidence-based filtering.
In the second part of this thesis, we investigate using lattice-based supervision as numerator graph to incorporate uncertainities in unsupervised data in the LF-MMI training framework. We explore various aspects of creating the numerator graph including splitting lattices for minibatch training, applying tolerance to frame-level alignments, pruning beam sizes, word LM scale and inclusion of pronunciation variants. We show that the WER recovery rate (WRR) of our proposed approach is 5-10\% absolute better than that of the baseline of using 1-best transcript as supervision, and is stable in the 40-60\% range even on large-scale setups and multiple different languages.
Finally, we explore transfer learning for the scenario where we have unsupervised data in a mismatched domain. First, we look at the teacher-student learning approach for cases where parallel data is available in source and target domains. Here, we train a “student” neural network on the target domain to mimic a “teacher” neural network on the source domain data, but using sequence-level posteriors instead of the traditional approach of using frame-level posteriors.
We show that the proposed approach is very effective to deal with acoustic domain mismatch in multiple scenarios of unsupervised domain adaptation — clean to noisy speech, 8kHz to 16kHz speech, close-talk microphone to distant microphone.
Second, we investigate approaches to mitigate language domain mismatch, and show that a matched language model significantly improves WRR. We finally show that our proposed semi-supervised transfer learning approach works effectively even on large-scale unsupervised datasets with 2000 hours of
audio in natural and realistic conditions.
Title: Strategies for Handling Out-of-Vocabulary Words in Automatic Speech Recognition
Abstract: Nowadays, most ASR (automatic speech recognition) systems deployed in industry are closed-vocabulary systems, meaning we have a limited vocabulary of words the system can recognize, and where pronunciations are provided to the system. Words out of this vocabulary are called out-of-vocabulary (OOV) words, for which either pronunciations or both spellings and pronunciations are not known to the system. The basic motivations of developing strategies to handle OOV words are: First, in the training phase, missing or wrong pronunciations of words in training data results in poor acoustic models. Second, in the test phase, words out of the vocabulary cannot be recognized at all, and mis-recognition of OOV words may affect recognition performance of its in-vocabulary neighbors as well. Therefore, this dissertation is dedicated to exploring strategies of handling OOV words in closed-vocabulary ASR.
First, we investigate dealing with OOV words in ASR training data, by introducing an acoustic-data driven pronunciation learning framework using a likelihood-reduction based criterion for selecting pronunciation candidates from multiple sources, i.e. standard grapheme-to-phoneme algorithms (G2P) and phonetic decoding, in a greedy fashion. This framework effectively expands a small hand-crafted pronunciation lexicon to cover OOV words, for which the learned pronunciations have higher quality than approaches using G2P alone or using other baseline pruning criteria. Furthermore, applying the proposed framework to generate alternative pronunciations for in-vocabulary (IV) words improves both recognition performance on relevant words and overall acoustic model performance.
Second, we investigate dealing with OOV words in ASR test data, i.e. OOV detection and recovery. We first conduct a comparative study of a hybrid lexical model (HLM) approach for OOV detection, and several baseline approaches, with the conclusion that the HLM approach outperforms others in both OOV detection and first pass OOV recovery performance. Next, we introduce a grammar-decoding framework for efficient second pass OOV recovery, showing that with properly designed schemes of estimating OOV unigram probabilities, the framework significantly improves OOV recovery and overall decoding performance compared to first pass decoding.
Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re scoring framework, making it possible to re-score lattices containing recovered OOVs using a word-level RNNLM, that was ignorant of OOVs when it was trained. Above all, the whole OOV recovery pipeline shows the potential of a highly efficient open-vocabulary word-level ASR decoding framework, tightly integrated into a standard WFST decoding pipeline.
Title: Advanced Image Reconstruction and Analysis for Fluorescence Molecular Tomography (FMT) and Positron Emission Tomography (PET)
Abstract: Molecular imaging provides efficient ways to monitor different biological processes noninvasively, and high-quality imaging is necessary in order to fully explore the value of molecular imaging. To this end, advanced image generation algorithms are able to significantly improve image quality and quantitative performance. In this research proposal, we focus on two imaging modalities, fluorescence molecular tomography (FMT) and positron emission tomography (PET), that fall in the category of molecular imaging. Specifically, we studied the following two problems: i) reconstruction problem in FMT and ii) partial volume correction in brain PET imaging.
Reconstruction in FMT: FMT is an optical imaging modality that uses diffuse light for imaging. Reconstruction problem for FMT is highly ill-posed due to photon scattering in biological tissue, and thus, regularization techniques tend to be used to alleviate the ill-posed nature of the problem. Conventional reconstruction algorithms cause oversmoothing which reduces resolution of the reconstructed images. Moreover, a Gaussian model is commonly chosen as the noise model although most FMT systems based on charged-couple device (CCD) or photon multiplier tube (PMT) are contaminated by Poisson noise. In our work, we propose a reconstruction algorithm for FMT using sparsity-initialized maximum-likelihood expectation maximization (MLEM). The algorithm preserves edges by exploiting sparsity, as well as taking Poisson noise into consideration. Through simulation experiments, we compare the proposed method with pure sparse reconstruction method and MLEM with uniform initialization. We show the proposed method holds several advantages compared to the other two methods.
Partial volume correction of brain PET imaging: The so-called partial volume effect (PVE) is caused by the limited resolution of PET systems, reducing quantitative accuracy of PET imaging. Based on the stage of implementation, partial volume correction (PVC) algorithms could be categorized into reconstruction-based and post-reconstruction methods.Post reconstruction PVC methods can be directly implemented on reconstructed PET images and do not require access to raw data or reconstruction algorithms of PET scanners. Many of these methods use anatomical information from MRI to further improve their performance. However, conventional MR guided post-reconstruction PVC methods require segmentation of MR images and assume uniform activity distribution within each segmented region. In this proposal, we develop post-reconstruction PVC method based on deconvolution via parallel level set regularization. The method is implemented with non-smooth optimization based on the split Bregman method. The proposed method incorporates MRI information without requiring segmentation or making any assumption on activity distribution. Simulation experiments are conducted to compare the proposed method with several other segmentationfree method, as well as conventional segmentation-based PVC method. The results show the proposed method outperforms other segmentation-free method and shows stronger resistance to MR information mismatch compared to conventional segmentation-based PVC method.
Title: Statistical Modeling and analysis of allele-specific DNA methylation at the haplotype level
Abstract: Epigenetics is the branch of biology concerned with the study of phenotypical changes due to alterations of DNA, maintained during cell division, excluding modifications of the sequence itself. Epigenetic information includes DNA methylation, histone modifications, and higher order chromatin structure among others. DNA methylation is a stable epigenetic mechanism that chemically marks the DNA by adding methyl groups at individual cytosines immediately adjacent to guanines (CpG sites). Methylation marks are used to identify cell-type specific aspects of gene regulation, since marks located within a gene promoter or enhancer typically act to repress gene transcription, whereas promoter or enhancer demethylation is associated with gene activation. Notably, patterns of methylation marks are highly polymorphic and stochastic, containing information about a broad range of normal and aberrant biological processes, such as development and differentiation, aging, and carcinogenesis.
The epigenetic information content of two homologous chromosomal regions need not be the same. For example, it is well established that the ability of a cell to methylate the promoter region of a specific copy of a gene (an allele), is crucial for proper development. In fact, many known phenotypical traits stem from allele-specific epigenetic marks. Moreover, some allele-specific epigenetic differences have been found to be associated with local genetic differences between copies of a chromosome. Thus, developing a framework for studying such epigenetic differences in diploid organisms is our main goal. More specifically, our objective is to develop a statistical method that can be used to detect regions in the genome, with genetic differences between homologous chromosomes, in which there are biologically relevant differences in DNA methylation between alleles.
State of the art methods for allele-specific methylation modeling and analysis have critical shortcomings rendering them unsuitable for this type of analysis. We present a statistical physics inspired model for allele-specific methylation analysis that contains a sensible number of parameters, considering the limited sample size in whole genome bisulfite sequencing data, which is rich enough to capture the complexity in the data. We demonstrate the appropriateness of this model for allele-specific methylation analysis using simulation data as well as real data. Using our model, we compute mean methylation level differences between alleles, as well as information-theoretic quantities, such as the entropy of the methylation state in each allele and the mutual information between the methylation state and the allele of origin, and assess the statistical significance of each quantity by learning the null distribution from the data. This complementary set of statistics allows for an unparalleled level of insight in subsequent biological analysis. As a result, the developed framework provides an unprecedented descriptive power to characterize (i) the circumstances under which allele-specific methylation events arise, and (ii) the cis-effect, or lack of thereof, that genetic mutations have on DNA methylation.
Title: A Practical and Efficient Multi-Stream Framework for End-to-End Speech Recognition
Abstract: The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. In these cases, an appropriate strategy to fuse streams or select the most informative source is necessary. In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this proposal, a multi-stream framework is present based on joint CTC/Attention E2E model, where parallel streams are represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, a secondary stream-fusion network is introduced to steer the decoder toward the most informative encoders.
Two representative framework have been proposed, which are MultiEncoder Multi-Resolution (MEM-Res) and Multi-Encoder Multi-Array (MEM-Array), respectively. Moreover, with an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training scheme is further proposed in this work. Experiments are conducted on various corpora including Wall Street Journal (WSJ), CHiME-4, DIRHA and AMI. Compared with the best single-stream performance, the proposed framework has achieved substantial improvement, which also outperforms various conventional fusion strategies.
The future plan aims to improve robustness of the proposed multistream framework. Measuring performance of an ASR system without ground-truth could be beneficial in multi-stream scenarios to emphasize on more informative streams than corrupted ones. In this proposal, four different Performance Monitoring (PM) techniques are investigated. The preliminary results suggest that PM measures on attention distributions and decoder posteriors are well-correlated with true performances. Integration of PM measures and more sophisticated fusion mechanism in multi-stream framework will be the focus for future exploration.
Title: Automated Spore Analysis Using Bright-Field Imaging and Raman Microscopy
Abstract: In 2015, it was determined that the United States Department of Defense had been shipping samples of B. anthracis spores which had undergone gamma irradiation but were not fully inactivated. In the aftermath of this event alternative and orthogonal methods were investigated to analyze spores determine their viability. In this thesis we demonstrate a novel analysis technique that combines bright-field microscopy images with Raman chemical microscopy.
We first developed an image segmentation routine based on the watershed method to locate individual spores within bright-field images. This routine was able to effectively demarcate 97.4% of the Bacillus spores within the bright-field images with minimal over-segmentation. Size and shape measurements, to include major and minor axis and area, were then extracted for 4048 viable spores which showed very good agreement with previously published values. When similar measurements were taken on 3627 gamma-irradiated spores, a statistically significant difference was noted for the minor axis length, ratio of major to minor axis, and total area when compared to the non-irradiated spores. Classification results show the ability to correctly classify 67% of viable spores with an 18% misclassification rate using the bright-field image by thresholding the minimum classification length.
Raman chemical imaging microscopy (RCIM) was then used to measure populations of viable, gamma irradiated, and autoclaved spores of B. anthracis Sterne, B. atrophaeus. B. megaterium, and B. thuringensis kurstaki. Significant spectral differences were observed between viable and inactivated spores due to the disappearance of features associated with calcium dipicolinate after irradiation. Principal component analysis was used which showed the ability to distinguish viable spores of B. anthracis Sterne and B. atrophaeus from each other and the other two Bacillus species.
Finally, Raman microscopy was used to classify mixtures of viable and gamma inactivated spores. A technique was developed that fuses the size and shape characteristics obtained from the bright-field image to preferentially target viable spores. Simulating a scenario of a practical demonstration of the technique was performed on a field of view containing approximately 7,000 total spores of which are only 12 were viable to simulate a sample that was not fully irradiated. Ten of these spores are properly classified while interrogating just 25% of the total spores.
Title: Robust Adaptive Strategies for Myographic Prosthesis Movement Decoding
Abstract: Improving the condition-tolerance, stability, response time, and dexterity of neural prosthesis control strategies are major clinical goals to aid amputees in achieving natural restorative upper-limb function. Currently, the dominant noninvasive neural source for prosthesis motor control is the skin-surface recorded electromyographic (EMG) signal. Decoding movement intentions from EMG is a challenging problem because this signal type is subject to a high degree of interference from noise and conditional influences. As a consequence, much of the movement intention information contained within the EMG signal has remained significantly under-utilized for the purposes of controlling robotic prostheses. We sought to overcome this information deficit through the use of adaptive strategies for machine learning, sparse representations, and signal processing to significantly improve myographic prosthesis control. This body of research represents the current state-of-the-art in condition-tolerant EMG movement classification (Chapter 3), stable and responsive EMG sequence decoding during movement transitions (Chapter 4), and positional regression to reliably control 7 wrist and finger degrees-of-freedom (Chapter 5). To our knowledge, the methods we describe in Chapter 5 elicit the most dexterous, biomimetic, and natural prosthesis control performance ever obtained from the surface EMG signal.
Title: Loss Landscapes of Neural Networks and their Generalization: Theory and Applications
Abstract: In the last decade or so, deep learning has revolutionized entire domains of machine learning. Neural networks have helped achieve significant improvements in computer vision, machine translation, speech recognition, etc. These powerful empirical demonstrations leave a wide gap between our current theoretical understanding of neural networks and their practical performance. The theoretical questions in deep learning can be put under three broad but inter-related themes: 1) Architecture/Representation, 2) Optimization, and 3) Generalization. In this dissertation, we study the landscapes of different deep learning problems to answer questions in the above themes.
First, in order to understand what representations can be learned by neural networks, we study simple Autoencoder networks with one hidden layer of rectified linear units. We connect autoencoders to the well-known problem in signal processing of Sparse Coding. We show that the squared reconstruction error loss function has a critical point at the ground truth dictionary under an appropriate generative model.
Next, we turn our attention to a problem at the intersection of optimization and generalization. Training deep networks through empirical risk minimization is a non-convex problem with many local minima in the loss landscape. A number of empirical studies have observed that “flat minima” for neural networks tend to generalize better than sharper minima. However, quantifying the flatness or sharpness of minima has been an issue due to possible rescaling in neural networks with positively homogenous activations. We use ideas from Riemannian geometry to define a new measure of flatness that is invariant to rescaling. We test the hypothesis that flatter minima generalize better through a number of different experiments on deep networks.
Finally, we apply deep networks to computer vision problems with compressed measurements of natural images and videos. We conduct experiments to characterize the situations in which these networks fail, and those in which they succeed. We train deep networks to perform object detection and classification directly on these compressive measurements of images, without trying to reconstruct the scene first. These experiments are conducted on public datasets as well as datasets specific to a sponsor of our research.
Title: Deep Learning-based Novelty Detection
Abstract: In recent years, intelligent systems powered by artificial intelligence and computer vision that perform visual recognition have gained much attention. These systems observe instances and labels of known object classes during training and learn association patterns that can be
used during inference. A practical visual recognition system should first determine whether an observed instance is from a known class. If it is from a known class, then the identity of the instance is queried through classification. The former process is commonly known as novelty detection (or novel class detection) in the literature. Given a set of image instances from known classes, the goal of novelty detection is to determine whether an observed image during inference belongs to one of the known classes.
We consider one-class novelty detection, where all training data are assumed to belong to a single class without any finer-annotations available. We identify limitations of conventional approaches in one-class novelty detection and present a Generative Adversarial Network(GAN) based solution. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder’s output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real.
Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.