Title: Semi-supervised training for automatic speech recognition.
Abstract: State-of-the-art automatic speech recognition (ASR) systems use sequence-level objectives like Connectionist Temporal Classification (CTC) and Lattice-free Maximum Mutual Information (LF-MMI) for training neural network-based acoustic models. These methods are known to be most effective with large size datasets with hundreds or thousands of hours of data. It is difficult to obtain large amounts of supervised data other than in a few major languages like English and Mandarin. It is also difficult to obtain supervised data in a myriad of channel and envirormental conditions. On the other hand, large amounts of
unsupervised audio can be obtained fairly easily. There are enormous amounts of unsupervised data available in broadcast TV, call centers and YouTube for many different languages and in many environment conditions. The goal of this research is to discover how to best leverage the available unsupervised data for training acoustic models for ASR.
In the first part of this thesis, we extend the Maximum Mutual Information (MMI) training to the semi-supervised training scenario. We show that maximizing Negative Conditional Entropy (NCE) over lattices from unsupervised data, along with state-level Minimum Bayes Risk (sMBR) on supervised data, in a multi-task architecture gives word error rate (WER) improvements without needing any confidence-based filtering.
In the second part of this thesis, we investigate using lattice-based supervision as numerator graph to incorporate uncertainities in unsupervised data in the LF-MMI training framework. We explore various aspects of creating the numerator graph including splitting lattices for minibatch training, applying tolerance to frame-level alignments, pruning beam sizes, word LM scale and inclusion of pronunciation variants. We show that the WER recovery rate (WRR) of our proposed approach is 5-10\% absolute better than that of the baseline of using 1-best transcript as supervision, and is stable in the 40-60\% range even on large-scale setups and multiple different languages.
Finally, we explore transfer learning for the scenario where we have unsupervised data in a mismatched domain. First, we look at the teacher-student learning approach for cases where parallel data is available in source and target domains. Here, we train a “student” neural network on the target domain to mimic a “teacher” neural network on the source domain data, but using sequence-level posteriors instead of the traditional approach of using frame-level posteriors.
We show that the proposed approach is very effective to deal with acoustic domain mismatch in multiple scenarios of unsupervised domain adaptation — clean to noisy speech, 8kHz to 16kHz speech, close-talk microphone to distant microphone.
Second, we investigate approaches to mitigate language domain mismatch, and show that a matched language model significantly improves WRR. We finally show that our proposed semi-supervised transfer learning approach works effectively even on large-scale unsupervised datasets with 2000 hours of
audio in natural and realistic conditions.
Title: Strategies for Handling Out-of-Vocabulary Words in Automatic Speech Recognition
Abstract: Nowadays, most ASR (automatic speech recognition) systems deployed in industry are closed-vocabulary systems, meaning we have a limited vocabulary of words the system can recognize, and where pronunciations are provided to the system. Words out of this vocabulary are called out-of-vocabulary (OOV) words, for which either pronunciations or both spellings and pronunciations are not known to the system. The basic motivations of developing strategies to handle OOV words are: First, in the training phase, missing or wrong pronunciations of words in training data results in poor acoustic models. Second, in the test phase, words out of the vocabulary cannot be recognized at all, and mis-recognition of OOV words may affect recognition performance of its in-vocabulary neighbors as well. Therefore, this dissertation is dedicated to exploring strategies of handling OOV words in closed-vocabulary ASR.
First, we investigate dealing with OOV words in ASR training data, by introducing an acoustic-data driven pronunciation learning framework using a likelihood-reduction based criterion for selecting pronunciation candidates from multiple sources, i.e. standard grapheme-to-phoneme algorithms (G2P) and phonetic decoding, in a greedy fashion. This framework effectively expands a small hand-crafted pronunciation lexicon to cover OOV words, for which the learned pronunciations have higher quality than approaches using G2P alone or using other baseline pruning criteria. Furthermore, applying the proposed framework to generate alternative pronunciations for in-vocabulary (IV) words improves both recognition performance on relevant words and overall acoustic model performance.
Second, we investigate dealing with OOV words in ASR test data, i.e. OOV detection and recovery. We first conduct a comparative study of a hybrid lexical model (HLM) approach for OOV detection, and several baseline approaches, with the conclusion that the HLM approach outperforms others in both OOV detection and first pass OOV recovery performance. Next, we introduce a grammar-decoding framework for efficient second pass OOV recovery, showing that with properly designed schemes of estimating OOV unigram probabilities, the framework significantly improves OOV recovery and overall decoding performance compared to first pass decoding.
Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re scoring framework, making it possible to re-score lattices containing recovered OOVs using a word-level RNNLM, that was ignorant of OOVs when it was trained. Above all, the whole OOV recovery pipeline shows the potential of a highly efficient open-vocabulary word-level ASR decoding framework, tightly integrated into a standard WFST decoding pipeline.
Title: Advanced Image Reconstruction and Analysis for Fluorescence Molecular Tomography (FMT) and Positron Emission Tomography (PET)
Abstract: Molecular imaging provides efficient ways to monitor different biological processes noninvasively, and high-quality imaging is necessary in order to fully explore the value of molecular imaging. To this end, advanced image generation algorithms are able to significantly improve image quality and quantitative performance. In this research proposal, we focus on two imaging modalities, fluorescence molecular tomography (FMT) and positron emission tomography (PET), that fall in the category of molecular imaging. Specifically, we studied the following two problems: i) reconstruction problem in FMT and ii) partial volume correction in brain PET imaging.
Reconstruction in FMT: FMT is an optical imaging modality that uses diffuse light for imaging. Reconstruction problem for FMT is highly ill-posed due to photon scattering in biological tissue, and thus, regularization techniques tend to be used to alleviate the ill-posed nature of the problem. Conventional reconstruction algorithms cause oversmoothing which reduces resolution of the reconstructed images. Moreover, a Gaussian model is commonly chosen as the noise model although most FMT systems based on charged-couple device (CCD) or photon multiplier tube (PMT) are contaminated by Poisson noise. In our work, we propose a reconstruction algorithm for FMT using sparsity-initialized maximum-likelihood expectation maximization (MLEM). The algorithm preserves edges by exploiting sparsity, as well as taking Poisson noise into consideration. Through simulation experiments, we compare the proposed method with pure sparse reconstruction method and MLEM with uniform initialization. We show the proposed method holds several advantages compared to the other two methods.
Partial volume correction of brain PET imaging: The so-called partial volume effect (PVE) is caused by the limited resolution of PET systems, reducing quantitative accuracy of PET imaging. Based on the stage of implementation, partial volume correction (PVC) algorithms could be categorized into reconstruction-based and post-reconstruction methods.Post reconstruction PVC methods can be directly implemented on reconstructed PET images and do not require access to raw data or reconstruction algorithms of PET scanners. Many of these methods use anatomical information from MRI to further improve their performance. However, conventional MR guided post-reconstruction PVC methods require segmentation of MR images and assume uniform activity distribution within each segmented region. In this proposal, we develop post-reconstruction PVC method based on deconvolution via parallel level set regularization. The method is implemented with non-smooth optimization based on the split Bregman method. The proposed method incorporates MRI information without requiring segmentation or making any assumption on activity distribution. Simulation experiments are conducted to compare the proposed method with several other segmentationfree method, as well as conventional segmentation-based PVC method. The results show the proposed method outperforms other segmentation-free method and shows stronger resistance to MR information mismatch compared to conventional segmentation-based PVC method.
Note: This is a virtual seminar that will be broadcast in Olin Hall 305. Refreshments will be available outside Olin Hall 305 at 2:30 PM.
Title: Computational infrastructure to improve scientific reproducibility
Abstract: The massive increase in the dimensionality of scientific data and the proliferation of complex data analysis methods has raised increasing concerns about the reproducibility of scientific results in many domains of science. I will first present evidence that analytic flexibility in neuroimaging research is associated with surprising variability in scientific outcomes in the wild, even holding the raw data constant. These findings motivate the development of well-tested software tools for neuroimaging data processing and analysis. I will focus in particular on the role of software development tools such as containerization and continuous integration, which provide the potential to deliver automated and reproducible data analysis at scale. I will also discuss the challenging tradeoffs inherent in the usage of complex software by scientists, and the need for increased transparency and validation of scientific software.
Bio: Russell A. Poldrack is the Albert Ray Lang Professor in the Department of Psychology and Professor (by courtesy) of Computer Science at Stanford University, and Director of the Stanford Center for Reproducible Neuroscience. His research uses neuroimaging to understand the brain systems underlying decision making and executive function. His lab is also engaged in the development of neuroinformatics tools to help improve the reproducibility and transparency of neuroscience, including the Openneuro.org and Neurovault.org data sharing projects and the Cognitive Atlas ontology.
Title: Statistical Modeling and analysis of allele-specific DNA methylation at the haplotype level
Abstract: Epigenetics is the branch of biology concerned with the study of phenotypical changes due to alterations of DNA, maintained during cell division, excluding modifications of the sequence itself. Epigenetic information includes DNA methylation, histone modifications, and higher order chromatin structure among others. DNA methylation is a stable epigenetic mechanism that chemically marks the DNA by adding methyl groups at individual cytosines immediately adjacent to guanines (CpG sites). Methylation marks are used to identify cell-type specific aspects of gene regulation, since marks located within a gene promoter or enhancer typically act to repress gene transcription, whereas promoter or enhancer demethylation is associated with gene activation. Notably, patterns of methylation marks are highly polymorphic and stochastic, containing information about a broad range of normal and aberrant biological processes, such as development and differentiation, aging, and carcinogenesis.
The epigenetic information content of two homologous chromosomal regions need not be the same. For example, it is well established that the ability of a cell to methylate the promoter region of a specific copy of a gene (an allele), is crucial for proper development. In fact, many known phenotypical traits stem from allele-specific epigenetic marks. Moreover, some allele-specific epigenetic differences have been found to be associated with local genetic differences between copies of a chromosome. Thus, developing a framework for studying such epigenetic differences in diploid organisms is our main goal. More specifically, our objective is to develop a statistical method that can be used to detect regions in the genome, with genetic differences between homologous chromosomes, in which there are biologically relevant differences in DNA methylation between alleles.
State of the art methods for allele-specific methylation modeling and analysis have critical shortcomings rendering them unsuitable for this type of analysis. We present a statistical physics inspired model for allele-specific methylation analysis that contains a sensible number of parameters, considering the limited sample size in whole genome bisulfite sequencing data, which is rich enough to capture the complexity in the data. We demonstrate the appropriateness of this model for allele-specific methylation analysis using simulation data as well as real data. Using our model, we compute mean methylation level differences between alleles, as well as information-theoretic quantities, such as the entropy of the methylation state in each allele and the mutual information between the methylation state and the allele of origin, and assess the statistical significance of each quantity by learning the null distribution from the data. This complementary set of statistics allows for an unparalleled level of insight in subsequent biological analysis. As a result, the developed framework provides an unprecedented descriptive power to characterize (i) the circumstances under which allele-specific methylation events arise, and (ii) the cis-effect, or lack of thereof, that genetic mutations have on DNA methylation.
Title: Exploring scalable coating of inorganic semiconductor inks: the surface structure-property-performance correlations
Abstract: Inorganic semiconductor inks – such as colloidal quantum dots (CQDs) and transition metal oxides (MOs) – can potentially enable low-cost flexible and transparent electronics via ‘roll-to-roll’ printing. Surfaces of these nanometer-sized CQDs and MO ultra-thin films lead to surface phenomenon with implications on film formation during coating, crystallinity and charge transport. In this talk, I will describe my recent efforts aimed at understanding the crucial role of surface structure in these materials using photoemission spectroscopy and X-ray scattering. Time-resolved X-ray scattering helps reveal the various stages during CQD ink-to-film transformation during blade-coating. Interesting insights include evidence of an early onset of CQD nucleation toward self-assembly and superlattice formation. I will close by discussing fresh results which suggest that nanoscale morphology significantly impacts charge transport in MO ultra-thin (≈5 nm) films. Control over crystallographic texture and film densification allows us to achieve high-performing (electron mobility ≈40 cm2V-1s-1), blade-coated MO thin-film transistors.
Bio: Dr. Ahmad R. Kirmani is a Guest Researcher in the Materials Science and Engineering Division, National Institute of Standards and Technology (NIST) in the group of Dr. Dean M. DeLongchamp and Dr. Lee J. Richter. He is exploring scalable coating of inorganic semiconductor inks using X-ray scattering. He received his PhD in Materials Science and Engineering from the King Abdullah University of Science and Technology (KAUST) under the supervision of Prof. Aram Amassian in 2017 for probing the surface structure-property relationship in colloidal quantum dot photovoltaics. He has published 30 articles in high-impact journals such Advanced Materials, ACS Energy Letters and the Nature family, and is also a volunteer science writer for the Materials Research Society (MRS) since the last couple of years and has contributed 10 news articles, opinions and perspectives.
Title: Electrets (Dielectrics with quasi-permanent Charges or Dipoles) – A long history and a bright future
Abstract: The history of electrets can be traced back to Thales of Miletus (approx. 624-546 B.C.E.) who reported that pieces of amber (“electron”) attract or repel each other. The science of fundamental electrical phenomena is closely intertwined with the development of electrets which came under such terms as “electrics”, “electrophores”, “charged/poled dielectrics”, etc. until about one century ago. Modern electret research started with Oliver Heaviside (1850-1925), who defined the concept of a “permanently electrized body” and proposed the name “electret” in 1885, and Mototarô Eguchi, who experimentally investigated carnauba wax electrets at the Higher Naval College in Tokyo around 1920. Today, we see a wide range of electret types, electret materials, and electret applications, which are being investigated and developed all over the world in a truly global endeavour. A classification of electrets will be followed by a few examples of useful electret effects and exciting device applications – mainly in the area of electro-mechanical and electro-acoustical transduction which started with the invention of the electret microphone by Sessler and West in the early 1960s. Furthermore, possible synergies between electret research and ultra-high-voltage DC electrical insulation will be mentioned.
Bio: Reimund Gerhard is a Professor of Physics and Astronomy at the University of Potsdam and the current President of the IEEE Dielectrics and Electrical Insulation Society (DEIS). He graduated from the Technical University of Darmstadt as Diplom-Physiker in 1978 and earned his PhD (Doktor-Ingenieur) in Communications Engineering from TU Darmstadt in 1984. From 1985 to 1994, Gerhard was a Research Scientist and Project Manager at the Heinrich-Hertz Institute for Communications Technology (now the Fraunhofer Institute) in Berlin, Germany. He was appointed as a Professor at the University of Potsdam in 1994. From 2004 to 2012, Gerhard served as the Chairman of the Joint Board for the Master-of-Science Program in Polymer Science of FU Berlin, HU Berlin, TU Berlin, and the University of Potsdam. He also served as the Dean of the Faculty of Science at the University of Potsdam from 2008 to 2012, eventually serving as a Senator of the University of Potsdam from 2014 to 2016.
Prof. Gerhard has received many awards and honors over his long career, including an Award (ITG-Preis) from the Information Technology Society (ITG) in the VDE, a silver medal from the Foundation Werner-von-Siemens-Ring, a First Prize Technology Transfer Award Brandenburg, Whitehead Memorial Lecturer of the IEEE CEIDP, and the Award of the EuroEAP Society “for his fundamental scientific contributions in the field of transducers based on dielectric polymers.” He is a Fellow of the American Physical Society (APS) and the Institute of Electrical and Electronics Engineers (IEEE). His research interests include polymer electrets with quasi-permanent space charge, ferro- or piezoelectrets (polymer films with electrically charged cavities), ferroelectric polymers with piezo- and pyroelectric properties, polymer composites with novel property combinations, physical mechanisms of dipole orientation and charge storage, electrically deformable dielectric elastomers (sometimes also called “electro-electrets”), as well as the physics of musical instruments.
Note: There will be a reception after the lecture.
Title: A Theory and Practice of the Lifelong Learnable Forest
Abstract: Since Vapnik’s and Valiant’s seminal papers on learnability, various lines of research have generalized his concept of learning and learners. In this paper, we formally define what it means to be a lifelong learner. Given this definition, we propose the first lifelong learning algorithm with theoretical guarantees that it can perform forward transfer and reverse transfer, while not experiencing catastrophic forgetting. Our algorithm, dubbed Lifelong Learning Forests, outperforms the current state-of-the-art deep lifelong learning algorithm on the CIFAR 10-by-10 challenge problem, despite its simplicity and mathematical tractability. Our approach immediately lends to further algorithmic developments that promise to exceed current performance limits of existing approaches.
Title: A Practical and Efficient Multi-Stream Framework for End-to-End Speech Recognition
Abstract: The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. In these cases, an appropriate strategy to fuse streams or select the most informative source is necessary. In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this proposal, a multi-stream framework is present based on joint CTC/Attention E2E model, where parallel streams are represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, a secondary stream-fusion network is introduced to steer the decoder toward the most informative encoders.
Two representative framework have been proposed, which are MultiEncoder Multi-Resolution (MEM-Res) and Multi-Encoder Multi-Array (MEM-Array), respectively. Moreover, with an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training scheme is further proposed in this work. Experiments are conducted on various corpora including Wall Street Journal (WSJ), CHiME-4, DIRHA and AMI. Compared with the best single-stream performance, the proposed framework has achieved substantial improvement, which also outperforms various conventional fusion strategies.
The future plan aims to improve robustness of the proposed multistream framework. Measuring performance of an ASR system without ground-truth could be beneficial in multi-stream scenarios to emphasize on more informative streams than corrupted ones. In this proposal, four different Performance Monitoring (PM) techniques are investigated. The preliminary results suggest that PM measures on attention distributions and decoder posteriors are well-correlated with true performances. Integration of PM measures and more sophisticated fusion mechanism in multi-stream framework will be the focus for future exploration.
Title: Automated Spore Analysis Using Bright-Field Imaging and Raman Microscopy
Abstract: In 2015, it was determined that the United States Department of Defense had been shipping samples of B. anthracis spores which had undergone gamma irradiation but were not fully inactivated. In the aftermath of this event alternative and orthogonal methods were investigated to analyze spores determine their viability. In this thesis we demonstrate a novel analysis technique that combines bright-field microscopy images with Raman chemical microscopy.
We first developed an image segmentation routine based on the watershed method to locate individual spores within bright-field images. This routine was able to effectively demarcate 97.4% of the Bacillus spores within the bright-field images with minimal over-segmentation. Size and shape measurements, to include major and minor axis and area, were then extracted for 4048 viable spores which showed very good agreement with previously published values. When similar measurements were taken on 3627 gamma-irradiated spores, a statistically significant difference was noted for the minor axis length, ratio of major to minor axis, and total area when compared to the non-irradiated spores. Classification results show the ability to correctly classify 67% of viable spores with an 18% misclassification rate using the bright-field image by thresholding the minimum classification length.
Raman chemical imaging microscopy (RCIM) was then used to measure populations of viable, gamma irradiated, and autoclaved spores of B. anthracis Sterne, B. atrophaeus. B. megaterium, and B. thuringensis kurstaki. Significant spectral differences were observed between viable and inactivated spores due to the disappearance of features associated with calcium dipicolinate after irradiation. Principal component analysis was used which showed the ability to distinguish viable spores of B. anthracis Sterne and B. atrophaeus from each other and the other two Bacillus species.
Finally, Raman microscopy was used to classify mixtures of viable and gamma inactivated spores. A technique was developed that fuses the size and shape characteristics obtained from the bright-field image to preferentially target viable spores. Simulating a scenario of a practical demonstration of the technique was performed on a field of view containing approximately 7,000 total spores of which are only 12 were viable to simulate a sample that was not fully irradiated. Ten of these spores are properly classified while interrogating just 25% of the total spores.