Calendar

May
26
Tue
Dissertation Defense: Sonia Joy
May 26 @ 2:00 pm
Dissertation Defense: Sonia Joy

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 1:45 PM EST.

Title: Sparsity and Structure in UWB Synthetic Aperture Radar

Abstract: Synthetic Aperure Radar is a form of radar that uses the motion of radar to simulate a large antenna in order to create high resolution imagery. Low frequency ultra-wideband (UWB) SARs in particular uses low frequencies and a large bandwidth that provide them with penetration capabilities and high resolution. UWB SARs are typically used for near eld imaging applications such as foliage penetration, through the wall imaging and ground penetration. SAR imaging is traditionally done by matched ltering, by applying the adjoint of the projection operator that maps from the image to SAR data.The matched lter imaging suffers disadvantages such as sidelobe artifacts, poor resolution of point targets and lack of robustness to noise and missing data. Regularized imaging with sparsity priors is found to be advantageous; however the regularized imaging is implemented as an iterative process in which projections between the image domain and data domain must be done many times. The projection operations (backprojection and reprojection) are highly complex; a brute force implementation has a complexity of O(N3). In this dissertation, a fast implementation of backprojection and reprojection is investigated. The implementation is explored in the context of regularized imaging as well as compressive sensing SAR.

The second part of the dissertation deals with a problem pertinent to UWB SAR imaging. The VHF/UHF bands used by UWB SAR are shared by other communication systems and that poses two problems; i) RF interference (RFI) from other sources and ii Missing spectral bands because transmission is prohibited in certain bands. The rst problem is addressed by using sparse and/or low-rank modeling. The SAR data is modeled to be sparse. The projection operator from above is used to capture the sparsity of the SAR data. The RFI is modeled to be either sparse with respect to an appropriate dictionary or assumed to be of low-rank. The sparse estimation or the sparse and low-rank estimation is used to estimate the SAR signal and RFI simultaneously. It is demonstrated that the new methods perform much better than the traditional RFI mitigation techniques such as notched ltering. The missing frequency problem can be modeled as a special case of compressive sensing. Sparse estimation is applied to the data to recover the missing frequencies. Simulations show that the sparse estimation is robust to large spectral gaps.

Jun
4
Thu
Seminar: Carlos Castillo
Jun 4 @ 12:00 pm
Seminar: Carlos Castillo

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 11:45 AM EDT.

Title: Deep Learning for Face and Behavior Analytics

Abstract: In this talk I will describe the AI systems we have built for face analysis and complex activity detection. I will describe SfSNet a DCNN that produces accurate decomposition of an unconstrained image of a human face into shape, reflectance and illuminance. We present a novel architecture that mimics lambertian image formation and a training scheme that uses a mixture of labeled synthetic and unlabeled real world images. I will describe our results on the properties of DCNN-based identity features for face recognition. I will show how the DCNN features trained on in-the-wild images form a highly structured organization of image and identity information. I will also describe our results comparing the performance of our state of the art face recognition systems to that of super recognizers and forensic face examiners.

I will describe our system for detecting complex activities in untrimmed security videos. In these videos the activities happen in small areas of the frame and some activities are quite rare. Our system is faster than real time, very accurate and works well with visible spectrum and IR cameras. We have defined a new approach to compute activity proposals.

I will conclude by highlighting future directions of our work.

Bio: Carlos D. Castillo is an assistant research scientist at the University of Maryland Institute for Advanced Computer Studies (UMIACS). He has done extensive work on face and activity detection and recognition for over a decade and has both industry and academic research experience. He received his PhD in Computer Science from the University of Maryland, College Park where he was advised by Dr. David Jacobs. During the past 5 years he has been involved with the UMD teams in IARPA JANUS and IARPA DIVA and DARPA L2M. He was recipient of the best paper award at the International Conference on Biometrics: Theory, Applications and Systems (BTAS) 2016. The software he developed under IARPA JANUS has been transitioned to many USG organizations, including Department of Defense, Department of Homeland Security, and Department of Justice.  In addition, the UMD JANUS system is being used operationally by the Homeland Security Investigations (HSI) Child Exploitation Investigations Unit to provide investigative leads in identifying and rescuing child abuse victims, as well as catching and prosecuting criminal suspects. The technologies his team developed provided the technical foundations to a spinoff startup company: Mukh Technologies LLC which creates software for face detection, alignment and recognition. In 2018, Dr. Castillo received the Outstanding Innovation of the Year Award from the UMD Office of Technology Commercialization. His current research interests include face and activity detection and recognition, and deep learning.

Jun
18
Thu
Dissertation Defense: Yansong Zhu
Jun 18 @ 1:00 pm
Dissertation Defense: Yansong Zhu

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 12:45 PM EDT. 

Title: Improved Modeling and Image Generation for Fluorescence Molecular Tomography (FMT) and Positron Emission Tomography (PET)

Abstract: In this thesis, we aim to improve quantitative medical imaging with advanced image generation algorithms. We focus on two specific imaging modalities: fluorescence molecular tomography (FMT) and positron emission tomography (PET).

In the case of FMT, we present a novel photon propagation model for its forward model, and in addition, we propose and investigate a reconstruction algorithm for its inverse problem. In the first part, we develop a novel Neumann-series-based radiative transfer equation (RTE) that incorporates reflection boundary conditions in the model. In addition, we propose a novel reconstruction technique for diffuse optical imaging that incorporates this Neumann-series-based RTE as forward model. The proposed model is assessed using a simulated 3D diffuse optical imaging setup, and the results demonstrate the importance of considering photon reflection at boundaries when performing photon propagation modeling. In the second part, we propose a statistical reconstruction algorithm for FMT. The algorithm is based on sparsity-initialized maximum-likelihood expectation maximization (MLEM), taking into account the Poisson nature of data in FMT and the sparse nature of images. The proposed method is compared with a pure sparse reconstruction method as well as a uniform-initialized MLEM reconstruction method. Results indicate the proposed method is more robust to noise and shows improved qualitative and quantitative performance.

For PET, we present an MRI-guided partial volume correction algorithm for brain imaging, aiming to recover qualitative and quantitative loss due to the limited resolution of PET system, while keeping image noise at a low level. The proposed method is based on an iterative deconvolution model with regularization using parallel level sets. A non-smooth optimization algorithm is developed so that the proposed method can be feasibly applied for 3D images and avoid additional blurring caused by conventional smooth optimization process. We evaluate the proposed method using both simulation data and in vivo human data collected from the Baltimore Longitudinal Study of Aging (BLSA). Our proposed method is shown to generate images with reduced noise and improved structure details, as well as increased number of statistically significant voxels in study of aging. Results demonstrate our method has promise to provide superior performance in clinical imaging scenarios.

Thesis Committee

  • Arman Rahmim, Department of Electrical and Computer Engineering, Department of Radiology and Radiological Sciences (advisor, primary reader)
  • Yong Du, Department of Radiology and Radiological Sciences (secondary reader)
  • Jin Kang, Department of Electrical and Computer Engineering
  • Trac Tran, Department of Electrical and Computer Engineering
Jul
28
Tue
Dissertation Defense: Ben Skerritt-Davis
Jul 28 @ 10:00 am
Dissertation Defense: Ben Skerritt-Davis

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 9:45 AM EDT.

Title: Statistical Inference in Auditory Perception

Abstract: The human auditory system effortlessly parses complex sensory inputs despite the ever-present randomness and uncertainty in real-world scenes. To achieve this, the brain tracks sounds as they evolve in time, collecting contextual information to construct an internal model of the external world for predicting future events. Previous work has shown the brain is sensitive to many predictable (and often complex) patterns in sequential sounds. However, real-world environments exhibit a broader spectrum of predictability, and moreover, the level of predictability is constantly in flux. How does the brain build robust internal representations of such stochastic and dynamic acoustic environments?

This question is addressed through the lens of a computational model based in statistical inference. Embodying theories from Bayesian perception and predictive coding, the model posits the brain collects statistical estimates from sounds and maintains multiple hypotheses for the degree of context to include in predictive processes. As a potential computational solution for perception of complex and dynamic sounds, this model is used to connect sensory inputs with listeners’ responses in a series of human behavioral and electroencephalography (EEG) experiments incorporating uncertainty. Experimental results point toward the underlying sufficient statistics collected by the brain, and the extension of these statistical representations to multiple dimensions is examined along spectral and spatial dimensions. The computational model guides interpretation of behavioral and neural responses, revealing multiplexed responses in the brain corresponding to different levels of predictive processing. In addition, the model is used to explain individual differences across listeners highlighted by uncertainty.

The proposed computational model was developed based on first principles, and its usefulness is not limited to the experiments presented here. The model was used to replicate a range of previous findings in the literature, unifying them under a single framework. Moving forward, this general and flexible model can be used as a broad-ranging tool for studying the statistical inference processes behind auditory perception, overcoming the need to minimize uncertainty in perceptual experiments and pushing what was previously considered feasible for study in the laboratory towards what is typically encountered in the “messy” environments of everyday listening.

Committee Members

Mounya Elhilali, Department of Electrical and Computer Engineering

Jason Fischer, Department of Psychological & Brain Sciences

Hynek Hermansky, Department of Electrical and Computer Engineering

James West, Department of Electrical and Computer Engineering

Aug
21
Fri
Dissertation Defense: Gary Li
Aug 21 @ 11:00 am
Dissertation Defense: Gary Li

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 10:45 AM EDT.

Title: Task-based Optimization of Administered Activity for Pediatric Renal SPECT Imaging

Abstract: Like any real-world problem, the design of an imaging system always requires tradeoffs. For medical imaging modalities using ionization radiation, a major tradeoff is between diagnostic image quality (IQ) and risk to the patient from absorbed dose (AD). In nuclear medicine, reducing the AD requires reducing the administered activity (AA). Lower AA to the patient can reduce risk and adverse effects, but can also result in reduced diagnostic image quality. Thus, ultimately, it is desirable to use the lowest AA that gives sufficient image quality for accurate clinical diagnosis.

In this dissertation, we proposed and developed tools for a general framework for optimizing RD with task-based assessment of IQ. Here, IQ is defined as an objective measure of the user performing the diagnostic task that the images were acquired to answer. To investigate IQ as a function of renal defect detectability, we have developed a projection image database modeling imaging of 99mTc-DMSA, a renal function agent. The database uses a highly-realistic population of pediatric phantoms with anatomical and body morphological variations. Using the developed projection image database, we have explored patient factors that affect IQ and are currently in the process of determining relationships between IQ and AA in terms of these found factors. Our data have shown that factors that are more local to the target organ may be more robust than weight for estimating the AA needed to provide a constant IQ across a population of patients. In the case of renal imaging, we have discovered that girth is more robust than weight (currently used in clinical practice) in predicting AA needed to provide a desired IQ. In addition to exploring the patient factors, we also did some work on improving the task simulating capability for anthropomorphic model observer. We proposed a deep learning-based anthropomorphic model observer to fully and efficiently (in terms of both training data and computational cost) model the clinical 3D detection task using multi-slice, multi-orientation images sets. The proposed model observer is important and could be readily adapted to model human observer performance on detection tasks for other imaging modalities such as PET, CT or MRI.

Committee Members

Eric Frey – Department of Radiology and Radiological Science. Faculty adviser.

Yong Du – Department of Radiology and Radiological Science. Second reader.

Vishal Patel – Department of Electrical and Computer Engineering.

George Sgouros – Department of Radiology and Radiological Science.

Archana Venkataraman – Department of Electrical and Computer Engineering.

Dissertation Defense: Nathan Henry
Aug 21 @ 11:00 am
Dissertation Defense: Nathan Henry

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 10:45 AM EDT.

Title: Mid-Infrared and Terahertz Frequency Combs from Quantum Cascade Lasers

Abstract: Optical frequency combs (FC) allow for extremely high resolution and broadband spectroscopic measurements that are captured contemporaneously rather than through some scanning action. Spectroscopic access to the infrared and THz is highly coveted as many molecular resonances lie in this region. However, due to a lack of available materials, emission of FC in the IR has been difficult, with many attempts resulting in low power and efficiency. In 2014 [1] the first mid-IR FC was characterized from a free-running QCL, requiring no extra elements. However, due to the inherently short upperstate lifetime of the laser, the FC is atypical in that it is not characterized by pulses but rather frequency modulation (FM). While the QCL FC has advanced significantly, it is not fully understood. As a result, spectroscopic measurements can become unreliable, sensitive to environmental changes, and recovery of absolute frequency can be difficult.

To better understand the FC QCL, a set of rate equations adapted from the optical Bloch equations is developed and found to be fully adequate for describing the origins and dynamics of FM FC. This work addresses two modes of operation (pseudo-random and chirped FM) calculating the dynamics of a QCL modeled after real-world measurements. Using specifications of real world QCLs (THz and IR), the gain is modeled under various operational scenarios and the most efficient state is identified. The period of the FM is postulated to be determined by the relative strengths of the various hole burning mechanisms and stability is shown for multiple regimes.

Further work is presented addressing the stability of QCL FCs. We begin by deriving the linewidth of the FC generating QCL and show that indeed it can be just as narrow as more conventional FCs. Subsequent to this work we use a two-dimensional model to achieve an engineered power-law dispersion, which can mitigate offset frequency drift offering the potential to significantly lower the phase noise. It is the hope of the author that this research will be used to develop a deeper understanding of FC producing QCLs that contribute to many fields of human endeavor such as medical diagnostics, remote sensing, time standardization, etc.

Committee Members

Jacob Khurgin – Department of Electrical and Computer Engineering. Adviser.

Susanna Thon – Department of Electrical and Computer Engineering.

Amy Foster – Department of Electrical and Computer Engineering.

Sep
4
Fri
Dissertation Defense: Yida Lin
Sep 4 @ 1:30 pm
Dissertation Defense: Yida Lin

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 1:15 PM EDT.

Title: Extending the Potential of Thin-film Optoelectronics via Optical Engineering

Abstract: Optoelectronics based on nanomaterials have become a research focus in recent years, and this technology bridges the fields of solid-state physics, electrical engineering and materials science. The rapid development in optoelectronic devices in the last century has both benefited from and spurred advancements in the science and engineering of pho- ton detection and manipulation, image sensing, high-efficiency and high-power-density light emission, displays, communications and renewable energy harvesting. A particularly promising material class for optoelectronics is colloidal nanomaterials, due to their functionality, cost -efficiency and even new physics, thanks to their exotic properties in the areas of light-matter interaction, low-dimensionality, and solution-processability which dramatically reduces the time and cost required to fabricate thin film devices, and at the same time provides wide compatibility with existing materials interfaces and device structures. This thesis focuses on exploring and assessing the capabilities of lead sulfide quantum dot-based solar cells and photodetectors. The discussion involves advances in techniques such as implementing novel photonic structures, designing and building novel characterization systems and methods, and coupling to external optical structures and components.

This thesis comprises three sections. The first section focuses on the design and adaption of photonic structures to tailor the function and response of photovoltaics and other absorption-based optoelectronics for specific applications. in the first part, we introduce consideration of complete multi-layer thin film interference effects into the design of solar cells. By numerical calculation and optimization of the film thicknesses as well as the precise fabrication control, devices with specific target colors or optical transparency levels were achieved. In the second part, we investigate the presence of 2D photonic crystal bands in absorbing materials that can be readily incorporated into nanomaterial thin films through nanostructuring of the material. We carried out simulations and theoretical analyses and proposed a method to realize simultaneous selectivity in the device reflection, transmission and absorption spectra that are critical for optoelectronic applications.

The next section focuses on designing and building a multi-modal microscopy system for thin-film optoelectronic devices, accompanied with analyses and explanation of complex experimental data. The goal of the system was to provide simultaneous 2D spatial measurements of, including but not limited to, photoluminescence spectra, time- resolved photocurrent and photovoltage responses, and a rich variety of all the possible combinations of these measurements and their associated derived quantities, collected with micrometer resolution. The multi-dimensional data helped us understand the intercorrelation between local defective regions in films and the entire device behavior, as well as a more comprehensive profile of mutual relationships between solar cell figures of merit.

In the last section, we discuss a new implementation of miniature solar concentrator arrays for lead sulfide quantum dot solar cells. First, we design and analyze the effects of a medium concentration ratio lens-type concentrator made from polydimethylsiloxane, a flexible organosilicon polymer. The concentrators were designed and optimized with the aid of ray-tracing simulation tools to achieve the best compatibility with colloidal nanomaterial-based solar cells. Experimentally, we produced an integrated concentrator system delivering 20-fold current and power enhancements close to the theoretical pre- dictions, and also used our concentrator measurements to explain the rarely explored carrier dynamics critical to high-power operation of thin film solar cells. Next, we design a wide-acceptance-angle dielectric solar concentrator that can be adapted to many types of high- efficiency small-area solar cells. The design was generated using rigorous optical models that define the behaviors of light rays and was verified with ray-tracing optical simulations to yield results for the full annual 2D time-resolved collectible power for the resulting system. Finally, we discuss strategies for further extending the possibilities of nanomaterial-based optoelectronics for future challenges in energy production and related applications.

Committee Members

Susanna Thon – Department of Electrical and Computer Engineering

Jacob Khurgin – Department of Electrical and Computer Engineering

Mark Foster – Department of Electrical and Computer Engineering

Oct
16
Fri
Dissertation Defense: Golnoosh Kamali
Oct 16 @ 12:00 pm
Dissertation Defense: Golnoosh Kamali

Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.

Title: Transfer function models of cortico-cortical evoked potentials for the localization of seizures in medically refractory epilepsy patients

Abstract: Surgical resection of the seizure onset zone (SOZ) could potentially lead to seizure-freedom in medically refractory epilepsy (MRE) patients. However, localizing the SOZ is a time consuming, subjective process involving visual inspection of intracranial electroencephalographic (iEEG) recordings captured during invasive passive patient monitoring. Cortical stimulation is currently performed on patients undergoing invasive EEG monitoring for the main purpose of mapping functional brain networks such as language and motor networks. We hypothesized that the evoked responses from single pulse electrical stimulation (SPES) can be used to localize the SOZ as they may express the natural frequencies and connectivity of the iEEG network. We constructed patient specific transfer function models from evoked responses recorded from 22 MRE patients that underwent SPES evaluation and iEEG monitoring. We then computed the frequency and connectivity dependent “peak gain” of the system, as measured by the H_∞ norm from systems theory, and the corresponding “floor gain,” which is the gain at which the H_∞ dipped 3dB below the DC gain. In cases for which clinicians had high confidence in localizing the SOZ, the highest peak gain transfer functions with the smallest “floor gains” corresponded to when the clinically annotated SOZ and early spread regions were stimulated. In more complex cases, there was a large spread of the peak gains when the clinically annotated SOZ was stimulated. Interestingly for patients who had successful surgeries, our ratio of peak-to-floor (PF) gains, agreed with clinical localization, no matter the complexity of the case. For patients with failed surgeries, the PF ratio did not match clinical annotations. Our findings suggest that transfer function gains and their corresponding frequency responses computed from SPES evoked responses may improve SOZ localization and thus surgical outcomes.

Committee Members

Sridevi V. Sarma, Department of Biomedical Engineering

Joon Y. Kang, Department of Neurology

Archana Venkataraman, Department of Electrical and Computer Engineering

Nathan E. Crone, Department of Neurology

Oct
23
Fri
Dissertation Defense: Gaspar Tognetti
Oct 23 @ 2:00 pm
Dissertation Defense: Gaspar Tognetti

Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.

Title: Circuits and Architecture for Bio-Inspired AI Accelerators

Abstract: Technological advances in microelectronics envisioned through Moore’s law have led to more powerful processors that can handle complex and computationally intensive tasks. Nonetheless, these advancements through technology scaling have come at an unfavorable cost of significantly larger power consumption, which has posed challenges for data processing centers and computers at the scale. Moreover, with the emergence of mobile computing platforms constrained by power and bandwidth for distributed computing, the necessity for more energy-efficient scalable local processing has become more significant.

Unconventional Compute-in-Memory (CiM) architectures such as the analog winner-takes-all associative-memory, the Charge-Injection Device (CID) processor, and analog-array processing have been proposed as alternatives. Unconventional charge-based computation has been employed for neural network accelerators in the past, where impressive energy efficiency per operation has been attained in 1-bit vector-vector multiplications (VMMs), and in recent work, multi-bit vector-vector multiplications. A similar approach was used in earlier work, where a charge-injection device array was utilized to store binary coded vectors, and computations were done using binary or multi-bit inputs in the charge domain; computation is carried out by counting quanta of charge at the thermal noise limit, using packets of about 1000 electrons. These systems are neither analog nor digital in the traditional sense but employ mixed-signal circuits to count the packets of charge and hence we call them Quasi-Digital. By amortizing the energy costs of the mixed-signal encoding/decoding over compute-vectors with a large number of elements, high energy efficiencies can be achieved.

In this dissertation, I present a design framework for AI accelerators using scalable compute-in-memory architectures. On the device level, two primitive elements are designed and characterized as target storage technologies: (i) a multilevel non-volatile computational cell and (ii) a pseudo Dynamic Random-Access Memory (pseudo-DRAM) computational bit-cell. Experimental results in deep-submicron CMOS processes demonstrate successful operation; subsequently, behavioral models were developed and employed in large-scale system simulations and emulations. Thereafter, at the level of circuit description, compute-in-memory crossbars and mixed-signal circuits were designed, allowing seamless connectivity to digital controllers. At the level of data representation, both binary and stochastic-unary coding are used to compute Vector-Vector Multiplications (VMMs) at the array level, demonstrating successful experimental results and providing insight into the integration requirements that larger systems may demand. Finally, on the architectural level, two AI accelerator architectures for data center processing and edge computing are discussed. Both designs are scalable multi-core Systems-on-Chip (SoCs), where vector-processor arrays are tiled on a 2-layer Network-on-Chip (NoC), enabling neighbor communication and flexible compute vs. memory trade-off. General purpose Arm/RISCV co-processors provide adequate bootstrapping and system-housekeeping and a high-speed interface fabric facilitates Input/Output to main memory.

Committee Members

Andreas Andreou, Department of Electrical and Computer Engineering

Ralph Etienne-Cummings, Department of Electrical and Computer Engineering

Philippe Pouliquen, Department of Electrical and Computer Engineering

Dissertation Defense: Ruizhi Li
Oct 23 @ 2:00 pm
Dissertation Defense: Ruizhi Li

Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.

Title: An Efficient and Robust Multi-Stream Framework for End-to-End Speech Recognition

Abstract: In the voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy. However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams or select the most informative array is necessary.

The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, specially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios.

In recent years, with the increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on joint Connectionist Temporal Classification/Attention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of the regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams. Two representative frameworks are proposed, which are Multi-Encoder Multi-Array (MEM-Array) and Multi-Encoder Multi-Resolution (MEM-Res), respectively.

The MEM-Array model aims at improving the far-field ASR robustness using multiple microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is desgnated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve the robustness of the multi-stream model, where small amount of parallel data is sufficient to achieve competitive results. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, which also outperform various conventional fusion strategies.

While proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition result of an end-to-end model is reliable, without growth-truth knowledge. Four different PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances.

Committee Members

Hynek Hermansky, Department of Electrical and Computer Engineering

Shinji Watanabe, Department of Electrical and Computer Engineering

Najim Dehak, Department of Electrical and Computer Engineering

Gregory Sell, JHU Human Language Technology Center of Excellence

Back to top