Thesis Proposal: Aswin Shanmugam Subramanian
Apr 23 @ 3:00 pm
Thesis Proposal: Aswin Shanmugam Subramanian

This presentation will be done remotely. Follow this link for access to the Zoom meeting where it will be taking place. It is advised that you do not log in to the meeting until at least 15 minutes before the presentation’s start time.

Title: A Synergistic Combination of Signal Processing and Deep Learning for Robust Speech Processing

Abstract: When speech is captured with a distant microphone it includes distortions caused by noise, reverberation and overlapping speakers. Far-field speech processing systems need to be robust to those distortions to function in real-world applications and hence have front-end components to handle them. The front-end components are typically optimized based on signal reconstruction objectives. This makes the overall speech processing system sub-optimal as the front-end is optimized independently of the downstream task. This approach also has another significant constraint that the enhancement/separation system can be trained with only simulated data and hence does not generalize well for real data. Alternatively, these front-end systems can be trained with application-oriented objectives. Emergent end-to-end neural methods have made it easier to optimize the frontend in such a manner. The goal of this work is to encompass carefully designed multichannel speech enhancement/separation subnetworks inside a sequence-to-sequence automatic speech recognition (ASR) system. This work takes an explainable AI approach to this problem where the intermediate outputs of the subnetworks can be interpreted although the entire network is trained only based on the speech recognition error minimization criteria. This proposal looks at two directions: (1) simultaneous dereverberation and denoising using a single differentiable speech recognition network which also learns some important hyperparameters from the data, (2) target speech extraction combining both anchor speech and location information which is optimized based on only the transcription as the target. In the first direction, dereverberation subnetwork is based on linear prediction where the filter order hyperparameter is estimated using a reinforcement learning approach, and the denoising (beamforming) subnetwork is based on a parametric multichannel Wiener filter where the speech distortion factor is also estimated inside the network. This method has shown a considerable gain in performance on real and unseen conditions. It is also shown how such a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to the ASR word error rate (WER) metric. In the second direction, a location and anchor speech guided target speech extraction subnetwork is trained end-to-end with an ASR network. From experimental comparison with a traditional pipeline system, it is verified that this task can be realized by end-to-end ASR training objectives without using parallel clean data. The results are promising in mixtures of two speakers and noise. The future plan is to optimize an explicit source localization frontend with a speech recognition objective. This can play an important role in realizing a conversation system that recognizes who is speaking what, when, and where.

Thesis Proposal: Ke Li
Apr 30 @ 3:00 pm
Thesis Proposal: Ke Li

This presentation is happening remotely. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.

Title: Context-aware Language Modeling and Adaptation for Automatic Speech Recognition

Abstract: Language models (LMs) are an important component in automatic speech recognition (ASR) and usually trained on transcriptions. Language use is strongly influenced by factors such as domain, topic, style, and user-preference. However, transcriptions from speech corpora are usually too limited to fully capture contextual variability in test domains. And some of the information is only available at test time. It is easily observed that the change of application domains often induces mismatch in lexicon and distribution of words. Even within the same domain, topics can shift and user-preference can vary. These observations indicate that LMs trained purely on transcriptions that may not be well representative for test domains are far from ideal and may severely affect ASR performance. To mitigate the mismatches, adapting LMs to contextual variables is desirable.

The goal of this work is to explore general and lightweight approaches for neural LM adaptation and context-aware modeling for ASR. In the adaptation direction, two approaches are investigated. The first is based on cache models. Although neural LMs outperform n-gram LMs on modeling longer context, previous studies show that some of them, for example, LSTMs, still only capture a relatively short span of context. Cache models that capture relatively long-term self-trigger information have been proved useful for n-gram LMs adaptation. This work extends a fast margin adaptation framework for neural LMs and adapts LSTM LMs in an unsupervised way. Specifically, pre-trained LMs are adapted to cache models estimated from decoded hypotheses. This method is lightweight as it does not require retraining. The second approach is interpolation-based. Linear interpolation is a simple and robust adaptation approach, while it is suboptimal since weights are globally optimized and not aware of local context. To tackle this issue, a mixer model that combines pre-trained neural LMs with dynamic weighting is proposed. Experimental results show that it outperforms finetuning and linear interpolation on most scenarios. As for context-aware modeling, this work proposes a simple and effective way to implicitly integrate cache models into neural LMs. It provides a simple alternative to the pointer sentinel mixture model. Experiments show that the proposed method is more effective on relatively rare words and outperforms several baselines. Future work is focused on analyzing the importance and the effect of various contextual factors on ASR and developing approaches for representing and modeling these factors to improve ASR performance.
Thesis Proposal: Arun Nair
May 14 @ 3:00 pm
Thesis Proposal: Arun Nair

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. It is advised that you do not enter the meeting until at least 15 minutes before the talk is scheduled to take place. 

Title: Machine Learning for Collaborative Signal Processing in Beamforming and Compressed Sensing

Abstract: Life today has become inextricably linked with the many sensors working in concert in our environment, from the webcam and microphone in our laptops to the arrays of wireless transmitters and receivers in cellphone towers. Collaborative signal processing methods tackle the challenge of efficiently processing data from multiple sources. Recently, machine learning methods have become very popular tools for collaborative signal processing, largely due to the success of deep learning. The large volume of data created by multiple sensors pairs well with the data-hungry nature of modern machine learning models, holding great promise for efficient solutions.

This proposal extends ideas from machine learning to problems in collaborative signal processing. Specifically, this work will focus on two collaborative signal processing methods – beamforming and compressed sensing. Beamforming is commonly employed in sensor arrays for directional signal transmission and reception by combining the signals received in the array elements to enhance a signal of interest. On the other hand, compressed sensing is a widely applicable mathematical framework that guarantees exact signal recovery even at sub-Nyquist sampling rates if suitable sparsity and incoherence assumptions are satisfied. Compressed sensing accomplishes this via convex or greedy optimization to fuse the information in a small number of signal measurements.

The first part of this work was motivated by the common experience of attempting to capture a video on a mobile device but having the target of interest contaminated by the surrounding environment (e.g., construction sounds from outside the camera’s field of view). Fusing visual and auditory information, we propose a novel audio-visual zooming algorithm that directionally filters the received audio data using beamforming to focus only on audio originating from within the field of view of the camera. Second, we improve the quality of ultrasound image formation by introducing a novel beamforming framework that leverages the benefits of deep learning. Ultrasound images currently suffer from severe speckle and clutter degradations which cause poor image quality and reduce diagnostic utility. We propose to design a deep neural network to learn end-to-end transformations that extract information directly from raw received US channel data. Finally, we improve upon optimization-based compressed sensing recovery by replacing the slow iterative optimization algorithms with far faster convolutional neural networks.

Dissertation Defense: Tengfei Li
May 26 @ 9:00 am
Dissertation Defense: Tengfei Li

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 8:45 AM EST.

Title: Enhancement of Optical Properties in Artificial Metal-Dielectric Structures

Abstract: The electromagnetic properties of materials, crucial to the operation of all electronic and optical devices, are determined by their permittivity and permeability. Thus, behavior of electromagnetic fields and currents can be controlled by manipulating permittivity and permeability. However, in the natural materials these properties cannot be changed easily. To achieve a wide range of (dielectric) permittivity and (magnetic) permeability, artificial materials with unusual properties have been introduced. This body of research represents a number of novel artificial structures with unusually attractive optical properties. We studied and achieved a series of new artificial structures with novel optical properties. The first one is the so-called hyperbolic metamaterials (HMMs), which are capable of supporting the waves with a very large k-vector and thus carry promises of large enhancement of spontaneous emission and high resolution imaging. We put these assumptions to rigorous test and show that the enhancement and resolution are severely limited by a number of factors. (Chapter 2 and 3). Then we analyzed and compared different mechanisms of achieving strong field enhancement in Mid-Infrared region of spectrum based on different metamaterials and structures. (Chapter 4). Through design and lab fabrication, we realized a planar metamaterials (metasurfaces) with the ability to modulate light reflection and absorption at the designated wavelength. (Chapter 5). Based on an origami-inspired self-folding approach, we reversibly transformed 2D MoS2 into functional 3D optoelectronic devices, which show enhanced light interaction and are capable of angle-resolved photodetection. (Chapter 6). Finally, to replace the conventional magnetic based optical isolators, we achieved two novel non-magnetic isolating schemes based on nonlinear frequency conversion in waveguides and four-wave mixing in semiconductor optical amplifiers. (Chapter 7).

Committee Members:

Jacob Khurgin, Department of Electrical and Computer Engineering

Amy Foster, Department of Electrical and Computer Engineering

David Gracias, Department of Chemical and Biomolecular Engineering

Susanna Thon, Department of Electrical and Computer Engineering


Dissertation Defense: Sonia Joy
May 26 @ 2:00 pm
Dissertation Defense: Sonia Joy

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 1:45 PM EST.

Title: Sparsity and Structure in UWB Synthetic Aperture Radar

Abstract: Synthetic Aperure Radar is a form of radar that uses the motion of radar to simulate a large antenna in order to create high resolution imagery. Low frequency ultra-wideband (UWB) SARs in particular uses low frequencies and a large bandwidth that provide them with penetration capabilities and high resolution. UWB SARs are typically used for near eld imaging applications such as foliage penetration, through the wall imaging and ground penetration. SAR imaging is traditionally done by matched ltering, by applying the adjoint of the projection operator that maps from the image to SAR data.The matched lter imaging suffers disadvantages such as sidelobe artifacts, poor resolution of point targets and lack of robustness to noise and missing data. Regularized imaging with sparsity priors is found to be advantageous; however the regularized imaging is implemented as an iterative process in which projections between the image domain and data domain must be done many times. The projection operations (backprojection and reprojection) are highly complex; a brute force implementation has a complexity of O(N3). In this dissertation, a fast implementation of backprojection and reprojection is investigated. The implementation is explored in the context of regularized imaging as well as compressive sensing SAR.

The second part of the dissertation deals with a problem pertinent to UWB SAR imaging. The VHF/UHF bands used by UWB SAR are shared by other communication systems and that poses two problems; i) RF interference (RFI) from other sources and ii Missing spectral bands because transmission is prohibited in certain bands. The rst problem is addressed by using sparse and/or low-rank modeling. The SAR data is modeled to be sparse. The projection operator from above is used to capture the sparsity of the SAR data. The RFI is modeled to be either sparse with respect to an appropriate dictionary or assumed to be of low-rank. The sparse estimation or the sparse and low-rank estimation is used to estimate the SAR signal and RFI simultaneously. It is demonstrated that the new methods perform much better than the traditional RFI mitigation techniques such as notched ltering. The missing frequency problem can be modeled as a special case of compressive sensing. Sparse estimation is applied to the data to recover the missing frequencies. Simulations show that the sparse estimation is robust to large spectral gaps.

Seminar: Carlos Castillo
Jun 4 @ 12:00 pm
Seminar: Carlos Castillo

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 11:45 AM EDT.

Title: Deep Learning for Face and Behavior Analytics

Abstract: In this talk I will describe the AI systems we have built for face analysis and complex activity detection. I will describe SfSNet a DCNN that produces accurate decomposition of an unconstrained image of a human face into shape, reflectance and illuminance. We present a novel architecture that mimics lambertian image formation and a training scheme that uses a mixture of labeled synthetic and unlabeled real world images. I will describe our results on the properties of DCNN-based identity features for face recognition. I will show how the DCNN features trained on in-the-wild images form a highly structured organization of image and identity information. I will also describe our results comparing the performance of our state of the art face recognition systems to that of super recognizers and forensic face examiners.

I will describe our system for detecting complex activities in untrimmed security videos. In these videos the activities happen in small areas of the frame and some activities are quite rare. Our system is faster than real time, very accurate and works well with visible spectrum and IR cameras. We have defined a new approach to compute activity proposals.

I will conclude by highlighting future directions of our work.

Bio: Carlos D. Castillo is an assistant research scientist at the University of Maryland Institute for Advanced Computer Studies (UMIACS). He has done extensive work on face and activity detection and recognition for over a decade and has both industry and academic research experience. He received his PhD in Computer Science from the University of Maryland, College Park where he was advised by Dr. David Jacobs. During the past 5 years he has been involved with the UMD teams in IARPA JANUS and IARPA DIVA and DARPA L2M. He was recipient of the best paper award at the International Conference on Biometrics: Theory, Applications and Systems (BTAS) 2016. The software he developed under IARPA JANUS has been transitioned to many USG organizations, including Department of Defense, Department of Homeland Security, and Department of Justice.  In addition, the UMD JANUS system is being used operationally by the Homeland Security Investigations (HSI) Child Exploitation Investigations Unit to provide investigative leads in identifying and rescuing child abuse victims, as well as catching and prosecuting criminal suspects. The technologies his team developed provided the technical foundations to a spinoff startup company: Mukh Technologies LLC which creates software for face detection, alignment and recognition. In 2018, Dr. Castillo received the Outstanding Innovation of the Year Award from the UMD Office of Technology Commercialization. His current research interests include face and activity detection and recognition, and deep learning.

Thesis Proposal: Uejima Takeshi
Jun 5 @ 10:00 am
Thesis Proposal: Uejima Takeshi

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 9:45 AM EDT.

Title: A Unified Visual Saliency Model for Neuromorphic Implementation

Abstract: Human eyes capture and send large amounts of data from the environment to the brain. However, the visual cortex cannot process all the information in detail at once. To deal with the overwhelming quantity of the input, the early stages of visual processing select a small subset of the input for detailed processing. Because only the fovea has high resolution imaging, the observer needs to move the eyeballs for thorough scene inspection. Therefore, eye movements can be thought as one of the observable outputs of the early visual process in the brain, which represents what is interesting and important for the observer. Modeling how the brain selects important information, and where humans fixate, is an intriguing research topic in neuroscience and computer vision and is generally referred to as visual saliency modeling. Beyond its grave scientific ramifications, a better understanding of this process will improve the effectiveness of graphic arts, advertisements, traffic signs, camouflage and many other applications.

To date, there has been some studies on developing bioinspired saliency models. Russell et al. proposed a biologically plausible visual saliency model called proto-object based saliency model. It has shown successful result to predict human fixation; however, it exclusively works on low-level features; intensity, color, and orientation. Russell et al. model has been extended by addition of a motion channel as well as a disparity (depth) channel. Texture feature, however, has neither been well studied in the visual saliency field, nor been incorporated into a proto-object based model. And no attempt has been made to combine all of these features in one model. Here, we propose an augmented version of the model that incorporates texture, motion, and disparity features.

In addition to designing the unified proto-object based model, we investigate rationality of the visual process in biological system from the viewpoint of efficiency to represent natural stimuli. This study will advance visual saliency modeling and improve the accuracy of human fixation prediction. In addition, it will deepen our knowledge on how the visual cortex deals with complex environment.

Committee Members:

Ralph Etienne-Cummings, Department of Electrical and Computer Engineering

Andreas Andreou, Department of Electrical and Computer Engineering

Philippe Pouliquen, Department of Electrical and Computer Engineering

Dissertation Defense: Yansong Zhu
Jun 18 @ 1:00 pm
Dissertation Defense: Yansong Zhu

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 12:45 PM EDT. 

Title: Improved Modeling and Image Generation for Fluorescence Molecular Tomography (FMT) and Positron Emission Tomography (PET)

Abstract: In this thesis, we aim to improve quantitative medical imaging with advanced image generation algorithms. We focus on two specific imaging modalities: fluorescence molecular tomography (FMT) and positron emission tomography (PET).

In the case of FMT, we present a novel photon propagation model for its forward model, and in addition, we propose and investigate a reconstruction algorithm for its inverse problem. In the first part, we develop a novel Neumann-series-based radiative transfer equation (RTE) that incorporates reflection boundary conditions in the model. In addition, we propose a novel reconstruction technique for diffuse optical imaging that incorporates this Neumann-series-based RTE as forward model. The proposed model is assessed using a simulated 3D diffuse optical imaging setup, and the results demonstrate the importance of considering photon reflection at boundaries when performing photon propagation modeling. In the second part, we propose a statistical reconstruction algorithm for FMT. The algorithm is based on sparsity-initialized maximum-likelihood expectation maximization (MLEM), taking into account the Poisson nature of data in FMT and the sparse nature of images. The proposed method is compared with a pure sparse reconstruction method as well as a uniform-initialized MLEM reconstruction method. Results indicate the proposed method is more robust to noise and shows improved qualitative and quantitative performance.

For PET, we present an MRI-guided partial volume correction algorithm for brain imaging, aiming to recover qualitative and quantitative loss due to the limited resolution of PET system, while keeping image noise at a low level. The proposed method is based on an iterative deconvolution model with regularization using parallel level sets. A non-smooth optimization algorithm is developed so that the proposed method can be feasibly applied for 3D images and avoid additional blurring caused by conventional smooth optimization process. We evaluate the proposed method using both simulation data and in vivo human data collected from the Baltimore Longitudinal Study of Aging (BLSA). Our proposed method is shown to generate images with reduced noise and improved structure details, as well as increased number of statistically significant voxels in study of aging. Results demonstrate our method has promise to provide superior performance in clinical imaging scenarios.

Thesis Committee

  • Arman Rahmim, Department of Electrical and Computer Engineering, Department of Radiology and Radiological Sciences (advisor, primary reader)
  • Yong Du, Department of Radiology and Radiological Sciences (secondary reader)
  • Jin Kang, Department of Electrical and Computer Engineering
  • Trac Tran, Department of Electrical and Computer Engineering
Thesis Proposal: Soohyun Lee
Jun 18 @ 3:00 pm
Thesis Proposal: Soohyun Lee

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 2:45 PM EDT. 

Title: Optical coherence tomography (OCT) – guided ophthalmic therapy

Abstract: Optical coherence tomography (OCT), which provides cross-sectional images noninvasively with a micro-scale in real-time, has been widely applied for the diagnosis and treatment guidance for ocular diseases.

Selective retina therapy (SRT) is an effective laser treatment method for retinal diseases associated with a degradation of the retinal pigment epithelium (RPE). The SRT selectively targets the RPE, so it reduces negative side effects and facilitates healing of the induced retinal lesions. However, the selection of proper laser energy is challenging because of ophthalmoscopically invisible lesions in the RPE and variance in melanin concentration between patients and even between regions within an eye. In the first part of this work, we propose and demonstrate SRT monitoring and temperature estimation based on speckle variance OCT (svOCT) for dosimetry control. SvOCT quantifies speckle pattern variation caused by moving particles or structural changes in biological tissues. We find that the svOCT peak values have a reliable correlation with the degree of retinal lesion formation. The temperature at the neural retina and RPE is estimated from the svOCT peak values using numerically calculated temperature, which is consistent with the observed lesion creation.

In the second part, we propose to develop a hand-held subretinal-injector actively guided by a common-path OCT (CP-OCT) distal sensor. Subretinal injection delivers drug or stem cells in the space between RPE and photoreceptor layers, so it can directly affect resident cell and tissues in the subretinal space. The technique requires high stability and dexterity of surgeon due to fine anatomy of the retina, and it is challenging because of physiological motions of surgeons like hand tremor. We mainly focus on two aspects of the CP-OCT guided subretinal-injector: (i) A high-performance fiber probe based on high index epoxy lensed-fiber to enhance the CP-OCT retinal image quality in a wet environment; (ii) Automated layer identification and tracking: Each retinal layer boundary, as well as retinal surface, is tracked using convolutional neural network (CNN)-based segmentation for accurate localization of a needle. The CNN performing retinal layer segmentation is integrated into the CP-OCT system for targeted layer distance sensing, and the CP-OCT distal sensor guided system is tested on ex vivo bovine retina.

Dissertation Defense: Ben Skerritt-Davis
Jul 28 @ 10:00 am
Dissertation Defense: Ben Skerritt-Davis

This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 9:45 AM EDT.

Title: Statistical Inference in Auditory Perception

Abstract: The human auditory system effortlessly parses complex sensory inputs despite the ever-present randomness and uncertainty in real-world scenes. To achieve this, the brain tracks sounds as they evolve in time, collecting contextual information to construct an internal model of the external world for predicting future events. Previous work has shown the brain is sensitive to many predictable (and often complex) patterns in sequential sounds. However, real-world environments exhibit a broader spectrum of predictability, and moreover, the level of predictability is constantly in flux. How does the brain build robust internal representations of such stochastic and dynamic acoustic environments?

This question is addressed through the lens of a computational model based in statistical inference. Embodying theories from Bayesian perception and predictive coding, the model posits the brain collects statistical estimates from sounds and maintains multiple hypotheses for the degree of context to include in predictive processes. As a potential computational solution for perception of complex and dynamic sounds, this model is used to connect sensory inputs with listeners’ responses in a series of human behavioral and electroencephalography (EEG) experiments incorporating uncertainty. Experimental results point toward the underlying sufficient statistics collected by the brain, and the extension of these statistical representations to multiple dimensions is examined along spectral and spatial dimensions. The computational model guides interpretation of behavioral and neural responses, revealing multiplexed responses in the brain corresponding to different levels of predictive processing. In addition, the model is used to explain individual differences across listeners highlighted by uncertainty.

The proposed computational model was developed based on first principles, and its usefulness is not limited to the experiments presented here. The model was used to replicate a range of previous findings in the literature, unifying them under a single framework. Moving forward, this general and flexible model can be used as a broad-ranging tool for studying the statistical inference processes behind auditory perception, overcoming the need to minimize uncertainty in perceptual experiments and pushing what was previously considered feasible for study in the laboratory towards what is typically encountered in the “messy” environments of everyday listening.

Committee Members

Mounya Elhilali, Department of Electrical and Computer Engineering

Jason Fischer, Department of Psychological & Brain Sciences

Hynek Hermansky, Department of Electrical and Computer Engineering

James West, Department of Electrical and Computer Engineering

Back to top