Title: Towards a better understanding of spoken conversations: Assessment of sentiment and emotion
Abstract: In this talk, we present our work on understanding the emotional aspects of spoken conversations. Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties.
While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this work, we considered emotion recognition in two particular scenarios. The first scenario is predicting customer sentiment/satisfaction (CSAT) in a call center conversation, and the second consists of emotion prediction in short utterances.
CSAT is defined as the overall sentiment (positive vs. negative) of the customer about his/her interaction with the agent. In this work, we perform a comprehensive search for adequate acoustic and lexical representations.
For acoustic representation, we propose to use the x-vector model, which is known for its state-of-the-art performance in the speaker recognition task. The motivation behind using x-vectors for CSAT is we observed that emotion information encoded in x-vectors affected speaker recognition performance. For lexical, we introduce a novel method, CSAT Tracker, which computes the overall prediction based on individual segment outcomes. Both methods rely on transfer learning to obtain the best performance. We classified using convolutional neural networks combining the acoustic and lexical features. We evaluated our systems on US English telephone speech from call center data. We found that lexical models perform better than acoustic models and fusion of them provided significant gains. The analysis of errors uncovers that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly. Also, we found that the customer’s speech is more emotional compared to the agent’s speech.
For the second scenario of predicting emotion, we present a novel approach based on x-vectors. We show that adapting the x-vector model for emotion recognition provides the best-published results on three public datasets.
Title: Single-Channel Speech Separation in Noisy and Reverberant Conditions
Abstract: An inevitable property of multi-party conversations is that more than one speaker will end up speaking simultaneously for portions of time. Many speech technologies, such as automatic speech recognition and speaker identification, are not designed to function on overlapping speech and suffer severe performance degradation under such conditions. Speech separation techniques aim to solve this problem by producing a separate waveform for each speaker in an audio recording with multiple talkers speaking simultaneously. The advent of deep neural networks has resulted in strong performance gains on the speech separation task. However, training and evaluation has been nearly ubiquitously restricted to a single dataset of clean, near-field read speech, not representative of many multi-person conversational settings which are frequently recorded on room microphones, introducing noise and reverberation. Due to the degradation of other speech technologies in these sorts of conditions, speech separation systems are expected to suffer a decrease in performance as well.
The primary goal of this proposal is to develop novel techniques to improve speech separation in noisy and reverberant recording conditions. One core component of this work is the creation of additional synthetic overlap corpora spanning a range of more realistic and challenging conditions. The lack of appropriate data necessitates a first step of creating appropriate conditions with which to benchmark the performance of state-of-the-art methods in these more challenging conditions. Another proposed line of investigation is the integration of speech separation techniques with speech enhancement, the task of enhancing a speech signal through the removal of noise or reverberation. This is a natural combination due to similarities in problem formulation and general approach. Finally, we propose an investigation into the effectiveness of speech separation as a pre-processing step to speech technologies, such as automatic speech recognition, that struggle with overlapping speech, as well as tighter integration of speech separation with these “downstream” systems.
This presentation will be happening remotely over Zoom. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.
Title: Compressive Sensing for Wireless Systems with Massive Antenna Arrays
Abstract: Over the past two decades the world has enjoyed exponential growth in wireless connectivity that has fundamentally changed the way people communicate and has opened the door to limitless new applications. With the advent of 5G, users will now begin to enjoy enhanced mobile broadband links supporting peak rates of over 10 gigabit per second. The 5G capability will also support massive machine type communications and less than one millisecond latency communications to support ultra-reliable low communication. Continuing to achieve greater increases in system capacity requires the continual advancement of new technology to make efficient use of finite spectrum resources.
Researchers have studied Multiple-Input-Multiple-Output (MIMO) communications over the last several decades as a way to increase system capacity. The MIMO channel is composed of multiple transmit (input) antennas and multiple (output) receive antennas. The channel is represented as the impulse response between each transmit and receive antenna pair. In the simplest of channels, the pairwise impulse response reduces to a single coefficient. Many theoretical MIMO results rely on Rayleigh channels featuring independently distributed complex Gaussian variables as channel coefficients.
The concept of Massive MIMO emerged a decade ago and is a leading technology in 5G wireless. Massive MIMO features base stations that have massive antenna arrays that simultaneously service many users. The Massive MIMO array has many more antennas than users. Unlike traditional phased array antennas, Massive MIMO arrays have all (or a large portion of) their antennas connected to receive chains for baseband processing. Successfully decoding each user’s data stream requires estimates of the propagation channel. Channel estimation is usually aided through the use of pilot signals that are known to both the user terminal and the base station. Simultaneously estimating the channel matrix between each user and each antenna in a massive MIMO array creates challenges for pilot sequence design. More channel resources reserved for pilot sequences for channel estimation result in fewer resources for user data.
Several efforts have shown that the mm wave massive MIMO channel exhibits several sparse features. The number of distinct and resolvable paths between a user and a massive MIMO array is generally much less than the number of base station antennas. Early theoretical MIMO work relied on Rayleigh channels as they are useful for closed form solutions. In reality, the Massive MIMO mm wave channel is low rank as it can be modeled by a smaller number of resolvable multipath components. This opens opportunities for new channel estimation techniques using compressive sensing and sparse recovery.
Although Massive MIMO will be featured in future 5G services, there is still much untapped potential. Through developing better channel estimation schemes, additional system throughput can be achieved. This work will consider:
This event will occur remotely in a Zoom meeting at this link. Please do not join the meeting until at least 15 minutes before the presentation is scheduled to start.
Title: Using Systems Modeling to Localize the Seizure Onset Zone in Epilepsy Patients from Single Pulse Electrical Stimulation Recordings
Abstract: Surgical resection of the seizure onset zone (SOZ) could potentially lead to seizure-freedom in medically refractory epilepsy patients. However, localizing the SOZ can be a time consuming and tedious process involving visual inspection of intracranial electroencephalographic (iEEG) recordings captured during passive patient monitoring. Single pulse electrical stimulation (SPES) is currently performed on patients undergoing invasive EEG monitoring for the main purposes of mapping functional brain networks such as language and motor networks. We hypothesize that evoked responses from SPES can also be used to localize the SOZ as they may express the natural frequencies and connectivity of the iEEG network. To test our hypothesis, we construct patient specific single-input multi-output transfer function models from the evoked responses recorded from eight epilepsy patients that underwent SPES evaluation and iEEG monitoring. Our preliminary results suggest that the stimulation electrodes that produced the highest system gain, as measured by the 𝓗∞ norm, correspond to those electrodes clinically defined in the SOZ in successfully treated patients.
This presentation will be done remotely. Follow this link for access to the Zoom meeting where it will be taking place. It is advised that you do not log in to the meeting until at least 15 minutes before the presentation’s start time.
Title: A Synergistic Combination of Signal Processing and Deep Learning for Robust Speech Processing
Abstract: When speech is captured with a distant microphone it includes distortions caused by noise, reverberation and overlapping speakers. Far-field speech processing systems need to be robust to those distortions to function in real-world applications and hence have front-end components to handle them. The front-end components are typically optimized based on signal reconstruction objectives. This makes the overall speech processing system sub-optimal as the front-end is optimized independently of the downstream task. This approach also has another significant constraint that the enhancement/separation system can be trained with only simulated data and hence does not generalize well for real data. Alternatively, these front-end systems can be trained with application-oriented objectives. Emergent end-to-end neural methods have made it easier to optimize the frontend in such a manner. The goal of this work is to encompass carefully designed multichannel speech enhancement/separation subnetworks inside a sequence-to-sequence automatic speech recognition (ASR) system. This work takes an explainable AI approach to this problem where the intermediate outputs of the subnetworks can be interpreted although the entire network is trained only based on the speech recognition error minimization criteria. This proposal looks at two directions: (1) simultaneous dereverberation and denoising using a single differentiable speech recognition network which also learns some important hyperparameters from the data, (2) target speech extraction combining both anchor speech and location information which is optimized based on only the transcription as the target. In the first direction, dereverberation subnetwork is based on linear prediction where the filter order hyperparameter is estimated using a reinforcement learning approach, and the denoising (beamforming) subnetwork is based on a parametric multichannel Wiener filter where the speech distortion factor is also estimated inside the network. This method has shown a considerable gain in performance on real and unseen conditions. It is also shown how such a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to the ASR word error rate (WER) metric. In the second direction, a location and anchor speech guided target speech extraction subnetwork is trained end-to-end with an ASR network. From experimental comparison with a traditional pipeline system, it is verified that this task can be realized by end-to-end ASR training objectives without using parallel clean data. The results are promising in mixtures of two speakers and noise. The future plan is to optimize an explicit source localization frontend with a speech recognition objective. This can play an important role in realizing a conversation system that recognizes who is speaking what, when, and where.
This presentation is happening remotely. Click this link as early as 15 minutes before the scheduled start time of the presentation to watch in a Zoom meeting.
Title: Context-aware Language Modeling and Adaptation for Automatic Speech Recognition
Abstract: Language models (LMs) are an important component in automatic speech recognition (ASR) and usually trained on transcriptions. Language use is strongly influenced by factors such as domain, topic, style, and user-preference. However, transcriptions from speech corpora are usually too limited to fully capture contextual variability in test domains. And some of the information is only available at test time. It is easily observed that the change of application domains often induces mismatch in lexicon and distribution of words. Even within the same domain, topics can shift and user-preference can vary. These observations indicate that LMs trained purely on transcriptions that may not be well representative for test domains are far from ideal and may severely affect ASR performance. To mitigate the mismatches, adapting LMs to contextual variables is desirable.
This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. It is advised that you do not enter the meeting until at least 15 minutes before the talk is scheduled to take place.
Title: Machine Learning for Collaborative Signal Processing in Beamforming and Compressed Sensing
Abstract: Life today has become inextricably linked with the many sensors working in concert in our environment, from the webcam and microphone in our laptops to the arrays of wireless transmitters and receivers in cellphone towers. Collaborative signal processing methods tackle the challenge of efficiently processing data from multiple sources. Recently, machine learning methods have become very popular tools for collaborative signal processing, largely due to the success of deep learning. The large volume of data created by multiple sensors pairs well with the data-hungry nature of modern machine learning models, holding great promise for efficient solutions.
This proposal extends ideas from machine learning to problems in collaborative signal processing. Specifically, this work will focus on two collaborative signal processing methods – beamforming and compressed sensing. Beamforming is commonly employed in sensor arrays for directional signal transmission and reception by combining the signals received in the array elements to enhance a signal of interest. On the other hand, compressed sensing is a widely applicable mathematical framework that guarantees exact signal recovery even at sub-Nyquist sampling rates if suitable sparsity and incoherence assumptions are satisfied. Compressed sensing accomplishes this via convex or greedy optimization to fuse the information in a small number of signal measurements.
The first part of this work was motivated by the common experience of attempting to capture a video on a mobile device but having the target of interest contaminated by the surrounding environment (e.g., construction sounds from outside the camera’s field of view). Fusing visual and auditory information, we propose a novel audio-visual zooming algorithm that directionally filters the received audio data using beamforming to focus only on audio originating from within the field of view of the camera. Second, we improve the quality of ultrasound image formation by introducing a novel beamforming framework that leverages the benefits of deep learning. Ultrasound images currently suffer from severe speckle and clutter degradations which cause poor image quality and reduce diagnostic utility. We propose to design a deep neural network to learn end-to-end transformations that extract information directly from raw received US channel data. Finally, we improve upon optimization-based compressed sensing recovery by replacing the slow iterative optimization algorithms with far faster convolutional neural networks.
This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 9:45 AM EDT.
Title: A Unified Visual Saliency Model for Neuromorphic Implementation
Abstract: Human eyes capture and send large amounts of data from the environment to the brain. However, the visual cortex cannot process all the information in detail at once. To deal with the overwhelming quantity of the input, the early stages of visual processing select a small subset of the input for detailed processing. Because only the fovea has high resolution imaging, the observer needs to move the eyeballs for thorough scene inspection. Therefore, eye movements can be thought as one of the observable outputs of the early visual process in the brain, which represents what is interesting and important for the observer. Modeling how the brain selects important information, and where humans fixate, is an intriguing research topic in neuroscience and computer vision and is generally referred to as visual saliency modeling. Beyond its grave scientific ramifications, a better understanding of this process will improve the effectiveness of graphic arts, advertisements, traffic signs, camouflage and many other applications.
To date, there has been some studies on developing bioinspired saliency models. Russell et al. proposed a biologically plausible visual saliency model called proto-object based saliency model. It has shown successful result to predict human fixation; however, it exclusively works on low-level features; intensity, color, and orientation. Russell et al. model has been extended by addition of a motion channel as well as a disparity (depth) channel. Texture feature, however, has neither been well studied in the visual saliency field, nor been incorporated into a proto-object based model. And no attempt has been made to combine all of these features in one model. Here, we propose an augmented version of the model that incorporates texture, motion, and disparity features.
In addition to designing the unified proto-object based model, we investigate rationality of the visual process in biological system from the viewpoint of efficiency to represent natural stimuli. This study will advance visual saliency modeling and improve the accuracy of human fixation prediction. In addition, it will deepen our knowledge on how the visual cortex deals with complex environment.
Ralph Etienne-Cummings, Department of Electrical and Computer Engineering
Andreas Andreou, Department of Electrical and Computer Engineering
Philippe Pouliquen, Department of Electrical and Computer Engineering
This presentation will be taking place remotely. Follow this link to enter the Zoom meeting where it will be hosted. Do not enter the meeting before 2:45 PM EDT.
Title: Optical coherence tomography (OCT) – guided ophthalmic therapy
Abstract: Optical coherence tomography (OCT), which provides cross-sectional images noninvasively with a micro-scale in real-time, has been widely applied for the diagnosis and treatment guidance for ocular diseases.
Selective retina therapy (SRT) is an effective laser treatment method for retinal diseases associated with a degradation of the retinal pigment epithelium (RPE). The SRT selectively targets the RPE, so it reduces negative side effects and facilitates healing of the induced retinal lesions. However, the selection of proper laser energy is challenging because of ophthalmoscopically invisible lesions in the RPE and variance in melanin concentration between patients and even between regions within an eye. In the first part of this work, we propose and demonstrate SRT monitoring and temperature estimation based on speckle variance OCT (svOCT) for dosimetry control. SvOCT quantifies speckle pattern variation caused by moving particles or structural changes in biological tissues. We find that the svOCT peak values have a reliable correlation with the degree of retinal lesion formation. The temperature at the neural retina and RPE is estimated from the svOCT peak values using numerically calculated temperature, which is consistent with the observed lesion creation.
In the second part, we propose to develop a hand-held subretinal-injector actively guided by a common-path OCT (CP-OCT) distal sensor. Subretinal injection delivers drug or stem cells in the space between RPE and photoreceptor layers, so it can directly affect resident cell and tissues in the subretinal space. The technique requires high stability and dexterity of surgeon due to fine anatomy of the retina, and it is challenging because of physiological motions of surgeons like hand tremor. We mainly focus on two aspects of the CP-OCT guided subretinal-injector: (i) A high-performance fiber probe based on high index epoxy lensed-fiber to enhance the CP-OCT retinal image quality in a wet environment; (ii) Automated layer identification and tracking: Each retinal layer boundary, as well as retinal surface, is tracked using convolutional neural network (CNN)-based segmentation for accurate localization of a needle. The CNN performing retinal layer segmentation is integrated into the CP-OCT system for targeted layer distance sensing, and the CP-OCT distal sensor guided system is tested on ex vivo bovine retina.
Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.
Title: Fine-grained activity recognition for assembly videos
Abstract: When a collaborative robot is working with a human partner to build a piece of furniture or an industrial part, the robot must be able to perceive which parts are connected and where, and it must be able to reason about how these connections can change as the result of its partner’s actions. This need can also arise in industrial process monitoring and manufacturing applications, where an automated system verifies a product as it progresses through the assembly line. These assembly processes require systems that can reason geometrically and temporally, relating the structure of an assembly to the manipulation actions that created it.
Grounded in a behavioral study of spatial cognition, this proposal combines methods for physical and temporal reasoning to enable the analysis and automated perception of assembly actions. We develop a temporal model that relates manipulation actions to the structures they produce and describe its use in enabling fine-grained behavioral analyses. Then, we apply our sequence model to recognize assembly actions in a variety of assembly scenarios. Finally, we describe a method for part-based reasoning that makes our approach robust to occluded and previously unseen assemblies.
Sanjeev Khudanpur, Department of Electrical and Computer Engineering
Greg Hager, Department of Computer Science
Vishal Patel, Department of Electrical and Computer Engineering