Dissertation Defense: Blake Dewey
Sep 22 @ 2:30 pm
Dissertation Defense: Blake Dewey

Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.

Title: Synthesis-Based Harmonization of Multi-Contrast Structural MRI

Abstract: The flexible design of the MRI system allows for the collection of multiple images with different acquisition parameters in a single scanning session. However, since MRI does not have any standards that regulate image acquisition (unlike other imaging modalities, such as computed tomography), differences in acquisition lead to variability in image appearance between manufacturers, imaging centers, and even individual scanners. Variability in images can cause significant problems in quality of analysis, setting the stage for harmonization.

This dissertation describes four main contributions to literature of synthesis-based harmonization for structural brain MR images. In synthesis-based harmonization, harmonized images are created that can be used confidently in automated analysis pipelines such as whole-brain segmentation, where image variability can cause inconsistent results. In our first contribution, we acquired a cross-domain dataset to provide training and validation data for our harmonization methods. This dataset is crucial to our work, as it provides examples of the same subjects under two different acquisition environments. In our second contribution, we used this unique, cross-domain dataset directly to develop a supervised method of harmonization. Our method, called DeepHarmony, uses state-of-the-art deep learning architecture and training strategies to provide significantly improved image harmonization over other synthesis methods. In our third contribution, we proposed an unsupervised harmonization framework to allow for broader applications where cross-domain data is not acquired. This novel framework is based on representation learning, where we aim to separate anatomical features from acquisition environment in a disentangled latent space. We used multi-contrast MRI images from the same scanning session as internal supervision to encourage this disentangled latent representation and we demonstrated that this regularization alone was able to generate disentanglement in a completely data-driven way. In our final contribution, we extended our unsupervised work for a more diverse clinical trial dataset, which included T2-FLAIR and PD-weighted images. In this substantially more complex dataset, we made improvements to the disentanglement architecture and training strategies to produce a more consistent latent space. This method was shown to properly enforce the expectations on our latent space and also has the ability to evaluate images for inconsistent acquisition.

Committee Members

  • Jerry Prince, Department of Electrical and Computer Engineering
  • Vishal Patel, Department of Electrical and Computer Engineering
  • Webster Stayman, Department of Biomedical Engineering
  • Peter van Zijl, Department of Radiology
  • Peter Calabresi, Department of Neurology
Dissertation Defense: Raghavendra Pappagari
Sep 29 @ 3:30 pm
Dissertation Defense: Raghavendra Pappagari

Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.

Title: Towards Better Understanding of Spoken Conversations: Assessment of Emotion and Sentiment

Abstract: Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties. While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human machine interaction and conversation analysis. Automatic speech emotion recognition (SER) can be broadly classified into two types: SER from isolated utterances and SER from long recordings. In this thesis, we present machine learning based approaches to recognize emotion from both isolated utterances and long recordings.

Isolated utterances are usually shorter than 10s in duration and assumed to contain only one major emotion. One of the main obstacles in achieving high emotion recognition accuracy in this case is lack of large annotated data. We proposed to mitigate this problem by using transfer learning and data augmentation techniques. We show that utterance representations (x-vectors) extracted from speaker recognition models (x-vector models) contain emotion predictive information and adapting those models provide significant improvements in emotion recognition performance. To further improve the performance, we proposed a novel perceptually motivated data augmentation method, CopyPaste on isolated utterances. Assuming that the presence of emotions other than neutral dictates a speaker’s overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We show that using this concatenated data along with the original training data to train the model improves the model performance. We presented three CopyPaste schemes and evaluate on two models – one trained independently and another using transfer learning from an x-vector model, a speaker recognition model – in both clean and test conditions. We validated the proposed approaches on three datasets each collected with different elicitation methods: Crema-D (acted emotions), IEMOCAP (induced emotions) and MSP-Podcast (spontaneous emotions).

As isolated utterances are assumed to contain only one emotion, the proposed models make predictions on the utterance level i.e., one emotion prediction for the whole utterance. However, these models can not be directly applied to the conversations which can have multiple emotions unless we know locations of emotion boundaries. In this work, we propose to recognize emotions in the conversations by doing frame-level classification where predictions are made at regular intervals. We investigated several deep learning architectures – transformers, ResNet-34 and BiLSTM – that can exploit context in the conversations. We show that models trained on isolated utterances perform worse than models trained on conversations suggesting the importance of context. Based on inner-workings of attention operation, we propose a data augmentation method, DiverseCatAugment (DCA) to equip the transformer models with better classification ability. However, these models does not exploit turn-taking pattern available in conversations. Speakers in the conversations take turns to exchange information and emotion in each turn could depend on the speaker’s and the corresponding partner’s emotions in the past turns. We show that exploiting the information of who is speaking when in the conversation improves the emotion recognition performance.
The proposed models can exploit speaker information even in the absence of speaker segmentation information.

Annotating utterances with emotions is not a simple task – it is very expensive, time consuming and depends on the number of emotions used for annotation. However, annotation schemes can be changed to reduce annotation efforts based on application. For example, for some applications, the goal is to only classify into positive or negative emotions instead of more detailed emotions like angry, happy, sad and disgust. We considered one such application in this thesis: predicting customer’s satisfaction (CSAT) in a call center conversation. CSAT is defined as the overall sentiment (positive vs. negative) of the customer about his/her interaction with the agent. As the goal is to predict only one label for the whole conversation, we perform utterance-level classification. We conducted a comprehensive search for adequate acoustic and lexical representations at different granular levels of conversations such as word/frame-, turn-. and call-level. From the acoustic signal, we found that the proposed x-vector representation combined with feed-forward deep neural network outperformed widely used prosodic features. From transcripts, CSAT Tracker, a novel method that computes overall prediction based on individual segment outcomes performed best. Both methods rely on transfer learning to obtain the best performance. We also performed fusion of acoustic and lexical features using a convolutional network. We evaluated our systems on US English telephone speech from call center data. We found that lexical models perform better than acoustic models and fusion of them provided significant gains. The analysis of errors revealed that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly. Also, we found that the customer’s speech is more emotional compared to the agent’s speech.

Committee Members:

  • Najim Dehak, Department of Electrical and Computer Engineering
  • Jesús Villalba, Department of Electrical and Computer Engineering
  • Hynek Hermansky, Department of Electrical and Computer Engineering
Back to top