Thesis Proposal: Raghavendra Pappagari
Title: Towards a better understanding of spoken conversations: Assessment of sentiment and emotion
Abstract: In this talk, we present our work on understanding the emotional aspects of spoken conversations. Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties.
While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this work, we considered emotion recognition in two particular scenarios. The first scenario is predicting customer sentiment/satisfaction (CSAT) in a call center conversation, and the second consists of emotion prediction in short utterances.
CSAT is defined as the overall sentiment (positive vs. negative) of the customer about his/her interaction with the agent. In this work, we perform a comprehensive search for adequate acoustic and lexical representations.
For acoustic representation, we propose to use the x-vector model, which is known for its state-of-the-art performance in the speaker recognition task. The motivation behind using x-vectors for CSAT is we observed that emotion information encoded in x-vectors affected speaker recognition performance. For lexical, we introduce a novel method, CSAT Tracker, which computes the overall prediction based on individual segment outcomes. Both methods rely on transfer learning to obtain the best performance. We classified using convolutional neural networks combining the acoustic and lexical features. We evaluated our systems on US English telephone speech from call center data. We found that lexical models perform better than acoustic models and fusion of them provided significant gains. The analysis of errors uncovers that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly. Also, we found that the customer’s speech is more emotional compared to the agent’s speech.
For the second scenario of predicting emotion, we present a novel approach based on x-vectors. We show that adapting the x-vector model for emotion recognition provides the best-published results on three public datasets.