VESUS

VESUS: Varied Emotion in Syntactically Uniform Speech

The Varied Emotion in Syntactically Uniform Speech (VESUS) repository is a lexically controlled database collected by the NSA lab. Here, actors read a semantically neutral script of words, phrases, and sentences with different emotional inflections. VESUS contains 252 distinct phrases, each read by 10 actors in 5 emotional states (neutral, angry, happy, sad, fearful). The script can be downloaded by filling out this form.

We obtain 10 crowd-sourced ratings for each utterance to determine the perceived emotion across a general population. In total, VESUS contains over 6 hours of pure speech and over 10,000 emotional annotations.

Phonetic comparison of our script (blue) with the 3,000 most commonly used words in spoken English (red). Bins corresponds to one of the 44 English phonemes. The y-axis indicates the frequency of occurrence. The VESUS script is phonetically balanced.

Data Acquisition

We recruited ten English speaking actors (5 male, 5 female) with varying professional experience from the Baltimore area. Informed consent was obtained prior to the session according to an approved IRB protocol. The audio recordings took place in a sound-proof environment on the Johns Hopkins University campus. Our audio equipment consisted of an AKG pro audio C214 condenser microphone (cardioid) with adjustable stand, a Focusrite Scarlett 2i2 preamplifier, and GLS cords.

The actors received a paper copy of the script. They were first asked to read the entire script aloud in a neutral voice. This process was repeated for each of the following emotions: happiness, sadness, anger, and fear. The actors were instructed to pause between utterances to given themselves time to reset. They were also given a break between each script reading. Finally, we asked the actor to rate his/her level of confidence in each of their emotional portrayals (scale: 1–10).

We have also used Amazon Mechanical Turk (AMT) to crowd source ten emotional annotations for each of the 12,594 utterances. Our AMT task involves listening to a single recorded utterance and answering two simple questions. First, users were asked which emotion (happiness, sadness, anger, fear, neutral) best describes the attitude of the speaker. Users were explicitly instructed to base their decision on the tone of voice rather than on the semantic content. Second, users were asked to rate their level of confidence in their selected emotion (scale: 1–5). We did not query secondary emotional categories in this study to avoid influencing gut emotional reactions.

Release

VESUS will be made freely available for academic use. To gain access to the database, fill out this form. You will be provided a download link once we have received your information. Additionally, please cite the following paper in any work that uses VESUS:

J. Sager, R. Shankar, J. Reinhold, and A. Venkataraman. VESUS: A Crowd-Annotated Database to Study Emotion Production and Perception in Spoken English. In Proc. Interspeech: Conf of the Intl Speech Communication Association, pp. 316-320, 2019.