Note: This is a virtual presentation. Here is the link for where the presentation will be taking place.
Title: Improving speaker embedding in speaker verification: Beyond speaker discrimanitive training
Abstract: Speaker verification (SV) is a task to verify a claimed identity from the voice signal. A well-performing SV system requires a method to transform a variable-length recording into a fixed-length representation (a.k.a. embedding vector), compacting the speaker biometric information that captures distinctive features over different speakers. There are two popular methods: i-vector and x-vector. Although i-vector is still used nowadays, x-vector outperforms i-vector in many SV tasks as deep learning research surges. The x-vector, however, has limitations, and we mainly tackle two of them in this proposal: 1) the embedding still includes information about the spoken text, 2) it cannot leverage data that do not have speaker labels since the training requires the labels.
In the first half, we tackle the text-dependency in the x-vector speaker embedding. Spoken text remaining in x-vector can degrade its performance in text-independent SV because utterances of the same speaker may have different embeddings due to different spoken text. This could lead to a false rejection, i.e., the system rejects a valid target speaker. To tackle this issue, we propose to disentangle the spoken text and speaker identity into separate latent factors using a text-to-speech (TTS) model. First, the multi-speaker end-to-end TTS system has text and speech encoders, each of which focuses on encoding information in its corresponding modality. These encoders enable text-independent speaker embedding learning by reconstructing the frames of a target speech segment, given a speaker embedding of another speech segment of the same utterance. Second, many efforts to the neural TTS research over recent years have improved the speech synthesis quality. We hypothesize that speech synthesis and speaker embedding qualities positively correlate since the speaker encoder in a TTS system needs to learn well for better speech synthesis of multiple speakers. We confirm the above two points through a series of experiments.
In the second half, we focus on leveraging unlabeled data to learn embedding. Considering that much more unlabeled data exists than labeled data, leveraging the unlabeled data is essential, which is not straightforward with the x-vector training. This, however, is possible with the proposed TTS method. First, we show how to use the TTS method for this purpose. The results show that it can leverage the unlabeled data, but it still requires some labeled data to post-process the embeddings for the final SV system. To develop a completely unsupervised SV system, we apply a self-supervised technique proposed in computer vision research, distillation with no labels (DINO), and compare this to the TTS method. The results show that the DINO method outperforms the TTS method in unsupervised scenarios and enables SV with no labels.
Future work will focus on 1) exploring the DINO-based method in semi-supervised scenarios, 2) fine-tuning the network for downstream tasks such as emotion recognition.