Abstract:
Audio Event detection (AED) is a technology aimed at detecting and classifying sound events within an audio signal. AED plays a critical role in enabling machines to understand audio content in various contexts and has direct applications in content retrieval, audio analytics, surveillance systems, sound monitoring among others. In this thesis, we revise the classic paradigm of Audio Event detection (AED) to introduce a nuanced, real-valued “degree of presence” for individual audio objects within complex auditory landscapes. This transformative approach goes far beyond the limiting binary on/off models that have been the norm in the field, offering a richer and more accurate portrayal that closely mirrors human perception of sound. This innovation has significant implications for diverse applications, including but not limited to content generation, audio captioning, and perceptual quality assessment.
To lay the groundwork for this approach, we tackle the understudied but crucial concept of auditory salience, which measures an object’s capacity to command attention. While visual salience is relatively well understood, the auditory counterpart has remained largely unexplored due to challenges in capturing attention mechanisms in free-listening conditions. We surmount this hurdle by employing an inventive crowd-sourcing-based dichotic salience paradigm. By rigorously validating the reliability of crowd-sourced data through direct comparison with controlled laboratory settings, we not only prove the efficacy of this novel methodology but also pave the way for collecting large-scale, diverse auditory salience datasets.
Moreover, we expand the frontier of auditory salience research by exploring the often-overlooked semantic dimensions. Through carefully designed experimental studies that manipulate the direction of audio scenes, we reveal new insights about how auditory salience is significantly influenced by semantic cues in addition to acoustic attributes. Using our empirical data derived from the crowd-sourced dichotic paradigm and predictive models of salience and salient events, we establish that perceptual salience balances low-level and high-level attributes in guiding what stands out in a natural scene.
In addition to the salience models, we developed robust methodologies to detect audio events with various dynamics overlapping with each other. Mimicking how the human auditory cortex performs a rate-specific analysis that can selectively track objects over time, we developed deep-learning models whose latent spaces are constrained to follow specific dynamics. Using the constraints as priors in a variational autoencoder framework, we leveraged large-scale unlabeled data to train rate-specific audio encoders. These rate-specific encoders provide performance gains when fine-tuned using a semi-supervised AED framework. We also developed a coherence-based regularization that enforces smoothness constraints on the latent space, which can lead to further gains in AED performance.