Humans can effortlessly focus on a single voice in a noisy room, a skill that has long challenged artificial intelligence. Now, researchers at Johns Hopkins have developed an AI system that can do just that. Called FlexSED, short for Flexible Sound Event Detection, the model can recognize and precisely mark when a sound occurs within an audio recording, using a plain-language description of the sound.
“For example, if someone types ‘dog barking,’ FlexSED can scan a long audio clip and highlight the exact moments when a bark occurs, down to the second,” said co-author Jiarui Hai, a PhD student in the Department of Electrical and Computer Engineering.
Presented in October at the 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), the system is a major leap toward open vocabulary sound understanding. What makes the system unique is that it can understand sounds through everyday language, a step beyond traditional models limited to a fixed set of labeled categories.
Traditional sound detection models can only identify sounds they were trained on, which limits how well they work in real-world settings. FlexSED takes a different approach: it can understand natural language descriptions of sounds, so it isn’t tied to a fixed list. “The system can respond to whatever sound the user describes,” said Hai. This allows FlexSED to recognize unfamiliar sounds (a process known as zero-shot learning) and quickly learn new ones with just a few examples (called few-shot learning), making it useful in settings like medical monitoring and wildlife tracking.
This work was recognized by the organizing committee as a notable contribution and was selected for a Spotlight oral presentation, which is reserved for a small number of papers that receive high evaluations from the reviewers.
To build this model, the researchers combined two pretrained systems: one that learns sound patterns from large amounts of unlabeled audio, and another that understands text descriptions such as “car horn,” “person laughing,” or “glass shattering.” An adaptive fusion strategy allows the model to integrate and fine-tune these components using a relatively small amount of labeled data, enabling open-vocabulary sound event detection without extensive task-specific retraining.
FlexSED performed better than traditional sound detection models on benchmark tests. “Even when tested on sounds outside its training set, the model showed a strong ability to identify them,” said Hai. “With just a few examples of new sounds, its accuracy rose even higher, showing that the system can quickly learn and adapt to new environments.”
Because FlexSED can understand everyday language, it could be used in many real-world settings—from spotting safety alerts in noisy workplaces to recognizing animal sounds in nature recordings. It can support audio-aware AI agents by helping them determine what occurred and when in an audio clip. Its speed and accuracy also make it a great platform to enhance assistive technologies for people with hearing loss, helping them interpret sounds in their surroundings.
FlexSED is open-source, with code and pretrained models available on GitHub.
Study co-authors include Charles Renn Faculty Scholar and Professor Mounya Elhilali, and PhD students Helin Wang and Weizhe Guo in the Department of Electrical and Computer Engineering.