Our current understanding of how the brain segregates auditory scenes into meaningful objects is in line with a Gestaltism framework. These Gestalt principles suggest a theory of how different attributes of the soundscape are extracted then bound together into separate groups that reflect different objects or streams present in the scene. These cues are thought to reflect the underlying statistical structure of natural sounds in a similar way that statistics of natural images are closely linked to the principles that guide figure-ground segregation and object segmentation in vision. In the present study, we leverage inference in stochastic neural networks to learn emergent grouping cues directly from natural soundscapes including speech, music and sounds in nature. The model learns a hierarchy of local and global spectro-temporal attributes reminiscent of simultaneous and sequential Gestalt cues that underlie the organization of auditory scenes. These mappings operate at multiple time scales to analyze an incoming complex scene and are then fused using a Hebbian network that binds together coherent features into perceptually-segregated auditory objects. The proposed architecture successfully emulates a wide range of well established auditory scene segregation phenomena and quantifies the complimentary role of segregation and binding cues in driving auditory scene segregation.
To understand our surroundings, we effortlessly parse our sound environment into sound sources, extracting invariant information—or regularities—over time to build an internal representation of the world around us. Previous experimental work has shown the brain is sensitive to many types of regularities in sound, but theoretical models that capture underlying principles of regularity tracking across diverse sequence structures have been few and far between. Existing efforts often focus on sound patterns rather the stochastic nature of sequences. In the current study, we employ a perceptual model for regularity extraction based on a Bayesian framework that posits the brain collects statistical information over time. We show this model can be used to simulate various results from the literature with stimuli exhibiting a wide range of predictability. This model can provide a useful tool for both interpreting existing experimental results under a unified model and providing predictions for new ones using more complex stimuli.
Deep neural networks have been recently shown to capture intricate information transformation of signals from the sensory profiles to semantic representations that facilitate recognition or discrimination of complex stimuli. In this vein, convolution neural networks (CNNs) have been used very successfully in image and audio classification. Designed to imitate the hierarchical structure of the nervous system, CNNs reflect activation with increasing degrees of complexity that transform the incoming signal onto object-level representations. In this work, we employ a CNN trained for large-scale object classification to gain insights about the contribution of various audio representations that guide sound perception. The analysis contrasts activation of different layers of a convolutional neural network with acoustic features extracted directly from the scenes, perceptual salience obtained from behavioral responses of human listeners, as well as neural oscillations recorded by Electroencephalography (EEG) in response to the same natural scenes. All three measures are tightly linked quantities believed to guide percepts of salience and object formation when listening to complex scenes. The results paint a picture of the intricate interplay between low-level and object-level representations in guiding auditory salience that is very much dependent on context and sound category.
Our ability to parse our acoustic environment relies on the brain's capacity to extract statistical regularities from surrounding sounds. Previous work in regularity extraction has predominantly focused on the brain's sensitivity to predictable patterns in sound sequences. However, natural sound environments are rarely completely predictable, often containing some level of randomness, yet the brain is able to effectively interpret its surroundings by extracting useful information from stochastic sounds. It has been previously shown that the brain is sensitive to the marginal lower-order statistics of sound sequences (i.e., mean and variance). In this work, we investigate the brain's sensitivity to higher-order statistics describing temporal dependencies between sound events through a series of change detection experiments, where listeners are asked to detect changes in randomness in the pitch of tone sequences. Behavioral data indicate listeners collect statistical estimates to process incoming sounds, and a perceptual model based on Bayesian inference shows a capacity in the brain to track higher-order statistics. Further analysis of individual subjects' behavior indicates an important role of perceptual constraints in listeners' ability to track these sensory statistics with high fidelity. In addition, the inference model facilitates analysis of neural electroencephalography (EEG) responses, anchoring the analysis relative to the statistics of each stochastic stimulus. This reveals both a deviance response and a change-related disruption in phase of the stimulus-locked response that follow the higher-order statistics. These results shed light on the brain's ability to process stochastic sound sequences.
Goal: Chest auscultations offer a noninvasive and low-cost tool for monitoring lung disease. However, they present many shortcomings, including inter-listener variability, subjectivity, and vulnerability to noise and distortions. This work proposes a computer-aided approach to process lung signals acquired in the field under ad- verse noisy conditions, by improving the signal quality and offering automated identification of abnormal auscultations indicative of respiratory pathologies. Methods: The developed noise-suppression scheme eliminates ambient sounds, heart sounds, sensor artifacts, and crying contamination. The improved high-quality signal is then mapped onto a rich spectrotemporal feature space before being classified using a trained support-vector machine classifier. Individual signal frame decisions are then combined using an evaluation scheme, providing an overall patient-level decision for unseen patient records. Results: All methods are evaluated on a large dataset with \textgreater1000 children enrolled, 1–59 months old. The noise suppression scheme is shown to significantly improve signal quality, and the classification system achieves an accuracy of 86.7\% in distinguishing nor- mal from pathological sounds, far surpassing other state- of-the-art methods. Conclusion: Computerized lung sound processing can benefit from the enforcement of advanced noise suppression. A fairly short processing window size (less than 1 s) combined with detailed spectrotemporal features is recommended, in order to capture transient adventitious events without highlighting sharp noise occurrences. Significance: Unlike existing methodologies in the literature, the proposed work is not limited in scope or confined to laboratory settings: This work validates a practical method for fully automated chest sound processing applicable to realistic and noisy auscultation settings.
Introduction Paediatric lung sound recordings can be systematically assessed, but methodological feasibility and validity is unknown, especially from developing countries. We examined the performance of acoustically interpreting recorded paediatric lung sounds and compared sound characteristics between cases and controls. Methods Pneumonia Etiology Research for Child Health staff in six African and Asian sites recorded lung sounds with a digital stethoscope in cases and controls. Cases aged 1–59 months had WHO severe or very severe pneumonia; age-matched community controls did not. A listening panel assigned examination results of normal, crackle, wheeze, crackle and wheeze or uninterpretable, with adjudication of discordant interpretations. Classifications were recategorised into any crackle, any wheeze or abnormal (any crackle or wheeze) and primary listener agreement (first two listeners) was analysed among interpretable examinations using the prevalence-adjusted, bias-adjusted kappa (PABAK). We examined predictors of disagreement with logistic regression and compared case and control lung sounds with descriptive statistics. Results Primary listeners considered 89.5\% of 792 case and 92.4\% of 301 control recordings interpretable. Among interpretable recordings, listeners agreed on the presence or absence of any abnormality in 74.9\% (PABAK 0.50) of cases and 69.8\% (PABAK 0.40) of controls, presence/absence of crackles in 70.6\% (PABAK 0.41) of cases and 82.4\% (PABAK 0.65) of controls and presence/absence of wheeze in 72.6\% (PABAK 0.45) of cases and 73.8\% (PABAK 0.48) of controls. Controls, tachypnoea, \textgreater3 uninterpretable chest positions, crying, upper airway noises and study site predicted listener disagreement. Among all interpretable examinations, 38.0\% of cases and 84.9\% of controls were normal (p\textless0.0001); wheezing was the most common sound (49.9\%) in cases. Conclusions Listening panel and case–control data suggests our methodology is feasible, likely valid and that small airway inflammation is common in WHO pneumonia. Digital auscultation may be an important future pneumonia diagnostic in developing countries.
Sounds in everyday life seldom appear in isolation. Both humans and machines are constantly flooded with a cacophony of sounds that need to be sorted through and scoured for relevant information-a phenomenon referred to as the 'cocktail party problem'. A key component in parsing acoustic scenes is the role of attention, which mediates perception and behaviour by focusing both sensory and cognitive resources on pertinent information in the stimulus space. The current article provides a review of modelling studies of auditory attention. The review highlights how the term attention refers to a multitude of behavioural and cognitive processes that can shape sensory processing. Attention can be modulated by 'bottom-up' sensory-driven factors, as well as 'top-down' task-specific goals, expectations and learned schemas. Essentially, it acts as a selection process or processes that focus both sensory and cognitive resources on the most relevant events in the soundscape; with relevance being dictated by the stimulus itself (e.g. a loud explosion) or by a task at hand (e.g. listen to announcements in a busy airport). Recent computational models of auditory attention provide key insights into its role in facilitating perception in cluttered auditory scenes.This article is part of the themed issue 'Auditory and visual scene analysis'.
Studies of auditory scene analysis have traditionally relied on paradigms using artificial sounds-and conventional behavioral techniques-to elucidate how we perceptually segregate auditory objects or streams from each other. In the past few decades, however, there has been growing interest in uncovering the neural underpinnings of auditory segregation using human and animal neuroscience techniques, as well as computational modeling. This largely reflects the growth in the fields of cognitive neuroscience and computational neuroscience and has led to new theories of how the auditory system segregates sounds in complex arrays. The current review focuses on neural and computational studies of auditory scene perception published in the last few years. Following the progress that has been made in these studies, we describe (1) theoretical advances in our understanding of the most well-studied aspects of auditory scene perception, namely segregation of sequential patterns of sounds and concurrently presented sounds; (2) the diversification of topics and paradigms that have been investigated; and (3) how new neuroscience techniques (including invasive neurophysiology in awake humans, genotyping, and brain stimulation) have been used in this field.
Salience describes the phenomenon by which an object stands out from a scene. While its underlying processes are extensively studied in vision, mechanisms of auditory salience remain largely unknown. Previous studies have used well-controlled auditory scenes to shed light on some of the acoustic attributes that drive the salience of sound events. Unfortunately, the use of constrained stimuli in addition to a lack of well-established benchmarks of salience judgments hampers the development of comprehensive theories of sensory-driven auditory attention. The present study explores auditory salience in a set of dynamic natural scenes. A behavioral measure of salience is collected by having human volunteers listen to two concurrent scenes and indicate continuously which one attracts their attention. By using natural scenes, the study takes a data-driven rather than experimenter-driven approach to exploring the parameters of auditory salience. The findings indicate that the space of auditory salience is multidimensional (spanning loudness, pitch, spectral shape, as well as other acoustic attributes), nonlinear and highly context-dependent. Importantly, the results indicate that contextual information about the entire scene over both short and long scales needs to be considered in order to properly account for perceptual judgments of salience.
\textcopyright 2014 IEEE.Parsing natural acoustic scenes using computational methodologies poses many challenges. Given the rich and complex nature of the acoustic environment, data mismatch between train and test conditions is a major hurdle in data-driven audio processing systems. In contrast, the brain exhibits a remarkable ability at segmenting acoustic scenes with relative ease. When tackling challenging listening conditions that are often faced in everyday life, the biological system relies on a number of principles that allow it to effortlessly parse its rich soundscape. In the current study, we leverage a key principle employed by the auditory system: its ability to adapt the neural representation of its sensory input in a high-dimensional space. We propose a framework that mimics this process in a computational model for robust speech activity detection. The system employs a 2-D Gabor filter bank whose parameters are retuned offline to improve the separability between the feature representation of speech and nonspeech sounds. This retuning process, driven by feedback from statistical models of speech and nonspeech classes, attempts to minimize the misclassification risk of mismatched data, with respect to the original statistical models. We hypothesize that this risk minimization procedure results in an emphasis of unique speech and nonspeech modulations in the high-dimensional space. We show that such an adapted system is indeed robust to other novel conditions, with a marked reduction in equal error rates for a variety of databases with additive and convolutive noise distortions. We discuss the lessons learned from biology with regard to adapting to an ever-changing acoustic environment and the impact on building truly intelligent audio processing systems.
Behavioral and neural studies of selective attention have consistently demonstrated that explicit attentional cues to particular perceptual features profoundly alter perception and performance. The statistics of the sensory environment can also provide cues about what perceptual features to expect, but the extent to which these more implicit contextual cues impact perception and performance, as well as their relationship to explicit attentional cues, is not well understood. In this study, the explicit cues, or attentional prior probabilities, and the implicit cues, or contextual prior probabilities, associated with different acoustic frequencies in a detection task were simultaneously manipulated. Both attentional and contextual priors had similarly large but independent impacts on sound detectability, with evidence that listeners tracked and used contextual priors for a variety of sound classes (pure tones, harmonic complexes, and vowels). Further analyses showed that listeners updated their contextual priors rapidly and optimally, given the changing acoustic frequency statistics inherent in the paradigm. A Bayesian Observer model accounted for both attentional and contextual adaptations found with listeners. These results bolster the interpretation of perception as Bayesian inference, and suggest that some effects attributed to selective attention may be a special case of contextual prior integration along a feature axis.
GOAL Chest auscultation constitutes a portable low-cost tool widely used for respiratory disease detection. Though it offers a powerful means of pulmonary examination, it remains riddled with a number of issues that limit its diagnostic capability. Particularly, patient agitation (especially in children), background chatter, and other environmental noises often contaminate the auscultation, hence affecting the clarity of the lung sound itself. This paper proposes an automated multiband denoising scheme for improving the quality of auscultation signals against heavy background contaminations. METHODS The algorithm works on a simple two-microphone setup, dynamically adapts to the background noise and suppresses contaminations while successfully preserving the lung sound content. The proposed scheme is refined to offset maximal noise suppression against maintaining the integrity of the lung signal, particularly its unknown adventitious components that provide the most informative diagnostic value during lung pathology. RESULTS The algorithm is applied to digital recordings obtained in the field in a busy clinic in West Africa and evaluated using objective signal fidelity measures and perceptual listening tests performed by a panel of licensed physicians. A strong preference of the enhanced sounds is revealed. SIGNIFICANCE The strengths and benefits of the proposed method lie in the simple automated setup and its adaptive nature, both fundamental conditions for everyday clinical applicability. It can be simply extended to a real-time implementation, and integrated with lung sound acquisition protocols.
To navigate complex acoustic environments, listeners adapt neural processes to focus on behaviorally relevant sounds in the acoustic foreground while minimizing the impact of distractors in the background, an ability referred to as top-down selective attention. Particularly striking examples of attention-driven plasticity have been reported in primary auditory cortex via dynamic reshaping of spectro-temporal receptive fields (STRFs). By enhancing the neural response to features of the foreground while suppressing those to the background, STRFs can act as adaptive contrast matched filters that directly contribute to an improved cognitive segregation between behaviorally relevant and irrelevant sounds. In this study, we propose a novel discriminative framework for modeling attention-driven plasticity of STRFs in primary auditory cortex. The model describes a general strategy for cortical plasticity via an optimization that maximizes discriminability between the foreground and distractors while maintaining a degree of stability in the cortical representation. The first instantiation of the model describes a form of feature-based attention and yields STRF adaptation patterns consistent with a contrast matched filter previously reported in neurophysiological studies. An extension of the model captures a form of object-based attention, where top-down signals act on an abstracted representation of the sensory input characterized in the modulation domain. The object-based model makes explicit predictions in line with limited neurophysiological data currently available but can be readily evaluated experimentally. Finally, we draw parallels between the model and anatomical circuits reported to be engaged during active attention. The proposed model strongly suggests an interpretation of attention-driven plasticity as a discriminative adaptation operating at the level of sensory cortex, in line with similar strategies previously described across different sensory modalities.
One of the hallmarks of sound processing in the brain is the ability of the nervous system to adapt to changing behavioral demands and surrounding soundscapes. It can dynamically shift sensory and cognitive resources to focus on relevant sounds. Neurophysiological studies indicate that this ability is supported by adaptively retuning the shapes of cortical spectro-temporal receptive fields (STRFs) to enhance features of target sounds while suppressing those of task-irrelevant distractors. Because an important component of human communication is the ability of a listener to dynamically track speech in noisy environments, the solution obtained by auditory neurophysiology implies a useful adaptation strategy for speech activity detection (SAD). SAD is an important first step in a number of automated speech processing systems, and performance is often reduced in highly noisy environments. In this paper, we describe how task-driven adaptation is induced in an ensemble of neurophysiological STRFs, and show how speech-adapted STRFs reorient themselves to enhance spectro-temporal modulations of speech while suppressing those associated with a variety of nonspeech sounds. We then show how an adapted ensemble of STRFs can better detect speech in unseen noisy environments compared to an unadapted ensemble and a noise-robust baseline. Finally, we use a stimulus reconstruction task to demonstrate how the adapted STRF ensemble better captures the spectro-temporal modulations of attended speech in clean and noisy conditions. Our results suggest that a biologically plausible adaptation framework can be applied to speech processing systems to dynamically adapt feature representations for improving noise robustness.
The identity of musical instruments is reflected in the acoustic attributes of musical notes played with them. Recently, it has been argued that these characteristics of musical identity (or timbre) can be best captured through an analysis that encompasses both time and frequency domains; with a focus on the modulations or changes in the signal in the spectrotemporal space. This representation mimics the spectrotemporal receptive field (STRF) analysis believed to underlie processing in the central mammalian auditory system, particularly at the level of primary auditory cortex. How well does this STRF representation capture timbral identity of musical instruments in continuous solo recordings remains unclear. The current work investigates the applicability of the STRF feature space for instrument recognition in solo musical phrases and explores best approaches to leveraging knowledge from isolated musical notes for instrument recognition in solo recordings. The study presents an approach for parsing solo performances into their individual note constituents and adapting back-end classifiers using support vector machines to achieve a generalization of instrument recognition to off-the-shelf, commercially available solo music.
Listeners' ability to discriminate unfamiliar voices is often susceptible to the effects of manipulations of acoustic characteristics of the utterances. This vulnerability was quantified within a task in which participants determined if two utterances were spoken by the same or different speakers. Results of this task were analyzed in relation to a set of historical and novel parameters in order to hypothesize the role of those parameters in the decision process. Listener performance was first measured in a baseline task with unmodified stimuli, and then compared to responses with resynthesized stimuli under three conditions: (1) normalized mean-pitch; (2) normalized duration; and (3) normalized linear predictive coefficients (LPCs). The results of these experiments suggest that perceptual speaker discrimination is robust to acoustic changes, though mean-pitch and LPC modifications are more detrimental to a listener's ability to successfully identify same or different speaker pairings. However, this susceptibility was also found to be partially dependent on the specific speaker and utterances.
A new approach for the segregation of monaural sound mixtures is presented based on the principle of temporal coherence and using auditory cortical representations. Temporal coherence is the notion that perceived sources emit coherently modulated features that evoke highly-coincident neural response patterns. By clustering the feature channels with coincident responses and reconstructing their input, one may segregate the underlying source from the simultaneously interfering signals that are uncorrelated with it. The proposed algorithm requires no prior information or training on the sources. It can, however, gracefully incorporate cognitive functions and influences such as memories of a target source or attention to a specific set of its attributes so as to segregate it from its background. Aside from its unusual structure and computational innovations, the proposed model provides testable hypotheses of the physiological mechanisms of this ubiquitous and remarkable perceptual ability, and of its psychophysical manifestations in navigating complex sensory environments.
Humans routinely segregate a complex acoustic scene into different auditory streams, through the extraction of bottom-up perceptual cues and the use of top-down selective attention. To determine the neural mechanisms underlying this process, neural responses obtained through magnetoencephalography (MEG) were correlated with behavioral performance in the context of an informational masking paradigm. In half the trials, subjects were asked to detect frequency deviants in a target stream, consisting of a rhythmic tone sequence, embedded in a separate masker stream composed of a random cloud of tones. In the other half of the trials, subjects were exposed to identical stimuli but asked to perform a different task-to detect tone-length changes in the random cloud of tones. In order to verify that the normalized neural response to the target sequence served as an indicator of streaming, we correlated neural responses with behavioral performance under a variety of stimulus parameters (target tone rate, target tone frequency, and the "protection zone", that is, the spectral area with no tones around the target frequency) and attentional states (changing task objective while maintaining the same stimuli). In all conditions that facilitated target/masker streaming behaviorally, MEG normalized neural responses also changed in a manner consistent with the behavior. Thus, attending to the target stream caused a significant increase in power and phase coherence of the responses in recording channels correlated with an increase in the behavioral performance of the listeners. Normalized neural target responses also increased as the protection zone widened and as the frequency of the target tones increased. Finally, when the target sequence rate increased, the buildup of the normalized neural responses was significantly faster, mirroring the accelerated buildup of the streaming percepts. Our data thus support close links between the perceptual and neural consequences of the auditory stream segregation.
Bottom-up attention is a sensory-driven selection mechanism that directs perception toward a subset of the stimulus that is considered salient, or attention-grabbing. Most studies of bottom-up auditory attention have adapted frameworks similar to visual attention models whereby local or global "contrast" is a central concept in defining salient elements in a scene. In the current study, we take a more fundamental approach to modeling auditory attention; providing the first examination of the space of auditory saliency spanning pitch, intensity and timbre; and shedding light on complex interactions among these features. Informed by psychoacoustic results, we develop a computational model of auditory saliency implementing a novel attentional framework, guided by processes hypothesized to take place in the auditory pathway. In particular, the model tests the hypothesis that perception tracks the evolution of sound events in a multidimensional feature space, and flags any deviation from background statistics as salient. Predictions from the model corroborate the relationship between bottom-up auditory attention and statistical inference, and argues for a potential role of predictive coding as mechanism for saliency detection in acoustic scenes.
Purpose: Lung auscultation has long been a standard of care for the diagnosis of respiratory diseases. Recent advances in electronic auscultation and signal processing have yet to find clinical acceptance; however, computerized lung sound analysis may be ideal for pediatric populations in settings, where skilled healthcare providers are commonly unavailable. We described features of normal lung sounds in young children using a novel signal processing approach to lay a foundation for identifying pathologic respiratory sounds. Methods: 186 healthy children with normal pulmonary exams and without respiratory complaints were enrolled at a tertiary care hospital in Lima, Peru. Lung sounds were recorded at eight thoracic sites using a digital stethoscope. 151 (81 \%) of the recordings were eligible for further analysis. Heavy-crying segments were automatically rejected and features extracted from spectral and temporal signal representations contributed to profiling of lung sounds. Results: Mean age, height, and weight among study participants were 2.2 years (SD 1.4), 84.7 cm (SD 13.2), and 12.0 kg (SD 3.6), respectively; and, 47 \% were boys. We identified ten distinct spectral and spectro-temporal signal parameters and most demonstrated linear relationships with age, height, and weight, while no differences with genders were noted. Older children had a faster decaying spectrum than younger ones. Features like spectral peak width, lower-frequency Mel-frequency cepstral coefficients, and spectro-temporal modulations also showed variations with recording site. Conclusions: Lung sound extracted features varied significantly with child characteristics and lung site. A comparison with adult studies revealed differences in the extracted features for children. While sound-reduction techniques will improve analysis, we offer a novel, reproducible tool for sound analysis in real-world environments.
Selecting pertinent events in the cacophony of sounds that impinge on our ears every day is regulated by the acoustic salience of sounds in the scene as well as their behavioral relevance as dictated by top-down task-dependent demands. The current study aims to explore the neural signature of both facets of attention, as well as their possible interactions in the context of auditory scenes. Using a paradigm with dynamic auditory streams with occasional salient events, we recorded neurophysiological responses of human listeners using EEG while manipulating the subjects' attentional state as well as the presence or absence of a competing auditory stream. Our results showed that salient events caused an increase in the auditory steady-state response (ASSR) irrespective of attentional state or complexity of the scene. Such increase supplemented ASSR increases due to task-driven attention. Salient events also evoked a strong N1 peak in the ERP response when listeners were attending to the target sound stream, accompanied by an MMN-like component in some cases and changes in the P1 and P300 components under all listening conditions. Overall, bottom-up attention induced by a salient change in the auditory stream appears to mostly modulate the amplitude of the steady-state response and certain event-related potentials to salient sound events; though this modulation is affected by top-down attentional processes and the prominence of these events in the auditory scene as well.
Humans are quite adept at communicating in presence of noise. However most speech processing systems, like automatic speech and speaker recognition systems, suffer from a significant drop in performance when speech signals are corrupted with unseen background distortions. The proposed work explores the use of a biologically-motivated multi-resolution spectral analysis for speech representation. This approach focuses on the information-rich spectral attributes of speech and presents an intricate yet computationally-efficient analysis of the speech signal by careful choice of model parameters. Further, the approach takes advantage of an information-theoretic analysis of the message and speaker dominant regions in the speech signal, and defines feature representations to address two diverse tasks such as speech and speaker recognition. The proposed analysis surpasses the standard Mel-Frequency Cepstral Coefficients (MFCC), and its enhanced variants (via mean subtraction, variance normalization and time sequence filtering) and yields significant improvements over a state-of-the-art noise robust feature scheme, on both speech and speaker recognition tasks.
The processing characteristics of neurons in the central auditory system are directly shaped by and reflect the statistics of natural acoustic environments, but the principles that govern the relationship between natural sound ensembles and observed responses in neurophysiological studies remain unclear. In particular, accumulating evidence suggests the presence of a code based on sustained neural firing rates, where central auditory neurons exhibit strong, persistent responses to their preferred stimuli. Such a strategy can indicate the presence of ongoing sounds, is involved in parsing complex auditory scenes, and may play a role in matching neural dynamics to varying time scales in acoustic signals. In this paper, we describe a computational framework for exploring the influence of a code based on sustained firing rates on the shape of the spectro-temporal receptive field (STRF), a linear kernel that maps a spectro-temporal acoustic stimulus to the instantaneous firing rate of a central auditory neuron. We demonstrate the emergence of richly structured STRFs that capture the structure of natural sounds over a wide range of timescales, and show how the emergent ensembles resemble those commonly reported in physiological studies. Furthermore, we compare ensembles that optimize a sustained firing code with one that optimizes a sparse code, another widely considered coding strategy, and suggest how the resulting population responses are not mutually exclusive. Finally, we demonstrate how the emergent ensembles contour the high-energy spectro-temporal modulations of natural sounds, forming a discriminative representation that captures the full range of modulation statistics that characterize natural sound ensembles. These findings have direct implications for our understanding of how sensory systems encode the informative components of natural stimuli and potentially facilitate multi-sensory integration.
There is strong neurophysiological evidence sug- gesting that processing of speech signals in the brain happens along parallel paths which encode complementary information in the signal. These parallel streams are organized around a duality of slow vs. fast: Coarse signal dynamics appear to be processed separately from rapidly changingmodulations both in the spectral and temporal dimensions.We adapt such duality in amultistream framework for robust speaker-independent phoneme recognition. The scheme presented here centers around a multi-path bandpass modulation analysis of speech sounds with each streamcovering an entire range of temporal and spectral modulations. By performing bandpass operations along the spectral and temporal dimensions, the proposed scheme avoids the classic feature explosion problem of previous multistream approaches while maintaining the advan- tage of parallelism and localized feature analysis. The proposed architecture results in substantial improvements over standard and state-of-the-art feature schemes for phoneme recognition, particularly in presence of nonstationary noise, reverberation and channel distortions.
Humans and other animals can attend to one of multiple sounds, and -follow it selectively over time. The neural underpinnings of this perceptual feat remain mysterious. Some studies have concluded that sounds are heard as separate streams when they activate well-separated populations of central auditory neurons, and that this process is largely pre-attentive. Here, we propose instead that stream formation depends primarily on temporal coherence between responses that encode various features of a sound source. Furthermore, we postulate that only when attention is directed toward a particular feature (e.g., pitch or location) do all other temporally coherent features of that source (e.g., timbre and location) become bound together as a stream that is segregated from the incoherent features of other sources. Experimental -neurophysiological evidence in support of this hypothesis will be presented. The focus, however, will be on a computational realization of this idea and a discussion of the insights learned from simulations to disentangle complex sound sources such as speech and music. The model consists of a representational stage of early and cortical auditory processing that creates a multidimensional depiction of various sound attributes such as pitch, location, and spectral resolution. The following stage computes a coherence matrix that summarizes the pair-wise correlations between all channels making up the cortical representation. Finally, the perceived segregated streams are extracted by decomposing the coherence matrix into its uncorrelated components. Questions raised by the model are discussed, especially on the role of attention in streaming and the search for further neural correlates of streaming percepts.
Music is a complex acoustic experience that we often take for granted. Whether sitting at a symphony hall or enjoying a melody over earphones, we have no difficulty identifying the instruments playing, following various beats, or simply distinguishing a flute from an oboe. Our brains rely on a number of sound attributes to analyze the music in our ears. These attributes can be straightforward like loudness or quite complex like the identity of the instrument. A major contributor to our ability to recognize instruments is what is formally called ‘timbre'. Of all perceptual attributes of music, timbre remains the most mysterious and least amenable to a simple mathematical abstraction. In this work, we examine the neural underpinnings of musical timbre in an attempt to both define its perceptual space and explore the processes underlying timbre-based recognition. We propose a scheme based on responses observed at the level of mammalian primary auditory cortex and show that it can accurately predict sound source recognition and perceptual timbre judgments by human listeners. The analyses presented here strongly suggest that rich representations such as those observed in auditory cortex are critical in mediating timbre percepts.
INTRODUCTION WHO case management algorithm for paediatric pneumonia relies solely on symptoms of shortness of breath or cough and tachypnoea for treatment and has poor diagnostic specificity, tends to increase antibiotic resistance. Alternatives, including oxygen saturation measurement, chest ultrasound and chest auscultation, exist but with potential disadvantages. Electronic auscultation has potential for improved detection of paediatric pneumonia but has yet to be standardised. The authors aim to investigate the use of electronic auscultation to improve the specificity of the current WHO algorithm in developing countries. METHODS This study is designed to test the hypothesis that pulmonary pathology can be differentiated from normal using computerised lung sound analysis (CLSA). The authors will record lung sounds from 600 children aged ≤5 years, 100 each with consolidative pneumonia, diffuse interstitial pneumonia, asthma, bronchiolitis, upper respiratory infections and normal lungs at a children's hospital in Lima, Peru. The authors will compare CLSA with the WHO algorithm and other detection approaches, including physical exam findings, chest ultrasound and microbiologic testing to construct an improved algorithm for pneumonia diagnosis. DISCUSSION This study will develop standardised methods for electronic auscultation and chest ultrasound and compare their utility for detection of pneumonia to standard approaches. Utilising signal processing techniques, the authors aim to characterise lung sounds and through machine learning, develop a classification system to distinguish pathologic sounds. Data will allow a better understanding of the benefits and limitations of novel diagnostic techniques in paediatric pneumonia.
Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing. \textcopyright 2012 Nemala et al.; licensee Springer.
Cochlear implant (CI) users demonstrate severe limitations in perceiving musical timbre, a psychoacoustic feature of sound responsible for 'tone color' and one's ability to identify a musical instrument. The reasons for this limitation remain poorly understood. In this study, we sought to examine the relative contributions of temporal envelope and fine structure for timbre judgments, in light of the fact that speech processing strategies employed by CI systems typically employ envelope extraction algorithms. We synthesized "instrumental chimeras" that systematically combined variable amounts of envelope and fine structure in 25\% increments from two different source instruments with either sustained or percussive envelopes. CI users and normal hearing (NH) subjects were presented with 150 chimeras and asked to determine which instrument the chimera more closely resembled in a single-interval two-alternative forced choice task. By combining instruments with similar and dissimilar envelopes, we controlled the valence of envelope for timbre identification and compensated for envelope reconstruction from fine structure information. Our results show that NH subjects utilize envelope and fine structure interchangeably, whereas CI subjects demonstrate overwhelming reliance on temporal envelope. When chimeras were created from dissimilar envelope instrument pairs, NH subjects utilized a combination of envelope (p = 0.008) and fine structure information (p = 0.009) to make timbre judgments. In contrast, CI users utilized envelope information almost exclusively to make timbre judgments (p \textless 0.001) and ignored fine structure information (p = 0.908). Interestingly, when the value of envelope as a cue was reduced, both NH subjects and CI users utilized fine structure information to make timbre judgments (p \textless 0.001), although the effect was quite weak in CI users. Our findings confirm that impairments in fine structure processing underlie poor perception of musical timbre in CI users.
Humans and other animals can attend to one of multiple sounds and follow it selectively over time. The neural underpinnings of this perceptual feat remain mysterious. Some studies have concluded that sounds are heard as separate streams when they activate well-separated populations of central auditory neurons, and that this process is largely pre-attentive. Here, we argue instead that stream formation depends primarily on temporal coherence between responses that encode various features of a sound source. Furthermore, we postulate that only when attention is directed towards a particular feature (e.g. pitch) do all other temporally coherent features of that source (e.g. timbre and location) become bound together as a stream that is segregated from the incoherent features of other sources.
Processing of complex acoustic scenes depends critically on the temporal integration of sensory information as sounds evolve naturally over time. It has been previously speculated that this process is guided by both innate mechanisms of temporal processing in the auditory system, as well as top-down mechanisms of attention and possibly other schema-based processes. In an effort to unravel the neural underpinnings of these processes and their role in scene analysis, we combine magnetoencephalography (MEG) with behavioral measures in humans in the context of polyrhythmic tone sequences. While maintaining unchanged sensory input, we manipulate subjects' attention to one of two competing rhythmic streams in the same sequence. The results reveal that the neural representation of the attended rhythm is significantly enhanced in both its steady-state power and spatial phase coherence relative to its unattended state, closely correlating with its perceptual detectability for each listener. Interestingly, the data reveal a differential efficiency of rhythmic rates of the order of few hertz during the streaming process, closely following known neural and behavioral measures of temporal modulation sensitivity in the auditory system. These findings establish a direct link between known temporal modulation tuning in the auditory system (particularly at the level of auditory cortex) and the temporal integration of perceptual features in a complex acoustic scene, while mediated by processes of attention.
Attention is essential for navigating complex acoustic scenes, when the listener seeks to extract a foreground source while suppressing background acoustic clutter. This study explored the neural correlates of this perceptual ability by measuring rapid changes of spectrotemporal receptive fields (STRFs) in primary auditory cortex during detection of a target tone embedded in noise. Compared with responses in the passive state, STRF gain decreased during task performance in most cells. By contrast, STRF shape changes were excitatory and specific, and were strongest in cells with best frequencies near the target tone. The net effect of these adaptations was to accentuate the representation of the target tone relative to the noise by enhancing responses of near-target cells to the tone during high-signal-to-noise ratio (SNR) tasks while suppressing responses of far-from-target cells to the masking noise in low-SNR tasks. These adaptive STRF changes were largest in high-performance sessions, confirming a close correlation with behavior.

This research is funded by: