Research Project

Towards robust speech processing systems

Indispensable processing element of most existing speech recognizers is the part, which estimates probabilities of speech sounds given the input signal.

Current speech process03ing systems are powerful but fragile. They work well as long as they are applied in domains for which they were trained but brake easily when encountering unexpected situations. This lack of robustness severely limits their acceptance by wider user community. The reason for this weakness is that dominant machine learning approaches derive their power from large amounts of training data that aim at covering all expected sources of harmful variability. The Achilles Heel of such approaches becomes apparent when a machine encounters data that have not been anticipated during its training.

The adaptation during recognition is a step towards addressing this fundamental weakness of machine learning. One biologically consistent way of identifying the unexpected data items and adapting a classifier is inspired by the hierarchical parallel architecture of human auditory perceptual system. The general concept of this engineering approach is shown in the figure. The processing capitalizes on redundancies in coding of speech information. Different parallel processing streams attend to different aspects of redundantly coded information and can in principle be differently influenced by prior information about the problem at hand. Unexpected inputs may affect only some of the streams and the remaining reliable streams can still be used in further processing. While all elements of our scheme contribute to a successful performance, the current critical element is the performance monitoring module, which indicates the corrupted streams. Since during recognition, the correct output is not known, the module needs to use some general characteristics of the classifier outputs that would indicate corruptions. Further, it needs to provide for efficient fusion of the information from the reliable streams. As in most practical applications, the processing needs to be done reasonably fast to minimize algorithmic delays of the system. We aim at systems with human-like ability to deal with new, previously unseen data. While our research is in machine recognition of speech, its results and implications may also find use in other areas of machine learning.

Back to top