Projects

Integration of visual and auditory information to detect aggression

A computational model of sound recognition

Linking physical processes to signal Components

Applications of continuity preserving acoustic signal processing

Machine analysis and diagnostics

Automatic keyword spotting

Publications

Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sound through context. International Journal of Semantic Computing 2(3), 327-341.
(abstract) (pdf)
Electronic version of an article published as doi:10.1142/S1793351X08000506 © [copyright World Scientific Publishing Company]
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene.We propose a dynamic network model to perform this mapping. In acoustics, much research is devoted to low-level perceptual abilities such as audio feature extraction and grouping, which are translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.
Krijnders, J.D., Andringa, T.C. (2008). Demonstration of online auditory scene analysis. Accepted for BNAIC 2008.
(abstract) (pdf)
We show an online system for auditory scene analysis. Auditory scene analysis is the analysis of complex sounds as they occur in natural settings. The analysis is based on models of human auditory processing. Our system includes models of the human ear, analysis in tones and pulses and grouping algorithms. This systems forms the basis for several sound recognition tasks, such as aggression detection and vowel recognition.
Van Elburg, R.A.J., Van Ooyen, A. (2008). Generalization of the event-based Carnevale-Hines integration scheme for integrate-and-fire models. Submitted.
(abstract) (preprint)
An event-based integration scheme for an integrate-and-fire neuron model with exponentially decaying excitatory synaptic currents and double exponential inhibitory synaptic currents has recently been introduced by Carnevale and Hines. This integration scheme imposes non-physiological constraints on the time constants of the synaptic currents it attempts to model which hamper the general applicability. This paper addresses this problem in two ways. First, we provide physical arguments to show why these constraints on the time constants can be relaxed. Second, we give a formal proof showing which constraints can be abolished. This proof rests on a generalization of the Carnevale-Hines lemma, which is a new tool for comparing double exponentials as they naturally occur in many cascaded decay systems including receptor-neurotransmitter dissociation followed by channel closing. We show that this lemma can be generalized and subsequently used for lifting most of the original constraints on the time constants. Thus we show that the Carnevale-Hines integration scheme for the integrate-and-fire model can be employed for simulating a much wider range of neuron and synapse type combinations than is apparent from the original treatment.
Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sounds through context. Proceedings of IEEE ICSC, 88-95.
(abstract) (pdf)
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research has been devoted to low-level perceptual abilities such as audio feature extraction and grouping, which have been translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification, by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.
Vooren, H. van de, Violanda, R.R., Andringa, T.C. (2008). Robust harmonic grouping by octave error correction. Acoustic-08 Paris.
(abstract)
Harmonic grouping is a frequently applied technique in computational auditory scene analysis and automatic speech recognition systems. However, grouping is easily disrupted by noise and reverberation. For instance, a noise induced signal component positioned roughly between two harmonics, might undesirably be assigned to the harmonic complex (HC) as well. This results in an octave error: harmonics in an HC are assigned to harmonic numbers twice as high as the correct values. We propose a cost function based method to correct these octave errors. This function is designed to, on the one hand, improve the balance between odd and even harmonic numbers, and, on the other hand, minimize the amount of signal components to be rejected. As a preprocessing step we applied short-time Fourier analysis to derive an instantaneous frequency representation from which we obtained the signal components. We used these as input for our harmonic grouping algorithm to obtain the HCs. Then we selected the optimal solution from the cost function and modified the composition of the HCs accordingly. As long as enough harmonics are sufficiently above the local noise level, this octave error correction mechanism works well for various sorts of harmonic sounds including speech.
Andringa, T.C. (2008). The texture of natural sounds. Acoustic-08 Paris.
(abstract)
The texture, a spectro-temporal pattern, of many sound sources is a robust and characteristic perceptual property that listeners use for sound source recognition. The robustness of the texture ensures that we can recognize sound sources like a helicopter, flowing water, or a surf breaking on pebbles in a wide variety of acoustic environments. This robustness suggests that textures can be used for automatic source identification or environment classification. We introduce a method to determine the presence of sound textures associated with, for example, flat noise, pulsed noises (helicopter), sweep based textures (running water), and tonal noises (babble). The cumulative probability density of time-frequency fluctuations are matched with prototypical cumulative probability density functions (cpdf) with a running variant of the Kolmogorov-Smirnov test. Textures with a similar distribution as the target distribution, contribute approximately equally to all values of the cpdf. The flatness of this distribution is used as a distance measure. When the pdf¡Çs of the target textures do not overlap strongly, the method can determine the texture of time-frequency-regions as small as 100 ms by 6 semi-tones. This method can therefore also be used to determine the texture of the background.
Krijnders, J.D., Niessen, M.E., Andringa T.C. (2008). A grouping approach to harmonic complexes. Acoustic-08 Paris.
(abstract)
Humans seem to perform sound-source separation for quasi-periodic sounds, such as speech, mostly on harmonicity cues. To model this function, most machine algorithms use a pitch-based approach to group the speech parts of the spectrum. In these methods the pitch is obtained either explicitly, in autocorrelation methods, or implicitly, as in harmonic sieves. If the estimation of pitch is wrong, the grouping will fail as well. In this paper we show a method that performs harmonic grouping without first calculating the pitch. Instead a pitch estimate is associated with each grouping hypothesis.
Making the grouping independent of the pitch estimate makes it more robust in noisy settings.
The algorithm obtains possible harmonics by tracking energy peaks in a cochleogram. Co-occuring harmonics are compared in terms of frequency difference. Grouping hypotheses are formed by combining harmonics with similar frequency differences. Consistency checks are performed on these hypotheses and hypotheses with compatible properties are combined into harmonic complexes. Every harmonic complex is evaluated on the number of the harmonics, the number of subsequent harmonics and the presence of a harmonic at the pitch position. By using the number of subsequent harmonics octave errors are prevented.
Multiple concurrent harmonic complexes can be found as long as the spectral overlap is small.
Niessen, M.E., Van Elburg, R.A.J., Krijnders, J.D., Andringa., T.C. (2008). A computational model for auditory scene analysis. Acoustic-08 Paris.
(abstract)
Primitive auditory scene analysis (ASA) is based on intrinsic properties of the auditory environment. Acoustic features such as continuity and proximity in time or frequency cause perceptual grouping of acoustic elements. Various grouping attributes have been translated into successful signal processing techniques that may be used in source separation. A next step beyond primitive ASA is source identification through schema-based ASA. We present a computational model for ASA that is inspired by models from cognitive research. It dynamically builds a hierarchical network of hypotheses, which is based on (learned) knowledge of the sources. Each hypothesis in the network, initiated by bottom-up evidence, represents a possible sound event. The network is updated for each new input event, which may be any sound in an unconstrained environment. The analysis of new input events is guided by knowledge of the environment and previous events. As a result of this adaptive behavior, information about the environment increases and the set of possible hypotheses decreases. With this method of continuously improving sound event identification we make a promising advance in computational ASA of complex real-world environments.
Andringa, T.C., Grootel, M. van (2007). Predicting listeners' reports to environmental sounds. Proceedings of ICA 2007, ENV-09-005-IP.
(abstract) (pdf)
Spontaneous verbal descriptions of environmental sounds lead to a description of the contributing sound sources and the environments in which they occur. This is a form of perception that relies crucially on the rich structure of sounds, because only rich sounds can convey detailed information about individual sources and the transmission environment. This paper uses a semantic network with connection strengths derived from listener reports to represent the content of auditory scenes. The activity of the semantic network is based on a number of source specific cues for sounds such as birds, vehicles, speech, and footsteps. These cues are not based on spectral envelope and level, but on patterns in tones, pulses and noises that capture source specific structures. Generally the system performed in a similar way as a human listener in terms of concepts activated (or named) and the choice of the acoustic environment. The robustness and performance suggest the combination of a semantic network and source specific cues can be used to design systems for sound-based ambient awareness.
Zajdel, W., Krijnders, J.D., Andringa T.C., Gavrila, D.M. (2007) Audio-video sensor fusion for aggression detection. Proceedings of AVSS 2007. Best paper award.
(abstract) (pdf)
This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of the complimentary nature of audio and video sensing to disambiguate scene activity in real-life, noisy and dynamic environments. At the lower level, independent analysis of the audio and video streams yields intermediate descriptors of a scene like: "scream", "passing train" or "articulation energy". At the higher level, a Dynamic Bayesian Network is used as a fusion mechanism that produces an aggregate aggression indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.
Hengel, P.W.J. van, Andringa, T.C. (2007). Verbal aggression detection in complex social environments. Proceedings of AVSS 2007.
(abstract) (pdf)
Niessen, M.E., Krijnders, J.D., Boers, J., Andringa, T.C. (2007). Assessing the reverberation level in speech. Proceedings of ICA 2007, CAS-03-020.
(abstract) (pdf)
The performance of automatic speech recognition (ASR) systems is seriously degraded in reverberant environments. We propose a method for assessing the reverberation level in speech that makes it possible to determine in real-time whether a speech signal is reverberant or not. Reverberation causes an increase in the variation of the energy and frequency of harmonics in speech. Speech with a variable pitch is especially affected by reverberation. To capture the effect of reverberation we measured six features on the harmonics of the speech signal, which represent energy and frequency variation in different ways. Speech from the Aurora database was artificially reverberated to demonstrate the validity of these features. Each feature predicted reverberation time for a different subset of the dataset. To test the overall separability of the speech samples using these features, speech from the dataset was automatically classified as being either inside or outside the reverberation radius. Most of the speech was correctly classified, which suggests that a reliable real-time classification algorithm can be developed to select good-quality speech. This algorithm can improve pre-processing methods, such as speech enhancement or voice activity detection, for more robust ASR.
Krijnders, J.D., Niessen, M.E., Andringa, T.C. (2007). Robust harmonic complex estimation in noise. Proceedings of ICA 2007, CAS-03-019. Best poster award at the BCN research school spring meeting.
(abstract) (pdf)(poster)
We present a new approach to robust pitch tracking of speech in noise. We transform the audio signal to the time-frequency domain using a transmission-line model of the human ear. The model output is leaky-integrated resulting in an energy measure, called a cochleogram. To select the speech parts of the cochleogram a correlogram, based on a running autocorrelation per channel, is used. In the correlogram possible pitch hypotheses are tracked through time and for every track a salience measure is calculated. The salience determines the importance of that track, which makes it possible to select tracks that belong to the signal while ignoring those of noise. The algorithm is shown to be more robust to the addition of pink noise than standard autocorrelation-based methods. An important property of the grouping is the selection of parts of the time-frequency plane that are highly likely to belong to the same source, which is impossible with conventional autocorrelation-based methods.
Lobanova, A., Spenader, J., Valkenier, B. (2007). Lexical and perceptual grounding of a sound ontology. Proceedings of TSD 2007, 180-187.
(abstract) (pdf)
Van Opzeeland, I., Slabbekoorn, H., Andringa, T.C., Ten Cate, C. (2007). Underwater racket: fish and noise pollution. De Levende Natuur 108, 39-43.
(abstract) (pdf)
Sinds de komst van de Wet op de Verontreiniging van Oppervlaktewateren is de kwaliteit van het oppervlaktewater in Nederland sterk verbeterd. Via de Kaderrichtlijn Water wordt nu ook op internationaal niveau samengewerkt aan een betere waterkwaliteit. Echter, een belangrijk aspect van water dat tot op heden verwaarloosd is, is het geluidsniveau onder water als gevolg van de sterk toegenomen scheepvaart, recreatie en bouwwerkzaamheden in en rond de Nederlandse oppervlaktewateren. Over de impact hiervan op de habitatkwaliteit onder water is tot dusver vrijwel niets bekend.
Van Opzeeland, I., Slabbekoorn, H., Andringa, T., Ten Cate, C. (2007). Underwater noise pollution: effects on fishes. Visionair 4, 11-13.
(abstract) (pdf)
Bijna iedere sportvisser zal het weten: als je te luidruchtig bent aan de waterkant gaat de vis ervandoor. Ondanks deze visserswijsheid zullen velen zich niet realiseren hoe belangrijk geluid voor vissen is. De menselijke activiteit op en rond de oppervlaktewateren van Nederland neemt sterk toe. Als gevolg hiervan zal ook de geluidsbelasting onder water navenant zijn toegenomen. Er is maar weinig bekend over de effecten van omgevingsgeluid op vissen. Reden voor de Universiteit van Leiden hier een uitgebreid literatuuronderzoek aan te wijden.
Andringa, T.C., Niessen, M.E. (2006). Real-world sound recognition: A recipe. Proceedings of LSAS 2006, 106-118.
(abstract) (pdf)
This article addresses the problem of recognizing acoustic events present in unconstrained input. We propose a novel approach to the processing of audio data which combines bottom-up hypothesis generation with top-down expectations, which, unlike standard pattern recognition techniques, can ensure that the representation of the input sound is physically realizable. Our approach gradually enriches low-level signal descriptors, based on Continuity Preserving Signal Processing, with more and more abstract interpretations along a hierarchy of description levels. This process is guided by top-down knowledge which provides context and ensures an interpretation consistent with the knowledge of the system. This leads to a system that can detect and recognize specific events for which the evidence is present in unconstrained real-world sounds.
Andringa, T., Hengel, P.W.J. van, Nillesen, M.M., Muchal, R. (2004). Aircraft sound-level measurements in residential areas using sound source separation. Proceedings of Internoise 2004, Prague, 460.
(abstract) (pdf)
Andringa, T.C. (2002). Continuity preserving signal processing. Thesis Rijksuniversiteit Groningen.
(abstract) (url)
De basis van dit onderzoek is gelegd in de periode tussen 1994 en 1999 toen ik een onderwijsaanstelling had bij de studierichting Technische Cognitiewetenschap. Doel van het onderzoek was het vinden van spraakherkenningstoepassingen van de rijke temporele informatie in een model van het menselijk binnenoor of cochlea dat ontwikkeld was binnen de groep van prof. Duifhuis. Op het moment dat de eerste resultaten suggereerden dat op basis van dit onderzoek belangrijke toepassingsmogelijkheden mogelijk waren in, onder andere, de spraaktechnologie, is besloten om te wachten met publikaties en één en ander verder te ontwikkelen ten behoeve van octrooiaanvragen. Eind 1999 is, op initiatief en met financiële steun van de RuG Houdstermaatschappij besloten om het onderzoek voort te zetten in bedrijfsverband. Hiertoe is het bedrijf Human Quality (HuQ) Speech Technologies BV opgericht. Continuity preserving Signal Processing (CPSP) is een geheel nieuwe signaalanalysemethodologie die gebaseerd is op het onderzoek dat beschreven is in dit proefschrift. Daar CPSP is gebaseerd op algemeen geldende basisaannames is het, in tegenstelling tot andere signaalanalysemethoden, geschikt voor de analyse van onbekende (d.w.z. niet herkende) complexe geluidssignalen. Met behulp van CPSP zijn topresultaten mogelijk op de Aurora test: een internationale benchmark test op het gebied van ruisrobuuste spraakherkenning. Dit onderzoek heeft zich ontwikkeld als enerzijds de aanzet voor een theorie over menselijke (spraak)geluidsanalyse en anderzijds een nieuwe benadering van spraaktechnologie. Mede om deze redenen is ervoor gekozen om het proefschrift te schrijven als een Tutorial die zich in de loop van de tijd verder zal ontwikkelen.

Sounds

The role of CPSP

Continuity Preserving Signal Processing (CPSP) was explicitly designed as a way to combine cognitive science with signal processing. As the name suggests, the feature that sets CPSP apart from other signal processing methods is the conservation of continuity. In particular the conservation of the continuous development of the physical process that produced the signal. This ability allows our auditory system to track the development of individual sound sources, and by doing so separate them from other sound sources. Traditional signal analysis methods like FFT's, Wavelets, etc., do not preserve this continuity and, consequently, have difficulties with tracking, separating, and recognizing mixtures of sound sources.

The most important consequence of the tracking of evidence of individual sound sources is the possibility to form signal components: physically coherent bits of information that contain information about a single source. These signal components form an ideal basis for consecutive analysis stages such as a recognition stage.

CPSP, as general auditory inspired signal processing paradigm, can be used for everything we use our ears for. And because it is tunable (unlike our ear in which one setting has to suffice for all uses) it can (in many cases) be adapted to surpass the human performance. An example of this is the analysis of failures in rotating equipment (like pumps, fans, motors, etc.). Humans have only a limited auditory memory scope and we have to come up with a percept in less than a second, while a computer based analysis system can be tuned to analyze a sound in multiple ways without the time constraint that our auditory system has.

gear-box
The cochleogram above shows a number of inhomogeneities in the signal of a gear-box. Two occur at low frequencies around t=0.4 and 1.4 sec, The other occur around 1200 Hz and t=0.25, 0.35, and 1.1 sec. Both types are audible, but many people will not notice the second type at first listening. And most listeners will have to listen multiple times before the first two high frequency ticks heard separately. Once you have heard them they are obvious. This shows that the visual representation above is often more easy to interpret and to analyze than the sound itself. The analysis above is an efficient way to determine the nature of the faults in this gear-box.

For research and development in the field of auditory cognition, CPSP is the natural choice a signal processing tool, but of course by no means the only one allowed: different tasks might require different signal processing approaches. It is the task that should determine the most suitable signal processing technique, not the preferences and background of the engineer or scientist. At this stage CPSP is still in its infancy and many potential uses (each of which impossible with standard signal processing techniques like FFT's) have not even be considered.

Contact

postal address
Auditory Cognition Group
Department of Artificial Intelligence
University of Groningen
P. O. Box 407
9700 AK Groningen
The Netherlands

visiting address
Bernoulliborg
Nijenborgh 9
9747 AG Groningen

 How to reach us