Publications

Andringa, T. C. (2010). Smart, general, and situation specific sensors. Proceedings of the workshop on “Smarter sensors, easier processing” associated with the 11th International Conference on the Simulation of Adaptive behavior (SAB2010). (abstract) (pdf)
The need for central processing in animals and animats can be reduced with smart sensors that reliably indicate the presence of behaviorally relevant events and environmental affordances. Another, equally important, reduction can be realized through a smart choice of what needs to be sensed and how much detail the task requires. The combination of smart processing and ‘expectation management’ accounts for some of the core properties of cognition and might be used to understand animal perception and to design improved animat sensors. In particular the combination couples gist, sensorimotor, and global workspace theories of perception and behavior selection to smart sensors in animals and animats.

Krijnders, J., Niessen, ME, & Andringa, T. (2010). Sound event recognition through expectancy-based evaluation ofsignal-driven hypotheses. Pattern Recognition Letters, 31, 1552-1559. (abstract) (pdf)
A recognition system for environmental sounds is presented. Signal-driven classification is performed by applying machine-learning techniques on features extracted from a cochleogram. These possibly unreliable classifications are improved by creating expectancies of sound events based on context information.

Krijnders, J. D. (2010). Signal-driven sound processing for uncontrolled environments. PhD Thesis of the University of Groningen, 1-144. (pdf)

Krijnders, J. D., & Andringa, T. (2010). Differences between annotating a soundscape live and annotating behind a screen. In Internoise 2010, Lisbon, Portugal. Presented at the Internoise 2010, Lisbon, Portugal, Proceedings of Internoise 2010, Lisbon, Portugal. (abstract) (pdf)
What sensory information do we use when we are asked to listen? To answer this question we asked participants to annotate real world city sounds in two conditions. In the first two conditions participants were present and annotated during recording. In the first condition the participants could see the environment. In the second annotation condition, participants sat behind a screen that blocked their view. The first condition corresponds to a normal situation for humans.

Niessen, ME, Cance, C., & Dubois, D. (2010). Categories For Soundscape: Toward A Hybrid Classification. Proceedings of Internoise 2010, 1-14. (abstract) (pdf)
To complement recent efforts in standardizing perceptual assessment of human sound preference through a taxonomy, we propose to contribute to the elaboration of a scientific classification of the diversity of soundscapes as presently studied, from our theoretical knowledge on categorization and naming. In a small collection of publications, selected on the basis of their explicit reference to soundscape studies, we identified exemplars of soundscapes, their structure, and naming. Furthermore, we assessed the consensus of categories mentioned by different research domains in this corpus. Through a linguistic analysis of the wording of the categories we identified different types of classifications, dependent on the research domain, and the object under investigation. Based on this finding we suggest that researchers should be explicit about the type of categorization they apply, and to which aspect of soundscape they are contributing. This suggestion is aimed at reaching a consensus not only on a generic definition, but also on the empirical investigations in a more explicitly structured domain, which the concept of soundscape intends to cover.

Niessen, Maria. (2010). Context-Based Sound Event Recognition. PhD Thesis of the University of Groningen, 1-144. (pdf)

Andringa, T.C., Soundscape And Core Affect Regulation (2010) Proceedings of Interspeech 2010 (abstract) (pdf)
“Why do people make detours through parks?” Or put differently: “Why is being exposed to the stimuli in a park pleasant and restorative.” Or even more general: “Why may sounds influence behavior.” The answer to these questions may be in the interplay of core phenomena of cognitive science such as: the processes of hearing and listening, different forms of attention, meaning giving and associated effortful and less effortful mental states, core affect regulation, basic emotions, viability and health, and the restoration of the capacity for directed attention. These all contribute in predictable ways to how humans respond to sound (and in general to the environment). The hearing process may deem part of the sound salient enough to be analyzed in full by the listening process that gives meaning to the input through the activation of behavioral options. The selected behavioral options must comply with the demand that they preserve viability and help to regulate core affect (defined as the combination of perceived viability and resource allocation). Any response strategy is influenced by perception-activated basic emotions. Angry behavior is activated when sounds hinder goal achievement. The restorativeness of parks relies on perceptual fascination through involuntary attention capturing of pleasant stimuli in combination with fairly simple perception-action responses. These mental states allow the systems for complex reasoning through directed attention time to return to normal values. Soundscape design should focus on allowing people sufficient opportunities for core affect self-regulation through the creation of fascinating “attractors”.

T.C. Andringa (2010). Audition: from sound to sounds. Chapter 4 of Machine Audition: Principles, Algorithms and Systems, editor Wenwu Wang, Chapter 4, pp 80-105. (abstract) (pdf)
This chapter addresses the functional requirements of auditory systems, both natural and artificial, to be able to deal with the complexities of uncontrolled real-world input. The demand to function in uncontrolled environments has severe implications for machine audition. The natural system has addressed this demand by adapting its function flexibly to changing task demands. Attentional processes and the concept of perceptual gist play an important role in this. Hearing and listening are seen as complementary processes. The process of hearing detects the existence and general character of the environment and its main and most salient sources. In combination with task demands these processes allow the pre-activation of knowledge about expected sources and their properties. Consecutive listening phases, in which the relevant subsets of the signal are analyzed, allow the level of detail required by task- and system-demands. This form of processing requires a signal representation that can be reasoned about. A representation based on source physics is suitable and has the advantage of being situation independent. The demand to determine physical source properties from the signal imposes restrictions on the signal processing. When these restrictions are not met, systems are limited to controlled domains. Novel signal representations are needed to couple the information in the signal to knowledge about the sources in the signal.

M.E. Niessen, J.D. Krijnders, T.C. Andringa (2009). Understanding a soundscape through its components. Accepted for oral presentation at Euronoise 2009, Edinburgh, Great Britain (abstract) (pdf)
Human perception of sound in real environments is complex fusion of many factors, which are investigated by divers research fields. Most approaches to assess and improve sonic environments (soundscapes) use a holistic approach. For example, in experimental psychology, subjective measurements usually involve the evaluation of a complete soundscape by human listeners, mostly through questionnaires. In contrast, psychoacoustic measurements try to capture low-level perceptual attributes in quantities, such as loudness. However, these two types of soundscape measurements are difficult to link other than with correlational measures. We propose a method inspired by cognitive research to improve our understanding of the link between acoustic events and human soundscape perception. Human listeners process sound as meaningful events. Therefore, we developed a model to identify components in a soundscape that are the basis of these meaningful events. First, we select structures from the sound signal that are likely to stem from a single source. Subsequently, we use a model of human memory to predict the location at which a sound is recorded, and to identify the most likely events that constitute the components in the sound given the location.

J.D. Krijnders, T.C. Andringa (2009). Soundscape annotation and environmental source recognition experiments in Assen (NL) In: Inter-noise 2009, Ottawa, Canada (abstract) (pdf)
We describe a soundscape annotation tool for unconstrained environmental sounds. The tool segments the time-frequency-plane into regions that are likely to stem from a single source. A human annotator can classify these regions. The system learns to suggest possible annotations and presents these for acceptance. Accepted or corrected annotations will be used to improve the classification further. Automatic annotations with a very high probability of being correct might be accepted by default. This speeds up the annotation process and makes it possible to annotate complex soundscapes both quickly and in considerable detail. We performed a pilot on recordings of uncontrolled soundscapes at locations in Assen (NL) made in the early spring of 2009. Initial results show that the system is able to propose the correct class in 75% of the cases.

R. R. Violanda, H. van de Vooren, R.A.J. van Elburg, T.C. Andringa (2009). Signal Component Estimation in Background Noise. In: Proceedings of NAG/DAGA 2009, 347, pp. 1588-1591. (abstract) (pdf)
Auditory scenes in a natural environment consist typically of multiple sound sources. From this the human auditory system can segregate and identify individual audible sources with ease even if the received signals are distorted due to noise and transmission effects [1]. Systems for computational auditory scene analysis (CASA) try to achieve human performance by decomposing a sound into coherent units and then grouping these to represent the individual sound sources in the scene [2-4]. Some CASA systems aim to estimate masks that enclose all regions dominated by a single sound source in the time-frequency (TF) plane [2]. This task becomes increasingly difficult when high levels of noise are present.
The aim underlying our approach to CASA is to estimate the physical properties of the source which give rise to its signal components. We define a signal component (SC) as a connected trajectory in the TF plane with a positive local signal-to-noise ratio (SNR). Ideally, if the sound stemming from a source is represented in the TF plane, the source’s characteristics carried by the sound are reflected as a pattern of signal components [5]. Hence, by correctly tracking and grouping the signal components, it is possible to retrieve the physical properties that shaped the source signal.
In this paper, we present two methods of estimating signal components. The first method uses both the energy and phase information derived from the time-frequency representation of the signal. The second method uses solely the energy of the signal. We compare the effectiveness of the two methods in the extraction of target sound signals from noisy backgrounds. Our results show the importance of the often neglected phase information in TF analysis.

M.E. Niessen, G. Kootstra, S. De Jong, T.C. Andringa (2009). Expectancy-based robot navigation through context evaluation. Proceedings of the 2009 International Conference on Artificial Intelligence, Las Vegas, NV, USA, pp 371-377. ISBN 1-60132-107-4 (abstract) (pdf)
Agents that operate in a real-world environment have to process an abundance of information, which may be ambiguous or noisy. We present a method inspired by cognitive research that keeps track of sensory information, and interprets it with knowledge of the context. We test this model on visual information from the real-world environment of a mobile robot in order to improve its self-localization. We use a topological map to represent the environment, which is an abstract representation of distinct places and the connections between them. Expectancies of the place of the robot on the map are combined with evidence from observations to reach the best prediction of the next place of the robot. These expectancies make a place prediction more robust to ambiguous and noisy observations. Results of the model operating on data gathered by a mobile robot confirm that context evaluation improves localization compared to a data-driven model.

B. Valkenier, J.D. Krijnders, R.A.J. van Elburg, T.C. Andringa (2009). Robust Vowel Detection. Accepted for oral presentation at NAG/DAGA 2009, Rotterdam, in Proceedings of NAG/DAGA 2009 pp. 1306-1309, 303 (abstract) (pdf)
Inspired by the robustness of human speech processing we aim to overcome the limitations of most modern automatic speech recognition methods by applying general knowledge on human speech processing. As a first step we present a vowel detection method that uses features of sound believed to be important for human auditory processing, such as fundamental frequency range, possible formant positions and minimal vowel duration. To achieve this we took the following steps: We identify high energy cochleogram regions of suitable shape and sufficient size, extract possible harmonic complexes, complement them with less reliable signal components, determine local formant positions through interpolating between peaks in the harmonic complex, and finally we keep formants of sufficient duration. We show the effectiveness of our method of formant detection by applying it to the Hillenbrand dataset both in clean conditions, as in noisy and reverberant conditions. In these three conditions the extracted formant positions agree well with Hillenbrand's findings. Thereby, we showed that, contrary to many modern automatic speech recognition methods, our results are robust to considerable levels of noise and reverberation.

Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sound through context. International Journal of Semantic Computing 2(3), 327-341. (abstract) (pdf) Electronic version of an article published as doi:10.1142/S1793351X08000506
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene.We propose a dynamic network model to perform this mapping. In acoustics, much research is devoted to low-level perceptual abilities such as audio feature extraction and grouping, which are translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Krijnders, J.D., Andringa, T.C. (2008). Demonstration of online auditory scene analysis. Accepted for BNAIC 2008. (abstract) (pdf)
We show an online system for auditory scene analysis. Auditory scene analysis is the analysis of complex sounds as they occur in natural settings. The analysis is based on models of human auditory processing. Our system includes models of the human ear, analysis in tones and pulses and grouping algorithms. This systems forms the basis for several sound recognition tasks, such as aggression detection and vowel recognition.

Van Elburg, R.A.J., Van Ooyen, A. (2008). Generalization of the event-based Carnevale-Hines integration scheme for integrate-and-fire models. Submitted. (abstract) (preprint)
An event-based integration scheme for an integrate-and-fire neuron model with exponentially decaying excitatory synaptic currents and double exponential inhibitory synaptic currents has recently been introduced by Carnevale and Hines. This integration scheme imposes non-physiological constraints on the time constants of the synaptic currents it attempts to model which hamper the general applicability. This paper addresses this problem in two ways. First, we provide physical arguments to show why these constraints on the time constants can be relaxed. Second, we give a formal proof showing which constraints can be abolished. This proof rests on a generalization of the Carnevale-Hines lemma, which is a new tool for comparing double exponentials as they naturally occur in many cascaded decay systems including receptor-neurotransmitter dissociation followed by channel closing. We show that this lemma can be generalized and subsequently used for lifting most of the original constraints on the time constants. Thus we show that the Carnevale-Hines integration scheme for the integrate-and-fire model can be employed for simulating a much wider range of neuron and synapse type combinations than is apparent from the original treatment.

Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sounds through context. Proceedings of IEEE ICSC, 88-95. (abstract) (pdf)
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research has been devoted to low-level perceptual abilities such as audio feature extraction and grouping, which have been translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification, by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Vooren, H. van de, Violanda, R.R., Andringa, T.C. (2008). Robust harmonic grouping by octave error correction. Acoustic-08 Paris. (abstract)
Harmonic grouping is a frequently applied technique in computational auditory scene analysis and automatic speech recognition systems. However, grouping is easily disrupted by noise and reverberation. For instance, a noise induced signal component positioned roughly between two harmonics, might undesirably be assigned to the harmonic complex (HC) as well. This results in an octave error: harmonics in an HC are assigned to harmonic numbers twice as high as the correct values. We propose a cost function based method to correct these octave errors. This function is designed to, on the one hand, improve the balance between odd and even harmonic numbers, and, on the other hand, minimize the amount of signal components to be rejected. As a preprocessing step we applied short-time Fourier analysis to derive an instantaneous frequency representation from which we obtained the signal components. We used these as input for our harmonic grouping algorithm to obtain the HCs. Then we selected the optimal solution from the cost function and modified the composition of the HCs accordingly. As long as enough harmonics are sufficiently above the local noise level, this octave error correction mechanism works well for various sorts of harmonic sounds including speech.

Andringa, T.C. (2008). The texture of natural sounds. Acoustic-08 Paris. (abstract)
The texture, a spectro-temporal pattern, of many sound sources is a robust and characteristic perceptual property that listeners use for sound source recognition. The robustness of the texture ensures that we can recognize sound sources like a helicopter, flowing water, or a surf breaking on pebbles in a wide variety of acoustic environments. This robustness suggests that textures can be used for automatic source identification or environment classification. We introduce a method to determine the presence of sound textures associated with, for example, flat noise, pulsed noises (helicopter), sweep based textures (running water), and tonal noises (babble). The cumulative probability density of time-frequency fluctuations are matched with prototypical cumulative probability density functions (cpdf) with a running variant of the Kolmogorov-Smirnov test. Textures with a similar distribution as the target distribution, contribute approximately equally to all values of the cpdf. The flatness of this distribution is used as a distance measure. When the pdf's of the target textures do not overlap strongly, the method can determine the texture of time-frequency-regions as small as 100 ms by 6 semi-tones. This method can therefore also be used to determine the texture of the background.

Krijnders, J.D., Niessen, M.E., Andringa T.C. (2008). A grouping approach to harmonic complexes. Acoustic-08 Paris. (abstract)
Humans seem to perform sound-source separation for quasi-periodic sounds, such as speech, mostly on harmonicity cues. To model this function, most machine algorithms use a pitch-based approach to group the speech parts of the spectrum. In these methods the pitch is obtained either explicitly, in autocorrelation methods, or implicitly, as in harmonic sieves. If the estimation of pitch is wrong, the grouping will fail as well. In this paper we show a method that performs harmonic grouping without first calculating the pitch. Instead a pitch estimate is associated with each grouping hypothesis.
Making the grouping independent of the pitch estimate makes it more robust in noisy settings.
The algorithm obtains possible harmonics by tracking energy peaks in a cochleogram. Co-occuring harmonics are compared in terms of frequency difference. Grouping hypotheses are formed by combining harmonics with similar frequency differences. Consistency checks are performed on these hypotheses and hypotheses with compatible properties are combined into harmonic complexes. Every harmonic complex is evaluated on the number of the harmonics, the number of subsequent harmonics and the presence of a harmonic at the pitch position. By using the number of subsequent harmonics octave errors are prevented.
Multiple concurrent harmonic complexes can be found as long as the spectral overlap is small.

Niessen, M.E., Van Elburg, R.A.J., Krijnders, J.D., Andringa., T.C. (2008). A computational model for auditory scene analysis. Acoustic-08 Paris. (abstract)
Primitive auditory scene analysis (ASA) is based on intrinsic properties of the auditory environment. Acoustic features such as continuity and proximity in time or frequency cause perceptual grouping of acoustic elements. Various grouping attributes have been translated into successful signal processing techniques that may be used in source separation. A next step beyond primitive ASA is source identification through schema-based ASA. We present a computational model for ASA that is inspired by models from cognitive research. It dynamically builds a hierarchical network of hypotheses, which is based on (learned) knowledge of the sources. Each hypothesis in the network, initiated by bottom-up evidence, represents a possible sound event. The network is updated for each new input event, which may be any sound in an unconstrained environment. The analysis of new input events is guided by knowledge of the environment and previous events. As a result of this adaptive behavior, information about the environment increases and the set of possible hypotheses decreases. With this method of continuously improving sound event identification we make a promising advance in computational ASA of complex real-world environments.

Andringa, T.C., Grootel, M. van (2007). Predicting listeners' reports to environmental sounds. Proceedings of ICA 2007, ENV-09-005-IP. (abstract) (pdf)
Spontaneous verbal descriptions of environmental sounds lead to a description of the contributing sound sources and the environments in which they occur. This is a form of perception that relies crucially on the rich structure of sounds, because only rich sounds can convey detailed information about individual sources and the transmission environment. This paper uses a semantic network with connection strengths derived from listener reports to represent the content of auditory scenes. The activity of the semantic network is based on a number of source specific cues for sounds such as birds, vehicles, speech, and footsteps. These cues are not based on spectral envelope and level, but on patterns in tones, pulses and noises that capture source specific structures. Generally the system performed in a similar way as a human listener in terms of concepts activated (or named) and the choice of the acoustic environment. The robustness and performance suggest the combination of a semantic network and source specific cues can be used to design systems for sound-based ambient awareness.

Zajdel, W., Krijnders, J.D., Andringa T.C., Gavrila, D.M. (2007) Audio-video sensor fusion for aggression detection. Proceedings of AVSS 2007. Best paper award. (abstract) (pdf)
This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of the complimentary nature of audio and video sensing to disambiguate scene activity in real-life, noisy and dynamic environments. At the lower level, independent analysis of the audio and video streams yields intermediate descriptors of a scene like: "scream", "passing train" or "articulation energy". At the higher level, a Dynamic Bayesian Network is used as a fusion mechanism that produces an aggregate aggression indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

Hengel, P.W.J. van, Andringa, T.C. (2007). Verbal aggression detection in complex social environments. Proceedings of AVSS 2007. (abstract) (pdf)
The paper presents a knowledge-based system designed to detect evidence of aggression by means of audio analysis. The detection is based on the way sounds are analyzed and how they attract attention in the human auditory system. The performance achieved is comparable to human performance in complex social environments. The SIgard system has been deployed in a number of different real-life situations and was tested extensively in the inner city of Groningen. Experienced police observers have annotated ~1400 recordings with various degrees of shouting, which were used for optimization. All essential events and a small number of nonessential aggressive events were detected. The system produces only a few false alarms (non-shouts) per microphone per year and misses no incidents. This makes it the first successful detection system for a non-trivial target in an unconstrained environment.

Niessen, M.E., Krijnders, J.D., Boers, J., Andringa, T.C. (2007). Assessing the reverberation level in speech. Proceedings of ICA 2007, CAS-03-020. (abstract) (pdf)
The performance of automatic speech recognition (ASR) systems is seriously degraded in reverberant environments. We propose a method for assessing the reverberation level in speech that makes it possible to determine in real-time whether a speech signal is reverberant or not. Reverberation causes an increase in the variation of the energy and frequency of harmonics in speech. Speech with a variable pitch is especially affected by reverberation. To capture the effect of reverberation we measured six features on the harmonics of the speech signal, which represent energy and frequency variation in different ways. Speech from the Aurora database was artificially reverberated to demonstrate the validity of these features. Each feature predicted reverberation time for a different subset of the dataset. To test the overall separability of the speech samples using these features, speech from the dataset was automatically classified as being either inside or outside the reverberation radius. Most of the speech was correctly classified, which suggests that a reliable real-time classification algorithm can be developed to select good-quality speech. This algorithm can improve pre-processing methods, such as speech enhancement or voice activity detection, for more robust ASR.

Krijnders, J.D., Niessen, M.E., Andringa, T.C. (2007). Robust harmonic complex estimation in noise. Proceedings of ICA 2007, CAS-03-019. Best poster award at the BCN research school spring meeting. (abstract) (pdf) (poster)
We present a new approach to robust pitch tracking of speech in noise. We transform the audio signal to the time-frequency domain using a transmission-line model of the human ear. The model output is leaky-integrated resulting in an energy measure, called a cochleogram. To select the speech parts of the cochleogram a correlogram, based on a running autocorrelation per channel, is used. In the correlogram possible pitch hypotheses are tracked through time and for every track a salience measure is calculated. The salience determines the importance of that track, which makes it possible to select tracks that belong to the signal while ignoring those of noise. The algorithm is shown to be more robust to the addition of pink noise than standard autocorrelation-based methods. An important property of the grouping is the selection of parts of the time-frequency plane that are highly likely to belong to the same source, which is impossible with conventional autocorrelation-based methods.

Lobanova, A., Spenader, J., Valkenier, B. (2007). Lexical and perceptual grounding of a sound ontology. Proceedings of TSD 2007, 180-187. (abstract) (pdf)
Sound ontologies need to incorporate source unidentifiable sounds in an adequate and consistent manner. Computational lexical resources like WordNet have either inserted these descriptions into conceptual categories, or make no attempt to organize the terms for these sounds. This work attempts to add structure to linguistic terms for source unidentifiable sounds. Through an analysis of WordNet and a psychoacoustic experiment we make some preliminary proposal about which features are highly salient for sound classification. This work is essential for interfacing between source unidentifiable sounds and linguistic descriptions of those sounds in computational applications, such as the Semantic Web and robotics.

Van Opzeeland, I., Slabbekoorn, H., Andringa, T.C., Ten Cate, C. (2007). Underwater racket: fish and noise pollution. De Levende Natuur 108, 39-43. (abstract) (pdf)
Sinds de komst van de Wet op de Verontreiniging van Oppervlaktewateren is de kwaliteit van het oppervlaktewater in Nederland sterk verbeterd. Via de Kaderrichtlijn Water wordt nu ook op internationaal niveau samengewerkt aan een betere waterkwaliteit. Echter, een belangrijk aspect van water dat tot op heden verwaarloosd is, is het geluidsniveau onder water als gevolg van de sterk toegenomen scheepvaart, recreatie en bouwwerkzaamheden in en rond de Nederlandse oppervlaktewateren. Over de impact hiervan op de habitatkwaliteit onder water is tot dusver vrijwel niets bekend.

Van Opzeeland, I., Slabbekoorn, H., Andringa, T., Ten Cate, C. (2007). Underwater noise pollution: effects on fishes. Visionair 4, 11-13. (abstract) (pdf)
Bijna iedere sportvisser zal het weten: als je te luidruchtig bent aan de waterkant gaat de vis ervandoor. Ondanks deze visserswijsheid zullen velen zich niet realiseren hoe belangrijk geluid voor vissen is. De menselijke activiteit op en rond de oppervlaktewateren van Nederland neemt sterk toe. Als gevolg hiervan zal ook de geluidsbelasting onder water navenant zijn toegenomen. Er is maar weinig bekend over de effecten van omgevingsgeluid op vissen. Reden voor de Universiteit van Leiden hier een uitgebreid literatuuronderzoek aan te wijden.

Andringa, T.C., Niessen, M.E. (2006). Real-world sound recognition: A recipe. Proceedings of LSAS 2006, 106-118. (abstract) (pdf)
This article addresses the problem of recognizing acoustic events present in unconstrained input. We propose a novel approach to the processing of audio data which combines bottom-up hypothesis generation with top-down expectations, which, unlike standard pattern recognition techniques, can ensure that the representation of the input sound is physically realizable. Our approach gradually enriches low-level signal descriptors, based on Continuity Preserving Signal Processing, with more and more abstract interpretations along a hierarchy of description levels. This process is guided by top-down knowledge which provides context and ensures an interpretation consistent with the knowledge of the system. This leads to a system that can detect and recognize specific events for which the evidence is present in unconstrained real-world sounds.

Andringa, T., Hengel, P.W.J. van, Nillesen, M.M., Muchal, R. (2004). Aircraft sound-level measurements in residential areas using sound source separation. Proceedings of Internoise 2004, Prague, 460. (abstract) (pdf)
Measuring the contribution of a particular sound source to the ambient sound level at an arbitrary location is impossible without some form of sound source separation. This made it difficult, if not impossible, to design automated systems that measure the contribution of a target sound to the ambient sound level. This paper introduces sound source separation technology that can be used to measure the contribution of a sound source, a passing plane, in environments where planes are not the dominant sound source. This sound source separation and classification technology, developed by Sound Intelligence makes it, in principle, possible to monitor the temporal development of any soundscape.
The plane detection technology is based on Continuity Preserving Signal Processing (CPSP): signal processing technology mimicking the tracking and streaming ability of the human auditory system. It relies on an efficient implementation of the human cochlea which, like the natural system, preserves the continuous development of sound sources better than is possible with traditional systems. Detection and classification are based on the characteristic development of the target sound source. In the case of airplanes the characteristic property set includes spectral cues in combination with a characteristic temporal development and techniques to deal with prominent transmission effects (wind and reflections of buildings).
This airplane sound detection technology has not been published before. Experiments at a residential location about 25 km from Amsterdam-Airport demonstrate that it is possible to estimate the contribution of passing planes to the ambient sound. Even in the case when the target signal is only marginally (5 dB) above background noise and degraded due to wind and reflections, a detection performance comparable to human listeners is achieved.

Andringa, T.C. (2002). Continuity preserving signal processing. Thesis Rijksuniversiteit Groningen. (abstract) (url)
De basis van dit onderzoek is gelegd in de periode tussen 1994 en 1999 toen ik een onderwijsaanstelling had bij de studierichting Technische Cognitiewetenschap. Doel van het onderzoek was het vinden van spraakherkenningstoepassingen van de rijke temporele informatie in een model van het menselijk binnenoor of cochlea dat ontwikkeld was binnen de groep van prof. Duifhuis. Op het moment dat de eerste resultaten suggereerden dat op basis van dit onderzoek belangrijke toepassingsmogelijkheden mogelijk waren in, onder andere, de spraaktechnologie, is besloten om te wachten met publikaties en een en ander verder te ontwikkelen ten behoeve van octrooiaanvragen. Eind 1999 is, op initiatief en met financiele steun van de RuG Houdstermaatschappij besloten om het onderzoek voort te zetten in bedrijfsverband. Hiertoe is het bedrijf Human Quality (HuQ) Speech Technologies BV opgericht.
Continuity preserving Signal Processing (CPSP) is een geheel nieuwe signaalanalysemethodologie die gebaseerd is op het onderzoek dat beschreven is in dit proefschrift. Daar CPSP is gebaseerd op algemeen geldende basisaannames is het, in tegenstelling tot andere signaalanalysemethoden, geschikt voor de analyse van onbekende (d.w.z. niet herkende) complexe geluidssignalen. Met behulp van CPSP zijn topresultaten mogelijk op de Aurora test: een internationale benchmark test op het gebied van ruisrobuuste spraakherkenning. Dit onderzoek heeft zich ontwikkeld als enerzijds de aanzet voor een theorie over menselijke (spraak)geluidsanalyse en anderzijds een nieuwe benadering van spraaktechnologie. Mede om deze redenen is ervoor gekozen om het proefschrift te schrijven als een Tutorial die zich in de loop van de tijd verder zal ontwikkelen.