Publications Tjeerd Andringa

Submitted

Andringa, T.C., Weber, M., Payne, S.R., Krijnders, D.J., Dixon, M.N., van der Linden, R., de Kock, E., and Lanser, J.J.L. (2011) Repositioning Soundscape Research and Management. Submitted. (abstract) (pdf)
This paper is an outcome of a workshop that addressed the question how soundscape research can improve its impact on the local level. It addresses a number of topics by complementing existing approaches and practices with possible future approaches and practices. The paper starts with an analysis of the role of sound annoyance and suboptimal soundscapes on the lives of individuals and conclude that a good soundscape, or more generally a good sensescape, is pleasant and conducive for the adoption of healthy habits. The paper progresses with an outline of future measurement and simulation techniques designed specifically for the optimization of local soundscapes. To maintain or improve sensescape quality, urban planning needs improved design tools that allow for a more holistic optimization and an active role of the user. Associated with this is a gradual development from government to governance in which optimization of the soundscape at local (administrative or geographic) level is directly influenced by the users of spaces. The paper concludes that soundscape research can have a greater impact by helping urban planners design for health and pleasant experiences as well as developing tools for improved citizen involvement in local optimization.

Andringa, T.C. and Lanser, J.J.L. (2011) Pleasurable and annoying sounds and how these impact on life. Submitted. (abstract) (pdf)
This paper aims to provide a scientific definition of pleasurable and annoying sounds. In a web-based survey we asked 478 respondents to describe how sound annoyance in the home environment impacted on their quality-of-life. We also asked why one particular sound was the most annoying sound. The diversity and structure of the answers to the first question let us formulate the notion of viability self-regulation and the impact of pleasurable and annoying sounds on this process. The answers to the second question allowed us to formulate a complementary model describing how annoying sounds influence where attention will be aimed at. This has direct consequences on arousal and alertness. We proposed and confirmed that pleasurable sounds provide active indicators of safety, while annoying sounds either mask these or provide indications of potential danger. We concluded that sound annoyance is a rich phenomenon that should not be reduced to sound level based dose-response relations. Sound annoyance in the home situations is better approached from the viewpoint of preservation of essential functions of the home for viability self-regulation.

Valkenier, B., Duyne, J.Y., Andringa, T.C. & Baskent, D. Audio-visual perception of dutch front vowels. Accepted by the Journal of Speech, Language, and Hearing Research

Published

Visser, T., Verbrugge, L. C., & Andringa, T. C. (2011). Affective agents bridge the gap between life and mind. Presented at the Workshop on Artificial Autonomy ECAL 2011, Paris. (abstract) (pdf)
We develop an agent-based model which implements affect regulation as the motivation for decision-making in individual affective agents within a simulated social environment. The model design applies the life-mind continuity thesis to the field of social simulation. Agents are fully cou- pled to the environment through perception and action via core affect regulation. This makes affect the center of the regulation of the coupling between the agent and the environment. The model is used to generate the emergence of zones of cooperation in the Demographic Prisoners Dilemma (DPD) paradigm. This model could provide a more biologically and cog- nitively plausible explanation for the dynamics of cooperation within the DPD. Although the model abstracts away from self-production and iden- tity generation, it is sufficient to generate these dynamics and provides blocks to build more complex models of human behavior based on affect self-regulation and self-production to gain a deeper understanding of the behavioral foundations of the emergence of social dynamics.

Andringa, T. C., & Lanser, J. J. L. (2011). Sound annoyance as loss of options for viability self-regulation. Proceedings of the 10th International Congress on Noise as a Public Health Problem (ICBEN) (pp. 898-905). London. (abstract) (pdf)
Sound annoyance is still ill-defined as scientific concept. In contrast, as a common-sense concept everyone is able to indicate which sounds are annoying and why they annoy. This paper has a single objective: it aims to couple a number of theoretical concepts to the breadth of responses to the question "Did you loose/gain something in terms of quality of life when the disturbing sound appeared in your life? If so could you please describe?" These answers were given in an online questionnaire targeted at sound annoyed persons. At the time of writing the questionnaire is still open and 179 respondents had answered. However, the pattern of these answers matched our theoretical expectations, which were based on the premise that for humans the sound of some sources can interfere with life's most basic requirement: the need to remain viable. Lindvall & Radford (1973) proposed that "Annoyance may be defined as a feeling of displeasure associated with any agent or condition known or believed by an individual or a group to be adversely affecting them" (Berglund et al. 1994). This paper proposes a more precise definition of the "adverse effect": namely making it more difficult to self-regulate viability. It starts with an outline of a number of theoretical "ingredients" and their relation to sound annoyance. These ingredients are used to generate a preliminary (sub-)categorization of possible responses. The method section addresses some issues related to the interpretation of the actual responses. The paper ends with a short analysis of the match between the expected answers and the sub-categorization and concluding remarks.

Andringa, T. C., & Lanser, J. J. (2011). Towards causality in sound annoyance. In Internoise 2011. Presented at Internoise 2011, Osaka, Japan. (abstract) (pdf)
The causal relations between being exposed to sound and sound annoyance are still unknown. This paper presents a conceptual model and investigates whether the explanatory scope of this model matches the different response categories of sound annoyed people when they are asked "Can you explain (briefly) why you choose [some source type] to be the most annoying sound?" or "In what activities do you participate to reduce the annoyance of the disturbance". The model comprises four attentional states (sleep, direct perception, directed attention, and directed attention with a strong sensory distractor), which become progressively more effortful and less restorative. The health effects arise from being forced to be in the least restorative and most effortful attentional state. The attentional states are activated by an Attention Control and Motivation system that in turn is influenced by a number of broad and imprecise influences that may be subtly different for different annoying sources. For example a difference between a continuous and a frequent source was matched by the respondent's answers. This suggests that the model has sufficient explanatory power to be used in conjunction with these answers.

Valkenier, B., Krijnders, J.D., van Elburg, R.A.J., & Andringa, T.C. (2011) Psycho-acoustically motivated formant feature extraction NEALT Proceedings Series Vol. 11 (abstract) (pdf)
Psycho-acoustical research investigates how human listeners are able to separate sounds that stem from different sources. This ability might be one of the reasons that human speech processing is robust to noise but methods that exploit this are, to our knowledge, not used in systems for automatic formant extraction or in mod- ern speech recognition systems. There- fore we investigate the possibility to use harmonics that are consistent with a har- monic complex as the basis for a robust formant extraction algorithm. With this new method we aim to overcome limita- tions of most modern automatic speech recognition systems by taking advantage of the robustness of harmonics at formant positions. We tested the effectiveness of our formant detection algorithm on Hillen- brandÕs annotated American English Vow- els dataset and found that in pink noise the results are competitive with existing sys- tems. Furthermore, our method needs no training and is implementable as a real- time system which contrasts many of the existing systems.

Botteldooren, D., Lavandier, C., Preis, A., Dubois, D., Aspuru, I., Guastatavino, C., Brown, L., Nilson, M., Andringa, T.C. (2011). Understanding urban and natural soundscapes. Forum Acousticum 2011 (abstract) (pdf)
The concept of soundscape has garnered increasing research attention over the last decade for studying and designing the sonic environment of public spaces. It is therefore critical to advance knowledge on how the soundscape of a place is evoked by its sonic environment, given visual, cultural, and situational contexts. Working Group 1 of the COST action "Soundscapes of European cities and landscapes" revolves around this question. In our current understanding the sounds that are heard during normal activities in a place trigger meaning and emotions based on the matching with expectations of the people using and acting in that place. This complete package of human experience in relation to the sonic environment can be named the soundscape. In terms of design, this understanding opens several opportunities. The designer can decide which sounds should be heard and try to make this happen by guiding the attention to particular sounds or simply remove, add or shape sounds. In doing so, he or she should keep in mind expectations of the local users. Expectations and meaning might be changed by suitable design of non-sonic features of the environment including besides the obvious visual context also the openness, lighting, local climate, etc. Bringing these concepts to practice requires new tools and methodologies.

T.P. Schmidt, M.A. Wiering, A.C. van Rossum, R.A.J. van Elburg, T.C. Andringa, B. Valkenier. Robust Real-Time Vowel Classification with an Echo State Network, Workshop on "Cognitive and neural models for automated processing of speech and text" 2010 (CONAS).(abstract) (pdf)
In the field of reservoir computing echo state networks (ESNs) and liquid state machines (LSMs) are the most commonly used networks. Comparative studies on these reservoirs identify the LSM as the network that yields the highest performance for speech recognition. But LSMs are not always usable in a real-time setting due to the computational costs of a large reservoir with spiking neurons. In this paper a vowel classification system is presented which consists of an ESN which processes cochlear filtered audio. The performance of the system is tested on a vowel classification task using different signal-to-noise ratios (SNRs). The usefulness of this method is measured by comparing it to formant based vowel classification systems. Results show that this ESN based system can get a perfor- mance similar to formant based vowel classification systems on the clean dataset with only a small reservoir and even outperforms these methods on the noisy dataset.

Andringa, T. C. (2010). Soundscape And Core Affect Regulation. Presented at Interspeech 2010, Lisbon, Portugal. (abstract) (pdf)
Why do people make detours through parks? Or put differently: Why is being exposed to the stimuli in a park pleasant and restorative. Or even more general: Why may sounds influence behavior. The answer to these questions may be in the interplay of core phenomena of cognitive science such as: the processes of hearing and listening, different forms of attention, meaning giving and associated effortful and less effortful mental states, core affect regulation, basic emotions, viability and health, and the restoration of the capacity for directed attention. These all contribute in predictable ways to how humans respond to sound (and in general to the environment). The hearing process may deem part of the sound salient enough to be analyzed in full by the listening process that gives meaning to the input through the activation of behavioral options. The selected behavioral options must comply with the demand that they preserve viability and help to regulate core affect (defined as the combination of perceived viability and resource allocation). Any response strategy is influenced by perception-activated basic emotions. Angry behavior is activated when sounds hinder goal achievement. The restorativeness of parks relies on perceptual fascination through involuntary attention capturing of pleasant stimuli in combination with fairly simple perception-action responses. These mental states allow the systems for complex reasoning through directed attention time to return to normal values. Soundscape design should focus on allowing people sufficient opportunities for core affect self-regulation through the creation of fascinating "attractors".

Andringa, T. C. (2010). Smart, general, and situation specific sensors. Proceedings of the workshop on “Smarter sensors, easier processing” associated with the 11th International Conference on the Simulation of Adaptive behavior (SAB2010). (abstract) (pdf)
The need for central processing in animals and animats can be reduced with smart sensors that reliably indicate the presence of behaviorally relevant events and environmental affordances. Another, equally important, reduction can be realized through a smart choice of what needs to be sensed and how much detail the task requires. The combination of smart processing and ‘expectation management’ accounts for some of the core properties of cognition and might be used to understand animal perception and to design improved animat sensors. In particular the combination couples gist, sensorimotor, and global workspace theories of perception and behavior selection to smart sensors in animals and animats.

Krijnders, J., Niessen, ME, & Andringa, T. (2010). Sound event recognition through expectancy-based evaluation ofsignal-driven hypotheses. Pattern Recognition Letters, 31, 1552-1559. (abstract) (pdf)
A recognition system for environmental sounds is presented. Signal-driven classification is performed by applying machine-learning techniques on features extracted from a cochleogram. These possibly unreliable classifications are improved by creating expectancies of sound events based on context information.

Krijnders, J. D. (2010). Signal-driven sound processing for uncontrolled environments. PhD Thesis of the University of Groningen, 1-144. (supervised) (pdf)

Krijnders, J. D., & Andringa, T. (2010). Differences between annotating a soundscape live and annotating behind a screen. In Internoise 2010, Lisbon, Portugal. Presented at the Internoise 2010, Lisbon, Portugal, Proceedings of Internoise 2010, Lisbon, Portugal. (abstract) (pdf)
What sensory information do we use when we are asked to listen? To answer this question we asked participants to annotate real world city sounds in two conditions. In the first two conditions participants were present and annotated during recording. In the first condition the participants could see the environment. In the second annotation condition, participants sat behind a screen that blocked their view. The first condition corresponds to a normal situation for humans.

Niessen, Maria. (2010). Context-Based Sound Event Recognition. PhD Thesis of the University of Groningen, 1-144. (supervised) (pdf)

Andringa, T.C., Soundscape And Core Affect Regulation (2010) Proceedings of Interspeech 2010 (abstract) (pdf)
“Why do people make detours through parks?” Or put differently: “Why is being exposed to the stimuli in a park pleasant and restorative.” Or even more general: “Why may sounds influence behavior.” The answer to these questions may be in the interplay of core phenomena of cognitive science such as: the processes of hearing and listening, different forms of attention, meaning giving and associated effortful and less effortful mental states, core affect regulation, basic emotions, viability and health, and the restoration of the capacity for directed attention. These all contribute in predictable ways to how humans respond to sound (and in general to the environment). The hearing process may deem part of the sound salient enough to be analyzed in full by the listening process that gives meaning to the input through the activation of behavioral options. The selected behavioral options must comply with the demand that they preserve viability and help to regulate core affect (defined as the combination of perceived viability and resource allocation). Any response strategy is influenced by perception-activated basic emotions. Angry behavior is activated when sounds hinder goal achievement. The restorativeness of parks relies on perceptual fascination through involuntary attention capturing of pleasant stimuli in combination with fairly simple perception-action responses. These mental states allow the systems for complex reasoning through directed attention time to return to normal values. Soundscape design should focus on allowing people sufficient opportunities for core affect self-regulation through the creation of fascinating “attractors”.

T.C. Andringa (2010). Audition: from sound to sounds. Chapter 4 of Machine Audition: Principles, Algorithms and Systems, editor Wenwu Wang, Chapter 4, pp 80-105. (abstract) (pdf)
This chapter addresses the functional requirements of auditory systems, both natural and artificial, to be able to deal with the complexities of uncontrolled real-world input. The demand to function in uncontrolled environments has severe implications for machine audition. The natural system has addressed this demand by adapting its function flexibly to changing task demands. Attentional processes and the concept of perceptual gist play an important role in this. Hearing and listening are seen as complementary processes. The process of hearing detects the existence and general character of the environment and its main and most salient sources. In combination with task demands these processes allow the pre-activation of knowledge about expected sources and their properties. Consecutive listening phases, in which the relevant subsets of the signal are analyzed, allow the level of detail required by task- and system-demands. This form of processing requires a signal representation that can be reasoned about. A representation based on source physics is suitable and has the advantage of being situation independent. The demand to determine physical source properties from the signal imposes restrictions on the signal processing. When these restrictions are not met, systems are limited to controlled domains. Novel signal representations are needed to couple the information in the signal to knowledge about the sources in the signal.

J.D. Krijnders, M. van Grootel, T.C. Andringa (2009). Research database for everyday listening. Accepted for oral presentation at NAG/DAGA 2009, Rotterdam, in Proceedings of NAG/DAGA 2009, pp. 996-999, 346. (abstract) (pdf)
This paper describes the design process and content of a publicly available database for everyday listening. While databases for musical sounds and studio recordings are readily available, databases for everyday sounds are lacking. Such a database is essential as more and more research is aimed at environmental sounds and everyday listening. The current lack of a standardized collection of properly recorded and annotated sounds in realistic acoustic environments limits the comparability of results between research groups, and hampers interdisciplinary cooperation. The collection described in this study must to help solve these issues. The collection currently consists of approximately 120 fragments of 60-second recordings of real-life situations. The use of head-mounted binaural microphones ensures a realistic capture of the environment for headphone playback. The recordings differ in location (e.g. indoor, rural, urban) and activity (e.g. taking the bus, dish washing, typing). Detailed annotations, provided for all recordings, describe the environment of the recording as well as the sources and onset-offset information of each of the sounds present. We aim to expand and improve the collection with more recordings and more specific information. We hope that many researchers in the auditory field will benefit from this database.

M.E. Niessen, J.D. Krijnders, T.C. Andringa (2009). Understanding a soundscape through its components. Accepted for oral presentation at Euronoise 2009, Edinburgh, Great Britain (abstract) (pdf)
Human perception of sound in real environments is complex fusion of many factors, which are investigated by divers research fields. Most approaches to assess and improve sonic environments (soundscapes) use a holistic approach. For example, in experimental psychology, subjective measurements usually involve the evaluation of a complete soundscape by human listeners, mostly through questionnaires. In contrast, psychoacoustic measurements try to capture low-level perceptual attributes in quantities, such as loudness. However, these two types of soundscape measurements are difficult to link other than with correlational measures. We propose a method inspired by cognitive research to improve our understanding of the link between acoustic events and human soundscape perception. Human listeners process sound as meaningful events. Therefore, we developed a model to identify components in a soundscape that are the basis of these meaningful events. First, we select structures from the sound signal that are likely to stem from a single source. Subsequently, we use a model of human memory to predict the location at which a sound is recorded, and to identify the most likely events that constitute the components in the sound given the location.

J.D. Krijnders, T.C. Andringa (2009). Soundscape annotation and environmental source recognition experiments in Assen (NL) In: Inter-noise 2009, Ottawa, Canada (abstract) (pdf)
We describe a soundscape annotation tool for unconstrained environmental sounds. The tool segments the time-frequency-plane into regions that are likely to stem from a single source. A human annotator can classify these regions. The system learns to suggest possible annotations and presents these for acceptance. Accepted or corrected annotations will be used to improve the classification further. Automatic annotations with a very high probability of being correct might be accepted by default. This speeds up the annotation process and makes it possible to annotate complex soundscapes both quickly and in considerable detail. We performed a pilot on recordings of uncontrolled soundscapes at locations in Assen (NL) made in the early spring of 2009. Initial results show that the system is able to propose the correct class in 75% of the cases.

R. R. Violanda, H. van de Vooren, R.A.J. van Elburg, T.C. Andringa (2009). Signal Component Estimation in Background Noise. In: Proceedings of NAG/DAGA 2009, 347, pp. 1588-1591. (abstract) (pdf)
Auditory scenes in a natural environment consist typically of multiple sound sources. From this the human auditory system can segregate and identify individual audible sources with ease even if the received signals are distorted due to noise and transmission effects [1]. Systems for computational auditory scene analysis (CASA) try to achieve human performance by decomposing a sound into coherent units and then grouping these to represent the individual sound sources in the scene [2-4]. Some CASA systems aim to estimate masks that enclose all regions dominated by a single sound source in the time-frequency (TF) plane [2]. This task becomes increasingly difficult when high levels of noise are present.
The aim underlying our approach to CASA is to estimate the physical properties of the source which give rise to its signal components. We define a signal component (SC) as a connected trajectory in the TF plane with a positive local signal-to-noise ratio (SNR). Ideally, if the sound stemming from a source is represented in the TF plane, the source’s characteristics carried by the sound are reflected as a pattern of signal components [5]. Hence, by correctly tracking and grouping the signal components, it is possible to retrieve the physical properties that shaped the source signal.
In this paper, we present two methods of estimating signal components. The first method uses both the energy and phase information derived from the time-frequency representation of the signal. The second method uses solely the energy of the signal. We compare the effectiveness of the two methods in the extraction of target sound signals from noisy backgrounds. Our results show the importance of the often neglected phase information in TF analysis.

M.E. Niessen, G. Kootstra, S. De Jong, T.C. Andringa (2009). Expectancy-based robot navigation through context evaluation. Proceedings of the 2009 International Conference on Artificial Intelligence, Las Vegas, NV, USA, pp 371-377. ISBN 1-60132-107-4 (abstract) (pdf)
Agents that operate in a real-world environment have to process an abundance of information, which may be ambiguous or noisy. We present a method inspired by cognitive research that keeps track of sensory information, and interprets it with knowledge of the context. We test this model on visual information from the real-world environment of a mobile robot in order to improve its self-localization. We use a topological map to represent the environment, which is an abstract representation of distinct places and the connections between them. Expectancies of the place of the robot on the map are combined with evidence from observations to reach the best prediction of the next place of the robot. These expectancies make a place prediction more robust to ambiguous and noisy observations. Results of the model operating on data gathered by a mobile robot confirm that context evaluation improves localization compared to a data-driven model.

B. Valkenier, J.D. Krijnders, R.A.J. van Elburg, T.C. Andringa (2009). Robust Vowel Detection. Accepted for oral presentation at NAG/DAGA 2009, Rotterdam, in Proceedings of NAG/DAGA 2009 pp. 1306-1309, 303 (abstract) (pdf)
Inspired by the robustness of human speech processing we aim to overcome the limitations of most modern automatic speech recognition methods by applying general knowledge on human speech processing. As a first step we present a vowel detection method that uses features of sound believed to be important for human auditory processing, such as fundamental frequency range, possible formant positions and minimal vowel duration. To achieve this we took the following steps: We identify high energy cochleogram regions of suitable shape and sufficient size, extract possible harmonic complexes, complement them with less reliable signal components, determine local formant positions through interpolating between peaks in the harmonic complex, and finally we keep formants of sufficient duration. We show the effectiveness of our method of formant detection by applying it to the Hillenbrand dataset both in clean conditions, as in noisy and reverberant conditions. In these three conditions the extracted formant positions agree well with Hillenbrand's findings. Thereby, we showed that, contrary to many modern automatic speech recognition methods, our results are robust to considerable levels of noise and reverberation.

Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sound through context. International Journal of Semantic Computing 2(3), 327-341. (abstract) (pdf) Electronic version of an article published as doi:10.1142/S1793351X08000506
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene.We propose a dynamic network model to perform this mapping. In acoustics, much research is devoted to low-level perceptual abilities such as audio feature extraction and grouping, which are translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Krijnders, J.D., Andringa, T.C. (2008). Demonstration of online auditory scene analysis. Accepted for BNAIC 2008. (abstract) (pdf)
We show an online system for auditory scene analysis. Auditory scene analysis is the analysis of complex sounds as they occur in natural settings. The analysis is based on models of human auditory processing. Our system includes models of the human ear, analysis in tones and pulses and grouping algorithms. This systems forms the basis for several sound recognition tasks, such as aggression detection and vowel recognition.

Niessen, M.E., Van Maanen, L., Andringa, T.C. (2008). Disambiguating sounds through context. Proceedings of IEEE ICSC, 88-95. (abstract) (pdf)
A central problem in automatic sound recognition is the mapping between low-level audio features and the meaningful content of an auditory scene. We propose a dynamic network model to perform this mapping. In acoustics, much research has been devoted to low-level perceptual abilities such as audio feature extraction and grouping, which have been translated into successful signal processing techniques. However, little work is done on modeling knowledge and context in sound recognition, although this information is necessary to identify a sound event rather than to separate its components from a scene. We first investigate the role of context in human sound identification in a simple experiment. Then we show that the use of knowledge in a dynamic network model can improve automatic sound identification, by reducing the search space of the low-level audio features. Furthermore, context information dissolves ambiguities that arise from multiple interpretations of one sound event.

Vooren, H. van de, Violanda, R.R., Andringa, T.C. (2008). Robust harmonic grouping by octave error correction. Acoustic-08 Paris. (abstract)
Harmonic grouping is a frequently applied technique in computational auditory scene analysis and automatic speech recognition systems. However, grouping is easily disrupted by noise and reverberation. For instance, a noise induced signal component positioned roughly between two harmonics, might undesirably be assigned to the harmonic complex (HC) as well. This results in an octave error: harmonics in an HC are assigned to harmonic numbers twice as high as the correct values. We propose a cost function based method to correct these octave errors. This function is designed to, on the one hand, improve the balance between odd and even harmonic numbers, and, on the other hand, minimize the amount of signal components to be rejected. As a preprocessing step we applied short-time Fourier analysis to derive an instantaneous frequency representation from which we obtained the signal components. We used these as input for our harmonic grouping algorithm to obtain the HCs. Then we selected the optimal solution from the cost function and modified the composition of the HCs accordingly. As long as enough harmonics are sufficiently above the local noise level, this octave error correction mechanism works well for various sorts of harmonic sounds including speech.

Andringa, T.C. (2008). The texture of natural sounds. Acoustic-08 Paris. (abstract)
The texture, a spectro-temporal pattern, of many sound sources is a robust and characteristic perceptual property that listeners use for sound source recognition. The robustness of the texture ensures that we can recognize sound sources like a helicopter, flowing water, or a surf breaking on pebbles in a wide variety of acoustic environments. This robustness suggests that textures can be used for automatic source identification or environment classification. We introduce a method to determine the presence of sound textures associated with, for example, flat noise, pulsed noises (helicopter), sweep based textures (running water), and tonal noises (babble). The cumulative probability density of time-frequency fluctuations are matched with prototypical cumulative probability density functions (cpdf) with a running variant of the Kolmogorov-Smirnov test. Textures with a similar distribution as the target distribution, contribute approximately equally to all values of the cpdf. The flatness of this distribution is used as a distance measure. When the pdf's of the target textures do not overlap strongly, the method can determine the texture of time-frequency-regions as small as 100 ms by 6 semi-tones. This method can therefore also be used to determine the texture of the background.

Krijnders, J.D., Niessen, M.E., Andringa T.C. (2008). A grouping approach to harmonic complexes. Acoustic-08 Paris. (abstract)
Humans seem to perform sound-source separation for quasi-periodic sounds, such as speech, mostly on harmonicity cues. To model this function, most machine algorithms use a pitch-based approach to group the speech parts of the spectrum. In these methods the pitch is obtained either explicitly, in autocorrelation methods, or implicitly, as in harmonic sieves. If the estimation of pitch is wrong, the grouping will fail as well. In this paper we show a method that performs harmonic grouping without first calculating the pitch. Instead a pitch estimate is associated with each grouping hypothesis.
Making the grouping independent of the pitch estimate makes it more robust in noisy settings.
The algorithm obtains possible harmonics by tracking energy peaks in a cochleogram. Co-occuring harmonics are compared in terms of frequency difference. Grouping hypotheses are formed by combining harmonics with similar frequency differences. Consistency checks are performed on these hypotheses and hypotheses with compatible properties are combined into harmonic complexes. Every harmonic complex is evaluated on the number of the harmonics, the number of subsequent harmonics and the presence of a harmonic at the pitch position. By using the number of subsequent harmonics octave errors are prevented.
Multiple concurrent harmonic complexes can be found as long as the spectral overlap is small.

Niessen, M.E., Van Elburg, R.A.J., Krijnders, J.D., Andringa., T.C. (2008). A computational model for auditory scene analysis. Acoustic-08 Paris. (abstract)
Primitive auditory scene analysis (ASA) is based on intrinsic properties of the auditory environment. Acoustic features such as continuity and proximity in time or frequency cause perceptual grouping of acoustic elements. Various grouping attributes have been translated into successful signal processing techniques that may be used in source separation. A next step beyond primitive ASA is source identification through schema-based ASA. We present a computational model for ASA that is inspired by models from cognitive research. It dynamically builds a hierarchical network of hypotheses, which is based on (learned) knowledge of the sources. Each hypothesis in the network, initiated by bottom-up evidence, represents a possible sound event. The network is updated for each new input event, which may be any sound in an unconstrained environment. The analysis of new input events is guided by knowledge of the environment and previous events. As a result of this adaptive behavior, information about the environment increases and the set of possible hypotheses decreases. With this method of continuously improving sound event identification we make a promising advance in computational ASA of complex real-world environments.

Andringa, T.C., Grootel, M. van (2007). Predicting listeners' reports to environmental sounds. Proceedings of ICA 2007, ENV-09-005-IP. (abstract) (pdf)
Spontaneous verbal descriptions of environmental sounds lead to a description of the contributing sound sources and the environments in which they occur. This is a form of perception that relies crucially on the rich structure of sounds, because only rich sounds can convey detailed information about individual sources and the transmission environment. This paper uses a semantic network with connection strengths derived from listener reports to represent the content of auditory scenes. The activity of the semantic network is based on a number of source specific cues for sounds such as birds, vehicles, speech, and footsteps. These cues are not based on spectral envelope and level, but on patterns in tones, pulses and noises that capture source specific structures. Generally the system performed in a similar way as a human listener in terms of concepts activated (or named) and the choice of the acoustic environment. The robustness and performance suggest the combination of a semantic network and source specific cues can be used to design systems for sound-based ambient awareness.

Zajdel, W., Krijnders, J.D., Andringa T.C., Gavrila, D.M. (2007) Audio-video sensor fusion for aggression detection. Proceedings of AVSS 2007. Best paper award. (abstract) (pdf)
This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of the complimentary nature of audio and video sensing to disambiguate scene activity in real-life, noisy and dynamic environments. At the lower level, independent analysis of the audio and video streams yields intermediate descriptors of a scene like: "scream", "passing train" or "articulation energy". At the higher level, a Dynamic Bayesian Network is used as a fusion mechanism that produces an aggregate aggression indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

Hengel, P.W.J. van, Andringa, T.C. (2007). Verbal aggression detection in complex social environments. Proceedings of AVSS 2007. (abstract) (pdf)
The paper presents a knowledge-based system designed to detect evidence of aggression by means of audio analysis. The detection is based on the way sounds are analyzed and how they attract attention in the human auditory system. The performance achieved is comparable to human performance in complex social environments. The SIgard system has been deployed in a number of different real-life situations and was tested extensively in the inner city of Groningen. Experienced police observers have annotated ~1400 recordings with various degrees of shouting, which were used for optimization. All essential events and a small number of nonessential aggressive events were detected. The system produces only a few false alarms (non-shouts) per microphone per year and misses no incidents. This makes it the first successful detection system for a non-trivial target in an unconstrained environment.

Niessen, M.E., Krijnders, J.D., Boers, J., Andringa, T.C. (2007). Assessing the reverberation level in speech. Proceedings of ICA 2007, CAS-03-020. (abstract) (pdf)
The performance of automatic speech recognition (ASR) systems is seriously degraded in reverberant environments. We propose a method for assessing the reverberation level in speech that makes it possible to determine in real-time whether a speech signal is reverberant or not. Reverberation causes an increase in the variation of the energy and frequency of harmonics in speech. Speech with a variable pitch is especially affected by reverberation. To capture the effect of reverberation we measured six features on the harmonics of the speech signal, which represent energy and frequency variation in different ways. Speech from the Aurora database was artificially reverberated to demonstrate the validity of these features. Each feature predicted reverberation time for a different subset of the dataset. To test the overall separability of the speech samples using these features, speech from the dataset was automatically classified as being either inside or outside the reverberation radius. Most of the speech was correctly classified, which suggests that a reliable real-time classification algorithm can be developed to select good-quality speech. This algorithm can improve pre-processing methods, such as speech enhancement or voice activity detection, for more robust ASR.

Krijnders, J.D., Niessen, M.E., Andringa, T.C. (2007). Robust harmonic complex estimation in noise. Proceedings of ICA 2007, CAS-03-019. Best poster award at the BCN research school spring meeting. (abstract) (pdf) (poster)
We present a new approach to robust pitch tracking of speech in noise. We transform the audio signal to the time-frequency domain using a transmission-line model of the human ear. The model output is leaky-integrated resulting in an energy measure, called a cochleogram. To select the speech parts of the cochleogram a correlogram, based on a running autocorrelation per channel, is used. In the correlogram possible pitch hypotheses are tracked through time and for every track a salience measure is calculated. The salience determines the importance of that track, which makes it possible to select tracks that belong to the signal while ignoring those of noise. The algorithm is shown to be more robust to the addition of pink noise than standard autocorrelation-based methods. An important property of the grouping is the selection of parts of the time-frequency plane that are highly likely to belong to the same source, which is impossible with conventional autocorrelation-based methods.

Van Opzeeland, I., Slabbekoorn, H., Andringa, T.C., Ten Cate, C. (2007). Underwater racket: fish and noise pollution. De Levende Natuur 108, 39-43. (abstract) (pdf)
Sinds de komst van de Wet op de Verontreiniging van Oppervlaktewateren is de kwaliteit van het oppervlaktewater in Nederland sterk verbeterd. Via de Kaderrichtlijn Water wordt nu ook op internationaal niveau samengewerkt aan een betere waterkwaliteit. Echter, een belangrijk aspect van water dat tot op heden verwaarloosd is, is het geluidsniveau onder water als gevolg van de sterk toegenomen scheepvaart, recreatie en bouwwerkzaamheden in en rond de Nederlandse oppervlaktewateren. Over de impact hiervan op de habitatkwaliteit onder water is tot dusver vrijwel niets bekend.

Van Opzeeland, I., Slabbekoorn, H., Andringa, T., Ten Cate, C. (2007). Underwater noise pollution: effects on fishes. Visionair 4, 11-13. (abstract) (pdf)
Bijna iedere sportvisser zal het weten: als je te luidruchtig bent aan de waterkant gaat de vis ervandoor. Ondanks deze visserswijsheid zullen velen zich niet realiseren hoe belangrijk geluid voor vissen is. De menselijke activiteit op en rond de oppervlaktewateren van Nederland neemt sterk toe. Als gevolg hiervan zal ook de geluidsbelasting onder water navenant zijn toegenomen. Er is maar weinig bekend over de effecten van omgevingsgeluid op vissen. Reden voor de Universiteit van Leiden hier een uitgebreid literatuuronderzoek aan te wijden.

Andringa, T.C., Niessen, M.E. (2006). Real-world sound recognition: A recipe. Proceedings of LSAS 2006, 106-118. (abstract) (pdf)
This article addresses the problem of recognizing acoustic events present in unconstrained input. We propose a novel approach to the processing of audio data which combines bottom-up hypothesis generation with top-down expectations, which, unlike standard pattern recognition techniques, can ensure that the representation of the input sound is physically realizable. Our approach gradually enriches low-level signal descriptors, based on Continuity Preserving Signal Processing, with more and more abstract interpretations along a hierarchy of description levels. This process is guided by top-down knowledge which provides context and ensures an interpretation consistent with the knowledge of the system. This leads to a system that can detect and recognize specific events for which the evidence is present in unconstrained real-world sounds.

Andringa, T., Hengel, P.W.J. van, Nillesen, M.M., Muchal, R. (2004). Aircraft sound-level measurements in residential areas using sound source separation. Proceedings of Internoise 2004, Prague, 460. (abstract) (pdf)
Measuring the contribution of a particular sound source to the ambient sound level at an arbitrary location is impossible without some form of sound source separation. This made it difficult, if not impossible, to design automated systems that measure the contribution of a target sound to the ambient sound level. This paper introduces sound source separation technology that can be used to measure the contribution of a sound source, a passing plane, in environments where planes are not the dominant sound source. This sound source separation and classification technology, developed by Sound Intelligence makes it, in principle, possible to monitor the temporal development of any soundscape.
The plane detection technology is based on Continuity Preserving Signal Processing (CPSP): signal processing technology mimicking the tracking and streaming ability of the human auditory system. It relies on an efficient implementation of the human cochlea which, like the natural system, preserves the continuous development of sound sources better than is possible with traditional systems. Detection and classification are based on the characteristic development of the target sound source. In the case of airplanes the characteristic property set includes spectral cues in combination with a characteristic temporal development and techniques to deal with prominent transmission effects (wind and reflections of buildings).
This airplane sound detection technology has not been published before. Experiments at a residential location about 25 km from Amsterdam-Airport demonstrate that it is possible to estimate the contribution of passing planes to the ambient sound. Even in the case when the target signal is only marginally (5 dB) above background noise and degraded due to wind and reflections, a detection performance comparable to human listeners is achieved.

Andringa, T.C. (2002). Continuity preserving signal processing. Thesis Rijksuniversiteit Groningen. (abstract) (pdf)
De basis van dit onderzoek is gelegd in de periode tussen 1994 en 1999 toen ik een onderwijsaanstelling had bij de studierichting Technische Cognitiewetenschap. Doel van het onderzoek was het vinden van spraakherkenningstoepassingen van de rijke temporele informatie in een model van het menselijk binnenoor of cochlea dat ontwikkeld was binnen de groep van prof. Duifhuis. Op het moment dat de eerste resultaten suggereerden dat op basis van dit onderzoek belangrijke toepassingsmogelijkheden mogelijk waren in, onder andere, de spraaktechnologie, is besloten om te wachten met publikaties en een en ander verder te ontwikkelen ten behoeve van octrooiaanvragen. Eind 1999 is, op initiatief en met financiele steun van de RuG Houdstermaatschappij besloten om het onderzoek voort te zetten in bedrijfsverband. Hiertoe is het bedrijf Human Quality (HuQ) Speech Technologies BV opgericht.
Continuity preserving Signal Processing (CPSP) is een geheel nieuwe signaalanalysemethodologie die gebaseerd is op het onderzoek dat beschreven is in dit proefschrift. Daar CPSP is gebaseerd op algemeen geldende basisaannames is het, in tegenstelling tot andere signaalanalysemethoden, geschikt voor de analyse van onbekende (d.w.z. niet herkende) complexe geluidssignalen. Met behulp van CPSP zijn topresultaten mogelijk op de Aurora test: een internationale benchmark test op het gebied van ruisrobuuste spraakherkenning. Dit onderzoek heeft zich ontwikkeld als enerzijds de aanzet voor een theorie over menselijke (spraak)geluidsanalyse en anderzijds een nieuwe benadering van spraaktechnologie. Mede om deze redenen is ervoor gekozen om het proefschrift te schrijven als een Tutorial die zich in de loop van de tijd verder zal ontwikkelen.