Projects
Integration of visual and auditory
information to detect aggression
A computational model of sound recognition
Linking physical processes to signal
Components
Applications of continuity preserving acoustic signal processing
Machine analysis and diagnostics
Automatic keyword spotting
Publications
- Niessen, M.E., Van Maanen, L., Andringa,
T.C. (2008). Disambiguating sound through
context. International Journal of
Semantic Computing 2(3), 327-341.
(abstract) (pdf)
Electronic version of an article published as doi:10.1142/S1793351X08000506
© [copyright World Scientific Publishing Company]
-
A central problem in automatic sound recognition is the
mapping between low-level audio features and the meaningful
content of an auditory scene.We propose a dynamic network
model to perform this mapping. In acoustics, much research is
devoted to low-level perceptual abilities such as audio
feature extraction and grouping, which are translated into
successful signal processing techniques. However, little work
is done on modeling knowledge and context in sound
recognition, although this information is necessary to
identify a sound event rather than to separate its components
from a scene. We first investigate the role of context in
human sound identification in a simple experiment. Then we
show that the use of knowledge in a dynamic network model can
improve automatic sound identification by reducing the search
space of the low-level audio features. Furthermore, context
information dissolves ambiguities that arise from multiple
interpretations of one sound event.
- Krijnders, J.D., Andringa,
T.C. (2008). Demonstration of online auditory scene analysis. Accepted for
BNAIC 2008.
(abstract)
(pdf)
-
We show an online system for auditory scene analysis. Auditory scene analysis is the analysis of complex
sounds as they occur in natural settings. The analysis is based on models of human auditory processing.
Our system includes models of the human ear, analysis in tones and pulses and grouping algorithms.
This systems forms the basis for several sound recognition tasks, such as aggression detection and vowel
recognition.
- Van Elburg, R.A.J., Van Ooyen, A. (2008). Generalization of
the event-based Carnevale-Hines integration scheme for
integrate-and-fire models. Submitted.
(abstract)
(preprint)
-
An event-based integration scheme for an integrate-and-fire
neuron model with exponentially decaying excitatory synaptic
currents and double exponential inhibitory synaptic currents has
recently been introduced by Carnevale and Hines. This integration
scheme imposes non-physiological constraints on the time constants
of the synaptic currents it attempts to model which hamper the
general applicability. This paper addresses this problem in two
ways. First, we provide physical arguments to show why these
constraints on the time constants can be relaxed. Second, we give
a formal proof showing which constraints can be abolished. This
proof rests on a generalization of the Carnevale-Hines lemma,
which is a new tool for comparing double exponentials as they
naturally occur in many cascaded decay systems including
receptor-neurotransmitter dissociation followed by channel
closing. We show that this lemma can be generalized and
subsequently used for lifting most of the original constraints on
the time constants. Thus we show that the Carnevale-Hines
integration scheme for the integrate-and-fire model can be
employed for simulating a much wider range of neuron and synapse
type combinations than is apparent from the original treatment.
- Niessen, M.E., Van Maanen, L., Andringa,
T.C. (2008). Disambiguating sounds through context. Proceedings
of IEEE ICSC,
88-95.
(abstract)
(pdf)
-
A central problem in automatic sound recognition is the mapping
between low-level audio features and the meaningful content of an
auditory scene. We propose a dynamic network model to perform this
mapping. In acoustics, much research has been devoted to low-level
perceptual abilities such as audio feature extraction and
grouping, which have been translated into successful signal
processing techniques. However, little work is done on modeling
knowledge and context in sound recognition, although this
information is necessary to identify a sound event rather than to
separate its components from a scene. We first investigate the
role of context in human sound identification in a simple
experiment. Then we show that the use of knowledge in a dynamic
network model can improve automatic sound identification, by
reducing the search space of the low-level audio
features. Furthermore, context information dissolves ambiguities
that arise from multiple interpretations of one sound event.
- Vooren, H. van de, Violanda, R.R., Andringa, T.C. (2008). Robust
harmonic grouping by octave error
correction. Acoustic-08
Paris.
(abstract)
-
Harmonic grouping is a frequently applied technique in
computational auditory scene analysis and automatic speech
recognition systems. However, grouping is easily disrupted by
noise and reverberation. For instance, a noise induced signal
component positioned roughly between two harmonics, might
undesirably be assigned to the harmonic complex (HC) as well. This
results in an octave error: harmonics in an HC are assigned to
harmonic numbers twice as high as the correct values. We propose a
cost function based method to correct these octave errors. This
function is designed to, on the one hand, improve the balance
between odd and even harmonic numbers, and, on the other hand,
minimize the amount of signal components to be rejected. As a
preprocessing step we applied short-time Fourier analysis to
derive an instantaneous frequency representation from which we
obtained the signal components. We used these as input for our
harmonic grouping algorithm to obtain the HCs. Then we selected
the optimal solution from the cost function and modified the
composition of the HCs accordingly. As long as enough harmonics
are sufficiently above the local noise level, this octave error
correction mechanism works well for various sorts of harmonic
sounds including speech.
- Andringa, T.C. (2008). The texture of natural
sounds. Acoustic-08
Paris.
(abstract)
-
The texture, a spectro-temporal pattern, of many sound sources is
a robust and characteristic perceptual property that listeners use
for sound source recognition. The robustness of the texture
ensures that we can recognize sound sources like a helicopter,
flowing water, or a surf breaking on pebbles in a wide variety of
acoustic environments. This robustness suggests that textures can
be used for automatic source identification or environment
classification. We introduce a method to determine the presence of
sound textures associated with, for example, flat noise, pulsed
noises (helicopter), sweep based textures (running water), and
tonal noises (babble). The cumulative probability density of
time-frequency fluctuations are matched with prototypical
cumulative probability density functions (cpdf) with a running
variant of the Kolmogorov-Smirnov test. Textures with a similar
distribution as the target distribution, contribute approximately
equally to all values of the cpdf. The flatness of this
distribution is used as a distance measure. When the pdf¡Çs of the
target textures do not overlap strongly, the method can determine
the texture of time-frequency-regions as small as 100 ms by 6
semi-tones. This method can therefore also be used to determine
the texture of the background.
- Krijnders, J.D., Niessen, M.E., Andringa T.C. (2008). A grouping
approach to harmonic complexes. Acoustic-08
Paris.
(abstract)
-
Humans seem to perform sound-source separation for quasi-periodic
sounds, such as speech, mostly on harmonicity cues. To model this
function, most machine algorithms use a pitch-based approach to
group the speech parts of the spectrum. In these methods the pitch
is obtained either explicitly, in autocorrelation methods, or
implicitly, as in harmonic sieves. If the estimation of pitch is
wrong, the grouping will fail as well. In this paper we show a
method that performs harmonic grouping without first calculating
the pitch. Instead a pitch estimate is associated with each
grouping hypothesis.
Making the grouping independent of the pitch estimate makes it
more robust in noisy settings.
The algorithm obtains possible harmonics by tracking energy peaks
in a cochleogram. Co-occuring harmonics are compared in terms of
frequency difference. Grouping hypotheses are formed by combining
harmonics with similar frequency differences. Consistency checks
are performed on these hypotheses and hypotheses with compatible
properties are combined into harmonic complexes. Every harmonic
complex is evaluated on the number of the harmonics, the number of
subsequent harmonics and the presence of a harmonic at the pitch
position. By using the number of subsequent harmonics octave
errors are prevented.
Multiple concurrent harmonic complexes can be found as long as the
spectral overlap is small.
- Niessen, M.E., Van Elburg, R.A.J., Krijnders, J.D., Andringa.,
T.C. (2008). A computational model for auditory scene
analysis. Acoustic-08
Paris.
(abstract)
-
Primitive auditory scene analysis (ASA) is based on intrinsic
properties of the auditory environment. Acoustic features such as
continuity and proximity in time or frequency cause perceptual
grouping of acoustic elements. Various grouping attributes have
been translated into successful signal processing techniques that
may be used in source separation. A next step beyond primitive ASA
is source identification through schema-based ASA. We present a
computational model for ASA that is inspired by models from
cognitive research. It dynamically builds a hierarchical network
of hypotheses, which is based on (learned) knowledge of the
sources. Each hypothesis in the network, initiated by bottom-up
evidence, represents a possible sound event. The network is
updated for each new input event, which may be any sound in an
unconstrained environment. The analysis of new input events is
guided by knowledge of the environment and previous events. As a
result of this adaptive behavior, information about the
environment increases and the set of possible hypotheses
decreases. With this method of continuously improving sound event
identification we make a promising advance in computational ASA of
complex real-world environments.
- Andringa, T.C., Grootel, M. van (2007). Predicting listeners'
reports to environmental sounds. Proceedings
of ICA
2007, ENV-09-005-IP.
(abstract)
(pdf)
-
Spontaneous verbal descriptions of environmental sounds lead to a
description of the contributing sound sources and the environments
in which they occur. This is a form of perception that relies
crucially on the rich structure of sounds, because only rich
sounds can convey detailed information about individual sources
and the transmission environment. This paper uses a semantic
network with connection strengths derived from listener reports to
represent the content of auditory scenes. The activity of the
semantic network is based on a number of source specific cues for
sounds such as birds, vehicles, speech, and footsteps. These cues
are not based on spectral envelope and level, but on patterns in
tones, pulses and noises that capture source specific
structures. Generally the system performed in a similar way as a
human listener in terms of concepts activated (or named) and the
choice of the acoustic environment. The robustness and performance
suggest the combination of a semantic network and source specific
cues can be used to design systems for sound-based ambient
awareness.
- Zajdel, W., Krijnders, J.D., Andringa T.C., Gavrila,
D.M. (2007) Audio-video sensor fusion for aggression
detection. Proceedings of AVSS
2007. Best paper award.
(abstract)
(pdf)
-
This paper presents a smart surveillance system named CASSANDRA,
aimed at detecting instances of aggressive human behavior in
public environments. A distinguishing aspect of CASSANDRA is the
exploitation of the complimentary nature of audio and video
sensing to disambiguate scene activity in real-life, noisy and
dynamic environments. At the lower level, independent analysis of
the audio and video streams yields intermediate descriptors of a
scene like: "scream", "passing train" or "articulation energy". At
the higher level, a Dynamic Bayesian Network is used as a fusion
mechanism that produces an aggregate aggression indication for the
current scene. Our prototype system is validated on a set of
scenarios performed by professional actors at an actual train
station to ensure a realistic audio and video noise
setting.
- Hengel, P.W.J. van, Andringa, T.C. (2007). Verbal
aggression detection in complex social
environments. Proceedings of AVSS
2007.
(abstract)
(pdf)
-
- Niessen, M.E., Krijnders, J.D., Boers, J., Andringa, T.C.
(2007). Assessing the reverberation level in
speech. Proceedings of ICA
2007, CAS-03-020.
(abstract)
(pdf)
-
The performance of automatic speech recognition (ASR) systems is
seriously degraded in reverberant environments. We propose a
method for assessing the reverberation level in speech that makes
it possible to determine in real-time whether a speech signal is
reverberant or not. Reverberation causes an increase in the
variation of the energy and frequency of harmonics in
speech. Speech with a variable pitch is especially affected by
reverberation. To capture the effect of reverberation we measured
six features on the harmonics of the speech signal, which
represent energy and frequency variation in different ways. Speech
from the Aurora database was artificially reverberated to
demonstrate the validity of these features. Each feature predicted
reverberation time for a different subset of the dataset. To test
the overall separability of the speech samples using these
features, speech from the dataset was automatically classified as
being either inside or outside the reverberation radius. Most of
the speech was correctly classified, which suggests that a
reliable real-time classification algorithm can be developed to
select good-quality speech. This algorithm can improve
pre-processing methods, such as speech enhancement or voice
activity detection, for more robust ASR.
- Krijnders, J.D., Niessen, M.E., Andringa, T.C.
(2007). Robust harmonic complex estimation in
noise. Proceedings of ICA
2007, CAS-03-019. Best poster award at the BCN
research school spring meeting.
(abstract)
(pdf)(poster)
-
We present a new approach to robust pitch tracking of speech in
noise. We transform the audio signal to the time-frequency domain
using a transmission-line model of the human ear. The model output is
leaky-integrated resulting in an energy measure, called a
cochleogram. To select the speech parts of the cochleogram a
correlogram, based on a running autocorrelation per channel, is
used. In the correlogram possible pitch hypotheses are tracked through
time and for every track a salience measure is calculated. The
salience determines the importance of that track, which makes it
possible to select tracks that belong to the signal while ignoring
those of noise. The algorithm is shown to be more robust to the
addition of pink noise than standard autocorrelation-based methods. An
important property of the grouping is the selection of parts of the
time-frequency plane that are highly likely to belong to the same
source, which is impossible with conventional autocorrelation-based
methods.
- Lobanova, A., Spenader, J., Valkenier, B. (2007). Lexical
and perceptual grounding of a sound ontology. Proceedings of TSD 2007, 180-187.
(abstract)
(pdf)
-
-
Van Opzeeland, I., Slabbekoorn, H., Andringa, T.C., Ten Cate,
C. (2007). Underwater racket: fish and noise pollution. De
Levende Natuur 108, 39-43.
(abstract)
(pdf)
-
Sinds de komst van de Wet op de Verontreiniging van
Oppervlaktewateren is de kwaliteit van het oppervlaktewater in
Nederland sterk verbeterd. Via de Kaderrichtlijn Water wordt nu
ook op internationaal niveau samengewerkt aan een betere
waterkwaliteit. Echter, een belangrijk aspect van water dat tot
op heden verwaarloosd is, is het geluidsniveau onder water als
gevolg van de sterk toegenomen scheepvaart, recreatie en
bouwwerkzaamheden in en rond de Nederlandse
oppervlaktewateren. Over de impact hiervan op de
habitatkwaliteit onder water is tot dusver vrijwel niets
bekend.
-
Van Opzeeland, I., Slabbekoorn, H., Andringa, T., Ten Cate,
C. (2007). Underwater noise pollution: effects on
fishes. Visionair 4, 11-13.
(abstract)
(pdf)
-
Bijna iedere sportvisser zal het weten: als je te luidruchtig
bent aan de waterkant gaat de vis ervandoor. Ondanks deze
visserswijsheid zullen velen zich niet realiseren hoe belangrijk
geluid voor vissen is. De menselijke activiteit op en rond de
oppervlaktewateren van Nederland neemt sterk toe. Als gevolg
hiervan zal ook de geluidsbelasting onder water navenant zijn
toegenomen. Er is maar weinig bekend over de effecten van
omgevingsgeluid op vissen. Reden voor de Universiteit van Leiden
hier een uitgebreid literatuuronderzoek aan te wijden.
- Andringa, T.C., Niessen, M.E. (2006). Real-world sound
recognition: A recipe. Proceedings of LSAS
2006, 106-118.
(abstract)
(pdf)
-
This article addresses the problem of recognizing acoustic
events present in unconstrained input. We propose a novel approach
to the processing of audio data which combines bottom-up hypothesis
generation with top-down expectations, which, unlike standard pattern
recognition techniques, can ensure that the representation of the input
sound is physically realizable. Our approach gradually enriches low-level
signal descriptors, based on Continuity Preserving Signal Processing,
with more and more abstract interpretations along a hierarchy of description
levels. This process is guided by top-down knowledge which provides
context and ensures an interpretation consistent with the knowledge of
the system. This leads to a system that can detect and recognize specific
events for which the evidence is present in unconstrained real-world
sounds.
- Andringa, T., Hengel, P.W.J. van, Nillesen, M.M., Muchal, R. (2004).
Aircraft sound-level measurements in residential areas using sound
source separation. Proceedings of Internoise 2004, Prague, 460.
(abstract)
(pdf)
-
- Andringa, T.C. (2002). Continuity preserving signal
processing. Thesis Rijksuniversiteit Groningen.
(abstract)
(url)
-
De basis van dit onderzoek is gelegd in de periode tussen 1994 en 1999
toen ik een onderwijsaanstelling had bij de studierichting Technische
Cognitiewetenschap. Doel van het onderzoek was het vinden van
spraakherkenningstoepassingen van de rijke temporele informatie in een
model van het menselijk binnenoor of cochlea dat ontwikkeld was binnen
de groep van prof. Duifhuis. Op het moment dat de eerste resultaten
suggereerden dat op basis van dit onderzoek belangrijke
toepassingsmogelijkheden mogelijk waren in, onder andere, de
spraaktechnologie, is besloten om te wachten met publikaties en één en
ander verder te ontwikkelen ten behoeve van octrooiaanvragen. Eind
1999 is, op initiatief en met financiële steun van de RuG
Houdstermaatschappij besloten om het onderzoek voort te zetten in
bedrijfsverband. Hiertoe is het bedrijf Human Quality (HuQ) Speech
Technologies BV opgericht.
Continuity preserving Signal Processing (CPSP) is een geheel nieuwe
signaalanalysemethodologie die gebaseerd is op het onderzoek dat
beschreven is in dit proefschrift. Daar CPSP is gebaseerd op
algemeen geldende basisaannames is het, in tegenstelling tot
andere signaalanalysemethoden, geschikt voor de analyse van
onbekende (d.w.z. niet herkende) complexe geluidssignalen. Met
behulp van CPSP zijn topresultaten mogelijk op de Aurora test:
een internationale benchmark test op het gebied van ruisrobuuste
spraakherkenning. Dit onderzoek heeft zich ontwikkeld als
enerzijds de aanzet voor een theorie over menselijke
(spraak)geluidsanalyse en anderzijds een nieuwe benadering van
spraaktechnologie. Mede om deze redenen is ervoor gekozen om het
proefschrift te schrijven als een Tutorial die zich in de loop
van de tijd verder zal ontwikkelen.
Sounds
The role of CPSP
Continuity Preserving Signal Processing (CPSP)
was explicitly designed as a way to combine cognitive science with
signal processing. As the name suggests, the feature that sets
CPSP apart from other signal processing methods is the
conservation of continuity. In particular the conservation of the
continuous development of the physical process that produced the
signal. This ability allows our auditory system to track the
development of individual sound sources, and by doing so separate
them from other sound sources. Traditional signal analysis methods
like FFT's, Wavelets, etc., do not preserve this continuity and,
consequently, have difficulties with tracking, separating, and
recognizing mixtures of sound sources.
The most important consequence of the tracking of evidence of
individual sound sources is the possibility to form signal
components: physically coherent bits of information that
contain information about a single source. These signal components
form an ideal basis for consecutive analysis stages such as a
recognition stage.
CPSP, as general auditory inspired signal processing paradigm, can
be used for everything we use our ears for. And because it is
tunable (unlike our ear in which one setting has to suffice for
all uses) it can (in many cases) be adapted to surpass the human
performance. An example of this is the analysis of failures in
rotating equipment (like pumps, fans, motors, etc.). Humans have
only a limited auditory memory scope and we have to come up with a
percept in less than a second, while a computer based analysis
system can be tuned to analyze a sound in multiple ways without
the time constraint that our auditory system has.
The cochleogram above shows a number of inhomogeneities in the
signal of a
gear-box. Two occur
at low frequencies around
t=0.4 and 1.4 sec, The other
occur around 1200 Hz and
t=0.25, 0.35, and 1.1 sec. Both
types are audible, but many people will not notice the second type
at first listening. And most listeners will have to listen
multiple times before the first two high frequency ticks heard
separately. Once you have heard them they are obvious. This shows
that the visual representation above is often more easy to
interpret and to analyze than the sound itself. The analysis above
is an efficient way to determine the nature of the faults in this
gear-box.
For research and development in the field of auditory cognition,
CPSP is the natural choice a signal processing tool, but of course
by no means the only one allowed: different tasks might require
different signal processing approaches. It is the task that should
determine the most suitable signal processing technique, not the
preferences and background of the engineer or scientist. At this
stage CPSP is still in its infancy and many potential uses (each
of which impossible with standard signal processing techniques
like FFT's) have not even be considered.