Auditory model Parameters    
  Cochleogram Parameters  
    Reference spectrum  
    Signal components Pulses

List of abbrevations

This document provides the theoretical basis of Continuity Preserving Signal Processing (CPSP). CPSP is a convenient tool for the analysis of signals stemming from physical processes because it allows an analysis in terms of representations (called signal components) that are likely to stem from a single process.

Signal components, explained in more detail below, are subsets of the signal that ultimately correspond to individual physical processes and subprocesses. In many cases a signal consists of multiple physical processes, so its is important to base conclusions about the processes that produced the signal on representations that do not mix up information of multiple physical processes. CPSP facilitates this process in a way that is favoured by evolution.

This document explains a number of important concepts of CPSP. It it requires very little background in signal processing and mathematics. It starts by a very short introduction of the auditory model that is used. This model is used to derive cochleograms in which signal components appear as members of a small family. It is assumed that all sounds can be described as combinations of this family, that are introduced and discussed next. The document ends with a discussion about the analysis of mixtures of sounds. A more elaborate text on CPSP can be found in the first chapter of the thesis of Tjeerd Andringa.

Auditory model

The first processing step in our auditory system is the conversion of sound vibrations into neural information. This task is performed at the basilar membrane (BM), a structure in our inner ear, or cochlea, that is in some aspects similar to a vibrating string. Its anatomy and function is beautifully visualized with a movie (2 MB) from Cornell University. The BM is schematically visualized in the figure below:

Basilar membrane model

The lower drawing shows the uncoiled basilar membrane model and some of the 3000 haircells that transduce vibrations into electrical potentials, which in turn are changed into neural spikes. The actual BM in combination with the hair cells and the spiking neurons behaves in a strongly non-linear way. This allows the cochlea to process dynamic range of more than 100 dB (ten orders of magnitude). Although the numerical model that is used is completely linear, its output is translated into decibel.
This upper drawing shows an electrical equivalent. The inductor L represents the mass (representing inertia) of the surrounding fluid and the mass of the BM-segment (position). The capacitance C represents the stiffness of the BM (which represents the restoring force) and the resistance R (representing the local energy dissipation). Each segment has an associated combination of mass, damping and stiffness and therefore behaves as a resonator. Each position is physically coupled to its neighbors, which ensures that their phases (stage in the oscillation cycle) are similar.

Key points of the mammalian basilar membrane are that

  • it is sensitive to different frequencies at different positions, and
  • it preserves continuity in time, place, and (via the place-frequency relation) frequency.

The preservation of continuity is not the case for commonly used forms of (speech) signal preprocessing techniques: it is in fact difficult to guarantee without a good model of a natural basilar membrane. A suitable time-domain model of the basilar membrane was originally developed by prof. H. Duifhuis and adapted for commercial use by Sound Intelligence.

At the level of the BM continuity preservation entails that the excitation xs(t) at time t of BM-model segment s is very similar to xs+1(t+dt), where dt is a very small time-step (for numerical reasons about 5 times the Nyquist period) and s+1 a neighboring segment. This double continuity is important because it entails that a continuous development of a sound source, such as a voice, will be preserved for further processing. A preprocessing stage, such as one based on the well-known and generally applied Fast Fourier Transform (FFT), does not preserve continuity in time, nor frequency. This is not necessarily wrong, but it is suboptimal and its save application cannot be justified physically for unknown mixture of signals. The same is true for other signal processing techniques like LPC/PLP, wavelets, and many auditory inspired signal processing techniques. (More information about the numerical model of the cochlea can be in separate file.)

Parameters auditory model

There are three parameters of auditory model that directly influence the way a signal is processed and the way it is represented in the cochleogram (see next):

Number of segments
The BM is divided into segments, each being sensitive to a particular frequency. In the cochlea model the number of segments can be adjusted. The use of more segments causes some sharpening of the frequency resolution as a result of which details are more visible. For many applications 100 is a good initial choice.
Highest frequency
While the lowest frequency of the implemented model is fixed (to about 27 Hz), the highest frequency can be adjusted, because the higher frequency regions do not always contain useful information, and leaving them out reduces computation time. This parameter depends in the first place on the phenomenon to be studied. Generally a highest frequency of 4000 Hz is a good starting point.
Quality of the oscillators
The coupled filters act like oscillators, each one being sensitive to a particular frequency. The quality of these oscillators is defined as the peakedness of the resonance: the narrower the peaks compared to the peak frequency, the higher the quality. It is generally best to leave this parameter at its default value since it may lead to a numerically unstable system.


The basilar membrane model can be used to form a so-called cochleogram: a spectrogram-like representation, based on the squared excitation xs(t)xs(t) of BM-segment s at time t. The cochleogram indicates the distribution of source energy as function of time and frequency. A cochleogram rs(t) is derived from a leaky integration process (or low-pass-filtering L):

rs(t) = rs(t-dt)exp(dt/τ) + xs(t)xs(t) segment dependent integration: τs equal for all segments
segment independent integration: τs=max(frameSize,1/f(s))
      = L(xs(t)xs(t)) s=1:smax

This formula denotes the leaky-integration process L where a fraction exp(dt/τ), which is close to but smaller than 1, of the energy in the previous time step rs(t-dt) is lost, but in which the new squared BM excitation xs(t)xs(t) is added. The time constant τ determines the scope of memory of the leaky integration process, the larger τs the more the temporal information will be smoothed. A value of τs=10 ms smooths out details that are smaller than about 10 ms. It is therefore adequate to visualize the main structures of speech because the lowest frequency in most speech is about 100 Hz, which corresponds to a period of 10 ms (10 ms = 1/100 Hz), and speech seems mostly stationary for intervals of about 10 ms. After segment independent integration (i.e., a common τs for all segments) all temporal details smaller than the time constant are smoothed out and the sampling period can be increased, without further loss of information, to a frame size of 5 or 10 ms. This is shown in the left panel below for the word "nul" spoken by a female speaker in Dutch.

Cochleograms of a clean sound with ridges with
      100 and 200 segments

The horizontal axis corresponds to time in seconds and the vertical axis to frequency in Hz The place-frequency relation is, like the real BM, approximately logarithmic. The color coding corresponds to energy in dB. The more or less horizontal black lines are explained below. [nog niet af]

Cochleograms with segment (in)dependent

It is also possible to perform segment dependent integration. This allows, like a wavelet analysis, maximal temporal detail for high frequencies en maximal frequency detail for the low frequency side. The integration time constant for each segment dependent integration is limited by the characteristic period of the segment and the frame size. The characteristic period of a segment is the duration of a single period at the frequency the segment is specialized in. To prevent the visualization of individual oscillations, the time constant for segment is [duidelijker, en afmaken]

Parameters cochleogram

There are a few parameters that influence the way the way the signal is represented in the cochleogram (apart from the way it is processed by the auditory model):

Frame size (dt)
The dt is the time step by which the cochleogram is sampled. Every time step results in a frame. Its value is dependent on the type of source you want to describe: sound sources that develop fast need to be sampled more often (a smaller frame size) than sources that develop slowly.
Integration time constant
When the integration time constant (τ) of the cochleogram is larger than 0, it is applied to all segments (segment independent integration). (For an explanation of the integration process see above.) If its value is 0, segment dependent integration is applied on the basis of the characteristic period of each segment:
Period (nPeriod)
Together with the frame size this parameter determines how the segment dependent integration is executed. The integration time constant (τ) is the maximum of the frame size and the period multiplied by the characteristic period of the segment (equivalent to the inverse of the characteristic frequency).
Group delay correction
True or false. [nog uitgebreider neem ik aan?]
Dynamic range (dBRange)
Maximal difference in dB between the highest energy value Emax in the cochleogram and the lowest energy value. All values smaller than Emax-dBRange are set to Emax-dBRange.
Absolute energy
True or false. If set to true the energy values of the cochleogram are rescaled so that Emax-dBRange=0. Consequently all values lie between 0 and dBRange.

Reference spectrum

The basilar membrane model is, like the natural basilar membrane, not equally sensitive to all frequencies. For many signal processing applications this is not an advantage, since it obscures the relative importance of different frequency ranges. To be (to a large extend) independent of the transfer function of the cochlea model, a reference signal can be used. The mean energy per segment of this signal is used as reference energy for that segment. When a reference signal is used with absolute energy set to TRUE, then all energy values are denoted as energy fractions of the energy in the reference signal. For example a value of 20 dB corresponds to a signal that 100 time as energetic as the average value of the reference signal. A value of -10 corresponds to a value that is 10 time smaller than the reference value.

Reference and absolute energy compared

Signal components

Physical processes distribute their energy over the time-frequency-plane (of which a cochleogram is a visualization). The energy distribution can be localized in time (as with a knock on the table), localized in frequency (as with a bell), or broad in frequency as well as in time (in the case of stationary noises). Energy distributions in the time-frequency plane that are coherent (a single whole) are called signal components. They can be described by specifying the temporal development of frequency and energy (phase is optional). The uncertainty relation for signals entails there is always a trade-off between time and frequency, mathematically expressed as follows:

      Δf Δt ≥ 1    (f in Hz), or
Δω Δt ≥ 2π (ω in rad/s)

In terms of this relation we can define some classes: pulses, tones and noise, successively discussed below.

Signal components - pulses

Pulses are strongly localized in time: Δt is small. A single pulse conveys minimal periodicity information for all frequencies; all frequencies are represented equally bad and consequently a pulse excites all frequencies equally and Δf is maximal. Narrow, near vertical structures signify pulses. The response of a cochlea model to a perfect pulse is shown in the figure below as the vertical structure starting at t=0.1 s. The pulses starting at t=0.3, 0.5 and 0.8 become gradually broader in time (and therefore further and further from the ideal realization of a pulse). This is perceptually noticeable, the last "pulse" begins the sound like a burst. (Note the pulse starting at t=0.5 is similar to heartbeat, described in the application section, which is in addition low pass filtered.)

Cochleogram of pulses

When an ideal impulse (such as the first impulse at t=0.1 s) excites the basilar membrane model it starts to resonate in a way similar to a bell. There are some important differences: a bell has a strong preference for some frequencies while the BM-model does not, and a normal bell rings-out much more slowly than the BM-model. Nevertheless the width of an impulse seems enormous: it lasts more than 200 ms for the low frequency side. This is in part an optical effect due to the dynamic range of 60 dB: a small part, of the impulse response contains most of the energy.

Note that the start of the BM (sensitive to high frequencies) responds faster (and much shorter) than the end of the BM (sensitive to low frequencies). This is an effect of group delay.

Signal components - ridges (tones)

Tones are sinusoids with a constant period and consequently strongly localized at a certain frequency (Δf is small). The expression of a sinusoid consequently is very narrow (pulse-like) in the frequency domain. Per definition, tones must last long which entails that Δt is large. Tones are therefore not suitable to convey detailed temporal information. Narrow horizontal cochleogram structures signify sinusoids and are called ridges. Like pulses, tones come in different flavors. The ideal tone is long, steady and smooth. A two second example is shown in the first two seconds of the figure below. This signal consists of successive 2 second signals that become broader and broader in frequency. The perceptual effect changes from a steady tone, via a slow and a fast amplified tone to a "hollow"sounding narrow band noise. The vertical structures at even seconds are pulses that result from concatenating these signals without ensuring a smoothly developing signal. The pulses help to segregate the signal in different perceptual units.

Cochleogram of a broadening tone

In the case of speech the more or less horizontal ridges indicate harmonics. In the cochleogram of the (clean) sound "nul" (see above) almost all ridges correspond to a single harmonic. The main contribution of the lowest ridge starts with a frequency just above 200 Hz and gradually rises in frequency to about 380 Hz. This ridge corresponds to the first harmonic and its frequency corresponds to the fundamental frequency development of the word "nul". The second and higher ridges correspond to higher harmonics. The first 19 harmonics (with frequencies up to 4500 Hz) are represented as ridges. In the case of voiced (quasi-periodic) speech we need to combine all evidence of the harmonics of a single source into a single representation: a harmonic complex. For this we need the best instantaneous frequency estimation possible. The local instantaneous frequency (LIF) is the change in phase per time step (dφ/dt), so only of interest for periodic signal components that change slowly: ridges. The LIF is further explained below.

Signal components - noise

While tones are defined by their periodicity, noises are defined as not being periodic; in noise there are no repeating structures present even though they last long. So noises consist from a broad continuum of frequency contributions (Δf is large) and last some time (Δt is large as well). Noises represent acoustic energy without much preference for certain frequencies or points in time. The respons of a cochlea model to different noises is shown in the figure below. The signal starts with a slowly developing broadband noise (white noise). Superimposed on the offset of the broadband noise are three other "noises" that become more and more localize in time and frequency (500Hz). The perception changes from a standard noise, to a noisy bursts with more and more tonal quality.

Cochleogram of different noises

Signal components - mixtures

The three kinds of signal components discussed above do give insight in sounds, but will seldom occur in a natural environment in isolation: natural sounds will (almost) always be a combination to these three components. This knowledge can be used for the classification of sound sources: on the basis of a priori knowledge of source characteristics and auditory scene analysis (ASA), seperate signal components that constitute a sound can be formed and used to classify sound sources. Any sound source that we can recognize must, by logical necessity, produce a sound with sufficient characteristics to allow recognition (or classification). A guitar cannot produce speech because its physics do not allow it to produce speech sounds. Conversely our vocal system cannot produce the sound of a guitar for the same physical reason. The (physical) constraints of a sound source determine what sounds it can and cannot produce. The main constraint of speech is typically that it complies to the (very complex) rules of at least one language. The constraints of other sound sources are almost always much simpler. For all sounds it is possible to formulate models of them by simply combining the (physical) constraints that determine them.

As an example lets take the sound "nul" in a noisy situation, obtained by adding some cocktail-party noise to the clean signal ("nul"). We get a noisy signal (in this case with a global SNR of 0dB) that not only contains information stemming from the target speaker but also information from other sound sources. Nevertheless the human auditory system has no difficulties (not even on first hearing) with finding, combining and recognizing the components of the noisy word. The cochleogram of this noisy signal is shown below:

Cochleogram of a sound with 0 dB SNR with

Compared to the clean situation, where it was trivial to combine information from a single source in a single representation, we now have to select a subset of the available evidence and discard the rest (or reassign it to other sources). For this subset to be assigned to a certain sound source it must, of course, comply to the constraints that determine this source. In other words, the subset must comply to the source model. As in the clean situation we want to find a harmonic complex to find evidence for a single source.

CPSP allows the estimation of the Local Instantaneous Frequency (LIF) under resolved ridges very accurate: usually the error is less than 1%, which in the same range as human performance. This is very difficult if not impossible with frame-based methods, especially in the case of noisy input. The LIF is computed via the running autocorrelation along ridges s(t):

      rs(t)(t+T) = L(xs(t)(t)xs(t)(t+T))	T=[0, Tmax]

The periodic structure of the ridge autocorelation determines the LF.

LIF development for clean and noisy signal

This figure shows the local instantaneous frequency development under ridges for the clean signal in red circles and for the noisy signal in the filled blue dots. Target ridges that still dominated in the noisy condition enforce the same local response of the BM as the clean signal and consequently the derived frequency information is identical. Consequently many of the LIF-values of the noisy and the clean condition area almost identical. Visual inspection shows that almost all frequency information at formant positions is unperturbed in the noisy condition.


Allen 1994, How do Humans Recognize Speech
Andringa 2002, Continuity preserving signal processing

List of abbrevations

ASA : Auditory Scene Analysis
BM : Basilar Membrane
CPSP : Continuity Preserving Signal Processing
FFT : Fast Fourier Transform
LIF : Local Instantanious Frequency
SNR : Signal-to-Noise Ratio
Top of page

[2004] Written by Tjeerd Andringa, Maria Niessen, and Maartje Nillesen