intro.png

theory.png

apps.png

download.png

faq.png

links.png

What is a cochlea?
What is a basilar membrane?
Why is the BM a special signal processing device?
What does the cochlea model do?
What is the difference between the cochlea model and the human cochlea?
What is the time-frequency plane?
What is a cochleogram?
Why use segment dependent integration?
What is a signal component?
How should I interpret a cochleogram?
What is the dynamical range of a cochleogram?
What is the frequency range of a cochleogram?
What is the place-frequency relation?
What is the frequency specificity of a segment?
What is group delay?
What is the relation between the frequency sensitivity of each segment and the group delay?
What is the frequency resolution of a cochleogram?
What is the temporal resolution of a cochleogram?
What is a (single) sound?
What is the local SNR and why is it important?
What is masking?
What is the origin of local domination by signal components?
What is background noise?
How can sounds be recognized?
Can the input be fully recognized?
Can a BM help with sound source separation?
What is the advantage of a cochleogram over an FFT-spectrogram?
How does the uncertainty relation limit a cochleogram?
What is the link between the uncertainty relation and signal components?
How do I estimate and use ridges?
Why is thy system running so slowly?

What is a cochlea?

The cochlea is the inner ear, which is shaped like a snail shell, called cochlea in Latin. In the cochlea sound vibrations are transduced to neural activity.

What is a basilar membrane?

A 3.5 mm long membrane on the inside of the cochlea, called the basilar membrane (BM), separates two rooms filled with fluid across the length of the cochlea. The transduction of vibrations to neural activity takes place on the BM. One end (the opening) is sensitive to high frequencies and the other end to low frequencies. Between these two ends the frequency-place relation is (approximately) logarithmic: a factor two in frequency corresponds to a fixed distance. In the cochlea model (and on the real BM) neighbour segments are coupled: when one segment vibrates its neighbours will make similar movements. This correlation decreases as the distance between segments increases. In the figure below a cross-section of the cochlea with the rooms and the BM is shown:

Cochlea and basilar membrane

Why is the BM a special signal processing device?

The spatial and temporal coupling of the BM ensures that acoustic information is represented at a range of BM positions. This causes the temporal development of the (physical) development of sound sources to be represented 1) with high accuracy and 2) without unphysical discontinuities in both the temporal and the frequency domain. These discontinuities do arise during the application of standard signal processing strategies such as the FFT, most filter banks, and Wavelets. Continuity preservation allows the optimal tracking and analysis of the development of the physical source that produces the signal.

What does the cochlea model do?

The cochlea model is numerical model of the mechanics of the human BM which is computed in the time domain (many models are computed asynchronously in the frequency domain). This entails that the response (and analysis capabilities) of the human auditory system can be approached with minimal delay. Conceptually the computation of the cochlea is similar to the computation of a filter bank response in which the filters are physically coupled. The cochlea model is a very efficient implementation of this filter bank. Internally the BM response is computed by simulating the force of the incoming sound (samples) and the forces that each segment exerts on its neighbours. Segments are modeled like mechanical oscillators with mass, stiffness, and damping. More detailed information about the cochlea can be found in a separate document.

What is the difference between the cochlea model and the human cochlea?

The main difference between the cochlea model and the human cochlea is the fact that the human cochlea is strongly non-linear in terms of output amplitude (due to the limited dynamical reach of neurons), while the cochlea model is completely linear. Advantages of a linear model are that the resulting signal is easier to interpret and that the distortion products due to non-linearities are avoided. Because o the huge dynamic range of most sounds, the output of the linear system is often studied best on a logarithmic intensity scale such as dB's. By the way, in spite of its non-linear implementation, the human perceptive system acts surprisingly linear.

What is the time-frequency plane?

A time-frequency plane provides information about the energy as function of time and frequency, E(t,f). FFT's and Wavelets approximate the continuous time-frequency plane as coarse blocks. At the centers of these blocks the information is represented differently from the information around the edges. This is a reason to use a cochleogram as an alternative.

What is a cochleogram?

The BM can be used to form a so-called cochleogram, a spectrogram-like representation based on the squared excitation xs(t)xs(t) of BM-segment s at time t. The cochleogram leads to a representation of the time-frequency plane because each position corresponds to a narrow frequency range. The continuity preserving properties of the BM ensure that a coherent region of the time-frequency plane corresponds to a coherent region of a cochleogram, without the bias for special frequencies and times that characterizes FFT's and Wavelets (see previous question). Mathematically the cochleogram is realized as the distribution of source energy as function of time and frequency derived from a low-pass filtering (leaky integration L):

      rs(t) = rs(t-dt)exp(dt/τ) + xs(t)xs(t)
            = L(xs(t)xs(t))                    s=1:smax
      

This formula denotes the leaky-integration process where a fraction exp(dt/τ), which is close to but smaller than 1, of the energy in the previous time step rs(t-dt) is lost, but in which the new squared BM excitation xs(t)xs(t) is added. The time constant τ determines the scope of memory of the leaky integration process. A value of τ=10 ms is adequate for many common sound sources. After this form of low pass-filtering the sampling period can be increased from dt to 5 to 10 ms.

Why use segment dependent integration?

The leaky integration process makes the cochleogram an expression of the energy development per segment. In this version the time constant of the low-pass filter is the inverse of the characteristic frequency of the concerning segment (with a minimum value dt). This means that every part of the cochlea can react to changes of the order of 1 or 2 times the characteristic period. For segments in the high frequency regions the reaction time is much shorter than for segments in the low frequency regions. In a number of ways this resembles a wavelet analysis that is continuous in time and frequency. [tjeerd komt nog met een stuk (?)]

What is a signal component?

A signal component is a single physically coherent signal constituent that can be described by specifying the temporal development of frequency and energy (phase is optional). This definition, in combination with the continuity preserving properties of the BM, ensures that a single signal component will be expressed as a coherent whole. Consequently a signal component corresponds to a coherent region of the time-frequency plane. Possible signal components are, for example, pulses, tones, and noises.

How should I interpret a cochleogram?

Pulses cause narrow vertical structures in the cochleogram and (individual) tones cause narrow horizontal structures, while noises are broad in time and frequency with a fine structure that does not repeat itself (see also "What is the link between the uncertainty relation and signal components?"). Usually a cochleogram consists of a mixture of variants of these main structures. When the standard settings (dt=5 ms, nseg=120, maxf=4000 Hz, τ=10 ms) are used, the visible structures (with the exception of the fine-structure of the noise) are as a rule part of an audio percept. When you study the cochleogram in more detail, you might see signal details that are not perceptible. The fine structure of noise is usually irrelevant. Humans are as a rule insensitive to similar noises with different fine-structures, but there is one important exception, namely when the fine structure of noise is (in part) repeated. This is valuable information that is often very difficult to estimate with alternative signal analysis methods. [Example links]

What is the dynamical range of a cochleogram?

The dynamical range of a cochleogram is used to normalize the intensity of the input sound. 60 dB is good for most natural sounds and equals the human hearing range.

What is the frequency range of a cochleogram?

The highest frequency is somewhat higher than the maximum frequency specified in the Signal Companion, the lowest frequency is about 30 Hz.

What is the place-frequency relation?

The place-frequency relation relates segment number to characteristic frequency. In the low frequency part of the cochlea the place-frequency relation in almost linear, above 190 Hz the relation is logarithmic (each doubling in frequency corresponds to the same number of segments). In principle it is possible to choose a different place frequency-relation, in practice this is difficult.

What is the frequency specificity of a segment?

The frequency specificity of a segment is a measure of how specialized a segment is in a certain frequency range. Because of the coupling between segments, the specialization of segments can be greater if the number of segments is larger. Due to the logarithmic place-frequency relation: the difference between neighboring high frequency segments can be several hundred Hz, while the corresponding low frequency difference can only be a few Hz.

What is group delay?

All segments require sufficient information before they can decide how strong they ought to respond. A segment that is specialized in narrow frequency range requires more information before it can decide that it ought to respond vigorously. Group delay is a measure of delay: a longer temporal scope allows a more narrowly tuned frequency response. Mathematically the group delay of a segment corresponds to the center of gravity of the impulse response. The effects of group delay are visible as the tilt of the impulse response.

What is the relation between the frequency sensitivity of each segment and the group delay?

The frequency sensitivity of each segment and the group delay are the inverse of each other: to achieve greater frequency sensitivity the group delay has to be increased. This is a manifestation of the uncertainty relation. [link]

What is the frequency resolution of a cochleogram?

The frequency resolution is, among other things, dependent on the number of segments in a frequency area (dependent on both the maximum frequency maxf, and the number of segments, nseg. For the standard settings (dt=5 ms, nseg=120, maxf=4000 Hz) the frequency difference is about 4% between segments. Components with similar energy levels that differ more than 8% are separated by a valley. When the difference in frequency is less than 8% and when the components have relatively deviating energy levels, frequency information can be deduced from the amplitude modulation of the energy. The frequency resolution that can be obtained during the analysis of a signal is a function of time: the longer a signal is stable, the smaller the frequency uncertainty.

What is the temporal resolution of a cochleogram?

The temporal resolution is equal to dt, it ought to correspond to the temporal detail one is interested in. When dt << τ the number of frames is much higher than required for the faithful representation of temporal detail. The minimal value of dt value can in principle be chosen as small as the sample period (1/fs) if the input signal, although this is rarely useful. A suitable lower bound is 0.1 ms. This choice requires considerable computer memory to store and visualize the cochleograms.

What is a (single) sound?

In the context of the cochlea model a (single) sound is defined as the set of all (relevant) signal components stemming from the same physical source. Often a physical source can be defined on multiple levels: a choir consists of many individual singers, but because of the strong correlation between the individual sources (a defining characteristic of a good choir!) one might define a choir singing two different parts for male and females, as two sources that produce two sounds.

What is the local SNR and why is it important?

The local SNR is the local signal-to-noise-ratio, a term used when the target signal component is defined as the signal and the signal minus the signal component as noise. A better name though is target-to-non target-ratio (TNR). This ratio tells how reliable the information is that can be estimated from a signal component without having to rely on context. (Measured in dB, a higher SNR means relatively less noise.) In the context of CPSP the global SNR is of less importance than the local SNR. Where local typically refers to a single or a few segments and a short interval. A local SNR of 6 to 10 dB and better leads to the estimation of reliable information about signal components. Typically for a local SNR between -3 dB and +6 dB more and more useful information can be derived, but for the lowest SNR's additional information (prior knowledge, top-down information, or a model) is required to disambiguate the available evidence. Note that the global SNR can be -30 dB or worse, while still allowing a positive local SNR for some component: for example when the target is a single sinusoid and the noise is predominantly in a different frequency range. Masking is also possible in the temporal direction: a loud bang might mask a tick 50 ms later, because its effects are still dominating the cochleogram.

What is masking?

Perceptually a sound is masked by other sounds when it cannot be detected in their presence while it can be detected in isolation. This corresponds, in good approximation, to a situation in which a sound or signal component cannot be identified by visual inspection of the cochleogram with default settings. Typically this entails that no part of its signal components have a sufficiently high (positive) local SNR to allow an accurate signal description. When a priori knowledge (possibly derived from the first part of the sound or through repetitions) is available then a signal detection task changes into a hypothesis checking task, and because of the reduced uncertainty this requires less evidence. The only contribution of a masked signal is that it raises the total energy slightly.
As an example, when the cochlea model gets a single sinus as input the entire BM vibrates. However, the part of the BM that is most sensitive to the frequency of the sine will vibrate most. Frequency components that cause less powerful BM excitation than the sine won't be visible: they are masked by the sine. There is less decrease in sensitivity in the direction of high frequencies than in the direction of low frequencies, so masking is stronger from low to high frequencies than vice versa.

What is the origin of local domination by signal components?

Local domination of cochleogram regions is important because most regions are dominated by a single source: the probability that two or more uncorrelated signal components are neither masked nor dominant is small because this entails that the signal must have 1) similar timing, 2) similar frequency ranges, and 3) similar energies. The signal energy is an important factor since source properties in combination with transmission effects (such as due spatial distance to the source) yield strong effects on the signal intensity and consequently the SNR. Often signal energy spans many orders of magnitude, while the range for neither masked nor dominant is a narrow region of about +/-3 around 0 dB SNR.

What is background noise?

Background noise is special because it is 1) always present in some way or another, 2) because it consists, by definition, entirely of signal components that are not dominant, and 3) the acoustic evidence cannot be assigned to a specific sound source. The existence of some background noise level is inevitable because of (in order of increasing importance):

  • numerical noise which is smaller than -120 dB since the cochlea model works with doubles,
  • quantization noise (depending on the number of bits for each sample, about 90 dB below the maximum level for 16 bit samples),
  • limitations in the measurement setup (such as a limited dynamic range of the microphones or transmission channel), or
  • the presence of many uncorrelated sound sources of which the energy peaks in the time-frequency plane provide a more or less constant level.

The last class is often the most important. Note that each independent (noise) source that exceeds the background level sufficiently will appear as an individual signal component, and will, in this definition, consequently not be assigned to the background noise. Take for example cocktail-party-noise: the audible part of this noise consists mainly of the spectral peaks of many speakers. A problem with this definition is the subjective nature of requirement 3: individual words may now and again "pop-out", and since these can be assigned to stem from a single speaker they are strictly not part of the background noise. This is however only true for a listener who knows the language. A work-around this subjective definition is to assign all parts of the signal that comply to the statistics of the noise to the background noise and to treat deviant signal components as potentially interesting.

How can sounds be recognized?

A sound is recognizable when it is possible to derive sufficient information from the input signal to reduce the uncertainty about the interpretation of the sound to zero. There are two related situations: with and without a priori knowledge. Without accurate a priori knowledge the signal must contain enough information to elicit a set of interpretation hypotheses that contains the correct interpretation. With a priori knowledge this set is (in part) available beforehand (via any possible means). Selecting the correct hypothesis requires the estimation of evidence that is consistent with at least one interpretation. Inconsistencies lead to a deactivation of the corresponding hypotheses. The most supported consistent interpretation is the best possible recognition result. All (good) sound recognition systems require models describing the possible dynamics of all target sounds. Sometimes the required dynamics are rather simple, for example for vehicle sounds, sometimes the models are extremely complex, for example with speech and music.

Can the input be fully recognized?

Two perspectives: an all-knowing judge or a limited system. The all-knowing judge (approximated by a human listener) knows how the signal ought to be interpreted in the most concise and informative way. Generally this entails that the input has been dissected into the constituting sounds and each sound has been recognized and described. When the system has reached this interpretation it has processed the input fully. This is an ideal that isn't realistic with the current state of art. A more realistic approach is from the perspective of a limited system that has some general knowledge about sounds and some more detailed knowledge of some particular sounds (such as key-words or aircraft sounds). This system works correctly when for all possible inputs all relevant knowledge has been applied correctly and the signal has been described most concisely.

Can a BM help with sound source separation?

At any given time most BM positions are dominated by a single sound source. Studying the spectro-temporal development of this domination leads to information about the dominating source. Often the same source is dominating multiple positions in a way that is consistent (i.e. predictable) with the properties of the sound source. A typical example is speech in which individual harmonics (integer multiples of some fundamental frequency contour) dominate regions that can be predicted if the fundamental period contour has been estimated. This makes it relatively easy to combine information from the same source that stem from different spectro-temporal regions into a single representation, and therefore to separate sounds.

What is the advantage of a cochleogram over an FFT-spectrogram?

Generally an FFT is preferred when its convenient mathematical properties can be exploited, and a cochleogram is favoured for an analysis of the physical properties of the original source.

Advantages FFT-spectogram

  • Very efficient implementation
  • Good enough for many purposes
  • Perfect inverse transformation of FFT
  • Many convenient symmetries
  • Well known, extensive literature
  • Existence of many ad hoc ways around limitations

Advantages cochleogram

  • No introduced discontinuities and consequently a faithful representation of the development of individual signal components
  • What you see is what you hear
  • Everything that is visible is physically relevant
  • Approach based on human, evolutionary optimized, sound processing
  • Allows correct application of quasi-stationarity
  • No preference for special signals, times or frequencies
  • Yields reliable information for sound source separation
  • Requires minimal mathematical background
  • Allows detailed analysis of temporal development
  • Can be used to analyst detailed phase
  • Allows direct link with the physics of the source

Disadvantages FFT-spectogram

  • Consists of discontinuous time-frequency blocks and associated contamination of the input signal representation
  • Requires windowing with additional effects on the signal representation (such as reduced frequency resolution)
  • Unjustified application of quasi-stationarity on mixtures of (unknown) sound sources
  • Less faithful representation of rapidly changing signals (opposed to time-constant signals)
  • Yields limited information for sound source separation
  • Requires some mathematical background for save use

Disadvantages cochleogram

  • Requires more computation
  • Not as well known, minimal literature
  • Requires some physical knowledge for save use
  • Strong masking effects
  • Many properties not well studied and/or documented

How does the uncertainty relation limit a cochleogram?

The uncertainty relation

      Δf Δt ≥ 1    (f in Hz), or
      Δω Δt ≥ 2π  (ω in rad/s)
      

is a fundamental limit on the expression of frequency and temporal information. As such it limits and determines the structures that can be visualized with a cochleogram. The main effect of the uncertainty relation is that it influences the expression of frequency information as function of time. Take for example a cosine excitation that starts at t=0 with amplitude 1. The first sample that the model processes resembles an impulse at t=0 and leads to an impulse response-like excitation. The next samples do not return to 0 but resemble a step-like form and change the initial impulse response consequently to a step response-like excitation. When more and more samples become available, more and more frequency information becomes available and fewer and fewer segments receive an excitation consistent with their periodicity. Consequently less and less segments respond. [plaatjes en link]

What is the link between the uncertainty relation and signal components?

The uncertainty relation leads to a limited number of signal component families, that is three kinds of special signals that correspond to allowed extreme values:

Δf << 1
Δt >> 1
Sinusoids
Tones
Tones are sinusoids with a constant period and consequently strongly localized at a certain frequency (Δf is small). The expression of a sinusoid consequently is very narrow (pulse-like) in the frequency domain. Per definition, tones must last long which entails that Δt is large. Tones are therefore not suitable to convey detailed temporal information. Narrow horizontal cochleogram structures signify sinusoids.
Δf >> 1
Δt << 1
Pulses Pulses are strongly localized in time: Δt is small. A single pulse conveys minimal periodicity information for all frequencies; all frequencies are represented equally bad and consequently a pulse excites all frequencies equally and Δf is maximal. Narrow, near vertical structures signify pulses.
Δf >> 1
Δt >> 1
Noises Noises consist from a broad continuum of frequency contributions (Δf is large) and last some time (Δt is large as well). Noises represent acoustic energy without much preference for certain frequencies or points in time.
Δf << 1
Δt << 1
Impossible Prohibited by the uncertainty relation.

Apart from these extreme signal intermediate forms exist:

  • wavelets, these are short wave packets with a limited range of frequencies,
  • bursts, pulse-like noises (Δf large, Δt small), and
  • narrow band and broadband noises, lasting (not pulse-like) noise with more or less frequency preference.

How do I estimate and use ridges?

Ridges are defined as string of neighboring peaks in the cochleogram. In many ridges correspond to (very) narrow band signals that are to good approximation periodic. Ideally they appear as more or less horizontal lines in the cochleogram. Ridges are most useful when they correspond to tonal signal components. Is this case it is possible to compute a local instantaneous frequency (LIF) of the signal component. Note that the LIF is only defined for narrow band signals of which the frequency content at each moment can reasonably be approximated by a single number. Many ridges (sequences of cochleogram peaks) are estimated in noise. This is difficult to prevent because some valid ridges, with useful periodicity information, can be estimated in very noisy conditions. Currently ridges and their LIF-development are only computed when dt is set to 5 ms or more. Ridge estimation may require considerable computational resources (up to 40%).

Why is the system running so slowly?

The calculation of a cochleogram on a 1000 MHz computer should easily be faster than real-time using standard settings (dt=5 ms, nseg=120, maxf=4000 Hz). A 2400 MHz laptop can process 340 segments with maxf=4000 in real time. Sometimes daemons like a virus checker can cause delay in the calculations.

Top of page

[2004] Written by Tjeerd Andringa, Maria Niessen, and Maartje Nillesen