This document provides the theoretical basis of Continuity Preserving Signal Processing (CPSP). CPSP is a convenient tool for the analysis of signals stemming from physical processes because it allows an analysis in terms of representations (called signal components) that are likely to stem from a single process.
Signal components, explained in more detail below, are subsets of the signal that ultimately correspond to individual physical processes and subprocesses. In many cases a signal consists of multiple physical processes, so its is important to base conclusions about the processes that produced the signal on representations that do not mix up information of multiple physical processes. CPSP facilitates this process in a way that is favoured by evolution.
This document explains a number of important concepts of CPSP. It it requires very little background in signal processing and mathematics. It starts by a very short introduction of the auditory model that is used. This model is used to derive cochleograms in which signal components appear as members of a small family. It is assumed that all sounds can be described as combinations of this family, that are introduced and discussed next. The document ends with a discussion about the analysis of mixtures of sounds. A more elaborate text on CPSP can be found in the first chapter of the thesis of Tjeerd Andringa.
The first processing step in our auditory system is the conversion of sound vibrations into neural information. This task is performed at the basilar membrane (BM), a structure in our inner ear, or cochlea, that is in some aspects similar to a vibrating string. Its anatomy and function is beautifully visualized with a movie (2 MB) from Cornell University. The BM is schematically visualized in the figure below:
The lower drawing shows the uncoiled basilar membrane model and
some of the 3000 haircells that transduce vibrations into
electrical potentials, which in turn are changed into neural
spikes. The actual BM in combination with the hair cells and the
spiking neurons behaves in a strongly non-linear way. This
allows the cochlea to process dynamic range of more than 100 dB
(ten orders of magnitude). Although the numerical model that is
used is completely linear, its output is translated into
Key points of the mammalian basilar membrane are that
The preservation of continuity is not the case for commonly used forms of (speech) signal preprocessing techniques: it is in fact difficult to guarantee without a good model of a natural basilar membrane. A suitable time-domain model of the basilar membrane was originally developed by prof. H. Duifhuis and adapted for commercial use by Sound Intelligence.
At the level of the BM continuity preservation entails that the excitation xs(t) at time t of BM-model segment s is very similar to xs+1(t+dt), where dt is a very small time-step (for numerical reasons about 5 times the Nyquist period) and s+1 a neighboring segment. This double continuity is important because it entails that a continuous development of a sound source, such as a voice, will be preserved for further processing. A preprocessing stage, such as one based on the well-known and generally applied Fast Fourier Transform (FFT), does not preserve continuity in time, nor frequency. This is not necessarily wrong, but it is suboptimal and its save application cannot be justified physically for unknown mixture of signals. The same is true for other signal processing techniques like LPC/PLP, wavelets, and many auditory inspired signal processing techniques. (More information about the numerical model of the cochlea can be in separate file.)
Parameters auditory model
There are three parameters of auditory model that directly influence the way a signal is processed and the way it is represented in the cochleogram (see next):
The basilar membrane model can be used to form a so-called cochleogram: a spectrogram-like representation, based on the squared excitation xs(t)xs(t) of BM-segment s at time t. The cochleogram indicates the distribution of source energy as function of time and frequency. A cochleogram rs(t) is derived from a leaky integration process (or low-pass-filtering L):
This formula denotes the leaky-integration process L where a fraction exp(dt/τ), which is close to but smaller than 1, of the energy in the previous time step rs(t-dt) is lost, but in which the new squared BM excitation xs(t)xs(t) is added. The time constant τ determines the scope of memory of the leaky integration process, the larger τs the more the temporal information will be smoothed. A value of τs=10 ms smooths out details that are smaller than about 10 ms. It is therefore adequate to visualize the main structures of speech because the lowest frequency in most speech is about 100 Hz, which corresponds to a period of 10 ms (10 ms = 1/100 Hz), and speech seems mostly stationary for intervals of about 10 ms. After segment independent integration (i.e., a common τs for all segments) all temporal details smaller than the time constant are smoothed out and the sampling period can be increased, without further loss of information, to a frame size of 5 or 10 ms. This is shown in the left panel below for the word "nul" spoken by a female speaker in Dutch.
The horizontal axis corresponds to time in seconds and the vertical axis to frequency in Hz The place-frequency relation is, like the real BM, approximately logarithmic. The color coding corresponds to energy in dB. The more or less horizontal black lines are explained below. [nog niet af]
It is also possible to perform segment dependent integration. This allows, like a wavelet analysis, maximal temporal detail for high frequencies en maximal frequency detail for the low frequency side. The integration time constant for each segment dependent integration is limited by the characteristic period of the segment and the frame size. The characteristic period of a segment is the duration of a single period at the frequency the segment is specialized in. To prevent the visualization of individual oscillations, the time constant for segment is [duidelijker, en afmaken]
There are a few parameters that influence the way the way the signal is represented in the cochleogram (apart from the way it is processed by the auditory model):
The basilar membrane model is, like the natural basilar membrane, not equally sensitive to all frequencies. For many signal processing applications this is not an advantage, since it obscures the relative importance of different frequency ranges. To be (to a large extend) independent of the transfer function of the cochlea model, a reference signal can be used. The mean energy per segment of this signal is used as reference energy for that segment. When a reference signal is used with absolute energy set to TRUE, then all energy values are denoted as energy fractions of the energy in the reference signal. For example a value of 20 dB corresponds to a signal that 100 time as energetic as the average value of the reference signal. A value of -10 corresponds to a value that is 10 time smaller than the reference value.
Physical processes distribute their energy over the time-frequency-plane (of which a cochleogram is a visualization). The energy distribution can be localized in time (as with a knock on the table), localized in frequency (as with a bell), or broad in frequency as well as in time (in the case of stationary noises). Energy distributions in the time-frequency plane that are coherent (a single whole) are called signal components. They can be described by specifying the temporal development of frequency and energy (phase is optional). The uncertainty relation for signals entails there is always a trade-off between time and frequency, mathematically expressed as follows:
Δf Δt ≥ 1 (f in Hz), or
In terms of this relation we can define some classes: pulses, tones and noise, successively discussed below.
Signal components - pulses
Pulses are strongly localized in time: Δt is small. A single pulse conveys minimal periodicity information for all frequencies; all frequencies are represented equally bad and consequently a pulse excites all frequencies equally and Δf is maximal. Narrow, near vertical structures signify pulses. The response of a cochlea model to a perfect pulse is shown in the figure below as the vertical structure starting at t=0.1 s. The pulses starting at t=0.3, 0.5 and 0.8 become gradually broader in time (and therefore further and further from the ideal realization of a pulse). This is perceptually noticeable, the last "pulse" begins the sound like a burst. (Note the pulse starting at t=0.5 is similar to heartbeat, described in the application section, which is in addition low pass filtered.)
When an ideal impulse (such as the first impulse at t=0.1 s) excites the basilar membrane model it starts to resonate in a way similar to a bell. There are some important differences: a bell has a strong preference for some frequencies while the BM-model does not, and a normal bell rings-out much more slowly than the BM-model. Nevertheless the width of an impulse seems enormous: it lasts more than 200 ms for the low frequency side. This is in part an optical effect due to the dynamic range of 60 dB: a small part, of the impulse response contains most of the energy.
Note that the start of the BM (sensitive to high frequencies) responds faster (and much shorter) than the end of the BM (sensitive to low frequencies). This is an effect of group delay.
Signal components - ridges (tones)
Tones are sinusoids with a constant period and consequently strongly localized at a certain frequency (Δf is small). The expression of a sinusoid consequently is very narrow (pulse-like) in the frequency domain. Per definition, tones must last long which entails that Δt is large. Tones are therefore not suitable to convey detailed temporal information. Narrow horizontal cochleogram structures signify sinusoids and are called ridges. Like pulses, tones come in different flavors. The ideal tone is long, steady and smooth. A two second example is shown in the first two seconds of the figure below. This signal consists of successive 2 second signals that become broader and broader in frequency. The perceptual effect changes from a steady tone, via a slow and a fast amplified tone to a "hollow"sounding narrow band noise. The vertical structures at even seconds are pulses that result from concatenating these signals without ensuring a smoothly developing signal. The pulses help to segregate the signal in different perceptual units.
In the case of speech the more or less horizontal ridges indicate harmonics. In the cochleogram of the (clean) sound "nul" (see above) almost all ridges correspond to a single harmonic. The main contribution of the lowest ridge starts with a frequency just above 200 Hz and gradually rises in frequency to about 380 Hz. This ridge corresponds to the first harmonic and its frequency corresponds to the fundamental frequency development of the word "nul". The second and higher ridges correspond to higher harmonics. The first 19 harmonics (with frequencies up to 4500 Hz) are represented as ridges. In the case of voiced (quasi-periodic) speech we need to combine all evidence of the harmonics of a single source into a single representation: a harmonic complex. For this we need the best instantaneous frequency estimation possible. The local instantaneous frequency (LIF) is the change in phase per time step (dφ/dt), so only of interest for periodic signal components that change slowly: ridges. The LIF is further explained below.
Signal components - noise
While tones are defined by their periodicity, noises are defined as not being periodic; in noise there are no repeating structures present even though they last long. So noises consist from a broad continuum of frequency contributions (Δf is large) and last some time (Δt is large as well). Noises represent acoustic energy without much preference for certain frequencies or points in time. The respons of a cochlea model to different noises is shown in the figure below. The signal starts with a slowly developing broadband noise (white noise). Superimposed on the offset of the broadband noise are three other "noises" that become more and more localize in time and frequency (500Hz). The perception changes from a standard noise, to a noisy bursts with more and more tonal quality.
Signal components - mixtures
The three kinds of signal components discussed above do give insight in sounds, but will seldom occur in a natural environment in isolation: natural sounds will (almost) always be a combination to these three components. This knowledge can be used for the classification of sound sources: on the basis of a priori knowledge of source characteristics and auditory scene analysis (ASA), seperate signal components that constitute a sound can be formed and used to classify sound sources. Any sound source that we can recognize must, by logical necessity, produce a sound with sufficient characteristics to allow recognition (or classification). A guitar cannot produce speech because its physics do not allow it to produce speech sounds. Conversely our vocal system cannot produce the sound of a guitar for the same physical reason. The (physical) constraints of a sound source determine what sounds it can and cannot produce. The main constraint of speech is typically that it complies to the (very complex) rules of at least one language. The constraints of other sound sources are almost always much simpler. For all sounds it is possible to formulate models of them by simply combining the (physical) constraints that determine them.
As an example lets take the sound "nul" in a noisy situation, obtained by adding some cocktail-party noise to the clean signal ("nul"). We get a noisy signal (in this case with a global SNR of 0dB) that not only contains information stemming from the target speaker but also information from other sound sources. Nevertheless the human auditory system has no difficulties (not even on first hearing) with finding, combining and recognizing the components of the noisy word. The cochleogram of this noisy signal is shown below:
Compared to the clean situation, where it was trivial to combine information from a single source in a single representation, we now have to select a subset of the available evidence and discard the rest (or reassign it to other sources). For this subset to be assigned to a certain sound source it must, of course, comply to the constraints that determine this source. In other words, the subset must comply to the source model. As in the clean situation we want to find a harmonic complex to find evidence for a single source.
CPSP allows the estimation of the Local Instantaneous Frequency (LIF) under resolved ridges very accurate: usually the error is less than 1%, which in the same range as human performance. This is very difficult if not impossible with frame-based methods, especially in the case of noisy input. The LIF is computed via the running autocorrelation along ridges s(t):
rs(t)(t+T) = L(xs(t)(t)xs(t)(t+T)) T=[0, Tmax]
The periodic structure of the ridge autocorelation determines the LF.
This figure shows the local instantaneous frequency development under ridges for the clean signal in red circles and for the noisy signal in the filled blue dots. Target ridges that still dominated in the noisy condition enforce the same local response of the BM as the clean signal and consequently the derived frequency information is identical. Consequently many of the LIF-values of the noisy and the clean condition area almost identical. Visual inspection shows that almost all frequency information at formant positions is unperturbed in the noisy condition.
Allen 1994, How do Humans Recognize Speech
List of abbrevations
 Written by Tjeerd Andringa, Maria Niessen, and Maartje Nillesen