next up previous contents
Next: Visual Speech Synthesis Up: No Title Previous: Acknowledgement

Audio-Visual Speech Synthesis


The bimodal information associated to speech is conveyed through audio and visual cues which are coherently produced at phonation time and are transmitted by the auditory and visual channel, respectively. Speech visual cues carry basic information on visible articulatory places and are associated to the shape of the lips, the position of the tongue, and the visibility of the teeth. This kind of information is usually complementary to that conveyed by the acoustic cues of speech which are mostly correlated to the voicing/unvoicing character.

The synthesis of speech visual cues represents a problem with a large variety of possible solutions ranging from continuous approaches were a small set of reference mouth images are variously ``distorted'' to obtain approximations of any possible mouth shape, to parametric approaches were a synthetic mouth icon is ``animated'' by simply changing its parameters.

Despite the specific methodology which is employed, the refresh frequency of the mouth shape must be high enough for guaranteeing the visual representation of rapid transients of speech, very important for comprehension. The temporal refresh frequency is usually higher than 25 Hz implying a resolution of 40 ms of speech. The basic articulators which must be reproduced are the lips, teeth and tongue. Secondary indicators like cheeks, chin and nose are usually very useful. A minimum spatial resolution of 100x100 pixel is needed.

Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995