Audio-visual speech synthesis

Next: Audio-Visual Speech Recognition Up: Audio-Visual Speech Synthesis Previous: Animation of synthetic

Audio-visual speech synthesis

Experiments on natural speech (see ICP-MIAMI report 94-1) allow us to anticipate that similar effects will be obtained with a TtAVS synthesizer: Even if the current quality of (most) TtS systems is not as bad as highly degraded speech, it is obvious that under very quiet conditions, synthesizers are much less intelligible than humans. Moreover, it is realistic to predict that in the near future, the spread of speech synthesizers will lead to wide use in noisy backgrounds, such as in railway stations. Such adverse conditions will necessitate a synchronized presentation of the information from another modality, for instance, the orthographic display of the text, or the animation of a synthetic face (especially for foreigners and illitirates). There are hence several reasons for the study and use of Audio-Visual Speech Synthesis.

Audio-Visual Speech Synthesis allows investigators to accurately control stimuli for perceptual tests on bimodality:
Massaro and Cohen [212] studied how speech perception is influenced by information presented to ear and eye by dubbing acoustic tokens generated by a speech synthesizer (in [158]) onto a sequence of images generated by a facial model (see [258]), as modified by Pearce et al. [261]). Highly controlled synthetic stimuli thus allowed them to investigate into details the McGurk effect.
Audio-Visual Speech Synthesis is also a tool for basic research on speech production:
Pelachaud studied the relationship between intonation and facial expression by means of natural speech [262] and the facial model developed by Platt [272].
Thanks to the increasing capacities of computer graphics, highly natural (or hyper-realistic) rendering of 3D synthetic faces now allows movie producers to create synthetic actors whose facial gestures have to be coherent with their acoustic production, due to their human-like quality and the demands of the audience. Short computer-generated movies clearly show this new trend:
Tony de Peltrie [18]), Rendez-vous Montral [203]); Sextone for President [159]); Tin Toy [285]); Bureaucrat [356]); Hi Fi Mike [370]); and Don Quichotte [117]), among others.
It is hence necessary for computer-assisted artists to be equiped with software facilities so that the facial gestures and expressions of their characters are easily, quickly, and automatically generated in a coherent manner. Several attempts to synchronize synthetic faces with acoustic (natural or synthetic) speech may be found in literature [186,140,177,215,238,70,230,263,367,226], among others, for (British & American) English, Japanese, and French.
Few audio-visual synthesizers have ever been integrated into a text-to-speech system [367,226,71,138].

Unfortunately, most of the authors only reported informal impressions from colleagues about the quality of their system, but as far as I am aware none of them has ever quantified the improvement in intelligibility given by adding visual synthesis to the acoustic waveflow. I strongly support the idea that assessment methodologies should be standardized so that the various approaches can be compared to one another. Next report will present results of intelligibility tests run at the ICP with various visual (natural and synthetic) displays of the lips, the jaw and the face under different condition of background noise added to the original acoustic signal.

Next: Audio-Visual Speech Recognition Up: Audio-Visual Speech Synthesis Previous: Animation of synthetic

Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995