Next: Audio-Visual Speech Recognition
Up: Audio-Visual Speech Synthesis
Previous: Animation of synthetic
Experiments on natural speech (see ICP-MIAMI report 94-1) allow us to
anticipate that similar effects will be obtained with a TtAVS synthesizer:
Even if the current quality of (most) TtS systems is not as bad as highly
degraded speech, it is obvious that under very quiet conditions,
synthesizers are much less intelligible than humans. Moreover, it is
realistic to predict that in the near future, the spread of speech
synthesizers will lead to wide use in noisy backgrounds, such as in railway
stations. Such adverse conditions will necessitate a synchronized
presentation of the information from another modality, for instance, the
orthographic display of the text, or the animation of a synthetic face
(especially for foreigners and illitirates). There are hence several
reasons for the study and use of Audio-Visual Speech Synthesis.
- Audio-Visual Speech Synthesis allows investigators to accurately
control stimuli for perceptual tests on bimodality:
Massaro and Cohen [212] studied how speech perception is
influenced by information presented to ear and eye by dubbing acoustic
tokens generated by a speech synthesizer (in [158]) onto a
sequence of images generated by a facial model (see [258]), as
modified by Pearce et al. [261]). Highly controlled
synthetic stimuli thus allowed them to investigate into details the
McGurk effect.
- Audio-Visual Speech Synthesis is also a tool for basic research on
speech production:
Pelachaud studied the relationship between intonation and facial
expression by means of natural speech [262] and the facial model
developed by Platt [272].
- Thanks to the increasing capacities of computer graphics, highly
natural (or hyper-realistic) rendering of 3D synthetic faces now allows
movie producers to create synthetic actors whose facial gestures have to
be coherent with their acoustic production, due to their human-like
quality and the demands of the audience. Short computer-generated movies
clearly show this new trend:
Tony de Peltrie [18]), Rendez-vous
Montral [203]); Sextone for
President [159]); Tin Toy [285]);
Bureaucrat [356]); Hi Fi Mike [370]); and
Don Quichotte [117]), among others.
- It is hence necessary for computer-assisted artists to be equiped
with software facilities so that the facial gestures and expressions of
their characters are easily, quickly, and automatically generated in a
coherent manner. Several attempts to synchronize synthetic faces with
acoustic (natural or synthetic) speech may be found in literature
[186,140,177,215,238,70,230,263,367,226],
among others, for (British & American) English, Japanese, and French.
- Few audio-visual synthesizers have ever been integrated into a
text-to-speech system [367,226,71,138].
Unfortunately, most of the authors only reported informal impressions from
colleagues about the quality of their system, but as far as I am aware none
of them has ever quantified the improvement in intelligibility given by
adding visual synthesis to the acoustic waveflow. I strongly support the
idea that assessment methodologies should be standardized so that the
various approaches can be compared to one another. Next report will
present results of intelligibility tests run at the ICP with various visual
(natural and synthetic) displays of the lips, the jaw and the face under
different condition of background noise added to the original acoustic
signal.
Next: Audio-Visual Speech Recognition
Up: Audio-Visual Speech Synthesis
Previous: Animation of synthetic
Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995