 
    
    
    
      
 Next:  Audio-Visual Speech Recognition
Up:  Audio-Visual Speech Synthesis 
 Previous:  Animation of synthetic 
 
 
Experiments on natural speech (see ICP-MIAMI report 94-1) allow us to
anticipate that similar effects will be obtained with a TtAVS synthesizer:
Even if the current quality of (most) TtS systems is not as bad as highly
degraded speech, it is obvious that under very quiet conditions,
synthesizers are much less intelligible than humans.  Moreover, it is
realistic to predict that in the near future, the spread of speech
synthesizers will lead to wide use in noisy backgrounds, such as in railway
stations.  Such adverse conditions will necessitate a synchronized
presentation of the information from another modality, for instance, the
orthographic display of the text, or the animation of a synthetic face
(especially for foreigners and illitirates).  There are hence several
reasons for the study and use of Audio-Visual Speech Synthesis.
-  Audio-Visual Speech Synthesis allows investigators to accurately
  control stimuli for perceptual tests on bimodality:
  Massaro and Cohen [212] studied how speech perception is
  influenced by information presented to ear and eye by dubbing acoustic
  tokens generated by a speech synthesizer (in [158]) onto a
  sequence of images generated by a facial model (see [258]), as
  modified by Pearce et al. [261]).  Highly controlled
  synthetic stimuli thus allowed them to investigate into details the
  McGurk effect.
 
-  Audio-Visual Speech Synthesis is also a tool for basic research on
  speech production:
  Pelachaud studied the relationship between intonation and facial
  expression by means of natural speech [262] and the facial model
  developed by Platt [272].
 
-  Thanks to the increasing capacities of computer graphics, highly
  natural (or hyper-realistic) rendering of 3D synthetic faces now allows
  movie producers to create synthetic actors whose facial gestures have to
  be coherent with their acoustic production, due to their human-like
  quality and the demands of the audience.  Short computer-generated movies
  clearly show this new trend:
   Tony de Peltrie [18]),  Rendez-vous
    Montral [203]);  Sextone for
    President [159]);  Tin Toy [285]); 
    Bureaucrat [356]);  Hi Fi Mike [370]); and
   Don Quichotte [117]), among others.
 
-  It is hence necessary for computer-assisted artists to be equiped
  with software facilities so that the facial gestures and expressions of
  their characters are easily, quickly, and automatically generated in a
  coherent manner.  Several attempts to synchronize synthetic faces with
  acoustic (natural or synthetic) speech may be found in literature
  [186,140,177,215,238,70,230,263,367,226],
  among others, for (British & American) English, Japanese, and French.
-  Few audio-visual synthesizers have ever been integrated into a
  text-to-speech system [367,226,71,138].
Unfortunately, most of the authors only reported informal impressions from
colleagues about the quality of their system, but as far as I am aware none
of them has ever quantified the improvement in intelligibility given by
adding visual synthesis to the acoustic waveflow.  I strongly support the
idea that assessment methodologies should be standardized so that the
various approaches can be compared to one another.  Next report will
present results of intelligibility tests run at the ICP with various visual
(natural and synthetic) displays of the lips, the jaw and the face under
different condition of background noise added to the original acoustic
signal.
 
 
    
    
    
      
 Next:  Audio-Visual Speech Recognition
Up:  Audio-Visual Speech Synthesis 
 Previous:  Animation of synthetic 
 
 
 
 Esprit Project 8579/MIAMI (Schomaker et al., '95)
 Esprit Project 8579/MIAMI (Schomaker et al., '95)
 
Thu May 18 16:00:17 MET DST 1995