Next: The specific nature
Up: The need for
Previous: Temporal coherence
Roughly speaking, the phoneme realizations
that are the most easily discriminable by the ears are those which are the
most difficult to distinguish by the eyes, and vice versa. For instance,
/p/, /b/, and /m/ look alike in many languages, although they obviously
sound unlike, and are often grouped together as one viseme. On the other
hand, speech recognizers often make confusions between /p/ and /k/, whereas
they look very different on the speaker's lips. This implies that a
synthetic face can easily improve the intelligibility of a speech
synthesizer or of a characters voice in a cartoon if the facial movements
are coherent with the acoustic flow that is supposed to be produced by
them. If not, any contradictory information processed during the bimodal
integration by the viewer/listener may greatly damage the intelligibility
of the original message. This dramatic effect can unfortunately result if
the movements of the visible articulators are driven by the acoustic flow,
e.g., through an acoustic-phonetic decoder. Such a device might
involuntarily replicate the well-known McGurk effect [220], where
the simultaneous presentation of an acoustic /ba/ and of a visual /ga/ (a
predictable decoder error) makes the viewer/listener perceive a /da/! I
must emphasize that the McGurk effect is very compelling, as subjects who
are well aware of the nature of the stimuli even fall for the illusion.
Moreover, Green et al. found little difference in the magnitude of the
McGurk effect between subjects for whom the sex of the voice and the face
presented were either matched or mismatched [129]. They
concluded that the mechanism for integrating speech information from the
two modalities is insensitive to certain incompatibilities, even when they
are perceptually apparent.
Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995