Source coherence

Roughly speaking, the phoneme realizations that are the most easily discriminable by the ears are those which are the most difficult to distinguish by the eyes, and vice versa. For instance, /p/, /b/, and /m/ look alike in many languages, although they obviously sound unlike, and are often grouped together as one viseme. On the other hand, speech recognizers often make confusions between /p/ and /k/, whereas they look very different on the speaker's lips. This implies that a synthetic face can easily improve the intelligibility of a speech synthesizer or of a characters voice in a cartoon if the facial movements are coherent with the acoustic flow that is supposed to be produced by them. If not, any contradictory information processed during the bimodal integration by the viewer/listener may greatly damage the intelligibility of the original message. This dramatic effect can unfortunately result if the movements of the visible articulators are driven by the acoustic flow, e.g., through an acoustic-phonetic decoder. Such a device might involuntarily replicate the well-known McGurk effect [220], where the simultaneous presentation of an acoustic /ba/ and of a visual /ga/ (a predictable decoder error) makes the viewer/listener perceive a /da/! I must emphasize that the McGurk effect is very compelling, as subjects who are well aware of the nature of the stimuli even fall for the illusion. Moreover, Green et al. found little difference in the magnitude of the McGurk effect between subjects for whom the sex of the voice and the face presented were either matched or mismatched [129]. They concluded that the mechanism for integrating speech information from the two modalities is insensitive to certain incompatibilities, even when they are perceptually apparent.

Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995