The intrinsic bimodality of speech communication

Next: Intelligibility of visible Up: Visual-speech perception Previous: Visual-speech perception

The intrinsic bimodality of speech communication

In 1989, Negroponte [242] predicted that ``the emphasis in user interfaces will shift from the direct manipulation of objects on a virtual desktop to the delegation of tasks to three-dimensional, intelligent agents parading across our desks'', and that ``these agents will be rendered holographically, and we will communicate with them using many channels, including speech and non-speech audio, gesture, and facial expressions.'' Historically, the talking machine with a human face has been a mystical means to power for charlatans and shamen. In that vein, the first speaking robots were probably the famous statues in ancient Greece temples, whose power as oracles derived from a simple acoustic tube! The statues were inanimate at that time, even though their impressionable listeners attributed a soul (anima) to them, because of their supposed speech competence. If this simple illusion already made them seem alive, how much more powerful would it have been if the statue's faces were animated? One can only wonder how children would perceive Walt Disney's or Tex Avery's cartoon characters if their facial movements were truly coherent with what they are meant to say, or with its dubbing into another language. Of course, these imaginary characters are given so many other extraordinary behavioral qualities that we easily forgive their peculiar mouth gestures. We have even become accustomed to ignoring the asynchrony between Mickey's mouth and his mouse voice.

What about natural speech? When a candidate for the presidency of the United States of America exclaims ``Read my lips!'', he is not asking his constituency to lip-read him, he is simply using a classical English formula so that his audience must believe him, as if it was written on his lips: If they cannot believe their ears, they can believe their eyes! But even though such expressions are common, people generally underestimate the actual amount of information that is transmitted through the optic channel. Humans produce speech through the actions of several articulators (vocal folds, velum, tongue, lips, jaw, etc.), of which only some are visible. The continuous speech thus produced is not, however, continuously audible: It is also made of significant parts of silence, during voiceless plosives and during pauses, while the speaker makes gestures in order to anticipate the following sound. To sum up, parts of speech movements are only visible, parts are only audible, and parts are not only audible, but also visible. Humans take advantage of the bimodality of speech; from the same source, information is simultaneously transmitted through two channels (the acoustic and the optic flow), and the outputs are integrated by the perceiver. In the following discussion, I will pinpoint the importance of visual intelligibility of speech for normal hearers, and discuss some of the most recent issues in the bimodal aspects of speech production and perception.

Next: Intelligibility of visible Up: Visual-speech perception Previous: Visual-speech perception

Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995