next up previous contents
Next: The need for Up: Visual-speech perception Previous: The intrinsic bimodality

Intelligibility of visible speech

It is well known that lip-reading is necessary in order for the hearing impaired to (partially) understand speech, specifically by using the information recoverable from visual speech. But as early as 1935, Cotton [73] stated that ``there is an important element of visual hearing in all normal individuals''. Even if the auditory modality is the most important for speech perception by normal hearers, the visual modality may allow subjects to better understand speech. Note that visual information, provided by movements of the lips, chin, teeth, cheeks, etc., cannot, in itself, provide normal speech intelligibility. However, a view of the talker's face enhances spectral information that is distorted by background noise. A number of investigators have studied this effect of noise distortion on speech intelligibility according to whether the message is heard only, or heard with the speakers face also provided [330,241,21,93,95,331].

  
Figure 2.1 : Improved intelligibility of degraded speech through vision of the speakers face. The box indicates the mean, and the whiskers the standard deviation.

Figure 2.1 replots articulation scores obtained in French by Benoît et al. on 18 nonsense words by 18 normal hearers in two test conditions: audition only and audition plus vision [17]. We observe that vision is basically unnecessary in rather clear acoustic conditions (S/N > 0 dB), whereas seeing the speakers face allows the listener to understand around 12 items out of 18 under highly degraded acoustic conditions (S/N = -24 dB) where the auditory alone message is not understood at all. One may reply that such conditions are seldom found in our everyday lives, only occurring in very noisy environment such as discotheques, in some streets or industrial plants. But using visual speech is not merely a matter of increasing acoustic intelligibility for hearers/viewers: it is also a matter of making it more comprehensible, i.e., easier to understand.

It is well known that information is more easily retained by an audience when transmitted over the television than over the radio. To confirm this, Reisberg et al. [286] reported that passages read from Kant's Critique of Pure Reason were better understood by listeners (according to the proportion of correctly repeated words in a shadowing task) when the speakers face was provided to them. Even if people usually do not speak the same way as Emmanuel Kant wrote, this last finding is a clear argument in favor of the general overall improvement of linguistic comprehension through vision. Therefore, it also allows us to better take into consideration the advantage of TtAVS synthesis for the understanding of automatically read messages, assuming that human-machine dialogue will be much more efficient under bimodal presentation of spoken information to the user. An average 11 dB ``benefit of lip-reading'' was found by MacLeod and Summerfield [199]. This corresponds to the average difference between the lowest signal-to-noise ratios at which test sentences are understood, given presence or absence of visual information. This finding must obviously be tempered by the conditions of visual presentation. Ostberg et al. tested the effects of six sizes of videophone display on the intelligibility of noisy speech [253]. They presented running speech to subjects who where asked to adjust the noise level so that the individual words in the story appeared at the borderline of being intelligible; they observed an increase in the mean benefit of lip-reading from 0.4 to 1.8 dB with the increase in display size. This observation confirms the intuitive idea that the better the visual information, the greater the improvement in intelligibility.



next up previous contents
Next: The need for Up: Visual-speech perception Previous: The intrinsic bimodality



Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995