next up previous contents
Next: The bimodality of Up: Visual-speech perception Previous: Source coherence

The specific nature of speech coherence between acoustics and optics

Speaking is not the process of uttering a sequence of discrete units. Coarticulation systematically occurs in the transitions between the realizations of phonological units. Anticipation or perseveration across phonetic units of articulator gestures in the vocal tract are well known for their acoustic consequences, i.e., for the differences in allophones of a single phoneme. In French, for instance, the /s/, which is considered a non- rounded consonant, is spread in /si/, but protruded in /sy/, due to regressive assimilation; on the opposite, /i/, which has the phonological status of a spread phoneme, is protruded in /Si/, due to progressive assimilation (which is less frequent). Such differences in the nature of allophones of the same phonemes are auditorily pertinent [65] and visually pertinent [62].

A classic example of anticipation in lip rounding was first given by Benguerel and Cowan who observed an articulatory influence of the /y/ on the first /s/ in /istrstry/ which occurred in the French sequence une sinistre structure [14] (though this has since been revised by Abry & Lallouache [3]). In fact, the longest effect of anticipation is observed during pauses, when no acoustic cues are provided, so that subjects are able to visually identify /y/ an average of 185 ms before it is pronounced, during a 460 ms pause in /i#y/ [64]. In an experiment where subjects had simply to identify the final vowel in /zizi/ or /zizy/ [98], Escudier et al. showed that subjects visually identified the /y/ in /zizy/ from a photo of the speakers face taken at around 80 ms before the time when they were able to auditorily identify it (from gated excerpts of various lengths of the general form /ziz/). They also observed no difference in the time when subjects could identify /i/ or /y/, auditorily or visually, in the transitions /zyzi/ or /zyzy/. This asymmetric phenomenon is due to non- linearities between articulatory gestures and their acoustic consequences [323]. In this example, French speakers can round their lips --- and they do so! --- before the end of the /i/ in /zizy/ without acoustic consequences, whereas spreading the /y/ too early in /zyzi/ would lead to a mispronunciation and therefore to a misidentification.

To acoustically produce a French /y/, lips have to be rounded so that their interolabial area is less than 0.8 , above which value it is perceived as /i/ [2]. Lip control is therefore much more constrained for /y/ than for /i/, leading to an anticipation of lip rounding in /i/ /y/ transitions longer than that of lip spreading in /y/ /i/ transitions. We see from these observations that coarticulation plays a great role in the possibilities for subjects to process visual information before, or in absence of, acoustic information. This natural asynchrony between the two modes of speech perception depends upon the intrinsic nature of phonetic units, as well as on the speech rate and the individual strategy of the speaker. It is obvious that the increase of intelligibility given by vision to audition relies on it.



next up previous contents
Next: The bimodality of Up: Visual-speech perception Previous: Source coherence



Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995