next up previous contents
Next: Automatic audio-visual speech Up: Audio-Visual Speech Recognition Previous: Audio-visual speech perception

Automatic visual speech recognition

The visible facial articulations cues can be exploited to attempt automatic visual speech recognition, which could enhance the performance of conventional, acoustical speech recognizers, especially when the acoustic signal is degraded by noise. In this section, we present some prototypical systems, processing the optical flow as a video input.

The optical automatic speech recognition system developed by Brooke and Petajan in 1986 [47] was based on the magnitude of the radial vector changes measured from the midpoint of a line between the inner lip corners to points on the inner line of the lips at equal angular intervals, with the zero angle at the left corner. The objective of this research was to discover an Optical Automatic Speech Recognition algorithm using this radial vector measure. The recognition experiment consisted of distinguishing between three vowels /a/, /i/, /u/ in consonant- vowel-consonant utterances using a similarity metric of the radial vector measure with two speakers. The results show that, if the threshold of similarity is held sufficiently high and consonant coarticulation effects are non-existent or neglected, a perfect vowel recognition is achieved on the three vowels considered.

Pentland has developed an optical speech recognition system for isolated digit recognition [264]. He used optical flow techniques to capture the movement of muscles in four regions of the lower facial area, as evidenced by the changes in muscle shadows under normal illumination. The recognition rate was 70% for ten digits. Smith used optical information from the derivatives of the area and the height of oral-cavity forms to distinguish among four words that an acoustic automatic speech recognizer had confused [316]. Using the two derivatives, he managed to distinguish perfectly among the four acoustically confused words. Goldschen built an optical speech recognizer for continuous speech [126]. The system used a probabilistic approach based on Hidden Markov Models (HMMs). The database consisted of sequences of binary images of contours or edges depicting the nostrils. The recognition rate was 25% of complete sentences.



next up previous contents
Next: Automatic audio-visual speech Up: Audio-Visual Speech Recognition Previous: Audio-visual speech perception



Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995