next up previous contents
Next: Preliminary results Up: Current results obtained Previous: Experimental paradigm

Technique used

A probabilistic approach, i.e. the HMM is used as the main classifier. The system has first been developed using an off-line analysis of the lips [171]. In order to extract the necessary visual information from the talker's face, the speaker's lips are made up in blue. A chroma-key turns it on-line into a saturated black color. Therefore the vermilion area of the lips is easily detected by thresholding the luminance of the video signal. Finally, these images are used as the input of an image processing software which measures the parameters used for the visual recognition system. Figure C.7 gives an example of the input of our image processing tool (left panel), and the measured parameters (right panel).

  
Figure C.7 : An example of an image frame and the visual parameters

The parameters are extracted from each video field (50 fields in a one second sequence). They are stored in a file used as the main ``optical data'' input of our visual speech recognizer. The audio input is a vector composed of a total of 26 elements such as Mel-frequency cepstral coefficients, energy, delta coefficients. In order to have a complete synchronization between audio and visual information cues, a 20 msec audio analysis window has been used. Therefore, visual and audio vectors are in synchrony during all the process. As a first step towards building an audio-visual speech recognizer, we followed the architecture of the Direct Identification Model (see MIAMI report 94-2 by Robert-Ribes). The video parameters are appended to each audio vector. Several tests have been run in order to choose the best combination of visual parameters. Finally, only four parameters have been selected as the input of the speech recognition system. They are presented in figure C.7 . The corpus is made of 18 non-sense words of the form VCVCV. Each word was uttered nine times so that the audio database consists of 162 words. In all the following results, we divided the corpus into a training set made of 18x7 utterances and a test set made of the remaining 18x2 utterances. The HMM is composed of 9 states. For each state, the transition and the output probabilities and other parameters are computed using forward-backward algorithm and Baum-Welch re-estimation.



next up previous contents
Next: Preliminary results Up: Current results obtained Previous: Experimental paradigm



Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995