Although acoustically-based automatic speech recognition system have witnessed enormous improvements over the past ten years, they still experience difficulties in several areas, including operation in noisy environments. However using visual cues can enhance significantly the recognition rate. In this section, we provide an overall survey on the existing audio-visual speech recognizers. The first audio-visual speech recognition system using oral-cavity region was developed by Petajan .
The primary focus of his research was demonstrating that an existing acoustic automatic speech recognizer could achieve a greater recognition percentage when augmented with information from the oral-cavity region. This system used a single speaker to perform isolated-word recognition. Petajan used a commercial acoustic automatic speech recognizer and built the audio-visual recognition system and utilized four (static) features from the oral-cavity region of each image frame in decreasing order by weight: height, area, width, and perimeter. The commercial acoustic automatic speech recognizer alone had a recognition rate of 65%, nearly typical of the technology commercially available at that time, however the audio-visual automatic speech recognizer achieved a recognition rate of 78%.
He further expanded his research building an improved version of the recognizer. In attempting to approximate real-time performance, the optical processor capturing the images utilized fast contour and region coding algorithms based on an extended variation of predictive differential quantization . The optical automatic speech recognizer utilized vector quantization to build a codebook of oral-cavity shadows. This new system had a faster performance than the earlier described while using only the area features of the oral-cavity region. The entire system was tested for speaker-dependent experiments on isolated words. On two experiments, optical identification is 100% on 16 samples of each digit for one speaker and 97.5% on eight samples of each digit for another.
Nishida, while working at the MIT Media Laboratory, used optical information from the oral-cavity to find word boundaries for an acoustic automatic speech recognizer . His works focused on the derivative of dark areas as obtained from the difference between two consecutive binary images. Yuhas, in his doctoral research in 1989, used a neural network and both optical and acoustic information to recognize nine phonemes . The goal of his research was to accurately recognize the phonemes from optical information, independently of the acoustic noise level. As the acoustic recognition degraded with noise, the optical system maintained its performance.
Stork et al. used a time delay neural network (TDNN) to achieve a speaker-independent recognition of ten phoneme utterances . The optical information consisted of four distances: (i) the distance between the nose and chin (lowering of the jaw), (ii) the height at the center of the upper and lower lips, (iii) the height at the corner of the lips, and (iv) the width corner of the lips. The TDNN achieved an optical recognition accuracy of 51% and a combined acoustic and optical recognition of 91%. Bregler used the TDNN as a preprocessor to the dynamic time warping algorithm to recognize 26 German letters in connected speech . The recognition rate for the combined acoustical and optical information was 97.2% which was slightly greater than the separate acoustic (97.2%) and optical (46.9%) recognition rates, but improved more significantly in noisy environments.