The first results seem to be a great step towards designing an audio-visual speech recognition system but many problems still have to be solved. First of all, in our model, weighting factors of visual and auditory parameters have been neglected; so that the system still relies on unbalanced information. Second, other models for audio-visual integration have to be implemented in order to select the best way of combining audio and visual information in our case. For a complete on-line audio-visual speech recognition, we have planned to design a video analysis board capable of extracting the lip contours and computing the four necessary visual parameters in a short delay (< 40 msec). Then it will be possible to collect a much larger amount of audio-visual data within a decent period of time. It should allow the audio-visual speech recognizer to run in real-time. This would then be an ideal interface for bimodal speech dialogue on a multimedia platform.
\