Automatic speech recognition (ASR) promises to be of great importance in human-machine interfaces, but despite extensive effort over decades, acoustic-based recognition systems remain too inaccurate for the vast majority of conceivable applications, especially those in noisy environments (automobiles, factory floors, crowded offices, etc.). While incremental advances may be expected along the current ASR paradigm, additional, novel approaches --- in particular those utilizing visual information as well --- deserve serious study. Such hybrid (acoustic and visual) ASR systems have already been shown to have superior recognition accuracy, especially in noisy conditions, just as humans recognize speech better when also given visual information (the ``cocktail party effect'').