next up previous contents
Next: Audio-Visual Speech Synthesis Up: Visual Speech Synthesis Previous: Articulatory description

Articulatory synthesis

In face-to-face verbal communication, since the analysis and comprehension of the various articulatory movements relies on a bimodal perceptive mechanism for the continuous integration of coherent visual and acoustic stimula [331], the most natural way for visualizing the cues of speech on a a screen is that of synthesizing speech on a ``talking human face''. In this way a message from the computer (output) can be represented according to a bimodal model and visualized through a multimedia interface.

Worldwide interest is recently increasing around the possibility of employing anatomic models of a human face, suitable to synthetic animation through speech, for applications like the realization of multimedia man-machine interfaces, new algorithms for very low bitrate model-based videophony, advanced applications in the fields of education, culture and entertainment.

Different approaches have been investigated and proposed to define the basic set of parameters, necessary to model human mouth expressions including lips, teeth and tongue movements. In [38] 14 elementary visual units, responsible of the mouth shape, have been defined and associated to a corresponding acoustic class. Through suitable phonemic classification, a consistent sequence of visemes is obtained to visualize stable acoustic frames, while coarticulation transients are approximated by interpolation. In [139] a phonemic classifier is employed to determine a basic set of mouth parameters through sinthesis by rule, while in [231] the intermediate stage of phonemic classification is bypassed and input acoustic parameters are straightforward associated to suitable output visual parameters. All these approches, however, are unsuited to specific applications where a very faithful reproduction of the labial expressions is required such as those oriented to synthetic lipreading. The main problems stem from the lack of a suitable mouth anatomical model and from the coarse representation of the coarticulation phenomena.

A recent approach [61] based on Time-Delay Neural Networks (TDNN) employs a novel fast learning algorithm capable to provide very accurate and realistic control of the mouth parameters. As proven by the experimental results, visible coarticulation parameters are correctly estimated by the network and the corresponding visualization is smooth enough to be comprehended by lipreading. In this work human faces have been modeled through a flexible wire-frame mask suitably adapted to their somatic characteristics with increasing resolution in correspondence to high detail features like eyes and mouth. This kind of wire-frame is generally referred to as , meaning that it isn't neither a 2D nor a true 3D structure, but something in-between. A comparison can be done with a carnival mask reproducing the characteristics of a human face on a thin film of clay: the 3D information is incomplete since only one half of the head is modeled by the mask. The wire-frame structure is organized as a set of subregions, each of them affected by predefined deformation rules producing time-varying variations to simulate muscle contraction. Animation procedures have been designed on the basis of Ekman and Friese [88] psychological studies, formalizing the relationship between the subjective information content of the speaker's audio-visual communication and his corresponding facial expression. In particular, the set of basic animation units (AUs) defined in [5] for facial mimics description and coding has been implemented with reference the Parke model [257].



next up previous contents
Next: Audio-Visual Speech Synthesis Up: Visual Speech Synthesis Previous: Articulatory description



Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995