Articulatory description

Next: Articulatory synthesis Up: Visual Speech Synthesis Previous: Visual Speech Synthesis

Articulatory description

Besides the main purpose of describing and quantizing the characteristics of the articulatory modes and places within the production of each single phone in different languages [149,69,167] and providing fruitful comparison with conventional articulatory classification [168,169] the use of modern and sophisticated experimental techniques allows the definition of a far more accurate taxonomy of the articulatory parameters. Among the many techniques those which are providing the most relevant hints to experimental phonetics are X-ray kinematography, X-ray kymography, electro-palategraphy, electro-kymography, labiography.

When articulatory movements are correlated with their corresponding acoustic output, the task of associating each phonetic segment to a specific articulatory segment becomes a critical problem. Differently from a pure spectral analysis of speech where phonetic units exhibit an intellegible structure and can be consequently segmented, the articulatory analysis does not provide, by its own, any univoque indication on how to perform such segmentation.

A few fundamental aspects of speech bimodality have inspired, since a fairly long time, interdisciplinary studies in neurology [173,49,225], physiology [307], psychology [361,106], and linguistics [248].

Experimental phonetics has demonstrated that, besides speed and precision in reaching the phonetic target (that is the articulatory configuration corresponding to a phoneme), speech exhibits high variability due to multiple factors such as:

psychological factors (emotions, attitudes);
linguistic factors (style, speed, emphasis);
articulatory compensation;
intra-segmental factors;
inter-segmental factors;
intra-articulatory factors;
inter-articulatory factors;
coarticulatory factors.

To give a idea of the interaction complexity among the many speech components, it must be noticed that emotions with high psychological activity increase automatically the speed of speech production, that high speed usually determins articulatory reduction (Hipo-speech) and that a clear emphasized articulation is produced (Hyper-speech) in case of particular communication needs.

Articulatory compensation takes effect when a phono-articulatory organ is working under unusual constraints: as an example when someone speaks while he is eating or with the sigarette between his lips [200,104,1].

Intra-segmental variability indicates the variety of articulatory configurations which correspond to the production of the same phonetic segment, in the same context and by the same speaker. Inter-segmental variability, on the other hand, indicates the interaction between adjacent phonetic segments [193,119] and can be expressed in ``space'', like a variation of the articulatory place, or in ``time'', meaning the extension of the characteristics of a phone to the following ones.

Intra-articulatory effects take effect when the same articulator is involved in the production of all the segments within the phonetic sequence. Inter-articulatory effects indicate the interdependencies between independent articulators involved in the production of adjacent segments within the same phonetic sequence.

Coarticulatory effects indicate the variation, in direction and extension, of the articulators movements during a phonetic transition [19,134,155]. Forward coarticulation takes effect when the articulatory characteristics of a segment to follow are anticipated to previous segments, while backward coarticulation happens when the articulatory characteristics of a segment are maintained and extended to following segments. Coarticulation is considered ``strong'' when two adjacent segments correspond to a visible articulatory discontinuity [110] or ``smooth'' when the articulatory activity proceeds smoothly between two adjacent segments.

The coarticulation phenomenon represents a major obstacle in lipreading as well as in artificial articulatory synthesis when the movements of the lips must be reconstructed from the acoustic analysis of speech, since there is no strict correspondence between phonemes and visemes [15]. The basic characteristic of these phenomena is the non-linearity between the semantics of the pronounced speech (despite the particular acoustic unit taken as reference), and the geometry of the vocal tract (representative of the status of each articulatory organ). It is apparent that speech segmentation cannot be performed by means of the only articulatory analysis. Articulators start and complete their trajectories asynchronously exhibiting both forward and backward coarticulation with respect to the speech wave.

Many theories exist and many cognitive models have been proposed on speech production and perception making reference to the different concepts of spatial acoustic or orosensorial target. Spatial target refers to the theoretical approach to speech production based on the hypothesis that speech is produced by adaptating the articulators toward a final target configuration, through an embedded spatial reference system. Acoustic target, as opposed to spatial target, implies that speech is produced by the adaptating the articulators in order to reach a final acoustic target [245]. Orosensorial target, in contrast with the previous two theories, explains speech production in terms of tactile feedbacks coming from receptors located at the mouth premises, used to control the articulatory adaptation [324,265,266].

Three basic different models are currently adopted, namely the closed loop model , the open loop model and the coordinative model. The closed loop model stems from Cybernetics (Cartesio machine) and considers the sensorial feedback (tactile and/or acoustic) to be the intrinsec control of the articulatory activities during speech production. The articulatory trajectories are planned on-line depending on the measured feed-backs until the target (spatial or acoustic) is reached. The open loop model stems from Computer Science and Artificial Intelligence (Turing machine) and is based on the metaphor ``brain like a computer''. It assumes that each articulatory activity is controlled through a sequence of instructions [106]. The articulatory trajectories are pre-programmed (off-line) point by point and are executed deterministically. The coordinative model [307,106,163], is based on the neurophysiology metaphor. It is in contrast to cybernetic and algorithmical models. It is based on coordinative structures, that is functional groups of muscles behaving like coordinated units according to rules learnt by training. The articulatory trajectories are the result of the interactive work of coordinative structures aimed at reaching a ``qualitative'' target in a flexible and adaptive way.

Independently from the particular approach, a very large set of signal observations is necessary to faithfully estimate the articulatory parameters. Classic methods generally aim at an accurate phoneme recognition and at a consequent synthesis by rule to associate the corresponding viseme. In [176,270], two estimation algorithms have been proposed for the statistical analysis of the cepstrum space for detecting phonetic clusters, representative of stable configurations of the articulatory organs, and for modeling the transition paths linking clusters, representative of coarticulation.

Next: Articulatory synthesis Up: Visual Speech Synthesis Previous: Visual Speech Synthesis

Esprit Project 8579/MIAMI (Schomaker et al., '95)
Thu May 18 16:00:17 MET DST 1995