Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


In normal verbal communication, the analysis and comprehension of the various articulation movements rely on a bimodal perceptive mechanism for the continuous integration of coherent visual and acoustic stimula. In case of impairments in the acoustic channel, due to distance, noisy environments, transparent barriers like a pane of glass, or to pathologies, the prevalent perceptive task is consequently performed through the visual modality. In this case, only the movements and the expressions of the visible articulatory organs are exploited for comprehension: vertical and horizontal lips opening, vertical jaw displacement, teeth visibility, tongue position, and other minor indicators like cheeks inflation and nose contractions.

Results from experimental phonetics show that hearing-impaired people behave differently from normal-hearing people in lipreading. In particular, visemes like bilabial /b, p, m/, fricative /f, v/, and occlusive consonants /t, d/ are recognized by each of them, while other visemes like /k, g/ are recognized only by hearing-impaired people. The occurrence of correct recognition for each viseme is also different between normal and hearing-impaired people: as an example, hearing-impaired people much more successfully recognize nasal consonants /m, n/ than normal hearing people. These two specific differences in phoneme recognition can be hardly explained since velum, which is the primary articulator involved in phonemes like /k, g/ or /m, n/, is not visible and its movements cannot be perceived in lipreading. A possible explanation, stemming from recent results in experimental phonetics, relies on the exploitation of secondary articulation indicators commonly unnoticed by the normal observer.

2.2 Speech Articulation and Coarticulation

When articulatory movements are correlated with their corresponding acoustic output, the task of associating each phonetic segment to a specific articulatory segment becomes a critical problem. Different from a pure spectral analysis of speech where phonetic units exhibit an intelligible structure and can be consequently segmented, the articulatory analysis does not provide, on its own, any unique indication on how to perform such segmentation.

A few fundamental aspects of speech bimodality have inspired interdisciplinary studies in neurology, physiology, psychology, and linguistics. Experimental phonetics have demonstrated that, in addition to speed and precision in reaching the phonetic target (that is, the articulatory configuration corresponding to a phoneme), speech exhibits high variability due to multiple factors such as

  psychological factors (emotions, attitudes);
  linguistic factors (style, speed, emphasis);
  articulatory compensation;
  intra-segmental factors;
  inter-segmental factors;
  intra-articulatory factors;
  inter-articulatory factors;
  coarticulatory factors.

To give an idea of the interaction complexity among the many speech components, it must be noticed that emotions with high psychological activity automatically increase the speed of speech production, and that high speed usually determines articulatory reduction (Hypo-speech) and a clear emphasized articulation is produced (Hyper-speech) in the case of particular communication needs.

Articulatory compensation takes effect when a phono-articulatory organ is working under unusual constraints; as an example when someone speaks while he is eating or with the cigarette between his lips.

Intra-segmental variability indicates the variety of articulatory configurations which correspond to the production of the same phonetic segment, in the same context, and by the same speaker. Inter-segmental variability, on the other hand, indicates the interaction between adjacent phonetic segments and can be expressed in “space,” like a variation of the articulatory place, or in “time,” meaning the extension of the characteristics of a phone.

Intra-articulatory effects are apparent when the same articulator is involved in the production of all the segments within the phonetic sequence. Inter-articulatory effects indicate the interdependencies between independent articulators involved in the production of adjacent segments within the same phonetic sequence.

Coarticulatory effects indicate the variation, in direction and extension, of the articulators movements during a phonetic transition. Forward coarticulation takes effect when the articulatory characteristics of a segment to follow are anticipated by previous segments, while backward coarticulation happens when the articulatory characteristics of a segment are maintained and extended to following segments. Coarticulation is considered “strong” when two adjacent segments correspond to a visible articulatory discontinuity or “smooth” when the articulatory activity proceeds smoothly between two adjacent segments.

The coarticulation phenomenon represents a major obstacle in lipreading as well as in artificial articulatory synthesis when the movements of the lips must be reconstructed from the acoustic analysis of speech, since there is no strict correspondence between phonemes and visemes. The basic characteristics of these phenomena is the nonlinearity between the semantics of the pronounced speech (despite the particular acoustic unit taken as reference) and the geometry of the vocal tract (representative of the status of each articulatory organ). Experimentation reveals that speech segmentation cannot be performed by means of only the articulatory analysis; articulators, in fact, start and complete their trajectories asynchronously exhibiting both forward and backward coarticulation with respect to the speech wave.

If a lip-readable visual synthetic output must be provided through the automatic analysis of continuous speech, much attention must be paid to the definition of suitable indicators, capable of describing the visually relevant articulation places (labial, dental, and alveolar) with the least residual ambiguity. This methodological consideration has been taken into account in the proposed technique by extending the analysis-synthesis region of interest to the region around the lips, including cheeks and nose.


Previous Table of Contents Next

Copyright © CRC Press LLC